DataChain - A Tool For AI Workflows

Written by Sue Gee

Tuesday, 23 July 2024

Iterative has released a new open-source tool for processing and evaluating unstructured data at scale. DataChain is an open-source Python library designed to make it easier to use generative AI on unstructured data by providing a link between the unstructured data and AI workflows based in languages such as Python.

According to McKinsey's Global Survey on the state of AI published in early 2024, only 15 percent of surveyed companies feel they have made meaningful use of generative AI in their business, and Iterative says a large part of the problem lies in the challenge of processing unstructured data at scale and estimating the results. Part of the challenge lies in assessing and improving the data quality in unstructured multimodal data like text and images.

datachain

Dmitry Petrov, CEO of Iterative, says that to overcome this challenge, there's a need for AI models that can evaluate and improve existing AI models. Interative says that in practice, most AI engineers are still building custom code for converting their JSON model responses, adapting them to databases, and running models in parallel with out-of-memory data.

DataChain provides a way to use AI-based analytical capabilities where large language models (LLMs) can judge the output of other LLMs' and multimodal GenAI evaluations to improve data curation and pre-processing. DataChain can also store and structure Python object responses using the latest data model schemas.

The name DataChain comes from the fact that DataChain lets analysts run multimodal API calls and local AI inferences in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. It uses the concept of a data chain, a sequence of data manipulation steps such as reading data from storage, running AI or LLM models or calling external services API to validate or enrich data. Data in DataChain is presented as Python classes with arbitrary set of fields, including nested classes. It can also persist features of Python objects returned by AI models, and enables vectorized analytical operations over them.

DataChain has two main elements - a set of Pythonic APIs that integrate with the Python ecosystem; and a Data Version Control (DVC) tool for unstructured data that makes use of data warehouses and Pythonic libraries to manage and version large volumes of unstructured data. It is built to handle large-scale data operations, and to ensure that AI workflows remain efficient and scalable as data volumes grow.

DataChain is available now on GitHub, and an online webinar will be hosted on July 24 to showcase DataChain's capabilities.

datachain

More Information

Announcing DataChain

DataChain Getting Started

DataChain On GitHub

Opaque Systems Introduces Gateway GenAI Solution

Apache Arrow Adds New View Data Types

Google Adds Ability To See Datasets

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

The OpenAI Academy Makes AI Accessible
29/04/2025

OpenAI has provided a treasure trove of information for spreading knowledge about AI to the general public; understanding what AI is and learning how to leverage it by using tools like ChatGPT.

+ Full Story

A New Threat - Package Hallucination
07/05/2025

The rise and rise of reliance on LLMs for code generation has resulted in a new threat to software supply chains. Dubbed "package hallucination", this occurs when LLMs generation references to non-exi [ ... ]

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Last Updated ( Tuesday, 23 July 2024 )

More Information

Related Articles

Comments