DataChain - A Tool For AI Workflows
Written by Sue Gee   
Tuesday, 23 July 2024

Iterative has released a new open-source tool for processing and evaluating unstructured data at scale. DataChain is an open-source Python library designed to make it easier to use generative AI on unstructured data by providing a link between the unstructured data and AI workflows based in languages such as Python.

According to McKinsey's Global Survey on the state of AI published in early 2024, only 15 percent of surveyed companies feel they have made meaningful use of generative AI in their business, and Iterative says a large part of the problem lies in the challenge of processing unstructured data at scale and estimating the results. Part of the challenge lies in assessing and improving the data quality in unstructured multimodal data like text and images.

datachain

Dmitry Petrov, CEO of Iterative, says that to overcome this challenge, there's a need for AI models that can evaluate and improve existing AI models. Interative says that in practice, most AI engineers are still building custom code for converting their JSON model responses, adapting them to databases, and running models in parallel with out-of-memory data.

DataChain provides a way to use AI-based analytical capabilities where large language models (LLMs) can judge the output of other LLMs' and multimodal GenAI evaluations to improve data curation and pre-processing. DataChain can also store and structure Python object responses using the latest data model schemas.

The name DataChain comes from the fact that DataChain lets analysts run multimodal API calls and local AI inferences in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. It uses the concept of a data chain, a sequence of data manipulation steps such as reading data from storage, running AI or LLM models or calling external services API to validate or enrich data. Data in DataChain is presented as Python classes with arbitrary set of fields, including nested classes. It can also persist features of Python objects returned by AI models, and enables vectorized analytical operations over them.

DataChain has two main elements - a set of Pythonic APIs that integrate with the Python ecosystem; and a Data Version Control (DVC) tool for unstructured data that makes use of data warehouses and Pythonic libraries to manage and version large volumes of unstructured data. It is built to handle large-scale data operations, and to ensure that AI workflows remain efficient and scalable as data volumes grow.

DataChain is available now on GitHub, and an online webinar will be hosted on July 24 to showcase DataChain's capabilities.

 datachain

More Information

Announcing DataChain

DataChain Getting Started

DataChain On GitHub

Related Articles

Opaque Systems Introduces Gateway GenAI Solution

Apache Arrow Adds New View Data Types

Google Adds Ability To See Datasets

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Gradle's Developer Productivity Engineering University
27/08/2024

Gradle has launched a free developers' learning portal. What can you except from it? We look at its range of courses.



Faster Bun Released
13/08/2024

Bun v1.1.22 has been released with performance improvements so xpress is now three times faster in Bun, ES modules load faster on Windows, and there's a 10% faster Bun.serve() at POST requests.


More News

kotlin book

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Tuesday, 23 July 2024 )