DataChain - A Tool For AI Workflows |
Written by Sue Gee | |||
Tuesday, 23 July 2024 | |||
Iterative has released a new open-source tool for processing and evaluating unstructured data at scale. DataChain is an open-source Python library designed to make it easier to use generative AI on unstructured data by providing a link between the unstructured data and AI workflows based in languages such as Python. According to McKinsey's Global Survey on the state of AI published in early 2024, only 15 percent of surveyed companies feel they have made meaningful use of generative AI in their business, and Iterative says a large part of the problem lies in the challenge of processing unstructured data at scale and estimating the results. Part of the challenge lies in assessing and improving the data quality in unstructured multimodal data like text and images. Dmitry Petrov, CEO of Iterative, says that to overcome this challenge, there's a need for AI models that can evaluate and improve existing AI models. Interative says that in practice, most AI engineers are still building custom code for converting their JSON model responses, adapting them to databases, and running models in parallel with out-of-memory data. DataChain provides a way to use AI-based analytical capabilities where large language models (LLMs) can judge the output of other LLMs' and multimodal GenAI evaluations to improve data curation and pre-processing. DataChain can also store and structure Python object responses using the latest data model schemas. The name DataChain comes from the fact that DataChain lets analysts run multimodal API calls and local AI inferences in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. It uses the concept of a data chain, a sequence of data manipulation steps such as reading data from storage, running AI or LLM models or calling external services API to validate or enrich data. Data in DataChain is presented as Python classes with arbitrary set of fields, including nested classes. It can also persist features of Python objects returned by AI models, and enables vectorized analytical operations over them. DataChain has two main elements - a set of Pythonic APIs that integrate with the Python ecosystem; and a Data Version Control (DVC) tool for unstructured data that makes use of data warehouses and Pythonic libraries to manage and version large volumes of unstructured data. It is built to handle large-scale data operations, and to ensure that AI workflows remain efficient and scalable as data volumes grow. DataChain is available now on GitHub, and an online webinar will be hosted on July 24 to showcase DataChain's capabilities.
More InformationRelated ArticlesOpaque Systems Introduces Gateway GenAI Solution Apache Arrow Adds New View Data Types Google Adds Ability To See Datasets To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.
Comments
or email your comment to: comments@i-programmer.info |
|||
Last Updated ( Tuesday, 23 July 2024 ) |