Apache Spark Now Understands English

Written by Nikos Vaggalis

Monday, 10 July 2023

An SDK for Apache Spark has been released that takes English instructions and compiles them into PySpark objects, making Spark more user-friendly and accessible.

sparklogo

It was expected that at some point someone would use Generative AI to compile English to SQL or code. SQL although easy to grasp has been the barrier to Management's direct interaction with the data silos of the organization.

But in actuality, the developer was the bridge between those two edges. He would get the request of the Management in English or another natural language and translate that to SQL/code in order to generate the necessary report. That role is now played by AI so that even those who do not understand SQL operations can easily use the ability to quickly query business data and generate reports.

But... where the developer had the edge, is that he had knowledge of the schema, the internal details of the database, the business logic which he utilized to construct the query. An AI system without that knowledge was destined to fail. Nowadays however LLM's can be trained on your data and thus gain the knowledge that was missing.

Such an attempt to use natural language to access data is Alibaba's Chat2DB , a general-purpose SQL client and reporting tool for databases which integrates ChatGPT capabilities which integrates AIGC's capabilities and is able to convert natural language into SQL. It can also convert SQL into natural language and provide optimization suggestions for SQL to greatly enhance the efficiency of developers. According to its website:

It is a tool for database developers in the AI era, and even non-SQL business operators in the future can use it to quickly query business data and generate reports.

Another intermediate attempt is the English SDK for Apache Spark which takes English instructions and compiles them into PySpark objects like DataFrames so that instead of having to understand the complex generated code, you could get the result with a simple instruction in English that many understand like:

transformed_df = df. ai. transform('get 4 week moving average sales by dept')

The English SDK, with its understanding of Spark tables and DataFrames, handles the complexity, returning a DataFrame directly and correctly.

spark

The SDK offers the following key features:

Data Ingestion
The SDK can perform a web search using a provided description, utilize the LLM to determine the most appropriate result, and then smoothly incorporate this chosen web data into Spark—all accomplished in a single step.

You can ingest data via search engine:

auto_df = spark_ai. create_df("2022 USA national auto sales by brand")

Or you can ingest data via URL:

auto_df = spark_ai. create_df("https://www.carpro. com/blog/full-year-2022-national-auto-sales-by-brand")

DataFrame Operations

The SDK provides functionalities on a given DataFrame that allow for transformation, plotting, and explanation based on your English description.

DataFrame Transformation

auto_top_growth_df=auto_df. ai. transform("brand with the highest growth")

auto_top_growth_df. show()

DataFrame Explanation

auto_top_growth_df. ai. explain()

DataFrame Attribute Verification

auto_top_growth_df. ai. verify("expect sales change percentage to be between -100 to 100")

User-Defined Functions (UDFs)

This feature simplifies the UDF creation process, letting you focus on function definition while the AI takes care of the rest.

@spark_ai. udf
def convert_grades(grade_percent: float) -> str:
"""Convert the grade percent to a letter grade using standard cutoffs"""
. . .
Now you can use the UDF in SQL queries or DataFrames

SELECT student_id, convert_grades(grade_percent) FROM grade

Caching

The SDK incorporates caching to boost execution speed, make reproducible results, and save cost.

Thus tooling enters a new era powered by natural language. The potential is beyond imagination - talking to robots anyone? A dream come true?

sparklogo

More Information

Chat2DB

Introducing English as the New Programming Language for Apache Spark

English SDK for Apache Spark Github

Stable Diffusion Animation SDK For Python

TLDR Explains Code Like I Am Five

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

MongoDB Acquires Voyage AI To Add Embedding Models
10/03/2025

MongoDB is to acquire Voyage AI with the intention of using Voyage AI's facilities within MongoDB so developers can build apps that use AI more reliably. Voyage AI produces embedding and reranking mod [ ... ]

+ Full Story

LeetGPU - The CUDA Challenges
04/04/2025

LeetGPU is a platform where you can write and test CUDA code.
Now it adds Challenges to foster competition, asking you to put your GPU programming skills to the test by writing the fastest program [ ... ]

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Last Updated ( Monday, 10 July 2023 )

More Information

Related Articles

Comments