AWS Glue 4 Adds Pandas Support |
Written by Kay Ewbank | |||
Thursday, 01 December 2022 | |||
AWS Glue has been updated with updated engines and support for Pandas. AWS Glue is a serverless data integration service that Amazon says makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning and application development. Glue includes a collection of libraries, engines, and tools developed by the open source community. AWS Glue consists of a Data Catalog which is a central metadata repository; an ETL engine that can automatically generate Scala or Python code; a flexible scheduler that handles dependency resolution, job monitoring, and retries; and AWS Glue DataBrew for cleaning and normalizing data with a visual interface. Glue 4 includes AWS Glue Studio, a new graphical interface that makes it easy to create, run, and monitor extract, transform, and load jobs. The Studio can be used to visually compose data transformation workflows for running on AWS Glue’s Apache Spark-based serverless ETL engine. The Pandas support means Python developers can use Pandas data analysis and manipulation facilities. The new version of Glue also has updated versions of the Spark and Python engines, Python 3.10 and Apache Spark 3.3.0. Both engines include bug fixes and performance enhancements; Spark includes new features such as row-level runtime filtering and additional built-in functions. Glue and Amazon EMR make use of the same optimized Spark runtime, which the Glue team says has been optimized to run in the AWS cloud and can be two to three times faster than the basic open source version. Glue 4.0 also adds native support for the Cloud Shuffle Service Plugin for Spark to help scale disk usage, and Adaptive Query Execution to dynamically optimize queries as they run. Another improvement to the new release is the addition of support for more data formats. Glue now has support for Apache Hudi, Apache Iceberg, and Delta Lake. It also now includes the Parquet vectorized reader, with support for additional data types and encodings. It has been upgraded to use log4j 2 and is no longer dependent on log4j 1. More InformationRelated ArticlesAmazon Announces AWS Visual Embedding Amazon Launches AWS Workflow Studio Amazon Releases Data IDE, Meet EMR Studio To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.
Comments
or email your comment to: comments@i-programmer.info |
|||
Last Updated ( Thursday, 01 December 2022 ) |