Apache Tika 2 Adds New Pipes Modules

Written by Kay Ewbank

Monday, 30 August 2021

Apache Tika 2 has been released with improvements including modularization of the parsers modules and new pipes modules.

Apache Tika is a content analysis toolkit that detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis and translation.

tika

Apache Tika provides a Java library as well as server and command line tools that can be used to access it from other programming languages. Tika uses a number of document parsers and document type detection techniques to detect and extract data, and can be used to extract structured text as well as metadata from document types including spreadsheets, text documents, images, PDFs and some multimedia input formats.

The improvements to the new version start with the modularization of the parsers modules. This change has been made to allow for easily configurable parser sub-packages. The tika-app, tika-server and tika-bundle jars have been getting larger and are now all larger than 50MB. The developers have modularized the parsers so users will be able to easily specify a subset of parsers they care about, either a la carte or by category such as image, common office files (MSOffice, PDF, etc.), or environmental data, and be provided with only the dependencies required for that subset of parsers.

The next improvement is the new pipes modules. These enable synchronous and asynchronous fetching from numerous data sources including JDBC,fileshares and S3. The data is then parsed and emitted to other endpoints such as fileshares, S3, Solr, or Elasticsearch.

tika

More Information

Tika Website

Apache Kafka 2.7 Updates Broker

Tika in Action

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

AI Leads To Slowdown In Developer Productivity
16/07/2025

Empirical research into whether access to AI-powered tools, primarily Cursor, reduces or lengthens the time taken to deal with routine software development tasks produced an unexpected result. Using A [ ... ]

+ Full Story

Akka Launches Agentic Platform
14/07/2025

Akka has launched a new Akka Agentic Platform that can be used to build, operate, and evaluate any type of agentic AI system. The platform provides orchestration, memory, toolkits for agents, and [ ... ]

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Recent Articles

Recent Book Reviews

Popular Articles

More Information

Related Articles

Comments