Apache Tika Improves Security
Written by Kay Ewbank   
Monday, 04 April 2022

Apache TIka 2.3 has been released with improvements including security upgrades to several dependencies, and a move to using Apache POI 5.2.

Tika is a content analysis toolkit for detecting and extracting metadata and text. It can be used to extract metadata from over a thousand different file types, all of which can be parsed through a single interface, making Tika useful for search engine indexing, content analysis and translation.

tika

Tika has a Java library as well as server and command line tools. It uses a number of document parsers and document type detection techniques to detect and extract data.

Apache POI used to be part of the Jakarta Project, and provides Java APIs for reading and writing files in the Office Open XML standards (OOXML) and OLE 2 Microsoft Office formats. The move to using POI 5.x in Tika represents a major refactoring, according to the developers, who also say that users may experience significantly more logging.

The new release also includes several security upgrades in dependencies, including an upgrade to log4j2 to overcome the security vulnerabilities known about in log4j.

Most of the other work has been to the Tika parsers, particularly to the PDF parser so that it now extracts annotation types, subtypes and 3D annotations into metadata. There's a new parser for Translation Memory eXchange (TMX) files, another for IDML, and an improvement to the identification of iWorks 13 files to add parsing for thumbnails, some metadata and attachments.

Tika Config has changes to improve the configuration of maps (key/value attributes) as parameters for parsers. Another change has been to all the parsers for embedded files to Improve consistency in the reporting of package-entry divs. The team says this will lead to some more text, specifically embedded file names, in files with many embedded attachments. 

Tika 2.3 is available now. 

 tika

More Information

Tika Website

Related Articles

Apache Tika 2 Adds New Pipes Modules

Apache Kafka 2.7 Updates Broker

Tika in Action

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Improved Code Completion With JetBrains Mellum
29/10/2024

JetBrains has launched Mellum, a proprietary large language model specifically built for coding. Currently available only with JetBrains AI Assistant, Mellum is claimed to provide faster, sm [ ... ]



IBM Updates Granite Models
28/10/2024

IBM has released new Granite models that it says provide state-of-the-art performance relative to model size. The Granite 3.0 collection includes a new, instruction-tuned, dense decoder-only LLM.


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info