ApacheCon NA 2015 has ended
Back To Schedule
Wednesday, April 15 • 10:00am - 10:50am
Evaluating Text Extraction: Developing a Toolkit for Apache Tika™ - Tim Allison, The MITRE Corporation

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Text extraction tools are essential for obtaining the textual content and metadata of computer files for use in a wide variety of applications, including search and natural language processing tools. Techniques and tools for evaluating text extraction tools are missing from academia and industry. Apache Tika™ detects file types and extracts metadata and text from many file types. Tika is a crucial component in a wide variety of tools, including Solr™, Nutch™, Alfresco, Elasticsearch and Sleuth Kit®/Autopsy®. In this talk, we will give an overview of a new initiative within Tika to create an evaluation toolkit that allows integrators to evaluate Tika and other content extraction systems on client-specific documents. This talk will end with a brief discussion of a related initiative to take this evaluation methodology public and evaluate Tika on large batches of public domain documents.

Note: This talk was co-authored with Paul M. Herceg, Lead Artificial Intelligence Engineer, The MITRE Corporation. Paul holds an M.S. in Computer Science and a B.S. in Computer Science-Mathematics, both from the State University of New York at Binghamton.

avatar for Tim Allison

Tim Allison

Principal Artificial Intelligence Engineer, The MITRE Corporation
Tim has been working in natural language processing since 2002. In recent years, his focus has shifted to advanced search and content/metadata extraction. Tim is committer and PMC member on Apache PDFBox (since September 2016), and on Apache POI and Apache Tika since (July, 2013... Read More →

Wednesday April 15, 2015 10:00am - 10:50am CDT
Texas I

Attendees (0)