Loading…
This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
View analytic
Wednesday, April 15 • 10:00am - 10:50am
Evaluating Text Extraction: Developing a Toolkit for Apache Tika™ - Tim Allison, The MITRE Corporation

Sign up or log in to save this to your schedule and see who's attending!

Text extraction tools are essential for obtaining the textual content and metadata of computer files for use in a wide variety of applications, including search and natural language processing tools. Techniques and tools for evaluating text extraction tools are missing from academia and industry. Apache Tika™ detects file types and extracts metadata and text from many file types. Tika is a crucial component in a wide variety of tools, including Solr™, Nutch™, Alfresco, Elasticsearch and Sleuth Kit®/Autopsy®. In this talk, we will give an overview of a new initiative within Tika to create an evaluation toolkit that allows integrators to evaluate Tika and other content extraction systems on client-specific documents. This talk will end with a brief discussion of a related initiative to take this evaluation methodology public and evaluate Tika on large batches of public domain documents.

Note: This talk was co-authored with Paul M. Herceg, Lead Artificial Intelligence Engineer, The MITRE Corporation. Paul holds an M.S. in Computer Science and a B.S. in Computer Science-Mathematics, both from the State University of New York at Binghamton.

Speakers
avatar for Tim Allison

Tim Allison

Principal Artificial Intelligence Engineer, The MITRE Corporation
Tim has been working in natural language processing since 2002. In recent years, his focus has shifted to advanced search and content/metadata extraction. Tim has been a committer and PMC member on Apache POI and Apache Tika since July, 2013. Tim holds a Ph.D. in Classical Studies from the University of Michigan, and in a former life, he was a professor of Latin and Greek.


Wednesday April 15, 2015 10:00am - 10:50am
Texas I

Attendees (13)