Loading…
ApacheCon NA 2015 has ended
Back To Schedule
Monday, April 13 • 5:00pm - 5:50pm
Content Extraction from Images and Video in Tika - Chris Mattmann, NASA

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

The DARPA Memex project and NSF Polar Cyber Infrastructure project have been funding a ton of improvements in the Apache Tika framework. Apache Tika is a content detection and analysis toolkit that has support for file type identification (MIME identification) for over 1200 types of files; extraction of text and metadata and language information from those files; even translation!

Though Tika supports all those file types, its support for extraction from images, and videos has been lacking. Via the Memex and NSF projects, we have expanded Tika to extract text from images (using Tesseract OCR); and are actively integrating other analyses (Visual Sentiment analysis; geo-location using toolkits like GDAL; and analyes of scenes and objects).

I'll tell you all about how to install and use these improvements and even illustrate them in a cool example from Memex and NSF Polar.

Speakers
avatar for Chris Mattmann

Chris Mattmann

Chief Architect & Adjunct Associate Professor, NASA Jet Propulsion Laboratory & USC
Chris Mattmann has a wealth of experience in software design, and in the construction of large-scale data-intensive systems. His work has infected a broad set of communities, ranging from helping NASA unlock data from its next generation of earth science system satellites, to assisting... Read More →


Monday April 13, 2015 5:00pm - 5:50pm CDT
Texas II

Attendees (0)