Loading…
ApacheCon NA 2015 has ended
View analytic
Wednesday, April 15 • 9:00am - 9:50am
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And Friends - Nick Burch, Quanticate

Sign up or log in to save this to your schedule and see who's attending!

If you have one or two files, you can take the time to manually work out what they are, what they contain, and how to get the useful bits out (probably....). However, this approach really doesn't scale, mechanical turks or no! Luckily, there are Apache projects out there which can help!

In this talk, we'll first look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We'll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We'll see how to do all of this with Apache Tika, and how to dive down to the underlying libraries (including its Apache friends like POI and PDFBox) for specialist cases. Finally, we'll look a little bit about how to roll this all out on a Big Data or Large-Search case.

Speakers
avatar for Nick Burch

Nick Burch

CTO, Quanticate
Nick began contributing to Apache projects in 2003, and hasn't looked back since! Most of the projects Nick has worked in belong in the "Content" space, such as Apache POI (ex-PMC Chair), Apache Tika and Apache Chemistry. As well as coding projects, Nick is also involved in a number... Read More →


Wednesday April 15, 2015 9:00am - 9:50am
Texas I

Attendees (0)