ApacheCon NA 2015 has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Science [clear filter]
Monday, April 13

11:45am CDT

Applying Apache Hadoop to NASA’s Big Climate Data - Glenn Tamkin, NASA
The NASA Center for Climate Simulation (NCCS) is using Apache Hadoop for high-performance analytics because it optimizes computer clusters and combines distributed storage of large data sets with parallel computation. We have built a platform for developing new climate analysis capabilities with Hadoop.

Hadoop is well known for text-based problems. Our scenario involves binary data. So, we created custom Java applications to read/write data during the MapReduce process. Our solution is unique because it: a) uses a custom composite key design for fast data access, and b) utilizes the Hadoop Bloom filter, a data structure designed to identify rapidly and memory-efficiently whether an element is present.

This presentation, which touches on motivation, use cases, and lessons learned, will explore the software architecture, including all Apache contributions (Avro, Maven, etc.).

avatar for Glenn Tamkin

Glenn Tamkin

Mr. Tamkin is the lead software engineer and architect for the NASA Center for Climate Simulation’s (NCCS) Climate Informatics project. Recently, he has built a Hadoop-based system designed to perform analytics across NASA’s Big Climate Data. Prior endeavors extended from spacecraft... Read More →

Monday April 13, 2015 11:45am - 12:35pm CDT
Texas II

3:00pm CDT

Streaming-OODT: Combining Apache Spark's Power with Apache OODT - Michael Starch, NASA Jet Propulsion Laboratory
Streaming-OODT was designed to overcome the limitations Apache OODT, which does not include cutting-edge data processing technologies and is limited in its ability to handle extremely large data sets.
As an extension to Apache OODT funded through the NASA Jet Propulsion Laboratory’s Big Data Research & Technology Development initiative “Archiving, Processing and Dissemination for the Big Data Era”, Streaming-OODT encapsulates state-of-the-art big data technologies within Apache OODT providing a prepackaged yet powerful data system.
Streaming-OODT enables OODT to use in-memory MapReduce processing provided by Apache Spark. Cluster management and multi-tenancy is provided via Apache Mesos. Apache Kafka and Spark Streaming enable the system to handle both streaming data types and streaming processing. All of which enable Apache OODT to handle next generation big data.


Michael Starch

Computer Engineer in Applications, NASA Jet Propulsion Laboratory
Michael Starch has been employed by the Jet Propulsion laboratory for the past 5 years. His primary responsibilities include: engineering big data processing systems for handling scientific data, researching the next generation of big data technologies, and helping infuse these systems... Read More →

Monday April 13, 2015 3:00pm - 3:50pm CDT
Texas II

4:00pm CDT

Exploring Apache Tika's Translate API to Enable Linguistic Analysis of Scientific Metadata through Internationalizing NASA JPL’s Physical Oceanographic Data Active Archive Centre - Lewis McGibbney, The Apache Software Foundation
The NASA Jet Propulsion Laboratories (JPL) Physical Oceanography Distributed Active Archive Center (PO.DAAC) is one of a number of NASA data archive’s containing many petabytes of oceanographic data. The primary goal (and challenge) for PO.DAAC is to enable provision, dissemination and availability of such data to the global scientific community at large. The driving justification behind the Internationalization Product Retrieval Services (iPReS) project is to address the growing requirement for PO.DAAC to provide high quality data products and services in a user-oriented manner by introducing language translation support for any data products retrieved from the data archive. Currently, this information is available only in English. This presentation will display how recent work undertaken on Apache Tika's Translate API has been leveraged to back the iPReS Service.

avatar for Lewis McGibbney

Lewis McGibbney

Enterprise Search Technologist III, Jet Propulsion Laboratory

Monday April 13, 2015 4:00pm - 4:50pm CDT
Texas II

5:00pm CDT

Content Extraction from Images and Video in Tika - Chris Mattmann, NASA
The DARPA Memex project and NSF Polar Cyber Infrastructure project have been funding a ton of improvements in the Apache Tika framework. Apache Tika is a content detection and analysis toolkit that has support for file type identification (MIME identification) for over 1200 types of files; extraction of text and metadata and language information from those files; even translation!

Though Tika supports all those file types, its support for extraction from images, and videos has been lacking. Via the Memex and NSF projects, we have expanded Tika to extract text from images (using Tesseract OCR); and are actively integrating other analyses (Visual Sentiment analysis; geo-location using toolkits like GDAL; and analyes of scenes and objects).

I'll tell you all about how to install and use these improvements and even illustrate them in a cool example from Memex and NSF Polar.

avatar for Chris Mattmann

Chris Mattmann

Chief Architect & Adjunct Associate Professor, NASA Jet Propulsion Laboratory & USC
Chris Mattmann has a wealth of experience in software design, and in the construction of large-scale data-intensive systems. His work has infected a broad set of communities, ranging from helping NASA unlock data from its next generation of earth science system satellites, to assisting... Read More →

Monday April 13, 2015 5:00pm - 5:50pm CDT
Texas II
Tuesday, April 14

10:40am CDT

Programming Math in Java - Lessons from Apache Commons Math - Phil Steitz
Apache Commons Math is a general-purpose mathematics library written in Java. In this talk, we will provide an overview of the library, showing how to use it to solve a wide range of common mathematical programming problems. Along the way, we will point out design and implementation challenges that we have faced over the years in choosing algorithms, developing the API, handling corner cases and balancing performance, accuracy and useability considerations. We will conclude with an update on work in progress and what the community is talking about regarding future directions.

avatar for Phil Steitz

Phil Steitz

Director, Apache Software Foundation
Phil Steitz is a member of the Board of Directors of the Apache Software Foundation. He has been an ASF volunteer since 2003 and an ASF member since 2005. He has served as VP, Apache Commons, as a mentor in the Apache Incubator and a committer on multiple ASF projects. His involvement... Read More →

Tuesday April 14, 2015 10:40am - 11:30am CDT
Texas II

3:00pm CDT

Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials - Eran Withana, Comprehend Systems
Pharmaceutical and medical device makers spend over $130bn each year collecting and analyzing new data, mostly through clinical trials. It costs over $1.8bn to bring a new drug to market, and over $4bn when factoring in the cost of failures. By more efficiently understanding and analyzing this data, new drugs can reach patients quicker, safer, and at a lower cost.

In this presentation, Eran will discuss how ETL pipelines can be built using the Apache and other open source projects to improve clinical trial development. We will examine how the system is built, the challenges we faced and how we are able to reduce cost, accelerate execution time, and improve results. We will also demonstrate how reliable resource allocation, scalable data ingestion adapters, on-demand and fault tolerant job deployments, and monitoring benefit clinical trial decision-making and execution.


Eran Withana

Comprehend Systems
Eran is a member of the Apache Software Foundation since 2005 and has contributed to numerous Open Source projects for over a decade. He has spoken at several technology conferences like ApacheCon US, Europe, JAX, and other scalable systems research conferences. He is an Open Source... Read More →

Tuesday April 14, 2015 3:00pm - 3:50pm CDT
Texas II

4:20pm CDT

The Emergence of the Datacenter Developer - Tobi Knaup, Mesosphere
A new category of developer is emerging in the datacenter. It used to be that individual servers were the building block for applications, but today’s datacenter developers have thousands of servers at their disposal. In this talk, Tobias Knaup will explain how all applications are becoming distributed applications and how that’s creating a new breed of developer that programs against the datacenter like it was their laptop. Knaup will provide an outlook for the emerging “datacenter developer” category - and describe how advancements in abstractions of datacenter resources will forever change the balance of power between developers and operations. Knaup will share specific examples of what it means to “program against the datacenter”, will explain these trends in the context of hot new frameworks (Kubernetes, Rocket, Mesos, etc.) 


Tobi Knaup

Tobias is the CTO and Co-Founder of Mesosphere, a startup that is building a data center operating system based on Apache Mesos, to support the next generation of large scale distributed applications.He was one of the first engineers and engineering leaders at Airbnb. At Airbnb, he... Read More →

Tuesday April 14, 2015 4:20pm - 5:10pm CDT
Texas II
Wednesday, April 15

10:00am CDT

Hadoop Applications on High Performance Computing(HPC) - Devaraj Kavali, Intel
High performance workloads have expanded. Today’s HPC users are demanding application frameworks to analyze vast amounts of data created by complex simulations. As the most widely deployed file system for HPC, Lustre software can play a critical role for these data-intensive applications. HPC Adapter for Lustre(HAL) provides adaptability to run Hadoop File System operations on HPC environment without any changes to applications. HPC Adapter for Mapreduce/Yarn(HAM) allows users to run their MapReduce/Yarn applications—without changes—directly on shared, fast, Lustre-powered storage. This optimizes the performance of MapReduce/Yarn tasks, while delivering faster, more scalable, easier-to-manage storage. This session explains the architecture and design level technical details to run Mapreduce and Yarn applications on HPC native schedulers like Slurm, MOAB, etc.

avatar for Devaraj Kavali

Devaraj Kavali

Intel Corporation
Devaraj Kavali is an Apache Hadoop Committer and contributor to the Hadoop Yarn & Mapreduce. He is currently working with Intel Corporation. He has been working on various distributed platforms/applications for more than 8 years.

Wednesday April 15, 2015 10:00am - 10:50am CDT
Texas II