ApacheCon NA 2015 has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Science [clear filter]
Monday, April 13

10:45am CDT

Getting Started with Apache OODT - Tom Barber, Meteorite Consulting
Apache OODT is a modular distributed data processing framework aimed to help collect, process and catalogue data.
In our getting started tutorial, we will take a look at OODT, its various modules, how to build and deploy OODT using Radix or Docker, and how to modify and extend OODT to fit your data processing and storage needs. We will also look at how to process data, and distribute it across multiple servers.

avatar for Tom Barber

Tom Barber

Technical Director, Spicule LTD
Tom Barber is the director of Meteorite BI and Spicule BI. A member of the Apache Software Foundation and regular speaker at ApacheCon, Tom has a passion for simplifying technology. The creator of Saiku Analytics and open source stalwart, when not working for NASA, Tom currently deals... Read More →

Monday April 13, 2015 10:45am - 11:35am CDT
Texas II

11:45am CDT

Applying Apache Hadoop to NASA’s Big Climate Data - Glenn Tamkin, NASA
The NASA Center for Climate Simulation (NCCS) is using Apache Hadoop for high-performance analytics because it optimizes computer clusters and combines distributed storage of large data sets with parallel computation. We have built a platform for developing new climate analysis capabilities with Hadoop.

Hadoop is well known for text-based problems. Our scenario involves binary data. So, we created custom Java applications to read/write data during the MapReduce process. Our solution is unique because it: a) uses a custom composite key design for fast data access, and b) utilizes the Hadoop Bloom filter, a data structure designed to identify rapidly and memory-efficiently whether an element is present.

This presentation, which touches on motivation, use cases, and lessons learned, will explore the software architecture, including all Apache contributions (Avro, Maven, etc.).

avatar for Glenn Tamkin

Glenn Tamkin

Mr. Tamkin is the lead software engineer and architect for the NASA Center for Climate Simulation’s (NCCS) Climate Informatics project. Recently, he has built a Hadoop-based system designed to perform analytics across NASA’s Big Climate Data. Prior endeavors extended from spacecraft... Read More →

Monday April 13, 2015 11:45am - 12:35pm CDT
Texas II

2:00pm CDT

Apache Tika: Cool Insights into Polar Data - Annie Burgess, USC
Climate change is amplified in the Polar Regions. Polar amplification is captured via space and airborne remote sensing, in-situ measurement, and climate modeling. While simply finding these data is often a challenge, this talk will focus on what to do with the data (and metadata) once it is found! Here we present our current efforts using Apache Tika to help us ask some big questions about Arctic and Antarctic data. Apache Tika is an open source framework for metadata exploration, automatic text mining, and information retrieval. Over the past year, we have expanded Apache Tika to parse, extract, and analyze common data formats used in Arctic and Antarctic research making them more easily accessible, searchable, and retrievable by all major content management systems. Come to this talk to hear about how we’ve expanded Tika and what cool new insights we have into polar data!


Annie Burgess

Lab Director, ESIP

Monday April 13, 2015 2:00pm - 2:50pm CDT
Texas II

3:00pm CDT

Streaming-OODT: Combining Apache Spark's Power with Apache OODT - Michael Starch, NASA Jet Propulsion Laboratory
Streaming-OODT was designed to overcome the limitations Apache OODT, which does not include cutting-edge data processing technologies and is limited in its ability to handle extremely large data sets.
As an extension to Apache OODT funded through the NASA Jet Propulsion Laboratory’s Big Data Research & Technology Development initiative “Archiving, Processing and Dissemination for the Big Data Era”, Streaming-OODT encapsulates state-of-the-art big data technologies within Apache OODT providing a prepackaged yet powerful data system.
Streaming-OODT enables OODT to use in-memory MapReduce processing provided by Apache Spark. Cluster management and multi-tenancy is provided via Apache Mesos. Apache Kafka and Spark Streaming enable the system to handle both streaming data types and streaming processing. All of which enable Apache OODT to handle next generation big data.


Michael Starch

Computer Engineer in Applications, NASA Jet Propulsion Laboratory
Michael Starch has been employed by the Jet Propulsion laboratory for the past 5 years. His primary responsibilities include: engineering big data processing systems for handling scientific data, researching the next generation of big data technologies, and helping infuse these systems... Read More →

Monday April 13, 2015 3:00pm - 3:50pm CDT
Texas II

4:00pm CDT

Exploring Apache Tika's Translate API to Enable Linguistic Analysis of Scientific Metadata through Internationalizing NASA JPL’s Physical Oceanographic Data Active Archive Centre - Lewis McGibbney, The Apache Software Foundation
The NASA Jet Propulsion Laboratories (JPL) Physical Oceanography Distributed Active Archive Center (PO.DAAC) is one of a number of NASA data archive’s containing many petabytes of oceanographic data. The primary goal (and challenge) for PO.DAAC is to enable provision, dissemination and availability of such data to the global scientific community at large. The driving justification behind the Internationalization Product Retrieval Services (iPReS) project is to address the growing requirement for PO.DAAC to provide high quality data products and services in a user-oriented manner by introducing language translation support for any data products retrieved from the data archive. Currently, this information is available only in English. This presentation will display how recent work undertaken on Apache Tika's Translate API has been leveraged to back the iPReS Service.

avatar for Lewis McGibbney

Lewis McGibbney

Enterprise Search Technologist III, Jet Propulsion Laboratory

Monday April 13, 2015 4:00pm - 4:50pm CDT
Texas II

5:00pm CDT

Content Extraction from Images and Video in Tika - Chris Mattmann, NASA
The DARPA Memex project and NSF Polar Cyber Infrastructure project have been funding a ton of improvements in the Apache Tika framework. Apache Tika is a content detection and analysis toolkit that has support for file type identification (MIME identification) for over 1200 types of files; extraction of text and metadata and language information from those files; even translation!

Though Tika supports all those file types, its support for extraction from images, and videos has been lacking. Via the Memex and NSF projects, we have expanded Tika to extract text from images (using Tesseract OCR); and are actively integrating other analyses (Visual Sentiment analysis; geo-location using toolkits like GDAL; and analyes of scenes and objects).

I'll tell you all about how to install and use these improvements and even illustrate them in a cool example from Memex and NSF Polar.

avatar for Chris Mattmann

Chris Mattmann

Chief Architect & Adjunct Associate Professor, NASA Jet Propulsion Laboratory & USC
Chris Mattmann has a wealth of experience in software design, and in the construction of large-scale data-intensive systems. His work has infected a broad set of communities, ranging from helping NASA unlock data from its next generation of earth science system satellites, to assisting... Read More →

Monday April 13, 2015 5:00pm - 5:50pm CDT
Texas II
Tuesday, April 14

10:40am CDT

Programming Math in Java - Lessons from Apache Commons Math - Phil Steitz
Apache Commons Math is a general-purpose mathematics library written in Java. In this talk, we will provide an overview of the library, showing how to use it to solve a wide range of common mathematical programming problems. Along the way, we will point out design and implementation challenges that we have faced over the years in choosing algorithms, developing the API, handling corner cases and balancing performance, accuracy and useability considerations. We will conclude with an update on work in progress and what the community is talking about regarding future directions.

avatar for Phil Steitz

Phil Steitz

Director, Apache Software Foundation
Phil Steitz is a member of the Board of Directors of the Apache Software Foundation. He has been an ASF volunteer since 2003 and an ASF member since 2005. He has served as VP, Apache Commons, as a mentor in the Apache Incubator and a committer on multiple ASF projects. His involvement... Read More →

Tuesday April 14, 2015 10:40am - 11:30am CDT
Texas II

11:40am CDT

Userfriendly Workflows with Apache OODT - Tom Barber, Meteorite Consulting
Apache OODT is a data processing platform that has a number of modules, when ingesting data or once it has been ingested you can then process it through OODT workflows.

Workflows are the OODT data transformation pipeline and allow you to pre or post process the data.

We'll be looking at OODT workflows, how to build them, extend them, deploy them and optimise them so that they can be used by the wider community to help process data in large scale and small scale data processing pipelines.

avatar for Tom Barber

Tom Barber

Technical Director, Spicule LTD
Tom Barber is the director of Meteorite BI and Spicule BI. A member of the Apache Software Foundation and regular speaker at ApacheCon, Tom has a passion for simplifying technology. The creator of Saiku Analytics and open source stalwart, when not working for NASA, Tom currently deals... Read More →

Tuesday April 14, 2015 11:40am - 12:30pm CDT
Texas II

2:00pm CDT

Pharmacovigilance - Big Data for RealTime Drug Monitoring - Pei Chen, Apache cTAKES & Jay Vyas, Red Hat
Real Time Drug Safety Monitoring in the Cloud - Collecting and harnessing knowledge from large public data sources in real time to monitor and detect adverse drug effects. In this presentation, Pei and Jay will demonstrate an entire system using Apache BigTop, OpenStack, Spark, cTAKES, Cassandra to proactively monitor and detect adverse drug events from Twitter data.


Pei Chen

Pei Chen is VP of the Apache cTAKES project. He is also a lead application development specialist at the Informatics Program at Boston Children’s Hospital/Harvard Medical School and Co-Founder of Wired Informatics. Mr. Pei’s interests lie in building practical applications... Read More →
avatar for Jay Vyas

Jay Vyas

Cloud Native Engineering Stuffs, VMware
Jay Vyas is a Kubernetes engineer at VMWare (ex-RedHat, Blackduck), and has worked on K8s at its inception in 2015 as an open source project.  He likes to hang out w/ the sig-network and sig-windows crews and hack on K8s stuff.  On the business side ~ he's moved large on premise... Read More →

Tuesday April 14, 2015 2:00pm - 2:50pm CDT
Texas II

3:00pm CDT

Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials - Eran Withana, Comprehend Systems
Pharmaceutical and medical device makers spend over $130bn each year collecting and analyzing new data, mostly through clinical trials. It costs over $1.8bn to bring a new drug to market, and over $4bn when factoring in the cost of failures. By more efficiently understanding and analyzing this data, new drugs can reach patients quicker, safer, and at a lower cost.

In this presentation, Eran will discuss how ETL pipelines can be built using the Apache and other open source projects to improve clinical trial development. We will examine how the system is built, the challenges we faced and how we are able to reduce cost, accelerate execution time, and improve results. We will also demonstrate how reliable resource allocation, scalable data ingestion adapters, on-demand and fault tolerant job deployments, and monitoring benefit clinical trial decision-making and execution.


Eran Withana

Comprehend Systems
Eran is a member of the Apache Software Foundation since 2005 and has contributed to numerous Open Source projects for over a decade. He has spoken at several technology conferences like ApacheCon US, Europe, JAX, and other scalable systems research conferences. He is an Open Source... Read More →

Tuesday April 14, 2015 3:00pm - 3:50pm CDT
Texas II

4:20pm CDT

The Emergence of the Datacenter Developer - Tobi Knaup, Mesosphere
A new category of developer is emerging in the datacenter. It used to be that individual servers were the building block for applications, but today’s datacenter developers have thousands of servers at their disposal. In this talk, Tobias Knaup will explain how all applications are becoming distributed applications and how that’s creating a new breed of developer that programs against the datacenter like it was their laptop. Knaup will provide an outlook for the emerging “datacenter developer” category - and describe how advancements in abstractions of datacenter resources will forever change the balance of power between developers and operations. Knaup will share specific examples of what it means to “program against the datacenter”, will explain these trends in the context of hot new frameworks (Kubernetes, Rocket, Mesos, etc.) 


Tobi Knaup

Tobias is the CTO and Co-Founder of Mesosphere, a startup that is building a data center operating system based on Apache Mesos, to support the next generation of large scale distributed applications.He was one of the first engineers and engineering leaders at Airbnb. At Airbnb, he... Read More →

Tuesday April 14, 2015 4:20pm - 5:10pm CDT
Texas II

5:20pm CDT

Data Stream Algorithms in Apache Storm and R - Radek Maciaszek, Data Mine Lab
Streaming data presents new challenges for statistics and machine learning on extremely large data sets. Tools such as Apache Storm, a stream processing framework, can power range of data analytics but lack advanced statistical capabilities. In this talk I will discuss developing streaming algorithms with the flexibility of both Storm and R, a statistical programming language.

I will address the critical issues of why and how to use Storm and R to develop streaming algorithms; in particular I will focus on:
• Streaming algorithms
• Online machine learning algorithms
• Use cases showing how to process hundreds of millions of events a day in (near) real time


Radek Maciaszek

Data Mine Lab
I am a founder of Data Mine Lab, a big-data consultancy. The company specialises in large-scale data number crunching and cloud computing. Currently I work as a data scientist contractor with a London based hedge fund. I share my passion in data science by leading number of training... Read More →

Tuesday April 14, 2015 5:20pm - 6:10pm CDT
Texas II
Wednesday, April 15

9:00am CDT

Apache Airavata Overview and Roadmap - Suresh Marru, Apache Software Foundation
Apache Airavata is software for providing services to manage scientific applications on a wide range of remote computing resources. Airavata can be used by both individual scientists to run scientific workflows as well as communities of scientists through Web browser interfaces. Airavata is composed of several components (Registry, Orchestrator, Application Factory, Workflow Interpreter, Messenger, Credential Store) that implement these capabilities. Airavata community is in the process of rearchitecting Airavata software to serve as the basis of a multi-tenanted, elastically scalable, fault-tolerant Platform as a Service for our community. This introduces several challenges to the current architecture as well as opportunities to leverage and collaborate with other Apache projects. We discuss these experiences and future directions.

avatar for Suresh Marru

Suresh Marru

Member, Indiana University
Suresh Marru is a Member of the Apache Software Foundation and is the current PMC chair of the Apache Airavata project. He is the deputy director of Science Gateways Research Center at Indiana University. Suresh focuses on research topics at the intersection of application domain... Read More →

Wednesday April 15, 2015 9:00am - 9:50am CDT
Texas II

10:00am CDT

Hadoop Applications on High Performance Computing(HPC) - Devaraj Kavali, Intel
High performance workloads have expanded. Today’s HPC users are demanding application frameworks to analyze vast amounts of data created by complex simulations. As the most widely deployed file system for HPC, Lustre software can play a critical role for these data-intensive applications. HPC Adapter for Lustre(HAL) provides adaptability to run Hadoop File System operations on HPC environment without any changes to applications. HPC Adapter for Mapreduce/Yarn(HAM) allows users to run their MapReduce/Yarn applications—without changes—directly on shared, fast, Lustre-powered storage. This optimizes the performance of MapReduce/Yarn tasks, while delivering faster, more scalable, easier-to-manage storage. This session explains the architecture and design level technical details to run Mapreduce and Yarn applications on HPC native schedulers like Slurm, MOAB, etc.

avatar for Devaraj Kavali

Devaraj Kavali

Intel Corporation
Devaraj Kavali is an Apache Hadoop Committer and contributor to the Hadoop Yarn & Mapreduce. He is currently working with Intel Corporation. He has been working on various distributed platforms/applications for more than 8 years.

Wednesday April 15, 2015 10:00am - 10:50am CDT
Texas II