ApacheCon NA 2015 has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Big Data: Big Picture [clear filter]
Wednesday, April 15

9:00am CDT

Kafka at Scale: Multi-Tier Architectures - Todd Palino, LinkedIn
If data is the lifeblood of high technology, Apache Kafka is the circulatory system in use at LinkedIn. It is used for moving every type of data around between systems, and it touches virtually every server, every day. This can only be accomplished with multiple Kafka clusters, installed at several sites, and they must all work together to assure no message loss, and almost no message duplication. In this presentation, we will discuss the architectural choices behind how the clusters are deployed, and the tools and processes that have been developed to manage them. Todd Palino will also discuss some of the challenges of running Kafka at this scale, and how they are being addressed both operationally and in the Kafka development community.

avatar for Todd Palino

Todd Palino

Staff Site Reliability Engineer, http://linkedin.com/
Todd Palino is a Staff Site Reliability Engineer at LinkedIn, tasked with keeping Zookeeper, Kafka, and Samza deployments fed and watered. He is responsible for architecture, day-to-day operations, and tools development, including the creation of an advanced monitoring and notification... Read More →

Wednesday April 15, 2015 9:00am - 9:50am CDT
Texas VI

10:00am CDT

From MapReduce to Spark with Apache Crunch - Micah Whitacre, Cerner Corporation
With companies having made heavy investments in MapReduce the emergence of Apache Spark as a new processing platform is both tempting and daunting. Refactoring code or altering processing steps can be a significant investment. The Apache Crunch project can help with the transition utilizing its built in support for reusing code in both execution environments. Teams can make incrementally migrate their processing workflows or utilize the appropriate execution engine depending on their use case while still utilizing a common set of concepts provided by Apache Crunch. The presentation will cover the basics of Apache Spark, how to reuse the same code in both MapReduce and Spark, as well as differences with using Apache Crunch over plain Apache Spark.

avatar for Micah Whitacre

Micah Whitacre

Software Architect, Cerner Corporation
Micah is a committer on the Apache Crunch project as well as a Software Architect for Cerner Corporation, a leading provider of healthcare technology. For almost a decade he has worked on building infrastructure and reusable assets. In the last few years his focus has shifted towards... Read More →

Wednesday April 15, 2015 10:00am - 10:50am CDT
Texas VI

1:15pm CDT

Delivering Systems of Insight by Leveraging the Hadoop Ecosystem - Eberhard Hechler, IBM Germany R&D Lab
This presentation will illustrate how to complement existing 'traditional' analytical capabilities with Big Data analytics, e.g. by using text analytics and Natural Language Processing (NLP) as part of IBM InfoSphere BigInsights. This leverages key Hadoop components (MapReduce programming model, HDFS, HBase, Zookeeper, etc.) to analyse data from Enterprise-owned systems of engagement (e.g. call center transcripts, e-mail traffic, Facebook), and data from external social media sites (e.g. Twitter tweeds, Facebook sites, Blogs) and putting this in context with transaction insight from data on IBM z Systems. We will provide examples on how Hadoop systems - by using HBase and Hive with corresponding connectors to existing systems – and Big SQL on HDFS and Hive will enrich analytical insight.


Eberhard Hechler

Executive Architect, IBM Germany R&D Lab
Eberhard is an Executive Architect working at the IBM Germany R&D Lab. He is a member of IBM DB2 Analytics Accelerator development. After 2,5 years at the IBM Kingston Development Lab in New York, he worked in software development, performance optimization and benchmarking, IT/solution... Read More →

Wednesday April 15, 2015 1:15pm - 2:05pm CDT
Texas VI

3:15pm CDT

Real-time Big Data Analytics with Apache Spark and Apache Solr - Timothy Potter, LucidWorks
Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.

Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.

avatar for Timothy Potter

Timothy Potter

Senior Software Engineer, Lucidworks
Timothy Potter is a senior member of the engineering team at Lucidworks and PMC member of the Apache Lucene/Solr project. At Lucidworks, Tim leads a team that builds tools to empower business analysts and data scientists to search, analyze, and visualize large-scale enterprise data... Read More →

Wednesday April 15, 2015 3:15pm - 4:05pm CDT
Texas VI