ApacheCon NA 2015 has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Spark Forum [clear filter]
Thursday, April 16

9:00am CDT

Apache Spark in 2015 and Beyond - Reynold Xin, Databricks
In this talk, I will give a quick introduction to Apache Spark, one of the most widely used cluster compute engine and Big Data framework. I will cover some of the important developments in the project, including:
  • our efforts to scale up Spark, which enabled us to set a new world record in 100TB sorting, beating the previous Hadoop MapReduce record by 3X using 1/10 of the nodes.
  • our efforts to expand the Spark API to make it easier to use for data scientists and application developers
  • and last but not least, a number of efforts including Spark Packages aimed at facilitating better community contribution at scale


Reynold Xin

co-founder of Databricks
Reynold is a co-founder and the Chief Architect of Databricks.

Thursday April 16, 2015 9:00am - 9:50am CDT
Texas VI

10:00am CDT

Hive Now Sparks - Chao Sun, Cloudera
Apache Hive has become de facto standar SQL on big data in Hadoop ecosystem. With its open architecture and backend neutrality, Hive queries can run on MapReduce and Tez. On the other hand, Apache Spark as an open-source data analytics cluster computing framework has gained significant momentum recently. Marrying the two, that is, providing a new execution engine to Hive, has many benefits for Spark users and Hive users.
Hive on Spark (HIVE-7292) is probably the most watched project in Hive with 100+ watchers. The effort has attracted developers from both communities, around globe, and from brand companies such as Intel, IBM, Cloudera, and MapR. This presentation will talk about the motivation, design principles, architecture, challenges, and current status of the project followed by a live demo.


Chao Sun

Chao Sun is currently a Software Engineer at Cloudera, Inc. He has been working on Hive on Spark project since joining the company in mid 2014. Prior to that, he was a PhD student in Computer Science at U​W-Milwaukee, focusing on type systems​ and ​mechanized proofs​.​

Thursday April 16, 2015 10:00am - 10:50am CDT
Texas VI

10:50am CDT

Thursday April 16, 2015 10:50am - 11:20am CDT

11:20am CDT

Faster ETL Workflows using Apache Pig & Spark - Praveen Rachabattuni, Sigmoid Analytics
Pig on Spark aims to combine the simplicity of Pig with faster execution engine Spark and make Pig more promising to developers. Currently, with the help of Apache foundation, various contributions are working on the project for a release quality build. With Pig on spark, significant performance benefit has been observed in ETL workflows already running on MapReduce. Our initial benchmarks have shown 2x-5x improvement over Mapreduce. For a benchmarking test, we considered the ‘distinct’ operation. We used the wikistats dump for 25 days with a size of 270G, on a cluster involving one master and four worker machines (16 cores and 64GB RAM each). It took about 14 mins with Pig on Spark, compared to about 30 mins on Mapreduce. In this talk, Praveen would be sharing the progress of the project with the community and help people take advantage of Pig-Spark in their workflows.

avatar for Praveen Rachabattuni

Praveen Rachabattuni

Technical Team Lead, SigmoidAnalytics
Praveen Rachabattuni is a technical team lead at Sigmoid Analytics. His areas of expertise includes Real Time Big Data Analytics using open source technologies like Apache Spark, Shark and Pig on Spark. He is working as a committer on the Apache Pig project and contributing for Pig... Read More →

Thursday April 16, 2015 11:20am - 12:10pm CDT
Texas VI

12:10pm CDT

Lunch Break (Attendees on own)
Thursday April 16, 2015 12:10pm - 2:00pm CDT

2:00pm CDT

Going Deep With Spark Streaming - Andrew Psaltis, Shutterstock
Today if a byte of data were a gallon of water, in only 10 seconds there would be enough data to fill an average home, in 2020 it will only take 2 seconds. The Internet of Things is driving a tremendous amount of this growth, providing more data at a higher rate then we’ve ever seen. With this explosive growth comes the demand from consumers and businesses to leverage and act on what is happening right now. Without stream processing these demands will never be met, and there will be no big data and no Internet of Things. Apache Spark, and Spark Streaming in particular can be used to fulfill this stream processing need now and in the future. In this talk I will peel back the covers and we will take a deep dive into the inner workings of Spark Streaming; discussing topics such as DStreams, input and output operations, transformations, and fault tolerance. 

avatar for Andrew Psaltis

Andrew Psaltis

DataFlow & IoT Principal Solution Architect, Hortonworks
Andrew Psaltis is deeply entrenched in streaming and IoT systems and obsessed with delivering insight at the speed of thought. As the author of Streaming Data (http://manning.com/psaltis/) by Manning, an international speaker and trainer he spends most of his waking hours thinking... Read More →

Thursday April 16, 2015 2:00pm - 2:50pm CDT
Texas VI

3:00pm CDT

Near Real-Time Stream Processing Architectures with Open-Source Tools - Anand Iyer, Cloudera
We are continuously producing vast streams of data, and thanks to phenomena such as the Internet of Things, the volume of streaming data is poised to see exponential growth over the coming years. Businesses want to process this data almost as soon as it is produced, drastically reducing time to action, and enabling a whole new category of use cases. This paradigm is called “Near Real-Time Stream Processing”. In this presentation, Anand Iyer will describe real-world use cases, across diverse industries. He will describe the open source tools (Kafka, Spark Streaming, Storm, Samza, etc) that are used to build near real-time stream processing architectures, and will also describe some of the common architectural patterns. Lastly he will describe future trends, such as machine learning and sql on streaming data.


Anand R Iyer

Anand R Iyer is a Senior Product Manager at Cloudera, the leading vendor of open source Apache Hadoop. His primary areas of focus are platforms for Real-Time Streaming, Apache Spark and tools for data ingestion into the Hadoop platform. Before joining Cloudera, he worked as an engineer... Read More →

Thursday April 16, 2015 3:00pm - 3:50pm CDT
Texas VI

3:50pm CDT

Thursday April 16, 2015 3:50pm - 4:20pm CDT

3:50pm CDT

Thursday April 16, 2015 3:50pm - 4:20pm CDT