ApacheCon NA 2015 has ended
Back To Schedule
Thursday, April 16 • 11:20am - 12:10pm
Faster ETL Workflows using Apache Pig & Spark - Praveen Rachabattuni, Sigmoid Analytics

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Pig on Spark aims to combine the simplicity of Pig with faster execution engine Spark and make Pig more promising to developers. Currently, with the help of Apache foundation, various contributions are working on the project for a release quality build. With Pig on spark, significant performance benefit has been observed in ETL workflows already running on MapReduce. Our initial benchmarks have shown 2x-5x improvement over Mapreduce. For a benchmarking test, we considered the ‘distinct’ operation. We used the wikistats dump for 25 days with a size of 270G, on a cluster involving one master and four worker machines (16 cores and 64GB RAM each). It took about 14 mins with Pig on Spark, compared to about 30 mins on Mapreduce. In this talk, Praveen would be sharing the progress of the project with the community and help people take advantage of Pig-Spark in their workflows.

avatar for Praveen Rachabattuni

Praveen Rachabattuni

Technical Team Lead, SigmoidAnalytics
Praveen Rachabattuni is a technical team lead at Sigmoid Analytics. His areas of expertise includes Real Time Big Data Analytics using open source technologies like Apache Spark, Shark and Pig on Spark. He is working as a committer on the Apache Pig project and contributing for Pig... Read More →

Thursday April 16, 2015 11:20am - 12:10pm CDT
Texas VI

Attendees (0)