In this talk, I will give a quick introduction to Apache Spark, one of the most widely used cluster compute engine and Big Data framework. I will cover some of the important developments in the project, including:
- our efforts to scale up Spark, which enabled us to set a new world record in 100TB sorting, beating the previous Hadoop MapReduce record by 3X using 1/10 of the nodes.
- our efforts to expand the Spark API to make it easier to use for data scientists and application developers
- and last but not least, a number of efforts including Spark Packages aimed at facilitating better community contribution at scale