Big Learning Systems

A new wave of systems is emerging in the space of Big Data Analytics that open the door to programming models beyond Hadoop MapReduce (HMR). It is well understood that HMR is not ideal for applications in the domain of machine learning and graph processing. This realization is fueling a number of new (Big Data) system efforts: Berkeley Spark, Google Pregel, GraphLab (CMU), and Hyracks (UC Irvine), to name a few. Each of these add unique capabilities, but form islands around key functionalities: fault-tolerance, resource allocation, and data caching. In this talk, I will provide an overview of some Big Data systems starting with Google’s MapReduce, which defined the foundational architecture for processing large data sets. I will then identify a key limitation in this architecture; namely, its inability to efficiently support iterative workflows. I will then describe real-world examples of systems that aim to fill this computational void. I will conclude with a description of my own work on a layering that unifies key runtime functionalities (fault-tolerance, resource allocation, data caching, and more) for workflows (both iterative and acyclic) that process large data sets.