Next Generation Information Systems

We are at the beginning of a data revolution where entire industries are actively being developed around finding, extracting and refining “Big Data” into valuable information and insights. Over the next decade, the value of a company will not be based solely on its products and services but also on the data that it owns. The data center will be viewed as tomorrow’s refinery, and developing new programming abstractions and systems for processing massive amounts of data is an active area of research. MapReduce is the current platform of choice for extracting, transforming, and loading (ETL) massive amounts of raw data. Unfortunately, MapReduce runtimes like Hadoop do not provide a complete high-performance solution for a crucial domain: machine learning, nor do they expose an ideal API for this domain. We seek to address these shortcomings on multiple levels. Firstly, we provide an efficient runtime that directly supports not only ETL workflows, but also iterative—or equivalently, recursive—algorithms. Furthermore, we expose a high-level programming language in which the data scientist can succinctly encode their machine learning algorithms; along side their ETL workflow that extracts data features and evaluates models. In this talk, I will introduce ScalOps—a new domain specific language (DSL) written in the Scala programming language—and its runtime Hyracks: a system developed out of UC Irvine that natively supports a rich set of relational algebra operators, as well as graph-based and machine learning algorithms through recursive data flows. This work represents an ongoing effort—between Yahoo! Research, UC Irvine, and UC Santa Cruz—to provide a complete systems foundation for extracting information and insights from massive data sets.