Building A TB-Scale Math Platform

ÜberConf

Denver · July 16 - 19, 2013

You are viewing details from a past event

About this Presentation

Datasets have gotten to PB-scale, but the modeling you can do has been limited to a single-node (e.g. R, SAS) or stuck inside the database or takes hours on Hadoop-like technologies. We have built a simple clustering package, and are using it to do distributed analytics on the sum of all ram in a cluster.

This talk focuses on how the clustering technology, plus a Java-based vector math API, is being used to build full algorithms like GLM/GLMNET, Random Forest and K-means. These algorithms are complex multi-pass programs and traditional distributed programming models expose the distributed boundaries making the algorithms hard to reason about. We have a basic JDK for doing at-scale math, we can run most Plain Olde Java in (distributed) inner loops, communicate via a K/V store with exact Java Memory Model consistency (not lazy consistency). Adding more cpus makes these algorithms run faster, and adding more ram allows larger datasets. We are bringing back Moore's Law!

CTO & Co-Founder of 0xdata

Cliff Click is the CTO and Co-Founder of 0xdata, a firm dedicated to creating a new way to think about web-scale data storage and real-time analytics. Cliff wrote his first compiler when he was 15 (Pascal to TRS Z-80!), although my most famous compiler is the HotSpot Server Compiler (the Sea of Nodes IR). I helped Azul Systems build an 864 core pure-Java mainframe that keeps GC pauses on 500Gb heaps to under 10ms, and worked on all aspects of that JVM. Before that Cliff worked on HotSpot at Sun Microsystems, and am at least partially responsible for bringing Java into the mainstream.

Cliff is invited to speak regularly at industry and academic conferences and has published many papers about HotSpot technology. He holds a PhD in Computer Science from Rice University and about 15 patents.