Hadoop: Divide and Conquer Gigantic Datasets (Advanced)

With the basics of Hadoop under your belt, we'll dig into the depths of this amazing framework by writing our own reducer in Java and deploying it to the cluster. Next, we'll dig deeper into DSLs like Pig and its log-file processing cousin, Chukwa. Since grid topology is intentionally very opaque in Hadoop, we'll look at the benefits and how to achieve a properly tuned cluster with replication. Specific to HDFS, we'll tune the configurable parameters for storage redundancy and bucket sizes.

In the ultimate use case of using many Hadoop components in harmony, you'll find the need to have a centralized synchronization and coordination framework. Don't build these capabilities on your own though, as you might be tempted to do in a homegrown distributed system. Instead, leverage Hadoop ZooKeeper's ability to store small sub 1MB blocks of data that contain state, naming, and mutex information.

About Matthew McCullough

Matthew McCullough is an energetic 15 year veteran of enterprise software development, open source education, and co-founder of Ambient Ideas, LLC, a Denver consultancy. Matthew currently is VP of Training at GitHub.com, author of the Git Master Class series for O'Reilly, speaker at over 30 national and international conferences, author of three of the top 10 DZone RefCards, and President of the Denver Open Source Users Group. His current topics of research center around project automation: build tools (Gradle), distributed version control (Git, GitHub), Continuous Integration (Jenkins, Travis) and Quality Metrics (Sonar). Matthew resides in Denver, Colorado with his beautiful wife and two young daughters, who are active in nearly every outdoor activity Colorado has to offer.

More About Matthew »