See the essence of the Spark Streaming API.

Spark Streaming

Using the Scala sbt build tool with Spark code.

sbt build for Scala/Spark

How to unit test Scala Spark code.

Spark Unit Testing

Here is how to setup and run SparkR on a cluster.

SparkR

This shows how to add Scala functions that look like they are part of the Spark API.

Extending Spark

Joins in Hive using both reduce-side and bucket map-side approaches.

Hive Joins

Here is the classic wordcount example, on Spark YARN, using the three language APIs.

Scala Wordcount
Java Wordcount
[Python Wordcount, coming soon]

Here is the classic wordcount example, using the new Hadoop API.

Wordcount

Here are two methods for chaining and managing multiple MapReduce Jobs.

Chaining and Managing Multiple MapReduce Jobs with two drivers
Chaining and Managing Multiple MapReduce Jobs with one driver

In order to inspect work node logs in a Hadoop cluster that is behind a firewall with only SSH access, a browser must be setup for tunneling.

Using a browser to tunnel into a Hadoop cluster to inspect worker node logs

Hadoop MapReduce can write key/value output to HDFS in a variety of formats. Here is how to display them.

Display Your MapReduce Output

MRUnit supports two different input/output methods, add and with. Here is the difference.

MRUnit: with vs. add

This four-part series shows how to pass multiple values from a mapper to a reducer, and from the reducer to output.

Passing Multiple Values in MapReduce Part 1: Strings
Passing Multiple Values in MapReduce Part 2: Custom Writables
Passing Multiple Values in MapReduce Part 3: Maps
Passing Multiple Values in MapReduce Part 4: AVRO

Using the software lifecycle and build tool Maven, you can configure Eclipse for Hadoop development in minutes.

Setup Eclipse for Hadoop Development Using Maven (Linux/Mac version)

Setup Eclipse for Hadoop Development Using Maven (Windows version)


Questions about this material?

Hadoop Concepts Forum


These guides are brought to you by

Center of Excellence for Big Data (CoE4BD)
Graduate Programs in Software
University of St. Thomas
St. Paul, Minnesota

http://www.stthomas.edu/CoE4BD
CoE4BD@stthomas.edu
@CoE4BD

In collaboration with Cloudera and their Academic Partnership program

Also see our Technical Reports