Danny Malter

Data Science Manager - Accenture
M.S. in Predictive Analytics - DePaul University

Danny Malter

Me
Malter Analytics
GitHub
LinkedIn
YouTube Channel
Twitter
Kaggle

Other Work
General Assembly
AriBall

Media
Built In

Guide to Runnng RHadoop over Amazon EMR

Assumptions

That you have already set up an Amazon AWS account.
That you have basic knowledge of the following:

Setting up Elastic Map Reduce

Step 1: Log into Amazon AWS

Step 2: Click the EMR icon - Managed Hadoop Framework

plot of chunk image1

Step 3: Create your cluster

IMPORTANT BOOTSTRAP INSTRUCTION

  1. Download the Amazon EMR Boostrap File
  2. Open a new tab and upload the .sh script into your S3 (either create a new folder/bucket in S3 or use an existing folder)
  3. Within EMR, select “custom action” from the “Select a boostrap action” drop down. Then select “Configure and add”
  4. Link the above .sh file in “S3 location” and press “Add”

    plot of chunk image3
    • Select any other fields as you wish and click on the “Create cluster” button

IMPORTANT NOTE:
The cluster will initially show as “pending”. With the bootstrap script, the cluster will take around 15 to fully complete the setup process.

Initial Stage:

Second Stage:

Final Stage:

Running R over the Cluster

Open the Terminal and connect to your cluster with the following code

ssh -i /path/to/keypair.pem hadoop@Amazon_Public_DNS

Once connected, either make a directory for your input file or download directly into the /home/hadoop directory. You can imort data from Amazon S3 with the wget command. The file link can be found under ‘Properties’ within S3 and you will need to make sure that the correct permissions are set for the file to be downloaded.

# Create a local directory with the data
mkdir data
cd data
wget https://s3-us-west-2.amazonaws.com/bucket-name/shakespeare.txt

# Another option by downloading the file from a url
shakespeare_works <- "data/shakespeare_works.txt"
if(!file.exists(shakespeare_works)) {
  download.file(url = "http://www.gutenberg.org/ebooks/100.txt.utf-8",
                destfile = shakespeare_works)
}

Word Count Example

The following code is an example of running a word count on a text file using the rmr2 package and loading data into HDFS with the rhdfs package.

sudo R

Sys.setenv(HADOOP_HOME="/home/hadoop");
Sys.setenv(HADOOP_CMD="/home/hadoop/bin/hadoop");
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java") ;
Sys.setenv(HADOOP_PREFIX="/home/hadoop/");
Sys.setenv(HADOOP_STREAMING="/home/hadoop/contrib/streaming/hadoop-streaming.jar");

library(rmr2)
library(rhdfs)
hdfs.init()

# Copy the file to HDFS
hdfs.mkdir("/user/shakespeare/wordcount/data")
hdfs.put("/home/hadoop/data/shakespeare.txt", "/user/shakespeare/wordcount/data")


map = function(.,lines) 
{ keyval(
	tolower(
	unlist(
  	strsplit(
		x = lines,
    	split = " +"))),
1)}

# Splits text by at least one space

reduce = function(word, counts) { 
	keyval(word, sum(counts))
	}
	
wordcount = function (input, output=NULL) {
mapreduce(input=input, output=output, input.format="text",
       map=map, reduce=reduce)
}	

hdfs.root <- '/user/shakespeare/wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
hdfs.out <- file.path(hdfs.root, 'out')   # 'out' must not be an already created folder

system.time(out <- wordcount(hdfs.data, hdfs.out))

results = from.dfs(out)
results.df = as.data.frame(results, stringsAsFactors=FALSE)

install.packages('tm')
library(tm)
sw <- stopwords()
top.words5 <- results.df[nchar(results.df$key)>5 & !results.df$key %in% sw, ]
top.words5[order(-top.words5$val, top.words$key)[1:20], ]

Example of output:

plot of chunk image4

Aggregation Example

hdfs.mkdir("/user/purchases/data")
hdfs.put("/home/hadoop/data/purchases.txt", "/user/purchases/data")

purchases.map <- function(k, v){
  city <- v[[3]]
  dollars <- v[[5]]
  return( keyval(city, dollars) )
}

purchases.reduce = function(k, v) { 
	keyval(k, sum(v))
	}

purchases.format <- make.input.format("csv", sep = "\t")

m = function (input, output=NULL) {
mapreduce(input='/user/purchases/data', output='/user/purchases/out18', input.format=purchases.format,
       map=purchases.map), reduce=purchases.reduce)
} 

hdfs.root <- '/user/purchases/'
hdfs.data <- file.path(hdfs.root, 'data')
hdfs.out <- file.path(hdfs.root, 'out') 

system.time(out <- m(hdfs.data, hdfs.out))

results = from.dfs(m)
results.df = data.frame(results$key, results$val)
results.df[order(-results.df$results.val, results.df$results.key)[1:20],]

Example of output:

plot of chunk image5

REMEMBER TO STOP OR TERMINATE YOUR INSTANCE

comments powered by Disqus