Running a MapReduce Job (WordCount) on Hadoop Single-Node Cluster

In the previous post we’ve seen how to install Hadoop on Ubuntu, now it’s time to run our first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.


1. Download example input data
We will use three ebooks from Project Gutenberg for this example:

Download each ebook and store the files in a local temporary directory of choice, for example ‘/tmp/gutenberg’. Now we have tochange the file ownership to hduser. Open a terminal and run:

1 sudo chown –R hduser:hadoop /tmp/gutenberg


2. Restart the Hadoop cluster
Open a new terminal and restart your Hadoop cluster if it’s not running already

1 su - hduser
2 /usr/local/hadoop/bin/start-all.sh


3. Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.

1 cd /usr/local/hadoop
2 bin/hadoop dfs –copyFromLocal /tmp/gutenberg /user/hduser/gutenberg

We can also check whether our files are copied correctly (see image 1).

1 bin/hadoop dfs –ls /user/hduser
2 bin/hadoop dfs –ls /user/hduser/gutenberg

Image 1. Files Copied Successfully.


4. Run the MapReduce job
Now, we actually run the WordCount example job (image 2).

1 bin/hadoop jar hadoop*example*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

This command will read all the files in the HDFS directory '/user/hduser/gutenberg', process it, and store the result in the HDFS directory '/user/hduser/gutenberg-output'.

Image 2. Running WordCount.

Check if the result is successfully stored in HDFS directory '/user/hduser/gutenberg-output' (image 3):

1 bin/hadoop dfs –ls /user/hduser
2 bin/hadoop dfs –ls /user/hduser/gutenberg

Image 3. Results Stored Successfully.

If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-D" option:

1 bin/hadoop jar hadoop*example*.jar wordcount –D /user/hduser/gutenberg mapred.reduce.tasks=16 /user/hduser/gutenberg-output

Note: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but you can specify mapred.reduce.tasks.


5. Retrieve the job result from HDFS
You can use the command

1 bin/hadoop dfs –cat /user/hduser/gutenberg-output/part-r-00000

to read the file directly from HDFS. Or alternatively you can copy it from HDFS to the local file system

1 mkdir /tmp/gutenberg-output
2 bin/hadoop dfs –getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output

Note: The command dfs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.

Now we can view the whole output file by opening it in any editor, open a new terminal and run:

1 sudo gedit /tmp/gutenberg-output/gutenberg-output


6. Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml ) available at these locations:


These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.


I. NameNode Web Interface (HDFS layer)
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.

By default, it’s available at http://localhost:50070/.


II. JobTracker Web Interface (MapReduce layer)
The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).

By default, it’s available at http://localhost:50030/.


III. TaskTracker Web Interface (MapReduce layer)
The task tracker web UI shows you running and non-running tasks. It also gives access to the "local machine's" Hadoop log files.

By default, it’s available at http://localhost:50060/.

— * — * — * — * —

Sources

How to Install Hadoop-1.2.1 (Single-Node Cluster) on Ubuntu-12.04, with JDK 8

In this DIY we will see how to set up a single-node Hadoop cluster backed by the Hadoop Distributed File System (HDFS), running on Ubuntu-12.04. The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.

This tutorial has been tested with the following software versions:
  • Ubuntu -12.04 (LTS)
  • Hadoop- 1.2.1, released Aug, 2013
  • JDK- 8 update 5


Step 1: Prerequisites

1. Download Hadoop
Hadoop-1.2.1 can be downloaded from here. Select a mirror, then select 'hadoop-1.2.1/' directory and download hadoop-1.2.1.tar.gz. I assume you have downloaded it into your '/home/user_name/Downloads' directory.

2. Install JDK 8
Hadoop requires a working Java installation. So, open up a terminal and run the following

1 sudo add-apt-repository ppa:webupd8team/java
2 sudo apt-get update && sudo apt-get install oracle-java8-installer

It will take some time to download and install so sit back and wait. Once it's done then we have to add the JAVA_HOME to the Ubuntu environment. Run the following in a terminal to open up the /etc/environment file.

1 sudo gedit /etc/environment

Now, append the following at the end of the file and save it:

JAVA_HOME="/usr/lib/jvm/java-8-oracle"

3. Adding a dedicated Hadoop system user
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine. Following commands are used for our purpose, image 1 shows a typical output screen of the same.

1 sudo addgroup hadoop
2 sudo adduser --ingroup hadoop hduser

Image 1. Adding a Hadoop System User.

4. Installing and configuring SSH Server
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section. Install openssh server as:

1 sudo apt-get install openssh-server

Now, open a new terminal and switch to hduser then we have to generate an SSH key for the hduser user

1 su - hduser
2 ssh-keygen -t rsa -P ""

The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the pass-phrase every time Hadoop interacts with its nodes). The output will look like this:

Image 2. Creating an RSA Key Pair.

Now we have to enable SSH access to your local machine with this newly created key.

1 cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to our local machine with the hduser user. The step is also needed to save our local machine’s host key fingerprint to the hduser user's known_hosts file.

1 ssh localhost

The output will look like this:

Image 3. Testing the ssh Setup.

NOTE: If you get an error saying “ssh: connect to host localhost port 22: Connection refused”, then you have not installed the ‘openssh-server’ properly, install it first.

5. Disabe IPv6
Close the current terminal by repeatedly typing "exit" until it doesn't get closed. Open a new terminal and open the file /etc/sysctl.conf

1 sudo gedit /etc/sysctl.conf

Now append the following lines at the end of the file and save it.

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

You have to restart your system in order to make the changes take effect. You can check whether IPv6 is enabled on your machine with the following command:

1 cat /proc/sys/net/ipv6/conf/all/disable_ipv6

A return value of 0 means IPv6 is enabled (image 4), a value of 1 means disabled (image 5, that’s what we want).

Image 4. IPv6 Enabled.

After restart

Image 5. IPv6 Disabled.


Step 2: Install Hadoop

1. Extract Hadoop and modify the permissions
Extract the contents of the Hadoop package to a location of your choice. Say ‘/usr/local/hadoop’. Make sure to change the owner of all the files to the hduser user and hadoop group. But first move the downloaded hadoop-1.2.1.tar.gz to ‘/user/local/’ (image 6).

1 sudo mv /home/user_name/Downloads/hadoop-1.2.1.tar.gz /usr/local
2 cd /usr/local
3 sudo tar xzf hadoop-1.2.1.tar.gz
4 sudo mv hadoop-1.2.1 hadoop
5 sudo chown –R hduser:hadoop hadoop

Image 6. Extracting and Modifying the Permissions.

2. Update '$HOME/.bashrc' of hduser

Open the $HOME/.bashrc file of user hduser

1 sudo gedit /home/hduser/.bashrc

and add the following lines at the end:

# Set Hadoop-related environment variables
export HADOOP_PREFIX=/usr/local/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_PREFIX/bin


Step 3: Configuring Hadoop

Now we have to configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little cluster only contains our single local machine.

1. Setting up the working directory
We will use the directory ‘/app/hadoop/tmp’ in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point. Now we create the directory and set the required ownerships and permissions:

1 sudo mkdir -p /app/hadoop/tmp
2 sudo chown hduser:hadoop /app/hadoop/tmp
3 sudo chmod 750 /app/hadoop/tmp

If you forget to set the required ownerships and permissions, you will see a java.io.IOException when you try to format the name node in the next section.

2. Configuring Hadoop setup files

I. hadoop-env.sh
The only required environment variable we have to configure for Hadoop is JAVA_HOME. Open conf/hadoop-env.sh (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the java8 directory.

1 sudo gedit /usr/local/hadoop/conf/hadoop-env.sh

Replace

Image 7. Replace this .

With

# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-8-oracle

II. core-site.xml
Open up the file /usr/local/hadoop/conf/core-site.xml

1 sudo gedit /usr/local/hadoop/conf/core-site.xml

and add the following snippet between <configuration>...</configuration> tags (see image 8). You can leave the settings below “as is” with the exception of the hadoop.tmp.dir parameter – this parameter you must change to a directory of your choice.

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

Image 8. Add as Shown.

III. mapred-site.xml
Open up the file /usr/local/hadoop/conf/mapred-site.xml

1 sudo gedit /usr/local/hadoop/conf/mapred-site.xml

add the following snippet between <configuration>...</configuration> tags.

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map and reduce task.</description>
</property>

IV. hdfs-site.xml
Open up the file /usr/local/hadoop/conf/hdfs-site.xml

1 sudo gedit /usr/local/hadoop/conf/hdfs-site.xml

add the following snippet between <configuration>...</configuration> tags.

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
</property>


Step 4: Formatting the HDFS Filesystem via the Namenode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.

NOTE: Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the commands in a new terminal

1 su - hduser
2 /usr/local/hadoop/bin/hadoop namenode -format

The output will look like this:

Image 9. Formatting Namenode.


Step 5: Starting your Single-Node Cluster

1. Run the command:

1 /usr/local/hadoop/bin/start-all.sh

This will start a Namenode, a Datanode, a Jobtracker and a Tasktracker on your machine. The output will look like this:

Image 10. Starting the Single-Node Cluster.

2. Check whether the expected Hadoop processes are running:

1 cd /usr/local/hadoop
2 jps

The output will look like this (Process ids and ordering of processes may differ):

Image 11. Running Hadoop Processes.

If all the six processes are running then your Hadoop is working fine.

3. You can also check if Hadoop is listening on the configured ports. Open a new terminal and run

1 sudo netstat -plten | grep java

Output will look like this:

Image 12. Hadoop is Listening.


Step 6: Stopping your Single-Node Cluster

To stop all the daemons running on your machine, run the command:

1 /usr/local/hadoop/bin/stop-all.sh

The output will look like this:

Image 13. Stopping the Single-Node Cluster.

Congratulations! You have successfully installed your Hadoop. You can start working with Hadoop now; just remember to start your cluster first. Happy Hadooping !

— * — * — * — * —

Sources