5 min readNov 11, 2016

Big Data on your local machine : How to install Hadoop 2.6.0

Many people from time to time google next phrase: “how to install hadoop in my cloud”

Today I try to help all these guys with their problem. I am going to describe step by step all my troubles with Apache Hadoop 2.6.0 installation.

Exordium

Let’s start!

Of course, you should have a Virtual Machine (VM) with Ubuntu 14.10, for example. I don’t give you garantee for another Unix systems, sorry.

Let’s suppose you have already installed the operating system.

If you haven’t ssh and rsync clients, please install it

$ sudo apt-get update && sudo apt-get upgrade 
$ sudo apt-get install ssh 
$ sudo apt-get install rsync

Before Hadoop installation you need to have a fresh version of Java on your VM. If you have no Java, you should install it by next commands:

if you want OpenJDK 7

$ sudo apt-get install openjdk-7-jdk

if you want Oracle JDK, please clear all OpenJDK dependencies

$ sudo apt-get purge openjdk*

And after that you can install Oracle JDK 7 by next commands

$ sudo apt-get install python-software-properties 
$ sudo add-apt-repository ppa:webupd8team/java 
$ sudo apt-get update 
$ sudo apt-get install oracle-java7-installer

Comment: I’d like Java 8, but it can be a lot of problems with fresh Hadoop 2.6.0 and it has a very low probability that newby will write hard MapReduce jobs in his first time with Hadoop.

Hadoop pulling

All manuals recommend us to create a new user with typical name ‘hduser’ which will handle our data from Hadoop face (of course not)

$ sudo addgroup hadoop // it creates special user group

System will ask you about a password, please type it and remember

$ sudo adduser --ingroup hadoop hduser // it includes our user to the group 
$ sudo usermod -aG sudo hduser

After that you should switch on the new user and change all system variables for the new user.

It’s easy, please type su hduser and enter the password (of course you remembered)

Now, after successful logging you should create and setup SSH certificates ’cause Hadoop uses SSH (to access its nodes)

$ ssh-keygen -t rsa -P '' $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

System may ask you about a possible filename for storing SSH certificates after first command executing. You can skip this step with empty name. The second command adds an open key to the list of special (authorized) keys and Hadoop can connect by SSH without asking a password.

Hadoop installing

Let’s create special directory for Hadoop downloading.

$ sudo mkdir /downloads 
$ cd downloads/ 
$ sudo wget http://apache-mirror.rbc.ru/pub/apache/hadoop/common/stable/hadoop-2.6.0.tar.gz

After that we should to move archieve file to /usr/local (typical directory) and unpack it. Of course, you should give a permisson (write, read and all possible dirty things) for the hduser.

$ sudo mv /downloads/hadoop-2.6.0.tar.gz /usr/local/ 
$ cd /usr/local/ 
$ sudo tar xzf hadoop-2.6.0.tar.gz 
$ sudo mv hadoop-2.6.0 hadoop 
$ chown -R hduser:hadoop hadoop

Configuration files: to change or not to change?

I am going to provide you simple settings for a Single Node Cluster, it will be enough for playing in the sandbox.

There are a lot of files which need to modify:

~/.bashrc
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
/usr/local/hadoop/etc/hadoop/core-site.xml
/usr/local/hadoop/etc/hadoop/yarn-site.xml
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/hdfs-site.xml

Let’s open .bashrc with the command sudo nano ~/.bashrc and append next rows

#HADOOP VARIABLES START export JAVA_HOME=<your path to jdk> 
export HADOOP_INSTALL=/usr/local/hadoop 
export PATH=$PATH:$HADOOP_INSTALL/bin 
export PATH=$PATH:$HADOOP_INSTALL/sbin 
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL 
export HADOOP_COMMON_HOME=$HADOOP_INSTALL 
export HADOOP_HDFS_HOME=$HADOOP_INSTALL 
export YARN_HOME=$HADOOP_INSTALL 
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" 
#HADOOP VARIABLES END

where <your path to jdk> is the directory where JDK was installed.

Comment: To save something in nano you should Ctrl+X firstly and Y secondly. To update system variables quickly type source ~/.bashrc

Of course, Hadoop wants to know path to your Java to do his secret deals. You need to edit hadoop-env.sh by

$ sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

where you should set path to you current Java location in row

export JAVA_HOME=/usr/lib/jvm/<... something else>

But we need to go deeper, to the core of Hadoop settings, core-site.xml

$ sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

Here you should add your own configuration which overrides defualt Hadoop configuration.

Please add hdfs property between configuration tags.

<property> 
   <name>fs.default.name</name> 
   <value>hdfs://localhost:9000</value> 
</property>

Don’t forget about YARN settings. Hadoop setting were changes since time of first Hadoop. Now you should open and edit yarn-site.xml by

$ sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

and add shuffle properties between configuration tags

<property> 
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value> 
</property> <property> 
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> 
  <value>org.apache.hadoop.mapred.ShuffleHandler</value> 
</property>

Do you know idea of MapReduce algorythm? Chain of jobs, mapping, reducing and etc. And of course we need to create its settings. But in this case you should to copy template file, change it.

$ cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml 
$ sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml

and add framework properties between configuration tags

<property> 
   <name>mapreduce.framework.name</name> 
   <value>yarn</value> 
</property>

Each node in the Hadoop cluster has own HDFS (special file system) and you should add correct paths to special directories with future metadata and data which are known as namenode and datanode respectively.

But before you must create these directories

$ mkdir -p /usr/local/hadoop_store/hdfs/namenode 
$ mkdir -p /usr/local/hadoop_store/hdfs/datanode

and after that you may open hdfs settings and change it

$ sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

and add replication and path properties between configuration tags

<property> 
   <name>dfs.replication</name> 
   <value>1</value> 
</property> 
<property> 
   <name>dfs.namenode.name.dir</name>   
   <value>file:/usr/local/hadoop_store/hdfs/namenode</value> </property> <property> 
   <name>dfs.datanode.data.dir</name>  
   <value>file:/usr/local/hadoop_store/hdfs/datanode</value> </property>

Last steps …

You need to format hdfs before using by next command

$ hdfs namenode -format

and if it will finish succesfully you can start Hadoop by next three commands:

$ start-dfs.sh $ start-yarn.sh

and in many cases you should start history server

$ mr-jobhistory-daemon.sh start historyserver

After this hard work you can enjoy by working Hadoop verifying it by command

$ jps 
4868 SecondaryNameNode 
5243 NodeManager 
5035 ResourceManager
4409 NameNode 
4622 DataNode 
5517 Jps

If you will see a result similar to previous six rows, it means that you now have a functional instance of Hadoop running on your VM.

In conclusion I suggest you to run a one of provided Hadoop examples and have a lot of fun with real working Hadoop instance.

All accessible examples can be listed by next command:

$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/ hadoop-mapreduce-examples-2.6.0.jar

Famous ‘wordcunt’ job can be run by special command with parameters (you should put something in ‘/in’ catalog to run this job)

$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/ hadoop-mapreduce-examples-2.6.0.jar wordcount /in /out

The next paper about Pig installation is coming soon!

So Long, and Thanks for all the Fish!

Originally published at zaleslaw.blogspot.com.