Big Data on your local machine : How to install Hadoop 2.6.0
Many people from time to time google next phrase: “how to install hadoop in my cloud”
Today I try to help all these guys with their problem. I am going to describe step by step all my troubles with Apache Hadoop 2.6.0 installation.
Exordium
Let’s start!
Of course, you should have a Virtual Machine (VM) with Ubuntu 14.10, for example. I don’t give you garantee for another Unix systems, sorry.
Let’s suppose you have already installed the operating system.
If you haven’t ssh and rsync clients, please install it
$ sudo apt-get update && sudo apt-get upgrade
$ sudo apt-get install ssh
$ sudo apt-get install rsync
Before Hadoop installation you need to have a fresh version of Java on your VM. If you have no Java, you should install it by next commands:
if you want OpenJDK 7
$ sudo apt-get install openjdk-7-jdk
if you want Oracle JDK, please clear all OpenJDK dependencies
$ sudo apt-get purge openjdk*
And after that you can install Oracle JDK 7 by next commands
$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer
Comment: I’d like Java 8, but it can be a lot of problems with fresh Hadoop 2.6.0 and it has a very low probability that newby will write hard MapReduce jobs in his first time with Hadoop.
Hadoop pulling
All manuals recommend us to create a new user with typical name ‘hduser’ which will handle our data from Hadoop face (of course not)
$ sudo addgroup hadoop // it creates special user group
System will ask you about a password, please type it and remember
$ sudo adduser --ingroup hadoop hduser // it includes our user to the group
$ sudo usermod -aG sudo hduser
After that you should switch on the new user and change all system variables for the new user.
It’s easy, please type su hduser and enter the password (of course you remembered)
Now, after successful logging you should create and setup SSH certificates ’cause Hadoop uses SSH (to access its nodes)
$ ssh-keygen -t rsa -P '' $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
System may ask you about a possible filename for storing SSH certificates after first command executing. You can skip this step with empty name. The second command adds an open key to the list of special (authorized) keys and Hadoop can connect by SSH without asking a password.
Hadoop installing
Let’s create special directory for Hadoop downloading.
$ sudo mkdir /downloads
$ cd downloads/
$ sudo wget http://apache-mirror.rbc.ru/pub/apache/hadoop/common/stable/hadoop-2.6.0.tar.gz
After that we should to move archieve file to /usr/local (typical directory) and unpack it. Of course, you should give a permisson (write, read and all possible dirty things) for the hduser.
$ sudo mv /downloads/hadoop-2.6.0.tar.gz /usr/local/
$ cd /usr/local/
$ sudo tar xzf hadoop-2.6.0.tar.gz
$ sudo mv hadoop-2.6.0 hadoop
$ chown -R hduser:hadoop hadoop
Configuration files: to change or not to change?
I am going to provide you simple settings for a Single Node Cluster, it will be enough for playing in the sandbox.
There are a lot of files which need to modify:
- ~/.bashrc
- /usr/local/hadoop/etc/hadoop/hadoop-env.sh
- /usr/local/hadoop/etc/hadoop/core-site.xml
- /usr/local/hadoop/etc/hadoop/yarn-site.xml
- /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
- /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Let’s open .bashrc with the command sudo nano ~/.bashrc and append next rows
#HADOOP VARIABLES START export JAVA_HOME=<your path to jdk>
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
where <your path to jdk> is the directory where JDK was installed.
Comment: To save something in nano you should Ctrl+X firstly and Y secondly. To update system variables quickly type source ~/.bashrc
Of course, Hadoop wants to know path to your Java to do his secret deals. You need to edit hadoop-env.sh by
$ sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
where you should set path to you current Java location in row
export JAVA_HOME=/usr/lib/jvm/<... something else>
But we need to go deeper, to the core of Hadoop settings, core-site.xml
$ sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
Here you should add your own configuration which overrides defualt Hadoop configuration.
Please add hdfs property between configuration tags.
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
Don’t forget about YARN settings. Hadoop setting were changes since time of first Hadoop. Now you should open and edit yarn-site.xml by
$ sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
and add shuffle properties between configuration tags
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property> <property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
Do you know idea of MapReduce algorythm? Chain of jobs, mapping, reducing and etc. And of course we need to create its settings. But in this case you should to copy template file, change it.
$ cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
$ sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
and add framework properties between configuration tags
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Each node in the Hadoop cluster has own HDFS (special file system) and you should add correct paths to special directories with future metadata and data which are known as namenode and datanode respectively.
But before you must create these directories
$ mkdir -p /usr/local/hadoop_store/hdfs/namenode
$ mkdir -p /usr/local/hadoop_store/hdfs/datanode
and after that you may open hdfs settings and change it
$ sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
and add replication and path properties between configuration tags
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value> </property> <property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value> </property>
Last steps …
You need to format hdfs before using by next command
$ hdfs namenode -format
and if it will finish succesfully you can start Hadoop by next three commands:
$ start-dfs.sh $ start-yarn.sh
and in many cases you should start history server
$ mr-jobhistory-daemon.sh start historyserver
After this hard work you can enjoy by working Hadoop verifying it by command
$ jps
4868 SecondaryNameNode
5243 NodeManager
5035 ResourceManager
4409 NameNode
4622 DataNode
5517 Jps
If you will see a result similar to previous six rows, it means that you now have a functional instance of Hadoop running on your VM.
In conclusion I suggest you to run a one of provided Hadoop examples and have a lot of fun with real working Hadoop instance.
All accessible examples can be listed by next command:
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/ hadoop-mapreduce-examples-2.6.0.jar
Famous ‘wordcunt’ job can be run by special command with parameters (you should put something in ‘/in’ catalog to run this job)
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/ hadoop-mapreduce-examples-2.6.0.jar wordcount /in /out
The next paper about Pig installation is coming soon!
So Long, and Thanks for all the Fish!
Originally published at zaleslaw.blogspot.com.