Home > Blogs > An Overview to Setting up Hadoop Clusters
What is Big data?
Data is an integral part of a business right from the smallest to the most widespread organizations. Managing data is now very relevant because there currently exists more sored information that we can process. This information is obtained from almost everything electronic, like sensors used in locomotives and machinery, social and digital media, GPS and mobile phone data, and even text messages. This data is not necessarily structured. Structured data is data that is organized perfectly using a relational database for example. Structured data can be queried and analyzed using available and straightforward algorithms. Unstructured data, which is what big data encapsulates, is data that exists but lacks structure. Once again, examples of unstructured data are social media conversations, email, weather and satellite data from sensors, cell phone data, texts and messages.
People quickly realized how much data we’d be processing, and in 2001 recognized 3 characteristics of data that would be used to classify it as big data or not. These are the 3 V’s - Volume of data that is to be managed, Variety of data that must be handled, and Velocity of data creation and transfer.
It was soon clear that traditional data management tools could not manage big data, and this drove the need to create distributed frameworks like Apache Hadoop to do just this.
What is Hadoop?
Apache Hadoop is a Java based open-source software framework for distributed storage and processing of large amounts of data. Hadoop modules are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. In addition to fault tolerance Hadoop is capable managing very very large amounts of data and an even larger number of concurrent tasks. Hadoop is open-source, modular and utilizes a distributed file system.
The Hadoop framework has two main parts – a data processing framework and a distributed filesystem for data storage. The distributed file system is a far-flung array of storage clusters termed as the Hadoop Distributed File System (HDFS). The data processing framework is a java system termed as MapReduce.
Setting up Hadoop Clusters
We’ll now look at the steps involved in configuring a single node cluster setup and a multi node cluster setup using Ubuntu Linux.
Java 6 JDK -
Hadoop requires a working installation of Java 1.5+. To get Java update your machine source list;
user@ubuntu:~$ sudo apt-get update
Or, if you already have Java on your system install it using;
user@ubuntu:~$ sudo apt-get install sun-java6-jdk
Adding a dedicated Hadoop system user
Add a dedicated Hadoop user account for running Hadoop using;
user@ubuntu:~$ sudo addgroup test_group
user@ubuntu:~$ sudo adduser --ingroup test_group user1
This will add the user “user1” and the group test_group to the local machine. Add user1 to the sudo group using;
user@ubuntu:~$ sudo adduser user1 sudo
Hadoop needs SSH access to manage its nodes. We therefore need to configure SSH access to localhost for the user we created in the earlier.
To work seamlessly, SSH needs to be setup to allow password-less login for the hadoop user from machines in the cluster. To do this, generate a public/private key pair, that will be shared across the cluster.
Switch to the user you created. “user1” using;
user1@ubuntu:~$ su - user1
Next, download and extract the Hadoop package, and set up environment variables for Hadoop.
To configure hadoop-env.sh and conf/*-site.xml visit doctuts Hadoop documentation.
Starting the Single Node Cluster
Before you start the node, formatting the HDFS filesystem via the NameNode, by running this command;
user1@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode –format
Next, give the hadoop folder a -R 777 permission and run the following command to start the cluster;
This will start a Namenode, Datanode, Jobtracker and a Tasktracker;
To start a multi-node cluster, we’ll take an example of using two single-nodes and merge them into one multi-node cluster, of which one will be the master but also act as the slave, and the other will be a dedicated slave.
Configure two single-node clusters and network them by putting them both on the same hardware configuration.
Ensure SSH access where the “user1” on the master must be able to connect it’s own user account on the master and to “user1” on the slave.
To configure the conf/masters, conf/slaves and conf/*-site.xml visit the doctuts Hadoop documentation.
Start the multi-node Cluster
The cluster is started in 3 steps:
1.Format the cluster’s HDFS filesystem using; user1@master:~/usr/local/hadoop$ bin/hadoop namenode -format
2.Start the HDFS daemons: NameNode daemon started on master, and DataNode daemons started on all slaves.
3.Next, start the MapReduce daemons: JobTracker started on master, and TaskTracker daemons started on all slaves.
Finally start the cluster by running the following command on the master;