Home > Blogs > An Overview to Setting up Hadoop Clusters
An Overview to Setting up Hadoop Clusters
What is Big data?
Data is an integral part of a business right from the smallest to the most widespread organizations. Managing data is now very relevant because there currently exists more sored information that we can process. This information is obtained from almost everything electronic, like sensors used in locomotives and machinery, social and digital media, GPS and mobile phone data, and even text messages. This data is not necessarily structured. Structured data is data that is organized perfectly using a relational database for example. Structured data can be queried and analyzed using available and straightforward algorithms. Unstructured data, which is what big data encapsulates, is data that exists but lacks structure. Once again, examples of unstructured data are social media conversations, email, weather and satellite data from sensors, cell phone data, texts and messages.
People quickly realized how much data we’d be processing, and in 2001 recognized 3 characteristics of data that would be used to classify it as big data or not. These are the 3 V’s - Volume of data that is to be managed, Variety of data that must be handled, and Velocity of data creation and transfer.
It was soon clear that traditional data management tools could not manage big data, and this drove the need to create distributed frameworks like Apache Hadoop to do just this.
What is Hadoop?
Apache Hadoop is a Java based open-source software framework for distributed storage and processing of large amounts of data. Hadoop modules are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. In addition to fault tolerance Hadoop is capable managing very very large amounts of data and an even larger number of concurrent tasks. Hadoop is open-source, modular and utilizes a distributed file system.
The Hadoop framework has two main parts – a data processing framework and a distributed filesystem for data storage. The distributed file system is a far-flung array of storage clusters termed as the Hadoop Distributed File System (HDFS). The data processing framework is a java system termed as MapReduce.
Setting up Hadoop Clusters
We’ll now look at the steps involved in configuring a single node cluster setup and a multi node cluster setup using Ubuntu Linux.
Java 6 JDK -
Hadoop requires a working installation of Java 1.5+. To get Java update your machine source list;
user@ubuntu:~$ sudo apt-get update
Or, if you already have Java on your system install it using;
user@ubuntu:~$ sudo apt-get install sun-java6-jdk
Adding a dedicated Hadoop system user
Add a dedicated Hadoop user account for running Hadoop using;
user@ubuntu:~$ sudo addgroup test_group
user@ubuntu:~$ sudo adduser --ingroup test_group user1
This will add the user “user1” and the group test_group to the local machine. Add user1 to the sudo group using;
user@ubuntu:~$ sudo adduser user1 sudo
Hadoop needs SSH access to manage its nodes. We therefore need to configure SSH access to localhost for the user we created in the earlier.
To work seamlessly, SSH needs to be setup to allow password-less login for the hadoop user from machines in the cluster. To do this, generate a public/private key pair, that will be shared across the cluster.
Switch to the user you created. “user1” using;
user1@ubuntu:~$ su - user1
Next, download and extract the Hadoop package, and set up environment variables for Hadoop.
To configure hadoop-env.sh and conf/*-site.xml visit doctuts Hadoop documentation.
Starting the Single Node Cluster
Before you start the node, formatting the HDFS filesystem via the NameNode, by running this command;
user1@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode –format
Next, give the hadoop folder a -R 777 permission and run the following command to start the cluster;
This will start a Namenode, Datanode, Jobtracker and a Tasktracker;
To start a multi-node cluster, we’ll take an example of using two single-nodes and merge them into one multi-node cluster, of which one will be the master but also act as the slave, and the other will be a dedicated slave.
Configure two single-node clusters and network them by putting them both on the same hardware configuration.
Ensure SSH access where the “user1” on the master must be able to connect it’s own user account on the master and to “user1” on the slave.
To configure the conf/masters, conf/slaves and conf/*-site.xml visit the doctuts Hadoop documentation.
Start the multi-node Cluster
The cluster is started in 3 steps:
1.Format the cluster’s HDFS filesystem using; user1@master:~/usr/local/hadoop$ bin/hadoop namenode -format
2.Start the HDFS daemons: NameNode daemon started on master, and DataNode daemons started on all slaves.
3.Next, start the MapReduce daemons: JobTracker started on master, and TaskTracker daemons started on all slaves.
Finally start the cluster by running the following command on the master;
You could also read:
By Aditi Bhat
By Arijit Banerjee
By Aditi Bhat
Request a Call Back
How To Find A Potential Cloud Computing Community?
IntroductionThe cloud market is growing into a new area at a rapidly faster rate and is going to...More Info
Interview Questions For A Job As Cybersecurity Professional Part 2
Welcome back! We promised you that we would be back with another round of cybersecurity interview...More Info
The Cyber Security Awareness Checklist For SaaS Products
Over the last few years, the software-as-a-service (SaaS) model has evolved tremendously. What was...More Info
Interview Questions For A Job As Cybersecurity Professional Part 1
Are you preparing for an interview for your next cybersecurity job? It would have been great if you...More Info
Can You Learn Cloud Computing Without Certifications Like AWS, CISCO, ETC? Find Out.
Can you learn cloud computing without certifications like AWS, CISCO, etc? Find out.IntroductionThe...More Info
Careers in Cloud Computing (E-book)
Cloud definitely isn't just for big businesses anymore! It's now a default approach for Startups...More Info
8 Common Cybersecurity Mistakes Everyone Can Avoid In 2019
Be it stolen customer data, ransomware attacks, or a website crash, successful cyber attacks can...More Info
The Significance Of Microservices In Cloud Computing
As cloud computing takes the center stage across most IT infrastructures, there is also a rise in...More Info
Cloud Center Of Excellence: What, Why, How, Tips
Whether a small start-up company or leading enterprises, the main goal is to have a flexible, cheap...More Info
Interested In Cloud Computing? But First, Start With Alexa Skills
IntroductionThe cloud computing arena as a whole is undergoing a major change and transforming to a...More Info