Home > Blogs > Loading data into Hadoop using Sqoop and Flume
What is Hadoop?
Apache Hadoop is a Java based open-source software framework for distributed storage and processing of large amounts of data. Hadoop modules are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. In addition to fault tolerance Hadoop is capable managing very very large amounts of data and an even larger number of concurrent tasks. Hadoop is open-source, modular and utilizes a distributed file system.
The Big Data Analytics using Hadoop framework has two main parts – A data processing framework and a distributed filesystem for data storage.
- The storage layer of Hadoop; HDFS, is a distributed and scalable Java-based file system used to store large volumes of unstructured data.
- The data processing framework is a java system termed as MapReduce. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query.
Loading Data into hadoop
Analyzing data using Hadoop requires large amounts of data to be loaded into the HDFS before creating Hadoop clusters. Because we use a lot of unstructured data with Hadoop, loading heterogenous data from a number of sources poses a bit of a challenge. Maintaining and ensuring data consistency and ensuring efficient utilization of resources, are a few considerations before selecting the right approach to load data.
Apache Sqoop is a tool designed to support bulk import of data into HDFS from structured data stores such as relational databases, data warehouses, and NoSQL systems. Sqoop works well with any RDBMS that has Java Database Connectivity (JDBC) and is based on a non-event driven connector architecture which supports connectors to provide connectivity with varied and different external systems.
All existing Database Management Systems (DBMS) are designed with SQL standards in mind, however, each DBMS differs from the other. This difference in database poses a challenge to bulk load data into a system. Sqoop Connectors are components which help overcome these challenges.
Sqoop has connectors designed to work with a range of popular relational databases, including MySQL, Oracle, SQL Server, and DB2. Each of these connectors written to interact with the associated DBMS. Sqoop also contains JDBC connectors to connect with a database that supports the JDBC protocol and provides optimized MySQL connectors that use database-specific APIs to perform bulk transfers efficiently.
Sqoop has various 3rd party connectors that do not come with the package, but can be downloaded and installed to an existing sqoop bundle.
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming log data. Flume is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. Log files particularly are a difficult source of data to manage as they take up large amounts of space and are rarely present in a place on the disk where Hadoop developers can effectively utilize it. Apache Flume is an easy to use tool that can push logs from applications servers to various repositories via configurable agents, aiding processing and collection of large volumes of data.
A Flume agent is a JVM process which has 3 components using which developers can create different topologies to collect data on any application server and direct it to any log repository:
- Flume Source - defines where the data is coming from.
- Flume Sink - defines the destination of the data pipelined from different sources.
- Flume Channel - are pipes which establish connect between sources and sinks.
A node is an event pipe in Flume which reads from the Source and writes to the Sink. A flume node can be configured to interpret the event and transforms it as it passes through.
The characteristics and role of a flume node is determined by the behaviour of the source and the sink. Apache Flume is built with several source and sink options, but if none of them fit your requirements then developers can write their own.