What is Hadoop? Introduction, Architecture, Ecosystem, Components — Ultimate Guide 2023

London Data Consulting (LDC)
13 min readJan 10, 2023

--

Do you want to use Hadoop and its ecosystem in your Big Data project? You are in the right place ! According to the findings of experts, public and private institutions, 90% of the data collected since the beginning of humanity have been generated during the last 2 years. The market now calls this explosion of data “Big Data”.

To successfully exploit “Big Data”, the idea is no longer to centralize the storage and processing of data on a server, but to distribute their storage and parallelize their processing on several computers. This is possible today thanks to Hadoop. Hadoop remained for a long time in the hands of open source. But today, it is fast becoming the de facto standard for data processing in companies, much like Microsoft Excel has gradually become the default data analysis software.

Despite this position, many companies still do not understand how to use Hadoop for their Big Data projects. In this article, we will tell you how to use hadoop and its technological ecosystem in a Big Data project.

1. Hadoop and Big Data

You must understand that before Hadoop, the strategic approach used by companies to manage their data was to centralize the storage and processing of data on a central server in a client/server architecture. These data are managed in the server by an RDBMS (Oracle type, SQL Server, DB 2, etc). The central server here is a very powerful machine, custom-designed by companies specializing in IT infrastructure such as EMC, Dell, HP and Lenovo.

The growth of the company’s data was managed by upsizing the server, that is, by increasing the physical capacity of its components. For example, increasing memory from 124 GB to 256 GB, increasing CPU frequency from 3 GHz quad-core to 5 GHz quad-core, or increasing disk storage capacity 500 GB to 2 TB hard drive. The following figure illustrates this strategy.

Unfortunately, although many companies still manage their data according to this strategy, it poses several problems in the current context:

  • The scale of data growth today surpasses the reasonable capacity of traditional technologies, or even the typical hardware configuration supporting access to this data. The centralization of data storage and processing on a central server creates significant pressure on the company’s IT architecture, which by domino effect increases the response time of requests to clients (latency). This increase in latency is detrimental in the context of many sectors of activity, in particular e-commerce, retail banking, industry, etc.
  • The upsizing which is used to make the central server able to adapt to the increase in the volume of data (we speak of scalability) is limited to the maximum capacity of the computer components. Even if it does not require any modification to the computer architecture, upsizing does not allow the inherent capacities of the computer hardware to be exceeded. For example, currently you cannot find a 500 GB RAM stick on the market. a limited number of slots. This reasoning is valid for the upsizing of the hard disk and the microprocessor. Upsizing increases the scalability of the system up to a certain threshold from which the performance of the system remains the same.

Google is one of the first companies to have felt these weaknesses very early on. In 2002, its then CEO, Eric Schmidt, sent shockwaves through the entire IT industry by announcing that Google had no intention of buying HP’s new server with the latest Itanium microprocessor developed by Intel. In Google’s vision, with the falling cost of computers as predicted by Moore’s Law, the future of computer processing would be based on the constitution of Data Centers composed of several convenient machines (clusters).

From this point of view, Google has introduced a new technology strategy that will gradually replace the classic client/server architecture. In 2002, this technological vision seemed ridiculous, but today it makes sense. Indeed, the approach proposed by Google consists in distributing the storage of data and parallelizing their processing on several convenient PCs organized in clusters (we speak of nodes). Figure 2 illustrates this new strategy. Hadoop is the most mature software implementation that enables this approach.

Several software vendors have tried to offer solutions that implement the storage distribution and data parallelism strategy, but without success! The few publishers who have succeeded offer solutions that cost a fortune! (this indicates the significant level of investment required to develop such solutions). Apart from particular Teradata or Hana-type Massively Parallel DBMS solutions, Hadoop is currently the only mature platform that successfully implements the storage distribution and data parallelism strategy. Unlike commercial solutions whose evolution depends on the publisher’s financial investments, Hadoop capitalizes on the vast pool of resources represented by open source communities (Apache, GitHub, Hacker News, Stack Overflow, etc.).

Thus, to approach Big Data projects, from a strictly architectural point of view, the strategy of distributing data storage and parallelism of their processing on a cluster is a much better strategy than the classic client/server architecture strategy.. In addition, with the cost of computer hardware falling, the costs of acquiring and scaling a cluster can potentially be less expensive over time than those of a central server. For more details on Hadoop, you can consult the articles in our “Big Data Tutorial” section.

2. Basic Components of a Hadoop Cluster

Over time, a real technological ecosystem has developed around Hadoop to support the multiplicity of data recovery use cases and the multiplicity of industry sectors. Hadoop moved from a “one-size-fits-all” software, that is to say as a software that will provide all the functionalities of all the possible uses cases of Big Data, to a real “framework”, that is to say, a data management platform on which specific solutions to Big Data problems can be built. So when you start a Big Data project, keep in mind that acquiring Hadoop is only the first step. You need to determine (if they exist) the specific solutions that can help you leverage the power of Hadoop for your use case. To make a simple analogy, developing a use case with Hadoop is similar to putting together several LEGO puzzles. You have to know how to combine all the tools of the ecosystem so that this set meets the needs of the problem of your project. To date, the Hadoop ecosystem is made up of hundreds of technologies that we have chosen to group into 14 categories according to their problem segment. We will come back to this below. But these technologies are based on 3 basic components:

  • The Distributed File System, for example Hadoop’s HDFS, which manages distributed data storage and provides the fault tolerance needed when operating a cluster.
  • The computational model, like MapReduce, which is how the data is parallelized across the cluster nodes
  • The Resource Manager, such as YARN, which makes it possible to run several calculation engines in the cluster and to exploit its potential to its maximum.

The Distributed File System, the parallel computing models and the resource management application are the 3 elements that are the basis of all Big Data technologies.

3. The Hadoop Ecosystem

Hadoop is becoming the Facto standard for data processing, much like Excel has gradually become the default data analysis software. Unlike Excel, Hadoop was not designed to be used by “Business Analysts”, but by developers. However, the large-scale adoption and success of a standard does not depend on developers, but on business analysts. For this reason, Big Data issues have been segmented from a functional point of view and for each segment, technologies that rely on Hadoop have been developed to meet its challenges.

All of these tools form what is called the Hadoop ecosystem. The Hadoop ecosystem enriches Hadoop and makes it capable of solving a wide variety of business problems. To date, the Hadoop ecosystem is made up of hundreds of technologies that we have chosen to group into 14 categories according to their problem segment:: abstraction languages, SQL on Hadoop (Hive, Pig), models calculation tools (MapReduce, Tez), real-time processing tools (Storm, Spark Streaming), Databases (HBase, Cassandra), streaming ingestion tools (Kafka, Flume), data integration tools , (Sqoop, Talend), Workflow coordination tools (Oozie, Control M for Hadoop), Distributed service coordination tools (Zookeeper), Cluster administration tools (Ranger, Sentry), user interface (Hue, Jupyter), content indexing tools (ElasticSearch, Splunk), distributed file systems (HDFS), and resource managers (YARN and MESOS).

In this part, we will review the function of each of the tools that make up this ecosystem of Big Data technologies. Afterwards, if you want to go further, we recommend that you download our guide “Initiation to the Hadoop ecosystem” which is just located to your right. The following mind map provides an overview of the Hadoop ecosystem.

The base configuration of the Hadoop ecosystem contains the following technologies: Spark, Hive, PIG, HBase, Sqoop, Storm, ZooKeeper, Oozie, and Kafka.

Spark

Before explaining what Spark is, let’s remember that for an algorithm to be able to run on several nodes of a Hadoop cluster, it must be parallelizable. Thus, we say of an algorithm that it is “scalable” if it is parallelizable (and can therefore take advantage of the scalability of a cluster). Hadoop is an implementation of the MapReduce computational model. The problem with the MapReduce is that it is built on a Direct Acyclic Graph model. In other words, the sequence of MapReduce operations are executed in three direct and direct sequential phases (Map -> Shuffle -> Reduce), no phase is iterative (or cyclical). The direct acyclic model is not suitable for some applications, especially those that reuse data across multiple operations, such as most statistical learning algorithms, mostly iterative, and interactive data analysis queries.

Spark is a response to these limits, it is a calculation engine that performs distributed processing in memory on a cluster. In other words, it is a distributed in-memory calculation engine. Compared to MapReduce which works in batch mode, Spark’s calculation model works in interactive mode, that is to say, mounts the data in memory before processing it and is therefore very suitable for Machine Learning processing.

Hive

Hive is a computer infrastructure similar to the Data Warehouse which provides query and aggregation services for very large volumes of data stored on a distributed file system such as HDFS. Hive provides an SQL-based query language (ANSI-92 standard) called HiveQL (Hive Query Language), which is used to query data stored on the HDFS. HiveQL also allows advanced users/developers to embed Map and Reduce functions directly into their queries to cover a wider range of data management issues. When you write a query in HiveQL, this query is transformed into a MapReduce job and submitted to the JobTracker for execution by Hive.

Pig

Pig is an interactive data flow execution environment under Hadoop. It is composed of 2 elements: a data flow expression language called Pig Latin; and an interactive environment for executing these data flows;

The language offered by Pig, Pig Latin, is roughly similar to scripting languages such as Perl, Python, or Ruby. However, it is more specific than the latter and is better described on the term “data flow language”. It allows you to write queries in the form of sequential streams of source data to obtain “target” data in Hadoop like an ETL.

These flows are then transformed into MapReduce functions which are finally submitted to the job-tracker for execution. Simply put, Pig is Hadoop’s ETL. Programming in Pig Latin means describing in the form of independent but nested streams, how data is loaded, transformed, and aggregated using specific Pig instructions called operators. Mastering these operators is the key to mastering Pig Latin programming, especially since there are not many of them relative to Hive, for example.

HBase

Before talking about HBase, we will recall that RDBMSs, which have been used until now for data management, have very quickly shown their limits in the face of the high volume of data on the one hand and the diversity of data on the other. Datas. Indeed, RDBMSs are designed to only manage structured data (row/column data table), and the increase in data volume increases the latency of queries. This latency is detrimental in the context of many businesses requiring near real-time responses. To meet these limits, new DBMS called “NoSQL” have emerged. These do not impose any particular structure on the data, are able to distribute the storage and management of the data over several nodes and are scalable. As a reminder, scalability means that system performance remains stable with increasing processing load. HBase falls into this category of DBMS.

HBase is a distributed, column-oriented DBMS that provides real-time read and write access to data stored on HDFS. Where HDFS provides sequential access to batch data, not suitable for fast data access issues like streaming, HBase covers these shortcomings and provides fast access to data stored on HDFS.

Sqoop

Sqoop or SQL-to-Hadoop is a tool that allows data to be transferred from a relational database to Hadoop’s HDFS and vice versa. It is integrated with the Hadoop ecosystem and is what we call the data ingestion scheduler in Hadoop. You can use Sqoop to import data from RDBMSs such as MySQL, Oracle, or SQL Server to HDFS, transform the data into Hadoop via MapReduce or another computational model, and export it back into the RDBMS.

We call it the data ingestion scheduler because just like Oozie (below), it automates this import/export process and schedules when to run it. All you have to do as a user is to write the SQL queries that will be used to perform the import/export movement. Sqoop, on the other hand, uses MapReduce to import and export data, which is efficient and fault-tolerant. The following figure illustrates the functions of Sqoop particularly well.

Storm

To understand Storm, you have to understand the notion of lambda architectures (λ) and to understand the interest of lambda architectures, you have to understand the concept of connected objects. Connected objects or Internet of Things (IoT — Internet of Things in English) represents the extension of the Internet to our daily lives. It generates data in streaming and in most of its problems, requires that the data be processed in real time. The models you know such as Batch calculation models are not adapted to the real-time issues raised by the IoT. Even interactive computing models are not suitable for real-time continuous processing.

Unlike the operational data produced by the operational systems of a company such as finance, marketing, which even when produced in streaming can be logged for later processing, the data produced in streaming within the framework of phenomena such as the IoT or the Internet expire (or are no longer valid) within moments of their creation and therefore require immediate treatment. Apart from connected objects, business issues such as the fight against fraud, the analysis of social network data, geolocation, require very low response times, almost of the order of less than a second.

To solve this problem in a Big Data context, so-called λ architectures have been set up. These architectures add to MapReduce 2 layers of additional processing for the reduction of latency times. Storm is a software implementation of the λ architecture. It allows you to develop applications under Hadoop that process data in real time (or almost).

ZooKeeper

Synchronizing or coordinating communication between nodes during parallel task execution is one of the most difficult problems in distributed application development. To solve this problem, Hadoop has introduced so-called service coordination tools, in this case ZooKeeper, into its ecosystem. ZooKeeper takes care of the inherent complexity of synchronizing distributed task execution in the cluster and allows other tools in the Hadoop ecosystem to not have to deal with this problem themselves.

It also allows users to be able to develop distributed applications without being experts in distributed programming. Without going into the intricate details of data coordination between nodes in a Hadoop cluster, ZooKeeper provides a distributed configuration service, distribution service, and naming registry for distributed applications. ZooKeeper is Hadoop’s way of coordinating distributed jobs.

Oozie

By default, Hadoop runs jobs as they are submitted by the user regardless of the relationship they may have with each other. However, the issues for which Hadoop is used generally require the writing of one or more complex jobs. When the 2 jobs are submitted to JobTracker (or YARN) for example, it will execute them without paying attention to the link that exists between them, which risks causing an error (exception) and causing the code to stop.

How do you manage the execution of several jobs that are related to the same problem? To manage this type of problem, the simplest solution currently is to use a job scheduler, in this case Oozie. Oozie is a job execution scheduler that runs as a service on a Hadoop cluster. It is used for scheduling Hadoop jobs, and more generally for scheduling the execution of all jobs that can run on a cluster, for example a Hive script, a MapReduce job, a Hama job, a jobStorm, etc. It was designed to manage the immediate or deferred execution of thousands of interdependent jobs on a Hadoop cluster automatically. To use Oozie, all you have to do is configure 2 XML files: an Oozie engine configuration file and a job workflow configuration file.

That’s it, you now have a clear idea of the technologies that revolve around the Hadoop ecosystem and the state of the art of Big Data technology. We hope you have understood that in the era of Big Data, the best way to value your data is to use a cluster on which will be installed a distributed file system, one (or more) calculation model(s) parallel(s), and a resource manager. Then you can choose from the plethora of technologies in the Hadoop ecosystem, those that match the LEGO of your project. Of course, you suspect that it is not possible to cover extensively a subject as vast as the Hadoop ecosystem from a pragmatic point of view. Therefore, if you want to go further, we recommend the following resources.

Additional Resources

If you want to go further in learning about the Hadoop ecosystem, you should know that Apache is now the repository of all the technologies in the ecosystem and that many other technologies are incubating there. Visit the following links for more information about each of these technologies.

ABOUT LONDON DATA CONSULTING (LDC)

We, at London Data Consulting (LDC), provide all sorts of Data Solutions. This includes Data Science (AI/ML/NLP), Data Engineer, Data Architecture, Data Analysis, CRM & Leads Generation, Business Intelligence and Cloud solutions (AWS/GCP/Azure).

For more information about our range of services, please visit: https://london-data-consulting.com/services

Interested in working for London Data Consulting, please visit our careers page on https://london-data-consulting.com/careers

More info on: https://london-data-consulting.com

--

--