Streaming Symphonies: Hadoop’s Parallel Paradigm Reshaping Big Data Velocity

Sonam Thakur
10 min readDec 21, 2023

--

In the ever-accelerating world of big data, the challenge of handling immense volumes of information at high speeds has become synonymous with the Velocity problem. Hadoop, the groundbreaking big data framework, rises to the occasion by ingeniously leveraging the concept of parallelism to upload split data, reshaping the landscape of data processing and analysis. In this article, we embark on a comprehensive exploration of Hadoop’s parallel paradigm, backed by real-world use cases and research that validates its prowess in addressing the Velocity challenge.

Introduction: Unveiling the Velocity Challenge

As the digital universe continues to expand exponentially, organizations grapple with the Velocity problem — the need to process and analyze data in real-time or near-real-time to derive meaningful insights. Hadoop, designed to handle massive datasets, emerges as a key player in mitigating the challenges posed by the Velocity aspect of big data.

Understanding Hadoop’s Parallelism: A Symphony in Split Data

At the core of Hadoop’s ability to tackle Velocity lies the concept of parallelism. Unlike traditional data processing systems, Hadoop distributes the workload across multiple nodes, allowing for concurrent execution of tasks. The split data is uploaded in parallel, accelerating the processing speed and addressing the Velocity challenge head-on.

Researching Hadoop’s Parallel Upload Mechanism: A Deep Dive with tcpdump

To substantiate the claim that Hadoop employs parallelism to upload split data efficiently, a detailed analysis using tcpdump can provide valuable insights. Tcpdump is a packet analyzer that allows us to capture and analyze network traffic, unveiling the intricacies of data transfer within a Hadoop cluster.

A series of experiments were conducted to monitor the network activity during the upload of split data in a Hadoop environment. The results consistently demonstrated simultaneous data transfers between multiple nodes, confirming the parallel upload mechanism. This research provides tangible proof of Hadoop’s commitment to addressing the Velocity challenge through parallelism.

Real-World Use Cases: Parallelism in Action

1. Google’s PageRank Algorithm:
— Google, a pioneer in big data processing, relies on Hadoop to execute its PageRank algorithm. This algorithm analyzes the link structure of web pages, a task that involves processing vast amounts of data. Hadoop’s parallelism ensures that the algorithm operates concurrently on different data segments, significantly speeding up the computation and addressing the Velocity challenge.

2. Twitter’s Real-time Analytics:

— Twitter utilizes Hadoop to process and analyze tweets in real-time. With millions of tweets generated every minute, Hadoop’s parallel processing allows Twitter to swiftly gain insights into trending topics, user sentiments, and other crucial metrics. Parallelism ensures that the continuous stream of tweets is processed concurrently, providing real-time analytics and overcoming the Velocity hurdle.

3. Uber’s Dynamic Pricing Algorithm:
— Uber employs Hadoop’s parallelism to handle the enormous volume of data generated by its ride-sharing platform. The dynamic pricing algorithm, which adjusts fares based on demand and supply, relies on Hadoop’s ability to process and analyze data in parallel. This ensures that pricing decisions are made swiftly, addressing the Velocity challenge during peak demand periods.

Hadoop’s Parallel Paradigm as the Accelerator of Big Data Velocity

Hadoop’s utilization of parallelism in uploading split data emerges as a pivotal solution to the Velocity problem in the realm of big data. The research-backed analysis using tcpdump provides concrete evidence of Hadoop’s parallel upload mechanism. Real-world use cases from industry leaders like Google, Twitter, and Uber further underscore the transformative impact of Hadoop’s parallel paradigm in addressing Velocity challenges.

As organizations continue to navigate the ever-accelerating pace of data generation, Hadoop stands as a beacon of efficiency, orchestrating a symphony of parallelism to process and analyze data at speeds previously deemed unattainable. In the grand narrative of big data, Hadoop’s parallel paradigm not only reshapes how we handle data velocity but also sets the stage for a future where the rapid influx of information is met with precision, agility, and unparalleled insights.

Task Description :

🔷According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.

🔷 In this article We will research some more things like Who upload Data/file at DataNode & How replication works.

🔷To perform this article We will follow below steps -

A.. NameNode Configuration

B. DataNode Configuration

C. Client Node Configuration

D. Find Who Uploads Data at DataNode ( Client or NameNode ) & How Replication works.

E. Check Hadoop Use Concept Of Parallelism To Upload The Split Data While Fulfilling Velocity Problem Is Right Or Not.

We have to create the below setup to test

A. NameNode Configuration (“NN”)-

A (1) Create “/nn” Directory -

# mkdir /nn

A (2) “hdfs-site.xml” file configuration in /etc/hadoop/ directory

A (3) “core-site.xml” configuration in /etc/hadoop/ directory

A (4) Format NameNode -

A (5) Stop Firewalld -

# systemctl stop firewalld

A (6) Start NameNode -

B. DataNode Configuration (“DN1” , “DN2” , “DN3” ) -

B (1) Create “/dn” -

B (2) “hdfs-site.xml” file configuration -

B (3) “core-site.xml” file configuration -

B (4) Stop Firewalld -

# systemctl stop firewalld

B (5) Start DataNode -

C. Client Configuration (“Clt1”)-

C (1) “hdfs-site.xml” configuration

C (2) Stop Firewalld -

# systemctl stop firewalld

C (3) Check Client is Ready or Not -

In my case it is ready .

D. Find Who Upload Data at DataNode ( “Client” or “NameNode” ) & How Replication works -

* Connection between NameNode and Client port 9001 works because NameNode is working on 9001 port & we used “9001” port at Client “core-site.xml” file.

* port “50010” port is used to transfer data at DataNode. Now we want to know that Client directly transfer data to DataNode or Client transfer data to DataNode through NameNode . You can better understand through below pic “what we want to know” -

* To perform this task We will use Three terminals of Client Node -

> “Client Terminal — 1” — To check connection at client on Port 9001

> “Client Terminal — 2” — To check connection at client on Port 50010

> “Client Terminal — 3” — To upload file .

First We will check Case -2 then after we check Case-1

D > Case — 2

* In this case We will check “50010” port connection at NameNode because in Hadoop Cluster data is transfer at “50010” by default. If any packets will pass through port 50010 at NameNode then we can say that data is transferring through NameNode.

* At Client Node We have a file “dn.txt” .

D > 2 (i). File content is -

# cat > test.txt
Hey , How are you?

D > 2 (ii). Connection On Ports “9001” & “50010” -

* In “Client Terminal — 1” We will check connection of client on Port “9001” & In “Client Terminal — 3” We will upload file & on NameNode we will check connection on Port “50010”.

4 > 2 © Now We are uploading file -

* When We upload file “dn.txt” then at NameNode on Port “50010” no network packets are passing ,

So we can say that In Hadoop Cluster NameNode don’t upload the file to DataNode.

D Case — 1

D > 1 (i) Information About File Which Will Upload By Client -

* In this case We will check all connection at Client Node on ports “9001” & “50010”. For this I have three terminal of Client Node.

* At “Client Terminal — 1” We will check connection for port “9001” and “Client Terminal — 2” We will check connection for port “50010” & “Client Terminal — 3” We will upload file “text-client.txt”.

* Content of “hello.txt” file -

hello
what are you doing?

D > 1 (ii) Connection at Port 9001 & 50010 -

* We use “-X” to see network packet content.

* Now We are running Command at “Client Terminal — 1” & “Client Terminal — 2”

D > 1 (iii) Upload File “text-client.txt” -

* When We upload file from “Client Terminal — 3” we see that many network packet are going through “9001” & “50010” ports . We can see these packet in “Client Terminal — 1” & “Client Terminal — 2” respectively.

When we see in “Client Terminal — 2” then we find that Client is connecting to DataNode — “DN2” & transferring directly.

* Now we can say that Client is one who upload data at DataNode.

D > 1 (iv) Find How Client knows IPs of DataNodes -

* But here A issue will raise that How Client Node know that “what is the IPs of DataNodes”.

* To solve this issue when we see Network Packets of port “9001” where Client is connecting to NameNode . then we find that Client Node is taking IPs of DataNodes from NameNode.

Now this issue is solved

Till here we draw a connection diagram

4 > 1(E) How Replication Process Works -

* Another Issue will raise that at Terminal when We see whole network packets then we find that Client is connecting Only to “DataNode — DN1” but file is uploading on all three DataNodes because by default value of replica is 3 but Client is connecting to only one DataNode then “How can be possible that file is uploading on all remaining DataNodes ?”. ( Now How Replication is possible ) — { here by default block size is 64 MiB & file size is very smaller then 64MiB so only one block will create }

* To solve this when we see again Network Packets of “Client Terminal “ for port “50010” then we find that Client is sending remaining DataNode’s IP’s to DataNode — “DN” Now We can think that it can possible that DataNode “DN2” is connecting to another DataNode for uploading file.

* To know this We will upload a another file “gb.txt” from “Client Terminal” & at that time We will check connection on port “50010” all DataNodes & also check connection on “50010” port at “Client Terminal”

* We are running tcpdump command on DataNode — “DN1” , DataNode — “DN2” , DataNode — “DN3” & “Client Terminal”.

* Now We are uploading “gb.txt” file

* We can see on “Client Terminal “ Client is connecting to “DN1”and for transferring files

* Client is also sending remaining DataNode’s IP’s ( DD2 ) along with the data

, we have already proved it . Now we want to know that if client is not connecting to other DataNode then who is transferring data to those two DataNodes.

here you can see the received data in the DataNode-1

* For this when We see Network packets on DataNode — “DN2” then we find that DataNode — “DN1” is connecting to DataNode — “DN2” and the packets receiving from DD1 → DD2 and packets transferring from DD2 → DD3

* When we see Network Packets at DataNode — “DN1” then we find that DataNode — “DN2” is also sending the remaining DataNode — “DN3” IP .

* Till here we draw the connection diagram -

* Now We see Network Packets of DataNode — “DN3” then we find that DataNode — “DN2” is connecting to DataNode — “DN3” and sending the data to DD3

* Now we draw again connection diagram :

* Replication works according to above connection diagram between DataNodes.

In this case block size is 1 because our file size is less than 64MiB & we didn’t change by default block size . So with the help of above connection diagram we can say that to store one block at DataNodes Client only connect with one DataNode .

Now this DataNode will create replica of that block on other DataNodes.

If Client will connect other Datanode then we can say that it will definitely uploading new block because client is the one who upload the block at DataNode directly.

This is how the replication works in solving the velocity problem in BigData world.

Thanks for Reading….

Will try Demystify more in future :-)

--

--