Select Page

Comparing Hadoop and SQL

If you have been familiar to working with conventional SQL-databases, hearing someone talk about Hadoop can sound like a crazy mess. In this blog, we are going to try and simplify three of the many differences between conventional SQL sources and Hadoop. Which hopefully will provide context as to where each model is used.

Comparing Hadoop and SQL

Schema on Write vs. Schema on Read

The first difference we want to point-out is schema on-write vs. schema on-read. When we transfer data from SQL-database A to SQL-database B, and before we write the data to database B we need to have some information in hand. For instance, how to adapt the data from database A to fit the structure of database B.  We have to know what the structure of database B is.

Furthermore, we have to make sure that the data being transmitted meets the data types the SQL database is expecting. Database B will reject the data and spit out errors if we try to load something that does not meet its expectation. This is called Schema on-write.

On the other hand, Hadoop has a schema on-read approach. So, when data is written on the HDFS or the Hadoop Distributed File System, it’s just brought in without setting any gate-keeping rules. Then, we apply rules to the code that reads the data when we want to read the data. This code reads the data rather than pre-configuring their structure ahead of time. This concept of schema on-read vs. schema on-write has fundamental implications on how the data is stored in SQL vs. Hadoop, which leads us to our second difference.

How data is stored

In SQL, the data is stored in a logical form with inter-related tables and defined columns. In Hadoop, the data is a compressed file of either data (any type) or text. But, the moment the data enters into Hadoop, the data or file is replicated across multiple-nodes in the HDFS. For example, let’s assume that we’re loading Instagram data and we have a large Hadoop distributed file system cluster of 500 servers. Your Instagram data may be replicated across 50 of them, along with all the other profiles of Instagram users.

Hadoop keeps track of where all the copies of your profile are. This might seem like a waste of space, but it’s actually the secret to the massive-scalability magic in Hadoop. Which leads us to the third difference.

Big data in Hadoop vs. in SQL

A big data solution like Hadoop is architected for an unlimited number of servers. Back to our previous example, and if you’re searching Instagram data, and you want to see all the sentence that include the word “happy”; Hadoop will distribute the calculation of the search request across all the 50 servers where your profile data is replicated in the HDFS.

But, instead of each server conducting the exact same search, Hadoop will divide the workload, so that each server works on just a portion of your Instagram history. As each node finishes its assigned portion of history, a reducer program on the same node cluster adds up all the tallies and produces a consolidated list.

So what happens if any of the 50 servers break-down or has some sort of an issue while processing the request, should the response hold up because of the incomplete data set? The answer to this question is one of the main differences between SQL and Hadoop.

Now Hadoop would say no, and it would deliver an immediate response and finally it would have a consistent-answer.

SQL would say no, before releasing anything we must have consistency across all the nodes. This is called the two-phase commit. Now neither approach is wrong or right, as both approaches play an important role based on the data-type being used.

The Hadoop consistency methodology is far more realistic for reading constantly updating streams of unstructured data across thousands of servers. While the two-phase commit SQL databases methodology is well suited for rolling up and managing transactions and ensure we get the right answer.

Hadoop is so flexible that Facebook  created a program called “Hive” that allowed their team members who didn’t know Hadoop java code to write queries in standard SQL and mimic SQL behavior on-demand.  Today, some of the leading tools for analysis and data management are now beginning to natively write MapReduce programs and prepackaged schemas on-read so that organizations don’t have to hire expensive data-experts to get value from this powerful architecture.

To dive deeper into this subject, please download the following StoneFly white paper on Big Data & Hadoop.

 

Lynx Ransomware: Attack Vectors, Impact, and Mitigation Strategies

Lynx Ransomware: Attack Vectors, Impact, and Mitigation Strategies

Lynx ransomware is a fast-spreading and highly disruptive malware that encrypts critical business data and demands ransom payments for decryption. It can halt operations, compromise sensitive information, and cause significant financial damage. Recent reports indicate...

8Base Ransomware: Detection, Prevention, and Mitigation

8Base Ransomware: Detection, Prevention, and Mitigation

8Base ransomware is a rapidly growing cyber threat targeting businesses across various sectors. Known for its sophisticated tactics and double extortion model, it encrypts critical data and steals sensitive information, demanding ransom for both. As the risk of 8Base...

Inside Rhysida Ransomware: Infiltration, Impact, and Prevention

Inside Rhysida Ransomware: Infiltration, Impact, and Prevention

Rhysida ransomware is a dangerous cyber threat that has been disrupting organizations since May 2023. Known for its double extortion tactics, Rhysida encrypts files and exfiltrates sensitive data, pressuring victims to pay or face public exposure. It infiltrates...

You May Also Like

  • S3 Object Storage Cost Comparison Public Cloud vs Data Center S3 Object Storage Cost Comparison: Cloud vs Data Center - Cost comparison between public cloud S3 object storage and private, in-house solutions, examining factors like initial investment, operational expenses, and scalability. Explore how private S3 object storage can offer long-term savings, greater control, and ransomware protection for enterprises with substantial… Read More
  • S3 Object Storage The Ultimate Solution for AIML Data Lakes S3 Object Storage: The Ultimate Solution for AI/ML Data Lakes - AI/ML workloads require scalable, high-performance storage to handle vast datasets. S3 object storage offers an ideal solution with its ability to decouple compute and storage, enhance durability, and reduce costs. Learn how S3 optimizes AI/ML data lakes and enables efficient… Read More
  • Top Reasons to Prioritize NAS Storage Backup in Your IT Strategy Top Reasons to Prioritize NAS Storage Backup in Your IT Strategy - NAS storage backup is critical for safeguarding enterprise data from hardware failures, cyberattacks, human error, and natural disasters. This blog covers best practices, including air-gapped, immutable, and cloud backups, to ensure data protection, compliance, and business continuity through efficient disaster… Read More

Subscribe To Our Newsletter

Join our mailing list to receive the latest news, updates, and promotions from StoneFly.

Please Confirm your subscription from the email