Select Page

Comparing Hadoop and SQL

If you have been familiar to working with conventional SQL-databases, hearing someone talk about Hadoop can sound like a crazy mess. In this blog, we are going to try and simplify three of the many differences between conventional SQL sources and Hadoop. Which hopefully will provide context as to where each model is used.

Comparing Hadoop and SQL

Schema on Write vs. Schema on Read

The first difference we want to point-out is schema on-write vs. schema on-read. When we transfer data from SQL-database A to SQL-database B, and before we write the data to database B we need to have some information in hand. For instance, how to adapt the data from database A to fit the structure of database B.  We have to know what the structure of database B is.

Furthermore, we have to make sure that the data being transmitted meets the data types the SQL database is expecting. Database B will reject the data and spit out errors if we try to load something that does not meet its expectation. This is called Schema on-write.

On the other hand, Hadoop has a schema on-read approach. So, when data is written on the HDFS or the Hadoop Distributed File System, it’s just brought in without setting any gate-keeping rules. Then, we apply rules to the code that reads the data when we want to read the data. This code reads the data rather than pre-configuring their structure ahead of time. This concept of schema on-read vs. schema on-write has fundamental implications on how the data is stored in SQL vs. Hadoop, which leads us to our second difference.

How data is stored

In SQL, the data is stored in a logical form with inter-related tables and defined columns. In Hadoop, the data is a compressed file of either data (any type) or text. But, the moment the data enters into Hadoop, the data or file is replicated across multiple-nodes in the HDFS. For example, let’s assume that we’re loading Instagram data and we have a large Hadoop distributed file system cluster of 500 servers. Your Instagram data may be replicated across 50 of them, along with all the other profiles of Instagram users.

Hadoop keeps track of where all the copies of your profile are. This might seem like a waste of space, but it’s actually the secret to the massive-scalability magic in Hadoop. Which leads us to the third difference.

Big data in Hadoop vs. in SQL

A big data solution like Hadoop is architected for an unlimited number of servers. Back to our previous example, and if you’re searching Instagram data, and you want to see all the sentence that include the word “happy”; Hadoop will distribute the calculation of the search request across all the 50 servers where your profile data is replicated in the HDFS.

But, instead of each server conducting the exact same search, Hadoop will divide the workload, so that each server works on just a portion of your Instagram history. As each node finishes its assigned portion of history, a reducer program on the same node cluster adds up all the tallies and produces a consolidated list.

So what happens if any of the 50 servers break-down or has some sort of an issue while processing the request, should the response hold up because of the incomplete data set? The answer to this question is one of the main differences between SQL and Hadoop.

Now Hadoop would say no, and it would deliver an immediate response and finally it would have a consistent-answer.

SQL would say no, before releasing anything we must have consistency across all the nodes. This is called the two-phase commit. Now neither approach is wrong or right, as both approaches play an important role based on the data-type being used.

The Hadoop consistency methodology is far more realistic for reading constantly updating streams of unstructured data across thousands of servers. While the two-phase commit SQL databases methodology is well suited for rolling up and managing transactions and ensure we get the right answer.

Hadoop is so flexible that Facebook  created a program called “Hive” that allowed their team members who didn’t know Hadoop java code to write queries in standard SQL and mimic SQL behavior on-demand.  Today, some of the leading tools for analysis and data management are now beginning to natively write MapReduce programs and prepackaged schemas on-read so that organizations don’t have to hire expensive data-experts to get value from this powerful architecture.

To dive deeper into this subject, please download the following StoneFly white paper on Big Data & Hadoop.

 

What is Black Basta Ransomware and How to Defend Against it

What is Black Basta Ransomware and How to Defend Against it

Black Basta, a sophisticated ransomware group, has become a major threat to organizations globally, targeting industries ranging from healthcare to financial services. Known for using double-extortion tactics, Black Basta not only encrypts critical data but also...

S3 Object Storage Cost Comparison: Cloud vs Data Center

S3 Object Storage Cost Comparison: Cloud vs Data Center

S3 object storage cost comparisons between public cloud options and private data centers reveal crucial differences in long-term expenses and scalability. Public cloud providers offer readily available infrastructure and flexibility, but data storage and retrieval...

S3 Object Storage: The Ultimate Solution for AI/ML Data Lakes

S3 Object Storage: The Ultimate Solution for AI/ML Data Lakes

Artificial Intelligence (AI) and Machine Learning (ML) workloads generate and require massive amounts of data, often from diverse sources such as structured databases, unstructured logs, multimedia, and sensor data. To manage this data effectively, enterprises...

Top Reasons to Prioritize NAS Storage Backup in Your IT Strategy

Top Reasons to Prioritize NAS Storage Backup in Your IT Strategy

Backing up your NAS appliance is a non-negotiable requirement for enterprise organizations. While NAS systems offer scalable storage and accessibility, they are still vulnerable to critical risks such as hardware failures, cyberattacks, and data corruption. Without a...

You May Also Like

  • S3 Object Storage Cost Comparison Public Cloud vs Data Center S3 Object Storage Cost Comparison: Cloud vs Data Center - Cost comparison between public cloud S3 object storage and private, in-house solutions, examining factors like initial investment, operational expenses, and scalability. Explore how private S3 object storage can offer long-term savings, greater control, and ransomware protection for enterprises with substantial… Read More
  • S3 Object Storage The Ultimate Solution for AIML Data Lakes S3 Object Storage: The Ultimate Solution for AI/ML Data Lakes - AI/ML workloads require scalable, high-performance storage to handle vast datasets. S3 object storage offers an ideal solution with its ability to decouple compute and storage, enhance durability, and reduce costs. Learn how S3 optimizes AI/ML data lakes and enables efficient… Read More
  • Top Reasons to Prioritize NAS Storage Backup in Your IT Strategy Top Reasons to Prioritize NAS Storage Backup in Your IT Strategy - NAS storage backup is critical for safeguarding enterprise data from hardware failures, cyberattacks, human error, and natural disasters. This blog covers best practices, including air-gapped, immutable, and cloud backups, to ensure data protection, compliance, and business continuity through efficient disaster… Read More

Subscribe To Our Newsletter

Join our mailing list to receive the latest news, updates, and promotions from StoneFly.

Please Confirm your subscription from the email