The focus of this post is the Hadoop Distributed File System (HDFS). The Hadoop environment includes other features such as Map Reduce which will be the subject of a separate post. It is important to understand that many businesses are now using Spark. However, Spark is a tool that processes and analyses vast data sets. Some of these parts overlap but are also independent. For example, Spark is in principle independent of HDFS, but you still need a big data storage solution in place on which to run Spark, and HDFS is a common choice. This image gives a nice visualisation of this explanation.
We are living in a remarkable time, terabytes of data are being generated on a daily level. From data being collected by the latest and greatest scientific equipment (e.g. the Large Hadron Collider at CERN) to the billions of tweets being shared every day across the world. All this data needs to be stored and analysed. This is typically done using a Relational Database Management System (RDBMS). This is where you split up your data into multiple tables that can be joined using tools such as Structured Query Language and analysed using Python, R, etc. classically this data is stored on a single server, which is fine if you are only accessing a subset of the data. However, problems arise when you want to access significant amounts of your data, for example, at banks, they are able to get real-time data from thier client’s spending habits. They want to monitor the spending habits to detect fraud and stop fraudulent activity. Using a traditional RDBMS, the query time is going to be huge, for example, if you have 1TB of data on a hard drive and its speed is 100MB/s then to read off all of your data is going to take nearly three hours!
Paradigm shift: Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) was created specifically to solve this exact problem. What this system does is to manage the splitting of data across multiple hard drives across multiple machines. Using Hadoop, you can query all of your data very quickly and get the results you need now. Let’s go back to the example above. Rather than having your 1TB of data on one hard drive, using HDFS you can split your data across multiple machines. Say you split 1TB data equally across 100 hard drives. You can now read off your data in less than three minutes. What about the pitfalls of hardware failure? Due to the increase in the number hard drives that are being used, there is an increase in the chance of hardware failure. HDFS solves this because it can be set up in a way such that each block of data is stored in multiple places so if one hard drive fails then the data is not lost/can not be accessed. Other benefits of Hadoop are that it is scalable, open sources and most importantly can be run on off the shelf equipment as well as on cloud computing software. From a data science perspective Hadoop can be used with SQL, R, Python, etc. to conduct analysis, so it is perfect. The final and possibly the most important question is when does your data become Big? In short, there is no formal definition. The rule of thumb is about 400GB, but it is a blurred line.
In conclusion, Hadoop is a way of storing big data in a reliable and scalable manner. Previously the data would have been stored in a standard Relational Database Management System. Once your data set gets very large (rule of thumb, >400GB) then storing it on one server results in significant problems. Hadoop also solves the risk of increased hardware failure by storing the blocks of data in multiple places such that it is not inaccessible if one server goes down or lost if a hard drive fails. The compatibility with off the shelf hardware and cloud-based hardware alike makes it an economical option for a business of all sizes. Combined with its ability to interface with all the major analysis software (e.g. Python) makes it even more useful and appealing for data science.
I hope that this post has been helpful and informative. Please get in touch if you have any questions and I look forward to welcoming you to my next blog.