Hadoop Tutorial - Architecture

Hadoop Tutorial – Architecture



hello and welcome to adult tutorials at learning journal in the previous session we try to understand HDFS at a high level in this session I will cover HDFS architecture and core concept this video will take you one level deeper into HDFS so let's start we already know that HDFS is a distributed file system so the first thing that comes to my mind is a bunch of computer we need to network them together and form a cluster so this is how a simple but typical Hadoop cluster is network this one column of computers is called one rack the term rack is important because there are few concepts associated with the rack so let me explain the rack the rack is nothing but a kind of a box they fix multiple computers into a rack typically each rack is given its individual power supply and a dedicated network switch so a switch fails or there is a problem with the power supply of the rack all the computers within the rack can go out of network the point that I am trying to make is that there is a possibility of entire rack to fail just keep this in mind and come back to Hadoop cluster so this is how a typical Hadoop cluster is network we have multiple racks each with their individual Shh finally we connect all these switches to a course a so everything is on network and we call it the Hadoop cluster the hdfs is designed using master/slave architecture in this architecture there is one master and all others are slaves so let's assume that this one is the master and all others are slaves okay the Hadoop master is called the name node and slaves are called data node one friend asked me an interesting question why we call it name node why not simply a master mode or a super node for a king node we call it name node because it is store and manager's name the names of directory and the names of files the data node is toes and manages the data of the file so we call them data node let me explain there since HDFS is a file system we can create directory and file using HDFS there are many ways to do it and we will look at some examples and demos later but for now just assume that you are creating a large file in HDFS the question is how HDFS is stored the file on this cluster when we create a file in HDFS what happens under the hood let's try to understand this there are three actors there Hadoop climb Hadoop name node and Hadoop data node the Hadoop client will send a request to name knows that it want to create a file the client will also supply the target directory name and the file name on receiving a request the name node will perform various checks like directly already exist the file doesn't already exist and the client has the right permissions to create a file name node can perform these checks because it maintains an image of entire HDFS namespace into memory we call it in memory FS image or file system image if all the tests pass the name node will create an entry for the new file and return success to the client the file name creation is over but it is empty you haven't started writing data to the file yet now it's time to start writing data so the client will create an FS data output stream and start writing data to this stream the FS data output stream is the Hadoop streamer clerk and it internally does a lot of work it buffers the data locally until you accumulate a reasonable amount of data let's say 128 MB we call it a block and DFS data block right so once there is a one block of data the streamer reaches out to name Lord asking for a block allocation it is just like asking the name knows that where do I store this block the name node doesn't store data but the name norm knows the amount of free disk space at each data node with that information the name node can easily assign a data node to store that block so name node will perform this allocation and send back the data node name to the streamer now the streamer knows that where to send the data block that's it the streamer is start sending the block to the data node if the file is larger than one block the streamer will again reach out to name node for a new block allocation this time the name node may assign some other data node so your next block may go to a different data node once you finish writing to the file the name node will commit all the changes I hope you followed this process let me summarize some takeaways from this entire discussion HDFS has a master/slave architecture right and HDFS cluster consists of a single name node and several data node correct the name node manages the file system namespace and regulates access to files by client when I say regulate I mean checking access permissions and user quotas etc the data node is toast file data in form of blocks each data node periodically sends a heartbeat to name node to inform that it is alive this heart beat also includes resource capacity information that helps name node in various diffusion the data node also sends a block report to name node the block report is a health information of all the blocks that are maintained by the data node the HB FF will split the file into one or more block and store these blocks on different data node the name now maintains the mapping of the blocks to the file they are order and all other metadata a typical block size used by HDFS is 128 MB we can specify block size on per file basis you should notice that the block size in HDFS is too large compared to the local file system but it was a crucial design decision to avoid disk thief some cluster setup configures the block size to even greater as 256 MB however taking a too big value for the block size may have an adverse impact we will again visit block size in a later video the name node determines the mapping of blocks to data node but after mapping the client directly interact with the data node for reading and writing when a client is writing data to an HDFS file the data first goes to a local buffer this approach is adopted to provide a streaming readwrite capability to HDFS the name node and data node are pieces of software so at the minimum configuration you can run both on the same machine and create a single node Hadoop cluster but a typical deployment has a dedicated computer that runs only the name node software each of the other machine in the cluster runs one instance of data node software ok so far we talked about following core architecture elements of HDFS we also talked about heartbeat block report block sizes client-side buffering and FS image in the next video I will cover fault tolerance and high-availability features of hadoop thank you for watching learning journal please like subscribe and share to support us keep learning and keep growing


21 thoughts on “Hadoop Tutorial – Architecture

  1. Suppose we are maintaining 3Copies of Data ( 1 is in A Rack 2,3 are in B Rack ) suppose if B Rack fails due to some network problem . Hadoop can access data from A Rack it is fine. But my doubt is before we fixing up B Rack if A Rack also fails How to get the Data? Do we have any mechanism maintaining Replication factor as 3 if some of copy fails means does it create those 2 copies by using A Rack copy to maintain Replication factor as 3 before we fix the problem of B Rack???

Leave a Reply

Your email address will not be published. Required fields are marked *