Mani's Tech Blog: September 2014

Node : A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data.

Rack :A rack is a collection of 30 or 40 nodes that are physically stored close together and are all connected to the same network switch.
Network bandwidth between any two nodes in rack is greater than bandwidth between two nodes on different racks.

Hadoop has two major components:

- The distributed filesystem component, the main example of which is the Hadoop Distributed FileSystem, though other file systems are supported.

- The MapReduce component, which is a framework for performing calculations on the data in the distributed file system.

HDFS :

-HDFS runs on top of the existing file systems on each node in a Hadoop cluster.

-Hadoop works best with very large files. The larger the file, the less time Hadoop spends seeking for the next data location on disk and the more time Hadoop runs at the limit of the bandwidth of your disks. Seeks are generally expensive operations that are useful when you only need to analyze a small subset of your dataset. Since Hadoop is designed to run over your entire dataset, it is best to minimize seeks byusing largefiles. Hadoop is designed for streaming or sequential data access rather than random access. Sequential data access means fewer seeks, since Hadoop only seeks to the beginning of each block and begins reading sequentially from there. Hadoop uses blocks to store a file or parts of a file.

MapReduce :

A MapReduce program consists of two types of transformations that can be applied to data any number of times - a map transformation and a reduce transformation.
A MapReduce job is an executing MapReduce program that is divided into map tasks that run in parallel with each other and reduce tasks that run in parallel with each other.

Mani's Tech Blog

Hadoop Components

Hadoop History