Hadoop Environment Setup

0
405

Downloading Hadoop

Hadoop can be downloaded from one of the below websites

1)      www.apache.org

2)      www.ant.apache.org

3)      www.hadoop.apache.org

Installation—

Invest some time to set up and for installing a perfectly working Hadoop environment.

Also Read:  Big Data Solutions

How to install the Hadoop

Using and installing Hadoop is a pretty simple task. One has to setup Hadoop in a normal Linux environment. Creating a user is one of the tasks which one has to do in setting up Hadoop in the Linux Environment. It is recommended to create a new user rather using the old one in order to differentiate the Hadoop File System from the Unix File System. The most important step in installing Hadoop is java. Since Hadoop is java based, it is essential for the computer to have Java pre-installed on it. Not only does one have to install java beforehand, but also know the version of the installed on their computer.

Once Hadoop has been successfully installed on the computer, it can be operated from one of the three modes. Those modes are:

•    Local/ Standalone Mode– After the downloading and the installation, Hadoop, by default, operates in the local/ standalone mode. This mode allows the user to run it as a single java process.

•    Pseudo-Distributed Mode– This mode allows the user to simulate different things from one computer. All of the Hadoop operations, such as HDFS, MapReduce, yarn, and the likes, will be run on this mode as separate java process.

•    Fully Distribution Mode– This mode has to be separately activated by the user. It is fully distributed and required more than one computer as a cluster.

Goals of Hadoop Distributed File System or HDFS

Hadoop Distributed File System was made in 2011 in order to make the life of people dealing with big data easier. Apart from the above-mentioned goal, few of its other goals are:

•    Fault Detection and Recovery– HDFS deal with huge datasets. Therefore, it is natural that there might be a failure of a component. It is the goal of HDFS to recognize the fault and recover it as quickly as possible.

•    Huge Datasets– HDFS should be the only framework coming in the mind of people when they think of large data sets.

•    Hardware at Data– Making HDFS so efficient that it does not require any external hardware to enhance its functioning. 

Hadoop Source Code—

First, we need Hadoop source code which we can get it from the official website for Hadoop that is https://wiki.apache.org/hadoop/GitAndHadoop. Git and Hadoop is a great combination as Git helps in managing patches to Hadoop.

Analysing the Building.txt –

The immediate step after source code is to understand the Building.txt file which has all the information on how to install Hadoop on different platforms along with some adjustments. We can read this file at the below location—

https://git-wip-us.apache.org/repos/asf?p=hadoop.git;a=blob;f=BUILDING.txt

 Editor—

                We can use our own favorite text editor or an IDE (Integrated Development Environment).  Installation or building and testing Hadoop are done on the command line. We need to set up the IDE according to the rules and regulations of the source layout. In order to suppress the problems while reviewing patches, we need to disable any added value “reformat” and “strip trailing spaces” features.

 Tools used to Build up and setup—

1)      First and the foremost tool in JDK (Java Development Kit). Preferred https://java.com/en/  and http://openjdk.java.net/

2)      Google protocol buffers. Refer https://wiki.apache.org/hadoop/ProtocolBuffers  for guidance.

3)      Preferred version of Hadoop is Apache Maven 3 or later

4)      Java API 

We need to make sure that all these are installed by executing mvn, git, and javac. We need a good internet connection as Hadoop builds uses the external Maven repository to download and as Maven makes external HTTP requests we need to make some set up with proxy settings. To setup, Maven proxy settings refer http://maven.apache.org/guides/mini/guide-proxies.html

We need to pass down the Ant proxy settings in the build Unix in order to build a successful Hadoop. As Maven does not pass proxy settings down to the Ant tasks runs bugs.