Hadoop Distributed File System Services
HDFSServices: - (for Writing
Data) 1st Generation in Hadoop Only
v NameNode: -
Master
v DataNode:- Slave
NameNode: -
The NameNode is the centerpiece of an HDFS file
system. It keeps the directory tree of all files in the file system, and tracks
where across the cluster the file data is kept. It does not store the data of
these files itself.
Client applications talk to the NameNode whenever
they wish to locate a file, or when they want to add/copy/move/delete a file.
The NameNode responds the successful requests by returning a list of
relevant DataNode servers
where the data lives.
The NameNode is a Single Point of Failure for the HDFS
Cluster. HDFS is not currently a High Availability system. When the NameNode
goes down, the file system goes offline. There is an optional SecondaryNameNode that
can be hosted on a separate machine. It only creates checkpoints of the
namespace by merging the edits file into the fsimage file and does not provide
any real redundancy. Hadoop 0.21+ has a BackupNameNode that
is part of a plan to have an HA name service, but it needs active contributions
from the people who want it (i.e. you) to make it Highly Available.
It is essential to look
after the NameNode. Here are some recommendations from production use:-
- Use a good server with lots of
RAM. The more RAM you have, the bigger the file system, or the smaller the
block size.
- Use ECC RAM.
- On Java6u15 or later, run the server VM with compressed
pointers -XX:+UseCompressedOops to
cut the JVM heap size down.
- List more than one name node
directory in the configuration, so that multiple copies of the file system
meta-data will be stored. As long as the directories are on separate
disks, a single disk failure will not corrupt the meta-data.
- Configure the NameNode to store one set of transaction
logs on a separate disk from the image.
- Configure the NameNode to store another set of
transaction logs to a network mounted disk.
- Monitor the disk space available to the NameNode. If
free space is getting low, add more storage.
- Do not host DataNode, Job
Tracker or TaskTracker services on the same system.
DataNode:
- A DataNode stores
data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode,
with data replicated across them.
On startup, a
DataNode connects to the NameNode; spinning
until that service comes up. It then responds to requests from the NameNode for
filesystem operations.
Client
applications can talk directly to a DataNode, once the NameNode has
provided the location of the data. Similarly, MapReduce operations
farmed out to TaskTracker instances near a DataNode, talk directly to the
DataNode to access the files. TaskTracker instances can, indeed should, be deployed on the
same servers that host DataNode instances, so that MapReduce operations
are performed close to the data.
DataNode
instances can talk to each other, which is what they do when they are
replicating data.
- There is usually no need to use
RAID storage for DataNode data, because data is designed to be replicated
across multiple servers, rather than multiple disks on the same server.
- An ideal configuration is for a
server to have a DataNode, a TaskTracker, and then
physical disks one TaskTracker slot
per CPU. This will allow every TaskTracker 100%
of a CPU, and separate disks to read and write data.
- Avoid using NFS for
data storage in production system.
MapReduce
Servcie:- Replace by
Resource manager and node manager
v Job Tracker:-
Master (Resource manager)
v Task Tracker:-
Slave ( Node Manager )
*As
a task tracker- One Name node and n number of data note
MapReduce processing in Hadoop 1 is handled by the
JobTracker and TaskTracker daemons. The JobTracker maintains a view of all
available processing resources in the Hadoop cluster and, as application
requests come in, it schedules and deploys them to the TaskTracker nodes for
execution.
As applications are running, the JobTracker receives
status updates from the TaskTracker nodes to track their progress and, if
necessary, coordinate the handling of any failures. The JobTracker needs to run
on a master node in the Hadoop cluster as it coordinates the execution of all
MapReduce applications in the cluster, so it’s a mission-critical service.
An instance of the TaskTracker daemon runs on every
slave node in the Hadoop cluster, which means that each slave node has a
service that ties it to the processing (TaskTracker) and the storage
(DataNode), which enables Hadoop to be a distributed system.
As a slave process, the TaskTracker receives
processing requests from the JobTracker. Its primary responsibility is to track
the execution of MapReduce workloads happening locally on its slave node and to
send status updates to the JobTracker.
TaskTrackers manage the processing resources on each
slave node in the form of processing slots — the slots defined for map tasks
and reduce tasks, to be exact. The total number of map and reduce slots
indicates how many map and reduce tasks can be executed at one time on the
slave node.
Keep in mind that every map and reduce task spawns
its own Java virtual machine (JVM) and that the heap represents the amount of
memory that’s allocated for each JVM. The ratio of map slots to reduce slots is
also an important consideration.
For example, if you have too many map slots and not
enough reduce slots for your workloads, map slots will tend to sit idle, while
your jobs are waiting for reduce slots to become available.
Distinct sets of slots are defined for map tasks and
reduce tasks because they use computing resources quite differently. Map tasks
are assigned based on data locality, and they depend heavily on disk I/O and
CPU. Reduce tasks are assigned based on availability, not on locality, and they
depend heavily on network bandwidth because they need to receive output from
map tasks.
Note: - When it comes to tuning a Hadoop cluster, setting the
optimal number of map and reduce slots is critical. The number of slots needs
to be carefully configured based on available memory, disk, and CPU resources
on each slave node. Memory is the most critical of these three resources from a
performance perspective. As such, the total number of task slots needs to be
balanced with the maximum amount of memory allocated to the Java heap size.
HADOOP DISTRIBUTED
FILE SYSTEM SHELL COMMANDS: -
The
Hadoop shell is a family of commands that you can run from your operating
system’s command line. The shell has two sets of commands: one for file
manipulation (similar in purpose and syntax to Linux commands that many of us
know and love) and one for Hadoop administration. The following list summarizes
the first set of commands for you, indicating what the command does as well as
usage and examples, where applicable.
cat:
Copies source paths to stdout.
Usage:
hdfs dfs -cat URI [URI …]
Example:
hdfs
dfs -cat hdfs://<path>/file1
hdfs
dfs -cat file:///file2 /user/hadoop/file3
chgrp:
Changes the group association of files. With -R, makes the change recursively
by way of the directory structure. The user must be the file owner or the
superuser.
Usage:
hdfs dfs -chgrp [-R] GROUP URI [URI …]
chmod:
Changes the permissions of files. With -R, makes the change recursively by way
of the directory structure. The user must be the file owner or the superuser
Usage:
hdfs dfs -chmod [-R] <MODE[,MODE]… | OCTALMODE> URI [URI …]
Example:
hdfs dfs -chmod 777 test/data1.txt
chown:
Changes the owner of files. With -R, makes the change recursively by way of the
directory structure. The user must be the superuser.
Usage:
hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
Example:
hdfs dfs -chown -R hduser2 /opt/hadoop/logs
copyFromLocal:
Works similarly to the put command, except that the source is restricted to a
local file reference.
Usage:
hdfs dfs -copyFromLocal <localsrc> URI
Example:
hdfs dfs -copyFromLocal input/docs/data2.txt
hdfs://localhost/user/rosemary/data2.txt
copyToLocal:
Works similarly to the get command, except that the destination is restricted
to a local file reference.
Usage:
hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Example:
hdfs dfs -copyToLocal data2.txt data2.copy.txt
count:
Counts the number of directories, files, and bytes under the paths that match
the specified file pattern.
Usage:
hdfs dfs -count [-q] <paths>
Example:
hdfs dfs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
cp:
Copies one or more files from a specified source to a specified destination. If
you specify multiple sources, the specified destination must be a directory.
Usage:
hdfs dfs -cp URI [URI …] <dest>
Example:
hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
du:
Displays the size of the specified file, or the sizes of files and directories
that are contained in the specified directory. If you specify the -s option,
displays an aggregate summary of file sizes rather than individual file sizes.
If you specify the -h option, formats the file sizes in a “human-readable” way.
Usage:
hdfs dfs -du [-s] [-h] URI [URI …]
Example:
hdfs dfs -du /user/hadoop/dir1 /user/hadoop/file1
dus:
Displays a summary of file sizes; equivalent to hdfs dfs -du –s.
Usage:
hdfs dfs -dus <args>
expunge:
Empties the trash. When you delete a file, it isn’t removed immediately from
HDFS, but is renamed to a file in the /trash directory. As long as the file
remains there, you can undelete it if you change your mind, though only the
latest copy of the deleted file can be restored.
Usage:
hdfs dfs –expunge
get:
Copies files to the local file system. Files that fail a cyclic redundancy
check (CRC) can still be copied if you specify the –ignorecrc option. The CRC
is a common technique for detecting data transmission errors. CRC checksum
files have the .crc extension and are used to verify the data integrity of
another file. These files are copied if you specify the -crc option.
Usage:
hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst>
Example:
hdfs dfs -get /user/hadoop/file3 localfile
getmerge:
Concatenates the files in src and writes the result to the specified local
destination file. To add a newline character at the end of each file, specify
the addnl option.
Usage:
hdfs dfs -getmerge <src> <localdst> [addnl]
Example:
hdfs dfs -getmerge /user/hadoop/mydir/ ~/result_file addnl
ls:
Returns statistics for the specified files or directories.
Usage:
hdfs dfs -ls <args>
Example:
hdfs dfs -ls /user/hadoop/file1
lsr:
Serves as the recursive version of ls; similar to the Unix command ls -R.
Usage:
hdfs dfs -lsr <args>
Example:
hdfs dfs -lsr /user/hadoop
mkdir:
Creates directories on one or more specified paths. Its behavior is similar to
the Unix mkdir -p command, which creates all directories that lead up to the
specified directory if they don’t exist already.
Usage:
hdfs dfs -mkdir <paths>
Example:
hdfs dfs -mkdir /user/hadoop/dir5/temp
moveFromLocal:
Works similarly to the put command, except that the source is deleted after it
is copied.
Usage:
hdfs dfs -moveFromLocal <localsrc> <dest>
Example:
hdfs dfs -moveFromLocal localfile1 localfile2 /user/hadoop/hadoopdir
mv:
Moves one or more files from a specified source to a specified destination. If
you specify multiple sources, the specified destination must be a directory.
Moving files across file systems isn’t permitted.
Usage:
hdfs dfs -mv URI [URI …] <dest>
Example:
hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2
put:
Copies files from the local file system to the destination file system. This
command can also read input from stdin and write to the destination file
system.
Usage:
hdfs dfs -put <localsrc> … <dest>
Example:
hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir; hdfs dfs -put –
/user/hadoop/hadoopdir (reads input from stdin)
rm:
Deletes one or more specified files. This command doesn’t delete empty directories
or files. To bypass the trash (if it’s enabled) and delete the specified files
immediately, specify the -skipTrash option.
Usage:
hdfs dfs -rm [-skipTrash] URI [URI …]
Example:
hdfs dfs -rm hdfs://nn.example.com/file9
rmr:
Serves as the recursive version of –rm.
Usage:
hdfs dfs -rmr [-skipTrash] URI [URI …]
Example:
hdfs dfs -rmr /user/hadoop/dir
setrep:
Changes the replication factor for a specified file or directory. With –R,
makes the change recursively by way of the directory structure.
Usage:
hdfs dfs -setrep <rep> [-R] <path>
Example:
hdfs dfs -setrep 3 -R /user/hadoop/dir1
stat:
Displays information about the specified path.
Usage:
hdfs dfs -stat URI [URI …]
Example:
hdfs dfs -stat /user/hadoop/dir1
tail:
Displays the last kilobyte of a specified file to stdout. The syntax supports
the Unix -f option, which enables the specified file to be monitored. As new
lines are added to the file by another process, tail updates the display.
Usage:
hdfs dfs -tail [-f] URI
Example:
hdfs dfs -tail /user/hadoop/dir1
test:
Returns attributes of the specified file or directory. Specifies –e to
determine whether the file or directory exists; -z to determine whether the
file or directory is empty; and -d to determine whether the URI is a directory.
Usage:
hdfs dfs -test -[ezd] URI
Example:
hdfs dfs -test /user/hadoop/dir1
text:
Outputs a specified source file in text format. Valid input file formats are
zip and TextRecordInputStream.
Usage:
hdfs dfs -text <src>
Example:
hdfs dfs -text /user/hadoop/file8.zip
touchz:
Creates a new, empty file of size 0 in the specified path.
Usage:
hdfs dfs -touchz <path>
Example:
hdfs dfs -touchz /user/hadoop/file12
HADOOP
ADMINISTRATION COMMANDS
Any
Hadoop administrator worth his salt must master a comprehensive set of commands
for cluster administration. The following list summarizes the most important
commands, indicating what the command does as well as syntax and examples. Know
them, and you will advance a long way along the path to Hadoop wisdom.
balancer:
Runs the cluster-balancing utility. The specified threshold value, which
represents a percentage of disk capacity, is used to overwrite the default
threshold value (10 percent). To stop the rebalancing process, press Ctrl+C.
Syntax:
hadoop balancer [-threshold <threshold>]
Example:
hadoop balancer -threshold 20
daemonlog:
Gets or sets the log level for each daemon (also known as a service). Connects
to http://host:port/logLevel?log=name and prints or sets the log level of the
daemon that’s running at host:port. Hadoop daemons generate log files that help
you determine what’s happening on the system, and you can use the daemonlog
command to temporarily change the log level of a Hadoop component when you’re
debugging the system. The change becomes effective when the daemon restarts.
Syntax:
hadoop daemonlog -getlevel <host:port> <name>; hadoop daemonlog
-setlevel <host:port> <name> <level>
Example:
hadoop daemonlog -getlevel 10.250.1.15:50030
org.apache.hadoop.mapred.JobTracker; hadoop daemonlog -setlevel
10.250.1.15:50030 org.apache.hadoop.mapred.JobTracker DEBUG
datanode:
Runs the HDFS DataNode service, which coordinates storage on each slave node.
If you specify -rollback, the DataNode is rolled back to the previous version.
Stop the DataNode and distribute the previous Hadoop version before using this
option.
Syntax:
hadoop datanode [-rollback]
Example:
hadoop datanode –rollback
dfsadmin:
Runs a number of Hadoop Distributed File System (HDFS) administrative
operations. Use the -help option to see a list of all supported options. The
generic options are a common set of options supported by several commands.
Syntax:
hadoop dfsadmin [GENERIC_OPTIONS] [-report] [-safemode enter | leave | get |
wait] [-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status | details |
force] [-metasave filename] [-setQuota <quota>
<dirname>…<dirname>] [-clrQuota <dirname>…<dirname>]
[-restoreFailedStorage true|false|check] [-help [cmd]]
mradmin:
Runs a number of MapReduce administrative operations. Use the -help option to
see a list of all supported options. Again, the generic options are a common
set of options that are supported by several commands. If you specify
-refreshServiceAcl, reloads the service-level authorization policy file
(JobTracker reloads the authorization policy file); -refreshQueues reloads the
queue access control lists (ACLs) and state (JobTracker reloads the
mapred-queues.xml file); -refreshNodes refreshes the hosts information at the
JobTracker; -refreshUserToGroupsMappings refreshes user-to-groups mappings;
-refreshSuperUserGroupsConfiguration refreshes superuser proxy groups mappings;
and -help [cmd] displays help for the given command or for all commands if none
is specified.
Syntax:
hadoop mradmin [ GENERIC_OPTIONS ] [-refreshServiceAcl] [-refreshQueues]
[-refreshNodes] [-refreshUserToGroupsMappings]
[-refreshSuperUserGroupsConfiguration] [-help [cmd]]
Example:
hadoop mradmin -help –refreshNodes
jobtracker:
Runs the MapReduce JobTracker node, which coordinates the data processing
system for Hadoop. If you specify -dumpConfiguration, the configuration that’s
used by the JobTracker and the queue configuration in JSON format are written
to standard output.
Syntax:
hadoop jobtracker [-dumpConfiguration]
Example:
hadoop jobtracker –dumpConfiguration
namenode:
Runs the NameNode, which coordinates the storage for the whole Hadoop cluster.
If you specify -format, the NameNode is started, formatted, and then stopped;
with -upgrade, the NameNode starts with the upgrade option after a new Hadoop
version is distributed; with -rollback, the NameNode is rolled back to the
previous version (remember to stop the cluster and distribute the previous
Hadoop version before using this option); with -finalize, the previous state of
the file system is removed, the most recent upgrade becomes permanent, rollback
is no longer available, and the NameNode is stopped; finally, with
-importCheckpoint, an image is loaded from the checkpoint directory (as
specified by the fs.checkpoint.dir property) and saved into the current
directory.
Syntax:
hadoop namenode [-format] | [-upgrade] | [-rollback] | [-finalize] |
[-importCheckpoint]
Example:
hadoop namenode –finalize
Secondary
namenode: Runs the secondary NameNode. If you specify -checkpoint, a checkpoint
on the secondary NameNode is performed if the size of the EditLog (a
transaction log that records every change that occurs to the file system
metadata) is greater than or equal to fs.checkpoint.size; specify -force and a
checkpoint is performed regardless of the EditLog size; specify –geteditsize
and the EditLog size is printed.
Syntax:
hadoop secondarynamenode [-checkpoint [force]] | [-geteditsize]
Example:
hadoop secondarynamenode –geteditsize
tasktracker:
Runs a MapReduce TaskTracker node.
Syntax:
hadoop tasktracker
Example:
hadoop tasktracker