Monday, September 22, 2014

Hadoop Configuration Parameters revisited

fs.default.name specifies the default filesystem

fs.checkpoint.dir used by secondary namenode to store filesystem metadata during checkpoint operation

fs.trash.interval specifies the no of minutes the file will be available in the .Trash before final deletion

topology.script.file.name absolute path of the script to make cluster rack aware

hadoop.log.dir The directory in which log data should be written. This should be the same path as specified in HADOOP_LOG_DIR in the hadoop-env.sh file.

io.file.buffer.size (core-site.xml) general purpose buffer size to enhance read/write IO and network IO

dfs.block.size specifies default block size to store on HDFS

dfs.name.dir specifies a comma separated directories to store namenode metadata

dfs.data.dir list of directories where datanodes will store HDFS block data

dfs.datanode.du.reserved disk space reserved for the non HDFS use

dfs.namenode.handler.count count of worker thread to process RPC request by clients as well as other cluster deamon

dfs.datanode.failed.volumes.tolerated specifies the number of disks that are permitted to die before failing the entire datanode

dfs.hosts list of hostname or datanode that are allowed to communicate with the namenode.

dfs.host.exclude for decommisioning the datanode or to block the host to communicate with the namenode

dfs.permissions.supergroup specify group of user whose privileges equivalent to the super user

dfs.balance.bandwidthPerSec use by datanode to limit the bandwidth

mapred.job.tracker specifies the job tracker hostname and port

mapred.local.dir mapReduce job use the machine’s local disk to store their intermediate output to the specified directories

mapred.java.child.opts specifies the jvm heap properties like initial heap size,max heap size etc.

mapred.child.ulimit it a limit on how much virtual memory a process may consume before it is terminated.

mapred.tasktracker.map.tasks.maximum maximum no of map task can be supported by the workeer node in parallel

mapred.tasktracker.reduce.tasks.maximum maximum no of reduce task can be supported by the workeer node in parallel

io.sort.mb specifies the size of circular buffer to have intermediate key-value pair emitted by the mapper

io.sort.factor specifies the number of files/streams to merge at once

mapred.compress.map.output true/false depending on whether you want to compress the mapper emitted data

mapred.map.output.compression.codec specifies the codec that you want to use to compress the intermediate data

mapred.output.compression.type RECORD/BLOCK level compression

mapred.job.tracker.handler.count jobtracker maintains a pool of worker thread to handle RPC requests

mapred.jobtracker.taskScheduler The mapred.jobtracker.taskScheduler parameter specifies the Java class name of the scheduler plugin that should be used by the jobtracker

mapred.reduce.parallel.copies which controls the number of copies each reduce task initiates in parallel during the shuffle phase

mapred.reduce.tasks control the no of reduce tasks

tasktracker.http.threads no of threads avaiable to handle http request concurrently

mapred.reduce.slowstart.completed.maps indicates when to begin allocating reducers as a percentage of completed map tasks

mapred.acls.enabled Access control lists must be globally enabled prior to use