本文共 15172 字,大约阅读时间需要 50 分钟。
Hadoop的框架最核心的设计就是:HDFS和MapReduce。HDFS为海量的数据提供了存储,则MapReduce为海量的数据提供了计算。
Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),简称HDFS。
HDFS有高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上;而且它提供高吞吐量(high throughput)来访问应用程序的数据,适合那些有着超大数据集(large data set)的应用程序。HDFS放宽了(relax)POSIX的要求,可以以流的形式访问(streaming access)文件系统中的数据。
HDFS采用主从(Master/Slave)结构模型,一个HDFS集群是由一个NameNode和若干个DataNode组成的。NameNode作为主服务器,管理文件系统命名空间和客户端对文件的访问操作。DataNode管理存储的数据。HDFS支持文件形式的数据。
从内部来看,文件被分成若干个数据块,这若干个数据块存放在一组DataNode上。NameNode执行文件系统的命名空间,如打开、关闭、重命名文件或目录等,也负责数据块到具体DataNode的映射。DataNode负责处理文件系统客户端的文件读写,并在NameNode的统一调度下进行数据库的创建、删除和复制工作。NameNode是所有HDFS元数据的管理者,用户数据永远不会经过NameNode。
Hadoop MapReduce是google MapReduce 克隆版。
MapReduce是一种计算模型,用以进行大数据量的计算。其中Map对数据集上的独立元素进行指定的操作,生成键-值对形式中间结果。Reduce则对中间结果中相同“键”的所有“值”进行规约,以得到最终结果。MapReduce这样的功能划分,非常适合在大量计算机组成的分布式并行环境里进行数据处理。
Hadoop MapReduce采用Master/Slave(M/S)架构,如下图所示,主要包括以下组件:Client、JobTracker、TaskTracker和Task。
# 下载安装包wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz# 解压安装包tar xf hadoop-2.7.3.tar.gz && mv hadoop-2.7.3 /usr/local/hadoop# 创建目录mkdir -p /home/hadoop/{name,data,log,journal}
创建文件/etc/profile.d/hadoop.sh
。
# HADOOP ENVexport HADOOP_HOME=/usr/local/hadoopexport PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
使 Hadoop 环境变量生效。
source /etc/profile.d/hadoop.sh
编辑文件/usr/local/hadoop/etc/hadoop/hadoop-env.sh
,修改下面字段。
export JAVA_HOME=/usr/java/defaultexport HADOOP_HOME=/usr/local/hadoop
编辑文件/usr/local/hadoop/etc/hadoop/yarn-env.sh
,修改下面字段。
export JAVA_HOME=/usr/java/default
编辑文件/usr/local/hadoop/etc/hadoop/slaves
datanode01datanode02datanode03
编辑文件/usr/local/hadoop/etc/hadoop/core-site.xml
,修改为如下:
fs.default.name hdfs://cluster1:9000 hadoop.tmp.dir /home/hadoop/data ha.zookeeper.quorum zk01:2181,zk02:2181,zk03:2181 dfs.permissions false io.file.buffer.size 131702
编辑文件/usr/local/hadoop/etc/hadoop/hdfs-site.xml
,修改为如下:
dfs.namenode.name.dir file:/home/hadoop/name dfs.datanode.data.dir file:/home/hadoop/data dfs.replication 2 dfs.webhdfs.enabled true dfs.nameservices cluster1
编辑文件/usr/local/hadoop/etc/hadoop/mapred-site.xml
,修改为如下:
mapreduce.framework.name yarn mapred.local.dir /home/hadoop/data mapreduce.admin.map.child.java.opts -Xmx256m mapreduce.admin.reduce.child.java.opts -Xmx4096m mapred.child.java.opts -Xmx512m mapred.task.timeout 1200000 true dfs.hosts.exclude slaves.exclude mapred.hosts.exclude slaves.exclude
编辑文件/usr/local/hadoop/etc/hadoop/yarn-site.xml
,修改为如下:
yarn.resourcemanager.hostname namenode01 yarn.resourcemanager.address ${yarn.resourcemanager.hostname}:8032 yarn.resourcemanager.scheduler.address ${yarn.resourcemanager.hostname}:8030 yarn.resourcemanager.webapp.address ${yarn.resourcemanager.hostname}:8088 yarn.resourcemanager.resource-tracker.address ${yarn.resourcemanager.hostname}:8031 yarn.resourcemanager.admin.address ${yarn.resourcemanager.hostname}:8033 yarn.scheduler.maximum-allocation-mb 983040 yarn.resourcemanager.scheduler.class yarn.resourcemanager.resource-tracker.address ${yarn.resourcemanager.hostname}:8031 yarn.resourcemanager.admin.address ${yarn.resourcemanager.hostname}:8033 yarn.scheduler.maximum-allocation-mb 8182 yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler yarn.log-aggregation-enable true yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler yarn.scheduler.maximum-allocation-vcores 512 yarn.scheduler.minimum-allocation-mb 2048 yarn.log-aggregation-enable true yarn.log-aggregation.retain-seconds 604800 yarn.nodemanager.resource.cpu-vcores 12 yarn.nodemanager.resource.memory-mb 8192 yarn.nodemanager.vmem-check-enabled false yarn.nodemanager.pmem-check-enabled false yarn.nodemanager.vmem-pmem-ratio 2.1 yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage 98.0 yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.auxservices.mapreduce.shuffle.class org.apache.hadoop.mapred.ShuffleHandler
cd /usr/local/hadoop/etc/hadoopscp * datanode01:/usr/local/hadoop/etc/hadoopscp * datanode02:/usr/local/hadoop/etc/hadoopscp * datanode03:/usr/local/hadoop/etc/hadoopchown -R hadoop:hadoop /usr/local/hadoopchmod 755 /usr/local/hadoop/etc/hadoop
hdfs namenode -formathadoop-daemon.sh start namenode
stop-all.shstart-all.sh
[root@namenode01 ~]# jps17419 NameNode17780 ResourceManager18152 Jps[root@datanode01 ~]# jps2227 DataNode1292 QuorumPeerMain2509 Jps2334 NodeManager[root@datanode02 ~]# jps13940 QuorumPeerMain18980 DataNode19093 NodeManager19743 Jps[root@datanode03 ~]# jps19238 DataNode19350 NodeManager14215 QuorumPeerMain20014 Jps
访问
访问
[root@namenode01 ~]# hadoop fs -put /root/anaconda-ks.cfg /anaconda-ks.cfg
[root@namenode01 ~]# cd /usr/local/hadoop/share/hadoop/mapreduce/[root@namenode01 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /anaconda-ks.cfg /test18/11/17 00:04:45 INFO client.RMProxy: Connecting to ResourceManager at namenode01/192.168.1.200:803218/11/17 00:04:45 INFO input.FileInputFormat: Total input paths to process : 118/11/17 00:04:45 INFO mapreduce.JobSubmitter: number of splits:118/11/17 00:04:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541095016765_000418/11/17 00:04:46 INFO impl.YarnClientImpl: Submitted application application_1541095016765_000418/11/17 00:04:46 INFO mapreduce.Job: The url to track the job: http://namenode01:8088/proxy/application_1541095016765_0004/18/11/17 00:04:46 INFO mapreduce.Job: Running job: job_1541095016765_000418/11/17 00:04:51 INFO mapreduce.Job: Job job_1541095016765_0004 running in uber mode : false18/11/17 00:04:51 INFO mapreduce.Job: map 0% reduce 0%18/11/17 00:04:55 INFO mapreduce.Job: map 100% reduce 0%18/11/17 00:04:59 INFO mapreduce.Job: map 100% reduce 100%18/11/17 00:04:59 INFO mapreduce.Job: Job job_1541095016765_0004 completed successfully18/11/17 00:04:59 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=1222 FILE: Number of bytes written=241621 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1023 HDFS: Number of bytes written=941 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=1758 Total time spent by all reduces in occupied slots (ms)=2125 Total time spent by all map tasks (ms)=1758 Total time spent by all reduce tasks (ms)=2125 Total vcore-milliseconds taken by all map tasks=1758 Total vcore-milliseconds taken by all reduce tasks=2125 Total megabyte-milliseconds taken by all map tasks=1800192 Total megabyte-milliseconds taken by all reduce tasks=2176000 Map-Reduce Framework Map input records=38 Map output records=90 Map output bytes=1274 Map output materialized bytes=1222 Input split bytes=101 Combine input records=90 Combine output records=69 Reduce input groups=69 Reduce shuffle bytes=1222 Reduce input records=69 Reduce output records=69 Spilled Records=138 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=99 CPU time spent (ms)=970 Physical memory (bytes) snapshot=473649152 Virtual memory (bytes) snapshot=4921606144 Total committed heap usage (bytes)=441450496 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=922 File Output Format Counters Bytes Written=941
[root@namenode01 mapreduce]# hadoop fs -cat /test/part-r-00000# 11#version=DEVEL 1$6$kRQ2y1nt/B6c6ETs$ITy0O/E9P5p0ePWlHJ7fRTqVrqGEQf7ZGi5IX2pCA7l25IdEThUNjxelq6wcD9SlSa1cGcqlJy2jjiV9/lMjg/ 1%addon 1%end 2%packages 1--all 1--boot-drive=sda 1--bootproto=dhcp 1--device=enp1s0 1--disable 1--drives=sda 1--enable 1--enableshadow 1--hostname=localhost.localdomain 1--initlabel 1--ipv6=auto 1--isUtc 1--iscrypted 1--location=mbr 1--onboot=off 1--only-use=sda 1--passalgo=sha512 1--reserve-mb='auto' 1--type=lvm 1--vckeymap=cn 1--xlayouts='cn' 1@^minimal 1@core 1Agent 1Asia/Shanghai 1CDROM 1Keyboard 1Network 1Partition 1Root 1Run 1Setup 1System 4Use 2auth 1authorization 1autopart 1boot 1bootloader 2cdrom 1clearing 1clearpart 1com_redhat_kdump 1configuration 1first 1firstboot 1graphical 2ignoredisk 1information 3install 1installation 1keyboard 1lang 1language 1layouts 1media 1network 2on 1password 1rootpw 1the 1timezone 2zh_CN.UTF-8 1
查看fs帮助命令: hadoop fs -help
查看HDFS磁盘空间: hadoop fs -df -h创建目录: hadoop fs -mkdir上传本地文件: hadoop fs -put查看文件: hadoop fs -ls查看文件内容: hadoop fs –cat复制文件: hadoop fs -cp下载HDFS文件到本地: hadoop fs -get移动文件: hadoop fs -mv删除文件: hadoop fs -rm -r -f删除文件夹: hadoop fs -rm –r
转载于:https://blog.51cto.com/wzlinux/2317912