搜档网
当前位置:搜档网 › hadoop 2.2+spark安装笔记

hadoop 2.2+spark安装笔记

注:本文档仅记录我平时安装的操作过程,没有做过系统整理,如有重复或罗嗦之处,敬请谅解
另外:在配置ssh、等遇到的问题,请参照我的另外一个 文档《ubuntu问题汇总.txt》

1.下载
Hadoop我们从Apache官方网站直接下载最新版本Hadoop2.2。官方目前是提供了linux32位系统可执行文件,所以如果需要在64位系统上部署则需要单独下载src 源码自行编译(10楼评论中提供了一个解决方法链接)。
下载地址:https://www.sodocs.net/doc/184378004.html,/hadoop/common/hadoop220/

2、拓扑结构:
192.168.1.107 ubuntu02 [公司笔记本] /home/hadoop (后来增加)
192.168.1.104 ubuntu03 [公司笔记本] /home/hadoop
192.168.1.105 ubuntu04 [家庭笔记本] /home/hadoop (namenode)
192.168.1.106 ubuntu05 [家庭笔记本] /home/hadoop
3.配置host
cd /etc
sudo vi /etc/hosts

192.168.1.107 ubuntu02
192.168.1.104 ubuntu03
192.168.1.105 ubuntu04
192.168.1.106 ubuntu05
复制进去保存

重复每台虚拟机直至完成

4.配置ssh
进入ubuntu04的hadoop主目录(即:/home/hadoop)
mkdir .ssh
chmod 700 .ssh
cd .ssh
ssh-keygen -t rsa -p 一路回车直至完成
ssh-keygen -t rsa -P ''

cp id_rsa.pub authorized_keys

每个节点重复以上几个步骤,直至完成

将每个节点的authorized_key合并成一个大的authorized_keys,将合并后的authorized_keys分发到各个节点的对应目录,覆盖原来的authorized_keys

scp /home/hadoop/.ssh/authorized_keys hadoop@ubuntu02:/home/hadoop/.ssh
scp /home/hadoop/.ssh/authorized_keys hadoop@ubuntu03:/home/hadoop/.ssh
scp /home/hadoop/.ssh/authorized_keys hadoop@ubuntu05:/home/hadoop/.ssh



在ubuntu04执行 ssh ubuntu03

5.安装jdk(这里以.tar.gz版本,32位系统为例)
安装方法参考https://www.sodocs.net/doc/184378004.html,/javase/7/docs/webnotes/install/linux/linux-jdk.html
5.1 选择要安装java的位置,如/home/hadoop/目录下,新建文件夹java(mkdirjava)
5.2 将文件jdk-7u40-linux-i586.tar.gz移动到/home/hadoop/
5.3 解压:tar -zxvf jdk-7u40-linux-i586.tar.gz
5.4 删除jdk-7u40-linux-i586.tar.gz(为了节省空间)
至此,jkd安装完毕,下面配置环境变量
5.5、打开/etc/profile(vim /etc/profile)
在最后面添加如下内容:

JAVA_HOME=/home/hadoop/jdk1.7.0_67
PATH=$JAVA_HOME/bin:$PATH
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME
export PATH
export CLASSPATH


5.6、source /etc/profile
5.7、验证是否安装成功:java–version

【注意】每台机器执行相同操作,最好将java安装在相同路径下(不是必须的,但这样会使后面的配置方便很多)

6.安装hadoop2.20
6.1 解压:tar -zxvf hadoop220.tar.gz
6.2 hadoop配置过程
/dfs/name
/dfs/data
/temp

这里要涉及到的配置文件有7个:
~/hadoop220/etc/hadoop/hadoop-env.sh
~/hadoop220/etc/hadoop/yarn-env.sh
~/hadoop220/etc/hadoop/

slaves
~/hadoop220/etc/hadoop/core-site.xml
~/hadoop220/etc/hadoop/hdfs-site.xml
~/hadoop220/etc/hadoop/mapred-site.xml
~/hadoop220/etc/hadoop/yarn-site.xml
以上个别文件默认不存在的,可以复制相应的template文件获得。

配置文件1:hadoop-env.sh
修改JAVA_HOME值(export JAVA_HOME=/home/hadoop/jdk1.7.0_67)
配置文件2:yarn-env.sh
修改JAVA_HOME值(exportJAVA_HOME=/home/hadoop/jdk1.7.0_67)

配置文件3:slaves (这个文件里面保存所有slave节点)
写入以下内容:
ubuntu03
ubuntu05


配置文件4:core-site.xml



fs.defaultFS
hdfs://ubuntu04:9000


io.file.buffer.size
131072


hadoop.tmp.dir
file:/home/hadoop/data/temp
Abase for other temporary directories.



hadoop.proxyuser.hduser.hosts
*



hadoop.proxyuser.hduser.groups
*



配置文件5:hdfs-site.xml



https://www.sodocs.net/doc/184378004.html,node.secondary.http-address
ubuntu04:9001



https://www.sodocs.net/doc/184378004.html,.dir
file:/home/hadoop/data/dfs/name



dfs.datanode.data.dir
file:/home/hadoop/data/dfs/data



dfs.replication
3



dfs.webhdfs.enabled
true



配置文件6:mapred-site.xml


https://www.sodocs.net/doc/184378004.html,
yarn


mapreduce.jobhistory.address
ubuntu04:10020


mapreduce.jobhistory.webapp.address
ubuntu04:19888



配置文件7:yarn-site.xml


yarn.nodemanager.aux-services
mapreduce_shuffle



yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler



yarn.resourcemanager.address
ubuntu04:8032



yarn.resourcemanager.scheduler.address
ubuntu04:8030



yarn.resourcemanager.resource-tracker.address
ubuntu04:8031



yarn.resourcemanager.admin.address
ubuntu04:8033



yarn.resourcemanager.webapp.address
ub

untu04:8088




复制到其他节点

这里可以写一个shell脚本进行操作(有大量节点时比较方便)

cp2slave.sh

#!/bin/bash
scp /home/hadoop/hadoop220/etc/hadoop/* hadoop@ubuntu03:/home/hadoop/hadoop220/etc/hadoop
scp /home/hadoop/hadoop220/etc/hadoop/* hadoop@ubuntu05:/home/hadoop/hadoop220/etc/hadoop
--整个目录
scp -r /home/hadoop/hadoop220 hadoop@ubuntu03:/home/hadoop/
scp -r /home/hadoop/hadoop220 hadoop@ubuntu05:/home/hadoop/

新增ubuntu02


scp /home/hadoop/hadoop220/etc/hadoop/* hadoop@ubuntu03:/home/hadoop/hadoop220/etc/hadoop
scp /home/hadoop/hadoop220/etc/hadoop/* hadoop@ubuntu05:/home/hadoop/hadoop220/etc/hadoop

scp /home/hadoop/spark102/conf/*.* hadoop@ubuntu03:/home/hadoop/spark102/conf
scp /home/hadoop/spark102/conf/*.* hadoop@ubuntu05:/home/hadoop/spark102/conf


scp -r /home/hadoop/hadoop220 hadoop@ubuntu02:/home/hadoop/
scp -r /home/hadoop/spark102 hadoop@ubuntu02:/home/hadoop/
scp -r /home/hadoop/scala-2.9.3 hadoop@ubuntu02:/home/hadoop/
scp -r /home/hadoop/jdk1.7.0_67 hadoop@ubuntu02:/home/hadoop/


7、启动验证

7.1 启动hadoop

进入安装目录: cd ~/hadoop220/

格式化namenode:
./bin/hdfs namenode -format


启动hdfs:
./sbin/start-dfs.sh
./sbin/stop-dfs.sh

此时在001上面运行的进程有:namenode secondarynamenode

002和003上面运行的进程有:datanode

启动yarn:
./sbin/start-yarn.sh
./sbin/stop-yarn.sh

此时在001上面运行的进程有:namenode secondarynamenode resourcemanager

002和003上面运行的进程有:datanode nodemanaget

查看集群状态:
./bin/hdfs dfsadmin -report


查看文件块组成:
./bin/hdfsfsck / -files -blocks
./bin/hdf sfsck / -files -blocks
./bin/hdfsfsck/ -files -blocks
查看HDFS: http://192.168.1.105:50070
查看RM(resource manager): http://192.168.1.105:8088



创建文件目录
hadoop fs -mkdir -p data/wordcount
hadoop fs -put /home/hadoop/mydata/kpi.txt data/wordcount/

./bin/hadoop jar /home/hadoop/hadoop220/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount data/wordcount/ data/out/wordcount


4.2 运行示例程序:

先在hdfs上创建一个文件夹

./bin/hdfs dfs –mkdir /input

./bin/hadoop jar /home/hadoop/hadoop220/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar randomwriter input

--------
/home/hadoop/jdk1.7.0_67/bin/jps


-------------------------------------spark-- 参照 Spark On YARN 环境搭建 https://www.sodocs.net/doc/184378004.html,/353572/1352713-----------------------------------------------
本文介绍的是如何将Apache Spark部署到Hadoop 2.2.0上,如果你们的Hadoop是其他版本,比如CDH4,可直接参考官方说明操作。

需要注意两点:(1)使用的Hadoop必须是2.0系列,比如0.23.x,2.0.x,2.x.x或CDH4、CDH5等,将Spark运行在Hadoop上,本质上是将Spark运行在Hadoop YARN上,因

为Spark自身只提供了作业管理功能,资源调度要依托于第三方系统,比如YARN或Mesos等 (2)之所以不采用Mesos而是YARN,是因为YARN拥有强大的社区支持,且逐步已经成为资源管理系统中的标准。
Spark 分布式集群配置【注:所有节点都做同样配置】
1.到spark官网下载
spark102.tgz
2.解压
spark102.tgz

3、Scala 安装
https://www.sodocs.net/doc/184378004.html,/files/archive/scala-2.9.3.tgz

sudo vi /etc/profile

export SCALA_HOME=/home/hadoop/scala-2.9.3
export PATH=$PATH:$SCALA_HOME/bin
export SPARK_HOME=/home/hadoop/spark102
export PATH=$PATH:$SPARK_HOME/bin

4.
cd /home/hadoop/spark102/conf
cp spark-env.sh.template spark-env.sh
修改spark-env.sh中的
export SPARK_HOME=/home/hadoop/spark102
export PATH=$PATH:$SPARK_HOME/bin

export JAVA_HOME=/home/hadoop/jdk1.7.0_67
export SCALA_HOME=/home/hadoop/scala-2.9.3
export HADOOP_HOME=/home/hadoop/hadoop220


修改 /home/hadoop/spark102/conf/slaves
ubuntu03
ubuntu05

#日志配置
cd /home/hadoop/spark102/conf
cp log4j.properties.template log4j.properties

分发
scp -r /home/hadoop/spark102 hadoop@ubuntu03:/home/hadoop/
scp -r /home/hadoop/spark102 hadoop@ubuntu05:/home/hadoop/

scp -r /home/hadoop/scala-2.9.3 hadoop@ubuntu03:/home/hadoop/
scp -r /home/hadoop/scala-2.9.3 hadoop@ubuntu05:/home/hadoop/

测试:
cd /home/hadoop/spark102/sbin
./sbin/start-all.sh
./sbin/stop-all.sh
/home/hadoop/jdk1.7.0_67/bin/jps



## 监控页面URL
http://192.168.1.105:8080/
http://192.168.1.105:4040 (SparkUI情况)

进入spark-shell控制台
cd /home/hadoop/spark102/bin
spark-shell

hadoop fs -ls hdfs://ubuntu04:9000/input/
--
hadoop fs -put /home/hadoop/hadoop/README.txt /input/
统计文件大小
hadoop fs -stat "%o" input/kpi.txt

进入spark-shell控制台
接下来通过以下命令读取刚刚上传到HDFS上的“README.txt”文件

测试1.(此种读dfs格式不对?)
#var file=sc.textFile("hdfs://ubuntu04:9000/input/README.txt")
var file=sc.textFile("/home/hadoop/hadoop220/README.txt")
val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)

测试2
---------注:测试1两句执行失败,执行下面这两句成功,令人疑惑的是,测试2的运行竟然读的是dfs的文件------------------
val textFile = sc.textFile("input/README.txt")
textFile.count()
val count = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)
count.collect

测试3:大文件(80M),第一次报 https://www.sodocs.net/doc/184378004.html,ng.OutOfMemoryError ,第二次执行成功
val textFile = sc.textFile("input/kpi.txt")
textFile.count()
val count = textFile.flatMap(line => line.split(",")).map(word => (word, 1)).reduceByKey(_+_)
count.collect
测试4:检查统计正确性
val textFile = sc.textFile("input/my.txt")
textFile.count()
val count = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)
count.coll

ect

测试5:
var file=sc.textFile("/input/README.md");
var sparks=file.filter(line=>line.contains("spark"));
sparks.count

测试6(5亿数据,4.2G大小,耗时21 min ,使用grep 139******** dw_acct_shoulditem_ms.txt 仅几秒就把所有包含139********的行全部找出来):
var file=sc.textFile("/input/dw_acct_shoulditem_ms.txt");
var sparks=file.filter(line=>line.contains("139********"));
sparks.count


var file=sc.textFile("/user/hive/warehouse/dw_acct_shoulditem_ms/dw_acct_shoulditem_ms.txt");
var sparks=file.filter(line=>line.contains("139********"));
sparks.count



实例:

## 先切换到/home/hadoop/spark102/bin目录
(1)、本地模式
./run-example org.apache.spark.examples.SparkPi local(10)


(2)、普通集群模式
./run-example org.apache.spark.examples.SparkPi spark://ubuntu04:7077
./run-example org.apache.spark.examples.SparkLR spark://ubuntu04:7077
./run-example org.apache.spark.examples.SparkKMeans spark://ubuntu04:7077 file:/usr/local/spark/kmeans_data.txt 2 1


(3)、结合HDFS的集群模式
# hadoop fs -put README.md /input/
# MASTER=spark://ubuntu04:7077 ./spark-shell
scala> val file = sc.textFile("hdfs://ubuntu04:9000/user/root/README.md")
scala> val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)
scala> count.collect()
scala> :quit


(4)、基于YARN模式
# SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.1-incubating-hadoop2.2.0.jar \
./spark-class org.apache.spark.deploy.yarn.Client \
--jar examples/target/scala-2.9.3/spark-examples_2.9.3-assembly-0.8.1-incubating.jar \
--class org.apache.spark.examples.SparkPi \
--args yarn-standalone \
--num-workers 3 \
--master-memory 4g \
--worker-memory 2g \
--worker-cores 1


执行结果:
/usr/local/hadoop/logs/userlogs/application_*/container*_000001/stdout


(5)、其他一些样例程序
examples/src/main/scala/org/apache/spark/examples/


(6)、问题定位【数据节点上的日志】
/data/hadoop/storage/tmp/nodemanager/logs


(7)、一些优化
# vim /usr/local/spark/conf/spark-env.sh
export SPARK_WORKER_MEMORY=16g 【根据内存大小进行实际配置】
......

---测试:
./bin/run-example SparkPi 30


========================================================================================
五、Shark 数据仓库【后续补上】




六、第三步:构建Spark的IDE开发环境 (2)
目前世界上Spark首选的InteIIiJ IDE开发工具是IDEA,我们下载InteIIiJ IDEA1
1.到https://www.sodocs.net/doc/184378004.html,/idea/dowload下载 Community Edition FREE 版本
2.解压
3.配置环境变量
export PATH=$PATH:/home/hadoop/idea-IC-135.1289/bin
4.到/home/hadoop/idea-IC-135.1289/bin 执行 idea.sh
5.在欢迎界面 点击plugin->搜索scala->点击scala->install plugin
6.重启后在欢迎界面 在欢迎界面的时候选择Configure -> Project defaults -> Project structure
否则在建立工程时:ntelliJ IDEA Community

Edition 13.1.5导入工程时遇到了Cannot determine Java VM executable in selected JDK
(网上也有人说设置环境变量JDK_HOME=${JAVA_HOME}也能解决,没有验证过)
7.创建scala工程
create new project->左边栏中是scala(不是java下的scala)-》右边的sbt->输入工程名称和路径-》DEA自动完成SBT工具的安装需要一段时间,家林这里花了大约5分钟的时间,SBT好后SBT会自动帮我们建立好一些目录:

spark 示例源码
https://https://www.sodocs.net/doc/184378004.html,/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples
或者
/home/hadoop/spark102/examples/src/main/scala/org/apache/spark/examples

七 构建Spark的IDE开发环境 eclipse
https://www.sodocs.net/doc/184378004.html,/art/201401/426592.htm

-----------------------------维护记录----------------------------------------
1.2014.10.31 增加节点ubuntu02
2.2014.10.31 修改hdfs-site.xml 的dfs.replication的值由3改为2
scp /home/hadoop/hadoop220/etc/hadoop/hdfs-site.xml hadoop@ubuntu03:/home/hadoop/hadoop220/etc/hadoop
scp /home/hadoop/hadoop220/etc/hadoop/hdfs-site.xml hadoop@ubuntu04:/home/hadoop/hadoop220/etc/hadoop
scp /home/hadoop/hadoop220/etc/hadoop/hdfs-site.xml hadoop@ubuntu05:/home/hadoop/hadoop220/etc/hadoop

3.201411.01 修改DFS块大小 120M-->10M

dfs.block.size
20971520


scp /home/hadoop/hadoop220/etc/hadoop/hdfs-site.xml hadoop@ubuntu02:/home/hadoop/hadoop220/etc/hadoop
scp /home/hadoop/hadoop220/etc/hadoop/hdfs-site.xml hadoop@ubuntu03:/home/hadoop/hadoop220/etc/hadoop
scp /home/hadoop/hadoop220/etc/hadoop/hdfs-site.xml hadoop@ubuntu05:/home/hadoop/hadoop220/etc/hadoop

4.2014.11.07 spark 升级到1.1,scala升级到2.10.4,hadoop升级到2.3 其它配置不变 (除了java外,其它组件均重命名为不带版本的文件件,为避免以后升级需要修改环境变量)、

JAVA_HOME=/home/hadoop/jdk1.7.0_67
HADOOP_HOME=/home/hadoop/hadoop
PATH=$JAVA_HOME/bin:$PATH
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME
export PATH
export CLASSPATH

export SCALA_HOME=/home/hadoop/scala
export PATH=$PATH:$SCALA_HOME/bin
export SPARK_HOME=/home/hadoop/spark
export PATH=$PATH:$SPARK_HOME/bin
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:/home/hadoop/idea-IC-135.1289/bin


cp /home/hadoop/hadoop220/etc/hadoop/hadoop-env.sh /home/hadoop/hadoop/etc/hadoop/
cp /home/hadoop/hadoop220/etc/hadoop/yarn-env.sh /home/hadoop/hadoop/etc/hadoop/
cp /home/hadoop/hadoop220/etc/hadoop/slaves /home/hadoop/hadoop/etc/hadoop/
cp /home/hadoop/hadoop220/etc/hadoop/core-site.xml /home/hadoop/hadoop/etc/hadoop/
cp /home/hadoop/hadoop220/etc/hadoop/hdfs-site.xml /home/hadoop/hadoop/etc/hadoop/
cp /home/hadoop/hadoop220/etc/hadoop/mapred-site.xml /home/hadoop/hadoop/etc/hadoop/
cp /home/hadoop/hadoop220/etc/hadoop/yarn-site.xml /home/hadoop/hadoop/e

tc/hadoop/


scp -r /home/hadoop/scala hadoop@ubuntu02:/home/hadoop/
scp -r /home/hadoop/scala hadoop@ubuntu03:/home/hadoop/
scp -r /home/hadoop/scala hadoop@ubuntu05:/home/hadoop/

scp -r /home/hadoop/spark hadoop@ubuntu02:/home/hadoop/
scp -r /home/hadoop/spark hadoop@ubuntu03:/home/hadoop/
scp -r /home/hadoop/spark hadoop@ubuntu05:/home/hadoop/

scp -r /home/hadoop/hadoop hadoop@ubuntu02:/home/hadoop/
scp -r /home/hadoop/hadoop hadoop@ubuntu03:/home/hadoop/
scp -r /home/hadoop/hadoop hadoop@ubuntu05:/home/hadoop/

--------------------------------------------------------------------------
5. 2014.11.08 加装HIVE
1) 安装mysql ubuntu02节点 参照https://www.sodocs.net/doc/184378004.html,/wuhou/archive/2008/09/28/1301071.html
https://www.sodocs.net/doc/184378004.html,/forum.php?mod=viewthread&tid=197019
(1).sudo apt-get install mysql-server (使用该命令,实际安装位置 /etc/mysql)

mysqladmin -u用户名 -p旧密码 password 新密码
mysqladmin -uroot password hadoop_123


mysqladmin -umysql -p hadoop_123
mysql -uroot -p


mysqladmin -u root -p version

/etc/init.d/mysql start|stop|restart

sudo /etc/init.d/mysql restart


登陆:(root是用户,hadoop是数据库)
mysql -u root -p hadoop
查看有哪些数据库:
show databases;
改变数据库:
use mysql;
查看有哪些表:
show tables;





2).-为hive 配置
mysql -uroot -p
grant all privileges on *.* to mysql@'%' identified by 'mysql' with grant option;
create user 'hadoop' identified by 'hadoop_123';
grant all on *.* to hadoop@'%' with grant option;
grant all PRIVILEGES on hive.* to hadoop@'192.168.1.107' identified by 'hadoop_123';


quit;
mysql -uhadoop -p
create database hive;


在ubuntu04 安装sudo apt-get install mysql-client
测试连接(注:本来不打算测试连接的,后面启动hive时报错,才反过来测试一下,看是hive的配置有问题,还是mysql的问题,折腾了一个下午) :
mysql -uroot -p -h192.168.1.107
报错:ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.1.107' (111)
JDBC链接(DBConTester.java)
也报错
解决:
在/etc/mysql/https://www.sodocs.net/doc/184378004.html,f 找到
bind-address = 127.0.0.1
直接屏蔽掉 OK

(折腾了好久,在网上搜了很多原因,有些说是权限有问题,有些说是端口可能不是默认的3306端口,有些建议增加https://www.sodocs.net/doc/184378004.html,f的什么参数,最后直接导致mysql都启动不了了
最后终于找到了https://www.sodocs.net/doc/184378004.html,/s/blog_60fcb5a10100qkyb.html
)


3)------安装hive 参照https://www.sodocs.net/doc/184378004.html,/forum.php?mod=viewthread&tid=197019
ubuntu02 hive metastore service
ubuntu04 hive client
(1) 安装hive metastore service
在ubuntu02上
版本:apache-hive-0.13.1-bin.tar

.gz
tar -zxvf apache-hive-0.13.1-bin.tar.gz
mv apache-hive-0.13.1-bin hive
chown -R hadoop:hadoop hive

设置环境变量
export HIVE_HOME=/home/hadoop/hive
export PATH=$PATH:$HIVE_HOME/bin
修改配置文件
cd hive/conf
cp hive-default.xml.template hive-site.xml
cp hive-env.sh.template hive-env.sh
cp hive-log4j.properties.template hive-log4j.properties

vi hive-env.sh
HADOOP_HOME=/home/hadoop/hadoop

vi hive-site.xml


javax.jdo.option.ConnectionURL
jdbc:mysql://ubuntu02:3306/hive?=createDatabaseIfNotExist=true
JDBC connect string for a JDBC metastore


javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore


javax.jdo.option.ConnectionUserName
hadoop
username to use against metastore database


javax.jdo.option.ConnectionPassword
hadoop_123
password to use against metastore database


安装ubuntu04(hive client)
scp -r hive hadoop@ubuntu04:/home/hadoop/

修改 hive-site.xml
property>
hive.metastore.uris
thrift://ubuntu02:9083
Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.


4).启动测试
(1).启动hadoop
(2).在ubuntu02
启动hive hive命令 Ok
ui界面 hive --service hwi
报错:cannot access /home/hadoop/hive/lib/hive-hwi-*.war: No such file or directory
原因:没有hive-hwi-0.13.0.war

参照 https://www.sodocs.net/doc/184378004.html,/wulantian/article/details/38271803
之前版本里面都有自带的hwi war包,唯独0.13版本的lib里没有hive-hwi-0.13.0.war,这个问题纠结了好久,网上也查不到资料
解决办法,下载0.13的源码,到hwi包里找到web目录,把该目录手动打包.zip,然后改成war包,copy到lib目录下,启动hwi即可

cd /home/hadoop/apache-hive-0.13.1-src/apache-hive-0.13.1-src/hwi/web
zip hive-hwi-0.13.1.zip ./*
mv hive-hwi-0.13.1.zip hive-hwi-0.13.1.war

scp hive-hwi-0.13.1.war hadoop@ubuntu02:/home/hadoop/hive/lib/
修改hive-site.xml中的hive.hwi.war.file 项目,value为:hive-hwi-0.13.1.war
http://192.168.1.107:9999/hwi/

接着报错 (暂时搞不定)
Perhaps JAVA_HOME does not point to the JDK.


5).hive 示例


创建文件:user_info.txt
shibenting,石本厅,30
yanghao,杨浩,29
mamingjian,马明杰,22

hadoop fs -put user_info.txt /input/

create table user_info (user_id STRING, user_name STRING,age int)
row format delimited
fields terminated by ','
lines terminated by

'\n';

LOAD DATA INPATH '/input/user_info.txt' OVERWRITE INTO TABLE user_info;

select * from user_info; OK,但是中文乱码,暂未解决

6).hive试验 账单表:dw_acct_shoulditem_ms(约5亿数据),账目项dim_acct_item、以及账目项分类dim_pub_itemstat.txt、map_acct_summary_item.txt
统计各项收入
(1).建表
create table dw_acct_shoulditem_ms (op_time DATE,user_id STRING,product_no STRING,item_code STRING,city_id SMALLINT,county_id INT,fee_01 DOUBLE,fee_02 DOUBLE,fee_03 DOUBLE,fee_04 DOUBLE,fee_05 DOUBLE,fee_06 DOUBLE,fee_07 DOUBLE)
row format delimited
fields terminated by ','
lines terminated by '\n';

create table dim_acct_item(item_code STRING,item_name STRING)
row format delimited
ROW FORMAT DELIMITED ','
FIELDS TERMINATED BY '\n';

(2).load
LOAD DATA INPATH '/input/dw_acct_shoulditem_ms.txt' OVERWRITE INTO TABLE dw_acct_shoulditem_ms;
(3).查询
select * from dw_acct_shoulditem_ms where product_no='139********' (耗时2分钟16秒);
select sum(fee_02) from dw_acct_shoulditem_ms where product_no='139********'; (耗时474.955 S);
select count(1) from dw_acct_shoulditem_ms where product_no='139********'





-----------20141109,ssh免密码突然不行了,根据网上各种说法,(折腾差不多一天,第一天晚上)---------------------
根据网上的各种说法,都是home/user/或者 .ssh或者的authorized_keys文件的权限问题(最终都没有解决)
使用如下命令debug
ssh -v -v -v hadoop@ubuntu02
ssh -v -v -v hadoop@ubuntu03

登陆日志
grep sshd /var/log/auth.log

网友总结:
用户目录权限为 755 或者 700就是不能是77x (例如你登陆系统用户为hadoop,那么就是/home/hadoop)
.ssh目录权限必须为755
rsa_id.pub 及authorized_keys权限必须为644
rsa_id权限必须为600



-----后面找到 ssh-copy-id -i命令 自动将id_rsa.pub增加到authorized_keys,实际完成操作如下,完全解决(从这判断,可能是手工复制粘贴,导致不知道的原因,例如字符集等)
cd .ssh
rm *
ssh-keygen -t rsa -P ''

ubuntu02
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu02
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu03
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu04
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu05
ubuntu03
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu02
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu03
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu04
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu05

ubuntu04
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu02
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu03
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu04
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu05

ubuntu05
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu02
ssh-copy-id -i /

home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu03
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu04
ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@ubuntu05













相关主题