Spark 基本安裝設定
Spark安裝放是有很多, 這裡舉一個基本的方式(python版本), 可以了解到要將叢集建立起來需要什麼樣的套件(叢集管理器是使用standalone), 若熟悉Docker或Ambari當然更建議直接安裝
以下為安裝步驟
JAVA
|
|
查詢javaj版本
|
Java –version Ubuntu19810
|
更新apt get
|
sudo apt-get update
|
安裝JDK
java-8-openjdk-amd64
|
sudo apt-get install default-jdk
sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk
|
查詢java安裝路徑
|
Update-alternatives –display java
|
叢集網路設定
|
|
編輯interfaces網路設定檔
|
sudo gedit /etc/network/interfaces
======================================================
# This file describes the network
interfaces available on your system
# and how to activate them. For more
information, see interfaces(5).
# The loopback network interface auto lo
iface lo inet loopback
# The primary network interface
auto eth0
iface eth0 inet static
address 192.168.0.41
netmask 255.255.255.0
network 192.168.0.0
broadcast 192.168.0.255
gateway 192.168.0.254
# dns-* options are implemented by the
resolvconf package, if installed dns-nameservers 192.168.0.1
dns-search nimbus.com
|
編輯interfaces網路設定檔-子節點
|
同上改成 192.168.0.122
|
編輯hostname主機名稱
|
Sudo /etc/hostnam
master
|
編輯hostname主機名稱-子節點
|
BigDataTest04
|
編輯hosts檔案
|
Sudo gedit /etc/hosts
127.0.0.1 localhost
127.0.1.1 hadoop
192.168.0.41 master
192.168.0.42 data1
192.168.0.121 BigDataTest03
192.168.0.122 BigDataTest04
# The following lines are desirable for
IPv6 capable
hosts ::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
|
編輯hosts檔案-子節點
|
同上
|
環境變數
|
|
環境變數
|
sudo gedit
~/.bashrc
#Hadoop Variables
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export
HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export
JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
#SCALA Variables
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
#SPARK Variables
export SPARK_HOME=/usr/local/spark2
export PATH=$PATH:$SPARK_HOME/bin
#ANACONDA Variables
export PATH=/home/charles/anaconda2/bin:$PATH
export
ANACONDA_PATH=/home/charles/anaconda2
export
PYSPARK_DRIVER_PYTHON=$ANACONDA_PATH/bin/ipython
export
PYSPARK_PYTHON=$ANACONDA_PATH/bin/python
#CASSANDRA Variables
export
CASSANDRA_HOME=/usr/local/cassandra
export PATH=$PATH:$CASSANDRA_HOME/bin
#KAFKA Variables
export KAFKA_HOME=/usr/local/kafka
export PATH=$PATH:$KAFKA_HOME/bin
|
Hadoop
|
|
下載Hadoop
2.6.x搭配spark2.0
|
sudo tar zxvf hadoop-2.6.5.tar.gz
sudo mv hadoop-2.6.5 /usr/local/hadoop2.6.5
|
查看安裝目錄
|
ll /usr/local/hadoop2.6.5
|
編輯core-site.xml 設定HDFS預設名稱
|
sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name> <value>hdfs://192.168.0.41:9000</value>
</property>
</configuration>
|
編輯hdfs-site.xml 設定HDFS預設名稱
|
sudo gedit /usr/local/hadoop/etc/hadoop2.6.5/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value> file:/usr/local/hadoop/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
|
編輯hdfs-site.xml 設定HDFS預設名稱 – 子節點
|
<property>
<name>dfs.namenode.name.dir</name>
<value> file:/usr/local/hadoop/hadoop_data/hdfs/datanode</value>
</property>
</configuration>
|
編輯masters檔案 – 主節點才要
|
gedit
/usr/local/hadoop/etc/hadoop/masters
master
|
編輯slaves檔案 – 主節點才要
|
gedit /usr/local/hadoop/etc/hadoop/slaves
data1
data2
data3
|
複製到多台
|
可以master設定好然後用複製的方式再去改相關參數
|
建立HDFS目錄
|
Sudo rm –rf
/usr/local/hadoop/hadoop_data/hdfs
建立DataNode儲存目錄
Sudo mkdir –p/usr/local/hadoop/hadoop_data/hdfs/namenode
將目錄擁有者改成管理者
Sudo chown –R harles:harles
/usr/local/hadoop
|
建立子節點HDFS目錄
|
Ssh data1
移除hdfs所有目錄
Sudo rm –rf
/usr/local/hadoop/hadoop_data/hdfs
建立DataNode儲存目錄
Sudo mkdir –p/usr/local/hadoop/hadoop_data/hdfs/datanode
將目錄擁有者改成管理者
Sudo chown –R charles:charles
/usr/local/hadoop
exit
|
格式化HDFS
|
Hadoop namenode -format
|
啟動HDFS
|
Start-dfs.sh or stop-dfs.sh
|
HDFS UI
|
192.168.0.41:50070
|
Spark
|
|
Scala
|
tar xvf scala-2.12.0.tgz
sudo mv scala-2.12.0 /usr/local/scala2.12.0
Cmd : scala 可測試是否安裝完成
|
下載spark
|
因為hadoop為2.7所以要選2.7版
tar zxf spark-2.2.0-bin-hadoop2.7.tgz
tar zxf spark-2.0.2-bin-hadoop2.6.tgz
sudo mv spark-2.2.0-bin-hadoop2.7
/usr/local/spark2.2.0_h2.7
sudo mv spark-2.0.2-bin-hadoop2.6
/usr/local/spark2
Cmd: pyspark可測試是否成功
|
修改spark設定檔 - log
|
cd /usr/local/spark2.2.0_h2.7/conf
cd /usr/local/spark2/conf
cp log4j.properties.template log4j.properties
sudo gedit log4j.properties
sudo vim log4j.properties
INFO改為WARN , info log就不會出現
|
建立spark standalone
cluster
|
cd /usr/local/spark2.2.0_h2.7/conf
cp spark-env.sh.template spark-env.sh
sudo gedit spark-env.sh
sudo vim spark-env.sh
Ubuntu19810
export SPARK_MASTER_IP=192.168.0.128
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MOMORY=4g
export SPARK_WORKER_INSTANCES=4
export SPARK_MASTER_IP=192.168.95.128
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MOMORY=2g
export SPARK_EXECUTOR_INSTANCES=3
export SPARK_WORKER_INSTANCES=3
export
SPARK_WORKER_OPTS="$SPARK_WORKER_OPTS
-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=3600
-Dspark.worker.cleanup.appDataTtl=172800 "
|
spark複製到子節點
|
Ssh data1
建立spark目錄
Sudo mkdir /usr/local/spark
將目錄擁有者改成管理者
Sudo chown charles:charles /usr/local/spark
Exit
Or
Sudo scp –r /usr/local/spark Charles@data1:/usr/local
|
設定slaves– 主節點才要
|
cp slaves.template slaves
sudo vim slaves
192.168.95.132
192.168.95.133
192.168.95.158
192.168.95.159
192.168.95.160
|
設定cassandra IP– 主節點才要
|
cd /usr/local/spark/conf
cp spark-defaults.conf.template spark-defaults.conf
Sudo gedit spark.defaults.conf
spark.cassandra.connection.host
192.168.0.41,192.168.0.42
|
啟動
|
/usr/local/spark/sbin/start-all.sh
|
Spark UI
|
192.168.0.41:8080
|
升級指令
|
wget
http://ftp.twaren.net/Unix/Web/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
tar zxf spark-2.2.0-bin-hadoop2.7.tgz
sudo mv spark-2.2.0-bin-hadoop2.7
/usr/local/spark2.2.0_h2.7
sudo chown charles:charles /usr/local/spark2.2.0_h2.7
sudo vim ~/.bashrc
cat ~/.bashrc
source ~/.bashrc
cd /usr/local/spark2.2.0_h2.7/conf
cp log4j.properties.template log4j.properties
vim log4j.properties
--------------------------------
WARN
--------------------------------
cp spark-env.sh.template spark-env.sh
vim spark-env.sh
--------------------------------
export SPARK_MASTER_IP=192.168.95.128
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MOMORY=4g
export SPARK_EXECUTOR_INSTANCES=4
#export
SPARK_WORKER_INSTANCES=4
sudo apt-get install ssh
=====主節點才要以下程式碼=====
cp slaves.template slaves
vim slaves
--------------------------------
192.168.95.132
192.168.95.133
192.168.95.158
192.168.95.159
192.168.95.160
--------------------------------
cp spark-defaults.conf.template spark-defaults.conf
vim spark-defaults.conf
--------------------------------
spark.cassandra.connection.host
192.168.95.127,192.168.95.122,192.168.95.123
--------------------------------
|
Python
|
|
Anaconda
|
bash Anaconda2-2.5.0-Linux-x86_64.sh -b
|
Jupyter
|
//start
standalone Ipython notebook PYSPARK_DRIVER_PYTHON=ipython
PYSPARK_DRIVER_PYTHON_OPTS="notebook" MASTER=spark://master:7077
pyspark --total-executor-cores 3 --executor-memory 1g
PYSPARK_DRIVER_PYTHON=ipython
PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
|
JDBC
|
|
JDBC 取得 MSSQL
data
|
https://www.microsoft.com/en-us/download/details.aspx?id=11774
解壓縮找到sqljdbc42.jar
'
移到指定目錄下
spark submit...
--driver-class-path
/usr/local/spark2/jars/sqljdbc42.jar
|
指令
|
jdbcDF = sqlContext.read \
.format("jdbc") \
.option("url","jdbs:sq;server://192.168.95.74;databaseName=MftBasis;user=sa;password=p@ssw0rd")
\
.option("dbtable","dbo.ConversationMember")\
.load()
df =
jdbcDF.select("serial","cgroupname","userid")
|
Spark歷史服務紀錄
4040端口只有在運行中才看的到, 必須透過歷史服務存起來, 建議存到HDFS等共用空間
1. 修改spark-defaults.conf.template
cd conf >>
mv spark-defaults.conf.template spark-defaults.conf
2. 修改spark-defaults.conf 寫入用
spark.evenLog.enabled = true
spark.evenLog.dir = hdfs://192.168.0.1:9000/directory
3. 修改spark-env.sh, 讀取用
retainedApplications :內存保存多少筆
vi spark-env.sh
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080
-Dspark.history.retainedApplications=30
-Dspark.history.fs.logDirectory=hdfs://192.168.0.1:9000/directory"
4. 分發給多台
xsync spark-defaluts.conf
xsync spark-env.sh-Dspark.history.retainedApplications=30
-Dspark.history.fs.logDirectory=hdfs://192.168.0.1:9000/directory"
4. 分發給多台
xsync spark-defaluts.conf
5. 啟動服務
sbin/start-history-server.sh
6.啟動任務, 然後在192.168.0.1:18080看
Ref:
- 林大貴著 , Pyhton+Spark 2.0+Hadoop機器學習與大數據分析實戰 , 博碩
- spark歷史服務配置
沒有留言:
張貼留言