2018年7月22日 星期日

[Spark] Spark 基本安裝設定

Spark 基本安裝設定

Spark安裝放是有很多, 這裡舉一個基本的方式(python版本), 可以了解到要將叢集建立起來需要什麼樣的套件(叢集管理器是使用standalone), 若熟悉Docker或Ambari當然更建議直接安裝

以下為安裝步驟


JAVA
查詢javaj版本
Java –version Ubuntu19810
更新apt get
sudo apt-get update
安裝JDK
java-8-openjdk-amd64
sudo apt-get install default-jdk

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk

查詢java安裝路徑
Update-alternatives –display java
叢集網路設定
編輯interfaces網路設定檔
sudo gedit /etc/network/interfaces
======================================================
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
# The loopback network interface auto lo iface lo inet loopback
# The primary network interface
auto eth0
iface eth0 inet static
address 192.168.0.41
netmask 255.255.255.0
network 192.168.0.0
broadcast 192.168.0.255
gateway 192.168.0.254
# dns-* options are implemented by the resolvconf package, if installed dns-nameservers 192.168.0.1
dns-search nimbus.com
編輯interfaces網路設定檔-子節點
同上改成 192.168.0.122
編輯hostname主機名稱
Sudo /etc/hostnam

master
編輯hostname主機名稱-子節點
BigDataTest04
編輯hosts檔案
Sudo gedit /etc/hosts

127.0.0.1 localhost
127.0.1.1 hadoop
192.168.0.41 master
192.168.0.42 data1
192.168.0.121 BigDataTest03
192.168.0.122 BigDataTest04
# The following lines are desirable for IPv6 capable
hosts ::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
編輯hosts檔案-子節點
同上
環境變數
環境變數
sudo gedit ~/.bashrc
#Hadoop Variables
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

#SCALA Variables
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin

#SPARK Variables
export SPARK_HOME=/usr/local/spark2
export PATH=$PATH:$SPARK_HOME/bin

#ANACONDA Variables
export PATH=/home/charles/anaconda2/bin:$PATH
export ANACONDA_PATH=/home/charles/anaconda2
export PYSPARK_DRIVER_PYTHON=$ANACONDA_PATH/bin/ipython
export PYSPARK_PYTHON=$ANACONDA_PATH/bin/python

#CASSANDRA Variables
export CASSANDRA_HOME=/usr/local/cassandra
export PATH=$PATH:$CASSANDRA_HOME/bin

#KAFKA Variables
export KAFKA_HOME=/usr/local/kafka
export PATH=$PATH:$KAFKA_HOME/bin
Hadoop
下載Hadoop
2.6.x搭配spark2.0

sudo tar zxvf hadoop-2.6.5.tar.gz
sudo mv hadoop-2.6.5 /usr/local/hadoop2.6.5
查看安裝目錄
ll /usr/local/hadoop2.6.5
編輯core-site.xml 設定HDFS預設名稱
sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml

<configuration>
<property>
<name>fs.default.name</name> <value>hdfs://192.168.0.41:9000</value>
</property>
</configuration>
編輯hdfs-site.xml 設定HDFS預設名稱
sudo gedit /usr/local/hadoop/etc/hadoop2.6.5/hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name> <value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name> <value> file:/usr/local/hadoop/hadoop_data/hdfs/namenode</value> </property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value> </property>
</configuration>
編輯hdfs-site.xml 設定HDFS預設名稱 子節點
<property>
<name>dfs.namenode.name.dir</name> <value> file:/usr/local/hadoop/hadoop_data/hdfs/datanode</value> </property>
</configuration>
編輯masters檔案 主節點才要
gedit /usr/local/hadoop/etc/hadoop/masters

master
編輯slaves檔案 主節點才要
gedit /usr/local/hadoop/etc/hadoop/slaves

data1
data2
data3
複製到多台
可以master設定好然後用複製的方式再去改相關參數
建立HDFS目錄
Sudo rm –rf /usr/local/hadoop/hadoop_data/hdfs
建立DataNode儲存目錄
Sudo mkdir –p/usr/local/hadoop/hadoop_data/hdfs/namenode
將目錄擁有者改成管理者
Sudo chown –R harles:harles /usr/local/hadoop

建立子節點HDFS目錄
Ssh data1
移除hdfs所有目錄
Sudo rm –rf /usr/local/hadoop/hadoop_data/hdfs
建立DataNode儲存目錄
Sudo mkdir –p/usr/local/hadoop/hadoop_data/hdfs/datanode
將目錄擁有者改成管理者
Sudo chown –R charles:charles /usr/local/hadoop
exit
格式化HDFS
Hadoop namenode -format
啟動HDFS
Start-dfs.sh or stop-dfs.sh
HDFS UI
192.168.0.41:50070
Spark
Scala
tar xvf scala-2.12.0.tgz
sudo mv scala-2.12.0 /usr/local/scala2.12.0

Cmd : scala 可測試是否安裝完成
下載spark
因為hadoop2.7所以要選2.7
tar zxf spark-2.2.0-bin-hadoop2.7.tgz
tar zxf spark-2.0.2-bin-hadoop2.6.tgz

sudo mv spark-2.2.0-bin-hadoop2.7 /usr/local/spark2.2.0_h2.7
sudo mv spark-2.0.2-bin-hadoop2.6 /usr/local/spark2

Cmd: pyspark可測試是否成功
修改spark設定檔 - log
cd /usr/local/spark2.2.0_h2.7/conf
cd /usr/local/spark2/conf

cp log4j.properties.template  log4j.properties
sudo gedit log4j.properties
sudo vim log4j.properties

INFO改為WARN , info log就不會出現
建立spark standalone cluster
cd /usr/local/spark2.2.0_h2.7/conf
cp spark-env.sh.template  spark-env.sh
sudo gedit spark-env.sh
sudo vim spark-env.sh

Ubuntu19810

export SPARK_MASTER_IP=192.168.0.128
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MOMORY=4g
export SPARK_WORKER_INSTANCES=4

export SPARK_MASTER_IP=192.168.95.128
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MOMORY=2g
export SPARK_EXECUTOR_INSTANCES=3
export SPARK_WORKER_INSTANCES=3
export SPARK_WORKER_OPTS="$SPARK_WORKER_OPTS -Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=3600 -Dspark.worker.cleanup.appDataTtl=172800 "

spark複製到子節點
Ssh data1
建立spark目錄
Sudo mkdir /usr/local/spark
將目錄擁有者改成管理者
Sudo chown charles:charles /usr/local/spark
Exit

Or

Sudo scp –r /usr/local/spark Charles@data1:/usr/local
設定slaves– 主節點才要
cp slaves.template slaves
sudo vim slaves
192.168.95.132
192.168.95.133
192.168.95.158
192.168.95.159
192.168.95.160

設定cassandra IP– 主節點才要
cd /usr/local/spark/conf
cp spark-defaults.conf.template  spark-defaults.conf
Sudo gedit spark.defaults.conf

spark.cassandra.connection.host 192.168.0.41,192.168.0.42
啟動
/usr/local/spark/sbin/start-all.sh
Spark UI
192.168.0.41:8080
升級指令
wget http://ftp.twaren.net/Unix/Web/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
tar zxf spark-2.2.0-bin-hadoop2.7.tgz
sudo mv spark-2.2.0-bin-hadoop2.7 /usr/local/spark2.2.0_h2.7
sudo chown charles:charles  /usr/local/spark2.2.0_h2.7
sudo vim ~/.bashrc
cat ~/.bashrc
source ~/.bashrc
cd /usr/local/spark2.2.0_h2.7/conf
cp log4j.properties.template  log4j.properties
vim log4j.properties
--------------------------------
WARN
--------------------------------
cp spark-env.sh.template  spark-env.sh
vim spark-env.sh
--------------------------------
export SPARK_MASTER_IP=192.168.95.128
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MOMORY=4g
export SPARK_EXECUTOR_INSTANCES=4
#export SPARK_WORKER_INSTANCES=4

sudo apt-get install ssh
=====主節點才要以下程式碼=====
cp slaves.template slaves
vim slaves
--------------------------------
192.168.95.132
192.168.95.133
192.168.95.158
192.168.95.159
192.168.95.160
--------------------------------
cp spark-defaults.conf.template  spark-defaults.conf
vim spark-defaults.conf
--------------------------------
spark.cassandra.connection.host 192.168.95.127,192.168.95.122,192.168.95.123
--------------------------------

Python
Anaconda
bash Anaconda2-2.5.0-Linux-x86_64.sh -b
Jupyter
//start standalone Ipython notebook PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" MASTER=spark://master:7077 pyspark --total-executor-cores 3 --executor-memory 1g

PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
JDBC
JDBC 取得 MSSQL data
https://www.microsoft.com/en-us/download/details.aspx?id=11774

解壓縮找到sqljdbc42.jar '
移到指定目錄下
spark submit...
--driver-class-path /usr/local/spark2/jars/sqljdbc42.jar

指令
jdbcDF = sqlContext.read \
.format("jdbc") \
.option("url","jdbs:sq;server://192.168.95.74;databaseName=MftBasis;user=sa;password=p@ssw0rd") \
.option("dbtable","dbo.ConversationMember")\
.load()

df = jdbcDF.select("serial","cgroupname","userid")


Spark歷史服務紀錄
4040端口只有在運行中才看的到, 必須透過歷史服務存起來, 建議存到HDFS等共用空間

1. 修改spark-defaults.conf.template
cd conf >>
mv spark-defaults.conf.template spark-defaults.conf
2. 修改spark-defaults.conf 寫入用
spark.evenLog.enabled = true
spark.evenLog.dir = hdfs://192.168.0.1:9000/directory

3. 修改spark-env.sh, 讀取用
retainedApplications :內存保存多少筆
vi spark-env.sh
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080
-Dspark.history.retainedApplications=30
-Dspark.history.fs.logDirectory=hdfs://192.168.0.1:9000/directory"

4. 分發給多台
xsync spark-defaluts.conf
xsync spark-env.sh

5. 啟動服務
sbin/start-history-server.sh

6.啟動任務, 然後在192.168.0.1:18080看

Ref:
  1. 林大貴著 , Pyhton+Spark 2.0+Hadoop機器學習與大數據分析實戰 , 博碩
  2. spark歷史服務配置


沒有留言:

張貼留言