A union of curiosity and data science

Knowledgebase and brain dump of a database engineer


Raspberry Pi - Pyspark Hadoop Cluster - 6 Nodes

 

 

Parts:

 

Put it together:

Video: Assembly of a cloudlet case (by C4 Labs)
https://www.youtube.com/watch?v=ZUDcgoOgY_A

 

Give your Pi's hostnames:

I have 6-8 shells  open while going through this exercise. 
update all of your PI's hostname files.
I have a 6 node cluster so my pi's each get a bit 000000 identifying it's place within the case.

sudo nano /etc/hostname

Add the server hostnames to the file

ubuntu-pi-100000
ubuntu-pi-010000
ubuntu-pi-001000
ubuntu-pi-000100
ubuntu-pi-000010
ubuntu-pi-000001


Give your Pi's IP's:

Update the netplan configuration on each of your Pi's

sudo nano /etc/netplan/50-cloud-init.yaml

Modify the network file to match your adapter and IP's.

network:
  ethernets:
    eth0:
        addresses:
        - <This Pi's IP>/24
        dhcp4: false
        gateway4: 192.168.1.1
        nameservers:
            addresses:
            - 192.168.1.1
            search: []
  version: 2

 

Reload your network configuration

sudo netplan apply

 

Create Passwordless SSH:

Create ssh config file /.ssh/config on your NameNode (ubuntu-pi-100000)

Host ubuntu-pi-100000
	HostName ubuntu-pi-100000
	User ubuntu
	IdentityFile ~/.ssh/hadoopuser
Host ubuntu-pi-010000
	HostName ubuntu-pi-010000
	User ubuntu
	IdentityFile ~/.ssh/hadoopuser
Host ubuntu-pi-001000
	HostName ubuntu-pi-001000
	User ubuntu
	IdentityFile ~/.ssh/hadoopuser
Host ubuntu-pi-000100
	HostName ubuntu-pi-000100
	User ubuntu
	IdentityFile ~/.ssh/hadoopuser
Host ubuntu-pi-000010
	HostName ubuntu-pi-000010
	User ubuntu
	IdentityFile ~/.ssh/hadoopuser
Host ubuntu-pi-000001
	HostName ubuntu-pi-000001
	User ubuntu
	IdentityFile ~/.ssh/hadoopuser

 

Create a ssh key for the cluster 

Use the keygen on your NameNode. 

sudo ssh-keygen -f ~/.ssh/sshkey_rsa -t rsa -P ""

 

Change the owner of the files we just created

sudo chown ubuntu sshkey_rsa
sudo chown ubuntu sshkey_rsa.pub

Move the public key to authorized keys and copy the public and private key to all machines

I renamed my private key to hadoopuser. You will be prompted for passwords while doing the secure copy.

sudo mv ~/.ssh/sshkey_rsa ~/.ssh/hadoopuser

cat ~/.ssh/sshkey_rsa.pub >> ~/.ssh/authorized_keys

sudo cat ~/.ssh/sshkey_rsa.pub | ssh ubuntu-pi-010000 "cat >> ~/.ssh/authorized_keys"
scp ~/.ssh/hadoopuser ~/.ssh/config ubuntu-pi-010000:~/.ssh

sudo cat ~/.ssh/sshkey_rsa.pub | ssh ubuntu-pi-001000 "cat >> ~/.ssh/authorized_keys"
scp ~/.ssh/hadoopuser ~/.ssh/config ubuntu-pi-001000:~/.ssh

sudo cat ~/.ssh/sshkey_rsa.pub | ssh ubuntu-pi-000100 "cat >> ~/.ssh/authorized_keys"
scp ~/.ssh/hadoopuser ~/.ssh/config ubuntu-pi-000100:~/.ssh

sudo cat ~/.ssh/sshkey_rsa.pub | ssh ubuntu-pi-000010 "cat >> ~/.ssh/authorized_keys"
scp ~/.ssh/hadoopuser ~/.ssh/config ubuntu-pi-000010:~/.ssh

sudo cat ~/.ssh/sshkey_rsa.pub | ssh ubuntu-pi-000001 "cat >> ~/.ssh/authorized_keys"
scp ~/.ssh/hadoopuser ~/.ssh/config ubuntu-pi-000001:~/.ssh

 

Server updates, software downloads and installs:

Log on to the first pi  for apt updates and upgrades

ssh ubuntu@ubuntu-pi-100000
sudo apt update
sudo apt -y upgrade

Now run these commands from the first node to update the other nodes. 

ssh ubuntu@ubuntu-pi-010000 "sudo apt update && sudo apt -y upgrade"
ssh ubuntu@ubuntu-pi-001000 "sudo apt update && sudo apt -y upgrade"
ssh ubuntu@ubuntu-pi-000100 "sudo apt update && sudo apt -y upgrade"
ssh ubuntu@ubuntu-pi-000010 "sudo apt update && sudo apt -y upgrade"
ssh ubuntu@ubuntu-pi-000001 "sudo apt update && sudo apt -y upgrade"

Install basic networking tools 

sudo apt -y install wireless-tools net-tools iw

Install Java : ssh onto each machine and install java. 
https://openjdk.java.net/install/

sudo apt-get -y install openjdk-8-jdk

 

Install Hadoop

http://apache.mirrors.hoobly.com/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz

wget http://apache.mirrors.hoobly.com/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz -P ~/downloads

extract compressed hadoop file to /usr/local
rename the folder "hadoop"

sudo tar zxvf ~/downloads/hadoop-2* -C /usr/local
sudo mv /usr/local/hadoop-* /usr/local/hadoop
sudo chown -R ubuntu /usr/local/hadoop


Update Environment Variables on all Pi's

Open your ~/.bashrc file in you favorite editor. nano or vi. Add the following environment variables to the end

#Hadoop
export JAVA_HOME=/usr
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_PREFIX=$HADOOP_HOME

Reload profile on each Pi

. ~/.bashrc

 

NameNode Configuration - (ubuntu-pi-100000): Create a data directory for the NameNode

sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/namenode

Update JAVA_HOME in /etc/hadoop/hadoop-env.sh

 

NameNode Configuration - core-site.xml 

Modify the /usr/local/hadoopetc/hadoop/core-site.xml configuration file.
Update it with the Namenode HOSTNAME and static user:

/usr/local/hadoop/etc/hadoop/core-site.xml

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://ubuntu-pi-100000:9000</value>
        </property>
       <property>
               <name>hadoop.http.staticuser.user</name>
               <value>ubuntu</value>
       </property>
</configuration>

 

NameNode Configuration - hdfs-site.xml

Modify the /usr/local/hadoopetc/hadoop/hdfs-site.xml configuration file

<configuration>
<property>
        <name>dfs.replication</name>
        <value>3</value>
</property>
<property>
        <name>dfs.permisions.enabled</name>
        <value>false</value>
</property>
<property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value>
</property>
<property>
    <name>dfs.secondary.http.address</name>
    <value>ubuntu-pi-010000:50090</value>
</property>
</configuration>

 

NameNode Configuration - yarn-site.xml 

Modify the usr/local/hadoop/etc/hadoop/yarn-site.xml configuration file

<configuration>
<!-- Site specific YARN configuration properties -->
<property>
   <name>nodemanager.recovery.enabled</name>
   <value>true</value>
</property>
<property>
   <name>yarn.resourcemanager.hostname</name>
   <value>ubuntu-pi-100000</value>
</property>
<property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
</configuration>

NameNode Configuration - mapred-site.xml

<configuration>
<property>
        <name>mapreduce.cluster.acls.enabled</name>
        <value>false</value>
</property>
<property>
        <name>mapreduce.jobhistory.address</name>
        <value>ubuntu-pi-100000:10020</value>
</property>
<property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>ubuntu-pi-100000:19888</value>
</property>
  <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
  </property>
</configuration>


NameNode Configuration -  Add NameNode entry to Masters file 

echo "ubuntu-pi-100000"  >> $HADOOP_CONF_DIR/masters

Name Node Configuration - Specify workers in 'slaves' file. /hadoop/etc/hadoop/slaves

#create a new file with the first entry.
echo "ubuntu-pi-010000" > $HADOOP_CONF_DIR/slaves
#append the rest.
echo "ubuntu-pi-001000" >> $HADOOP_CONF_DIR/slaves
echo "ubuntu-pi-000100" >> $HADOOP_CONF_DIR/slaves
echo "ubuntu-pi-000010" >> $HADOOP_CONF_DIR/slaves
echo "ubuntu-pi-000001" >> $HADOOP_CONF_DIR/slaves

 

 

DataNode Configuration (ubuntu-pi-0xxxxxx) : Copy files from Namenode to worker nodes

Copy the configuration files hdfs-site.xml, core-site.xml, mapred-site.xml to the data nodes (ubuntu-pi-010000 : ubuntu-pi-00001). 

scp $HADOOP_CONF_DIR/hdfs-site.xml ubuntu-pi-010000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hdfs-site.xml ubuntu-pi-001000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hdfs-site.xml ubuntu-pi-000100:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hdfs-site.xml ubuntu-pi-000010:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hdfs-site.xml ubuntu-pi-000001:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/core-site.xml ubuntu-pi-010000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/core-site.xml ubuntu-pi-001000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/core-site.xml ubuntu-pi-000100:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/core-site.xml ubuntu-pi-000010:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/core-site.xml ubuntu-pi-000001:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/mapred-site.xml ubuntu-pi-010000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/mapred-site.xml ubuntu-pi-001000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/mapred-site.xml ubuntu-pi-000100:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/mapred-site.xml ubuntu-pi-000010:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/mapred-site.xml ubuntu-pi-000001:$HADOOP_CONF_DIR

 

DataNode Configuration (ubuntu-pi-010000) : yarn-site.xml 

ssh onto the first data node / worker. 
ubuntu-pi-010000 and modify the yarn-site.xml configuration.

<configuration>
<property>
   <name>yarn.resourcemanager.hostname</name>
   <value>ubuntu-pi-100000</value>
</property>
<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
</property>
</configuration>

Copy the yarn-site.xml and the hadoop-env.sh file to the other data nodes

scp $HADOOP_CONF_DIR/yarn-site.xml ubuntu-pi-001000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/yarn-site.xml ubuntu-pi-000100:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/yarn-site.xml ubuntu-pi-000010:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/yarn-site.xml ubuntu-pi-000001:$HADOOP_CONF_DIR

scp $HADOOP_CONF_DIR/hadoop-env.sh ubuntu-pi-001000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hadoop-env.sh ubuntu-pi-000100:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hadoop-env.sh ubuntu-pi-000010:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hadoop-env.sh ubuntu-pi-000001:$HADOOP_CONF_DIR

 

Data Node Configuration (ubuntu-pi-010000) : Create Data Directories

#Run these from the first data node to create data directories on all data nodes.
sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanode
sudo chown -R ubuntu $HADOOP_HOME

#creating remote data dirs
ssh ubuntu@ubuntu-pi-001000 "sudo mkdir -p ${HADOOP_HOME}/hadoop_data/hdfs/datanode && sudo chown -R ubuntu ${HADOOP_HOME}"
ssh ubuntu@ubuntu-pi-000100 "sudo mkdir -p ${HADOOP_HOME}/hadoop_data/hdfs/datanode && sudo chown -R ubuntu ${HADOOP_HOME}"
ssh ubuntu@ubuntu-pi-000010 "sudo mkdir -p ${HADOOP_HOME}/hadoop_data/hdfs/datanode && sudo chown -R ubuntu ${HADOOP_HOME}"
ssh ubuntu@ubuntu-pi-000001 "sudo mkdir -p ${HADOOP_HOME}/hadoop_data/hdfs/datanode && sudo chown -R ubuntu ${HADOOP_HOME}"

 

 

START HADOOP:

Format the Namenode, start the services and view the report : (ubuntu-pi-100000) 

hdfs namenode -format
#Start Everything!@ (the start-all command is deprecated but still works as of 2.10)
#you could also use start-dfs.sh && start-yarn.sh
$HADOOP_HOME/sbin/start-all.sh
hdfs dfsadmin -report


 

Check Java - Namenode ubuntu-pi-100000:

jps

Check Java - Data Node / Workers  ubuntu-pi-010000:

jps

 

 

Check Hadoop UI (hdfs manager)

Navigate to the Hadoop UI: http://ubuntu-pi-100000:50070

Check the resource manager (yarn ui)

Navigate to the Hadoop Cluster Metrics: http://ubuntu-pi-100000:8088/cluster?user.name=ubuntu

 

Test a map reduce job : 

yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.0.jar pi 25 5
#or
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.0.jar pi 25 5

 

 

#useful hadoop commands: 
#file System Check
hdfs fsck /

All Healthy. 

Install hive:

http://www.apache.org/dyn/closer.cgi/hive/

wget https://downloads.apache.org/hive/hive-2.3.6/apache-hive-2.3.6-bin.tar.gz -P ~/downloads

#untar
sudo tar zxvf ~/downloads/apache-hive* -C /usr/local
#rename to "hive'
sudo mv /usr/local/apache-hive-2.3.6-bin/ /usr/local/hive

Update ~/.bashrc with environment variables

export HIVE_HOME=/usr/local/hive
export HIVE_CONF_HOME=$HIVE_HOME/bin
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/*:.

Edit the hive config file ($HIVE_HOME/bin/hive-config.sh). Tell hive where hadoop lives. 

sudo nano $HIVE_HOME/bin/hive-config.sh
#add to the bottom of the file 
export HADOOP_HOME=/usr/local/hadoop

 

Create and configure the /hive/conf/hive-site.xml file if it doesn't exist

I've added the connection data for a Mysql instance we will create in the next step. 
Ensure your hive.metastore.warehouse.dir url points to your yarn/hadoop name node and that the port number is correct. 

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&useSSL=false</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password for connecting to mysql server</description>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://ubuntu-pi-100000:9000/user/hive/warehouse</value>
<description>Location of default database for the warehouse</description>
</property>
</configuration>

 

Create directories on HDFS for hive

hadoop fs -mkdir /tmp
hadoop fs -mkdir -p /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse

 

Setup MySQL as the hive Metastore

The hive Metastore holds table definitions. The data related to these table is spread across the hadoop file system.  

Install MySQL

sudo apt install -y mysql-server
sudo systemctl start mysql
sudo systemctl status mysql

 

Create database for the hive metastore

mysql -u root
mysql> create database hive;
mysql> create user 'hive' identified by 'hive';
mysql> grant all on hive.* to hive;
mysql> flush privileges;
mysql> exit;

mysql -u hive -p

mysql> show databases;



 download mysql-connector-java and add connector jar to the hive/libs directory

wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-8.0.18.tar.gz
tar -xvf mysql-connector-java-8.0.18.tar.gz --wildcards --no-anchored '*.jar' -C ~/usr/local/hive/lib

 

#if available change metastore_db directory to temp, we're going to use MySql
mv /usr/local/hive/metastore_db /usr/local/hive/metastore_db.tmp

 Initialize the mysql meta store: 

schematool -initSchema -dbType mysql --verbose

 

 

 Start Hive Metastore

hive --service metastore
#I had to leave hdfs safe mode
hdfs dfsadmin -safemode leave

Start Hive

hive

Test Hive 

This simple insert will create a full on map reduce job. two actually. sele

create table rockstars (id integer, name string);
insert into rockstars (id, name) values (1, 'Nyella!');
select * from rockstars;

 

#Useful hive debugging commands: 
hive -hiveconf hive.root.logger=DEBUG,console

 

Spark

Install Spark on all nodes of the cluster

download it: 

wget https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz -P ~/downloads
sudo tar zxvf ~/downloads/spark* -C /usr/local
sudo mv /usr/local/spark-* /usr/local/spark
sudo chown -R ubuntu /usr/local/spark

Configure our Master Spark Node:  (also our namenode ubuntu-pi-100000) 

update /spark/conf/spark-env.sh

cd /usr/local/spark
#copy the template file and make a new spark-env.sh
cp spark-env.sh.template spank-env.sh
nano spark-env.sh

add the SPARK_DIST_CLASSPATH variable

export SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*

update the master nodes .bashrc file

#SPARK VARS
#spark and py spark variables - added by DR. 20200409
export SCALA_HOME=/usr/local/scala
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
export TERM=xterm-color
#here we're actually changing the pyspark working directory and configuring pyspark to use jupyter
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip ubuntu-pi-100000 --notebook-dir=/home/ubuntu/notebooks'

update master node slaves configuration file in /spark/conf/slaves

If the file isn't available, create it by copying the template file slaves.template

cp slaves.template slaves

remove 'localhost' from this file and add your servers:

#A Spark Worker will be started on each of the machines listed below.

ubuntu-pi-010000
ubuntu-pi-001000
ubuntu-pi-000100
ubuntu-pi-000010
ubuntu-pi-000001


 
Start configuring our worker nodes by modifying the spark-env.sh  

nano /usr/local/spark/conf/spark-env.sh  

Add the environment variables to the file 

export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr
export SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*

Update ~/.bashrc with Spark Variables

#SPARK VARS
export SCALA_HOME=/usr/local/scala
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
export SPARK_CONF_DIR=/usr/local/spark/conf

Start the spark cluster from your main node

/usr/local/spark/sbin/start-all.sh

This will read the workers/slaves file and initialize the worker nodes as well. 

Test Spark 

cd /usr/local/spark/bin
spark-shell --version

 

View the Spark UI : http://ubuntu-pi-100000:8080/

 

Pyspark:

Pyspark comes with Spark. We just need to ensure we have Jupyter notebooks available. 

If you noticed , earlier in this post, I setup PYSPARK_DRIVER_PYTHON variables which will instruct pyspark to use jupyter instead of the shell when ran. This will spin up a notebook server and display the URL for accessing the server. 

#Install Jupyter notebooks
sudo apt install python-pip
pip install jupyter

#start pyspark and specify the Spark master node url
/usr/local/spark/bin/pyspark --master spark://ubuntu-pi-100000:7077

 

 

go to the URL

 Helpful ish!

#Useful jupyter notebook commands: 
jupyter troubleshoot 
#in general
jupyter notebook stop <port the notebook is running on> 
jupyter notebook list

#useful spark commands:
./start-slave.sh ubuntu-pi-100000:7077



STARTING AND STOPPING HADOOP

1. ALL Services : start-all.sh & stop-all.sh: Available in the /usr/local/hadoop/sbin directory. Call "start-all.sh"

2. Yarn (Resource Manager) : stop-yarn.sh,  start-yarn.sh Available in the /usr/local/hadoop/sbin directory.

3. HDFS (Datanode, NodeManager): stop-dfs.sh, start-dfs.sh  Available in the /usr/local/hadoop/sbin directory.

 

 Great references for default values and install instructions:

 https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/ClusterSetup.html

https://www.informit.com/articles/article.aspx?p=2190194&seqNum=2

https://spark.apache.org/docs/2.2.0/

https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/