
Parts:
Put it together:
Video: Assembly of a cloudlet case (by C4 Labs)
https://www.youtube.com/watch?v=ZUDcgoOgY_A
Give your Pi's hostnames:
I have 6-8 shells open while going through this exercise.
update all of your PI's hostname files.
I have a 6 node cluster so my pi's each get a bit 000000 identifying it's place within the case.
sudo nano /etc/hostname
Add the server hostnames to the file
ubuntu-pi-100000
ubuntu-pi-010000
ubuntu-pi-001000
ubuntu-pi-000100
ubuntu-pi-000010
ubuntu-pi-000001
Give your Pi's IP's:
Update the netplan configuration on each of your Pi's
sudo nano /etc/netplan/50-cloud-init.yaml
Modify the network file to match your adapter and IP's.
network:
ethernets:
eth0:
addresses:
- <This Pi's IP>/24
dhcp4: false
gateway4: 192.168.1.1
nameservers:
addresses:
- 192.168.1.1
search: []
version: 2
Reload your network configuration
sudo netplan apply
Create Passwordless SSH:
Create ssh config file /.ssh/config on your NameNode (ubuntu-pi-100000)
Host ubuntu-pi-100000
HostName ubuntu-pi-100000
User ubuntu
IdentityFile ~/.ssh/hadoopuser
Host ubuntu-pi-010000
HostName ubuntu-pi-010000
User ubuntu
IdentityFile ~/.ssh/hadoopuser
Host ubuntu-pi-001000
HostName ubuntu-pi-001000
User ubuntu
IdentityFile ~/.ssh/hadoopuser
Host ubuntu-pi-000100
HostName ubuntu-pi-000100
User ubuntu
IdentityFile ~/.ssh/hadoopuser
Host ubuntu-pi-000010
HostName ubuntu-pi-000010
User ubuntu
IdentityFile ~/.ssh/hadoopuser
Host ubuntu-pi-000001
HostName ubuntu-pi-000001
User ubuntu
IdentityFile ~/.ssh/hadoopuser
Create a ssh key for the cluster
Use the keygen on your NameNode.
sudo ssh-keygen -f ~/.ssh/sshkey_rsa -t rsa -P ""

Change the owner of the files we just created
sudo chown ubuntu sshkey_rsa
sudo chown ubuntu sshkey_rsa.pub
Move the public key to authorized keys and copy the public and private key to all machines
I renamed my private key to hadoopuser. You will be prompted for passwords while doing the secure copy.
sudo mv ~/.ssh/sshkey_rsa ~/.ssh/hadoopuser
cat ~/.ssh/sshkey_rsa.pub >> ~/.ssh/authorized_keys
sudo cat ~/.ssh/sshkey_rsa.pub | ssh ubuntu-pi-010000 "cat >> ~/.ssh/authorized_keys"
scp ~/.ssh/hadoopuser ~/.ssh/config ubuntu-pi-010000:~/.ssh
sudo cat ~/.ssh/sshkey_rsa.pub | ssh ubuntu-pi-001000 "cat >> ~/.ssh/authorized_keys"
scp ~/.ssh/hadoopuser ~/.ssh/config ubuntu-pi-001000:~/.ssh
sudo cat ~/.ssh/sshkey_rsa.pub | ssh ubuntu-pi-000100 "cat >> ~/.ssh/authorized_keys"
scp ~/.ssh/hadoopuser ~/.ssh/config ubuntu-pi-000100:~/.ssh
sudo cat ~/.ssh/sshkey_rsa.pub | ssh ubuntu-pi-000010 "cat >> ~/.ssh/authorized_keys"
scp ~/.ssh/hadoopuser ~/.ssh/config ubuntu-pi-000010:~/.ssh
sudo cat ~/.ssh/sshkey_rsa.pub | ssh ubuntu-pi-000001 "cat >> ~/.ssh/authorized_keys"
scp ~/.ssh/hadoopuser ~/.ssh/config ubuntu-pi-000001:~/.ssh
Server updates, software downloads and installs:
Log on to the first pi for apt updates and upgrades
ssh ubuntu@ubuntu-pi-100000
sudo apt update
sudo apt -y upgrade
Now run these commands from the first node to update the other nodes.
ssh ubuntu@ubuntu-pi-010000 "sudo apt update && sudo apt -y upgrade"
ssh ubuntu@ubuntu-pi-001000 "sudo apt update && sudo apt -y upgrade"
ssh ubuntu@ubuntu-pi-000100 "sudo apt update && sudo apt -y upgrade"
ssh ubuntu@ubuntu-pi-000010 "sudo apt update && sudo apt -y upgrade"
ssh ubuntu@ubuntu-pi-000001 "sudo apt update && sudo apt -y upgrade"
Install basic networking tools
sudo apt -y install wireless-tools net-tools iw
Install Java : ssh onto each machine and install java.
https://openjdk.java.net/install/
sudo apt-get -y install openjdk-8-jdk
Install Hadoop
http://apache.mirrors.hoobly.com/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz
wget http://apache.mirrors.hoobly.com/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz -P ~/downloads
extract compressed hadoop file to /usr/local
rename the folder "hadoop"
sudo tar zxvf ~/downloads/hadoop-2* -C /usr/local
sudo mv /usr/local/hadoop-* /usr/local/hadoop
sudo chown -R ubuntu /usr/local/hadoop
Update Environment Variables on all Pi's
Open your ~/.bashrc file in you favorite editor. nano or vi. Add the following environment variables to the end
#Hadoop
export JAVA_HOME=/usr
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_PREFIX=$HADOOP_HOME
Reload profile on each Pi
. ~/.bashrc
NameNode Configuration - (ubuntu-pi-100000): Create a data directory for the NameNode
sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/namenode
Update JAVA_HOME in /etc/hadoop/hadoop-env.sh

NameNode Configuration - core-site.xml
Modify the /usr/local/hadoopetc/hadoop/core-site.xml configuration file.
Update it with the Namenode HOSTNAME and static user:
/usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ubuntu-pi-100000:9000</value>
</property>
<property>
<name>hadoop.http.staticuser.user</name>
<value>ubuntu</value>
</property>
</configuration>
NameNode Configuration - hdfs-site.xml
Modify the /usr/local/hadoopetc/hadoop/hdfs-site.xml configuration file
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permisions.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.secondary.http.address</name>
<value>ubuntu-pi-010000:50090</value>
</property>
</configuration>
NameNode Configuration - yarn-site.xml
Modify the usr/local/hadoop/etc/hadoop/yarn-site.xml configuration file
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>nodemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ubuntu-pi-100000</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
</configuration>
NameNode Configuration - mapred-site.xml
<configuration>
<property>
<name>mapreduce.cluster.acls.enabled</name>
<value>false</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>ubuntu-pi-100000:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>ubuntu-pi-100000:19888</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
NameNode Configuration - Add NameNode entry to Masters file
echo "ubuntu-pi-100000" >> $HADOOP_CONF_DIR/masters
Name Node Configuration - Specify workers in 'slaves' file. /hadoop/etc/hadoop/slaves
#create a new file with the first entry.
echo "ubuntu-pi-010000" > $HADOOP_CONF_DIR/slaves
#append the rest.
echo "ubuntu-pi-001000" >> $HADOOP_CONF_DIR/slaves
echo "ubuntu-pi-000100" >> $HADOOP_CONF_DIR/slaves
echo "ubuntu-pi-000010" >> $HADOOP_CONF_DIR/slaves
echo "ubuntu-pi-000001" >> $HADOOP_CONF_DIR/slaves
DataNode Configuration (ubuntu-pi-0xxxxxx) : Copy files from Namenode to worker nodes
Copy the configuration files hdfs-site.xml, core-site.xml, mapred-site.xml to the data nodes (ubuntu-pi-010000 : ubuntu-pi-00001).
scp $HADOOP_CONF_DIR/hdfs-site.xml ubuntu-pi-010000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hdfs-site.xml ubuntu-pi-001000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hdfs-site.xml ubuntu-pi-000100:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hdfs-site.xml ubuntu-pi-000010:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hdfs-site.xml ubuntu-pi-000001:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/core-site.xml ubuntu-pi-010000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/core-site.xml ubuntu-pi-001000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/core-site.xml ubuntu-pi-000100:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/core-site.xml ubuntu-pi-000010:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/core-site.xml ubuntu-pi-000001:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/mapred-site.xml ubuntu-pi-010000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/mapred-site.xml ubuntu-pi-001000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/mapred-site.xml ubuntu-pi-000100:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/mapred-site.xml ubuntu-pi-000010:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/mapred-site.xml ubuntu-pi-000001:$HADOOP_CONF_DIR
DataNode Configuration (ubuntu-pi-010000) : yarn-site.xml
ssh onto the first data node / worker.
ubuntu-pi-010000 and modify the yarn-site.xml configuration.
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ubuntu-pi-100000</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Copy the yarn-site.xml and the hadoop-env.sh file to the other data nodes
scp $HADOOP_CONF_DIR/yarn-site.xml ubuntu-pi-001000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/yarn-site.xml ubuntu-pi-000100:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/yarn-site.xml ubuntu-pi-000010:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/yarn-site.xml ubuntu-pi-000001:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hadoop-env.sh ubuntu-pi-001000:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hadoop-env.sh ubuntu-pi-000100:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hadoop-env.sh ubuntu-pi-000010:$HADOOP_CONF_DIR
scp $HADOOP_CONF_DIR/hadoop-env.sh ubuntu-pi-000001:$HADOOP_CONF_DIR
Data Node Configuration (ubuntu-pi-010000) : Create Data Directories
#Run these from the first data node to create data directories on all data nodes.
sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanode
sudo chown -R ubuntu $HADOOP_HOME
#creating remote data dirs
ssh ubuntu@ubuntu-pi-001000 "sudo mkdir -p ${HADOOP_HOME}/hadoop_data/hdfs/datanode && sudo chown -R ubuntu ${HADOOP_HOME}"
ssh ubuntu@ubuntu-pi-000100 "sudo mkdir -p ${HADOOP_HOME}/hadoop_data/hdfs/datanode && sudo chown -R ubuntu ${HADOOP_HOME}"
ssh ubuntu@ubuntu-pi-000010 "sudo mkdir -p ${HADOOP_HOME}/hadoop_data/hdfs/datanode && sudo chown -R ubuntu ${HADOOP_HOME}"
ssh ubuntu@ubuntu-pi-000001 "sudo mkdir -p ${HADOOP_HOME}/hadoop_data/hdfs/datanode && sudo chown -R ubuntu ${HADOOP_HOME}"
START HADOOP:
Format the Namenode, start the services and view the report : (ubuntu-pi-100000)
hdfs namenode -format
#Start Everything!@ (the start-all command is deprecated but still works as of 2.10)
#you could also use start-dfs.sh && start-yarn.sh
$HADOOP_HOME/sbin/start-all.sh
hdfs dfsadmin -report

Check Java - Namenode ubuntu-pi-100000:
jps

Check Java - Data Node / Workers ubuntu-pi-010000:
jps

Check Hadoop UI (hdfs manager)
Navigate to the Hadoop UI: http://ubuntu-pi-100000:50070

Check the resource manager (yarn ui)
Navigate to the Hadoop Cluster Metrics: http://ubuntu-pi-100000:8088/cluster?user.name=ubuntu

Test a map reduce job :
yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.0.jar pi 25 5
#or
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.0.jar pi 25 5


#useful hadoop commands:
#file System Check
hdfs fsck /

All Healthy.
Install hive:
http://www.apache.org/dyn/closer.cgi/hive/
wget https://downloads.apache.org/hive/hive-2.3.6/apache-hive-2.3.6-bin.tar.gz -P ~/downloads
#untar
sudo tar zxvf ~/downloads/apache-hive* -C /usr/local
#rename to "hive'
sudo mv /usr/local/apache-hive-2.3.6-bin/ /usr/local/hive
Update ~/.bashrc with environment variables
export HIVE_HOME=/usr/local/hive
export HIVE_CONF_HOME=$HIVE_HOME/bin
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/*:.
Edit the hive config file ($HIVE_HOME/bin/hive-config.sh). Tell hive where hadoop lives.
sudo nano $HIVE_HOME/bin/hive-config.sh
#add to the bottom of the file
export HADOOP_HOME=/usr/local/hadoop
Create and configure the /hive/conf/hive-site.xml file if it doesn't exist
I've added the connection data for a Mysql instance we will create in the next step.
Ensure your hive.metastore.warehouse.dir url points to your yarn/hadoop name node and that the port number is correct.
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&useSSL=false</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password for connecting to mysql server</description>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://ubuntu-pi-100000:9000/user/hive/warehouse</value>
<description>Location of default database for the warehouse</description>
</property>
</configuration>
Create directories on HDFS for hive
hadoop fs -mkdir /tmp
hadoop fs -mkdir -p /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse
Setup MySQL as the hive Metastore
The hive Metastore holds table definitions. The data related to these table is spread across the hadoop file system.
Install MySQL
sudo apt install -y mysql-server
sudo systemctl start mysql
sudo systemctl status mysql

Create database for the hive metastore
mysql -u root
mysql> create database hive;
mysql> create user 'hive' identified by 'hive';
mysql> grant all on hive.* to hive;
mysql> flush privileges;
mysql> exit;
mysql -u hive -p
mysql> show databases;

download mysql-connector-java and add connector jar to the hive/libs directory
wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-8.0.18.tar.gz
tar -xvf mysql-connector-java-8.0.18.tar.gz --wildcards --no-anchored '*.jar' -C ~/usr/local/hive/lib
#if available change metastore_db directory to temp, we're going to use MySql
mv /usr/local/hive/metastore_db /usr/local/hive/metastore_db.tmp
Initialize the mysql meta store:
schematool -initSchema -dbType mysql --verbose


Start Hive Metastore
hive --service metastore
#I had to leave hdfs safe mode
hdfs dfsadmin -safemode leave
Start Hive
hive

Test Hive
This simple insert will create a full on map reduce job. two actually. sele
create table rockstars (id integer, name string);
insert into rockstars (id, name) values (1, 'Nyella!');
select * from rockstars;

#Useful hive debugging commands:
hive -hiveconf hive.root.logger=DEBUG,console
Spark
Install Spark on all nodes of the cluster
download it:
wget https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz -P ~/downloads
sudo tar zxvf ~/downloads/spark* -C /usr/local
sudo mv /usr/local/spark-* /usr/local/spark
sudo chown -R ubuntu /usr/local/spark
Configure our Master Spark Node: (also our namenode ubuntu-pi-100000)
update /spark/conf/spark-env.sh
cd /usr/local/spark
#copy the template file and make a new spark-env.sh
cp spark-env.sh.template spank-env.sh
nano spark-env.sh
add the SPARK_DIST_CLASSPATH variable
export SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*
update the master nodes .bashrc file
#SPARK VARS
#spark and py spark variables - added by DR. 20200409
export SCALA_HOME=/usr/local/scala
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
export TERM=xterm-color
#here we're actually changing the pyspark working directory and configuring pyspark to use jupyter
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip ubuntu-pi-100000 --notebook-dir=/home/ubuntu/notebooks'
update master node slaves configuration file in /spark/conf/slaves
If the file isn't available, create it by copying the template file slaves.template
cp slaves.template slaves
remove 'localhost' from this file and add your servers:
#A Spark Worker will be started on each of the machines listed below.
ubuntu-pi-010000
ubuntu-pi-001000
ubuntu-pi-000100
ubuntu-pi-000010
ubuntu-pi-000001

Start configuring our worker nodes by modifying the spark-env.sh
nano /usr/local/spark/conf/spark-env.sh
Add the environment variables to the file
export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr
export SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*
Update ~/.bashrc with Spark Variables
#SPARK VARS
export SCALA_HOME=/usr/local/scala
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
export SPARK_CONF_DIR=/usr/local/spark/conf
Start the spark cluster from your main node
/usr/local/spark/sbin/start-all.sh
This will read the workers/slaves file and initialize the worker nodes as well.


Test Spark
cd /usr/local/spark/bin
spark-shell --version

View the Spark UI : http://ubuntu-pi-100000:8080/

Pyspark:
Pyspark comes with Spark. We just need to ensure we have Jupyter notebooks available.
If you noticed , earlier in this post, I setup PYSPARK_DRIVER_PYTHON variables which will instruct pyspark to use jupyter instead of the shell when ran. This will spin up a notebook server and display the URL for accessing the server.
#Install Jupyter notebooks
sudo apt install python-pip
pip install jupyter
#start pyspark and specify the Spark master node url
/usr/local/spark/bin/pyspark --master spark://ubuntu-pi-100000:7077
go to the URL

Helpful ish!
#Useful jupyter notebook commands:
jupyter troubleshoot
#in general
jupyter notebook stop <port the notebook is running on>
jupyter notebook list
#useful spark commands:
./start-slave.sh ubuntu-pi-100000:7077
STARTING AND STOPPING HADOOP
1. ALL Services : start-all.sh & stop-all.sh: Available in the /usr/local/hadoop/sbin directory. Call "start-all.sh"
2. Yarn (Resource Manager) : stop-yarn.sh, start-yarn.sh Available in the /usr/local/hadoop/sbin directory.
3. HDFS (Datanode, NodeManager): stop-dfs.sh, start-dfs.sh Available in the /usr/local/hadoop/sbin directory.
Great references for default values and install instructions:
https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/ClusterSetup.html
https://www.informit.com/articles/article.aspx?p=2190194&seqNum=2
https://spark.apache.org/docs/2.2.0/
https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/