01.Introduction
Big Data
Suppose we have data that can fit on a local computer, in the scale of 0-32GB depending on RAM. This kind of scale of data is not Big Data. Though we have data of scale 30 to 64GB,we can buy extra RAM and data can be fit into it. This data can be opened in excel file and as this is stored in RAM, we can have quick access to it.
When we have larger set of data that we cannot hold on a RAM,
- we try using SQL database to move storage onto hard drive instead of RAM.
- Or use a distributed system that distributes data to multiple machines/computers.
This is where Spark comes into play. If we are using Spark, we are at a point where it no longer makes sense to have data on RAM and to have data into single machine.
Local System is a single machine or single computer that shares the same RAM, same hard drive.
In Distribution system, we will have one main computer like master node and we will have data and calculations distributed on to the other computers.
A local process will use the computation resources of a single amchine. A distributed process has access to the computational resources across a number of machines connected through a network.
At a certain point, it is easier to scale out to many lower CPU machines, than to try to scale up to a single machine with a high CPU.
Distributed machines also have the advantage of easily scaling, we can just add more machines.
Distributed system also includes fault tolerance, that is if one machine fails, the whole network can still go on.
Hadoop
- Hadoop is a way to distribute very large files across the multiple machines.
- It uses Hadoop Distributed File System (HDFS).
- HDFS allows a user to work with larger data sets.
- HDFS also duplicates blocks of data for fault tolerance.
- It also then uses MapReduce.
- MapReduce allows computations across the distributed data sets.
- HDFS will use blocks of data , with a size of 128 MB by default.
- Each of these blocks is replicated 3 times.
- The blocks are distributed in a way to support fault tolerance.
- Smaller blocks provide more parallelization during processing.
- Multiple copies of a block preven loss of data due to failure of node.
MapReduce
- MapReduce is a way of splitting a computation task to a distributed set of file(such as HDFS).
- It consists of a Job Tracker and multiple Task Trackers.
- The Job Tracker sends code to run on the Task Trackers.
- The task trackers allocate CPU and memory for the tasks and monitor the tasks on the worker nodes.
Spark
- Spark is one of the latest technologies being used to quickly and easily handle Big Data.
- It is an open source project on Apache.
- It was first released in Feb 2013 and has exploded in popularity due to its ease of use and speed.
- It was created at the AMPLab at UC Berkeley.
We can think of Spark as a flexible alternative to MapReduce.
Spark can use data stored in a variety of formats
- Cassandra
- AWS S3
- HDFS and more.
Spark vs MapReduce
- MapReduce requires files to be stored in HDFS, Spark does not!
-
Spark can also perform operations upto 100x faster than MapReduce.
-
MapReduce writes most data to disk after each map and reduce operation.
- Spark keeps most of the data in memory (something like RAM) after each transformation.
- Spark can spill over to disk if the memory is spilled.
Spark RDDs
- At the core of Spark is the idea of a Resilient Distributed Dataset (RDD).
- Resilient Distributed Dataset(RDD) has 4 main features.
- Distributed Collection of Data.
- Fault-tolerant
- Parallel operation - partioned
- Ability to use many data sources.
- RDDs are immutable, lazily evaluated and cacheable.
- There are two types of spark operations:
- Transformations
- Actions
- Transformations are basically a receipe to follow.
- Actions usually perform what the receipe says to do and returns something back.
- This behaviour of transformations and actions carries over to the syntax when coding.
- A lot of times, we write a method call, but won't see anything as a result until we call the action.
- This makes sense because with a large datasets, we don't want to calculate all the transformations until we are sure we want to perform them.
Spark DataFrames
Spark DataFrames are also now the standard way of using Spark's Machine Learning Capabilities.
Spark is not a programming language, it is just a framework for dealing with large data and doing calculations across distributed language.
Spark itself written in a programming language scala.
Spark and Python Setup
While Installing spark, we will either link to online Linux based system or use virtualBox to setup a Linux based system locally.
This is because,
Realistically Spark won't be running on a single machine, it will run a cluster on a service like AWS.
These cluster services will pretty much always be a Linux based system.
Installaing using AWS EC2 - ubuntu instance
* sudo apt-get update
* sudo apt-get install python-pip3
* pip3 install jupyter
* sudo apt-get install default-jre
* sudo apt-get install scala
* pip3 install py4j
* Install actual spark library
wget http://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
* sudo tar -zxvf spark-3.0.0-bin-hadoop2.7.tgz (this installs hadoop and spark)
* ls
* cd spark-3.0.0-bin-hadoop2.7
* pwd
* cd (back to home directory)
* pip3 install findspark
(findpsark module helps us to connect python with spark easily)
* ~/.local/bin/jupyter notebook --generate-config
* cd
* mkdir certs
* cd certs
* sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
(Creating .pem files that are going to be used by jupyter config files)
* cd ~/.jupyter
* vi jupyter_notebook_config.py
* Add the below lines at start of file
c = get_config()
c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem'
c.NotebookApp.ip = ''
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888
~/.local/bin/jupyter notebook
* copy the url and replace localhost or ip address with ECR
Install spark using Databricks