I. Data science & data analytics with big data

Before diving into Data science in big data, we need to define each one. The big data is a large set of data that collected from different places. Type of these data is different. It’s structured, un-structured or semi-structured. The data science is the method to solve the issues of this data set (the collected data) to implement different algorithms, statistics and visualize the data in way that it will become very simple for the management.

When we come to the data science & big data, we will have to rolls in this subject (big data analytics & Data science). Big data analytics, the roll that deal with collect, cleansing, transforming & modeling data while the data science deals with predicting algorithms like machine learning and deep learning.

The main part of data science is to understand writing SQL queries. We use the queries in many parts especially in data analytics.

In this article we will concentrate on data analytics process methods. Also the data analytics in many places is part of data science.

II. Data analytics process methods

The data analytics has four steps in our model. The sequence is not important in these steps.

  1. Collect data
    In this we collect data from different resources such as log files, machine generated files, databases, structured data like csv/psv files and semi-structured like json/XML files
  2. Transform data
    After we collect data, we can do some transformation methods like:
    – String Parsing
    – Mapping data with static tables
  3. Enrich data
    In this part we can use:
    – Mapping
    – Do calculation to produce anew columns or data
    – Join two data sets
    – Union two data sets
  4. Filter data
    – Filter through where condition
    – Filter through join statement
  5. Data modeling
    In this part, when we have a set of the methods or a set of business and we want to implement these business on the data set to get a ne columns or to filter the data.

III. The programming languages in data analytics

In data analytics for big data, there are many languages to use:

1-Python 2- Scala 3- R language 4- Java
Python is very famous language in data science. In our article, we will write all our examples in python. In Python we can write the data science code and to run the code on single machine or on distributed system with Spark. When we write the machine learning model in python, we can implement in python only and run it or we can build ML model in python and spark to implement it on big data platform.

IV. The tools of data science

In the data analytics for the data science, we use two tools (1-pyspark and 2- Jupyter). We have to write pyspark in command line. This tool is text command line. In this tool we can write the script code in python to implement the data analytics. We can use Jupyter Notebook (it’s under Anaconda). In Jepyter, we can our write our script for data analysis.

When we work with Jupyer notebook, we will have two options in Jupyter. pure Python (on single machine. The machine that host Jupyter) or you use Spark with python (Spark works on the cluster. The cluster is a set of machine. These set of machine work under distributed system).

1. Pyspark

Start pyspark by going to cmd command and write “pyspark”. When you work with pyspark, you have two option works with Python only or you use Spark

Example(1)
Pure python in pyspark. After you type pyspark in command line

for item in [1,3,5,6]:
print(item)

Example(2)
Spark with PySpark:

documents_rdd = sc.parallelize([[1, ‘cats are cute’, 0],[2, ‘dogs are playfull’, 0],[3, ‘lions are big’, 1],[4, ‘cars are fast’, 1]])
documents_df = documents_rdd.toDF([‘doc_id’, ‘text’, ‘user_id’])
documents_df.show()

2. Jupyer notebook

We open Anaconda. After that we choose Jupyter. we will write one example. We will use spark with python to display data.

import findspark
findspark.init()
import pyspark
conf = pyspark.SparkConf().setAppName(‘sentiment-analysis’).setMaster(‘local[*]’)
from pyspark.sql import SQLContext, HiveContext
sc = pyspark.SparkContext(conf=conf)
sqlContext = HiveContext(sc)

documents_rdd = sc.parallelize([[1, ‘cats are cute’, 0], [2, ‘dogs are playfull’, 0], [3, ‘lions are big’, 1],[4, ‘cars are fast’, 1]]) documents_df = documents_rdd.toDF([‘doc_id’, ‘text’, ‘user_id’])
print(documents_df.show())

V. Spark Dataframe

Define the Datafrme: it is a distributed collection of data set. This collection represent a table in database. It has columns and rows.
Read more

VI. Pandas dataframe

Python has dataframe. It’s data structure like a table. It contains from row and columns. It runs under single machine.
Read more

VII. Transfer data between Spark & Pandas dataframes

We can transfer the data set between spark and pandas.
Read more