We use the benefit of convertion the dataset between Spark dataframe and Pandas dataframe

Spark DataFrame: Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (Resilient Distributed Datasets) with the benefits of Spark SQL’s optimized execution engine. It’s immutable data set

Pandas Dataframe: two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns

import findspark
import pyspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
spark = SparkSession.builder.appName('abc').getOrCreate()
sc = spark.sparkContext

from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data['target']
df.head(4)

  • From Pandas to Spark Dataframe

df = spark.createDataFrame(df)
df.show(5)

df.select('CRIM','NOX','AGE').show(5)

  • Convert Spark Dataframe to Pandas Dataframe

dfPandas=df.toPandas()
dfPandas.head(5)