Load CSV file with Spark using Python-Jupyter notebook

neovijayk
Jul 6, 2020
2 min read

In this article I am going to use Jupyter notebook to read data from a CSV file with Spark using Python code in Jupyter notebook. In this demonstration I am going to use input dataset from the kaggle (You can download the input dataset from this link.).

Now we will take a look at some of the ways to read data from the input CSV file:

1. Without mentioning the schema:

from pyspark.sql import SparkSession

scSpark = SparkSession \

.builder \

.appName("Python Spark SQL basic example: Reading CSV file without mentioning schema") \

.config("spark.some.config.option", "some-value") \

.getOrCreate()

sdfData = scSpark.read.csv("data.csv", header=True, sep=",")

sdfData.show()

Now check the columns: sdfData.columns

Output will be:

['InvoiceNo', 'StockCode','Description','Quantity', 'InvoiceDate', 'CustomerID', 'Country']

Check the datatype for each column:

sdfData.schema

StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,StringType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,StringType,true),StructField(CustomerID,StringType,true),StructField(Country,StringType,true)))

This will give the data frame with all the columns with datatype as StringType

2. With schema:

If you know the schema or want to change the datatype of any column in the above table then use this (let’s say I am having following columns and want them in a particular data type for each of them)

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField

from pyspark.sql.types import DoubleType, IntegerType, StringType

schema = StructType([\

StructField("InvoiceNo", IntegerType()),\

StructField("StockCode", StringType()), \

StructField("Description", StringType()),\

StructField("Quantity", IntegerType()),\

StructField("InvoiceDate", StringType()),\

StructField("CustomerID", DoubleType()),\

StructField("Country", StringType())\

])

scSpark = SparkSession \

.builder \

.appName("Python Spark SQL example: Reading CSV file with schema") \

.config("spark.some.config.option", "some-value") \

.getOrCreate()

sdfData = scSpark.read.csv("data.csv", header=True, sep=",", schema=schema)

Now check the schema for datatype of each column:

sdfData.schema

StructType(List(StructField(InvoiceNo,IntegerType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(CustomerID,DoubleType,true),StructField(Country,StringType,true)))

We can use the following line of code as well without mentioning schema explicitly:

sdfData = scSpark.read.csv("data.csv", header=True, inferSchema = True)

sdfData.schema

The output is:

StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,DoubleType,true),StructField(CustomerID,IntegerType,true),StructField(Country,StringType,true)))

Now we will print the some of the rows in the output:

sdfData.show()

+---------+---------+--------------------+--------+--------------+----------+-------+

+---------+---------+--------------------+--------+--------------+----------+-------+

| 536365| 85123A|WHITE HANGING HEA...| 6|12/1/2010 8:26| 2.55| 17850|

| 536365| 71053| WHITE METAL LANTERN| 6|12/1/2010 8:26| 3.39| 17850|

| 536365| 84406B|CREAM CUPID HEART...| 8|12/1/2010 8:26| 2.75| 17850|

| 536365| 84029G|KNITTED UNION FLA...| 6|12/1/2010 8:26| 3.39| 17850|

| 536365| 84029E|RED WOOLLY HOTTIE...| 6|12/1/2010 8:26| 3.39| 17850|

| 536365| 22752|SET 7 BABUSHKA NE...| 2|12/1/2010 8:26| 7.65| 17850|

| 536365| 21730|GLASS STAR FROSTE...| 6|12/1/2010 8:26| 4.25| 17850|

| 536366| 22633|HAND WARMER UNION...| 6|12/1/2010 8:28| 1.85| 17850|

| 536366| 22632|HAND WARMER RED P...| 6|12/1/2010 8:28| 1.85| 17850|

| 536367| 84879|ASSORTED COLOUR B...| 32|12/1/2010 8:34| 1.69| 13047|

| 536367| 22745|POPPY'S PLAYHOUSE...| 6|12/1/2010 8:34| 2.1| 13047|

| 536367| 22748|POPPY'S PLAYHOUSE...| 6|12/1/2010 8:34| 2.1| 13047|

| 536367| 22749|FELTCRAFT PRINCES...| 8|12/1/2010 8:34| 3.75| 13047|

| 536367| 22310|IVORY KNITTED MUG...| 6|12/1/2010 8:34| 1.65| 13047|

| 536367| 84969|BOX OF 6 ASSORTED...| 6|12/1/2010 8:34| 4.25| 13047|

| 536367| 22623|BOX OF VINTAGE JI...| 3|12/1/2010 8:34| 4.95| 13047|

| 536367| 22622|BOX OF VINTAGE AL...| 2|12/1/2010 8:34| 9.95| 13047|

| 536367| 21754|HOME BUILDING BLO...| 3|12/1/2010 8:34| 5.95| 13047|

| 536367| 21755|LOVE BUILDING BLO...| 3|12/1/2010 8:34| 5.95| 13047|

| 536367| 21777|RECIPE BOX WITH M...| 4|12/1/2010 8:34| 7.95| 13047|

+---------+---------+--------------------+--------+--------------+----------+-------+

only showing top 20 rows

This is it for this article. If you have any questions feel free to ask in the comment section below. Also if you find the above information is useful please like this article and subscribe to my blog 🙂

Load CSV file with Spark using Python-Jupyter notebook

1. Without mentioning the schema:

2. With schema:

Recent Posts

Commenti

Subscribe to BrainStorm newsletter