This blog post explains how to read data from a CSV file and write it out as a Parquet file.
The Parquet file format is better than CSV for a lot of data operations. Columnar data stores allow for column pruning that massively speeds up lots of queries.
Go is a great language for ETL. Writing out Parquet files makes it easier for downstream Spark or Python to consume data in an optimized manner.
The parquet-go library makes it easy to convert CSV files to Parquet files.
Sample CSV data
Let’s start with the following sample data in the data/shoes.csv
file:
nike,air_griffey fila,grant_hill_2 steph_curry,curry7
Let’s read this data and write it out as a Parquet file.
Check out the parquet-go-example repo if you’d like to run this code yourself.
Create Parquet file
Create a Shoe
struct that’ll be used for each row of data in the CSV file:
type Shoe struct { ShoeBrand string `parquet:"name=shoe_brand, type=UTF8"` ShoeName string `parquet:"name=shoe_name, type=UTF8"` }
Setup the Parquet writer so it’s ready to accept data writes:
var err error fw, err := local.NewLocalFileWriter("tmp/shoes.parquet") if err != nil { log.Println("Can't create local file", err) return } pw, err := writer.NewParquetWriter(fw, new(Shoe), 2) if err != nil { log.Println("Can't create parquet writer", err) return } pw.RowGroupSize = 128 * 1024 * 1024 //128M pw.CompressionType = parquet.CompressionCodec_SNAPPY
Open up the CSV file, iterate over every line in the file, and then write each line to the Parquet file:
csvFile, _ := os.Open("data/shoes.csv") reader := csv.NewReader(bufio.NewReader(csvFile)) for { line, error := reader.Read() if error == io.EOF { break } else if error != nil { log.Fatal(error) } shoe := Shoe{ ShoeBrand: line[0], ShoeName: line[1], } if err = pw.Write(shoe); err != nil { log.Println("Write error", err) } }
Once we’ve iterated over all the lines in the file, we can stop the NewParquetWriter
and close the NewLocalFileWriter
.
if err = pw.WriteStop(); err != nil { log.Println("WriteStop error", err) return } log.Println("Write Finished") fw.Close()
The data will be written in the tmp/shoes.parquet
file. You can run this on your local machine with the go run csv_to_parquet.go
command.
Let’s read this Parquet file into a Spark DataFrame to verify that it’s compatible with another framework. Spark loves Parquet files 😉
Read into Spark DataFrame
You can download Spark to run this code on your local machine if you’d like.
The Parquet file was ouputted to /Users/powers/Documents/code/my_apps/parquet-go-example/tmp/shoes.parquet
on my machine.
cd
into the downloaded Spark directory (e.g. cd ~/spark-2.4.0-bin-hadoop2.7/bin/
) and then run ./spark-shell
to start the Spark console.
Let’s read the Parquet file into a Spark DataFrame:
val path = "/Users/powers/Documents/code/my_apps/parquet-go-example/tmp/shoes.parquet" val df = spark.read.parquet(path)
Run the show()
method to inspect the DataFrame contents:
df.show() +-----------+------------+ | shoe_brand| shoe_name| +-----------+------------+ | nike| air_griffey| | fila|grant_hill_2| |steph_curry| curry7| +-----------+------------+
Run the printSchema()
method to view the DataFrame schema.
df.printSchema() root |-- shoe_brand: string (nullable = true) |-- shoe_name: string (nullable = true)
You can use Go to build a Parquet data lake and then do further data analytics with Spark. Parquet is the perfect pass off between Go and Spark!
Reading into a Go DataFrame
qframe seems to be the most promising Go DataFrame library.
It doesn’t support Parquet yet, but hopefully we can get a qframe.ReadParquet
method added 😉
Next steps
We need to create more examples and demonstrate that parquet-go can also write out other column types like integers.
Go is a great language for ETL. Parquet support makes it even better!
Comments are closed, but trackbacks and pingbacks are open.