Converting CSV files to Parquet with Go
This blog post explains how to read data from a CSV file and write it out as a Parquet file.
The Parquet file format is better than CSV for a lot of data operations. Columnar data stores allow for column pruning that massively speeds up lots of queries.
Go is a great language for ETL. Writing out Parquet files makes it easier for downstream Spark or Python to consume data in an optimized manner.
The parquet-go library makes it easy to convert CSV files to Parquet files.
Sample CSV data
Let's start with the following sample data in the data/shoes.csv
file:
nike,air_griffey
fila,grant_hill_2
steph_curry,curry7
Let's read this data and write it out as a Parquet file.
Check out the parquet-go-example repo if you'd like to run this code yourself.
Create Parquet file
Create a Shoe
struct that'll be used for each row of data in the CSV file:
type Shoe struct {
ShoeBrand string `parquet:"name=shoe_brand, type=UTF8"`
ShoeName string `parquet:"name=shoe_name, type=UTF8"`
}
Setup the Parquet writer so it's ready to accept data writes:
var err error
fw, err := local.NewLocalFileWriter("tmp/shoes.parquet")
if err != nil {
log.Println("Can't create local file", err)
return
}
pw, err := writer.NewParquetWriter(fw, new(Shoe), 2)
if err != nil {
log.Println("Can't create parquet writer", err)
return
}
pw.RowGroupSize = 128 * 1024 * 1024 //128M
pw.CompressionType = parquet.CompressionCodec_SNAPPY
Open up the CSV file, iterate over every line in the file, and then write each line to the Parquet file:
csvFile, _ := os.Open("data/shoes.csv")
reader := csv.NewReader(bufio.NewReader(csvFile))
for {
line, error := reader.Read()
if error == io.EOF {
break
} else if error != nil {
log.Fatal(error)
}
shoe := Shoe{
ShoeBrand: line[0],
ShoeName: line[1],
}
if err = pw.Write(shoe); err != nil {
log.Println("Write error", err)
}
}
Once we've iterated over all the lines in the file, we can stop the NewParquetWriter
and close the NewLocalFileWriter
.
if err = pw.WriteStop(); err != nil {
log.Println("WriteStop error", err)
return
}
log.Println("Write Finished")
fw.Close()
The data will be written in the tmp/shoes.parquet
file. You can run this on your local machine with the go run csv_to_parquet.go
command.
Let's read this Parquet file into a Spark DataFrame to verify that it's compatible with another framework. Spark loves Parquet files ;)
Read into Spark DataFrame
You can download Spark to run this code on your local machine if you'd like.
The Parquet file was ouputted to /Users/powers/Documents/code/my_apps/parquet-go-example/tmp/shoes.parquet
on my machine.
cd
into the downloaded Spark directory (e.g. cd ~/spark-2.4.0-bin-hadoop2.7/bin/
) and then run ./spark-shell
to start the Spark console.
Let's read the Parquet file into a Spark DataFrame:
val path = "/Users/powers/Documents/code/my_apps/parquet-go-example/tmp/shoes.parquet"
val df = spark.read.parquet(path)
Run the show()
method to inspect the DataFrame contents:
df.show()
+-----------+------------+
| shoe_brand| shoe_name|
+-----------+------------+
| nike| air_griffey|
| fila|grant_hill_2|
|steph_curry| curry7|
+-----------+------------+
Run the printSchema()
method to view the DataFrame schema.
df.printSchema()
root
|-- shoe_brand: string (nullable = true)
|-- shoe_name: string (nullable = true)
You can use Go to build a Parquet data lake and then do further data analytics with Spark. Parquet is the perfect pass off between Go and Spark!
Reading into a Go DataFrame
qframe seems to be the most promising Go DataFrame library.
It doesn't support Parquet yet, but hopefully we can get a qframe.ReadParquet
method added ;)
Next steps
We need to create more examples and demonstrate that parquet-go can also write out other column types like integers.
Go is a great language for ETL. Parquet support makes it even better!