Different ways to write CSV files with Dask

This post explains how to write a Dask DataFrame to CSV files.

You’ll see how to write CSV files, customize the filename, change the compression, and append files to an existing lake.

We’ll also discuss best practices and gotchas that you need to watch out for when productionalizing your code.

Simple example

Let’s create a dataset, write it out to disk, and observe the files that are created.

Create a Dask DataFrame with two partitions and then write it out to disk with to_csv:

pdf = pd.DataFrame(
    {"num1": [1, 2, 3, 4], "num2": [7, 8, 9, 10]},
)
df = dd.from_pandas(pdf, npartitions=2)
df.to_csv("./tmp/csv_simple")

Here are the files that are outputted:

csv_simple/
  0.part
  1.part

The Dask DataFrame is written as two files because it contains two partitions. Each partition in a Dask DataFrame is written in parallel so the operation is quick.

Here are the contents of 0.part:

,num1,num2
0,1,7
1,2,8

Dask writes the index by default. You can set a flag, so the index isn’t written:

df.to_csv("./tmp/csv_simple", index=False)

Now let’s see how to customize the filename.

Customize filename

Here’s how to write the data out with a proper file extension:

df.to_csv("./tmp/csv_simple2/number-data-*.csv", index=False)

Here are the files that are outputted.

csv_simple2/
  number-data-0.csv
  number-data-1.csv

You can even write the files out to a directory structure:

df.to_csv("./tmp/csv_simple3/batch-1/numbers-*.csv", index=False)

Here’s how the files are written:

csv_simple3/
  batch-1/
    numbers-0.csv
    numbers-1.csv

Set compression

Write out the files with gzip compression:

df.to_csv("./tmp/csv_compressed/hi-*.csv.gz", index=False, compression="gzip")

Here’s how the files are outputted:

csv_compressed/
  hi-0.csv.gz
  hi-1.csv.gz

Watch out!

The to_csv writer clobbers existing files in the event of a name conflict. Be careful whenever you’re writing to a folder with existing data, especially if you’re using the default file names.

Let’s write some data to a CSV directory.

pdf = pd.DataFrame(
    {"num1": [1, 2, 3, 4], "num2": [7, 8, 9, 10]},
)
df = dd.from_pandas(pdf, npartitions=2)
df.to_csv("./tmp/csv_danger")

Here are the files on disk:

csv_danger/
  0.part
  1.part

Let’s write another file to disk:

pdf = pd.DataFrame(
    {"firstname": ["cat"], "lastname": ["dog"]},
)
df = dd.from_pandas(pdf, npartitions=1)
df.to_csv("./tmp/csv_danger")

The csv_danger folder still has two files, but 0.part was clobbered.

Here’s the contents of 0.part:

,firstname,lastname
0,cat,dog

Here’s the contents of 1.part:

,num1,num2
2,3,9
3,4,10

Luckily, it’s easy to add a UUID to the outputted filename to avoid accidental clobbering.

Concurrency safe appends

You can use a UUID in the filename to make sure none of the underlying files will be clobbered when performing an append operation.

Here’s a chunk of code that’ll write out two files in the first batch and then write out another file in the second batch:

import uuid

pdf = pd.DataFrame(
    {"num1": [1, 2, 3, 4], "num2": [7, 8, 9, 10]},
)
df = dd.from_pandas(pdf, npartitions=2)
batch_id = uuid.uuid4()
df.to_csv(f"./tmp/csv_appender/mything-{batch_id}-*.csv", index=False)

pdf = pd.DataFrame(
    {"num1": [11, 11], "num2": [12, 12]},
)
df = dd.from_pandas(pdf, npartitions=2)
batch_id = uuid.uuid4()
df.to_csv(f"./tmp/csv_appender/mything-{batch_id}-*.csv", index=False)

Here’s what gets written to the filesystem:

csv_appender/
  mything-44a0201c-c381-4ac8-adb6-91ba82b60a92-0.csv
  mything-44a0201c-c381-4ac8-adb6-91ba82b60a92-1.csv
  mything-3ff08ac4-50e3-4b68-bbdb-611a8afe0d09-0.csv

You can tell which files are from each batch from the UUID that’s embedded in the filename.

Multiple different writers can append to the same directory and there won’t be any name collisions with this design pattern.

Writing to a single file

The single_file flag makes it easy to write a single file:

df.to_csv("./tmp/my_single_file.csv", index=False, single_file=True)

There are other techniques for writing single files that are more appropriate in other situations, see here for the full description.

Having fun

Dask even lets you supply a function argument to customize the number part of the output. Let’s write two files as 10.part and 12.part instead of 0.part and 1.part.

my_fun = lambda x: str(x * 2 + 10)
pdf = pd.DataFrame(
    {"num1": [1, 2, 3, 4], "num2": [7, 8, 9, 10]},
)
df = dd.from_pandas(pdf, npartitions=2)
df.to_csv("./tmp/csv_even", name_function=my_fun)

Here’s what is written to the filesystem:

csv_even/
  10.part
  12.part

When to use CSV

CSV is a bad file format for data analyses:

  • quoting bugs are common
  • cannot distinguish between the empty string and “nothing”
  • doesn’t allow for column pruning
  • less compressible than a columnar file format

CSV is human readable and that’s really the only advantage.

Dask offers great support for file formats that are more amenable for production grade data workflows, like Parquet.

Parquet gives better performances and saves you from suffering with a lot weird bugs. Avoid CSV whenever possible!

Conclusion

Dask makes it easy to write CSV files and provides a lot of customization options.

Only write CSVs when a human needs to actually open the file and inspect the contents.

Whenever possible, use a better file format for data analyses like Parquet, Avro, or ORC.

Registration

Comments are closed, but trackbacks and pingbacks are open.