Analyzing Parquet Metadata and Statistics with PyArrow
The PyArrow library makes it easy to read the metadata associated with a Parquet file.
This blog post shows you how to create a Parquet file with PyArrow and review the metadata that contains important information like the compression algorithm and the min / max value of a given column.
Parquet files are vital for a lot of data analyses. Knowing how to read Parquet metadata will enable you to work with Parquet files more effectively.
Converting a CSV to Parquet with PyArrow
PyArrow makes it really easy to convert a CSV file into a Parquet file. Suppose you have the following data/people/people1.csv
file:
first_name,last_name
jose,cardona
jon,smith
You can read in this CSV file and write out a Parquet file with just a few lines of PyArrow code:
import pyarrow.csv as pv
import pyarrow.parquet as pq
table = pv.read_csv('./data/people/people1.csv')
pq.write_table(table, './tmp/pyarrow_out/people1.parquet')
Let's look at the metadata associated with the Parquet file we just wrote out.
Fetching metadata of Parquet file
Let's create a PyArrow Parquet file object to inspect the metadata:
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('./tmp/pyarrow_out/people1.parquet')
parquet_file.metadata
<pyarrow._parquet.FileMetaData object at 0x10a3d8650>
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 2
num_rows: 2
num_row_groups: 1
format_version: 1.0
serialized_size: 531
Use the row_group
method to get row group metadata:
parquet_file.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x10a3dcdc0>
num_columns: 2
num_rows: 2
total_byte_size: 158
You can use the column
method to get column chunk metadata:
parquet_file.metadata.row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x10a413a00>
file_offset: 78
file_path:
physical_type: BYTE_ARRAY
num_values: 2
path_in_schema: first_name
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x10a413a50>
has_min_max: True
min: jon
max: jose
null_count: 0
distinct_count: 0
num_values: 2
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 35
total_compressed_size: 74
total_uncompressed_size: 70
The compression algorithm used by the file is stored in the column chunk metadata and you can fetch it as follows:
parquet_file.metadata.row_group(0).column(0).compression # => 'SNAPPY'
Fetching Parquet column statistics
The min and max values for each column are stored in the metadata as well.
Let's create another Parquet file and fetch the min / max statistics via PyArrow.
Here's the CSV data.
nickname,age
fofo,3
tio,1
lulu,9
Convert the CSV file to a Parquet file.
table = pv.read_csv('./data/pets/pets1.csv')
pq.write_table(table, './tmp/pyarrow_out/pets1.parquet')
Inspect the Parquet metadata statistics to see the min and max values of the age
column.
parquet_file = pq.ParquetFile('./tmp/pyarrow_out/pets1.parquet')
print(parquet_file.metadata.row_group(0).column(1).statistics)
<pyarrow._parquet.Statistics object at 0x11ac17eb0>
has_min_max: True
min: 1
max: 9
null_count: 0
distinct_count: 0
num_values: 3
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
The Parquet metadata statistics can make certain types of queries a lot more efficient. Suppose you'd like to find all the pets that are 10 years or older in a Parquet data lake containing thousands of files. You know that the max age in the tmp/pyarrow_out/pets1.parquet
file is 9 based on the Parquet metadata, so you know that none of the data in that file is relevant for your analysis of pets that are 10 or older. You can simply skip the file entirely.
num_rows and serialized_size
The number of rows and dataset size are also included in the Parquet metadata.
Big data systems are known to accumulate small files over time with incremental updates. Too many small files can cause performance bottlenecks, so the small files should periodically get compacted into bigger files.
You can query the metadata of all the Parquet files in a lake to identify the small files and determine how they should be compacted so the Parquet lake can be queried efficiently.
Next steps
Parquet files are important when performing analyses with Pandas, Dask, Spark, or AWS services like Athena.
Most Parquet file consumers don't know how to access the file metadata. This blog post has taught you an important trick that'll put you ahead of your competition ;)