Metadata for win — Apache Parquet

You read the title right! Apache Parquet provisions best of the data properties to optimize your data processing engine capabilities. Some of the popular Distributed Computing solutions like Apache Spark, Presto exploits this property of Apache Parquet to read/write data faster.

Also, enterprise solutions in market like Databricks provides ACID properties on top of the Apache Parquet format to build Delta Tables. Some new formats have also arrived like Apache Iceberg, Apache Hudi etcetera.

But how it works? What if you have to write your own custom processing solution when it’s desired to use a needle instead of sword like Apache Spark.

Often, setting up Apache Spark is another big elephant which most people shy away from if they just want to process a manageable size on in single system.

For such cases, Apache Arrow is the best solution. Although it’s a platform that is language agnostic, it can be used for a single machine processing. Other libraries like Polars can also be used for single machine processing.

But how these frameworks makes the best out of Apache Parquet format? It’s the columnar format and the inherent structure of the file. The file is structured in such a format that retrieving columns is much efficient than the row based retrieval. In fact, the columns which are only required for analytical processing queries are preferred to be retrieved instead of selecting all columns as a whole.

How is file structured?

Apache Parquet is optimized for analytical processing queries which is why it follows columnar format approach. Referring to the below official illustration, I’ll explain how it work:

What’s going on in the above illustration? Let me explain.

File is structured in parts, where it has 5 crucial information:

Header

Header provides information regarding the official file format named as Magic Number — “PAR1”. This information is mostly specific to designate that it’s a Parquet format.

Data

The actual data is stored in Data part of the file. It’s a combination of Row Groups + Column Chunks + Pages. Don’t worry we’ll come to it when we discuss about Footer.

Footer

Footer is main block of information we are interested in and this article is about that only. Footer holds some of the critical information which is exploited by processing frameworks to optimize read and write operation.

It contains metadata about the whole file. It’s written during the write time of the file inherently to keep track of information so that it can be used during read time. High-level metadata we are interested in are as follows:

FileMetadata — Metadata block containing information about the file like Schema, Version of the parquet etcetera.

RowGroupMetadata — This one holds metadata about the column chunk names, number of records per row group etcetera.

ColumnMetadata — Column metadata tells information about each column like Name, Values (Distinct if needed), Compression Type, Offset to data page, Column Statistics like Min, Max values, byte size etcetera.

PageMetadata — Pages are blocks containing the actual data where it’s broken down to multiple pages. Each page metadata contains the next offset value to access the page.

There are are metadata information like Indexed columns and other which can be set while writing the file.

If you want to know how it looks as compiled bytes to string, you can see the similar structure as below:

`
SOURCE – https://parquet.apache.org/docs/file-format/metadata/

 4-byte magic number “PAR1”
 <Column 1 Chunk 1 + Column Metadata>
 <Column 2 Chunk 1 + Column Metadata>
 …
 <Column N Chunk 1 + Column Metadata>
 <Column 1 Chunk 2 + Column Metadata>
 <Column 2 Chunk 2 + Column Metadata>
 …
 <Column N Chunk 2 + Column Metadata>
 …
 <Column 1 Chunk M + Column Metadata>
 <Column 2 Chunk M + Column Metadata>
 …
 <Column N Chunk M + Column Metadata>
 File Metadata
 4-byte length in bytes of file metadata (little endian)
 4-byte magic number “PAR1”

Using metadata in practice

Discussed File Structure and Metadata can be used to write efficient systems for retrieving and processing large amount of files. In this section we are going to discuss how you can utilize the above information to benefit your program for read and processing logic.

Often the mistakes are made during the write time itself. Most people ignore the benefits of Parquet format and considers it like just any other file format like CSV, TSV etc. But if done right, some extra parameters can be used to apply these properties like Sorting based on low-cardinality columns, applying indexing based on a column, sorting the metric columns with high cardinality for co-locating values closer to a range.

Let’s not waste time and dive into a real example.

Data in practice

We are going to utilize a script to generate some random data with a structure. The script will produce 100k records per file with 100 files in total. We will add some extra parameters to force some conditions for efficient reads at first like sorting based on columns. This data will be stored in a GCS bucket. Since most production environments are based on cloud, the blob storage systems are used to handle concurrent read/writes by the processing solutions.

Data points are scattered randomly across the files, since this is a challenge for most processing engines to spread the reads across 80–90% of the files in question. It’s also a worst case to handle and test our processing system.

Refer the below code to generate some data.

Metadata Collection and Filtering Processes

We’ll divide our codebase into two parts, first is Metadata Collection process which will be responsible for reading metadata across all 100 files and writing it to a metadata folder at the same path where other files exists. Second part is the filtering process, which will take some random Query Parameters as arguments and based on this, collected metadata will be searched for filtering out only those part of the files that we require to read the records.

We are going to utilize Multiprocessing module in Python to maximize the reading and writing of metadata. Remind you, we are also going to utilize the Apache Arrow for just reading metadata and memory mapping the files while reading actual data.

In the cover image, each process is part of 2 classes. Collection and Write process are part of MetadataCollector class and Filter and Data Collection processes are part of MetadataProcessor class.

Both the classes provide executable methods for multiprocessing.

Experiment in Consideration

I took around 20 GB of user clicks data that I generated from a script, it has the following configuration:

No partitiong logic — random splits
No ordering of the columns — predicate pushdown process most probably read all files while filtering
Row Group size is kept 100K records per page
Number of Rows per file — 1M
Compression Type — Snappy

Compute Engine

The Google Cloud Compute Engine service is used to run the module. It has the following configuration:

4 vCPUs (2 Cores)
16 GB Memory

Performance

Whole process took around 1.1 minutes (29 seconds for Metadata Collection and 34 seconds for Metadata Filtering and Data Collection)

Conclusion

Although the processing with Python suffers a lot of draw back, but the module provides a basic understanding on how Apache Parquet can be used efficiently for better I/O operations. Future scope would be to use Bloom Filters, Sort-Merge Compaction, Z-Ordering (Key Co-location per file) and some other tricks to make it more mature.

Code Repository

Refer to the below GitHub link to run check and run your own process. The code is not mature and it lacks proper constructs, but it’s still a WIP so feel free to add any suggestion or PR maybe.

parquet-reader — https://github.com/RahulDubey391/parquet-reader