Tuesday, February 4, 2025
HomeAnalyticsadvantages of a denormalised Apache Parquet storage

advantages of a denormalised Apache Parquet storage


It is important for our customers to know that they will always have access to their data when they need it. Whether they want to display a dashboard, drill down or extract data via the API, they need to be able to request a large volume of data in a powerful way to achieve their objectives, and therefore bring value to their companies. 
Behind the scenes, AT Internet’s engineers are working to set up data processing and storage systems to make this possible. They use the various tools at their disposal for this purpose. 
This article is the first of a series that will introduce you to some of the data storage technologies we use at AT Internet. It focuses on Apache Parquet, a file storage format designed for big data.

AT Internet started a major overhaul of its processing chain several years ago. Some fundamental aspects of this redesign have recently become more visible with tools such as Data Query 3 and the new Data Model. Internally, the first steps of this transition started a few years ago when the company began to redesign how it processes and stores data from the ground up, taking advantage of the potential that the Big Data ecosystem has to offer. 
One of the cornerstones of this new approach is column-oriented storage, which makes it possible to denormalise in an efficient way. This article will explain what it is and what the benefits are. But before I get into these explanations, I’ll describe how data is traditionally stored, especially through normalization and line-oriented storage. 

The traditional approach to data storage: standardisation 

Imagine we want to create a database with a certain amount of information related to films. 

Movie Id  Movie name  Release year  Author name  Author country 
The Matrix  1999  The Wachowskis  USA 
The Matrix Reloaded  2003  The Wachowskis  USA 
The Matrix Revolutions  2003  The Wachowskis  USA 

As you can see, in the storage model above, the author’s information is duplicated on each film, despite the fact that in the case in question, this data is exactly the same for each line. 
In the real world a lot of data is duplicated/shared in databases. For example, we could want to store the actors who played in each film; or we could go to a deeper level and store information about the countries the actors or directors come from… or both! 
We intuitively feel that storing the data by duplicating it can quickly become a problem if a technical solution is not found. 
Traditionally, the ‘right’/standard way to store this type of data is to separate the different types of data into different tables, in order to reduce or even eliminate duplicates, and this technique has long been the de facto way to proceed in the industry. 

Movie Id  Movie name  Release year  Author id 
The Matrix  1999 
The Matrix Reloaded  2003 
The Matrix Revolutions  2003 
Author id  name  country 
The Wachowskis  USA 

In the current (and time-honoured) processing chain, i.e. the one used to execute your Data Query requests, most data is stored in this way and called a normal form. 
This approach has the advantage of reducing duplication and therefore reducing the amount of data that needs to be stored. In many cases of use, storing data in this way is the most natural, ecological and efficient because database management systems (DBMS) have mechanisms to make queries on this type of schema efficient. How? Through techniques such as the calculation and storage of statistics internal to the database engine, or the optimisation of queries. 
In the world of analytics, where OLAP-type queries are mainly carried out, this paradigm is beginning to pose some problems with the explosion of the quantities of data collected. 
The most important of these problems occurs when you try to cross several large tables. In technical jargon, this is called performing a join operation between several tables. 
At the scales of data processed and requested by AT Internet, this joining step can be very costly and complex to perform effectively. In the event that the request is not properly optimised by the DBMS engine, the processing cost may be such that it ultimately results in a request that is too slow for the end customer. 

Another way to do things: denormalise the data to get performance at the request 

Data Engineer 1 – Why not just store everything in a denormalised way? This way we won’t have to pay the price of these damn joins every time the customer wants to request their data! 
Data Engineer 2 – Are you serious? Do you have any idea how much duplication this would create? 
Data Engineer 2 -… 
Data Engineer 2 – This is madness! 
Data Engineer 1 – Madness? … THIS IS DATA! 

The rationale for adopting such an approach is quite simple. Avoiding the need to make joins means simpler queries, easier to optimise, and therefore better response times for the end user at the time of the request. 
Empirically, we notice that the cost of storing the data is more than compensated by the request performance obtained, and this applies even more as we can deploy a number of tips to relieve the burden of redundancy. 

The column format 

One way to compensate for the problems of data duplication by returning to denormalised storage is by using what is called a column-oriented format

In a traditional database, each record is stored in one block. The blocks follow each other, but all the data for each record is in an adjoining space: 

1;The Matrix;1999;The Wachowskis 
2;The Matrix Reloaded;2003;The Wachowskis 
3;The Matrix Revolutions;2003;The Wachowskis 

One of the problems associated with storing in this form is that you have to read the whole line from the disk even if you want to load only part of the data. For example, we are obliged to load the titles from the disk even if the only information we want to retrieve is the year of release of each film. 
Reading a relatively small set of all available columns is a prototypical example of the type of requests executed by our customers in an analytical context. 
In a column-oriented format, each column is stored separately: 

The Matrix:1;The Matrix Reloaded:2;The Matrix Revolutions:3 
1999:1;2003:2;2003:3 
The Wachowskis:1;The Wachowskis:2;The Wachowskis:3 

At first glance, this may not seem like a big difference, but in reality, this alteration changes the constraints so much that it is a real paradigm shift. 
One of the immediate advantages is that it is now much easier to read only the data in certain columns. This implies fewer disk I/Os, and this is crucial because disk I/Os are one of the first factors limiting performance. 
A somewhat less obvious advantage of this paradigm shift is that since the data in the same column are generally relatively homogeneous, it allows it to be compressed aggressively by applying appropriate compression algorithms, which can even be chosen on the fly and on a case-by-case basis. 
To give an overview, by using our previous example, the date and author columns could be stored as follows: 

1999:1;2003:2,3 
The Wachowskis:* 

Many optimisations are possible, but these are only mentioned here as examples to give you an idea of the possibilities offered by this type of format. 
Most DBMS on the market offer options to store data in column format. Microsoft SQL Server, the database traditionally used by AT Internet, for example, is able to manage the column format and this feature has been used for several years now in our databases. Nevertheless, the data in these databases remains normalized, unlike in the New Data Factory. 

A few words about Apache Parquet 

The official Apache Parquet site defines the format as: 
A column storage format available for any product in the Hadoop ecosystem, regardless of the choice of processing framework, data model or programming language. 

This technology is one of the most popular implementations of a column-oriented file format. 
One of the aspects I would like to stress is that this is a file format and not a database management system. This is an important distinction, especially because it implies that being made up of simple files, a data lake parquet can be stored where you want, whether in your SAN storage bay, in a datacentre, or in a cloud computing server. 
Adopting such a storage format thus makes it possible to start on a sound basis compatible with the principle of decoupling compute and storage, one of the prevailing principles in big data storage and which we try to follow as data engineers at AT Internet. 

We will not have time in this article to cover all the features of this Apache Parquet format, but here is a brief summary of its most useful features: 

  • Parquet is able to natively manage nested data structures 
  • Empty values (NULL) are managed natively and cost almost nothing in storage 
  • The parquet files are self-describing (The schema is contained in each file) 
  • The engines managing the parquet format are able to dynamically discover the schema of a Parquet data lake (but this discovery may take some time) 
  • This format allows predicate push-down natively by eliminating row-groups. This means that it is possible to load from the disk only the part of the data that really interests us when filtering a parquet file. 
  • It is strongly supported in the various tools of the Big Data ecosystem 

A final important aspect to mention with regard to this format is partitioning. In a data lake parquet, the files are generally stored in directories corresponding to one of the columns of the data. This is called partitioning. If we sort our files by date, we can have for example: 


|- date=2019-01-01-01/data.parquet 
|- date=2019-01-02/data.parquet 
|- ... 

Partitioning by date or time is often a natural and efficient choice for data collected as an uninterrupted flow. Partitioning allows you to request the data in a powerful way by directly targeting the files likely to contain the data you are requesting. 

Not the ideal remedy: pitfalls… and solutions 

In addition to the positive aspects of this paradigm shift, new constraints and difficulties are emerging; making it challenging to find solutions. Here are some of these issues as well as ways to limit their inconvenience. 

Updating data 

Parquet does not offer a native way to update only a few lines in a file. To carry this out, it is necessary to completely re-write the file. 
Fortunately, in most Big Data workflows, the data is “Write once, Read many times”, which mitigates the problem in this context. A good partitioning of the data also improves the situation because it allows to update one or more partitions independently of the rest of the Data Lake. 
In the case where it is known in advance that some data will change frequently, one possible solution is to re-normalise it. 

Variable geometry properties 

Depending on how the files are written and the type of technology where they are stored (Hadoop cluster, S3, local file system), the data lake does not have exactly the same properties, especially in terms of the atomicity of operations. 
As a data engineer, it is therefore important to know and master the properties of the file system hosting your data lake so as not to make any misinterpretation. 

Transaction management 

Transaction management must be done manually: if several processes write and read the same data, the synchronisation of these processes must be done manually in order not to read the data in a corrupted state. 
Data processing tools such as Apache Spark natively have connectors that allow transactional writing for most file systems. In cases where it would be impossible to use these features, it is still possible to implement a distributed lock system to regulate read and write access to the Data Lake. 

Sorting data within a file 

The distribution of the data inside a partition and its sorting within the same parquet file is important and greatly conditions the size of the written files.  
There is no miracle solution for this point. It is very important to know the data processed in order to choose the right sorting to apply. One of the best approaches is to test different partitioning keys, the aim being to group similar data into the same groups of Parquet lines. 

Taking a step back 

Many of the problems mentioned above are some form of ACID loss, and most of the technologies on which contemporary data lakes are based suffer from it. As these problems are very general, the big data community is working on solutions, many that are beginning to emerge; and of which some honourable mentions are Delta LakeACID Orc or Iceberg. These technologies are promising and should make it possible in the medium term to avoid having to worry about the considerations mentioned above. 

In conclusion 

We have seen in this article what database normalization is, and learned about the row and column storage formats. We also saw why in big data workflows, the column format is preferred, why it opens the door to data recording in a denormalised format, and this led us to introduce the Apache Parquet file format. 
As the ecosystem is constantly evolving, and as data engineers in a company that processes as much data as AT Internet, we have a responsibility to stay current and continue to adapt processing chains using the most efficient tools in order to bring the best value to our customers, and we hope to be able to tell you more about these tools in future articles. 
If you found this article interesting, do not hesitate to consult the AT Internet website to learn more about our solution. 

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments

Skip to toolbar