Rebuilding Luxauto File Processing with Containers

The Luxauto file processing pipeline is an important part of our application. It allows us to have our classifieds marketplace up to date with the newest vehicle listings. Within the last few years, the file processing pipeline has undergone substantial improvements:

We moved from a single-sequential process in a cronjob to a multi-parallel pipeline. Each step became a docker image running multiple containers. This architectural shift drastically reduced the processing time and increased reliability.

By adding logging capabilities to each step, we enabled live monitoring and observability. This allowed the software engineering team to quickly find and fix problems with the pipeline and reduced the dependency on the infrastructure team.

Moving away from cron schedules to messaging queues, we improved scalability and resource utilisation efficiency since memory and CPU are only used when needed.

We enhanced our customer support by enriching the back-office application with information about each customer.

By decoupling the process into multiple small projects, we managed to upgrade the technologies step-by-step. From PHP 5.4 cron jobs to PHP 8.2 FPM containers.

Our experience over the last year has reinforced our conviction that a reliable, fast and efficient file processing pipeline that allows us to innovate and support our classifieds service is fundamental to the continued success of Luxauto.

From Cron to Containers

Cron

The Luxauto file processing pipeline used to run as a periodic cron job. The script would spawn different processes for each different file parser – ZIP, XML, TXT, CSV, etc. The files had a range of sizes – from a few kilobytes up to several gigabytes.

The cron has many limitations when it comes to scaling the file processing pipeline:

Scaling up cron jobs was not an option and scaling out cron jobs is very difficult. Problems with the file processing were frequent, so we needed to find a solution with high availability and fault tolerance.

Files were partially processed. If the file processing throws an uncaught exception or a fatal error in the middle of the process, the file would have a pseudo-success state. The imported data was incomplete, and this was very difficult to monitor and trace back to the cause of the problem.

Big files were causing delays. Whenever the cron would start processing a large file – a few GB – it caused an inconvenient delay for the files waiting in the queue to be processed, because the pipeline was synchronous.

The maintenance of the code was costly. Maintenance and improvement of the code was taking too much time. The code was coupled to the file parser and each parser would contain multiple customer-level customisations.

In addition to these problems, the software engineering team relied on the infrastructure team to build and release the application.

Therefore, the goal was to build a highly available, fault-tolerant, fast and cheap-to-maintain file processing pipeline. We decided to split the pipelines into multiple independent containers.

Containers

The data and software engineering team started to work on the next-generation Luxauto file processing pipeline. The team aimed to increase autonomy from the infrastructure team, decouple and improve the code maintainability and significantly decrease the issues present with the existing implementation.

Creating the new pipeline using containers provides a straightforward CI/CD build and release pipeline, horizontal scaling, and maintainable and testable code. Logging, debugging and monitoring are also facilitated since running the containers locally doesn’t require setting up the entire application infrastructure.

Building the File Processing Pipeline in Gateways v2

Splitting the services

The Luxauto file processing pipeline consists of reading files, parsing files and importing data and pictures. Hence, the first task was identifying boundaries and splitting the services.

We analysed and grouped these steps into small yet manageable services:

Reading files from the storage

Parsing the files into a standardised data structure

Importing and updating the data

Downloading and processing the pictures

Each service has a clear, decoupled responsibility, and may be deployed as a single or multiple container(s). This creates redundancy and high availability.

Building the infrastructure

Every service provides a dedicated functionality and works together to import and update the data. The containers communicate with each other through message brokers and are always available in an idle state. As soon as a new message is available, the first idle container gets the message and starts processing it. When the job is done, it sends a message for the next step.

Monitoring

Distributed applications can be challenging for developers to simulate and run locally, but this was a requirement to facilitate monitoring, debugging and bug fixing.

First, we created a way to run and test the steps individually to make it easier and faster for the engineers to fix and improve the code.

Second, we created a way to verify in our back-office application whether the file was imported successfully and if the data was inserted, updated or deleted. And in case of errors, files can be easily reprocessed from the back-office application.

Where we are now

We’ve had the new file processing pipeline running in production for over a year now. During this time, we’ve gradually shifted the old file processing to the new system, while making improvements with enhanced logging, improved memory usage, reduced cloud cost and the introduction of new monitoring tools.

While it’s still in the early days, we have already seen the benefits of the new platform; the team now has autonomy from the infrastructure team, the code is easier to test, change and improve, and debugging, bug tracking and fixing are faster and more accurate.