🪄 DuckDB sql hack : get things SORTED w/ constraint CHECK

🪄 DuckDB sql hack : get things SORTED w/ constraint CHECK

🪝 Intro

Ever wanted to get that kind on constraint on a table :

CREATE TABLE sensor_data(
ts TIMESTAMP,
measurement INTEGER,
…,
SORTED(ts)
);

Well, it turns out this topic has a dedicated discussion:

For now it’s not implemented…yet. But we have a lot of use cases out there.

Let’s see how to achieve this in pure sql… and why it would be so useful.

🍿 Demo

🤔 (A bit of) Context

We are currently using duckdb on GitHub Actions (see opt-nc/setup-duckdb-action) as it’s a very convenient and efficient way to check data quality with sql as part of our CI… with very very few efforts, see below:

☝️ Recently, we also felt the need to keep a csv file sorted according to a given column to keep it as clean as possible and letting people make Pull Requests… and, most important : delegating the CI the role to explain the end user why the data he wants to put cannot be merged.

We did not want any human in the loop to moderate contributors and explain how the data should be provided to get the PR merged. In particular in the context of we use GH auto-merge to

“[…] increase development velocity by enabling auto-merge for a pull request so that the pull request will merge automatically when all merge requirements are met.”

👉 This productivity hack is all about achieving that in pure sql… and take profit of:

duckdb columnar storage format (see Exploring DuckDB and the Columnar Advantage for more)
CHECK constraint

ℹ️ Notice that this hack can be applied on very large volumes of data.

🪄 sql tricks

———————————————————–

— Check if a table column is sorted w/ integrity check

———————————————————–

— Create a table with a column that is not sorted
— Feed some random stuff
— The real target table
create or replace table demo_sort (text varchar);

insert into demo_sort values (‘DuckDb’);
insert into demo_sort values (‘duckdb’);
insert into demo_sort values (‘Duckdb’);
insert into demo_sort values (‘DUCKDB’);
insert into demo_sort values (‘duckDB’);
insert into demo_sort values (‘DUCKdb’);
insert into demo_sort values (‘DuckDB’);

— Check the resulting table
from demo_sort;

— Prepare test environment
CREATE SEQUENCE seq_original START 1;
CREATE SEQUENCE seq_sorted START 1;

create or replace temp table orig_table as
select nextval(‘seq_original’) as index,
text from demo_sort;

create or replace temp table sorted_table as
select nextval(‘seq_sorted’) as index,
text
from (select text from demo_sort order by text);

— Check the resulting tables
from orig_table;
from sorted_table;

— Create the table that compares the sorted and original tables columns
create or replace temp table test_table(orig_text varchar,
orig_index integer,
sorted_index integer
— the magic part XD
check(orig_index = sorted_index)
);
— Populate the comparison table
insert into test_table
select
orig_table.text as orig_text,
orig_table.index as orig_index,
sorted_table.index as sorted_index,
from
orig_table,
sorted_table
where
orig_table.text = sorted_table.text
order by orig_table.index;

— Enjoy the resulting “Constraint Error: CHECK constraint failed: test_table”

🔖 Resources

opt-nc/setup-duckdb-action
🪄 DuckDB sql hack : get things SORTED (jupysql)
Automatically merging a pull request
Exploring DuckDB and the Columnar Advantage
Why DuckDB

Leave a Reply

Your email address will not be published. Required fields are marked *