Optimizing Fuzzy Search Across Multiple Tables: pg_trgm, GIN, and Triggers

RMAG news

PostgreSQL offers significant improvements beyond single-column indexing, which YugabyteDB also leverages. For searches where the full value is not known, the pg_trgm extension can generate multiple trigrams, and GIN indexes these multiple values per row. Additionally, bitmap scans can combine indexes on different columns within a table. However, challenges arise when predicates span across different tables in a join. Here, triggers come into play, allowing the creation of custom indexing by maintaining user-defined columns dedicated to fuzzy searches.

Here is an example using a table of Cities and Countries which I build from gvenzl/sample-data:

! curl s https://raw.githubusercontent.com/gvenzl/sampledata/main/countriescitiescurrencies/uninstall.sql > countries.sql
! curl s https://raw.githubusercontent.com/gvenzl/sampledata/main/countriescitiescurrencies/install.sql | awk ‘/D a t a l o a d/{print “do $$ begin”}/ COMMIT /{print “end; $$ ;”}{print}’ >> countries.sql
i countries.sql

I want to list the cities that have ‘port’ in their city name or country name:

yugabyte=# select city.name, country.name
from cities city
join countries country using(country_id)
where city.name ilike ‘%port%’ or country.name ilike ‘%port%’;

name | name
—————-+———————
Port Moresby | Papua New Guinea
Port of Spain | Trinidad and Tobago
Port au Prince | Haiti
Porto Novo | Benin
Port Vila | Vanuatu
Port Louis | Mauritius
Lisbon | Portugal
(7 rows)

Which index to create to accelerate this search?

Trigrams and GIN index on both tables

Because I do a fuzzy search, I can create the pg_trgm extension and a GIN index on both columns:

create extension if not exists pg_trgm;
create index on cities using gin (name gin_trgm_ops);
create index on countries using gin (name gin_trgm_ops);

Unfortunately, it cannot be used with my query because the two columns are on two tables and bitmap scan can only combine bitmaps for one table:

yugabyte=# explain (costs off)
select city.name, country.name
from cities city
join countries country using(country_id)
where city.name ilike ‘%port%’ or country.name ilike ‘%port%’;
QUERY PLAN
———————————————————————————————–
Nested Loop
-> Seq Scan on cities city
-> Index Scan using countries_pk on countries country
Index Cond: ((country_id)::text = (city.country_id)::text)
Filter: (((city.name)::text ~~* ‘%port%’::text) OR ((name)::text ~~* ‘%port%’::text))
(5 rows)

I can re-write the query with a UNION:

yugabyte=# explain (costs off)
select city.name, country.name
from cities city
join countries country using(country_id)
where city.name ilike ‘%port%’
union
select city.name, country.name
from cities city
join countries country using(country_id)
where country.name ilike ‘%port%’;
QUERY PLAN
—————————————————————————————————————————————————-
HashAggregate
Group Key: city.name, country.name
-> Append
-> YB Batched Nested Loop Join
Join Filter: ((city.country_id)::text = (country.country_id)::text)
-> Index Scan using cities_name_idx1 on cities city
Index Cond: ((name)::text ~~* ‘%port%’::text)
-> Index Scan using countries_pk on countries country
Index Cond: ((country_id)::text = ANY (ARRAY[(city.country_id)::text, ($1)::text, ($2)::text, …, ($1023)::text]))
-> YB Batched Nested Loop Join
Join Filter: ((city_1.country_id)::text = (country_1.country_id)::text)
-> Index Scan using countries_name_idx1 on countries country_1
Index Cond: ((name)::text ~~* ‘%port%’::text)
-> Index Scan using cities_countries_fk001 on cities city_1
Index Cond: ((country_id)::text = ANY (ARRAY[(country_1.country_id)::text, ($1025)::text, ($1026)::text, …, ($2047)::text]))
(15 rows)

This works, but the user may have some constraints with the ORM that cannot generate such a query.

Maintain an additional column for searching

I create an additional column in “cities” that will concatenate the name of the city and the name of the country, with a GIN index on trigrams:

alter table cities add fuzzy_search text;

create index on cities using gin (fuzzy_search gin_trgm_ops);

update cities
set fuzzy_search = format(‘%s %s’,cities.name, countries.name)
from countries where countries.country_id = cities.country_id
;

I can now query on this “fuzzy_search” column:

yugabyte=# explain (costs off)
select city.name, country.name
from cities city
join countries country using(country_id)
where fuzzy_search ilike ‘%port%’;
QUERY PLAN
—————————————————————————————————————————–
YB Batched Nested Loop Join
Join Filter: ((city.country_id)::text = (country.country_id)::text)
-> Index Scan using cities_fuzzy_search_idx on cities city
Index Cond: (fuzzy_search ~~* ‘%port%’::text)
-> Index Scan using countries_pk on countries country
Index Cond: ((country_id)::text = ANY (ARRAY[(city.country_id)::text, ($1)::text, ($2)::text, …, ($1023)::text]))
(6 rows)

This is why SQL databases have triggers: to add some data logic transparent to the application.

The code for triggers depends on your application and access patterns to reduce the overhead. The cases I will cover are:

inserting a new city that references a country must add the country name to the search column
updating a city must update it in the search column
inserting, updating, or deleting a country must update the search column on all cities referencing it.
If you have foreign key with cascade constraint, you may not need all those cases.

Here are my triggers:

— Trigger for insert, update, and delete on cities
create or replace function trg_cities_fuzzy_search() returns trigger as $$
begin
if tg_op = ‘INSERT’ or tg_op = ‘UPDATE’ then
new.fuzzy_search := format(‘%s %s’, new.name, (select name from countries where country_id = new.country_id));
return new;
end if;
end;
$$ language plpgsql;
create trigger trg_cities_fuzzy_search
before insert or update or delete on cities
for each row execute function trg_cities_fuzzy_search();

— Trigger for update and delete on countries
create or replace function trg_countries_fuzzy_search() returns trigger as $$
begin
if tg_op in ( ‘UPDATE’ , ‘INSERT’ ) then
update cities
set fuzzy_search = format(‘%s %s’, cities.name, new.name)
where country_id = new.country_id;
return new;
elsif tg_op = ‘DELETE’ then
update cities
set fuzzy_search = format(‘%s’, name)
where country_id = old.country_id;
return old;
end if;
end;
$$ language plpgsql;
create trigger trg_countries_fuzzy_search
after update or delete on countries
for each row
execute function trg_countries_fuzzy_search();

The most important is to add tests to cover all DML that can happen on those tables and verify that the “fuzzy_search column” is always correct. In other words, no rows returned by:

select format(‘%s %s’,city.name, country.name) , fuzzy_search as names
from cities city
join countries country using(country_id)
where format(‘%s %s’,city.name, country.name) != fuzzy_search
;

If you have stress tests for critical use cases that modify those tables, you should also verify that the overhead is acceptable.

By implementing this solution, the changes needed in the application are minimal: simply build the criteria based on the column specifically designated for fuzzy search. This approach offers the benefits of denormalization (filtering before joining) without the drawbacks (such as having to add data logic in the application code and tests to maintain consistency).

This example is compatible with PostgreSQL, YugabyteDB, and any PostgreSQL-compatible databases, implying support for GIN indexes and triggers of course.