SpringAI, llama3 and pgvector: bRAGging rights!

In the very beginning at least, Python reigned supreme in terms of tooling for AI development. However, recently came the answer from the Spring community, and it’s called SpringAI! This means that if you’re a Spring developer with working knowledge of concepts such as Beans, Auto-configurations and Starters, you’re covered, and you can write your AI apps following the standard patterns you’re already familiar with.

In this post, I want to share with you an exploration that started with the goal to take SpringAI for a little spin and try out the capabilities of open-source LLMs (large language models). I got curious along the way, and ended up also looking at some pretty low-level details about data storage for AI applications as well.

To support the exploration, I’ve developed a simple Retrieval Augmented Generation (RAG) workflow that works completely locally on the laptop for free. If you’re interested, you can find the code itself here. Basically, I’ve used Testcontainers to create a Postgres database container with the pgvector extension to store text embeddings and an open source LLM with which I send requests to: Meta’s llama3 through ollama.

Alright, let’s have a look at what I’m building with this setup.

Fun with sayings and essays

The RAG workflow works like this. I first ask it to generate a couple of sayings, for example this is one among the ones I got back:

A bad excuse is better than none.

Then, I ask it to “elaborate” and genarate an essay (of under 200 words) about the saying it provided me earlier. Here is one such essay generated by llama3 for the saying above:

The saying suggests that while an inadequate justification may not be satisfactory, it’s still preferable to having no explanation at all. This phrase highlights the importance of accountability and acknowledging responsibility for one’s actions or decisions.

In essence, a bad excuse implies that some effort has been made to justify or rationalize one’s behavior, even if the reasoning is flawed or unconvincing. On the other hand, not offering any excuse whatsoever can come across as evasive or dishonest, leading to mistrust and resentment.

This saying encourages people to take ownership of their mistakes and apologize for any wrongdoing, rather than remaining silent and avoiding accountability. It also acknowledges that everyone makes errors and that a bad excuse is better than none because it shows a willingness to acknowledge and learn from those mistakes.

Ultimately, the saying promotes honesty, responsibility, and personal growth by emphasizing the value of taking ownership of one’s actions, no matter how imperfect the explanation may be.

Then, I will take these essays and create embeddings from them, which I will store in Postgres, using the pgvector extension in columns of vector data type. All with the help of SpringAI abstractions and least amount of custom code. I will skip the part called “chunking”. When you are dealing with very large documents, or want to isolate sections in your data (like in e-mails where you have subject, sender, etc..) you might look into doing that.

So far so good. At this point, we have stored the data we need in the next steps.

I will then take each saying and do a similarity search on the embeddings to retrieve the corresponding essay for each saying. Lastly, I will supply the retrieved essays back again to the LLM, and now ask it to guess the original saying from which the essay was generated. Finally I will check how many it got right.

What do you think, will it manage to correctly guess the saying from just the essay? After all, it has generated the essays from those sayings itself in the first place. A human would have no problem doing this.

But let’s first have a look at how the program is set up from a technical perspective. We will look at the results and find out how capable is the LLM a bit later.

The LLM and the vector store with Testcontainers

Testcontainers makes it very easy to integrate with collaborating services that each play a role in use-cases like this. All that is required to set up a database and the LLM are the couple of lines below.

@TestConfiguration(proxyBeanMethods = false)
public class RagDemoApplicationConfiguration {
private static final String POSTGRES = “postgres”;

@Bean
@ServiceConnection
PostgreSQLContainer<?> postgreSQLContainer() {
return new PostgreSQLContainer<>(“pgvector/pgvector:pg16”)
.withUsername(POSTGRES)
.withPassword(POSTGRES)
.withDatabaseName(POSTGRES)
.withInitScript(“init-script.sql”);
}
}

@Bean
@ServiceConnection
OllamaContainer ollamaContainer() {
return new OllamaContainer(“ollama/ollama:latest”);
}
}

I’ve used the @ServiceConnections annotation that allows typing less configuration code. I can do this for the ollama container only since recently, thanks to this recent contribution from Eddú Meléndez.

You might have noted there’s an init script there. It contains only one line, which is the following, and has the purpose to install a Postgres extension which lets me inspect the contents of the Postgres shared buffers in RAM. I’m interested in having a look at this in order to better understand the operational characteristics of working with vectors. What are the memory demands?

create extension pg_buffercache;

Now, to fully initialise our LLM container such that it’s ready to actually handle our requests for our sayings and essays, we need to pull the models we want to work with, like so:

ollama.execInContainer(“ollama”, “pull”, “llama3”);

ollama.execInContainer(“ollama”, “pull”, “nomic-embed-text”);

You will notice that besides the llama3 that I mentioned before which will take care of generating text, I am also pulling a so-called embedding model: nomic-embed-text. This is to be able to convert text into embeddings, to be able store them.

The ones I’m using are not the only options. New LLM bindings and embedding models are added all the time in both SpringAI and ollama, so refer to the docs for the up-to-date list, as well as the ollama website.

Configuration properties

Let’s have a look at the vector store configuration. Here’s how that looks:

@DynamicPropertySource

static void pgVectorProperties(DynamicPropertyRegistry registry) {

registry.add(“spring.ai.vectorstore.pgvector.index-type”, () -> “HNSW”);

registry.add(“spring.ai.vectorstore.pgvector.distance-type”, () -> “COSINE_DISTANCE”);

registry.add(“spring.ai.vectorstore.pgvector.dimensions”, () -> 768);

}

The first one is called index-type. This means that we are creating an index in our vector store. We don’t necessarily need to always use an index – it’s a trade-off. With indexing, the idea is that we gain speed (and other things, like uniqueness, etc) at the expense of storage space. With indexing vectors however, the trade-off also includes the relevance aspect. Without indexing, the similarity search is based on the kNN algorithm (k-nearest neigbours) where it checks all vectors in the table. However with indexing, it will perform an aNN (approximate nearest neighbours) which is faster but might miss some results. Indeed, it’s quite the balancing act.

Let’s have a look at the other configuration options for indexing, which I extracted from the SpringAI code:

NONE,

IVFFLAT,

HNSW;

In the beginning, there used to be only one option for indexing in pgvector, namely ivfflat. More recently, the HNSW (Hierarchical Navigable Small Worlds) one was added, which is based on different construction principles and is more performant, and keeps getting better. The general recommendation is to go for HNSW as of now.

The next configuration option is the distance-type which is the procedure it uses to compare vectors in order to determine similarity. Here are our options:

EUCLIDEAN_DISTANCE,

NEGATIVE_INNER_PRODUCT,

COSINE_DISTANCE;

I’ll go with the cosine distance, but it might be helpful to have a look at their properties because it might make a difference for your use-case.

The last configuration property is called dimensions, which represent the number of components (tokenized float values) that the embeddings will be represented on. This number has to be correlated with the number of dimensions we set up in our vector store. In our example, the nomic-embedding-text one has 768, but others have more, or less. If the model returns the embeddings in more dimensions than we have set up our table, it won’t work. Now you might wonder, should you strive to have as high number of dimensions as possible? Actually the answer to this question is apparently no, this blog from Supabase shows that fewer dimensions are better.

Under the hood – what’s created in Postgres?

Let’s explore what Spring AI has created for us with this configuration in Postgres. In a production application however, you might want to take full control and drive the schema through SQL files managed by migration tools such as Flyway. We didn’t do this here for simplicity.

Firstly, we find it created a table called vector_store with the following structure:

postgres=# d vector_store;

Table “public.vector_store”

Column   |    Type     | Collation | Nullable |      Default

———–+————-+———–+———-+——————–

id        | uuid        |           | not null | uuid_generate_v4()

content   | text        |           |          |

metadata  | json        |           |          |

embedding | vector(768) |           |          |

Indexes:

“vector_store_pkey” PRIMARY KEY, btree (id)

“spring_ai_vector_index” hnsw (embedding vector_cosine_ops)

Nothing surprising here. It’s in-line with the configuration we saw above in the Java code. For example, we notice the embedding column of type vector, of 768 dimensions. We notice also the index – spring_ai_vector_index and the vector_cosine_ops operator class, which we expected given what we set in the “distance-type” setting earlier. The other index, namely vector_store_pkey, is created automatically by Postgres. It creates such an index for every primary key by itself.

The command that SpringAI used to create our index is the following:

CREATE INDEX IF NOT EXISTS %s ON %s USING %s (embedding %s)

This creates an index with the default configuration. It might be good to know that you have a couple of options if you’d like to tweak the index configuration for potentially better results (depends on use-case):

m – the max number of connections per layer (16 by default, min: 2, max: 100)

ef_construction – the size of the dynamic candidate list for constructing the graph (64 by default, min: 4, max: 1000)

In order to understand the internals of this index and what effect changing the above options might have, here is a link to the original paper.

Additionally, see also this post by J. Katz in which he presents results of experimenting with various combinations of the above parameters.

When you know what values you want to set for these parameters you can create the index like so:

CREATE INDEX ON vector_store

USING hnsw (embedding vector_cosine_ops)

WITH (m = 2, ef_construction = 5);

In case you get any error when constructing an index it’s worth ensuring we give it enough memory through the maintenance_work_mem setting.

Let’s now check how our embedding column is actually stored on disk. We use the following query which will show us our next step.

postgres=# select

att.attname,

case

att.attstorage

when ‘p’ then ‘plain’

when ‘m’ then ‘main’

when ‘e’ then ‘external’

when ‘x’ then ‘extended’

end as attstorage

from

pg_attribute att

join

pg_class tbl on tbl.oid = att.attrelid

join

pg_namespace ns on tbl.relnamespace = ns.oid

where

tbl.relname = ‘vector_store’ and

ns.nspname = ‘public and

att.attname = ‘embedding‘;

Result:

–[ RECORD 1 ]———

attname    | embedding

attstorage | external

Alright, so it uses the external storage type. This means that it will store this column in a separate, so-called TOAST table. Postgres does this when columns are so large it can’t fit at least 4 rows in a page. But interesting that it will not attempt to also compress it to shrink it even more. For compressed columns it would have said extended instead of external in the result above.

Where are these tables though? We haven’t created them ourselves, they are managed by Postgres. Let’s try to locate this separate TOAST table using this query:

postgres=# select

relname,

oid

from

pg_class,

(select

reltoastrelid

from

pg_class

where

relname = ‘vector_store’) as vector_store

where

oid = vector_store.reltoastrelid or

oid = (select

indexrelid

from

pg_index

where

indrelid = vector_store.reltoastrelid

);

relname        |  oid

———————-+——-

pg_toast_16630       | 16634

pg_toast_16630_index | 16635

So far so good. We now have the TOAST table ID. Let’s use it to have a look at the structure of the TOAST table. For example, what columns does it have? Note that these tables are in the pg_toast schema, by the way, so to get there, we have to set the search_path to pg_toast, like below:

postgres=# set search_path to pg_toast;

SET

postgres=# d pg_toast_16630;

TOAST table “pg_toast.pg_toast_16630”

Column   |  Type

————+———

chunk_id   | oid

chunk_seq  | integer

chunk_data | bytea

Owning table: “public.vector_store”

Indexes:

“pg_toast_16630_index” PRIMARY KEY, btree (chunk_id, chunk_seq)

We can learn a couple of things from this. As expected, the large columns in the main table that have to be “TOASTed” are chunked (split up) and each chunk is identified by a sequence, and is always retrieved using an index.

Postgres has a mechanism to avoid “blasting” the entire shared buffer cache when it needs to do large reads, like sequential scans of a large table. When it has to do this, it actually uses a 32 page ring buffer so that it doesn’t evict other data from the cache. But this mechanism will not kick in for TOAST tables, so vector-based workloads will be run without this form of protection.

Okay! We had a very good look at the database part. Let’s now “resurface” for a moment and have a look at other topics pertaining to the high level workflow of interacting with the LLM.

Template-based prompts

Initially, I had constructed the prompts for the request to the LLM in the same class where I was using them. However, I found the following different approach in the SpringAI repository itself and adopted it, because it’s indeed cleaner to do it this way. It’s based on externalised resource files, like so:

@Value(“classpath:/generate-essay.st”)
protected Resource generateEssay;

@Value(“classpath:/generate-saying.st”)
protected Resource generateSaying;

@Value(“classpath:/guess-saying.st”)
protected Resource guessSaying;

This is how one of them looks inside.

Write a short essay under 200 words explaining the

meaning of the following saying: {saying}.

As you can see, I have not applied any sophisticated prompt engineering whatsoever, and kept it simple and direct for now.

Calling the LLM

Alright, the pieces are starting to fit together! The next thing I’d like to show you is how to call the LLM.

chatModel

.withModel(model)

.call(createPromptFrom(promptTemplate, promptTemplateValues))

.getResult()

.getOutput()

.getContent();

I am using the so-called Chat Model API, a powerful abstraction over AI models. This design allows us to switch between models with minimal code changes. If you want to work with a different model, you just change the runtime configuration. This is a nice example of the Dependency Inversion Principle; where we have higher level modules that do not depend on low-level modules, both depend on abstractions.

Storing the embeddings

To store the embeddings, I must say that I found it a pretty complicated procedure:

vectorStore.add(documents);

Just kidding, that’s it!

This single command will do several things. First convert the documents (our essays) to embeddings with the help of the embeddings model, then it will run the following batched insert statement to get the embeddings into our vector_store table:

INSERT INTO vector_store (id, content, metadata, embedding) VALUES (?, ?, ?::jsonb, ?) ON CONFLICT (id) DO UPDATE SET content = ? , metadata = ?::jsonb , embedding = ?

We can see it actually performs an update of the content column in case there is already one row with that ID (taken care of by the ON CONFLICT part in the query) present in the database.

Similarity searches

To do a similarity search on the stored vectors with SpringAI, it’s just a matter of:

vectorStore

.similaritySearch(SearchRequest.query(saying))

.getFirst()

.getContent();

Again you get a couple of things done for you by SpringAI. It takes the parameter you supply (“saying” in our case), and first it creates its embedding using the embedding model we talked about before. Then it uses it to retrieve the most similar results, from which we pick only the first one.

One thing to be noted, is that if you’d write the query yourself without letting SpringAI create it for you, you can customise the query by supplying different values for the ef_search parameter (default: 40, min: 1, max: 1000), like so:

SET hnsw.ef_search = 100;

With it, you can influence the number of neighbours that it considers for the search. The more that are checked, the better the recall, but it will be at the expense of performance.

Now that we know how to do perform similarity searches to retrieve semantically close data, we can also make a short incursion into how Postgres uses memory (shared buffers) when performing these retrievals.

How much of the Postgres shared buffers gets filled?

Let’s now increase a bit the amount of essays we’re working with to 100, and have a look what’s in the Postgres shared buffers after we run the program. We’ll use the pg_buffercache extension that I mentioned before, which was installed in the init script.

But first, let’s start with looking at the size of the table and index, just to get some perspective.

postgres=# dt+ vector_store;

List of relations

Schema |     Name     | Type  |  Owner   | Persistence | Access method |  Size  | Description

——–+————–+——-+———-+————-+—————+——–+————-

public | vector_store | table | postgres | permanent   | heap          | 584 kB |

postgres=# di+ spring_ai_vector_index;

List of relations

Schema |          Name          | Type  |  Owner   |    Table     | Persistence | Access method |  Size  | Description

——–+————————+——-+———-+————–+————-+—————+——–+————-

public | spring_ai_vector_index | index | postgres | vector_store | permanent   | hnsw          | 408 kB |

(1 row)

Okay, so the table is 584 kB and the index is 408 kB. It seems the index gets pretty big, basically about the same size of the table. We don’t mind that much at such small scale, but if we assume this proportion will be maintained at large scale too, we will have to take it more seriously.

To contrast with how other indexes behave, I checked a table we have at work that amounts to 40Gb. The corresponding B-tree primary key index is 10Gb, while other indexes of the same type for other columns are just 3Gb.

I’m using the following query to get an overview of what’s in the shared buffers:

select

c.relname,

count(*) as buffers

from

pg_buffercache b

inner join

pg_class c on b.relfilenode = pg_relation_filenode(c.oid) and

b.reldatabase in (0, (select

oid

from

pg_database

where

datname = current_database()

)

)

group by

c.relname

order by

2 desc

limit

10;

relname             | buffers

——————————–+———

pg_proc                        |      61

pg_toast_16630                 |      53

spring_ai_vector_index         |      51

pg_attribute                   |      35

pg_proc_proname_args_nsp_index |      30

pg_depend                      |      23

pg_operator                    |      19

pg_statistic                   |      19

pg_class                       |      18

vector_store                   |      18

(10 rows)

We see that all the index in its entirety is in there. We deduced this because the size of the index is 408 Kb, as we saw before, and if we divide that by 8 Kb, which is the size of a Postgres page, we get exactly 51 like we see in the above table (third row).

We can draw a conclusion from this – working with vectors in Postgres is going to be pretty demanding in terms of memory. As reference, vectors that have 1536 dimensions (probably the most common case) will occupy each about 6Kb. One million of them already gets us to 6Gb. In case we have other workloads next to the vectors, they might be affected in the sense that we start seeing cache evictions because there’s no free buffer. This means we might even need to consider separating the vectors from the other data we have, in separate databases, in order to isolate the workloads in case we notice the performance going downhill.

Using the @ParameterizedTest annotation for trying out other models without code changes

Alright, a last remark I want to make about this program is that it’s set up to be able to experiment with other open-source LLMs. The entrypoint method I’m using to run the workflow, is a JUnit parameterized test where the arguments for each run can be the names of other LLM models distributed with ollama. This is how you set it up to run multiple times with a different LLM for every execution:

@ParameterizedTest

@ValueSource(strings = {“llama3”, “llama2”, “gemma”, “mistral”})

void rag_workflow(String model) {

…

}

Results

Finally it’s time to review how well did the LLM manage to guess the sayings. With no other help except for the initial essays provided in the prompt, it managed to guess the saying perfectly a grand total of… once!

Saying
LLM Guess

Your most powerful moments are born from the ashes of your greatest fears.
What doesn’t kill you…

Every sunrise holds the promise of a new masterpiece.
What lies within is far more important than what lies without.

Every step forward is a declaration of your willingness to grow.
Any Step Forward

Your most beautiful moments are waiting just beyond your comfort zone.
What lies within…

Light reveals itself in the darkness it creates.
The darkness is not the absence of light but the presence of a different kind

Courage is not the absence of fear, but the willingness to take the next step anyway.
Be brave.

Small sparks can ignite entire galaxies.
Small sparks can ignite entire galaxies.

Believe in yourself, take the leap and watch the universe conspire to make your dreams come true.
Take the leap

Life begins at the edge of what you’re willing to let go.
Take the leap.

Some responses are quite amusing, like when it tries to be “mysterious” or more conversational by not completing the sentence fully and just ending it in three dots (“What doesn’t kill you…”), and the ones where it reaches for extreme succintness (“Take the leap.”, “Be brave.”)

Let’s give it some help now. In the prompt, this time I’ll provide all the sayings it initially generated as a list of options to pick from. Will it manage to pick the correct one from the bunch this way?

Turns out, indeed, if I gave it options to pick from, it picked the right one, every time. Quite the difference between with or without RAG!

Conclusion

The open-source LLMs I tried have not yet passed my “test” to guess the initial sayings from which the essays were generated. They seem not ready to perform well for requests that require any kind of precise and correct answers, but I will keep trying new models as they are made available.

However, they are still useful if you know what you can expect from them – they are very good for storing and retrieving many loosely connected facts, and they can help with brainstorming.

When I applied RAG and gave it options, the difference is like night and day compared to when I didn’t. If given the options, it picked the right answer every time, but as we saw, with no support, it did not perform well at all.

We also looked at how vectors are stored and how “memory-hungry” they are, and indicated that we should account for this and make some arrangements at the beginning of the project in order to have smooth operation when the scale grows.

What’s next

While writing this post pgvectorscale got released. It’s an interesting new project that makes pgvector much more performant and cost-effective. Planning to try it out!

Thanks for reading!