beginner guide to fully local RAG on entry-level machines

beginner guide to fully local RAG on entry-level machines

prerequisites

it’s better if you have a VPS or a physical server somewhere on which you can test your deployment, but you can do all the steps on your machine anyhow
you can write and run basic Python code
you know what a LLM is
you know what LLM tokens are

what RAG systems are

Retrieval-Augment Generation systems combine:

retrieval of information that is outside the knowledge base of a given LLM
generation of answers that are more contextualized, accurante, and relevant, on the LLM’s end

This hybrid model leverages the precision of retrieval methods and the creativity of generation models, making it highly effective for tasks requiring detailed and factual responses, such as question answering and document summarization.

the use-case

All companies manage some sort of written documentation, to be used internally or to be published externally. We want to build a RAG system that facilitates access and query over this written data and that is both easy to setup and as cheap as it can be, in terms of computational power and billing. The system I have in mind:

can be run on CPU onlu
has a UI
uses a free and open-source embedding model and LLM
uses pgvector, as a lot of systems out there run on Postgres

I personally believe in a future where tiny machines and IoT devices will host powerful optimized models that will remove most of the current needs to send everything to a giga model in some giga server farm somewhere (and all the privacy issues that necessarily arise from this practice). Open-source LLMs make daily progress to bring us closer to that future.

For this exercise, let’s use the open-sourced technical documentation of Scalingo, a well-known cloud provider in France. This documentation can be found @ https://github.com/Scalingo/documentation. The actual documentation lies in the src/_posts folder.

step 1: download ollama and select the right model

Nowadays, running powerful LLMs locally is ridiculously easy when using tools such as ollama. Just follow the installation instructions for your #OS. From now on, we’ll assume using bash on Ubuntu.

what is ollama?

ollama is a versatile tool designed for running large language models (LLMs) locally on your computer. It offers a streamlined and user-friendly way to leverage powerful AI models like Llama 3, Mistral, and others without relying on cloud services. This approach provides significant benefits in terms of speed, privacy, and cost efficiency, as all data processing happens locally, eliminating the need for data transfers to external servers. Additionally, its integration with Python enables seamless incorporation into existing workflows and projects​.

The documentation of ollama says You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models, this means you can run relatively small models easily on a low-end server.

7B models? tokens?

In the world of generative AI, you will often see terms like tokens and parameters pop up.

In our demo, we will try to run the smallest models possible, so that our consumption footprint gets very low; let’s fix us a maximum of 7B parameters.

Parameters in a machine learning model are essentially the weights and biases that the model learns during the training process. They are the internal variables that adjust themselves to minimize the error in predictions. These parameters are critical because they enable the model to learn and generalize from the training data.

Tokens are the pieces of data that LLMs process. In simpler terms, a token can be as small as a single character or as large as a whole word. The specific definition of a token can vary depending on the language model and its tokenizer, but generally, tokens represent the smallest unit of meaningful data for the model. They are numerical representations of units of semantic meaning.

You will often see the term token used in conjunction of the idea a context window, which represents how many tokens a LLM can keep in memory during a conversation. The longer the context window, the longer meaningful conversations (in theory) can be conducted within a single conversation with the LLM.

selecting the right model

With ollama, running a LLM (a Small Language Model in our case, let’s use that term for now) is as easy as ollama run <model_name>.

So, after I’ve browsed the ollama models library for the most popular yet smallest models, I’ve downloaded a few of them and tested them in my terminal; some of them kept outputting non-sensical answers on my machine (such as codegemma:2b) so I’ve discarded them.

I’ve found out, by tinkering around and not with systematic tests (although that would be quite interesting) that deepseek-coder:1.3b offers a particulary good performance/quality of answers ratio.

deepseek-coder:1.3b is a 1.3B parameters SLM (Small Language Model) that weights only 776MB 🤯 developed by deepseek, a major Chinese AI company. It has been trained on a high-quality dataset of 2 trillion tokens. This model is optimized for running on various hardware, including mobile devices, which enables local inference without needing cloud connectivity.

Its strengths are:

a long 16K tokens context window
highly scalable, as the 1.3B and the other higher models in the series can suit various types of machines and deployments
it has been created for coding tasks, which may make it suitable for technical documentation RAG
it’s small yet it performs quite well on various benchmarks

Some use-cases of such a model are:

environments requiring strict data privacy
mobile agentic applications that can run code
industrial IoT devices performing intelligent tasks on the edge

step 2: make sure that we can run on CPU only

Just open htop or btop and, with another terminal tab or window, run:

ollama run deepseek-coder:1.3b

In the conversation, tell the LLM to generate a very long sentence and then go back to your htop: this will give you a quick sense of the resource consumption of the model’s inference.

Still, we need to be absolutely sure that this thing can run on a customer-grade server as well, provided that it is powerful enough:

I have an 8GB RAM 2 CPU cores VPS somewhere. This server has no GPU, so let’s run it there:

As you can see, I am warned that I will use the thing in CPU-only mode. Alright let’s pull deepseek-coder:1.3b on this remote server.

After having ran it on the VPS, I noticed the tokens throughput was a little bit slower on this remote machine, however the thing works like a charm ! now I’m thinking “hello costless intelligent scraping with other generalistic small footprint models” 🤑

step 3: RAG with LlamaIndex

Now that our LLM setup is ready, lets put together a RAG system using the famous RAG in 5 lines of code LlamaIndex example that we’ll tweak a little bit to meet our requirements.

Basically, we will:

download Scalingo’s documentation on disk
set up a vector store using an open-source embeddings model from HuggingFace

load our local instance of deepseek-coder:1.3b via LlamaIndex
create an index with the vector store and our documents
query our documentation from our terminal !

wait, what is LlamaIndex in the first place?

LlamaIndex is a data framework for building context-augmented LLM applications. With it, you can create:

autonomous agents that can perform research and take actions
Q&A chatbots
tools to extract data from various data sources
tools to summarize, complete, classify, etc. written content

All these use-cases basically augment what LLMs can do with more relevant context than their initial knowledge base and abilities. LlamaIndex has been designed to allow for LLM querying large-scale data efficiently.

Currently, Llama Index officially supports Python and Typescript.

and what about embeddings?

Embeddings are basically a way of representing data, in our case text, as vectors (often represented as lists of numbers). A vector is a quantity that has both magnitude and direction. A 2-D vector such as [3,4], for instance, can be thought as a point in a 2-dimensional space (like an X-Y plane). Vectors used as embeddings by LLMs are high-dimensional vectors that allow to capture a lot of semantic intricacies in text.

These embeddings are produced by specialized neural networks that learn to identify patterns and relationships between words based on their context; needless to say that these models are trained on large datasets of text. Embeddings can be used for document similarity analysis, clustering, enhancing search algorithms, and more.

For the embeddings model, we’ll use nomic-embed-text-v1.5, which performs better than several OpenAI models served through their API! This model:

produces high dimensional vectors (up to 768 dim)
produces great alignment of semantically similar meaning tokens
supports various embedding dimensions (from 64 to 768)
has a long context of up to 8192 tokens, which makes it suitable with very large datasets and pieces of content

setting up pgvector

There are many vector stores out there to store embeddings, but we want something that integrates with Postgres, so let’s use pgvector, which is an extension that you have to build after having downloaded it from GitHub. I personally run a dockerized instance of Postgres, here is the Dockerfile:

# postgres image with `pgvector` enabled
FROM postgres:16.3

RUN apt-get update
&& apt-get install -y postgresql-server-dev-all build-essential
&& apt-get install -y git
&& git clone https://github.com/pgvector/pgvector.git
&& cd pgvector
&& make
&& make install
&& apt-get remove -y git build-essential
&& apt-get autoremove -y
&& rm -rf /var/lib/apt/lists/*

EXPOSE 5432

show me the code! 👨‍💻

Without a UI and without pgvector, our app’ looks like this (I am showing you the full script, then we’ll provide more explanations on some parts of it) =>

from datetime import datetime
from dotenv import load_dotenv
from llama_index.core import (
# function to create better responses
get_response_synthesizer,
SimpleDirectoryReader,
Settings,
# abstraction that integrates various storage backends
StorageContext,
VectorStoreIndex
)
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.vector_stores.postgres import PGVectorStore
import logging
import os
import psycopg2
from sqlalchemy import make_url
import sys

def set_local_models(model: str = deepseek-coder:1.3b):
# use Nomic
Settings.embed_model = HuggingFaceEmbedding(
model_name=nomic-ai/nomic-embed-text-v1.5,
trust_remote_code=True
)
# setting a high request timeout in case you need to build an answer based on a large set of documents
Settings.llm = Ollama(model=model, request_timeout=120)

# ! comment if you don’t want to see everything that’s happening under the hood
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# time the execution
start = datetime.now()

# of course, you can store db credentials in some secret place if you want
connection_string = postgresql://postgres:postgres@localhost:5432
db_name = postgres
vector_table = knowledge_base_vectors

conn = psycopg2.connect(connection_string)
conn.autocommit = True

load_dotenv()

set_local_models()

PERSIST_DIR = data
documents = SimpleDirectoryReader(os.environ.get(KNOWLEDGE_BASE_DIR), recursive=True).load_data()

url = make_url(connection_string)
vector_store = PGVectorStore.from_params(
database=db_name,
host=url.host,
password=url.password,
port=url.port,
user=url.username,
table_name=knowledge_base_vectors,
# embed dim for this model can be found on https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
embed_dim=768
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# if index does not exist create it
# index = VectorStoreIndex.from_documents(
# documents, storage_context=storage_context, show_progress=True
# )
# if index already exists, load it
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

# configure retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=10,
)
# configure response synthesizer
response_synthesizer = get_response_synthesizer(streaming=True)
# assemble query engine
query_engine = RetrieverQueryEngine(
retriever=retriever,
response_synthesizer=response_synthesizer,
# discarding nodes which similarity is below a certain threshold
node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],
)

# getting the query from the command line
query = help me get started with Node.js express app deployment
if len(sys.argv) >= 2:
query = .join(sys.argv[1:])

response = query_engine.query(query)
# print(textwrap.fill(str(response), 100))
response.print_response_stream()

end = datetime.now()
# print the time it took to execute the script
print(fTime taken: {(end start).total_seconds()})

the StorageContext

LlamaIndex represents data as indices, nodes, and vectors. These are manipulated via the StorageContext abstraction.

nodes

They are the basic building blocks in LlamaIndex: they represent chunks of ingested documents, they also encapsulate the metadata around these chunks. They store individual pieces of information from larger documents.
They are to be part of various data structures used within the framework.

indices

They are data structures that organize and store metadata about the aforementioned nodes. Their function is to allow for quick location and retrieval of nodes based on search queries; this is done via keyword indices, embeddings, and more.

During data ingestion, documents are split into chunks and converted into nodes. These nodes are then indexed, and their semantic content is embedded into vectors. When a query is made, indices are used to quickly locate relevant nodes, and vector stores facilitate finding semantically similar nodes based on the query’s embedding vector.

the RetrieverQueryEngine

The RetrieverQueryEngine in LlamaIndex is a versatile query engine designed to fetch relevant context from an index based on a user’s query. It consists of:

a data retriever
a response synthesizer

In our case the data retriever would be the VectorIndexRetriever that we have plugged in to our Postgres vector dabatase.

the SimilarityPostProcessor

With this LlamaIndex module, we make sure that only a subset of the retrieved data is being used for the final output, based on a similarity score threshold. It’s basically a filter for nodes.

get_response_synthesizer

This function is used to generate responses from the language models that are used using:

a query
a set of text chunks retrieved from the storage context

The text chunks themselves are processed by the LLMs using a configurable strategy: the response mode. Response modes include:

compact: combines text chunks into larger consolidated chunks that fit within the context window of the LLM, reducing the number of calls needed; this is the default mode

refine: iteratively generates and refines an answer by going through each text chunk; this mode makes a separate LLM call per node (something to keep in mind if you’re paying for tokens), making it suitable for detailed answers

tree summarize: recursively merges text chunks and summarizes them in a bottom-up fashion (i.e. building a tree from leaves to root) => it is a “summary of summaries”
and more in the docs!

step 5: serve the app with Streamlit

Now that we have a working script, let’s wire this to a Streamlit UI.

Streamlit is an open-source Python framework designed to simplify the creation and sharing of interactive data applications. It’s particularly popular among data scientists and machine learning engineers due to its ease of use and ability to transform Python scripts into fully functional web applications with minimal code.

Again, this can be done with very few lines of code once you’ve added streamlit to your Python requirements:

import logging
import streamlit as st
import sys

from rag import get_streamed_rag_query_engine

# ! comment if you don’t want to see everything that’s happening under the hood
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# initialize chat history
if history not in st.session_state:
st.session_state.history = []

def get_streamed_res(input_prompt: str):
query_engine = get_streamed_rag_query_engine()
res = query_engine.query(input_prompt)
for x in res.response_gen:
yield x + “”

st.title(technical documentation RAG demo 🤖📚)

# display chat messages history
for message in st.session_state.history:
with st.chat_message(message[role]):
st.markdown(message[content])

# react to user input
if prompt := st.chat_input(Hello 👋):
# display user message
with st.chat_message(user):
st.markdown(prompt)
# add it to chat history
st.session_state.history.append({role: user, content: prompt})

# display bot response
with st.chat_message(assistant):
response = st.write_stream(get_streamed_res(prompt))
# add bot response to history as well
st.session_state.history.append({role: assistant, content: response})

… less than 50 lines of code and you have a functional chat UI 😎

The responses could be perfected, but the result is truly impressive, considering how small our model is:

video demo 1
video demo 2

wrapping it up

As you can see, it is more easier than ever to build context-rich applications that are cheap in both resource consumption and actual money. There are many more ways to improve this little demo, such as:

create a multi modal knowledge base RAG using llava

deploy the thing on various platforms (VPS, bare metal server, serverless containers, etc.)
enhance the generated answers using various grounding techniques
implement a human in the loop feature, where actual humans take over the bot when things get difficult with a given customer
make the system more agentic by letting it evaluate if the user’s query has been fulfilled, if the user’s query is relevant, etc.
package the app and build it in WebAssembly
parallelize the calls to the SLM on response generation
update existing vectors with contents from the same source instead of adding to the vector database systematically
update the vectorized documentation on a schedule

… don’t hesitate to PR @ https://github.com/yactouat/documentation-rag-demo if you’d like to improve it!