AWS Bedrock, Claude 3, Serverless RAG, Rust

AWS Bedrock, Claude 3, Serverless RAG, Rust

Photo by vackground.com on Unsplash

What a funny time to live in. It is quite challenging to craft an informative blog post title that won’t contain only buzzwords!

Introduction

I wanted to start exploring Amazon Bedrock service for quite a while. I firmly believe that offering multiple LLMs as a serverless service is a huge step toward democratizing access to the current “AI revolution”.

A while ago I heard about LanceDB – an open-source vector database written in Rust. This is an amazing project with a bunch of cool features, but for me, the selling point was that I could use a local file system or S3 as storage and move computation to Lambda. Additionally, because LanceDB is written in Rust, I could use Rust to work with it.

Then I stumbled upon an amazing post from Giuseppe Battista about creating a serverless RAG with LanceDB: Serverless Retrieval Augmented Generation (RAG) on AWS. I treated that post as a starting point for me.

Project

The code for this blog is in the repository

The GenAI ecosystem is a bit overwhelming. What works for me is dividing complex problems into smaller tasks and tackling them one by one.

In general text generation with a vector database used for RAG might be broken into the following steps:

1 Create a knowledge base

read input documents
transform them into embeddings
store in the vector database

2 Generate text based on the user’s request

transform user’s query to embeddings
get related data from the vector database
construct a prompt for LLM using context
invoke LLM model

In this post, I’ll focus on the second point.

Prepare knowledgebase

As I mentioned above, I won’t focus on this part. I just take ready-to-use code from Giuseppe’s blog.

I create a new folder and initialize aws cdk project

cdk init –language typescript

At this point, the only resource I need is the S3 bucket.

import * as cdk from aws-cdk-lib;
import { Construct } from constructs;
import * as s3 from aws-cdk-lib/aws-s3;

export class BedrockRustStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);

// create s3 bucket for vector db
const vectorDbBucket = new s3.Bucket(this, lancedb-vector-bucket, {
versioned: true,
});

new cdk.CfnOutput(this, vector-bucket-name, {
value: vectorDbBucket.bucketName,
});

}
}

Documents processor

The code for processing documents is available in the repo from Giuseppe’s blog. I just want to run it manually from my local machine, so I simplify it a bit.

// document-processor/main.ts
import { BedrockEmbeddings } from langchain/embeddings/bedrock;
import { CharacterTextSplitter } from langchain/text_splitter;
import { PDFLoader } from langchain/document_loaders/fs/pdf;
import { LanceDB } from langchain/vectorstores/lancedb;

import { connect } from vectordb; // LanceDB

import dotenv from dotenv;

dotenv.config();

(async () => {
const dir = process.env.LANCEDB_BUCKET || missing_s3_folder;
const lanceDbTable = process.env.LANCEDB_TABLE || missing_table_name;
const awsRegion = process.env.AWS_REGION;

console.log(lanceDbSrc, dir);
console.log(lanceDbTable, lanceDbTable);
console.log(awsRegion, awsRegion);

const path = `documents/poradnik_bezpiecznego_wypoczynku.pdf`;

const splitter = new CharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});

const embeddings = new BedrockEmbeddings({
region: awsRegion,
model: amazon.titan-embed-text-v1,
});

const loader = new PDFLoader(path, {
splitPages: false,
});

const documents = await loader.loadAndSplit(splitter);

const db = await connect(dir);

console.log(connected)

const table = await db.openTable(lanceDbTable).catch((_) => {
console.log(creating new table, lanceDbTable)
return db.createTable(lanceDbTable, [
{
vector: Array(1536),
text: sample,
},
])
})

const preparedDocs = documents.map(doc => ({
pageContent: doc.pageContent,
metadata: {}
}))

await LanceDB.fromDocuments(preparedDocs, embeddings, { table })

})();

Now I run from the document-processor/ (if you want to use an AWS profile different than the default one, it needs to be configured as an environment variable)

npx ts-node main.ts

Cross-check in the S3 – all looks good

Input data

Sometimes it might be tricky to build GenAI solutions for non-English languages. In my case, I plan to generate texts in Polish based on the Polish knowledge base.

Luckily Titan Embedings model is multilingual and supports Polish. That’s why I can use it out of the box with LangChain integration.

Next time I would like to spend more time on this step, especially for preparing chunks of documentation. For now, splitting everything into fixed-sized pieces would work.

Generate text based on the user’s query

OK, now I can create the main part.

In the root directory, I add a lambdas folder and create a new lambda with cargo lambda

cargo lambda new text_generator

Before I start I create config.rs file next to main.rs to keep env variables in one place. I add clap and dotenv to manage them

cargo add clap -F derive,env
cargo add dotenv
// config.rs
#[derive(clap::Parser, Debug)]
pub struct Config {
#[clap(long, env)]
pub(crate) bucket_name: String,
#[clap(long, env)]
pub(crate) prefix: String,
#[clap(long, env)]
pub(crate) table_name: String,
}

Now I can read the configuration at the beginning of the execution.

// main.rs
#[tokio::main]
async fn main() -> Result<(), Error> {
tracing::init_default_subscriber();

info!(“starting lambda”);

dotenv::dotenv().ok();
let env_config = Config::parse();
// …

Before defining the function handler, I would like to prepare everything I could define outside of the specific request’s context. Client for SDK service and LanceDB connection are obvious candidates

// main.rs
// …

// set up aws sdk config
let region_provider = RegionProviderChain::default_provider().or_else(“us-east-1”);
let config = aws_config::defaults(BehaviorVersion::latest())
.region(region_provider)
.load()
.await;

// initialize sdk clients
let bedrock_client = aws_sdk_bedrockruntime::Client::new(&config);

info!(“sdk clients initialized”);
// …

When I started working on this blog post, Lance SDK for Rust didn’t support connecting directly to s3. I needed to implemented logic to download lance files from s3 to the local directory. It is not needed anymore

I initialize LanceDB with s3 bucket uri

// …

let bucket_name = env_config.bucket_name;
let prefix = env_config.prefix;
let table_name = env_config.table_name;

let start_time_lance = std::time::Instant::now();

let s3_db = format!(“s3://{}/{}/”, bucket_name, prefix);

info!(“bucket string {}”, s3_db);

// set AWS_DEFAULT_REGION env

std::env::set_var(“AWS_DEFAULT_REGION”, “us-east-1”);

let db = connect(&s3_db).execute().await?;

info!(“connected to db {:?}”, db.table_names().execute().await);

let table = db.open_table(&table_name).execute().await?;

info!(“connected to db in {}”, Duration::from(start_time_lance.elapsed()).as_secs_f32());

Finally, I initialize a handler with an “injected” DB table and bedrock client

//…

run(service_fn(|event: LambdaEvent<Request>| {
function_handler(&table, &bedrock_client, event)
}))
.await

Function handler

Lambda function input and output are pretty straightforward

#[derive(Deserialize)]
struct Request {
prompt: String,
}

#[derive(Serialize)]
struct Response {
req_id: String,
msg: String,
}

The handler signature looks like this:

#[instrument(skip_all)]
async fn function_handler(
table: &Table,
client: &aws_sdk_bedrockruntime::Client,
event: LambdaEvent<Request>,
) -> Result<Response, Error> {

//…

Transfor query with Amazon Titan

The first task is to send the received prompt to the Bedrock Titam Embeddings model. According to documentation input for the model and response from it are pretty simple

{
“inputText”: string
}
{
“embedding”: [float, float, ],
“inputTextTokenCount”: int
}

To be able to parse the response I create a struct

#[derive(Debug, serde::Deserialize)]
#[serde(rename_all = “camelCase”)]
struct TitanResponse {
embedding: Vec<f32>,
input_text_token_count: i128,
}

And I use SDK to invoke the model

// …
// transform prompt to embeddings
let embeddings_prompt = format!(
r#”{{
“inputText”: “{}”
}}”#
,
prompt
);

info!(“invoking embeddings model with: {}”, embeddings_prompt);

let invocation = client
.invoke_model()
.content_type(“application/json”)
.model_id(“amazon.titan-embed-text-v1”)
.body(Blob::new(embeddings_prompt.as_bytes().to_vec()))
.send()
.await
.unwrap();

let titan_response =
serde_json::from_slice::<TitanResponse>(&invocation.body().clone().into_inner()).unwrap();

let embeddings = titan_response.embedding;

info!(“got embeddings for prompt from model”);
//…

Lookup for related documents in LanceDB

Once we have the query transformed into embeddings, we can utilize vector database magic. I query our knowledge base to find related content.

// …
let result: Vec<RecordBatch> = table
.search(&embeddings)
.limit(1)
.execute_stream()
.await
.unwrap()
.try_collect::<Vec<_>>()
.await
.unwrap();

let items = result
.iter()
.map(|batch| {
let text_batch = batch.column(1);
let texts = as_string_array(text_batch);
texts
})
.flatten()
.collect::<Vec<_>>();

info!(“items {:?}”, &items);

let context = items
.first()
.unwrap()
.unwrap_or(“”)
.replace(u{a0}, “”)
.replace(n, ” “)
.replace(t, ” “);
// …

Let’s unpack what’s going on here. LanceDB uses Arrow as a format for data in memory. The search query returns a vector of BatchRecords, a type related to Arrow.

To get content from RecordBatches I map them to items with type Vec<Option<&str>> To be honest I don’t like the part with using hardcoded column numbers to get the data I want (batch.column(1)), but so far I wasn’t able to use more declarative way.

As a last step, I sanitize received text – in the other way, it won’t work as an input for the LLM model.

Invoke Claude 3

Finally the most exciting part. I didn’t try any advanced prompt-engineering techniques, so my prompt is a simple one

//…
let prompt_for_llm = format!(
r#”{{
“system”: “Respond only in Polish. Informative style. Information focused on health and safety for kids during vacations. Keep it short and use max 500 words. Please use examples from the following document in Polish: {}”,
“anthropic_version”: “bedrock-2023-05-31”,
“max_tokens”: 500,
“messages”: [
{{
“role”: “user”,
“content”: [{{
“type”: “text”,
“text”: “{}”
}}]
}}
]
}}”#
,
context, prompt
);
// …

Calling a model is the same as for embeddings. I needed to create structs to parse the answer differently, but the flow is the same.

//…
let generate_invocation = client
.invoke_model()
.content_type(“application/json”)
.model_id(“anthropic.claude-3-sonnet-20240229-v1:0”)
.body(Blob::new(prompt_for_llm.as_bytes().to_vec()))
.send()
.await
.unwrap();

let raw_response = generate_invocation.body().clone().into_inner();

let generated_response = serde_json::from_slice::<CloudeResponse>(&raw_response).unwrap();

println!(“{:?}”, generated_response.content);

// Prepare the response
let resp = Response {
req_id: event.context.request_id,
msg: format!(“Response {:?}.”, generated_response),
};

// Return `Response` (it will be serialized to JSON automatically by the runtime)
Ok(resp)

Test

Testing is the fun part. First, let’s run lambda locally with cargo lambda. I’ve prepared json file with prompt in events/prompt.json

{
“prompt”: “jakie kompetencje powinni mieć opiekunowie”
}

And .env file in the function’s root directory

BUCKET_NAME=xxx
PREFIX=lance_db
TABLE_NAME=embeddings

The prompt is about what skills supervisors need to have. The document I’ve used as a knowledge base is a brochure prepared by the Polish Ministry of Education with general health and safety rules during holidays.

I run …

cargo lambda watch –env-file .env

… and in the second terminal

cargo lambda invoke –data-file events/prompt.json

For this prompt the context found in LanceDB is relevant.

I won’t translate the answer, but the point is that it looks reasonable and I can see that injected context was included in the answer

The answer for the same query, just without context, is still pretty good, but generic.

I’ve experimented with different queries, and not all of them returned relevant context from the vector database. Preparing knowledgebase and experimenting with embeddings are things I would like to experiment with.

Deployment

I use RustFunction construct for CDK to define lambda

//lib/bedrock_rust-stack.ts

//….

const textGeneratorLambda = new RustFunction(this, text-generator, {
manifestPath: lambdas/text_generator/Cargo.toml,
environment: {
BUCKET_NAME: vectorDbBucket.bucketName,
PREFIX: lance_db,
TABLE_NAME: embeddings,
},
memorySize: 512,
timeout: cdk.Duration.seconds(30),
});

vectorDbBucket.grantRead(textGeneratorLambda);

// add policy to allow calling bedrock
textGeneratorLambda.addToRolePolicy(
new iam.PolicyStatement({
actions: [bedrock:InvokeModel],
resources: [
arn:aws:bedrock:*::foundation-model/amazon.titan-embed-text-v1,
arn:aws:bedrock:*::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0,
],
effect: iam.Effect.ALLOW,
})
);

Cold start penalty

At this point, Rust is so famous for its speed, that it would be disappointing to see bad results for the cold start. Init duration in my case was stable around 300-400ms. Pretty neat, since I am not “only” initializing SDK clients, but I am also “connecting” LanceDB to S3.

Overall performance

I didn’t run full-blown performance tests, so don’t treat my benchmarks too seriously.

Getting embedding for user’s prompt – 200-300ms

Getting context from LanceDB – I observed a max ~400ms but it depends on the query. The S3 option is the slowest (and the cheapest) option to handle storage for LanceDB. In my case, this is a fair tradeoff. Other serverless options are listed in the documentation.

The rest is invoking Claude 3 model, which takes around 10 seconds to generate a full answer for my prompt.

Summary

LanceDB is an open-source vector database that allows splitting storage from computing. Thanks to that I was able to create a simple RAG pipeline using Lambda and S3.

AWS Bedrock offers multiple models as a service. Multilingual AWS Titan lets you create embeddings in various languages, including Polish. Cloude 3 Sonet is a new LLM with outstanding capabilities.

LanceDB and AWS Bedrock provide SKDs for Rust, which is very pleasant to work with. Additionally, thanks to Rust, the hit from cold starts is minimal.

I am aware that this is only scratching the surface. I plan to spend more time playing with LanceDB and AWS Bedrock. It feels that now sky is the limit.

Leave a Reply

Your email address will not be published. Required fields are marked *