Building an Embedding API with Rust, Arm, and EmbeddingGemma on AWS Lambda
Step-by-step guide covering model selection, containerization, ARM64 optimization, and production benchmarks
Introduction
On November 14, 2025, AWS announced official support for Rust in Lambda. Full SLA, full AWS Support, production-ready https://aws.amazon.com/about-aws/whats-new/2025/11/aws-lambda-rust/
This is a big deal. Rust on Lambda means blazing-fast cold starts, minimal memory footprint, and compile-time safety. All things that matter when you’re paying per millisecond and per megabyte. For performance-critical serverless workloads, it’s hard to find a better fit.
So let’s put it to the test.
In this article, we’ll build a REST API that takes text and returns embeddings. If you’re not familiar with embeddings, they’re vector representations of text that capture semantic meaning. They power things like semantic search, recommendations, and RAG pipelines. Instead of calling an external service like AWS Bedrock or OpenAI, we’ll run Embedding Gemma directly inside Lambda.
Why go through the trouble? Cost and latency. External embedding APIs charge per token, and every API call adds network overhead. Running inference locally in Lambda gives you predictable pricing and faster response times, especially for high-volume workloads.
Along the way, we’ll work within Lambda’s constraints (10GB memory, 10GB container images, 15-minute timeout) and see how Rust helps us maximize performance within these limits.
If you’re curious about Rust, interested in serverless ML, or want to see what’s possible now that Rust is officially supported, let’s explore together.
Understanding the Constraints
Before writing any code, let’s map out what we’re working with.
Lambda limits:
Memory: Up to 10GB
Storage: 512MB in
/tmp(or 10GB with ephemeral storage enabled)Package size: 250MB zipped for direct upload, or up to 10GB with container images
Timeout: 15 minutes max
CPU: Scales proportionally with memory
For ML workloads, memory and package size are usually the bottlenecks. Large models don’t fit, and if they do, cold starts can be brutal.
Embedding Gemma specs:
EmbeddingGemma is designed for on-device inference, optimized for exactly the kind of constrained environment we’re dealing with.
Parameters: ~308 million (100M model parameters + 200M embedding parameters)
RAM with quantization: Sub-200MB
Output dimensions: 768 (or 128/256/512 using Matryoshka truncation)
Context window: 2K tokens
Inference time: <15ms on EdgeTPU, <22ms on mobile (benchmarked at 256 tokens; longer sequences scale proportionally)
The model was built on Gemma 3 architecture and trained on 100+ languages. Google explicitly designed it for phones, laptops, and tablets. Lambda’s 10GB memory ceiling is more than enough.
Project Setup
First, install cargo-lambda. It’s a Cargo subcommand that simplifies building, testing, and deploying Rust Lambda functions.
cargo install cargo-lambda
Create a new project:
cargo lambda new embedding-lambda
cd embedding-lambda
When prompted, select “HTTP function” since we’re building a REST API.
Dependencies
Open Cargo.toml and add the following:
[package]
name = "embedding-lambda"
version = "0.1.0"
edition = "2021" # Using 2021 for broader ecosystem compatibility; Rust 2024 is available but less widely supported
[dependencies]
lambda_http = "1.0" # Using semver flexibility for automatic patch updates
tokio = { version = "1.48.0", features = ["macros"] }
serde = { version = "1.0.228", features = ["derive"] }
serde_json = "1.0.145"
ndarray = "0.17.1"
tokenizers = "0.22.2"
tracing = "0.1.43"
tracing-subscriber = { version = "0.3.22", features = ["env-filter"] }
thiserror = "2.0"
# Platform-specific ONNX Runtime configuration
[target.'cfg(target_os = "macos")'.dependencies]
ort = { version = "2.0.0-rc.10", default-features = false, features = [
"ndarray",
"std",
"download-binaries",
] }
[target.'cfg(target_os = "linux")'.dependencies]
ort = { version = "2.0.0-rc.10", default-features = false, features = [
"ndarray",
"std",
"load-dynamic",
] }
[profile.release]
opt-level = 3 # Optimize for speed (ARM64 benefits more from speed optimizations)
lto = "fat" # Full link-time optimization across all crates
codegen-units = 1 # Better optimization, slower compile
strip = true # Strip symbols
panic = "abort" # Abort on panic for FFI safety with ONNX Runtime
# Alternative profile optimized for smaller binary size (faster cold starts)
[profile.release-size]
inherits = "release"
opt-level = "z" # Optimize for size
lto = true
Key dependencies:
lambda_http: The official AWS Lambda HTTP runtime for Rustort: Rust bindings for ONNX Runtime with platform-specific loading strategiesndarray: NumPy-like array operations for tensor handlingtokenizers: Hugging Face’s tokenizer library with Rust bindingsserde/serde_json: For request/response serialization
Platform-specific ONNX Runtime loading
The ort crate is configured differently per platform:
macOS (
download-binaries): Automatically downloads ONNX Runtime during compilation. Convenient for local development.Linux (
load-dynamic): Loadslibonnxruntime.soat runtime viaORT_DYLIB_PATHenvironment variable. Required for Lambda deployment where we control the runtime environment.
Release profile optimizations
The [profile.release] section configures the compiler for optimal ARM64 performance:
opt-level = 3: Optimize for speed—ARM64 Graviton2 processors deliver up to 19% better performance and 34% better price-performance compared to x86 for compute-intensive workloadslto = "fat": Full link-time optimization across all crates for maximum performancecodegen-units = 1: Single codegen unit enables better whole-program optimizationstrip = true: Removes symbol information from the final binarypanic = "abort": Abort on panic instead of unwinding—safer for FFI with ONNX Runtime
For situations where binary size matters more than performance (e.g., optimizing cold starts), use the release-size profile which prioritizes size optimization with opt-level = "z".
These settings increase compile time but maximize runtime performance.
Project structure
embedding-lambda/
├── Cargo.toml
├── .cargo/
│ └── config.toml
├── src/
│ ├── main.rs
│ ├── embedder.rs
│ ├── error.rs
│ └── http_handler.rs
└── model/
├── model_quantized.onnx
├── model_quantized.onnx_data
└── tokenizer.json
We’ll download the ONNX model from onnx-community/embeddinggemma-300m-ONNX on Hugging Face https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX
The model is available in fp32, q8, and q4 variants.
For Lambda, Q8 offers the best balance nearly full quality at a quarter of the size. Q4 doesn’t achieve 4× size reduction due to metadata overhead, and the quality tradeoff is more noticeable.
Implementing the Embedding Logic
Let’s build the core embedding functionality. We need to:
Load the tokenizer
Load the ONNX model
Tokenize input text
Run inference
Apply mean pooling to get the final embedding
Truncate to the requested dimension (Matryoshka)
Matryoshka embeddings
EmbeddingGemma was trained using Matryoshka Representation Learning (MRL). Named after Russian nesting dolls, this technique produces embeddings where the first N dimensions form a valid, meaningful embedding on their own.
In practice, this means you can truncate the full 768-dimensional vector to 512, 256, or 128 dimensions without retraining or losing semantic quality. Smaller embeddings mean:
Less storage space in your vector database
Faster similarity calculations
Lower memory usage
The tradeoff is minor: smaller dimensions capture slightly less nuance, but for most use cases the difference is negligible.
The embedding module
Create src/embedder.rs:
| use crate::error::EmbedError; | |
| use ort::{session::Session, value::Value}; | |
| use tokenizers::Tokenizer; | |
| /// Valid embedding dimensions for Matryoshka truncation | |
| pub const VALID_DIMENSIONS: [usize; 4] = [768, 512, 256, 128]; | |
| /// Maximum sequence length in tokens | |
| /// Prevents excessive memory usage and processing time | |
| const MAX_SEQUENCE_LENGTH: usize = 8192; | |
| /// Handles text embedding using ONNX Runtime. | |
| /// | |
| /// The Embedder loads an ONNX model and tokenizer, then provides | |
| /// a simple interface to convert text into vector embeddings. | |
| pub struct Embedder { | |
| session: Session, | |
| tokenizer: Tokenizer, | |
| } | |
| impl Embedder { | |
| /// Creates a new Embedder instance. | |
| /// | |
| /// # Arguments | |
| /// * `model_path` - Path to the ONNX model file (e.g., "model/model_quantized.onnx") | |
| /// * `tokenizer_path` - Path to the tokenizer JSON file (e.g., "model/tokenizer.json") | |
| /// | |
| /// # Note | |
| /// The ONNX model uses external data storage. Both `model_quantized.onnx` and | |
| /// `model_quantized.onnx_data` must be present in the same directory. | |
| /// ONNX Runtime automatically loads the external data file. | |
| pub fn new( | |
| model_path: &str, | |
| tokenizer_path: &str, | |
| ) -> Result<Self, EmbedError> { | |
| // Initialize ONNX Runtime session with optimization level Basic (Level 1) | |
| // This enables standard graph optimizations for better performance on ARM64. | |
| let session = Session::builder()? | |
| .with_optimization_level(ort::session::builder::GraphOptimizationLevel::Level1)? | |
| .with_intra_threads(1)? // Optimal for Q4 model: single thread reduces overhead | |
| .commit_from_file(model_path)?; | |
| // Load the Hugging Face tokenizer from JSON | |
| let tokenizer = Tokenizer::from_file(tokenizer_path) | |
| .map_err(|e| EmbedError::TokenizerLoad { | |
| path: tokenizer_path.to_string(), | |
| reason: e.to_string(), | |
| })?; | |
| Ok(Self { session, tokenizer }) | |
| } | |
| /// Tokenizes input text with the document prompt format. | |
| /// | |
| /// EmbeddingGemma expects a specific prompt template: | |
| /// "title: none | text: {text}" | |
| fn tokenize( | |
| &self, | |
| text: &str, | |
| ) -> Result<(Vec<i64>, Vec<i64>), EmbedError> { | |
| // Apply the prompt template | |
| let formatted = format!("title: none | text: {}", text); | |
| // Tokenize with special tokens (e.g., [CLS], [SEP]) | |
| let encoding = self | |
| .tokenizer | |
| .encode(formatted, true) | |
| .map_err(|e| EmbedError::Tokenization(e.to_string()))?; | |
| // Convert to i64 as required by ONNX Runtime | |
| let input_ids: Vec<i64> = encoding.get_ids().iter().map(|&id| id as i64).collect(); | |
| let attention_mask: Vec<i64> = encoding | |
| .get_attention_mask() | |
| .iter() | |
| .map(|&m| m as i64) | |
| .collect(); | |
| Ok((input_ids, attention_mask)) | |
| } | |
| /// Generates an embedding vector for the given text. | |
| /// | |
| /// # Arguments | |
| /// * `text` - The input text to embed | |
| /// * `size` - Output dimension: 768, 512, 256, or 128 (Matryoshka truncation) | |
| /// | |
| /// # Returns | |
| /// A normalized embedding vector of the requested dimension | |
| pub fn embed( | |
| &mut self, | |
| text: &str, | |
| size: usize, | |
| ) -> Result<Vec<f32>, EmbedError> { | |
| // Validate the requested dimension | |
| if !VALID_DIMENSIONS.contains(&size) { | |
| return Err(EmbedError::InvalidDimension { | |
| size, | |
| valid: VALID_DIMENSIONS.to_vec(), | |
| }); | |
| } | |
| // Step 1: Tokenize the input | |
| let (input_ids, attention_mask) = self.tokenize(text)?; | |
| let seq_len = input_ids.len(); | |
| // Validate sequence length | |
| if seq_len > MAX_SEQUENCE_LENGTH { | |
| return Err(EmbedError::SequenceTooLong { | |
| got: seq_len, | |
| max: MAX_SEQUENCE_LENGTH, | |
| }); | |
| } | |
| // Step 2: Prepare inputs as 2D tensors with shape [batch_size=1, seq_len] | |
| let shape = vec![1, seq_len]; | |
| // Step 3: Run inference | |
| let outputs = self.session.run(ort::inputs![ | |
| "input_ids" => Value::from_array((shape.clone(), input_ids))?, | |
| "attention_mask" => Value::from_array((shape, attention_mask.clone()))?, | |
| ])?; | |
| // Step 4: Extract the output tensor | |
| // The model outputs last_hidden_state with shape [batch_size, seq_len, hidden_dim] | |
| let (output_shape, output_data) = outputs[0].try_extract_tensor::<f32>()?; | |
| let batch_size = output_shape[0] as usize; | |
| let seq_len_out = output_shape[1] as usize; | |
| let hidden_dim = output_shape[2] as usize; | |
| // Convert to ArrayView3 for mean_pooling | |
| let output_view = | |
| ndarray::ArrayView3::from_shape((batch_size, seq_len_out, hidden_dim), output_data)?; | |
| // Step 5: Apply mean pooling over token embeddings | |
| let embedding = Self::mean_pooling(&output_view, &attention_mask)?; | |
| // Step 6: Truncate to requested dimension (Matryoshka) | |
| let truncated: Vec<f32> = embedding.into_iter().take(size).collect(); | |
| // Step 7: L2 normalize the final embedding | |
| // Re-normalization after truncation is important for correct similarity scores | |
| let normalized = Self::normalize(&truncated); | |
| Ok(normalized) | |
| } | |
| /// Applies mean pooling to token embeddings. | |
| /// | |
| /// Mean pooling averages the embeddings of all non-padding tokens. | |
| /// The attention mask is used to exclude padding tokens from the average. | |
| /// Uses vectorized ndarray operations for optimal performance. | |
| fn mean_pooling( | |
| hidden_states: &ndarray::ArrayView3<f32>, | |
| attention_mask: &[i64], | |
| ) -> Result<Vec<f32>, EmbedError> { | |
| use ndarray::Axis; | |
| // hidden_states: [batch=1, seq_len, hidden_dim] | |
| // Remove batch dimension: [seq_len, hidden_dim] | |
| let states_2d = hidden_states.index_axis(Axis(0), 0); | |
| // Convert mask to f32 and create array | |
| let mask_f32: Vec<f32> = attention_mask.iter().map(|&x| x as f32).collect(); | |
| let mask_1d = ndarray::Array1::from(mask_f32); | |
| // Count non-padding tokens (do this before consuming mask_1d) | |
| let count = mask_1d.sum(); | |
| // Reshape to [seq_len, 1] for broadcasting | |
| let mask_col = mask_1d.insert_axis(Axis(1)); // Shape: [seq_len, 1] | |
| // Broadcast multiply: each token embedding is scaled by its mask value | |
| // This zeros out padding tokens | |
| let masked_states = &states_2d * &mask_col; | |
| // Sum along sequence axis: [seq_len, hidden_dim] -> [hidden_dim] | |
| let sum = masked_states.sum_axis(Axis(0)); | |
| // Compute mean (avoid division by zero) | |
| let mean = if count > 0.0 { sum / count } else { sum }; | |
| Ok(mean.to_vec()) | |
| } | |
| /// Applies L2 normalization to the embedding vector. | |
| /// | |
| /// Normalized embeddings allow using dot product instead of cosine similarity, | |
| /// which is computationally cheaper for similarity searches. | |
| fn normalize(embedding: &[f32]) -> Vec<f32> { | |
| let norm: f32 = embedding.iter().map(|x| x * x).sum::<f32>().sqrt(); | |
| if norm > 0.0 { | |
| embedding.iter().map(|x| x / norm).collect() | |
| } else { | |
| embedding.to_vec() | |
| } | |
| } | |
| } |
Production-Grade Error Handling
Before building the HTTP handler, let’s implement proper error handling. Production code needs structured errors that are type-safe, informative, and secure.
Create src/error.rs:
| use thiserror::Error; | |
| /// Errors that can occur during embedding generation | |
| #[derive(Error, Debug)] | |
| pub enum EmbedError { | |
| /// Invalid embedding dimension requested | |
| #[error("Invalid embedding size: {size}. Must be one of: {valid:?}")] | |
| InvalidDimension { | |
| size: usize, | |
| valid: Vec<usize>, | |
| }, | |
| /// Input sequence is too long | |
| #[error("Tokenized sequence exceeds maximum length of {max} tokens (got {got})")] | |
| SequenceTooLong { | |
| got: usize, | |
| max: usize, | |
| }, | |
| /// Empty text input | |
| #[error("Text input cannot be empty")] | |
| EmptyInput, | |
| /// Text input exceeds maximum character limit | |
| #[error("Text exceeds maximum length of {max} characters (got {got})")] | |
| TextTooLong { | |
| got: usize, | |
| max: usize, | |
| }, | |
| /// Failed to load tokenizer | |
| #[error("Failed to load tokenizer from {path}: {reason}")] | |
| TokenizerLoad { | |
| path: String, | |
| reason: String, | |
| }, | |
| /// Tokenization failed | |
| #[error("Tokenization failed: {0}")] | |
| Tokenization(String), | |
| /// ONNX Runtime error | |
| #[error("ONNX Runtime error: {0}")] | |
| OnnxRuntime(#[from] ort::Error), | |
| /// Array shape mismatch | |
| #[error("Array shape error: {0}")] | |
| ArrayShape(#[from] ndarray::ShapeError), | |
| /// Mutex poisoned (concurrent access error) | |
| #[error("Internal error: shared resource poisoned")] | |
| MutexPoisoned, | |
| /// Internal server error (catch-all) | |
| #[error("Internal server error: {0}")] | |
| Internal(String), | |
| } | |
| impl EmbedError { | |
| /// Returns true if this error should be reported as a client error (4xx) | |
| pub fn is_client_error(&self) -> bool { | |
| matches!( | |
| self, | |
| EmbedError::InvalidDimension { .. } | |
| | EmbedError::SequenceTooLong { .. } | |
| | EmbedError::EmptyInput | |
| | EmbedError::TextTooLong { .. } | |
| ) | |
| } | |
| /// Get the HTTP status code for this error | |
| pub fn status_code(&self) -> u16 { | |
| if self.is_client_error() { | |
| 400 | |
| } else { | |
| 500 | |
| } | |
| } | |
| /// Get a user-friendly error message (sanitized for production) | |
| pub fn user_message(&self) -> String { | |
| match self { | |
| // Client errors - show detailed message | |
| EmbedError::InvalidDimension { size, valid } => { | |
| format!("Invalid embedding size: {}. Must be one of: {:?}", size, valid) | |
| } | |
| EmbedError::SequenceTooLong { got, max } => { | |
| format!("Text is too long: {} tokens (max: {})", got, max) | |
| } | |
| EmbedError::EmptyInput => "Text input cannot be empty".to_string(), | |
| EmbedError::TextTooLong { got, max } => { | |
| format!("Text is too long: {} characters (max: {})", got, max) | |
| } | |
| EmbedError::Tokenization(msg) => { | |
| format!("Failed to process text: {}", msg) | |
| } | |
| // Server errors - generic message in production | |
| _ => { | |
| if cfg!(debug_assertions) { | |
| // Development: show full error | |
| self.to_string() | |
| } else { | |
| // Production: generic message | |
| "An internal error occurred while processing your request".to_string() | |
| } | |
| } | |
| } | |
| } | |
| } |
Key benefits:
Type-safe errors: Each error variant carries relevant context
User-friendly messages:
user_message()provides sanitized output for API responsesSecurity: Internal errors don’t leak implementation details in production
HTTP mapping:
status_code()maps errors to appropriate HTTP codesAutomatic trait implementations:
thiserrorgeneratesDisplayandErrortraits
Building the Lambda Handler
Now let’s wire up the embedding logic to a Lambda HTTP endpoint. We’ll split the code into two files for better organization: the handler logic and the main entry point.
The HTTP handler
Create src/http_handler.rs:
| use crate::embedder::{Embedder, VALID_DIMENSIONS}; | |
| use crate::error::EmbedError; | |
| use lambda_http::{Body, Error, Request, Response}; | |
| use serde::{Deserialize, Serialize}; | |
| use std::sync::{Arc, Mutex}; | |
| use tracing::{error, info, warn}; | |
| /// Maximum input text length in characters | |
| /// Prevents OOM from extremely long inputs | |
| const MAX_TEXT_LENGTH: usize = 100_000; | |
| /// Incoming request payload | |
| #[derive(Deserialize)] | |
| struct EmbedRequest { | |
| /// The text to embed | |
| text: String, | |
| /// Output dimension: 768, 512, 256, or 128 (default: 768) | |
| #[serde(default = "default_size")] | |
| size: usize, | |
| } | |
| fn default_size() -> usize { | |
| 768 | |
| } | |
| /// Response payload containing the embedding vector | |
| #[derive(Serialize)] | |
| struct EmbedResponse { | |
| /// The embedding vector | |
| embedding: Vec<f32>, | |
| /// Dimension of the embedding | |
| size: usize, | |
| } | |
| /// Error response payload | |
| #[derive(Serialize)] | |
| struct ErrorResponse { | |
| error: String, | |
| } | |
| /// Lambda handler function. | |
| /// | |
| /// Receives an HTTP request with JSON body, generates an embedding, | |
| /// and returns it as a JSON response. | |
| pub async fn function_handler( | |
| embedder: Arc<Mutex<Embedder>>, | |
| event: Request, | |
| ) -> Result<Response<Body>, Error> { | |
| // Parse the JSON request body | |
| let body = event.body(); | |
| let request: EmbedRequest = match serde_json::from_slice(body) { | |
| Ok(req) => req, | |
| Err(e) => { | |
| return Ok(error_response(400, &format!("Invalid JSON: {}", e))); | |
| } | |
| }; | |
| // Validate the size parameter | |
| if !VALID_DIMENSIONS.contains(&request.size) { | |
| return Ok(error_response( | |
| 400, | |
| &format!( | |
| "Invalid size: {}. Must be one of: {:?}", | |
| request.size, VALID_DIMENSIONS | |
| ), | |
| )); | |
| } | |
| // Validate text is not empty | |
| if request.text.is_empty() { | |
| let err = EmbedError::EmptyInput; | |
| warn!("Empty text input"); | |
| return Ok(error_from_embed_error(&err)); | |
| } | |
| // Validate text length to prevent OOM | |
| if request.text.len() > MAX_TEXT_LENGTH { | |
| let err = EmbedError::TextTooLong { | |
| got: request.text.len(), | |
| max: MAX_TEXT_LENGTH, | |
| }; | |
| warn!("Text too long: {} chars", request.text.len()); | |
| return Ok(error_from_embed_error(&err)); | |
| } | |
| // Generate the embedding | |
| // Mutex required: ONNX Runtime Rust bindings need &mut for session.run() | |
| // Lambda processes one request at a time per container, so no contention | |
| let embedding = { | |
| // Safe mutex handling - recover from poisoned state | |
| let mut embedder = match embedder.lock() { | |
| Ok(guard) => guard, | |
| Err(poisoned) => { | |
| warn!("Mutex was poisoned, recovering..."); | |
| poisoned.into_inner() | |
| } | |
| }; | |
| match embedder.embed(&request.text, request.size) { | |
| Ok(emb) => { | |
| info!( | |
| text_len = request.text.len(), | |
| embedding_size = request.size, | |
| "Embedding generated successfully" | |
| ); | |
| emb | |
| } | |
| Err(e) => { | |
| error!("Embedding generation failed: {}", e); | |
| return Ok(error_from_embed_error(&e)); | |
| } | |
| } | |
| }; | |
| // Build the JSON response | |
| let response = EmbedResponse { | |
| size: embedding.len(), | |
| embedding, | |
| }; | |
| let response_json = serde_json::to_string(&response)?; | |
| let resp = Response::builder() | |
| .status(200) | |
| .header("content-type", "application/json") | |
| .body(response_json.into()) | |
| .map_err(|e| Box::new(e) as Box<dyn std::error::Error + Send + Sync>)?; | |
| Ok(resp) | |
| } | |
| /// Helper function to create error responses from EmbedError | |
| fn error_from_embed_error(err: &EmbedError) -> Response<Body> { | |
| let status = err.status_code(); | |
| let message = err.user_message(); | |
| error_response(status, &message) | |
| } | |
| /// Helper function to create error responses | |
| fn error_response(status: u16, message: &str) -> Response<Body> { | |
| let body = serde_json::to_string(&ErrorResponse { | |
| error: message.to_string(), | |
| }) | |
| .unwrap_or_else(|_| r#"{"error":"Unknown error"}"#.to_string()); | |
| // Safe response building with fallback | |
| Response::builder() | |
| .status(status) | |
| .header("content-type", "application/json") | |
| .body(body.into()) | |
| .unwrap_or_else(|e| { | |
| error!("Failed to build error response: {}", e); | |
| // Absolute fallback - plain text error | |
| Response::builder() | |
| .status(500) | |
| .body(Body::from(r#"{"error":"Internal server error"}"#)) | |
| .unwrap() | |
| }) | |
| } |
The main entry point
Update src/main.rs:
| pub mod embedder; | |
| pub mod error; | |
| pub mod http_handler; | |
| use embedder::Embedder; | |
| use http_handler::function_handler; | |
| use lambda_http::{run, service_fn, tracing, Error}; | |
| use std::sync::{Arc, Mutex}; | |
| #[tokio::main] | |
| async fn main() -> Result<(), Error> { | |
| // Initialize ONNX Runtime global environment and keep it in memory for program lifetime | |
| // This must be called before creating any sessions to register the DefaultLogger | |
| // The environment remains active for the entire program execution | |
| ort::init().with_name("embedding-lambda").commit()?; | |
| // Initialize tracing for CloudWatch logs | |
| tracing::init_default_subscriber(); | |
| // Initialize the Embedder once during cold start. | |
| // This loads the ONNX model and tokenizer into memory. | |
| let embedder = Embedder::new("model/model_quantized.onnx", "model/tokenizer.json") | |
| .map_err(|e| { | |
| tracing::error!("Failed to initialize embedder: {}", e); | |
| Box::new(e) as Box<dyn std::error::Error + Send + Sync> | |
| })?; | |
| // Wrap in Arc<Mutex> to share across handler invocations | |
| // Mutex required: ONNX Runtime Rust bindings need &mut for session.run() | |
| let embedder = Arc::new(Mutex::new(embedder)); | |
| // Start the Lambda runtime. | |
| // Each incoming request will clone the Arc and call function_handler. | |
| run(service_fn(move |event| { | |
| let embedder = embedder.clone(); | |
| function_handler(embedder, event) | |
| })) | |
| .await | |
| } |
API usage
The endpoint accepts a JSON payload with two fields:
{
"text": "Your text to embed",
"size": 256
}
The size parameter is optional and defaults to 768. Valid values are:
768- Full precision, best quality512- Good balance of quality and efficiency256- Faster similarity search, slightly reduced quality128- Smallest, fastest, good for large-scale filtering
Testing locally
Before deploying, test the function locally using cargo-lambda:
cargo lambda watch
In another terminal, send test requests with different sizes:
# Full 768-dimensional embedding (default)
curl -X POST http://localhost:9000/ \
-H "Content-Type: application/json" \
-d '{"text": "What is the capital of France?"}'
# Compact 256-dimensional embedding
curl -X POST http://localhost:9000/ \
-H "Content-Type: application/json" \
-d '{"text": "What is the capital of France?", "size": 256}'
Containerizing for Lambda
Our project includes ONNX model files that exceed Lambda’s 250MB zip deployment limit. Container images solve this problem, supporting up to 10GB.
Image size breakdown
The final image size depends on:
Rust binary: ~5-15MB (with size optimizations)
ONNX Runtime library: ~50-100MB
Model files: ~300-600MB (depending on quantization)
Base image: ~100MB
Total: typically under 1GB, well within Lambda’s 10GB limit.
Why containers?
Beyond the size limit, containers give us:
Full control over the runtime environment
Ability to include native libraries (ONNX Runtime)
Consistent builds across development and production
ARM64 support for AWS Graviton processors (better price/performance)
Why ARM64?
AWS Graviton processors offer up to 34% better price/performance compared to x86 for Lambda. Since we’re building from scratch, there’s no reason not to target ARM64.
Project structure
Before we create the Dockerfile, make sure your project looks like this:
embedding-lambda/
├── Cargo.toml
├── Cargo.lock
├── Dockerfile
├── .cargo/
│ └── config.toml
├── src/
│ ├── main.rs
│ ├── embedder.rs
│ ├── error.rs
│ └── http_handler.rs
└── model/
├── model_quantized.onnx
├── model_quantized.onnx_data
└── tokenizer.json
The Dockerfile
Create a Dockerfile in the project root:
# Stage 1: Build the Rust binary for ARM64
FROM --platform=linux/arm64 rust:1.92-slim-bookworm AS builder
# Install build dependencies including lld linker for faster builds
RUN apt-get update && apt-get install -y \
pkg-config \
libssl-dev \
build-essential \
lld \
&& rm -rf /var/lib/apt/lists/*
# Set ARM64 Graviton2-specific compiler optimizations
ENV RUSTFLAGS="-C target-cpu=neoverse-n1 -C target-feature=+neon"
WORKDIR /app
# Copy cargo config for ARM64 optimizations
COPY .cargo .cargo
# Copy manifests first for better layer caching
COPY Cargo.toml Cargo.lock ./
# Create a dummy main.rs to build dependencies
RUN mkdir src && \
echo "fn main() {}" > src/main.rs && \
cargo build --release && \
rm -rf src
# Copy actual source code
COPY src ./src
# Build the real application
# Touch main.rs to ensure it rebuilds
RUN touch src/main.rs && cargo build --release
# Stage 2: Download ONNX Runtime for ARM64
FROM --platform=linux/arm64 debian:bookworm-slim AS onnx-downloader
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
# Download ONNX Runtime for Linux ARM64
ARG ORT_VERSION=1.22.0
RUN curl -L https://github.com/microsoft/onnxruntime/releases/download/v${ORT_VERSION}/onnxruntime-linux-aarch64-${ORT_VERSION}.tgz \
| tar xz -C /opt
# Stage 3: Create the Lambda runtime image for ARM64
FROM --platform=linux/arm64 public.ecr.aws/lambda/provided:al2023-arm64
# Copy ONNX Runtime library
COPY --from=onnx-downloader /opt/onnxruntime-linux-aarch64-*/lib/libonnxruntime.so* /opt/onnxruntime/lib/
# Set the library path for dynamic loading
ENV ORT_DYLIB_PATH=/opt/onnxruntime/lib/libonnxruntime.so
# Copy the compiled binary
COPY --from=builder /app/target/release/embedding-lambda ${LAMBDA_RUNTIME_DIR}/bootstrap
# Copy model files
COPY model/ ${LAMBDA_TASK_ROOT}/model/
# Set the entrypoint
CMD ["bootstrap"]
Understanding the Dockerfile
The build uses three stages:
Builder stage: Compiles the Rust code for ARM64 with Graviton2-specific optimizations. Key enhancements include:
lld linker: Faster linking times during builds
RUSTFLAGS: Targets AWS Graviton2’s Neoverse-N1 CPU architecture with NEON SIMD instructions
.cargo directory: Contains additional ARM64 optimization configuration
The release profile optimizations from
Cargo.tomlare applied automatically
These optimizations provide 10-15% performance improvement for ARM64 workloads.
ONNX downloader stage: Downloads the official ONNX Runtime release for Linux ARM64 from GitHub. This gives us
libonnxruntime.sowhich theortcrate loads dynamically.Runtime stage: Combines everything into the final Lambda image. The
ORT_DYLIB_PATHenvironment variable tells theortcrate where to find the ONNX Runtime library.
ARM64 Optimization Configuration
For optimal ARM64 performance on AWS Graviton processors, create a .cargo/config.toml file in your project root:
[build]
rustflags = ["-C", "target-cpu=neoverse-n1"]
[target.aarch64-unknown-linux-gnu]
linker = "aarch64-linux-gnu-gcc"
rustflags = [
"-C", "target-cpu=neoverse-n1",
"-C", "target-feature=+neon",
]
This configuration:
Targets the Neoverse-N1 CPU architecture used in AWS Graviton2 processors
Enables NEON SIMD instructions for vectorized operations (2-3x faster mean pooling)
Works in conjunction with the RUSTFLAGS in the Dockerfile
The vectorized mean pooling implementation in embedder.rs automatically leverages these SIMD instructions, providing significant performance gains for embedding generation.
Building the image
Build the ARM64 image:
docker buildx build --platform linux/arm64 -t embedding-lambda .
If you’re on an ARM machine (like Apple Silicon Mac), you can simply use:
docker build -t embedding-lambda .
Testing locally
Test the container using Lambda’s Runtime Interface Emulator:
docker run --rm -p 9000:8080 embedding-lambda
In another terminal:
curl -X POST "http://localhost:9000/2015-03-31/functions/function/invocations" \
-H "Content-Type: application/json" \
-d '{"body": "{\"text\": \"Hello world\", \"size\": 256}"}'
Pushing to Amazon ECR
Create an ECR repository and push the image:
# Set your AWS region and account ID
AWS_REGION=eu-central-1
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REPO_NAME=embedding-lambda
# Create the repository (skip if it already exists)
aws ecr create-repository --repository-name $REPO_NAME --region $AWS_REGION
# Authenticate Docker with ECR
aws ecr get-login-password --region $AWS_REGION | \
docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
# Tag and push
docker tag embedding-lambda:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO_NAME:latest
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO_NAME:latest
Deployment
With the image in ECR, we can create the Lambda function and expose it via a Function URL.
Creating the Lambda function
First, create an IAM role for the Lambda function:
# Create the trust policy
cat > scripts/trust-policy.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "lambda.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
EOF
# Create the role
aws iam create-role \
--role-name embedding-lambda-role \
--assume-role-policy-document file://scripts/trust-policy.json
# Attach basic execution policy for CloudWatch logs
aws iam attach-role-policy \
--role-name embedding-lambda-role \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Now create the Lambda function:
AWS_REGION=eu-central-1
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REPO_NAME=embedding-lambda
FUNCTION_NAME=embedding-lambda
aws lambda create-function \
--function-name $FUNCTION_NAME \
--package-type Image \
--code ImageUri=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO_NAME:latest \
--role arn:aws:iam::$AWS_ACCOUNT_ID:role/embedding-lambda-role \
--architectures arm64 \
--memory-size 2048 \
--timeout 30 \
--region $AWS_REGION
Key configuration choices:
--architectures arm64: Matches our ARM64 container build for Graviton--memory-size 2048: 2GB is plenty for the quantized model; adjust based on your testing--timeout 30: 30 seconds handles cold starts plus inference comfortably
Creating a Function URL
Function URLs provide a simple HTTPS endpoint without needing API Gateway:
AWS_REGION=eu-central-1
FUNCTION_NAME=embedding-lambda
aws lambda create-function-url-config \
--function-name $FUNCTION_NAME \
--auth-type NONE \
--invoke-mode BUFFERED \
--region $AWS_REGION
# Grant public access to the function URL
aws lambda add-permission \
--function-name $FUNCTION_NAME \
--statement-id FunctionURLAllowPublicAccess \
--action lambda:InvokeFunctionUrl \
--principal "*" \
--function-url-auth-type NONE \
--region $AWS_REGION
This returns a URL like
https://xxxxxxxxxx.lambda-url.eu-central-1.on.aws/
For production, you’ll want to change --auth-type to AWS_IAM and configure proper authentication.
Testing the deployed function
AWS_REGION=eu-central-1
FUNCTION_NAME=embedding-lambda
FUNCTION_URL=$(aws lambda get-function-url-config \
--function-name $FUNCTION_NAME \
--query 'FunctionUrl' \
--output text \
--region $AWS_REGION)
curl -X POST "$FUNCTION_URL" \
-H "Content-Type: application/json" \
-d '{"text": "Rust on Lambda is fast", "size": 256}'
Updating the function
When you update your code, rebuild the image, push to ECR, and update the function:
AWS_REGION=eu-central-1
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REPO_NAME=embedding-lambda
FUNCTION_NAME=embedding-lambda
# Rebuild and push
docker build -t embedding-lambda .
docker tag embedding-lambda:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO_NAME:latest
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO_NAME:latest
# Update Lambda to use the new image
aws lambda update-function-code \
--function-name $FUNCTION_NAME \
--image-uri $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO_NAME:latest \
--region $AWS_REGION
Cold start optimization
Cold starts are the main latency concern for ML workloads in Lambda. A few strategies to minimize them:
Provisioned Concurrency: Keep instances warm for consistent latency
AWS_REGION=eu-central-1
FUNCTION_NAME=embedding-lambda
aws lambda put-provisioned-concurrency-config \
--function-name $FUNCTION_NAME \
--qualifier '$LATEST' \
--provisioned-concurrent-executions 2 \
--region $AWS_REGION
Memory tuning: More memory means more CPU, which speeds up model loading. Test different values to find the sweet spot between cost and cold start time.
SnapStart: Currently not available for container images, but worth watching for future support.
Monitoring
CloudWatch metrics to watch:
Duration: Inference time per requestInitDuration: Cold start time (model loading)ConcurrentExecutions: Scale patternsErrors: Failed requests
Set up alarms for duration spikes or error rates to catch issues early.
Performance Benchmarks
Now that we have our function deployed, let’s look at real-world performance numbers and cost analysis.
Production Performance Metrics
Running on ARM64 Graviton2 with 2048 MB memory and the Q8 quantized model:
Resource Usage:
Peak Memory: ~1509 MB (well within 2048 MB limit)
Cold Start (Init Duration): ~2.6 seconds (model loading + initialization)
Total Cold Start Latency: ~3-4 seconds (includes Init Duration + execution environment provisioning + network latency)
Warm Inference: ~280-295 ms per request
Model Size: ~300 MB
Note: The 2.6s figure is Lambda’s Init Duration metric. End-to-end latency experienced by users on cold starts will be higher due to execution environment provisioning (200-400ms) and any API Gateway/network overhead.
Why is warm inference ~280ms when EdgeTPU benchmarks show <15ms? The EdgeTPU figure is for raw model inference on 256 tokens with specialized hardware. Our Lambda latency includes the complete pipeline: HTTP request parsing, tokenization, full-context inference on general-purpose CPUs, mean pooling, and normalization. The ~18× difference is expected and reasonable for this architecture.
Thread Count Tuning
ONNX Runtime’s thread configuration has a significant impact on performance. The optimal setting depends on your model size and quantization.
Configuration Options:
In src/embedder.rs, the thread count is set via:
.with_intra_threads(1)? // Optimal for quantized models under 500MB
Thread Count Trade-offs (tested with Q8 model):
ThreadsBest ForPerformance Impact1Q8/Q4 models (<500MB)Fastest (reduces overhead)2Larger models or long sequences~10-20% slower than 1 thread for small models4FP32 models (>1GB)May have diminishing returns
For the Q8 quantized model (~300 MB) used in this deployment, single-threaded inference is optimal because:
Thread scheduling overhead exceeds parallelization benefits at this model size
Better cache locality with a single thread
Reduced context switching
Note: These recommendations are based on testing with the Q8 model. If you’re using a different quantization variant, benchmark with different thread counts to find your optimal configuration.
Cost Analysis: Lambda vs. Alternatives
Understanding the economics helps you choose the right deployment strategy.
AWS Lambda (ARM64) - Our Implementation:
Cost per 1M requests: ~$4.00
Compute (GB-seconds): 2048 MB × 0.28s × 1M × $0.0000133334/GB-s = $3.80
Request charges: $0.20
Pros: Zero cost when idle, no server management, automatic scaling
Cons: Higher latency than EC2 for sustained high-volume workloads
AWS SageMaker Serverless:
Cost per 1M requests: ~$4.90
Pros: Managed ML infrastructure, easier monitoring
Cons: Slightly more expensive, limited to x86_64 (no Graviton optimization)
AWS EC2 Break-Even Analysis:
Instance TypeMonthly CostBreak-Even Point (vs. Lambda)t4g.small$14.01> 3.5M requests/montht4g.medium$28.03> 7.0M requests/month
Cost Recommendations by Scale:
< 3.5M requests/month: Use Lambda (most cost-effective)
3.5M - 10M requests/month: Consider t4g.small EC2
> 10M requests/month: Use t4g.medium EC2 or ECS Fargate with auto-scaling
Lambda is ideal for development, small-to-medium production, or sporadic workloads where you benefit from pay-per-use pricing.
Provider Comparison: Self-Hosted vs. API Services
Lambda charges per request, not per token. As your input size grows, the savings compound dramatically:
500 tokens: Self-hosted saves 60-94%
2,000 tokens: Self-hosted saves 90-97%
4,000 tokens: Self-hosted saves 95-98%
For document embedding (2K-4K tokens per document), self-hosted Lambda can be 10-20× cheaper than API providers.
Strategic Benefits Beyond Cost:
Data Privacy: Text never leaves your AWS VPC—critical for sensitive data
Predictable Pricing: No runaway costs from large prompts or unexpected usage spikes
No Rate Limits: Scale to your Lambda concurrency limit (default 1,000 concurrent executions)
Independence: No dependency on external provider uptime or API changes
When to Use External APIs:
Prototyping and experimentation
Need for state-of-the-art quality (commercial models may have better training data)
Want zero operational overhead
Performance Optimization Impact
The ARM64-specific optimizations we implemented provide significant gains:
Optimization Breakdown:
Vectorized Mean Pooling: ~90ms latency reduction (23% faster)
Naive implementation: ~380ms
Vectorized ndarray: ~290ms
ARM64 Target CPU: Up to 19% performance improvement (Neoverse-N1 optimization)
NEON SIMD Instructions: 2-4× faster vector operations
Single Thread Configuration: Optimal for quantized models (eliminates scheduling overhead)
Speed-Optimized Compilation: 10-15% faster than size-optimized builds
Combined, these optimizations deliver 30-50% performance improvement over a baseline implementation without ARM-specific tuning.
Key Metrics to Track:
Cold Start (InitDuration): Should be 2-4 seconds
Warm Latency (Duration): Target 250-350ms for Q8 model
Memory Utilization: Should stay under 1600 MB (80% of limit)
Error Rate: Monitor for timeouts or memory exhaustion
Conclusion
We’ve built a complete embedding service: Rust for performance, ONNX Runtime for portable inference, and Lambda for serverless scale. The combination delivers consistent ~280ms inference latency with minimal operational overhead and exceptional cost efficiency.
Key takeaways:
Rust on Lambda is production-ready. The official support means you get SLAs, documentation, and long-term stability.
Small models unlock serverless ML. EmbeddingGemma’s efficient design fits comfortably within Lambda’s constraints (~1.5 GB memory, 2.6s cold starts).
Matryoshka embeddings give you flexibility. Choose your dimension based on quality vs. storage tradeoffs without retraining.
Containers simplify deployment. Multi-stage builds keep images small while including everything you need.
ARM64 optimizations matter. Target-specific compilation and SIMD instructions deliver 30-50% performance gains.
Economics favor self-hosting. At $4/M requests, Lambda is 2-16× cheaper than commercial embedding APIs, especially for longer documents.
From here, you could extend this with batch processing, add caching for repeated queries, or integrate with a vector database for semantic search. The foundation is solid.
The code from this article is available on GitHub: https://github.com/alexsobolev/rust-embedding-lambda
Questions or improvements? Open an issue or reach out. Happy embedding.


