Building an Embedding API with Rust, Arm, and EmbeddingGemma on AWS Lambda

Step-by-step guide covering model selection, containerization, ARM64 optimization, and production benchmarks

Dec 18, 2025

Introduction

On November 14, 2025, AWS announced official support for Rust in Lambda. Full SLA, full AWS Support, production-ready https://aws.amazon.com/about-aws/whats-new/2025/11/aws-lambda-rust/

This is a big deal. Rust on Lambda means blazing-fast cold starts, minimal memory footprint, and compile-time safety. All things that matter when you’re paying per millisecond and per megabyte. For performance-critical serverless workloads, it’s hard to find a better fit.

So let’s put it to the test.

In this article, we’ll build a REST API that takes text and returns embeddings. If you’re not familiar with embeddings, they’re vector representations of text that capture semantic meaning. They power things like semantic search, recommendations, and RAG pipelines. Instead of calling an external service like AWS Bedrock or OpenAI, we’ll run Embedding Gemma directly inside Lambda.

Why go through the trouble? Cost and latency. External embedding APIs charge per token, and every API call adds network overhead. Running inference locally in Lambda gives you predictable pricing and faster response times, especially for high-volume workloads.

Along the way, we’ll work within Lambda’s constraints (10GB memory, 10GB container images, 15-minute timeout) and see how Rust helps us maximize performance within these limits.

If you’re curious about Rust, interested in serverless ML, or want to see what’s possible now that Rust is officially supported, let’s explore together.

Understanding the Constraints

Before writing any code, let’s map out what we’re working with.

Lambda limits:

Memory: Up to 10GB
Storage: 512MB in /tmp (or 10GB with ephemeral storage enabled)
Package size: 250MB zipped for direct upload, or up to 10GB with container images
Timeout: 15 minutes max
CPU: Scales proportionally with memory

For ML workloads, memory and package size are usually the bottlenecks. Large models don’t fit, and if they do, cold starts can be brutal.

Embedding Gemma specs:

EmbeddingGemma is designed for on-device inference, optimized for exactly the kind of constrained environment we’re dealing with.

Parameters: ~308 million (100M model parameters + 200M embedding parameters)
RAM with quantization: Sub-200MB
Output dimensions: 768 (or 128/256/512 using Matryoshka truncation)
Context window: 2K tokens
Inference time: <15ms on EdgeTPU, <22ms on mobile (benchmarked at 256 tokens; longer sequences scale proportionally)

The model was built on Gemma 3 architecture and trained on 100+ languages. Google explicitly designed it for phones, laptops, and tablets. Lambda’s 10GB memory ceiling is more than enough.

Project Setup

First, install cargo-lambda. It’s a Cargo subcommand that simplifies building, testing, and deploying Rust Lambda functions.

cargo install cargo-lambda

Create a new project:

cargo lambda new embedding-lambda
cd embedding-lambda

When prompted, select “HTTP function” since we’re building a REST API.

Dependencies

Open Cargo.toml and add the following:

[package]
name = "embedding-lambda"
version = "0.1.0"
edition = "2021"  # Using 2021 for broader ecosystem compatibility; Rust 2024 is available but less widely supported

[dependencies]
lambda_http = "1.0"  # Using semver flexibility for automatic patch updates
tokio = { version = "1.48.0", features = ["macros"] }
serde = { version = "1.0.228", features = ["derive"] }
serde_json = "1.0.145"
ndarray = "0.17.1"
tokenizers = "0.22.2"
tracing = "0.1.43"
tracing-subscriber = { version = "0.3.22", features = ["env-filter"] }
thiserror = "2.0"

# Platform-specific ONNX Runtime configuration
[target.'cfg(target_os = "macos")'.dependencies]
ort = { version = "2.0.0-rc.10", default-features = false, features = [
    "ndarray",
    "std",
    "download-binaries",
] }

[target.'cfg(target_os = "linux")'.dependencies]
ort = { version = "2.0.0-rc.10", default-features = false, features = [
    "ndarray",
    "std",
    "load-dynamic",
] }

[profile.release]
opt-level = 3       # Optimize for speed (ARM64 benefits more from speed optimizations)
lto = "fat"         # Full link-time optimization across all crates
codegen-units = 1   # Better optimization, slower compile
strip = true        # Strip symbols
panic = "abort"     # Abort on panic for FFI safety with ONNX Runtime

# Alternative profile optimized for smaller binary size (faster cold starts)
[profile.release-size]
inherits = "release"
opt-level = "z"     # Optimize for size
lto = true

Key dependencies:

lambda_http: The official AWS Lambda HTTP runtime for Rust
ort: Rust bindings for ONNX Runtime with platform-specific loading strategies
ndarray: NumPy-like array operations for tensor handling
tokenizers: Hugging Face’s tokenizer library with Rust bindings
serde / serde_json: For request/response serialization

Platform-specific ONNX Runtime loading

The ort crate is configured differently per platform:

macOS (download-binaries): Automatically downloads ONNX Runtime during compilation. Convenient for local development.
Linux (load-dynamic): Loads libonnxruntime.so at runtime via ORT_DYLIB_PATH environment variable. Required for Lambda deployment where we control the runtime environment.

Release profile optimizations

The [profile.release] section configures the compiler for optimal ARM64 performance:

opt-level = 3: Optimize for speed—ARM64 Graviton2 processors deliver up to 19% better performance and 34% better price-performance compared to x86 for compute-intensive workloads
lto = "fat": Full link-time optimization across all crates for maximum performance
codegen-units = 1: Single codegen unit enables better whole-program optimization
strip = true: Removes symbol information from the final binary
panic = "abort": Abort on panic instead of unwinding—safer for FFI with ONNX Runtime

For situations where binary size matters more than performance (e.g., optimizing cold starts), use the release-size profile which prioritizes size optimization with opt-level = "z".

These settings increase compile time but maximize runtime performance.

Project structure

embedding-lambda/
├── Cargo.toml
├── .cargo/
│   └── config.toml
├── src/
│   ├── main.rs
│   ├── embedder.rs
│   ├── error.rs
│   └── http_handler.rs
└── model/
    ├── model_quantized.onnx
    ├── model_quantized.onnx_data
    └── tokenizer.json

We’ll download the ONNX model from onnx-community/embeddinggemma-300m-ONNX on Hugging Face https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX

The model is available in fp32, q8, and q4 variants.

For Lambda, Q8 offers the best balance nearly full quality at a quarter of the size. Q4 doesn’t achieve 4× size reduction due to metadata overhead, and the quality tradeoff is more noticeable.

Implementing the Embedding Logic

Let’s build the core embedding functionality. We need to:

Load the tokenizer
Load the ONNX model
Tokenize input text
Run inference
Apply mean pooling to get the final embedding
Truncate to the requested dimension (Matryoshka)

Matryoshka embeddings

EmbeddingGemma was trained using Matryoshka Representation Learning (MRL). Named after Russian nesting dolls, this technique produces embeddings where the first N dimensions form a valid, meaningful embedding on their own.

In practice, this means you can truncate the full 768-dimensional vector to 512, 256, or 128 dimensions without retraining or losing semantic quality. Smaller embeddings mean:

Less storage space in your vector database
Faster similarity calculations
Lower memory usage

The tradeoff is minor: smaller dimensions capture slightly less nuance, but for most use cases the difference is negligible.

The embedding module

Create src/embedder.rs:

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Show hidden characters

	use crate::error::EmbedError;
	use ort::{session::Session, value::Value};
	use tokenizers::Tokenizer;

	/// Valid embedding dimensions for Matryoshka truncation
	pub const VALID_DIMENSIONS: [usize; 4] = [768, 512, 256, 128];

	/// Maximum sequence length in tokens
	/// Prevents excessive memory usage and processing time
	const MAX_SEQUENCE_LENGTH: usize = 8192;

	/// Handles text embedding using ONNX Runtime.
	///
	/// The Embedder loads an ONNX model and tokenizer, then provides
	/// a simple interface to convert text into vector embeddings.
	pub struct Embedder {
	session: Session,
	tokenizer: Tokenizer,
	}

	impl Embedder {
	/// Creates a new Embedder instance.
	///
	/// # Arguments
	/// * `model_path` - Path to the ONNX model file (e.g., "model/model_quantized.onnx")
	/// * `tokenizer_path` - Path to the tokenizer JSON file (e.g., "model/tokenizer.json")
	///
	/// # Note
	/// The ONNX model uses external data storage. Both `model_quantized.onnx` and
	/// `model_quantized.onnx_data` must be present in the same directory.
	/// ONNX Runtime automatically loads the external data file.
	pub fn new(
	model_path: &str,
	tokenizer_path: &str,
	) -> Result<Self, EmbedError> {
	// Initialize ONNX Runtime session with optimization level Basic (Level 1)
	// This enables standard graph optimizations for better performance on ARM64.
	let session = Session::builder()?
	.with_optimization_level(ort::session::builder::GraphOptimizationLevel::Level1)?
	.with_intra_threads(1)? // Optimal for Q4 model: single thread reduces overhead
	.commit_from_file(model_path)?;

	// Load the Hugging Face tokenizer from JSON
	let tokenizer = Tokenizer::from_file(tokenizer_path)
	.map_err(\|e\| EmbedError::TokenizerLoad {
	path: tokenizer_path.to_string(),
	reason: e.to_string(),
	})?;

	Ok(Self { session, tokenizer })
	}

	/// Tokenizes input text with the document prompt format.
	///
	/// EmbeddingGemma expects a specific prompt template:
	/// "title: none \| text: {text}"
	fn tokenize(
	&self,
	text: &str,
	) -> Result<(Vec<i64>, Vec<i64>), EmbedError> {
	// Apply the prompt template
	let formatted = format!("title: none \| text: {}", text);

	// Tokenize with special tokens (e.g., [CLS], [SEP])
	let encoding = self
	.tokenizer
	.encode(formatted, true)
	.map_err(\|e\| EmbedError::Tokenization(e.to_string()))?;

	// Convert to i64 as required by ONNX Runtime
	let input_ids: Vec<i64> = encoding.get_ids().iter().map(\|&id\| id as i64).collect();
	let attention_mask: Vec<i64> = encoding
	.get_attention_mask()
	.iter()
	.map(\|&m\| m as i64)
	.collect();

	Ok((input_ids, attention_mask))
	}

	/// Generates an embedding vector for the given text.
	///
	/// # Arguments
	/// * `text` - The input text to embed
	/// * `size` - Output dimension: 768, 512, 256, or 128 (Matryoshka truncation)
	///
	/// # Returns
	/// A normalized embedding vector of the requested dimension
	pub fn embed(
	&mut self,
	text: &str,
	size: usize,
	) -> Result<Vec<f32>, EmbedError> {
	// Validate the requested dimension
	if !VALID_DIMENSIONS.contains(&size) {
	return Err(EmbedError::InvalidDimension {
	size,
	valid: VALID_DIMENSIONS.to_vec(),
	});
	}

	// Step 1: Tokenize the input
	let (input_ids, attention_mask) = self.tokenize(text)?;
	let seq_len = input_ids.len();

	// Validate sequence length
	if seq_len > MAX_SEQUENCE_LENGTH {
	return Err(EmbedError::SequenceTooLong {
	got: seq_len,
	max: MAX_SEQUENCE_LENGTH,
	});
	}

	// Step 2: Prepare inputs as 2D tensors with shape [batch_size=1, seq_len]
	let shape = vec![1, seq_len];

	// Step 3: Run inference
	let outputs = self.session.run(ort::inputs![
	"input_ids" => Value::from_array((shape.clone(), input_ids))?,
	"attention_mask" => Value::from_array((shape, attention_mask.clone()))?,
	])?;

	// Step 4: Extract the output tensor
	// The model outputs last_hidden_state with shape [batch_size, seq_len, hidden_dim]
	let (output_shape, output_data) = outputs[0].try_extract_tensor::<f32>()?;
	let batch_size = output_shape[0] as usize;
	let seq_len_out = output_shape[1] as usize;
	let hidden_dim = output_shape[2] as usize;

	// Convert to ArrayView3 for mean_pooling
	let output_view =
	ndarray::ArrayView3::from_shape((batch_size, seq_len_out, hidden_dim), output_data)?;

	// Step 5: Apply mean pooling over token embeddings
	let embedding = Self::mean_pooling(&output_view, &attention_mask)?;

	// Step 6: Truncate to requested dimension (Matryoshka)
	let truncated: Vec<f32> = embedding.into_iter().take(size).collect();

	// Step 7: L2 normalize the final embedding
	// Re-normalization after truncation is important for correct similarity scores
	let normalized = Self::normalize(&truncated);

	Ok(normalized)
	}

	/// Applies mean pooling to token embeddings.
	///
	/// Mean pooling averages the embeddings of all non-padding tokens.
	/// The attention mask is used to exclude padding tokens from the average.
	/// Uses vectorized ndarray operations for optimal performance.
	fn mean_pooling(
	hidden_states: &ndarray::ArrayView3<f32>,
	attention_mask: &[i64],
	) -> Result<Vec<f32>, EmbedError> {
	use ndarray::Axis;

	// hidden_states: [batch=1, seq_len, hidden_dim]
	// Remove batch dimension: [seq_len, hidden_dim]
	let states_2d = hidden_states.index_axis(Axis(0), 0);

	// Convert mask to f32 and create array
	let mask_f32: Vec<f32> = attention_mask.iter().map(\|&x\| x as f32).collect();
	let mask_1d = ndarray::Array1::from(mask_f32);

	// Count non-padding tokens (do this before consuming mask_1d)
	let count = mask_1d.sum();

	// Reshape to [seq_len, 1] for broadcasting
	let mask_col = mask_1d.insert_axis(Axis(1)); // Shape: [seq_len, 1]

	// Broadcast multiply: each token embedding is scaled by its mask value
	// This zeros out padding tokens
	let masked_states = &states_2d * &mask_col;

	// Sum along sequence axis: [seq_len, hidden_dim] -> [hidden_dim]
	let sum = masked_states.sum_axis(Axis(0));

	// Compute mean (avoid division by zero)
	let mean = if count > 0.0 { sum / count } else { sum };

	Ok(mean.to_vec())
	}

	/// Applies L2 normalization to the embedding vector.
	///
	/// Normalized embeddings allow using dot product instead of cosine similarity,
	/// which is computationally cheaper for similarity searches.
	fn normalize(embedding: &[f32]) -> Vec<f32> {
	let norm: f32 = embedding.iter().map(\|x\| x * x).sum::<f32>().sqrt();

	if norm > 0.0 {
	embedding.iter().map(\|x\| x / norm).collect()
	} else {
	embedding.to_vec()
	}
	}
	}

view raw embedder.rs hosted with ❤ by GitHub

Production-Grade Error Handling

Before building the HTTP handler, let’s implement proper error handling. Production code needs structured errors that are type-safe, informative, and secure.

Create src/error.rs:

Show hidden characters

	use thiserror::Error;

	/// Errors that can occur during embedding generation
	#[derive(Error, Debug)]
	pub enum EmbedError {
	/// Invalid embedding dimension requested
	#[error("Invalid embedding size: {size}. Must be one of: {valid:?}")]
	InvalidDimension {
	size: usize,
	valid: Vec<usize>,
	},

	/// Input sequence is too long
	#[error("Tokenized sequence exceeds maximum length of {max} tokens (got {got})")]
	SequenceTooLong {
	got: usize,
	max: usize,
	},

	/// Empty text input
	#[error("Text input cannot be empty")]
	EmptyInput,

	/// Text input exceeds maximum character limit
	#[error("Text exceeds maximum length of {max} characters (got {got})")]
	TextTooLong {
	got: usize,
	max: usize,
	},

	/// Failed to load tokenizer
	#[error("Failed to load tokenizer from {path}: {reason}")]
	TokenizerLoad {
	path: String,
	reason: String,
	},

	/// Tokenization failed
	#[error("Tokenization failed: {0}")]
	Tokenization(String),

	/// ONNX Runtime error
	#[error("ONNX Runtime error: {0}")]
	OnnxRuntime(#[from] ort::Error),

	/// Array shape mismatch
	#[error("Array shape error: {0}")]
	ArrayShape(#[from] ndarray::ShapeError),

	/// Mutex poisoned (concurrent access error)
	#[error("Internal error: shared resource poisoned")]
	MutexPoisoned,

	/// Internal server error (catch-all)
	#[error("Internal server error: {0}")]
	Internal(String),
	}

	impl EmbedError {
	/// Returns true if this error should be reported as a client error (4xx)
	pub fn is_client_error(&self) -> bool {
	matches!(
	self,
	EmbedError::InvalidDimension { .. }
	\| EmbedError::SequenceTooLong { .. }
	\| EmbedError::EmptyInput
	\| EmbedError::TextTooLong { .. }
	)
	}

	/// Get the HTTP status code for this error
	pub fn status_code(&self) -> u16 {
	if self.is_client_error() {
	400
	} else {
	500
	}
	}

	/// Get a user-friendly error message (sanitized for production)
	pub fn user_message(&self) -> String {
	match self {
	// Client errors - show detailed message
	EmbedError::InvalidDimension { size, valid } => {
	format!("Invalid embedding size: {}. Must be one of: {:?}", size, valid)
	}
	EmbedError::SequenceTooLong { got, max } => {
	format!("Text is too long: {} tokens (max: {})", got, max)
	}
	EmbedError::EmptyInput => "Text input cannot be empty".to_string(),
	EmbedError::TextTooLong { got, max } => {
	format!("Text is too long: {} characters (max: {})", got, max)
	}
	EmbedError::Tokenization(msg) => {
	format!("Failed to process text: {}", msg)
	}

	// Server errors - generic message in production
	_ => {
	if cfg!(debug_assertions) {
	// Development: show full error
	self.to_string()
	} else {
	// Production: generic message
	"An internal error occurred while processing your request".to_string()
	}
	}
	}
	}
	}

view raw error.rs hosted with ❤ by GitHub

Key benefits:

Type-safe errors: Each error variant carries relevant context
User-friendly messages: user_message() provides sanitized output for API responses
Security: Internal errors don’t leak implementation details in production
HTTP mapping: status_code() maps errors to appropriate HTTP codes
Automatic trait implementations: thiserror generates Display and Error traits

Building the Lambda Handler

Now let’s wire up the embedding logic to a Lambda HTTP endpoint. We’ll split the code into two files for better organization: the handler logic and the main entry point.

The HTTP handler

Create src/http_handler.rs:

Show hidden characters

	use crate::embedder::{Embedder, VALID_DIMENSIONS};
	use crate::error::EmbedError;
	use lambda_http::{Body, Error, Request, Response};
	use serde::{Deserialize, Serialize};
	use std::sync::{Arc, Mutex};
	use tracing::{error, info, warn};

	/// Maximum input text length in characters
	/// Prevents OOM from extremely long inputs
	const MAX_TEXT_LENGTH: usize = 100_000;

	/// Incoming request payload
	#[derive(Deserialize)]
	struct EmbedRequest {
	/// The text to embed
	text: String,
	/// Output dimension: 768, 512, 256, or 128 (default: 768)
	#[serde(default = "default_size")]
	size: usize,
	}

	fn default_size() -> usize {
	768
	}

	/// Response payload containing the embedding vector
	#[derive(Serialize)]
	struct EmbedResponse {
	/// The embedding vector
	embedding: Vec<f32>,
	/// Dimension of the embedding
	size: usize,
	}

	/// Error response payload
	#[derive(Serialize)]
	struct ErrorResponse {
	error: String,
	}

	/// Lambda handler function.
	///
	/// Receives an HTTP request with JSON body, generates an embedding,
	/// and returns it as a JSON response.
	pub async fn function_handler(
	embedder: Arc<Mutex<Embedder>>,
	event: Request,
	) -> Result<Response<Body>, Error> {
	// Parse the JSON request body
	let body = event.body();
	let request: EmbedRequest = match serde_json::from_slice(body) {
	Ok(req) => req,
	Err(e) => {
	return Ok(error_response(400, &format!("Invalid JSON: {}", e)));
	}
	};

	// Validate the size parameter
	if !VALID_DIMENSIONS.contains(&request.size) {
	return Ok(error_response(
	400,
	&format!(
	"Invalid size: {}. Must be one of: {:?}",
	request.size, VALID_DIMENSIONS
	),
	));
	}

	// Validate text is not empty
	if request.text.is_empty() {
	let err = EmbedError::EmptyInput;
	warn!("Empty text input");
	return Ok(error_from_embed_error(&err));
	}

	// Validate text length to prevent OOM
	if request.text.len() > MAX_TEXT_LENGTH {
	let err = EmbedError::TextTooLong {
	got: request.text.len(),
	max: MAX_TEXT_LENGTH,
	};
	warn!("Text too long: {} chars", request.text.len());
	return Ok(error_from_embed_error(&err));
	}

	// Generate the embedding
	// Mutex required: ONNX Runtime Rust bindings need &mut for session.run()
	// Lambda processes one request at a time per container, so no contention
	let embedding = {
	// Safe mutex handling - recover from poisoned state
	let mut embedder = match embedder.lock() {
	Ok(guard) => guard,
	Err(poisoned) => {
	warn!("Mutex was poisoned, recovering...");
	poisoned.into_inner()
	}
	};

	match embedder.embed(&request.text, request.size) {
	Ok(emb) => {
	info!(
	text_len = request.text.len(),
	embedding_size = request.size,
	"Embedding generated successfully"
	);
	emb
	}
	Err(e) => {
	error!("Embedding generation failed: {}", e);
	return Ok(error_from_embed_error(&e));
	}
	}
	};

	// Build the JSON response
	let response = EmbedResponse {
	size: embedding.len(),
	embedding,
	};
	let response_json = serde_json::to_string(&response)?;

	let resp = Response::builder()
	.status(200)
	.header("content-type", "application/json")
	.body(response_json.into())
	.map_err(\|e\| Box::new(e) as Box<dyn std::error::Error + Send + Sync>)?;

	Ok(resp)
	}

	/// Helper function to create error responses from EmbedError
	fn error_from_embed_error(err: &EmbedError) -> Response<Body> {
	let status = err.status_code();
	let message = err.user_message();

	error_response(status, &message)
	}

	/// Helper function to create error responses
	fn error_response(status: u16, message: &str) -> Response<Body> {
	let body = serde_json::to_string(&ErrorResponse {
	error: message.to_string(),
	})
	.unwrap_or_else(\|_\| r#"{"error":"Unknown error"}"#.to_string());

	// Safe response building with fallback
	Response::builder()
	.status(status)
	.header("content-type", "application/json")
	.body(body.into())
	.unwrap_or_else(\|e\| {
	error!("Failed to build error response: {}", e);
	// Absolute fallback - plain text error
	Response::builder()
	.status(500)
	.body(Body::from(r#"{"error":"Internal server error"}"#))
	.unwrap()
	})
	}

view raw http_handler.rs hosted with ❤ by GitHub

The main entry point

Update src/main.rs:

Show hidden characters

	pub mod embedder;
	pub mod error;
	pub mod http_handler;

	use embedder::Embedder;
	use http_handler::function_handler;
	use lambda_http::{run, service_fn, tracing, Error};
	use std::sync::{Arc, Mutex};

	#[tokio::main]
	async fn main() -> Result<(), Error> {
	// Initialize ONNX Runtime global environment and keep it in memory for program lifetime
	// This must be called before creating any sessions to register the DefaultLogger
	// The environment remains active for the entire program execution
	ort::init().with_name("embedding-lambda").commit()?;

	// Initialize tracing for CloudWatch logs
	tracing::init_default_subscriber();

	// Initialize the Embedder once during cold start.
	// This loads the ONNX model and tokenizer into memory.
	let embedder = Embedder::new("model/model_quantized.onnx", "model/tokenizer.json")
	.map_err(\|e\| {
	tracing::error!("Failed to initialize embedder: {}", e);
	Box::new(e) as Box<dyn std::error::Error + Send + Sync>
	})?;

	// Wrap in Arc<Mutex> to share across handler invocations
	// Mutex required: ONNX Runtime Rust bindings need &mut for session.run()
	let embedder = Arc::new(Mutex::new(embedder));

	// Start the Lambda runtime.
	// Each incoming request will clone the Arc and call function_handler.
	run(service_fn(move \|event\| {
	let embedder = embedder.clone();
	function_handler(embedder, event)
	}))
	.await
	}

view raw main.rs hosted with ❤ by GitHub

API usage

The endpoint accepts a JSON payload with two fields:

{
  "text": "Your text to embed",
  "size": 256
}

The size parameter is optional and defaults to 768. Valid values are:

768 - Full precision, best quality
512 - Good balance of quality and efficiency
256 - Faster similarity search, slightly reduced quality
128 - Smallest, fastest, good for large-scale filtering

Testing locally

Before deploying, test the function locally using cargo-lambda:

cargo lambda watch

In another terminal, send test requests with different sizes:

# Full 768-dimensional embedding (default)
curl -X POST http://localhost:9000/ \
  -H "Content-Type: application/json" \
  -d '{"text": "What is the capital of France?"}'

# Compact 256-dimensional embedding
curl -X POST http://localhost:9000/ \
  -H "Content-Type: application/json" \
  -d '{"text": "What is the capital of France?", "size": 256}'

Containerizing for Lambda

Our project includes ONNX model files that exceed Lambda’s 250MB zip deployment limit. Container images solve this problem, supporting up to 10GB.

Image size breakdown

The final image size depends on:

Rust binary: ~5-15MB (with size optimizations)
ONNX Runtime library: ~50-100MB
Model files: ~300-600MB (depending on quantization)
Base image: ~100MB

Total: typically under 1GB, well within Lambda’s 10GB limit.

Why containers?

Beyond the size limit, containers give us:

Full control over the runtime environment
Ability to include native libraries (ONNX Runtime)
Consistent builds across development and production
ARM64 support for AWS Graviton processors (better price/performance)

Why ARM64?

AWS Graviton processors offer up to 34% better price/performance compared to x86 for Lambda. Since we’re building from scratch, there’s no reason not to target ARM64.

Project structure

Before we create the Dockerfile, make sure your project looks like this:

embedding-lambda/
├── Cargo.toml
├── Cargo.lock
├── Dockerfile
├── .cargo/
│   └── config.toml
├── src/
│   ├── main.rs
│   ├── embedder.rs
│   ├── error.rs
│   └── http_handler.rs
└── model/
    ├── model_quantized.onnx
    ├── model_quantized.onnx_data
    └── tokenizer.json

The Dockerfile

Create a Dockerfile in the project root:

# Stage 1: Build the Rust binary for ARM64
FROM --platform=linux/arm64 rust:1.92-slim-bookworm AS builder

# Install build dependencies including lld linker for faster builds
RUN apt-get update && apt-get install -y \
    pkg-config \
    libssl-dev \
    build-essential \
    lld \
    && rm -rf /var/lib/apt/lists/*

# Set ARM64 Graviton2-specific compiler optimizations
ENV RUSTFLAGS="-C target-cpu=neoverse-n1 -C target-feature=+neon"

WORKDIR /app

# Copy cargo config for ARM64 optimizations
COPY .cargo .cargo

# Copy manifests first for better layer caching
COPY Cargo.toml Cargo.lock ./

# Create a dummy main.rs to build dependencies
RUN mkdir src && \
    echo "fn main() {}" > src/main.rs && \
    cargo build --release && \
    rm -rf src

# Copy actual source code
COPY src ./src

# Build the real application
# Touch main.rs to ensure it rebuilds
RUN touch src/main.rs && cargo build --release

# Stage 2: Download ONNX Runtime for ARM64
FROM --platform=linux/arm64 debian:bookworm-slim AS onnx-downloader

RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

# Download ONNX Runtime for Linux ARM64
ARG ORT_VERSION=1.22.0
RUN curl -L https://github.com/microsoft/onnxruntime/releases/download/v${ORT_VERSION}/onnxruntime-linux-aarch64-${ORT_VERSION}.tgz \
    | tar xz -C /opt

# Stage 3: Create the Lambda runtime image for ARM64
FROM --platform=linux/arm64 public.ecr.aws/lambda/provided:al2023-arm64

# Copy ONNX Runtime library
COPY --from=onnx-downloader /opt/onnxruntime-linux-aarch64-*/lib/libonnxruntime.so* /opt/onnxruntime/lib/

# Set the library path for dynamic loading
ENV ORT_DYLIB_PATH=/opt/onnxruntime/lib/libonnxruntime.so

# Copy the compiled binary
COPY --from=builder /app/target/release/embedding-lambda ${LAMBDA_RUNTIME_DIR}/bootstrap

# Copy model files
COPY model/ ${LAMBDA_TASK_ROOT}/model/

# Set the entrypoint
CMD ["bootstrap"]

Understanding the Dockerfile

The build uses three stages:

Builder stage: Compiles the Rust code for ARM64 with Graviton2-specific optimizations. Key enhancements include:
- lld linker: Faster linking times during builds
- RUSTFLAGS: Targets AWS Graviton2’s Neoverse-N1 CPU architecture with NEON SIMD instructions
- .cargo directory: Contains additional ARM64 optimization configuration
- The release profile optimizations from Cargo.toml are applied automatically
These optimizations provide 10-15% performance improvement for ARM64 workloads.
ONNX downloader stage: Downloads the official ONNX Runtime release for Linux ARM64 from GitHub. This gives us libonnxruntime.so which the ort crate loads dynamically.
Runtime stage: Combines everything into the final Lambda image. The ORT_DYLIB_PATH environment variable tells the ort crate where to find the ONNX Runtime library.

ARM64 Optimization Configuration

For optimal ARM64 performance on AWS Graviton processors, create a .cargo/config.toml file in your project root:

[build]
rustflags = ["-C", "target-cpu=neoverse-n1"]

[target.aarch64-unknown-linux-gnu]
linker = "aarch64-linux-gnu-gcc"
rustflags = [
    "-C", "target-cpu=neoverse-n1",
    "-C", "target-feature=+neon",
]

This configuration:

Targets the Neoverse-N1 CPU architecture used in AWS Graviton2 processors
Enables NEON SIMD instructions for vectorized operations (2-3x faster mean pooling)
Works in conjunction with the RUSTFLAGS in the Dockerfile

The vectorized mean pooling implementation in embedder.rs automatically leverages these SIMD instructions, providing significant performance gains for embedding generation.

Building the image

Build the ARM64 image:

docker buildx build --platform linux/arm64 -t embedding-lambda .

If you’re on an ARM machine (like Apple Silicon Mac), you can simply use:

docker build -t embedding-lambda .

Testing locally

Test the container using Lambda’s Runtime Interface Emulator:

docker run --rm -p 9000:8080 embedding-lambda

In another terminal:

curl -X POST "http://localhost:9000/2015-03-31/functions/function/invocations" \
  -H "Content-Type: application/json" \
  -d '{"body": "{\"text\": \"Hello world\", \"size\": 256}"}'

Pushing to Amazon ECR

Create an ECR repository and push the image:

# Set your AWS region and account ID
AWS_REGION=eu-central-1
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REPO_NAME=embedding-lambda

# Create the repository (skip if it already exists)
aws ecr create-repository --repository-name $REPO_NAME --region $AWS_REGION

# Authenticate Docker with ECR
aws ecr get-login-password --region $AWS_REGION | \
  docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

# Tag and push
docker tag embedding-lambda:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO_NAME:latest
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO_NAME:latest

Deployment

With the image in ECR, we can create the Lambda function and expose it via a Function URL.

Creating the Lambda function

First, create an IAM role for the Lambda function:

# Create the trust policy
cat > scripts/trust-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

# Create the role
aws iam create-role \
  --role-name embedding-lambda-role \
  --assume-role-policy-document file://scripts/trust-policy.json

# Attach basic execution policy for CloudWatch logs
aws iam attach-role-policy \
  --role-name embedding-lambda-role \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Now create the Lambda function:

AWS_REGION=eu-central-1
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REPO_NAME=embedding-lambda
FUNCTION_NAME=embedding-lambda

aws lambda create-function \
  --function-name $FUNCTION_NAME \
  --package-type Image \
  --code ImageUri=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO_NAME:latest \
  --role arn:aws:iam::$AWS_ACCOUNT_ID:role/embedding-lambda-role \
  --architectures arm64 \
  --memory-size 2048 \
  --timeout 30 \
  --region $AWS_REGION

Key configuration choices:

--architectures arm64: Matches our ARM64 container build for Graviton
--memory-size 2048: 2GB is plenty for the quantized model; adjust based on your testing
--timeout 30: 30 seconds handles cold starts plus inference comfortably

Creating a Function URL

Function URLs provide a simple HTTPS endpoint without needing API Gateway:

AWS_REGION=eu-central-1
FUNCTION_NAME=embedding-lambda

aws lambda create-function-url-config \
  --function-name $FUNCTION_NAME \
  --auth-type NONE \
  --invoke-mode BUFFERED \
  --region $AWS_REGION

# Grant public access to the function URL
aws lambda add-permission \
  --function-name $FUNCTION_NAME \
  --statement-id FunctionURLAllowPublicAccess \
  --action lambda:InvokeFunctionUrl \
  --principal "*" \
  --function-url-auth-type NONE \
  --region $AWS_REGION

This returns a URL like

https://xxxxxxxxxx.lambda-url.eu-central-1.on.aws/

For production, you’ll want to change --auth-type to AWS_IAM and configure proper authentication.

Testing the deployed function

AWS_REGION=eu-central-1
FUNCTION_NAME=embedding-lambda

FUNCTION_URL=$(aws lambda get-function-url-config \
  --function-name $FUNCTION_NAME \
  --query 'FunctionUrl' \
  --output text \
  --region $AWS_REGION)

curl -X POST "$FUNCTION_URL" \
  -H "Content-Type: application/json" \
  -d '{"text": "Rust on Lambda is fast", "size": 256}'

Updating the function

When you update your code, rebuild the image, push to ECR, and update the function:

AWS_REGION=eu-central-1
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REPO_NAME=embedding-lambda
FUNCTION_NAME=embedding-lambda

# Rebuild and push
docker build -t embedding-lambda .
docker tag embedding-lambda:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO_NAME:latest
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO_NAME:latest

# Update Lambda to use the new image
aws lambda update-function-code \
  --function-name $FUNCTION_NAME \
  --image-uri $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO_NAME:latest \
  --region $AWS_REGION

Cold start optimization

Cold starts are the main latency concern for ML workloads in Lambda. A few strategies to minimize them:

Provisioned Concurrency: Keep instances warm for consistent latency

AWS_REGION=eu-central-1
FUNCTION_NAME=embedding-lambda

aws lambda put-provisioned-concurrency-config \
  --function-name $FUNCTION_NAME \
  --qualifier '$LATEST' \
  --provisioned-concurrent-executions 2 \
  --region $AWS_REGION

Memory tuning: More memory means more CPU, which speeds up model loading. Test different values to find the sweet spot between cost and cold start time.
SnapStart: Currently not available for container images, but worth watching for future support.

Monitoring

CloudWatch metrics to watch:

Duration: Inference time per request
InitDuration: Cold start time (model loading)
ConcurrentExecutions: Scale patterns
Errors: Failed requests

Set up alarms for duration spikes or error rates to catch issues early.

Performance Benchmarks

Now that we have our function deployed, let’s look at real-world performance numbers and cost analysis.

Production Performance Metrics

Running on ARM64 Graviton2 with 2048 MB memory and the Q8 quantized model:

Resource Usage:

Peak Memory: ~1509 MB (well within 2048 MB limit)
Cold Start (Init Duration): ~2.6 seconds (model loading + initialization)
Total Cold Start Latency: ~3-4 seconds (includes Init Duration + execution environment provisioning + network latency)
Warm Inference: ~280-295 ms per request
Model Size: ~300 MB

Note: The 2.6s figure is Lambda’s Init Duration metric. End-to-end latency experienced by users on cold starts will be higher due to execution environment provisioning (200-400ms) and any API Gateway/network overhead.

Why is warm inference ~280ms when EdgeTPU benchmarks show <15ms? The EdgeTPU figure is for raw model inference on 256 tokens with specialized hardware. Our Lambda latency includes the complete pipeline: HTTP request parsing, tokenization, full-context inference on general-purpose CPUs, mean pooling, and normalization. The ~18× difference is expected and reasonable for this architecture.

Thread Count Tuning

ONNX Runtime’s thread configuration has a significant impact on performance. The optimal setting depends on your model size and quantization.

Configuration Options:

In src/embedder.rs, the thread count is set via:

.with_intra_threads(1)?  // Optimal for quantized models under 500MB

Thread Count Trade-offs (tested with Q8 model):

ThreadsBest ForPerformance Impact1Q8/Q4 models (<500MB)Fastest (reduces overhead)2Larger models or long sequences~10-20% slower than 1 thread for small models4FP32 models (>1GB)May have diminishing returns

For the Q8 quantized model (~300 MB) used in this deployment, single-threaded inference is optimal because:

Thread scheduling overhead exceeds parallelization benefits at this model size
Better cache locality with a single thread
Reduced context switching

Note: These recommendations are based on testing with the Q8 model. If you’re using a different quantization variant, benchmark with different thread counts to find your optimal configuration.

Cost Analysis: Lambda vs. Alternatives

Understanding the economics helps you choose the right deployment strategy.

AWS Lambda (ARM64) - Our Implementation:

Cost per 1M requests: ~$4.00
- Compute (GB-seconds): 2048 MB × 0.28s × 1M × $0.0000133334/GB-s = $3.80
- Request charges: $0.20
Pros: Zero cost when idle, no server management, automatic scaling
Cons: Higher latency than EC2 for sustained high-volume workloads

AWS SageMaker Serverless:

Cost per 1M requests: ~$4.90
Pros: Managed ML infrastructure, easier monitoring
Cons: Slightly more expensive, limited to x86_64 (no Graviton optimization)

AWS EC2 Break-Even Analysis:

Instance TypeMonthly CostBreak-Even Point (vs. Lambda)t4g.small$14.01> 3.5M requests/montht4g.medium$28.03> 7.0M requests/month

Cost Recommendations by Scale:

< 3.5M requests/month: Use Lambda (most cost-effective)
3.5M - 10M requests/month: Consider t4g.small EC2
> 10M requests/month: Use t4g.medium EC2 or ECS Fargate with auto-scaling

Lambda is ideal for development, small-to-medium production, or sporadic workloads where you benefit from pay-per-use pricing.

Provider Comparison: Self-Hosted vs. API Services

Lambda charges per request, not per token. As your input size grows, the savings compound dramatically:

500 tokens: Self-hosted saves 60-94%
2,000 tokens: Self-hosted saves 90-97%
4,000 tokens: Self-hosted saves 95-98%

For document embedding (2K-4K tokens per document), self-hosted Lambda can be 10-20× cheaper than API providers.

Strategic Benefits Beyond Cost:

Data Privacy: Text never leaves your AWS VPC—critical for sensitive data
Predictable Pricing: No runaway costs from large prompts or unexpected usage spikes
No Rate Limits: Scale to your Lambda concurrency limit (default 1,000 concurrent executions)
Independence: No dependency on external provider uptime or API changes

When to Use External APIs:

Prototyping and experimentation
Need for state-of-the-art quality (commercial models may have better training data)
Want zero operational overhead

Performance Optimization Impact

The ARM64-specific optimizations we implemented provide significant gains:

Optimization Breakdown:

Vectorized Mean Pooling: ~90ms latency reduction (23% faster)
- Naive implementation: ~380ms
- Vectorized ndarray: ~290ms
ARM64 Target CPU: Up to 19% performance improvement (Neoverse-N1 optimization)
NEON SIMD Instructions: 2-4× faster vector operations
Single Thread Configuration: Optimal for quantized models (eliminates scheduling overhead)
Speed-Optimized Compilation: 10-15% faster than size-optimized builds

Combined, these optimizations deliver 30-50% performance improvement over a baseline implementation without ARM-specific tuning.

Key Metrics to Track:

Cold Start (InitDuration): Should be 2-4 seconds
Warm Latency (Duration): Target 250-350ms for Q8 model
Memory Utilization: Should stay under 1600 MB (80% of limit)
Error Rate: Monitor for timeouts or memory exhaustion

Conclusion

We’ve built a complete embedding service: Rust for performance, ONNX Runtime for portable inference, and Lambda for serverless scale. The combination delivers consistent ~280ms inference latency with minimal operational overhead and exceptional cost efficiency.

Key takeaways:

Rust on Lambda is production-ready. The official support means you get SLAs, documentation, and long-term stability.
Small models unlock serverless ML. EmbeddingGemma’s efficient design fits comfortably within Lambda’s constraints (~1.5 GB memory, 2.6s cold starts).
Matryoshka embeddings give you flexibility. Choose your dimension based on quality vs. storage tradeoffs without retraining.
Containers simplify deployment. Multi-stage builds keep images small while including everything you need.
ARM64 optimizations matter. Target-specific compilation and SIMD instructions deliver 30-50% performance gains.
Economics favor self-hosting. At $4/M requests, Lambda is 2-16× cheaper than commercial embedding APIs, especially for longer documents.

From here, you could extend this with batch processing, add caching for repeated queries, or integrate with a vector database for semantic search. The foundation is solid.

The code from this article is available on GitHub: https://github.com/alexsobolev/rust-embedding-lambda

Questions or improvements? Open an issue or reach out. Happy embedding.

Aleksandr’s Substack

Discussion about this post

Ready for more?