Large Language Models (LLMs) and Generative AI have become central to modern application development. From chatbots to content generation to code analysis, LLMs power increasingly sophisticated user experiences. Yet most LLM integrations happen in Python or JavaScriptโlanguages prioritizing convenience over safety and performance.
Rust changes this calculus. With libraries like reqwest for API calls, tokio for async operations, and emerging frameworks like llm-rs and langchain bindings, Rust enables you to build production-grade AI applications that are simultaneously safe, fast, and maintainable.
This article explores integrating LLMs with Rust across multiple paradigms: API-based inference, local model serving, and embedded inference.
Core Concepts & Terminology
Large Language Models (LLMs)
An LLM is a neural network trained on vast amounts of text data, capable of understanding and generating human-like text. Key characteristics:
- Parameters: Model size (7B, 13B, 70B parameters for open models; GPT-4 has ~100T+)
- Context Window: Maximum tokens it can consider (4K, 8K, 32K, 128K, or more)
- Inference: The process of feeding input to the model and getting output
- Token: A piece of text (roughly 4 characters on average)
- Latency: Time to first token (TTFT) and time per token (TpT)
- Throughput: Tokens per second generated
API-Based vs. Local Inference
API-Based (e.g., OpenAI, Claude, Gemini)
- Pros: No model hosting, automatic updates, high quality
- Cons: Latency, cost per token, data privacy concerns
- Use case: Rapid prototyping, specialized models
Local Inference (e.g., Ollama, llama.cpp, vLLM)
- Pros: Data privacy, no per-token cost, customizable
- Cons: Requires GPU/hardware, model management overhead
- Use case: Privacy-sensitive applications, batch processing
Common LLM Architectures
- Transformers: Foundation of modern LLMs (attention mechanisms)
- Quantization: Reducing model precision (fp32 โ int8) for speed/memory
- Fine-tuning: Adapting pre-trained models for specific tasks
- Retrieval-Augmented Generation (RAG): Combining search + generation for knowledge grounding
Architecture Patterns
Typical LLM Integration Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Rust Application โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ User Interface (Web/CLI) โ โ
โ โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Prompt Engineering & Context Management โ โ
โ โ - Build prompts dynamically โ โ
โ โ - Manage conversation history โ โ
โ โ - Implement RAG for knowledge โ โ
โ โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ LLM Client Layer (Async) โ โ
โ โ - Handle rate limiting โ โ
โ โ - Retry logic & error handling โ โ
โ โ - Token counting & cost tracking โ โ
โ โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ LLM Provider โ
โ - OpenAI API โ
โ - Anthropic API โ
โ - Local Ollama/vLLM โ
โ - HuggingFace Inference โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Getting Started: OpenAI API Integration
Project Setup
cargo new llm-app
cd llm-app
cargo add reqwest --features json
cargo add tokio --features full
cargo add serde --features derive
cargo add serde_json
cargo add dotenv
Cargo.toml Configuration
[package]
name = "llm-app"
version = "0.1.0"
edition = "2021"
[dependencies]
reqwest = { version = "0.11", features = ["json"] }
tokio = { version = "1", features = ["full"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
dotenv = "0.15"
anyhow = "1.0"
Simple Chat Completion Example
use reqwest::Client;
use serde::{Deserialize, Serialize};
use anyhow::Result;
#[derive(Serialize)]
struct Message {
role: String,
content: String,
}
#[derive(Serialize)]
struct ChatRequest {
model: String,
messages: Vec<Message>,
temperature: f32,
max_tokens: u32,
}
#[derive(Deserialize, Debug)]
struct Choice {
message: Message,
finish_reason: String,
}
#[derive(Deserialize, Debug)]
struct ChatResponse {
choices: Vec<Choice>,
usage: Usage,
}
#[derive(Deserialize, Debug)]
struct Usage {
prompt_tokens: u32,
completion_tokens: u32,
total_tokens: u32,
}
async fn chat_with_gpt(prompt: &str) -> Result<String> {
let api_key = std::env::var("OPENAI_API_KEY")?;
let client = Client::new();
let request = ChatRequest {
model: "gpt-4-turbo".to_string(),
messages: vec![
Message {
role: "system".to_string(),
content: "You are a helpful assistant.".to_string(),
},
Message {
role: "user".to_string(),
content: prompt.to_string(),
},
],
temperature: 0.7,
max_tokens: 1000,
};
let response = client
.post("https://api.openai.com/v1/chat/completions")
.bearer_auth(&api_key)
.json(&request)
.send()
.await?;
let chat_response: ChatResponse = response.json().await?;
Ok(chat_response
.choices
.first()
.map(|c| c.message.content.clone())
.unwrap_or_default())
}
#[tokio::main]
async fn main() -> Result<()> {
dotenv::dotenv().ok();
let response = chat_with_gpt("What is Rust and why is it useful?").await?;
println!("Response: {}", response);
Ok(())
}
Advanced LLM Integration Patterns
Pattern 1: Streaming Responses
use futures::stream::StreamExt;
async fn stream_chat(prompt: &str) -> Result<()> {
let api_key = std::env::var("OPENAI_API_KEY")?;
let client = Client::new();
let request = ChatRequest {
model: "gpt-4-turbo".to_string(),
messages: vec![
Message {
role: "user".to_string(),
content: prompt.to_string(),
},
],
temperature: 0.7,
max_tokens: 2000,
};
// Enable streaming with stream: true
let response = client
.post("https://api.openai.com/v1/chat/completions")
.bearer_auth(&api_key)
.json(&request)
.send()
.await?;
let mut stream = response.bytes_stream();
while let Some(chunk) = stream.next().await {
let chunk = chunk?;
let text = String::from_utf8_lossy(&chunk);
// Parse SSE (Server-Sent Events) format
for line in text.lines() {
if line.starts_with("data: ") {
let data = &line[6..];
if data == "[DONE]" {
break;
}
if let Ok(json) = serde_json::from_str::<serde_json::Value>(data) {
if let Some(content) = json["choices"][0]["delta"]["content"].as_str() {
print!("{}", content);
}
}
}
}
}
println!();
Ok(())
}
Pattern 2: Conversation Management with History
struct ConversationManager {
messages: Vec<Message>,
system_prompt: String,
max_messages: usize,
}
impl ConversationManager {
fn new(system_prompt: String) -> Self {
ConversationManager {
messages: Vec::new(),
system_prompt,
max_messages: 10,
}
}
fn add_user_message(&mut self, content: String) {
self.messages.push(Message {
role: "user".to_string(),
content,
});
}
fn add_assistant_message(&mut self, content: String) {
self.messages.push(Message {
role: "assistant".to_string(),
content,
});
// Keep conversation window manageable
if self.messages.len() > self.max_messages {
self.messages.remove(0);
}
}
fn get_messages_for_request(&self) -> Vec<Message> {
let mut msgs = vec![Message {
role: "system".to_string(),
content: self.system_prompt.clone(),
}];
msgs.extend(self.messages.clone());
msgs
}
}
#[tokio::main]
async fn main() -> Result<()> {
let mut conversation = ConversationManager::new(
"You are a Rust programming expert.".to_string()
);
// Multi-turn conversation
conversation.add_user_message("What is ownership in Rust?".to_string());
let response = chat_with_messages(
conversation.get_messages_for_request()
).await?;
println!("Assistant: {}", response);
conversation.add_assistant_message(response);
// Follow-up question
conversation.add_user_message("How does the borrow checker enforce this?".to_string());
let response2 = chat_with_messages(
conversation.get_messages_for_request()
).await?;
println!("Assistant: {}", response2);
Ok(())
}
Pattern 3: Token Counting and Cost Tracking
struct TokenCounter {
model: String,
}
impl TokenCounter {
fn new(model: &str) -> Self {
TokenCounter {
model: model.to_string(),
}
}
// Rough estimation (more sophisticated: use tokenizer library)
fn estimate_tokens(&self, text: &str) -> u32 {
(text.len() / 4) as u32
}
fn calculate_cost(&self, input_tokens: u32, output_tokens: u32) -> f32 {
match self.model.as_str() {
"gpt-4-turbo" => {
let input_cost = (input_tokens as f32) * 0.01 / 1000.0;
let output_cost = (output_tokens as f32) * 0.03 / 1000.0;
input_cost + output_cost
}
"gpt-3.5-turbo" => {
let input_cost = (input_tokens as f32) * 0.0005 / 1000.0;
let output_cost = (output_tokens as f32) * 0.0015 / 1000.0;
input_cost + output_cost
}
_ => 0.0,
}
}
}
fn main() {
let counter = TokenCounter::new("gpt-4-turbo");
let input_tokens = counter.estimate_tokens("What is Rust?");
let output_tokens = 150;
let cost = counter.calculate_cost(input_tokens, output_tokens);
println!("Estimated cost: ${:.4}", cost);
}
Pattern 4: Retry Logic with Exponential Backoff
use std::time::Duration;
async fn call_llm_with_retry(
prompt: &str,
max_retries: u32,
) -> Result<String> {
let mut retries = 0;
loop {
match chat_with_gpt(prompt).await {
Ok(response) => return Ok(response),
Err(e) => {
retries += 1;
if retries > max_retries {
return Err(e);
}
// Exponential backoff: 1s, 2s, 4s, 8s...
let wait_duration = Duration::from_secs(2_u64.pow(retries - 1));
eprintln!(
"Request failed (attempt {}): {}. Retrying in {:?}...",
retries, e, wait_duration
);
tokio::time::sleep(wait_duration).await;
}
}
}
}
Pattern 5: Retrieval-Augmented Generation (RAG)
use std::collections::HashMap;
struct Document {
id: String,
content: String,
embedding: Vec<f32>, // Would be populated by embedding model
}
struct RAGSystem {
documents: Vec<Document>,
}
impl RAGSystem {
fn new() -> Self {
RAGSystem {
documents: Vec::new(),
}
}
fn add_document(&mut self, id: String, content: String) {
// In production, compute embeddings using embedding model
let doc = Document {
id,
content,
embedding: vec![0.0; 1536], // placeholder
};
self.documents.push(doc);
}
fn retrieve_relevant_documents(&self, query: &str, k: usize) -> Vec<String> {
// Simple keyword matching (in production: use similarity search)
self.documents
.iter()
.filter(|doc| doc.content.to_lowercase().contains(&query.to_lowercase()))
.take(k)
.map(|doc| doc.content.clone())
.collect()
}
async fn answer_with_context(&self, question: &str) -> Result<String> {
// Retrieve relevant documents
let context_docs = self.retrieve_relevant_documents(question, 3);
let context = context_docs.join("\n---\n");
// Build augmented prompt
let augmented_prompt = format!(
"Context:\n{}\n\nQuestion: {}\n\nAnswer based on the context above:",
context, question
);
chat_with_gpt(&augmented_prompt).await
}
}
#[tokio::main]
async fn main() -> Result<()> {
let mut rag = RAGSystem::new();
// Add documents to knowledge base
rag.add_document(
"rust-ownership".to_string(),
"Rust uses ownership to manage memory safely...".to_string(),
);
// Ask question with context
let answer = rag.answer_with_context("How does Rust manage memory?").await?;
println!("Answer: {}", answer);
Ok(())
}
Local LLM Inference with Ollama
Using Ollama for Local Models
# Install and run Ollama
# https://ollama.ai
# Pull a model
ollama pull llama2
# Model runs on http://localhost:11434
Rust Integration with Local Ollama
use reqwest::Client;
use serde::{Deserialize, Serialize};
#[derive(Serialize)]
struct OllamaRequest {
model: String,
prompt: String,
stream: bool,
}
#[derive(Deserialize)]
struct OllamaResponse {
response: String,
done: bool,
}
async fn query_local_llm(prompt: &str) -> Result<String> {
let client = Client::new();
let request = OllamaRequest {
model: "llama2".to_string(),
prompt: prompt.to_string(),
stream: false,
};
let response = client
.post("http://localhost:11434/api/generate")
.json(&request)
.send()
.await?;
let ollama_response: OllamaResponse = response.json().await?;
Ok(ollama_response.response)
}
#[tokio::main]
async fn main() -> Result<()> {
let response = query_local_llm("Write a Rust function that returns the sum of two numbers").await?;
println!("{}", response);
Ok(())
}
Common Pitfalls & Best Practices
1. Ignoring Rate Limits
โ Bad: No rate limit handling
for prompt in prompts {
let _ = chat_with_gpt(&prompt).await; // Hits rate limits
}
โ Good: Implement rate limiting
use tokio::sync::Semaphore;
use std::sync::Arc;
let semaphore = Arc::new(Semaphore::new(10)); // Max 10 concurrent
for prompt in prompts {
let sem = semaphore.clone();
tokio::spawn(async move {
let _permit = sem.acquire().await;
let _ = chat_with_gpt(&prompt).await;
});
}
2. Hardcoding API Keys
โ Bad: Keys in source code
let api_key = "sk-1234567890abcdef";
โ Good: Environment variables
let api_key = std::env::var("OPENAI_API_KEY")
.expect("OPENAI_API_KEY not set");
3. No Error Recovery
โ Bad: Single attempt, no fallback
let response = chat_with_gpt(prompt).await?; // Panics on failure
โ Good: Retry logic with fallbacks
let response = call_llm_with_retry(prompt, 3)
.await
.unwrap_or_else(|_| "Sorry, I couldn't process that.".to_string());
4. Unbounded Token Usage
โ Bad: No token limits
let request = ChatRequest {
max_tokens: 100_000, // Expensive!
// ...
};
โ Good: Set reasonable limits
let request = ChatRequest {
max_tokens: 2000, // Reasonable for most tasks
// ...
};
5. Not Validating Input
โ Bad: Direct user input to LLM
let response = chat_with_gpt(&user_input).await?; // Could be malicious
โ Good: Validate and sanitize
if user_input.len() > 5000 {
anyhow::bail!("Input too long");
}
let sanitized = user_input.trim();
let response = chat_with_gpt(sanitized).await?;
6. Synchronous Blocking in Async Code
โ Bad: Blocking operation in async
async fn process_llm_response(prompt: &str) -> Result<String> {
let response = chat_with_gpt(prompt).await?;
let expensive_computation = std::thread::sleep(Duration::from_secs(10)); // Blocks!
Ok(response)
}
โ Good: Use async alternatives
async fn process_llm_response(prompt: &str) -> Result<String> {
let response = chat_with_gpt(prompt).await?;
tokio::time::sleep(Duration::from_secs(10)).await; // Non-blocking
Ok(response)
}
Rust vs. Alternatives for LLM Integration
| Aspect | Rust | Python | Node.js | Go |
|---|---|---|---|---|
| LLM Library Support | Growing (langchain, llm-rs) | Excellent (official SDKs) | Good (official SDKs) | Limited |
| Async Performance | Excellent | Good (asyncio) | Excellent | Good |
| Type Safety | Exceptional | Weak | Weak | Good |
| Memory Usage | Low (no GC) | High (GC) | High (GC) | Low |
| Production Readiness | โ Mature | โ Mature | โ Mature | โ ๏ธ Growing |
| Development Speed | Medium | Fast | Fast | Medium |
| Concurrency Model | async/await | asyncio/threading | Event loop | goroutines |
| Binary Size | 2-10 MB | N/A (interpreted) | N/A (runtime) | 5-15 MB |
| Deployment | Single binary | Requires Python | Requires Node.js | Single binary |
When to Choose Rust for LLM Applications
โ Use Rust when:
- Performance and latency are critical
- You need a self-contained binary
- Memory efficiency matters (edge devices)
- Building infrastructure/services
- Type safety is important for correctness
โ Use Python when:
- Rapid prototyping is essential
- Best LLM library ecosystem needed
- Data science integration required
- Team expertise is Python-focused
โ Use Node.js when:
- Full-stack JavaScript is preferred
- Real-time streaming is critical
- Web framework integration needed
Production Deployment Considerations
Docker Deployment
FROM rust:latest as builder
WORKDIR /app
COPY . .
RUN cargo build --release
FROM debian:bookworm-slim
COPY --from=builder /app/target/release/llm-app /usr/local/bin/
ENV OPENAI_API_KEY=${OPENAI_API_KEY}
CMD ["llm-app"]
Environment Configuration
use std::env;
#[derive(Debug)]
struct Config {
api_key: String,
model: String,
max_tokens: u32,
temperature: f32,
}
impl Config {
fn from_env() -> Result<Self, env::VarError> {
Ok(Config {
api_key: env::var("OPENAI_API_KEY")?,
model: env::var("MODEL")
.unwrap_or_else(|_| "gpt-4-turbo".to_string()),
max_tokens: env::var("MAX_TOKENS")
.ok()
.and_then(|v| v.parse().ok())
.unwrap_or(2000),
temperature: env::var("TEMPERATURE")
.ok()
.and_then(|v| v.parse().ok())
.unwrap_or(0.7),
})
}
}
Resources & Learning Materials
Official LLM API Documentation
Rust LLM Libraries
- langchain-rust - LangChain port to Rust
- llm-rs - Local LLM inference
- ort - ONNX Runtime for inference
- candle - HuggingFace’s ML framework in Rust
Learning Resources
- Practical Deep Learning for Coders - General ML concepts
- LangChain Documentation - RAG patterns (principles apply to Rust)
- OpenAI Cookbook - Best practices
- DeepLearning.AI Short Courses - LLM fundamentals
Useful Crates
- reqwest - HTTP client with async support
- tokio - Async runtime
- serde/serde_json - Serialization
- anyhow - Error handling
- dotenv - Environment variables
- tracing - Structured logging
- tokio-stream - Streaming primitives
- futures - Future combinators
Conclusion
Integrating LLMs with Rust enables you to build AI-powered applications that are simultaneously safe, fast, and maintainable. Whether you’re building ChatBot backends, content generation pipelines, or real-time analysis systems, Rust provides the tools necessary for production-grade implementations.
The key advantages are clear:
- Performance: Minimal latency overhead compared to C/C++
- Safety: Memory safety prevents entire classes of bugs
- Concurrency: Async/await handles thousands of concurrent requests efficiently
- Deployment: Single compiled binary with no runtime dependencies
As the Rust AI ecosystem maturesโwith libraries like Candle, LLM-rs, and LangChain-Rustโbuilding sophisticated AI applications in Rust transitions from “possible” to “preferable” for many use cases.
Start with API-based integration for rapid prototyping, then move to local inference for privacy-sensitive applications. In both cases, Rust provides the foundation for applications that scale confidently from prototype to production.
Comments