Skip to main content
โšก Calmops

Integrating Large Language Models with Rust

Building AI-Powered Applications with Safety and Performance

Large Language Models (LLMs) and Generative AI have become central to modern application development. From chatbots to content generation to code analysis, LLMs power increasingly sophisticated user experiences. Yet most LLM integrations happen in Python or JavaScriptโ€”languages prioritizing convenience over safety and performance.

Rust changes this calculus. With libraries like reqwest for API calls, tokio for async operations, and emerging frameworks like llm-rs and langchain bindings, Rust enables you to build production-grade AI applications that are simultaneously safe, fast, and maintainable.

This article explores integrating LLMs with Rust across multiple paradigms: API-based inference, local model serving, and embedded inference.

Core Concepts & Terminology

Large Language Models (LLMs)

An LLM is a neural network trained on vast amounts of text data, capable of understanding and generating human-like text. Key characteristics:

  • Parameters: Model size (7B, 13B, 70B parameters for open models; GPT-4 has ~100T+)
  • Context Window: Maximum tokens it can consider (4K, 8K, 32K, 128K, or more)
  • Inference: The process of feeding input to the model and getting output
  • Token: A piece of text (roughly 4 characters on average)
  • Latency: Time to first token (TTFT) and time per token (TpT)
  • Throughput: Tokens per second generated

API-Based vs. Local Inference

API-Based (e.g., OpenAI, Claude, Gemini)

  • Pros: No model hosting, automatic updates, high quality
  • Cons: Latency, cost per token, data privacy concerns
  • Use case: Rapid prototyping, specialized models

Local Inference (e.g., Ollama, llama.cpp, vLLM)

  • Pros: Data privacy, no per-token cost, customizable
  • Cons: Requires GPU/hardware, model management overhead
  • Use case: Privacy-sensitive applications, batch processing

Common LLM Architectures

  • Transformers: Foundation of modern LLMs (attention mechanisms)
  • Quantization: Reducing model precision (fp32 โ†’ int8) for speed/memory
  • Fine-tuning: Adapting pre-trained models for specific tasks
  • Retrieval-Augmented Generation (RAG): Combining search + generation for knowledge grounding

Architecture Patterns

Typical LLM Integration Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Rust Application                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚  User Interface (Web/CLI)                     โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                   โ†“                                โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚  Prompt Engineering & Context Management     โ”‚ โ”‚
โ”‚  โ”‚  - Build prompts dynamically                 โ”‚ โ”‚
โ”‚  โ”‚  - Manage conversation history               โ”‚ โ”‚
โ”‚  โ”‚  - Implement RAG for knowledge               โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                   โ†“                                โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚  LLM Client Layer (Async)                     โ”‚ โ”‚
โ”‚  โ”‚  - Handle rate limiting                       โ”‚ โ”‚
โ”‚  โ”‚  - Retry logic & error handling               โ”‚ โ”‚
โ”‚  โ”‚  - Token counting & cost tracking             โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ†“
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚ LLM Provider              โ”‚
         โ”‚ - OpenAI API              โ”‚
         โ”‚ - Anthropic API           โ”‚
         โ”‚ - Local Ollama/vLLM       โ”‚
         โ”‚ - HuggingFace Inference   โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Getting Started: OpenAI API Integration

Project Setup

cargo new llm-app
cd llm-app
cargo add reqwest --features json
cargo add tokio --features full
cargo add serde --features derive
cargo add serde_json
cargo add dotenv

Cargo.toml Configuration

[package]
name = "llm-app"
version = "0.1.0"
edition = "2021"

[dependencies]
reqwest = { version = "0.11", features = ["json"] }
tokio = { version = "1", features = ["full"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
dotenv = "0.15"
anyhow = "1.0"

Simple Chat Completion Example

use reqwest::Client;
use serde::{Deserialize, Serialize};
use anyhow::Result;

#[derive(Serialize)]
struct Message {
    role: String,
    content: String,
}

#[derive(Serialize)]
struct ChatRequest {
    model: String,
    messages: Vec<Message>,
    temperature: f32,
    max_tokens: u32,
}

#[derive(Deserialize, Debug)]
struct Choice {
    message: Message,
    finish_reason: String,
}

#[derive(Deserialize, Debug)]
struct ChatResponse {
    choices: Vec<Choice>,
    usage: Usage,
}

#[derive(Deserialize, Debug)]
struct Usage {
    prompt_tokens: u32,
    completion_tokens: u32,
    total_tokens: u32,
}

async fn chat_with_gpt(prompt: &str) -> Result<String> {
    let api_key = std::env::var("OPENAI_API_KEY")?;
    
    let client = Client::new();
    
    let request = ChatRequest {
        model: "gpt-4-turbo".to_string(),
        messages: vec![
            Message {
                role: "system".to_string(),
                content: "You are a helpful assistant.".to_string(),
            },
            Message {
                role: "user".to_string(),
                content: prompt.to_string(),
            },
        ],
        temperature: 0.7,
        max_tokens: 1000,
    };

    let response = client
        .post("https://api.openai.com/v1/chat/completions")
        .bearer_auth(&api_key)
        .json(&request)
        .send()
        .await?;

    let chat_response: ChatResponse = response.json().await?;
    
    Ok(chat_response
        .choices
        .first()
        .map(|c| c.message.content.clone())
        .unwrap_or_default())
}

#[tokio::main]
async fn main() -> Result<()> {
    dotenv::dotenv().ok();
    
    let response = chat_with_gpt("What is Rust and why is it useful?").await?;
    println!("Response: {}", response);
    
    Ok(())
}

Advanced LLM Integration Patterns

Pattern 1: Streaming Responses

use futures::stream::StreamExt;

async fn stream_chat(prompt: &str) -> Result<()> {
    let api_key = std::env::var("OPENAI_API_KEY")?;
    
    let client = Client::new();
    
    let request = ChatRequest {
        model: "gpt-4-turbo".to_string(),
        messages: vec![
            Message {
                role: "user".to_string(),
                content: prompt.to_string(),
            },
        ],
        temperature: 0.7,
        max_tokens: 2000,
    };

    // Enable streaming with stream: true
    let response = client
        .post("https://api.openai.com/v1/chat/completions")
        .bearer_auth(&api_key)
        .json(&request)
        .send()
        .await?;

    let mut stream = response.bytes_stream();
    
    while let Some(chunk) = stream.next().await {
        let chunk = chunk?;
        let text = String::from_utf8_lossy(&chunk);
        
        // Parse SSE (Server-Sent Events) format
        for line in text.lines() {
            if line.starts_with("data: ") {
                let data = &line[6..];
                if data == "[DONE]" {
                    break;
                }
                
                if let Ok(json) = serde_json::from_str::<serde_json::Value>(data) {
                    if let Some(content) = json["choices"][0]["delta"]["content"].as_str() {
                        print!("{}", content);
                    }
                }
            }
        }
    }
    
    println!();
    Ok(())
}

Pattern 2: Conversation Management with History

struct ConversationManager {
    messages: Vec<Message>,
    system_prompt: String,
    max_messages: usize,
}

impl ConversationManager {
    fn new(system_prompt: String) -> Self {
        ConversationManager {
            messages: Vec::new(),
            system_prompt,
            max_messages: 10,
        }
    }

    fn add_user_message(&mut self, content: String) {
        self.messages.push(Message {
            role: "user".to_string(),
            content,
        });
    }

    fn add_assistant_message(&mut self, content: String) {
        self.messages.push(Message {
            role: "assistant".to_string(),
            content,
        });
        
        // Keep conversation window manageable
        if self.messages.len() > self.max_messages {
            self.messages.remove(0);
        }
    }

    fn get_messages_for_request(&self) -> Vec<Message> {
        let mut msgs = vec![Message {
            role: "system".to_string(),
            content: self.system_prompt.clone(),
        }];
        msgs.extend(self.messages.clone());
        msgs
    }
}

#[tokio::main]
async fn main() -> Result<()> {
    let mut conversation = ConversationManager::new(
        "You are a Rust programming expert.".to_string()
    );

    // Multi-turn conversation
    conversation.add_user_message("What is ownership in Rust?".to_string());
    
    let response = chat_with_messages(
        conversation.get_messages_for_request()
    ).await?;
    
    println!("Assistant: {}", response);
    conversation.add_assistant_message(response);

    // Follow-up question
    conversation.add_user_message("How does the borrow checker enforce this?".to_string());
    
    let response2 = chat_with_messages(
        conversation.get_messages_for_request()
    ).await?;
    
    println!("Assistant: {}", response2);
    
    Ok(())
}

Pattern 3: Token Counting and Cost Tracking

struct TokenCounter {
    model: String,
}

impl TokenCounter {
    fn new(model: &str) -> Self {
        TokenCounter {
            model: model.to_string(),
        }
    }

    // Rough estimation (more sophisticated: use tokenizer library)
    fn estimate_tokens(&self, text: &str) -> u32 {
        (text.len() / 4) as u32
    }

    fn calculate_cost(&self, input_tokens: u32, output_tokens: u32) -> f32 {
        match self.model.as_str() {
            "gpt-4-turbo" => {
                let input_cost = (input_tokens as f32) * 0.01 / 1000.0;
                let output_cost = (output_tokens as f32) * 0.03 / 1000.0;
                input_cost + output_cost
            }
            "gpt-3.5-turbo" => {
                let input_cost = (input_tokens as f32) * 0.0005 / 1000.0;
                let output_cost = (output_tokens as f32) * 0.0015 / 1000.0;
                input_cost + output_cost
            }
            _ => 0.0,
        }
    }
}

fn main() {
    let counter = TokenCounter::new("gpt-4-turbo");
    
    let input_tokens = counter.estimate_tokens("What is Rust?");
    let output_tokens = 150;
    
    let cost = counter.calculate_cost(input_tokens, output_tokens);
    println!("Estimated cost: ${:.4}", cost);
}

Pattern 4: Retry Logic with Exponential Backoff

use std::time::Duration;

async fn call_llm_with_retry(
    prompt: &str,
    max_retries: u32,
) -> Result<String> {
    let mut retries = 0;
    
    loop {
        match chat_with_gpt(prompt).await {
            Ok(response) => return Ok(response),
            Err(e) => {
                retries += 1;
                if retries > max_retries {
                    return Err(e);
                }
                
                // Exponential backoff: 1s, 2s, 4s, 8s...
                let wait_duration = Duration::from_secs(2_u64.pow(retries - 1));
                eprintln!(
                    "Request failed (attempt {}): {}. Retrying in {:?}...",
                    retries, e, wait_duration
                );
                
                tokio::time::sleep(wait_duration).await;
            }
        }
    }
}

Pattern 5: Retrieval-Augmented Generation (RAG)

use std::collections::HashMap;

struct Document {
    id: String,
    content: String,
    embedding: Vec<f32>, // Would be populated by embedding model
}

struct RAGSystem {
    documents: Vec<Document>,
}

impl RAGSystem {
    fn new() -> Self {
        RAGSystem {
            documents: Vec::new(),
        }
    }

    fn add_document(&mut self, id: String, content: String) {
        // In production, compute embeddings using embedding model
        let doc = Document {
            id,
            content,
            embedding: vec![0.0; 1536], // placeholder
        };
        self.documents.push(doc);
    }

    fn retrieve_relevant_documents(&self, query: &str, k: usize) -> Vec<String> {
        // Simple keyword matching (in production: use similarity search)
        self.documents
            .iter()
            .filter(|doc| doc.content.to_lowercase().contains(&query.to_lowercase()))
            .take(k)
            .map(|doc| doc.content.clone())
            .collect()
    }

    async fn answer_with_context(&self, question: &str) -> Result<String> {
        // Retrieve relevant documents
        let context_docs = self.retrieve_relevant_documents(question, 3);
        let context = context_docs.join("\n---\n");

        // Build augmented prompt
        let augmented_prompt = format!(
            "Context:\n{}\n\nQuestion: {}\n\nAnswer based on the context above:",
            context, question
        );

        chat_with_gpt(&augmented_prompt).await
    }
}

#[tokio::main]
async fn main() -> Result<()> {
    let mut rag = RAGSystem::new();
    
    // Add documents to knowledge base
    rag.add_document(
        "rust-ownership".to_string(),
        "Rust uses ownership to manage memory safely...".to_string(),
    );

    // Ask question with context
    let answer = rag.answer_with_context("How does Rust manage memory?").await?;
    println!("Answer: {}", answer);
    
    Ok(())
}

Local LLM Inference with Ollama

Using Ollama for Local Models

# Install and run Ollama
# https://ollama.ai

# Pull a model
ollama pull llama2

# Model runs on http://localhost:11434

Rust Integration with Local Ollama

use reqwest::Client;
use serde::{Deserialize, Serialize};

#[derive(Serialize)]
struct OllamaRequest {
    model: String,
    prompt: String,
    stream: bool,
}

#[derive(Deserialize)]
struct OllamaResponse {
    response: String,
    done: bool,
}

async fn query_local_llm(prompt: &str) -> Result<String> {
    let client = Client::new();
    
    let request = OllamaRequest {
        model: "llama2".to_string(),
        prompt: prompt.to_string(),
        stream: false,
    };

    let response = client
        .post("http://localhost:11434/api/generate")
        .json(&request)
        .send()
        .await?;

    let ollama_response: OllamaResponse = response.json().await?;
    Ok(ollama_response.response)
}

#[tokio::main]
async fn main() -> Result<()> {
    let response = query_local_llm("Write a Rust function that returns the sum of two numbers").await?;
    println!("{}", response);
    Ok(())
}

Common Pitfalls & Best Practices

1. Ignoring Rate Limits

โŒ Bad: No rate limit handling

for prompt in prompts {
    let _ = chat_with_gpt(&prompt).await; // Hits rate limits
}

โœ… Good: Implement rate limiting

use tokio::sync::Semaphore;
use std::sync::Arc;

let semaphore = Arc::new(Semaphore::new(10)); // Max 10 concurrent

for prompt in prompts {
    let sem = semaphore.clone();
    tokio::spawn(async move {
        let _permit = sem.acquire().await;
        let _ = chat_with_gpt(&prompt).await;
    });
}

2. Hardcoding API Keys

โŒ Bad: Keys in source code

let api_key = "sk-1234567890abcdef";

โœ… Good: Environment variables

let api_key = std::env::var("OPENAI_API_KEY")
    .expect("OPENAI_API_KEY not set");

3. No Error Recovery

โŒ Bad: Single attempt, no fallback

let response = chat_with_gpt(prompt).await?; // Panics on failure

โœ… Good: Retry logic with fallbacks

let response = call_llm_with_retry(prompt, 3)
    .await
    .unwrap_or_else(|_| "Sorry, I couldn't process that.".to_string());

4. Unbounded Token Usage

โŒ Bad: No token limits

let request = ChatRequest {
    max_tokens: 100_000, // Expensive!
    // ...
};

โœ… Good: Set reasonable limits

let request = ChatRequest {
    max_tokens: 2000, // Reasonable for most tasks
    // ...
};

5. Not Validating Input

โŒ Bad: Direct user input to LLM

let response = chat_with_gpt(&user_input).await?; // Could be malicious

โœ… Good: Validate and sanitize

if user_input.len() > 5000 {
    anyhow::bail!("Input too long");
}

let sanitized = user_input.trim();
let response = chat_with_gpt(sanitized).await?;

6. Synchronous Blocking in Async Code

โŒ Bad: Blocking operation in async

async fn process_llm_response(prompt: &str) -> Result<String> {
    let response = chat_with_gpt(prompt).await?;
    let expensive_computation = std::thread::sleep(Duration::from_secs(10)); // Blocks!
    Ok(response)
}

โœ… Good: Use async alternatives

async fn process_llm_response(prompt: &str) -> Result<String> {
    let response = chat_with_gpt(prompt).await?;
    tokio::time::sleep(Duration::from_secs(10)).await; // Non-blocking
    Ok(response)
}

Rust vs. Alternatives for LLM Integration

Aspect Rust Python Node.js Go
LLM Library Support Growing (langchain, llm-rs) Excellent (official SDKs) Good (official SDKs) Limited
Async Performance Excellent Good (asyncio) Excellent Good
Type Safety Exceptional Weak Weak Good
Memory Usage Low (no GC) High (GC) High (GC) Low
Production Readiness โœ… Mature โœ… Mature โœ… Mature โš ๏ธ Growing
Development Speed Medium Fast Fast Medium
Concurrency Model async/await asyncio/threading Event loop goroutines
Binary Size 2-10 MB N/A (interpreted) N/A (runtime) 5-15 MB
Deployment Single binary Requires Python Requires Node.js Single binary

When to Choose Rust for LLM Applications

โœ… Use Rust when:

  • Performance and latency are critical
  • You need a self-contained binary
  • Memory efficiency matters (edge devices)
  • Building infrastructure/services
  • Type safety is important for correctness

โœ… Use Python when:

  • Rapid prototyping is essential
  • Best LLM library ecosystem needed
  • Data science integration required
  • Team expertise is Python-focused

โœ… Use Node.js when:

  • Full-stack JavaScript is preferred
  • Real-time streaming is critical
  • Web framework integration needed

Production Deployment Considerations

Docker Deployment

FROM rust:latest as builder
WORKDIR /app
COPY . .
RUN cargo build --release

FROM debian:bookworm-slim
COPY --from=builder /app/target/release/llm-app /usr/local/bin/
ENV OPENAI_API_KEY=${OPENAI_API_KEY}
CMD ["llm-app"]

Environment Configuration

use std::env;

#[derive(Debug)]
struct Config {
    api_key: String,
    model: String,
    max_tokens: u32,
    temperature: f32,
}

impl Config {
    fn from_env() -> Result<Self, env::VarError> {
        Ok(Config {
            api_key: env::var("OPENAI_API_KEY")?,
            model: env::var("MODEL")
                .unwrap_or_else(|_| "gpt-4-turbo".to_string()),
            max_tokens: env::var("MAX_TOKENS")
                .ok()
                .and_then(|v| v.parse().ok())
                .unwrap_or(2000),
            temperature: env::var("TEMPERATURE")
                .ok()
                .and_then(|v| v.parse().ok())
                .unwrap_or(0.7),
        })
    }
}

Resources & Learning Materials

Official LLM API Documentation

Rust LLM Libraries

  • langchain-rust - LangChain port to Rust
  • llm-rs - Local LLM inference
  • ort - ONNX Runtime for inference
  • candle - HuggingFace’s ML framework in Rust

Learning Resources

Useful Crates

  • reqwest - HTTP client with async support
  • tokio - Async runtime
  • serde/serde_json - Serialization
  • anyhow - Error handling
  • dotenv - Environment variables
  • tracing - Structured logging
  • tokio-stream - Streaming primitives
  • futures - Future combinators

Conclusion

Integrating LLMs with Rust enables you to build AI-powered applications that are simultaneously safe, fast, and maintainable. Whether you’re building ChatBot backends, content generation pipelines, or real-time analysis systems, Rust provides the tools necessary for production-grade implementations.

The key advantages are clear:

  • Performance: Minimal latency overhead compared to C/C++
  • Safety: Memory safety prevents entire classes of bugs
  • Concurrency: Async/await handles thousands of concurrent requests efficiently
  • Deployment: Single compiled binary with no runtime dependencies

As the Rust AI ecosystem maturesโ€”with libraries like Candle, LLM-rs, and LangChain-Rustโ€”building sophisticated AI applications in Rust transitions from “possible” to “preferable” for many use cases.

Start with API-based integration for rapid prototyping, then move to local inference for privacy-sensitive applications. In both cases, Rust provides the foundation for applications that scale confidently from prototype to production.

Comments