Building LLM Inference Engines with Rust: Candle and Llama.rs

Large Language Models (LLMs) are reshaping how we build AI applications. But running them efficiently in production is challenging. Python frameworks like PyTorch and Transformers are flexible but carry overhead. Enter Rust: a language designed for performance, memory safety, and reliability.

In this post, we’ll explore two powerful Rust libraries for LLM inference: Candle (Meta’s lightweight ML framework) and Llama.rs (a pure Rust implementation for LLaMA models). We’ll compare them, build examples, and show you when to use each.

Why Rust for LLM Inference?

Before we dive into specific libraries, let’s understand why Rust is compelling for LLM inference:

Performance

No garbage collector: Predictable latency, critical for serving requests
Zero-cost abstractions: SIMD operations, vectorization, and optimization at compile time
Memory efficiency: Tight control over allocations means lower latency spikes

Reliability

Type safety: Catch bugs at compile time, not during inference
Thread safety: Safe concurrency for multi-user serving
Error handling: Explicit error handling prevents silent failures

Production-Ready

Single binary: No runtime dependencies, easy deployment
Minimal footprint: Deploy on edge devices, embedded systems, or serverless
Cross-platform: Compile once, run on Linux, macOS, Windows, ARM

Candle: Meta’s Lightweight ML Framework

What Is Candle?

Candle is a minimal, idiomatic Rust ML framework developed by Meta. It’s designed for inference with a focus on simplicity and performance. Unlike full frameworks, Candle doesn’t aim to be comprehensive—it’s purpose-built for deployment.

Key Features

Tensor operations: Broadcasting, slicing, reshaping
Pre-trained models: Direct support for Hugging Face models
Hardware acceleration: CPU, CUDA, and Metal (Apple Silicon) support
Minimal dependencies: Pure Rust core with optional accelerators

When to Use Candle

Running inference with pre-trained models
Deploying LLMs on servers or embedded devices
Building a lightweight inference API
You want tight control over the computation graph

Getting Started with Candle

First, add Candle to your Cargo.toml:

[dependencies]
candle-core = "0.3"
candle-transformers = "0.3"
candle-nn = "0.3"
tokenizers = "0.13"
anyhow = "1"

Example 1: Basic Tensor Operations

use candle_core::{Device, Tensor};

fn main() -> anyhow::Result<()> {
    // Create a device (CPU)
    let device = Device::Cpu;

    // Create a tensor from data
    let data = vec![1.0, 2.0, 3.0, 4.0];
    let tensor = Tensor::new(&data, &device)?
        .reshape((2, 2))?;

    println!("Tensor:\n{}", tensor);

    // Matrix multiplication
    let result = tensor.matmul(&tensor)?;
    println!("Result:\n{}", result);

    Ok(())
}

Output:

Tensor:
[[1.0, 2.0],
 [3.0, 4.0]]
Result:
[[7.0, 10.0],
 [15.0, 22.0]]

Example 2: Loading and Running a Model

use candle_core::Device;
use candle_transformers::models::llama;
use tokenizers::Tokenizer;
use std::path::Path;

fn main() -> anyhow::Result<()> {
    let device = Device::Cpu;

    // Load tokenizer
    let tokenizer = Tokenizer::from_file("tokenizer.json")?;

    // Load model weights (safetensors format)
    let model_path = Path::new("model.safetensors");
    
    // Initialize model config
    let config = llama::Config::default();
    let mut model = llama::Model::load(model_path, &config, &device)?;

    // Tokenize input
    let input = "Once upon a time";
    let tokens = tokenizer.encode(input, true)?;
    let token_ids = tokens.get_ids().to_vec();

    // Run inference
    let logits = model.forward(&token_ids, 0)?;
    
    // Get predicted token
    let predicted = logits.argmax(candle_core::D::Minus1)?;
    
    println!("Next token ID: {:?}", predicted);

    Ok(())
}

Pros of Candle

✅ Lightweight and fast
✅ Direct Hugging Face integration
✅ CPU and GPU support
✅ Minimal boilerplate
✅ Actively maintained by Meta

Cons of Candle

❌ Newer ecosystem (less community content)
❌ Fewer pre-built models compared to Python
❌ Learning curve for Rust beginners

Llama.rs: Pure Rust LLaMA Implementation

What Is Llama.rs?

Llama.rs is a pure Rust implementation of the LLaMA model architecture. It brings the simplicity of single-file implementations (inspired by llama.cpp) to Rust, with safety guarantees.

Key Features

Self-contained: Single library, no external dependencies
GGML format: Supports quantized models from llama.cpp ecosystem
Memory efficient: Optimized for inference on limited hardware
Rust idiomatic: Leverages Rust’s type system and safety

When to Use Llama.rs

Running LLaMA or compatible models
You want a standalone, dependency-light solution
Deploying on resource-constrained devices
You prefer pure Rust implementations

Getting Started with Llama.rs

Add to Cargo.toml:

[dependencies]
llama-rs = "0.1"

Example 1: Basic Model Loading and Inference

use llama_rs::InferenceSession;
use std::io::Write;

fn main() -> anyhow::Result<()> {
    // Load model
    let model = llama_rs::load_model("model.gguf")?;

    // Create inference session
    let mut session = InferenceSession::new(
        &model,
        Default::default(),
    )?;

    // Prepare input
    let prompt = "Once upon a time, there was a";
    let mut output = String::new();

    // Run inference
    session.infer::<std::io::Sink>(
        &model,
        &mut llama_rs::InferenceRequest {
            prompt: prompt.into(),
            parameters: Default::default(),
            play_back_previous_tokens: false,
            maximum_token_count: Some(100),
        },
        &mut llama_rs::CallbackEvent::new(|event| {
            match event {
                llama_rs::InferenceEvent::InferredToken(token) => {
                    print!("{}", token);
                    std::io::stdout().flush().ok();
                    output.push_str(&token);
                }
                _ => {}
            }
            Ok(())
        }),
        &mut std::io::sink(),
    )?;

    println!("\n\nFull output:\n{}", output);
    Ok(())
}

Example 2: Streaming Inference

use llama_rs::{Model, InferenceSession};

fn main() -> anyhow::Result<()> {
    let model = llama_rs::load_model("model.gguf")?;
    let mut session = InferenceSession::new(&model, Default::default())?;

    let prompt = "Explain quantum computing in simple terms:";

    session.infer::<std::io::Sink>(
        &model,
        &mut llama_rs::InferenceRequest {
            prompt: prompt.into(),
            parameters: Default::default(),
            play_back_previous_tokens: false,
            maximum_token_count: Some(512),
        },
        &mut llama_rs::CallbackEvent::new(|event| {
            match event {
                llama_rs::InferenceEvent::InferredToken(token) => {
                    // Stream tokens as they're generated
                    print!("{}", token);
                    std::io::stdout().flush().ok();
                }
                llama_rs::InferenceEvent::EndOfText => {
                    println!("\n[End of generation]");
                }
                _ => {}
            }
            Ok(())
        }),
        &mut std::io::sink(),
    )?;

    Ok(())
}

Pros of Llama.rs

✅ Pure Rust, minimal dependencies
✅ Direct support for GGML quantized models
✅ Excellent for embedded and edge devices
✅ Single-file simplicity
✅ Memory-efficient inference

Cons of Llama.rs

❌ Specialized for LLaMA architecture
❌ Smaller ecosystem than Candle
❌ Limited to quantized models from llama.cpp

Head-to-Head Comparison

Aspect	Candle	Llama.rs
Scope	General ML inference	LLaMA-specific
Model Support	Any architecture	LLaMA, derivatives
Ease of Use	Moderate	Very easy
Dependencies	Minimal (optional GPU)	Minimal
Performance	Excellent	Excellent
Community	Growing	Smaller
GPU Support	CUDA, Metal	CPU-focused
Deployment	Production-grade	Production-grade
Learning Curve	Steeper	Gentler

Practical Decision Tree

Choose Candle if:

You need flexibility across multiple model architectures
You want GPU acceleration (CUDA, Metal)
You’re building a general inference service
You prefer a larger ecosystem

Choose Llama.rs if:

You’re exclusively working with LLaMA models
You want maximum simplicity and minimal dependencies
You’re targeting embedded or edge devices
You prefer GGML quantized models

Building a Production Inference Server

Here’s a practical example: a simple HTTP server for LLM inference using Candle and Axum:

use axum::{
    extract::Json,
    http::StatusCode,
    routing::post,
    Router,
};
use serde::{Deserialize, Serialize};
use std::sync::Arc;
use candle_core::Device;

#[derive(Serialize, Deserialize)]
struct InferenceRequest {
    prompt: String,
    max_tokens: Option<usize>,
}

#[derive(Serialize)]
struct InferenceResponse {
    generated_text: String,
}

struct AppState {
    device: Device,
    // Add model here
}

async fn infer(
    Json(req): Json<InferenceRequest>,
) -> Result<Json<InferenceResponse>, StatusCode> {
    // TODO: Implement inference logic
    Ok(Json(InferenceResponse {
        generated_text: "Generated text...".to_string(),
    }))
}

#[tokio::main]
async fn main() {
    let app = Router::new()
        .route("/infer", post(infer));

    let listener = tokio::net::TcpListener::bind("0.0.0.0:3000")
        .await
        .unwrap();

    axum::serve(listener, app).await.unwrap();
}

Optimization Tips

1. Quantization

Use quantized models (4-bit, 8-bit) to reduce memory and speed up inference:

# Convert with llama.cpp
python convert.py model --outtype q4_0

2. Batch Inference

Process multiple prompts in parallel:

let prompts = vec!["Prompt 1", "Prompt 2", "Prompt 3"];
for prompt in prompts {
    // Process in parallel with rayon or tokio
}

3. Model Caching

Keep models in memory between requests:

use once_cell::sync::Lazy;
use std::sync::Mutex;

static MODEL: Lazy<Mutex<Model>> = Lazy::new(|| {
    Mutex::new(load_model().unwrap())
});

4. KV Cache Reuse

Reuse key-value caches across generations for faster decoding.

Ecosystem Beyond Candle and Llama.rs

Burn: Native Rust deep learning framework with dynamic graphs
Ort: ONNX Runtime bindings for interoperability
Tch-rs: PyTorch bindings for advanced models
Huggingface-hub-rs: Download models from Hugging Face Hub

Conclusion

Rust is ready for LLM inference. Whether you choose Candle for versatility or Llama.rs for simplicity, you’ll get performance and reliability that Python simply can’t match.

The choice depends on your priorities:

Flexibility + GPU support → Candle
Simplicity + Edge deployment → Llama.rs

As Rust’s ML ecosystem matures, expect more specialized libraries and better tooling. For now, these two frameworks offer a solid foundation for building production-grade LLM inference systems.

Building LLM Inference Engines with Rust: Candle and Llama.rs

Why Rust for LLM Inference?

Performance

Reliability

Production-Ready

Candle: Meta’s Lightweight ML Framework

What Is Candle?

Key Features

When to Use Candle

Getting Started with Candle

Example 1: Basic Tensor Operations

Example 2: Loading and Running a Model

Pros of Candle

Cons of Candle

Llama.rs: Pure Rust LLaMA Implementation

What Is Llama.rs?

Key Features

When to Use Llama.rs

Getting Started with Llama.rs

Example 1: Basic Model Loading and Inference

Example 2: Streaming Inference

Pros of Llama.rs

Cons of Llama.rs

Head-to-Head Comparison

Practical Decision Tree

Building a Production Inference Server

Optimization Tips

1. Quantization

2. Batch Inference

3. Model Caching

4. KV Cache Reuse

Ecosystem Beyond Candle and Llama.rs

Conclusion

Resources

Comments