Skip to main content
โšก Calmops

Building LLM Inference Engines with Rust: Candle and Llama.rs

Run Large Language Models Efficiently with Rust

Large Language Models (LLMs) are reshaping how we build AI applications. But running them efficiently in production is challenging. Python frameworks like PyTorch and Transformers are flexible but carry overhead. Enter Rust: a language designed for performance, memory safety, and reliability.

In this post, we’ll explore two powerful Rust libraries for LLM inference: Candle (Meta’s lightweight ML framework) and Llama.rs (a pure Rust implementation for LLaMA models). We’ll compare them, build examples, and show you when to use each.

Why Rust for LLM Inference?

Before we dive into specific libraries, let’s understand why Rust is compelling for LLM inference:

Performance

  • No garbage collector: Predictable latency, critical for serving requests
  • Zero-cost abstractions: SIMD operations, vectorization, and optimization at compile time
  • Memory efficiency: Tight control over allocations means lower latency spikes

Reliability

  • Type safety: Catch bugs at compile time, not during inference
  • Thread safety: Safe concurrency for multi-user serving
  • Error handling: Explicit error handling prevents silent failures

Production-Ready

  • Single binary: No runtime dependencies, easy deployment
  • Minimal footprint: Deploy on edge devices, embedded systems, or serverless
  • Cross-platform: Compile once, run on Linux, macOS, Windows, ARM

Candle: Meta’s Lightweight ML Framework

What Is Candle?

Candle is a minimal, idiomatic Rust ML framework developed by Meta. It’s designed for inference with a focus on simplicity and performance. Unlike full frameworks, Candle doesn’t aim to be comprehensiveโ€”it’s purpose-built for deployment.

Key Features

  • Tensor operations: Broadcasting, slicing, reshaping
  • Pre-trained models: Direct support for Hugging Face models
  • Hardware acceleration: CPU, CUDA, and Metal (Apple Silicon) support
  • Minimal dependencies: Pure Rust core with optional accelerators

When to Use Candle

  • Running inference with pre-trained models
  • Deploying LLMs on servers or embedded devices
  • Building a lightweight inference API
  • You want tight control over the computation graph

Getting Started with Candle

First, add Candle to your Cargo.toml:

[dependencies]
candle-core = "0.3"
candle-transformers = "0.3"
candle-nn = "0.3"
tokenizers = "0.13"
anyhow = "1"

Example 1: Basic Tensor Operations

use candle_core::{Device, Tensor};

fn main() -> anyhow::Result<()> {
    // Create a device (CPU)
    let device = Device::Cpu;

    // Create a tensor from data
    let data = vec![1.0, 2.0, 3.0, 4.0];
    let tensor = Tensor::new(&data, &device)?
        .reshape((2, 2))?;

    println!("Tensor:\n{}", tensor);

    // Matrix multiplication
    let result = tensor.matmul(&tensor)?;
    println!("Result:\n{}", result);

    Ok(())
}

Output:

Tensor:
[[1.0, 2.0],
 [3.0, 4.0]]
Result:
[[7.0, 10.0],
 [15.0, 22.0]]

Example 2: Loading and Running a Model

use candle_core::Device;
use candle_transformers::models::llama;
use tokenizers::Tokenizer;
use std::path::Path;

fn main() -> anyhow::Result<()> {
    let device = Device::Cpu;

    // Load tokenizer
    let tokenizer = Tokenizer::from_file("tokenizer.json")?;

    // Load model weights (safetensors format)
    let model_path = Path::new("model.safetensors");
    
    // Initialize model config
    let config = llama::Config::default();
    let mut model = llama::Model::load(model_path, &config, &device)?;

    // Tokenize input
    let input = "Once upon a time";
    let tokens = tokenizer.encode(input, true)?;
    let token_ids = tokens.get_ids().to_vec();

    // Run inference
    let logits = model.forward(&token_ids, 0)?;
    
    // Get predicted token
    let predicted = logits.argmax(candle_core::D::Minus1)?;
    
    println!("Next token ID: {:?}", predicted);

    Ok(())
}

Pros of Candle

  • โœ… Lightweight and fast
  • โœ… Direct Hugging Face integration
  • โœ… CPU and GPU support
  • โœ… Minimal boilerplate
  • โœ… Actively maintained by Meta

Cons of Candle

  • โŒ Newer ecosystem (less community content)
  • โŒ Fewer pre-built models compared to Python
  • โŒ Learning curve for Rust beginners

Llama.rs: Pure Rust LLaMA Implementation

What Is Llama.rs?

Llama.rs is a pure Rust implementation of the LLaMA model architecture. It brings the simplicity of single-file implementations (inspired by llama.cpp) to Rust, with safety guarantees.

Key Features

  • Self-contained: Single library, no external dependencies
  • GGML format: Supports quantized models from llama.cpp ecosystem
  • Memory efficient: Optimized for inference on limited hardware
  • Rust idiomatic: Leverages Rust’s type system and safety

When to Use Llama.rs

  • Running LLaMA or compatible models
  • You want a standalone, dependency-light solution
  • Deploying on resource-constrained devices
  • You prefer pure Rust implementations

Getting Started with Llama.rs

Add to Cargo.toml:

[dependencies]
llama-rs = "0.1"

Example 1: Basic Model Loading and Inference

use llama_rs::InferenceSession;
use std::io::Write;

fn main() -> anyhow::Result<()> {
    // Load model
    let model = llama_rs::load_model("model.gguf")?;

    // Create inference session
    let mut session = InferenceSession::new(
        &model,
        Default::default(),
    )?;

    // Prepare input
    let prompt = "Once upon a time, there was a";
    let mut output = String::new();

    // Run inference
    session.infer::<std::io::Sink>(
        &model,
        &mut llama_rs::InferenceRequest {
            prompt: prompt.into(),
            parameters: Default::default(),
            play_back_previous_tokens: false,
            maximum_token_count: Some(100),
        },
        &mut llama_rs::CallbackEvent::new(|event| {
            match event {
                llama_rs::InferenceEvent::InferredToken(token) => {
                    print!("{}", token);
                    std::io::stdout().flush().ok();
                    output.push_str(&token);
                }
                _ => {}
            }
            Ok(())
        }),
        &mut std::io::sink(),
    )?;

    println!("\n\nFull output:\n{}", output);
    Ok(())
}

Example 2: Streaming Inference

use llama_rs::{Model, InferenceSession};

fn main() -> anyhow::Result<()> {
    let model = llama_rs::load_model("model.gguf")?;
    let mut session = InferenceSession::new(&model, Default::default())?;

    let prompt = "Explain quantum computing in simple terms:";

    session.infer::<std::io::Sink>(
        &model,
        &mut llama_rs::InferenceRequest {
            prompt: prompt.into(),
            parameters: Default::default(),
            play_back_previous_tokens: false,
            maximum_token_count: Some(512),
        },
        &mut llama_rs::CallbackEvent::new(|event| {
            match event {
                llama_rs::InferenceEvent::InferredToken(token) => {
                    // Stream tokens as they're generated
                    print!("{}", token);
                    std::io::stdout().flush().ok();
                }
                llama_rs::InferenceEvent::EndOfText => {
                    println!("\n[End of generation]");
                }
                _ => {}
            }
            Ok(())
        }),
        &mut std::io::sink(),
    )?;

    Ok(())
}

Pros of Llama.rs

  • โœ… Pure Rust, minimal dependencies
  • โœ… Direct support for GGML quantized models
  • โœ… Excellent for embedded and edge devices
  • โœ… Single-file simplicity
  • โœ… Memory-efficient inference

Cons of Llama.rs

  • โŒ Specialized for LLaMA architecture
  • โŒ Smaller ecosystem than Candle
  • โŒ Limited to quantized models from llama.cpp

Head-to-Head Comparison

Aspect Candle Llama.rs
Scope General ML inference LLaMA-specific
Model Support Any architecture LLaMA, derivatives
Ease of Use Moderate Very easy
Dependencies Minimal (optional GPU) Minimal
Performance Excellent Excellent
Community Growing Smaller
GPU Support CUDA, Metal CPU-focused
Deployment Production-grade Production-grade
Learning Curve Steeper Gentler

Practical Decision Tree

Choose Candle if:

  • You need flexibility across multiple model architectures
  • You want GPU acceleration (CUDA, Metal)
  • You’re building a general inference service
  • You prefer a larger ecosystem

Choose Llama.rs if:

  • You’re exclusively working with LLaMA models
  • You want maximum simplicity and minimal dependencies
  • You’re targeting embedded or edge devices
  • You prefer GGML quantized models

Building a Production Inference Server

Here’s a practical example: a simple HTTP server for LLM inference using Candle and Axum:

use axum::{
    extract::Json,
    http::StatusCode,
    routing::post,
    Router,
};
use serde::{Deserialize, Serialize};
use std::sync::Arc;
use candle_core::Device;

#[derive(Serialize, Deserialize)]
struct InferenceRequest {
    prompt: String,
    max_tokens: Option<usize>,
}

#[derive(Serialize)]
struct InferenceResponse {
    generated_text: String,
}

struct AppState {
    device: Device,
    // Add model here
}

async fn infer(
    Json(req): Json<InferenceRequest>,
) -> Result<Json<InferenceResponse>, StatusCode> {
    // TODO: Implement inference logic
    Ok(Json(InferenceResponse {
        generated_text: "Generated text...".to_string(),
    }))
}

#[tokio::main]
async fn main() {
    let app = Router::new()
        .route("/infer", post(infer));

    let listener = tokio::net::TcpListener::bind("0.0.0.0:3000")
        .await
        .unwrap();

    axum::serve(listener, app).await.unwrap();
}

Optimization Tips

1. Quantization

Use quantized models (4-bit, 8-bit) to reduce memory and speed up inference:

# Convert with llama.cpp
python convert.py model --outtype q4_0

2. Batch Inference

Process multiple prompts in parallel:

let prompts = vec!["Prompt 1", "Prompt 2", "Prompt 3"];
for prompt in prompts {
    // Process in parallel with rayon or tokio
}

3. Model Caching

Keep models in memory between requests:

use once_cell::sync::Lazy;
use std::sync::Mutex;

static MODEL: Lazy<Mutex<Model>> = Lazy::new(|| {
    Mutex::new(load_model().unwrap())
});

4. KV Cache Reuse

Reuse key-value caches across generations for faster decoding.

Ecosystem Beyond Candle and Llama.rs

  • Burn: Native Rust deep learning framework with dynamic graphs
  • Ort: ONNX Runtime bindings for interoperability
  • Tch-rs: PyTorch bindings for advanced models
  • Huggingface-hub-rs: Download models from Hugging Face Hub

Conclusion

Rust is ready for LLM inference. Whether you choose Candle for versatility or Llama.rs for simplicity, you’ll get performance and reliability that Python simply can’t match.

The choice depends on your priorities:

  • Flexibility + GPU support โ†’ Candle
  • Simplicity + Edge deployment โ†’ Llama.rs

As Rust’s ML ecosystem matures, expect more specialized libraries and better tooling. For now, these two frameworks offer a solid foundation for building production-grade LLM inference systems.

Resources

Comments