Large Language Models (LLMs) are reshaping how we build AI applications. But running them efficiently in production is challenging. Python frameworks like PyTorch and Transformers are flexible but carry overhead. Enter Rust: a language designed for performance, memory safety, and reliability.
In this post, we’ll explore two powerful Rust libraries for LLM inference: Candle (Meta’s lightweight ML framework) and Llama.rs (a pure Rust implementation for LLaMA models). We’ll compare them, build examples, and show you when to use each.
Why Rust for LLM Inference?
Before we dive into specific libraries, let’s understand why Rust is compelling for LLM inference:
Performance
- No garbage collector: Predictable latency, critical for serving requests
- Zero-cost abstractions: SIMD operations, vectorization, and optimization at compile time
- Memory efficiency: Tight control over allocations means lower latency spikes
Reliability
- Type safety: Catch bugs at compile time, not during inference
- Thread safety: Safe concurrency for multi-user serving
- Error handling: Explicit error handling prevents silent failures
Production-Ready
- Single binary: No runtime dependencies, easy deployment
- Minimal footprint: Deploy on edge devices, embedded systems, or serverless
- Cross-platform: Compile once, run on Linux, macOS, Windows, ARM
Candle: Meta’s Lightweight ML Framework
What Is Candle?
Candle is a minimal, idiomatic Rust ML framework developed by Meta. It’s designed for inference with a focus on simplicity and performance. Unlike full frameworks, Candle doesn’t aim to be comprehensiveโit’s purpose-built for deployment.
Key Features
- Tensor operations: Broadcasting, slicing, reshaping
- Pre-trained models: Direct support for Hugging Face models
- Hardware acceleration: CPU, CUDA, and Metal (Apple Silicon) support
- Minimal dependencies: Pure Rust core with optional accelerators
When to Use Candle
- Running inference with pre-trained models
- Deploying LLMs on servers or embedded devices
- Building a lightweight inference API
- You want tight control over the computation graph
Getting Started with Candle
First, add Candle to your Cargo.toml:
[dependencies]
candle-core = "0.3"
candle-transformers = "0.3"
candle-nn = "0.3"
tokenizers = "0.13"
anyhow = "1"
Example 1: Basic Tensor Operations
use candle_core::{Device, Tensor};
fn main() -> anyhow::Result<()> {
// Create a device (CPU)
let device = Device::Cpu;
// Create a tensor from data
let data = vec![1.0, 2.0, 3.0, 4.0];
let tensor = Tensor::new(&data, &device)?
.reshape((2, 2))?;
println!("Tensor:\n{}", tensor);
// Matrix multiplication
let result = tensor.matmul(&tensor)?;
println!("Result:\n{}", result);
Ok(())
}
Output:
Tensor:
[[1.0, 2.0],
[3.0, 4.0]]
Result:
[[7.0, 10.0],
[15.0, 22.0]]
Example 2: Loading and Running a Model
use candle_core::Device;
use candle_transformers::models::llama;
use tokenizers::Tokenizer;
use std::path::Path;
fn main() -> anyhow::Result<()> {
let device = Device::Cpu;
// Load tokenizer
let tokenizer = Tokenizer::from_file("tokenizer.json")?;
// Load model weights (safetensors format)
let model_path = Path::new("model.safetensors");
// Initialize model config
let config = llama::Config::default();
let mut model = llama::Model::load(model_path, &config, &device)?;
// Tokenize input
let input = "Once upon a time";
let tokens = tokenizer.encode(input, true)?;
let token_ids = tokens.get_ids().to_vec();
// Run inference
let logits = model.forward(&token_ids, 0)?;
// Get predicted token
let predicted = logits.argmax(candle_core::D::Minus1)?;
println!("Next token ID: {:?}", predicted);
Ok(())
}
Pros of Candle
- โ Lightweight and fast
- โ Direct Hugging Face integration
- โ CPU and GPU support
- โ Minimal boilerplate
- โ Actively maintained by Meta
Cons of Candle
- โ Newer ecosystem (less community content)
- โ Fewer pre-built models compared to Python
- โ Learning curve for Rust beginners
Llama.rs: Pure Rust LLaMA Implementation
What Is Llama.rs?
Llama.rs is a pure Rust implementation of the LLaMA model architecture. It brings the simplicity of single-file implementations (inspired by llama.cpp) to Rust, with safety guarantees.
Key Features
- Self-contained: Single library, no external dependencies
- GGML format: Supports quantized models from llama.cpp ecosystem
- Memory efficient: Optimized for inference on limited hardware
- Rust idiomatic: Leverages Rust’s type system and safety
When to Use Llama.rs
- Running LLaMA or compatible models
- You want a standalone, dependency-light solution
- Deploying on resource-constrained devices
- You prefer pure Rust implementations
Getting Started with Llama.rs
Add to Cargo.toml:
[dependencies]
llama-rs = "0.1"
Example 1: Basic Model Loading and Inference
use llama_rs::InferenceSession;
use std::io::Write;
fn main() -> anyhow::Result<()> {
// Load model
let model = llama_rs::load_model("model.gguf")?;
// Create inference session
let mut session = InferenceSession::new(
&model,
Default::default(),
)?;
// Prepare input
let prompt = "Once upon a time, there was a";
let mut output = String::new();
// Run inference
session.infer::<std::io::Sink>(
&model,
&mut llama_rs::InferenceRequest {
prompt: prompt.into(),
parameters: Default::default(),
play_back_previous_tokens: false,
maximum_token_count: Some(100),
},
&mut llama_rs::CallbackEvent::new(|event| {
match event {
llama_rs::InferenceEvent::InferredToken(token) => {
print!("{}", token);
std::io::stdout().flush().ok();
output.push_str(&token);
}
_ => {}
}
Ok(())
}),
&mut std::io::sink(),
)?;
println!("\n\nFull output:\n{}", output);
Ok(())
}
Example 2: Streaming Inference
use llama_rs::{Model, InferenceSession};
fn main() -> anyhow::Result<()> {
let model = llama_rs::load_model("model.gguf")?;
let mut session = InferenceSession::new(&model, Default::default())?;
let prompt = "Explain quantum computing in simple terms:";
session.infer::<std::io::Sink>(
&model,
&mut llama_rs::InferenceRequest {
prompt: prompt.into(),
parameters: Default::default(),
play_back_previous_tokens: false,
maximum_token_count: Some(512),
},
&mut llama_rs::CallbackEvent::new(|event| {
match event {
llama_rs::InferenceEvent::InferredToken(token) => {
// Stream tokens as they're generated
print!("{}", token);
std::io::stdout().flush().ok();
}
llama_rs::InferenceEvent::EndOfText => {
println!("\n[End of generation]");
}
_ => {}
}
Ok(())
}),
&mut std::io::sink(),
)?;
Ok(())
}
Pros of Llama.rs
- โ Pure Rust, minimal dependencies
- โ Direct support for GGML quantized models
- โ Excellent for embedded and edge devices
- โ Single-file simplicity
- โ Memory-efficient inference
Cons of Llama.rs
- โ Specialized for LLaMA architecture
- โ Smaller ecosystem than Candle
- โ Limited to quantized models from llama.cpp
Head-to-Head Comparison
| Aspect | Candle | Llama.rs |
|---|---|---|
| Scope | General ML inference | LLaMA-specific |
| Model Support | Any architecture | LLaMA, derivatives |
| Ease of Use | Moderate | Very easy |
| Dependencies | Minimal (optional GPU) | Minimal |
| Performance | Excellent | Excellent |
| Community | Growing | Smaller |
| GPU Support | CUDA, Metal | CPU-focused |
| Deployment | Production-grade | Production-grade |
| Learning Curve | Steeper | Gentler |
Practical Decision Tree
Choose Candle if:
- You need flexibility across multiple model architectures
- You want GPU acceleration (CUDA, Metal)
- You’re building a general inference service
- You prefer a larger ecosystem
Choose Llama.rs if:
- You’re exclusively working with LLaMA models
- You want maximum simplicity and minimal dependencies
- You’re targeting embedded or edge devices
- You prefer GGML quantized models
Building a Production Inference Server
Here’s a practical example: a simple HTTP server for LLM inference using Candle and Axum:
use axum::{
extract::Json,
http::StatusCode,
routing::post,
Router,
};
use serde::{Deserialize, Serialize};
use std::sync::Arc;
use candle_core::Device;
#[derive(Serialize, Deserialize)]
struct InferenceRequest {
prompt: String,
max_tokens: Option<usize>,
}
#[derive(Serialize)]
struct InferenceResponse {
generated_text: String,
}
struct AppState {
device: Device,
// Add model here
}
async fn infer(
Json(req): Json<InferenceRequest>,
) -> Result<Json<InferenceResponse>, StatusCode> {
// TODO: Implement inference logic
Ok(Json(InferenceResponse {
generated_text: "Generated text...".to_string(),
}))
}
#[tokio::main]
async fn main() {
let app = Router::new()
.route("/infer", post(infer));
let listener = tokio::net::TcpListener::bind("0.0.0.0:3000")
.await
.unwrap();
axum::serve(listener, app).await.unwrap();
}
Optimization Tips
1. Quantization
Use quantized models (4-bit, 8-bit) to reduce memory and speed up inference:
# Convert with llama.cpp
python convert.py model --outtype q4_0
2. Batch Inference
Process multiple prompts in parallel:
let prompts = vec!["Prompt 1", "Prompt 2", "Prompt 3"];
for prompt in prompts {
// Process in parallel with rayon or tokio
}
3. Model Caching
Keep models in memory between requests:
use once_cell::sync::Lazy;
use std::sync::Mutex;
static MODEL: Lazy<Mutex<Model>> = Lazy::new(|| {
Mutex::new(load_model().unwrap())
});
4. KV Cache Reuse
Reuse key-value caches across generations for faster decoding.
Ecosystem Beyond Candle and Llama.rs
- Burn: Native Rust deep learning framework with dynamic graphs
- Ort: ONNX Runtime bindings for interoperability
- Tch-rs: PyTorch bindings for advanced models
- Huggingface-hub-rs: Download models from Hugging Face Hub
Conclusion
Rust is ready for LLM inference. Whether you choose Candle for versatility or Llama.rs for simplicity, you’ll get performance and reliability that Python simply can’t match.
The choice depends on your priorities:
- Flexibility + GPU support โ Candle
- Simplicity + Edge deployment โ Llama.rs
As Rust’s ML ecosystem matures, expect more specialized libraries and better tooling. For now, these two frameworks offer a solid foundation for building production-grade LLM inference systems.
Resources
- Candle Documentation
- Candle Examples
- Llama.rs GitHub
- Hugging Face Transformers in Rust
- Llama.cpp (original C++ implementation)
Comments