Skip to main content
โšก Calmops

7 Best Rust Frameworks for Machine Learning on Edge Devices in 2025

Table of Contents

Edge ML in 2025 means running useful models on constrained devicesโ€”microcontrollers, ARM single-board computers, mobile phones, or small robotics platformsโ€”while balancing memory, latency, power, and cross-compilation needs. This guide compares seven Rust-first (or Rust-friendly) frameworks and runtimes that are most relevant for edge ML, analyzing strengths, trade-offs, hardware-acceleration options, format compatibility, and practical deployment considerations. Use this as a decision guide and quick-start reference to pick the right tool for your use case.


Introduction

Running ML at the edge brings strict constraints: limited RAM and flash, heterogeneous accelerators (GPUs, NPUs, TPUs), battery budgets, and hard real-time needs in some robotics or control systems. Rust’s combination of performance, small binaries, and memory safety makes it an attractive stack for edge ML. This article focuses on frameworks and runtimes that are particularly suitable for edge deployment in 2025.

What this guide covers:

  • Memory efficiency and low resource consumption
  • Real-time inference characteristics
  • Hardware acceleration support and delegates
  • Model format compatibility (ONNX, TFLite, TorchScript)
  • Cross-compilation for ARM and embedded architectures
  • Power and battery efficiency guidance
  • Practical installation and example usage

1) Tract โ€” Embedded-focused ONNX / TFLite runtime (tract)

Brief description & primary use cases

Tract is a lightweight, pure-Rust inference runtime focused on embedded and edge devices. It supports ONNX and has good TFLite reading capabilities. Tract aims for small memory usage and predictable behaviorโ€”ideal for sensor gateway boxes, local inference on SBCs, and deterministic embedded systems.

Key technical specifications

  • Model formats: ONNX (primary) and TFLite reading support
  • Execution: CPU-focused, vectorized (SIMD where available)
  • Pure Rust implementation (few native dependencies)
  • Optimizations: graph-level transforms, operator fusion where available

Features

  • Small runtime memory footprint
  • Deterministic inference behavior
  • Good operator coverage for typical vision and classification models

Pros / Cons

Pros

  • Excellent for CPU-only, memory-constrained deployments
  • Simple cross-compilation and static linking
  • Transparent model inspection and optimizations

Cons

  • Limited out-of-the-box GPU/NPU delegate support
  • New or niche operators may lag upstream frameworks

Installation & setup

Add to Cargo.toml:

tract-onnx = "0.19"

Basic usage:

use tract_onnx::prelude::*;

let model = tract_onnx::onnx()
    .model_for_path("model.onnx")?
    .with_input_fact(0, InferenceFact::dt_shape(f32::datum_type(), tvec!(1, 3, 224, 224)))?
    .into_optimized()?
    .into_runnable()?;

let input = tensor4(&[[[/* ... */]]]);
let result = model.run(tvec!(input))?;

Performance characteristics

  • Low memory usage and competitive CPU inference times on ARM Cortex-A and similar SBCs.
  • Best results with quantized models (int8) and when using build flags enabling SIMD/NEON.

Advantages & limitations

  • Advantage: small binary and runtime overheadโ€”good for constrained devices.
  • Limitation: for heavy acceleration (GPU, NPU) you may need to wrap or integrate an external delegate.

Best suited scenarios

  • Raspberry Pi / other ARM SBCs running small vision/classification models
  • Sensor gateways requiring deterministic, low-RAM inference

2) TensorFlow Lite (Rust bindings) โ€” Mobile & micro front-line (tflite / tflite-rs)

Brief description & primary use cases

TensorFlow Lite is the go-to option for many mobile and microcontroller scenarios. Rust bindings to TFLite let you use the TFLite interpreter, delegates (GPU via OpenGL/Vulkan, NNAPI, CoreML, Edge TPU), and TFLite Micro for MCUs.

Key technical specifications

  • Model formats: TFLite FlatBuffers (.tflite) and TFLite Micro for microcontrollers
  • Execution: CPU, GPU/NNAPI delegates, Edge TPU via delegates
  • Cross-compilation: requires building or bundling native TFLite libraries

Features

  • Strong mobile delegate ecosystem (NNAPI, CoreML, GPU)
  • Support for TFLite Micro on tiny MCUs
  • First-class quantized model support (int8)

Pros / Cons

Pros

  • Best-in-class support for very small and mobile models
  • Efficient quantized execution and hardware delegates

Cons

  • Bindings wrap C/C++ librariesโ€”cross-compilation/tooling is more complex
  • TFLite Micro may require hand-crafted build steps for very small MCUs

Installation & setup

Add a TFLite crate or use FFI bindings (crate names vary):

tflite = "0.7"

Basic usage:

use tflite::{FlatBufferModel, InterpreterBuilder};

let model = FlatBufferModel::build_from_file("model.tflite").unwrap();
let resolver = tflite::ops::builtin::BuiltinOpResolver::default();
let mut interp = InterpreterBuilder::new(model, resolver).build().unwrap();

interp.allocate_tensors().unwrap();
interp.set_input(0, &input_data).unwrap();
interp.invoke().unwrap();
let output = interp.output(0).unwrap();

Performance characteristics

  • With delegates (GPU, NNAPI, Edge TPU), latency and power efficiency improve dramatically.
  • TFLite Micro is optimized for small flash/RAM footprints but requires tight integration with MCU SDKs.

Advantages & limitations

  • Advantage: excellent hardware delegate ecosystem and quantization tools.
  • Limitation: non-Rust native components and extra build effort for some targets.

Best suited scenarios

  • Mobile apps (Android/iOS) using NNAPI/CoreML/GPU delegates
  • Ultra-constrained MCUs using TFLite Micro and int8 models

3) ONNX Runtime via ort crate โ€” Production-grade cross-platform inference

Brief description & primary use cases

ONNX Runtime (ORT) is a mature, high-performance inference engine with broad execution provider support (CPU, CUDA, Vulkan, DirectML, NNAPI, CoreML). The ort Rust crate provides idiomatic bindings to ORT.

Key technical specifications

  • Model formats: ONNX
  • Execution: Many providers (CPU, CUDA, Vulkan, NNAPI, CoreML)
  • Packaging: native C++ runtime with Rust bindings

Features

  • Wide hardware acceleration support via execution providers
  • Comprehensive operator coverage and performance tuning

Pros / Cons

Pros

  • Excellent acceleration with GPUs and NPUs
  • Production-ready toolchain and profiling utilities

Cons

  • Larger binary footprint (native runtime)
  • Cross-compilation requires building ONNX Runtime for target platforms

Installation & setup

Add to Cargo.toml:

ort = "0.14"

Basic usage:

use ort::{Environment, SessionBuilder};

let env = Environment::builder().with_name("edge").build()?;
let session = SessionBuilder::new(&env)?.with_model_from_file("model.onnx")?;
let input = ndarray::Array::from_shape_vec((1, 3, 224, 224), vec![/*...*/]).unwrap();
let outputs = session.run(vec![input.into()])?;

Performance characteristics

  • When paired with a device-specific execution provider (CUDA, Vulkan, NNAPI), ORT delivers the best throughput and latency for many edge deployments with accelerators.
  • CPU-only ORT is robust but may be heavier in memory than minimal runtimes.

Advantages & limitations

  • Advantage: top-tier acceleration support and enterprise-grade stability.
  • Limitation: packaging and cross-compile effort for small embedded targets.

Best suited scenarios

  • Robotics, drones, and industrial edge systems where low latency and hardware acceleration matter
  • Mobile apps needing high-perf inference via NNAPI/CoreML or Vulkan

4) Candle โ€” Rust-native transformer & inference runtime (candle)

Brief description & primary use cases

Candle is a Rust-first runtime optimized for transformer-style models (BERT, small GPTs). It targets CPU inference with an emphasis on memory efficiency, quantization support, and optimized attention kernels that are friendly to edge CPUs.

Key technical specifications

  • Focus: Transformer models and attention kernels
  • Strength: quantization, memory-efficient inference for LLM-like workloads
  • Implementation: pure Rust with tuned kernels

Features

  • Very memory-efficient inference for transformer architectures
  • Good tooling for quantized models

Pros / Cons

Pros

  • Excellent for on-device NLP and compact LLMs
  • Pure Rust: easier cross-compilation and embedding

Cons

  • Narrow scope focused on transformer workloads
  • Smaller ecosystem than mainstream runtimes

Installation & setup

Add to Cargo.toml:

candle = "0.1"

Conceptual usage:

use candle::prelude::*;

let model = candle::rw::load_model("small_quant.onnx")?;
let mut runner = model.runner()?;
let input = Tensor::from_ndarray(&ndarray::arr2(&[[...]]));
let out = runner.forward(&[input])?;

Performance characteristics

  • Very competitive CPU inference for quantized transformer models; designed to reduce working memory and peak allocations.

Advantages & limitations

  • Advantage: run useful transformer workflows on CPU-only edge devices.
  • Limitation: not a general-purpose NN runtimeโ€”best for NLP/transformer workloads.

Best suited scenarios

  • On-device assistants, smart cameras with NLP tagging, and devices running compact LLMs without an accelerator

5) Rune โ€” Sandboxed, portable runtime with model packaging (rune)

Brief description & primary use cases

Rune is a small, sandboxed runtime that enables packing and delivering model logic to devices safely. Itโ€™s optimized for portability (WASM-friendly) and secure execution, which is useful for multi-tenant or untrusted environments.

Key technical specifications

  • Execution: WASM-first, sandboxed execution
  • Packaging: model + logic packaged as a portable artifact
  • Target: edge devices where secure execution matters

Features

  • Sandboxed runtime for safe execution of third-party models
  • WASM support enabling broad portability

Pros / Cons

Pros

  • Secure packaging and execution model for edge deployments
  • Small runtime footprint when running packaged WASM models

Cons

  • Not optimized for raw numerical speedโ€”relies on WASM SIMD for performance
  • For heavy inference, wrap native providers or run in a secure enclave

Installation & setup

Add to Cargo.toml:

rune = "0.10"

Conceptual usage:

use rune::{Context, FromValue};

let ctx = rune::load_model_bytes(include_bytes!("model.rune"))?;
let result = ctx.run("infer", &input)?;

Performance characteristics

  • Best for secure, portable deployments; performance depends on available WASM SIMD and host capabilities.

Advantages & limitations

  • Advantage: safe, auditable execution for distributed fleets
  • Limitation: less numerically optimized than native runtimes

Best suited scenarios

  • Fleet deployments where third-party models are distributed and must be sandboxed
  • Devices that prefer WASM portability over native binaries

6) dfdx โ€” Small, pure-Rust tensor library for compact models (dfdx)

Brief description & primary use cases

dfdx is a compact, pure-Rust tensor and NN library geared at experiments, small models, WASM and embedded targets. Itโ€™s useful where you want a Rust-native stack from model definition to inference (and small-scale training).

Key technical specifications

  • Pure Rust tensors, neural layers, and optimizers
  • WASM / no-std friendly when configured properly

Features

  • Great for prototyping and running simple networks in Rust
  • Explicit control over memory layout and allocations

Pros / Cons

Pros

  • Tight Rust integration and small binary size
  • Good fit for WASM and constrained devices

Cons

  • Not designed for running large pre-trained models from major ecosystems without reimplementation
  • Fewer built-in operator optimizations compared to ORT or TFLite

Installation & setup

Add to Cargo.toml:

dfdx = "0.14"

Usage example:

use dfdx::prelude::*;

let model = MyNet::load("model.bin")?;
let x = Tensor::from_data(vec![/* ... */]);
let y = model.forward(x);

Performance characteristics

  • Compact and efficient for small networks; speed depends heavily on build flags (SIMD) and target hardware.

Advantages & limitations

  • Advantage: full Rust ownership and easy cross-compilation
  • Limitation: manual reimplementation or export may be needed for complex models

Best suited scenarios

  • Custom lightweight ML pipelines, WASM apps, and embedded inference where you control the model definition

7) tch-rs + libtorch mobile โ€” PyTorch models on edge (tch)

Brief description & primary use cases

tch-rs provides Rust bindings to libtorch (PyTorch C++). Combined with libtorch mobile builds, it is a practical choice when your model pipeline originates from PyTorch and you need to deploy TorchScript artifacts to the edge.

Key technical specifications

  • Model formats: TorchScript (.pt) and traced/scripted models
  • Execution: CPU/GPU depending on libtorch build; mobile builds available

Features

  • Direct reuse of PyTorch models and tooling
  • Integration with TorchScript for production deployment

Pros / Cons

Pros

  • Minimal friction if your pipeline uses PyTorch
  • Access to many PyTorch ops and utilities

Cons

  • Heavier runtime and larger binary size than pure-Rust runtimes
  • Cross-compilation and mobile builds require careful packaging

Installation & setup

Add to Cargo.toml:

tch = "0.7"

Usage:

use tch::CModule;

let model = CModule::load("model.pt")?;
let input = tch::Tensor::of_slice(&[...]).reshape(&[1, 3, 224, 224]);
let output = model.forward_is(&[input])?;

Performance characteristics

  • Good performance when using optimized libtorch mobile builds or GPU-enabled builds; heavier than Tract/TFLite for small CPU-only devices.

Advantages & limitations

  • Advantage: straightforward for teams already using PyTorch
  • Limitation: packaging complexity and binary size for constrained devices

Best suited scenarios

  • Teams using PyTorch looking to ship TorchScript models to mobile or edge Linux devices

Comparison Table โ€” At a glance

Framework Memory Efficiency Real-time Inference HW Acceleration Model Formats Cross-compile friendliness Community / Maturity
Tract โœ… High โœ… Good (CPU) โš ๏ธ Limited ONNX / TFLite โœ… Easy Mature (embedded)
TFLite (bindings) โœ… High (with quant) โœ… Excellent (with delegates) โœ… GPU / NNAPI / EdgeTPU TFLite โš ๏ธ Native toolchain needed Very mature
ONNX Runtime (ort) โš ๏ธ Medium โœ… Excellent โœ… Wide provider support ONNX โš ๏ธ Complex native build Very mature
Candle โœ… Very High (quant) โœ… Excellent (transformers) โš ๏ธ CPU-optimized ONNX-ish / Torch export โœ… Good (pure Rust) Growing
Rune โœ… High (packaging) โœ… OK โš ๏ธ Depends on backend Packaged / ONNX โœ… WASM-friendly Niche & growing
dfdx โœ… High (small models) โœ… Good (WASM/CPU) โš ๏ธ Limited Manual / small exports โœ… Excellent Growing
tch + libtorch โš ๏ธ Medium-Large โœ… Good (mobile libtorch) โœ… GPU (if built) TorchScript โš ๏ธ Complex Mature (via PyTorch)

Cross-compilation, power, and battery tips ๐Ÿ”ง

  • Prefer quantized/int8 models where possible to reduce memory and CPU usage.
  • Use platform-native delegates (NNAPI, CoreML, Edge TPU) for energy-efficient acceleration.
  • Build size: strip symbols, enable LTO, and use size-optimized Rust flags for smaller deployed binaries.
  • Cross-compile using rustup target add or use tooling like cross to simplify building crates that include native components.
  • Measure power with a realistic workload on target hardware (use external meters or kernel power stats) to get accurate battery impact.

Deployment checklist (practical)

  • Choose the model format that matches the target runtime (ONNX for ORT/Tract, TFLite for TFLite delegates).
  • Quantize and prune the model to fit memory and energy budgets.
  • Build or bundle native runtime components for target architecture; prebuild or cross-compile where appropriate.
  • Validate on-device latency and memory usage with production-like inputs.
  • Automate packaging and OTA updates for fleets.

Recommendations for common edge use cases ๐ŸŽฏ

  • IoT sensors / ultra-constrained MCUs: TFLite Micro or Tract (with aggressive quantization). Prioritize int8, minimal working memory, and a small interpreter.

  • Mobile apps (Android/iOS): TFLite with delegates or ONNX Runtime with NNAPI / CoreML providers. Use quantized models and native delegates to maximize battery life.

  • Robotics / real-time control: ONNX Runtime with proper execution providers (GPU/Vulkan) for strict low-latency needs; fallback to Tract for deterministic CPU-only scenarios.

  • Embedded vision on SBCs: ONNX Runtime (Jetson/Vulkan) or Tract (Raspberry Pi CPU-only) depending on acceleration requirements.

  • Secure/third-party model distribution: Rune for sandboxed WASM packaging and safe execution on untrusted devices.

  • Prototyping & research in Rust: dfdx, burn or candle depending on whether you need small custom models or transformer support.


How to choose in 3 steps

  1. Pick your model format: TFLite for MCU/mobile, ONNX for cross-platform acceleration.
  2. Select the runtime by device capability: TFLite/Tract for small devices, ORT for hardware-accelerated edge, Candle for on-device LLMs.
  3. Quantize, cross-compile, and benchmark on the actual device to validate latency and power.

Note: Benchmarks depend heavily on target hardware and model architectureโ€”always measure on your intended device.


Demo pipeline: PyTorch โ†’ ONNX โ†’ Quantize โ†’ Run on Raspberry Pi

This short demo walks through a practical pipeline that is easy to reproduce: export a PyTorch model to ONNX, apply post-training quantization, copy the quantized ONNX to a Raspberry Pi (ARM), and run inference there using tract (pure Rust). The pipeline is intentionally compactโ€”suitable for a simple image classification model like MobileNet or a small custom CNN.

1) Export PyTorch โ†’ ONNX (local workstation)

Save export.py:

# export.py
import torch
from torchvision import models

model = models.mobilenet_v2(pretrained=True).eval()
x = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, x, "model.onnx", opset_version=14, input_names=["input"], output_names=["output"], dynamic_axes={"input":{0:"batch"}})

Run:

python export.py

Verify:

python -c "import onnx; onnx.checker.check_model('model.onnx'); print('ONNX ok')"

Image preprocessing & local sanity-check (Python)

Quick script to preprocess a JPEG and run a local ONNX inference to validate the exported model:

# preprocess_and_run.py
from PIL import Image
import numpy as np
import onnxruntime as ort

def preprocess(path):
    img = Image.open(path).convert('RGB').resize((224,224))
    arr = np.array(img).astype('float32') / 255.0
    mean = np.array([0.485,0.456,0.406])
    std = np.array([0.229,0.224,0.225])
    arr = (arr - mean) / std
    # move to NCHW
    arr = np.transpose(arr, (2,0,1))[None,:,:,:].astype('float32')
    return arr

sess = ort.InferenceSession('model.onnx')
inp = preprocess('sample.jpg')
out = sess.run(None, {'input': inp})[0]
print('Top-5 indices:', out[0].argsort()[-5:][::-1])

This helps ensure your exported model gives expected outputs before quantizing.

2) Quantize the ONNX model (post-training dynamic quantization)

Install the quantization tools and run dynamic quantization (works well for many CNNs):

pip install onnx onnxruntime
python - <<'PY'
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic('model.onnx', 'model.quant.onnx', weight_type=QuantType.QInt8)
print('Quantized model saved: model.quant.onnx')
PY

Quick check (optional): compare size and run basic inference with onnxruntime on your workstation to sanity-check outputs.

3) Copy files to Raspberry Pi

scp model.quant.onnx [email protected]:~/models/
# or use rsync/USB

On the Raspberry Pi (or aarch64 Debian/Ubuntu): install Rust and prepare a small runner (you can also cross-compile):

# Install Rust (if not present)
curl https://sh.rustup.rs -sSf | sh
source $HOME/.cargo/env

4) Rust runner using tract-onnx

Create a minimal Cargo project on the Pi (or copy & build locally for the Pi target):

Cargo.toml:

[package]
name = "rp-infer"
version = "0.1.0"
edition = "2021"

[dependencies]
tract-onnx = "0.19"
ndarray = "0.15"

src/main.rs:

use tract_onnx::prelude::*;
use ndarray::Array4;

fn main() -> TractResult<()> {
    // Load quantized ONNX
    let model = tract_onnx::onnx()
        .model_for_path("models/model.quant.onnx")?
        .with_input_fact(0, InferenceFact::dt_shape(f32::datum_type(), tvec!(1, 3, 224, 224)))?
        .into_optimized()? // apply graph optimizations
        .into_runnable()?;

    // Dummy input (replace with preprocessed image bytes)
    let input = Array4::<f32>::zeros((1, 3, 224, 224));
    let input = input.into_dyn().into_tensor();

    let result = model.run(tvec!(input))?;
    println!("Output: {:?}", result[0].to_array_view::<f32>()?.slice(s![0, ..10]));
    Ok(())
}

Build and run on Raspberry Pi:

cargo build --release
# run
./target/release/rp-infer

5) Tips & verification

  • If building on the Pi is slow, cross-compile from an x86 host targeting aarch64-unknown-linux-gnu or use cross.
  • Use RUSTFLAGS='-C target-cpu=native -C lto' cargo build --release to get better performance on the Pi (only when building on Pi or for the Pi’s CPU type).
  • Compare FP32 vs quantized outputs on a few samples to ensure acceptable accuracy drop.
  • For extra speed enable NEON/SIMD on the target and build with appropriate target flags, and prefer model.quant.onnx with per-channel quantization if supported by your model.
  • Measure latency (time command or microbenchmark tool) and memory with realistic inputs to confirm the model fits the device constraints.

Cross-compile from x86 (quick example)

rustup target add aarch64-unknown-linux-gnu
# install aarch64 cross linker (Debian/Ubuntu example)
sudo apt-get install -y gcc-aarch64-linux-gnu
# build the release binary for Raspberry Pi (aarch64)
cargo build --release --target aarch64-unknown-linux-gnu
# copy binary to Pi
scp target/aarch64-unknown-linux-gnu/release/rp-infer [email protected]:~/rp-infer

If your crate depends on native C libraries or more complex toolchains, prefer using cross or building in an aarch64 CI runner to avoid linker issues.


This short pipeline gives a practical, reproducible route from training artifacts to a quantized ONNX model running on Raspberry Pi with a small Rust runner using tract.


Conclusion

By 2025, Rust-based and Rust-friendly ML tooling covers a wide span of edge use casesโ€”from microcontrollers using TFLite Micro to transformer-focused runtimes like Candle that enable on-device LLMs. Choose the framework that best matches your device capabilities and deployment constraints, quantize aggressively for battery and memory savings, and always validate on real hardware. If youโ€™d like, I can provide a step-by-step example for a specific target (Raspberry Pi + quantized MobileNet, Android with NNAPI, or an MCU pipeline with TFLite Micro).


Comments