Inference Frameworks & Runtimes on Embedded Systems Development

TensorFlow Lite Micro

Mon, 01 Jan 0001 00:00:00 +0000

TensorFlow Lite Micro#

TensorFlow Lite for Microcontrollers (TFLM) is an inference runtime designed for bare-metal environments where there is no operating system, no heap allocator, and as little as 16 KB of RAM. The runtime loads a pre-trained model stored as a FlatBuffer in flash memory, allocates all intermediate tensor storage from a single pre-allocated byte array (the tensor arena), and executes the model’s operations sequentially through a lightweight interpreter. There is no dynamic memory allocation at any point during inference — every byte comes from the arena, and the arena size is fixed at compile time. This makes TFLM deterministic and suitable for hard real-time systems, but it also means the developer bears full responsibility for sizing the arena correctly and registering exactly the operators the model requires.

TensorFlow Lite for Linux

Mon, 01 Jan 0001 00:00:00 +0000

TensorFlow Lite for Linux#

TensorFlow Lite (TFLite) on Linux is the full-featured inference runtime for edge devices with an operating system, a filesystem, and megabytes to gigabytes of RAM. Unlike TensorFlow Lite Micro, which targets bare-metal microcontrollers with static memory allocation, TFLite for Linux uses dynamic memory, supports multi-threaded inference, and — critically — provides a delegate architecture that offloads computation to hardware accelerators like GPUs, NPUs, and DSPs. This delegate system is what makes TFLite viable on platforms ranging from a Raspberry Pi 4 running object detection on the CPU to a Jetson Orin Nano offloading an entire model graph to TensorRT.

ONNX Runtime & Edge Variants

Mon, 01 Jan 0001 00:00:00 +0000

ONNX Runtime & Edge Variants#

ONNX (Open Neural Network Exchange) is an open model interchange format that allows models trained in one framework — PyTorch, TensorFlow, scikit-learn, XGBoost — to run on a shared inference runtime. The format defines a standard set of operators (organized into versioned “opsets”), a graph-based computation representation, and a serialization scheme using Protocol Buffers. ONNX Runtime (ORT) is Microsoft’s high-performance inference engine for ONNX models, supporting execution on CPUs, GPUs, NPUs, and DSPs through a pluggable execution provider architecture.

Edge Runtime Selection Guide

Mon, 01 Jan 0001 00:00:00 +0000

Edge Runtime Selection Guide#

Selecting an inference runtime for edge deployment is a multi-dimensional decision that involves the target hardware, model format, operator coverage, memory budget, and the training framework that produced the model. There is no single “best” runtime — the choice is constrained by the intersection of what the hardware supports, what the model requires, and what the deployment environment allows. A model that runs perfectly on TensorFlow Lite for Linux with the XNNPACK delegate on a Raspberry Pi 4 cannot run on a Cortex-M4 microcontroller, where TensorFlow Lite Micro is the only viable option. Conversely, a PyTorch model that exports cleanly to ONNX Runtime may have no path to TFLite without an intermediate conversion step that risks operator compatibility issues.