Model Optimization & Deployment on Embedded Systems Development

Quantization

Mon, 01 Jan 0001 00:00:00 +0000

Quantization#

Quantization reduces the numerical precision of a neural network’s weights and activations from the default float32 representation to a lower-precision format — typically int8, float16, or int4. A float32 model stores each parameter as a 32-bit IEEE 754 floating-point number. Converting those parameters to int8 reduces model size by 4x (from 4 bytes per weight to 1 byte), decreases memory bandwidth during inference by the same factor, and enables hardware-accelerated integer arithmetic on devices with int8 support — including Arm Cortex-M with CMSIS-NN, dedicated NPUs (Ethos-U55/U65, Hailo-8), Edge TPU, and GPU int8 tensor cores on Jetson platforms.

Pruning & Knowledge Distillation

Mon, 01 Jan 0001 00:00:00 +0000

Pruning & Knowledge Distillation#

Quantization reduces the precision of every weight in a model. Pruning and knowledge distillation take a different approach: they reduce the number of computations the model performs — pruning by removing redundant weights or structures, and distillation by training a smaller model to approximate a larger one. Both techniques require some form of retraining, which makes them more involved than post-training quantization, but they address cases where quantization alone cannot meet the latency or size target.

Model Conversion Pipelines

Mon, 01 Jan 0001 00:00:00 +0000

Model Conversion Pipelines#

Training frameworks produce models in their native format — PyTorch saves .pt or .pth files with Python-pickled state dictionaries, TensorFlow saves SavedModel directories or .keras files, and ONNX stores models as protobuf .onnx files. None of these formats run directly on edge inference runtimes. TFLite Micro expects .tflite FlatBuffers. TensorRT requires serialized .trt engine files built for the specific GPU architecture. Hailo’s Dataflow Compiler produces .hef files for the Hailo-8/8L NPU. The conversion pipeline bridges this gap — transforming a trained model from its source format into the target runtime’s format while preserving numerical correctness.

Profiling & Benchmarking Inference

Mon, 01 Jan 0001 00:00:00 +0000

Profiling & Benchmarking Inference#

A model that meets its accuracy target after quantization and conversion is only half deployed. The other half is meeting latency, memory, and power constraints on the target hardware. A MobileNetV2 int8 model that achieves 71% top-1 accuracy is useless for a 30 fps video pipeline if it takes 50 ms per inference on the target SBC — that leaves only 16 ms for preprocessing, postprocessing, and frame capture. Profiling and benchmarking on the actual deployment hardware — not on a development workstation — is the only way to validate that the optimized model meets its operational requirements.