<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Model Optimization &amp; Deployment on Embedded Systems Development</title><link>https://applied-ee.github.io/embedded/docs/edge-ai/model-optimization/</link><description>Recent content in Model Optimization &amp; Deployment on Embedded Systems Development</description><generator>Hugo</generator><language>en-us</language><atom:link href="https://applied-ee.github.io/embedded/docs/edge-ai/model-optimization/index.xml" rel="self" type="application/rss+xml"/><item><title>Quantization</title><link>https://applied-ee.github.io/embedded/docs/edge-ai/model-optimization/quantization/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://applied-ee.github.io/embedded/docs/edge-ai/model-optimization/quantization/</guid><description>&lt;h1 id="quantization"&gt;Quantization&lt;a class="anchor" href="#quantization"&gt;#&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;Quantization reduces the numerical precision of a neural network&amp;rsquo;s weights and activations from the default float32 representation to a lower-precision format — typically int8, float16, or int4. A float32 model stores each parameter as a 32-bit IEEE 754 floating-point number. Converting those parameters to int8 reduces model size by 4x (from 4 bytes per weight to 1 byte), decreases memory bandwidth during inference by the same factor, and enables hardware-accelerated integer arithmetic on devices with int8 support — including Arm Cortex-M with CMSIS-NN, dedicated NPUs (Ethos-U55/U65, Hailo-8), Edge TPU, and GPU int8 tensor cores on Jetson platforms.&lt;/p&gt;</description></item><item><title>Pruning &amp; Knowledge Distillation</title><link>https://applied-ee.github.io/embedded/docs/edge-ai/model-optimization/pruning-distillation/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://applied-ee.github.io/embedded/docs/edge-ai/model-optimization/pruning-distillation/</guid><description>&lt;h1 id="pruning--knowledge-distillation"&gt;Pruning &amp;amp; Knowledge Distillation&lt;a class="anchor" href="#pruning--knowledge-distillation"&gt;#&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;Quantization reduces the precision of every weight in a model. Pruning and knowledge distillation take a different approach: they reduce the &lt;strong&gt;number&lt;/strong&gt; of computations the model performs — pruning by removing redundant weights or structures, and distillation by training a smaller model to approximate a larger one. Both techniques require some form of retraining, which makes them more involved than post-training &lt;a href="https://applied-ee.github.io/embedded/docs/edge-ai/model-optimization/quantization/"&gt;quantization&lt;/a&gt;, but they address cases where quantization alone cannot meet the latency or size target.&lt;/p&gt;</description></item><item><title>Model Conversion Pipelines</title><link>https://applied-ee.github.io/embedded/docs/edge-ai/model-optimization/model-conversion/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://applied-ee.github.io/embedded/docs/edge-ai/model-optimization/model-conversion/</guid><description>&lt;h1 id="model-conversion-pipelines"&gt;Model Conversion Pipelines&lt;a class="anchor" href="#model-conversion-pipelines"&gt;#&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;Training frameworks produce models in their native format — PyTorch saves &lt;code&gt;.pt&lt;/code&gt; or &lt;code&gt;.pth&lt;/code&gt; files with Python-pickled state dictionaries, TensorFlow saves SavedModel directories or &lt;code&gt;.keras&lt;/code&gt; files, and ONNX stores models as protobuf &lt;code&gt;.onnx&lt;/code&gt; files. None of these formats run directly on edge inference runtimes. TFLite Micro expects &lt;code&gt;.tflite&lt;/code&gt; FlatBuffers. TensorRT requires serialized &lt;code&gt;.trt&lt;/code&gt; engine files built for the specific GPU architecture. Hailo&amp;rsquo;s Dataflow Compiler produces &lt;code&gt;.hef&lt;/code&gt; files for the Hailo-8/8L NPU. The conversion pipeline bridges this gap — transforming a trained model from its source format into the target runtime&amp;rsquo;s format while preserving numerical correctness.&lt;/p&gt;</description></item><item><title>Profiling &amp; Benchmarking Inference</title><link>https://applied-ee.github.io/embedded/docs/edge-ai/model-optimization/profiling-benchmarking/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://applied-ee.github.io/embedded/docs/edge-ai/model-optimization/profiling-benchmarking/</guid><description>&lt;h1 id="profiling--benchmarking-inference"&gt;Profiling &amp;amp; Benchmarking Inference&lt;a class="anchor" href="#profiling--benchmarking-inference"&gt;#&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;A model that meets its accuracy target after &lt;a href="https://applied-ee.github.io/embedded/docs/edge-ai/model-optimization/quantization/"&gt;quantization&lt;/a&gt; and &lt;a href="https://applied-ee.github.io/embedded/docs/edge-ai/model-optimization/model-conversion/"&gt;conversion&lt;/a&gt; is only half deployed. The other half is meeting latency, memory, and power constraints on the target hardware. A MobileNetV2 int8 model that achieves 71% top-1 accuracy is useless for a 30 fps video pipeline if it takes 50 ms per inference on the target SBC — that leaves only 16 ms for preprocessing, postprocessing, and frame capture. Profiling and benchmarking on the actual deployment hardware — not on a development workstation — is the only way to validate that the optimized model meets its operational requirements.&lt;/p&gt;</description></item></channel></rss>