Fusion-Aware QDQ Placement: Achieving Native Kernel Fusion in ONNX via Graph Reordering
Abstract
The deployment of deep neural networks on edge devices relies heavily on INT8 quantization to reduce memory footprint and inference latency. The industry-standard ONNX Runtime (ORT) provides Python-based quantization utilities that inject QuantizeLinear and DequantizeLinear (QDQ) nodes into the computation graph. However, the default node injection strategy isolates Convolution operations from subsequent Activation layers, inadvertently breaking hardware-level kernel fusion.
In this paper, we present Kenosis, a native Rust graph optimization engine that implements Fusion-Aware QDQ Placement. By leveraging the commutative properties of non-linear activations under positive scalar multiplication, Kenosis safely reorders the computation graph to preserve Conv-Activation contiguity. Our benchmarks across four architectures demonstrate speedups of up to 2.46× over FP32 baselines and up to 65% latency reduction against the ORT Python quantizer, with cosine similarity scores of 0.875–0.999 against FP32 outputs. These results are validated on 6,000 real-world images drawn from ImageNet-1K and MS COCO val2017.
1. Introduction
The transition from FP32 (32-bit floating point) to INT8 (8-bit integer) computation is a critical step in preparing machine learning models for production edge environments. While the mathematical theory of quantization is well understood, the implementation of this math within a static computation graph often introduces severe performance bottlenecks.
In the ONNX ecosystem, static INT8 quantization is achieved by wrapping heavy mathematical operations (like Conv and MatMul) in QuantizeLinear (Q) and DequantizeLinear (DQ) nodes. This "QDQ pattern" signals the backend execution provider to map the operation to an accelerated 8-bit integer kernel.
We observed that standard Python-based quantization tools apply a naive, node-by-node injection strategy. This localized approach ignores the broader graph topology — specifically the relationship between Convolutional layers and their subsequent Non-Linear Activations (e.g., ReLU). This paper details how this naive placement causes "fusion breaking," leading to unnecessary memory thrashing, and how Kenosis solves this via topological graph awareness.
2. The Bottleneck: Broken Kernel Fusion
Modern CPU and GPU architectures achieve maximum throughput by minimizing trips to main memory. "Kernel Fusion" is the process by which a runtime engine collapses multiple sequential graph operations into a single, highly optimized hardware instruction.
In a standard FP32 vision model, a Convolution is almost always followed immediately by a ReLU activation:
When an ONNX execution provider sees this contiguous Conv ➔ ReLU pattern, it fuses them: performing the matrix multiplication and the negative-value zeroing in a single memory cycle.
When the standard ORT Python quantizer converts this to INT8, it evaluates the Conv node in isolation and wraps it in QDQ nodes:
By injecting the Dequantize node between Conv and ReLU, the quantizer severs their contiguity. The runtime engine can no longer fuse them. The hardware is forced to: compute the INT8 Convolution, push the result to main memory, pull it back to Dequantize to FP32, push it back to memory, and finally pull it to apply ReLU. This memory thrashing completely neutralizes the computational speedup gained from INT8 math.
3. The Mathematical Guarantee of Reordering
To restore kernel fusion, the Dequantize node must be moved after the ReLU node. In computational graphs, altering the order of operations generally corrupts the output. Kenosis relies on a specific mathematical property of ReLU interacting with the Dequantization formula to guarantee that reordering is safe.
The Dequantize operation is defined as:
y = (x - zero_point) * scale
Kenosis designs Conv output bias quantization such that zero_point is exactly 0. The Dequantization therefore simplifies to pure positive scalar multiplication:
y = x * scale (where scale > 0)
The ReLU operation is defined as y = max(0, x). Because the scale is strictly positive, the scalar multiplication is commutative with the max(0, x) operation:
- Standard Path (Dequantize ➔ ReLU):
max(0, x * scale) - Kenosis Path (ReLU ➔ Dequantize):
max(0, x) * scale
Whether you multiply a negative integer by a positive scale and then clamp to zero, or clamp first and then multiply, the result is exactly 0.0. This equivalence provides the formal proof required to safely rewrite the graph. The same commutativity holds for LeakyRelu, Clip, and HardSwish activations. For activations where strict commutativity does not hold (e.g., Sigmoid), Kenosis applies QDQ wrapping on the activation output rather than commuting through it, preserving numerical correctness while still maximizing downstream fusion opportunities.
The resulting Kenosis-optimized graph places Dequantize after the activation, restoring the contiguous Conv ➔ ReLU pattern that the execution provider maps to a single QLinearConv kernel.
4. The Kenosis Pipeline
Kenosis is a native Rust graph optimization engine that applies seven coordinated optimizations statically, prior to deployment. Unlike standard tooling, Kenosis performs a topological traversal of the ONNX protobuf graph:
- Self-calibration: Automatically generates synthetic calibration inputs and runs them through the model via ONNX Runtime to collect per-tensor activation ranges. No external calibration data required. Multi-input models and NLP inputs (token IDs, attention masks) are handled automatically.
- Weight quantization: INT8 symmetric per-tensor or per-channel. All scale computations in f64 to match ORT's internal precision.
- INT32 bias quantization:
scale = activation_scale × weight_scale, zero_point = 0. Wrapped with DequantizeLinear for ORT kernel fusion. - Zero-point nudged activation quantization: UINT8 asymmetric with post-hoc range adjustment ensuring
float 0.0maps exactly to the quantized zero. Prevents rounding asymmetry from compounding across layers. - Fusion-aware QDQ placement: Detects
Conv/MatMul → Activationpairs at graph level and places QDQ after the activation instead of between them. Combined with second-pass wrapping of Add, Concat, MaxPool, and AveragePool, this maximizes QLinear fusions. - Non-vision tensor protection: For multi-input models (detection, segmentation), tensors reachable from non-primary inputs (scale_factor, image_shape) are traced through the graph and excluded from quantization, preventing metadata paths from being crushed by INT8 range limits.
- Model output protection: Tensors that are direct model outputs are never QDQ-wrapped, preserving full FP32 precision in detection head scores and bounding box coordinates.
5. Benchmarks and Results
Test Environment
| Component | Specification |
|---|---|
| CPU | Intel i5-13420H (8C/12T, 8 GB DDR5) |
| GPU | Disabled (CPU-only execution) |
| Runtime | ONNX Runtime 1.24 (CPU EP), ort crate v2.0.0-rc.12 |
| Build | Release (--release), Rust 1.85+ |
| Isolated benchmarks | 100 timed iterations after 20 warmup runs, single-threaded ORT (intra_op=1, inter_op=1) |
Fidelity is evaluated using Cosine Similarity between INT8 and FP32 output vectors, as well as task-level Top-1 Predict Agreement (the percentage of inputs where INT8's top prediction matches FP32). Accuracy is evaluated on the official ImageNet-1K validation sample (1,000 images) and the full MS COCO val2017 set (5,000 images). All FP32 baseline models are sourced from the ONNX Model Zoo; PP-YOLOE+ Small is exported from PaddleDetection.
5.1 Isolated Latency — PP-YOLOE+ Small (Kenosis INT8 vs FP32)
PP-YOLOE+ is an anchor-free object detection architecture designed for efficient edge deployment. Single-threaded isolated inference.
| Resolution | INT8 Latency | FP32 Latency | Speedup | INT8 Size |
|---|---|---|---|---|
| 320×320 | 23ms | 44ms | 1.89× | 7.9 MB (3.9× smaller) |
| 416×416 | 43ms | 77ms | 1.80× | 7.9 MB (3.9× smaller) |
| 640×640 | 111ms | 187ms | 1.68× | 7.9 MB (3.8× smaller) |
5.2 Isolated Latency — Classifier Benchmarks (Kenosis INT8 vs FP32)
Standard vision classifiers quantized with per-tensor symmetric INT8 weights and self-calibrated activations.
| Architecture | Cosine Sim. | Top-1 Agree. | Kenosis INT8 | FP32 Baseline | Speedup |
|---|---|---|---|---|---|
| SqueezeNet 1.1 | 0.999 | — | 2.85ms | 6.60ms | 2.32× |
| ResNet50 v2 | 0.980 | 94.8% | 27.8ms | 68.4ms | 2.46× |
| MobileNetV2 | 0.970 | 91.3% | 4.61ms | 6.53ms | 1.42× |
| EfficientNet-Lite4 | 0.875 | 81.9% | 14.2ms | 26.8ms | 1.89× |
SqueezeNet 1.1 cosine is measured under synthetic inputs. Real-image accuracy is omitted due to non-standard ONNX Zoo preprocessing.
5.3 Direct Comparison — Kenosis vs ORT Python Quantizer (Isolated)
A head-to-head comparison against the ORT Python quantizer isolates the contribution of fusion-aware QDQ placement. Both quantizers use per-tensor INT8 with synthetic calibration data. ORT quantizer uses quantize_static with QDQ format after quant_pre_process.
| Architecture | Kenosis Latency | ORT Latency | Kenosis Advantage |
|---|---|---|---|
| SqueezeNet 1.1 | 2.85ms | 8.13ms | 65% faster |
| ResNet50 v2 | 27.8ms | 46.1ms | 40% faster |
| MobileNetV2 | 4.61ms | 6.29ms | 27% faster |
| EfficientNet-Lite4 | 14.2ms | 23.5ms | 40% faster |
The ORT Python quantizer produces a SqueezeNet model that is slower than the FP32 baseline (8.13ms INT8 vs 6.60ms FP32) — a direct consequence of broken Conv-ReLU fusion caused by naive QDQ placement. Kenosis eliminates this regression entirely, delivering a 2.32× speedup over FP32.
For complex detection models like PP-YOLOE+ Small, the standard ORT quantizer fails to quantize the convolutional weights due to its local-node analysis limitation. This leaves the heavy weight matrices in FP32 while injecting redundant casting node pairs, resulting in a bloated model file (30.7 MB vs. 30.4 MB FP32 baseline). Kenosis successfully processes these complex layers, reducing the weight matrices to 8-bit integers and yielding a 7.9 MB file size (3.9× smaller).
5.4 Real-World Accuracy — Kenosis vs ORT Python Quantizer
Accuracy comparison reveals substantial divergence in predictive quality. Results report cosine similarity and Top-1 predict agreement for classifiers on 1,000 ImageNet-1K images, and per-output cosine similarity for PP-YOLOE+ Small on 5,000 MS COCO val2017 images. All metrics are computed against the FP32 baseline.
| Architecture | Dataset (N) | Cosine Similarity | Top-1 Predict Agreement | ||||
|---|---|---|---|---|---|---|---|
| FP32 | Kenosis | ORT | FP32 | Kenosis | ORT | ||
| ResNet50 v2 | ImageNet-1K (1000) | 1.000 | 0.980 | 0.974 | 100% | 94.8% | 49.4% |
| MobileNetV2 | ImageNet-1K (1000) | 1.000 | 0.970 | 0.954 | 100% | 91.3% | 5.3% |
| EfficientNet-Lite4 | ImageNet-1K (1000) | 1.000 | 0.875 | 0.364 | 100% | 81.9% | 16.3% |
| PP-YOLOE+ Small | COCO val2017 (5000) | 1.000 / 1.000 | 0.997 / 0.860 | 0.996 / 0.653 | — | — | — |
Across all architectures, Kenosis INT8 consistently achieves higher fidelity to the FP32 baseline. On classifiers, the divergence is most pronounced in Top-1 predict agreement: ResNet50 v2 achieves 94.8% under Kenosis vs. 49.4% under ORT; MobileNetV2 achieves 91.3% vs. 5.3%; and EfficientNet-Lite4 achieves 81.9% vs. 16.3% (with ORT's output cosine collapsing to 0.364).
For PP-YOLOE+ Small, cosine similarity is reported separately for bounding box coordinates and classification confidence scores. Both quantizers preserve box fidelity above 0.996, but the classification score head reveals a significant gap: Kenosis achieves 0.860 mean cosine on 5,000 COCO images while ORT degrades to 0.653.
6. Conclusion
The transition from Python-based, localized node injection to Rust-based, topologically aware graph rewriting represents a significant step forward in edge AI deployment. By aligning the static graph structure with the expectations of underlying hardware execution providers, Kenosis achieves native kernel fusion without requiring custom runtime extensions or modifications to the ONNX Runtime itself.
Fusion-aware QDQ placement yields INT8 models with cosine similarity scores of 0.875–0.999 against their FP32 origins while achieving the compute efficiency required to run high-density computer vision pipelines on commodity edge hardware. Evaluated on 6,000 real-world images from ImageNet-1K and MS COCO val2017, Kenosis INT8 achieves 81.9–94.8% Top-1 predict agreement with FP32 baselines across all tested classifiers, compared to 5.3–49.4% for the standard ORT Python quantizer.