Fusion-Aware QDQ Placement: Achieving Native Kernel Fusion in ONNX via Graph Reordering

Abstract

The deployment of deep neural networks on edge devices relies heavily on INT8 quantization to reduce memory footprint and inference latency. The industry-standard ONNX Runtime (ORT) provides Python-based quantization utilities that inject QuantizeLinear and DequantizeLinear (QDQ) nodes into the computation graph. However, the default node injection strategy isolates Convolution operations from subsequent Activation layers, inadvertently breaking hardware-level kernel fusion.

In this paper, we present Kenosis, a native Rust graph optimization engine that implements Fusion-Aware QDQ Placement. By leveraging the commutative properties of non-linear activations under positive scalar multiplication, Kenosis safely reorders the computation graph to preserve Conv-Activation contiguity. Our benchmarks across four architectures demonstrate speedups of up to 2.46× over FP32 baselines and up to 65% latency reduction against the ORT Python quantizer, with cosine similarity scores of 0.875–0.999 against FP32 outputs. These results are validated on 6,000 real-world images drawn from ImageNet-1K and MS COCO val2017.

1. Introduction

The transition from FP32 (32-bit floating point) to INT8 (8-bit integer) computation is a critical step in preparing machine learning models for production edge environments. While the mathematical theory of quantization is well understood, the implementation of this math within a static computation graph often introduces severe performance bottlenecks.

In the ONNX ecosystem, static INT8 quantization is achieved by wrapping heavy mathematical operations (like Conv and MatMul) in QuantizeLinear (Q) and DequantizeLinear (DQ) nodes. This "QDQ pattern" signals the backend execution provider to map the operation to an accelerated 8-bit integer kernel.

We observed that standard Python-based quantization tools apply a naive, node-by-node injection strategy. This localized approach ignores the broader graph topology — specifically the relationship between Convolutional layers and their subsequent Non-Linear Activations (e.g., ReLU). This paper details how this naive placement causes "fusion breaking," leading to unnecessary memory thrashing, and how Kenosis solves this via topological graph awareness.

2. The Bottleneck: Broken Kernel Fusion

Modern CPU and GPU architectures achieve maximum throughput by minimizing trips to main memory. "Kernel Fusion" is the process by which a runtime engine collapses multiple sequential graph operations into a single, highly optimized hardware instruction.

In a standard FP32 vision model, a Convolution is almost always followed immediately by a ReLU activation:

Conv ReLU FP32 baseline — fused by the execution provider into a single memory cycle

When an ONNX execution provider sees this contiguous Conv ➔ ReLU pattern, it fuses them: performing the matrix multiplication and the negative-value zeroing in a single memory cycle.

When the standard ORT Python quantizer converts this to INT8, it evaluates the Conv node in isolation and wraps it in QDQ nodes:

Quantize Conv Dequantize ReLU ORT Python output — Dequantize severs Conv-ReLU contiguity, breaking fusion

By injecting the Dequantize node between Conv and ReLU, the quantizer severs their contiguity. The runtime engine can no longer fuse them. The hardware is forced to: compute the INT8 Convolution, push the result to main memory, pull it back to Dequantize to FP32, push it back to memory, and finally pull it to apply ReLU. This memory thrashing completely neutralizes the computational speedup gained from INT8 math.

3. The Mathematical Guarantee of Reordering

To restore kernel fusion, the Dequantize node must be moved after the ReLU node. In computational graphs, altering the order of operations generally corrupts the output. Kenosis relies on a specific mathematical property of ReLU interacting with the Dequantization formula to guarantee that reordering is safe.

The Dequantize operation is defined as:

y = (x - zero_point) * scale

Kenosis designs Conv output bias quantization such that zero_point is exactly 0. The Dequantization therefore simplifies to pure positive scalar multiplication:

y = x * scale   (where scale > 0)

The ReLU operation is defined as y = max(0, x). Because the scale is strictly positive, the scalar multiplication is commutative with the max(0, x) operation:

  • Standard Path (Dequantize ➔ ReLU): max(0, x * scale)
  • Kenosis Path (ReLU ➔ Dequantize): max(0, x) * scale

Whether you multiply a negative integer by a positive scale and then clamp to zero, or clamp first and then multiply, the result is exactly 0.0. This equivalence provides the formal proof required to safely rewrite the graph. The same commutativity holds for LeakyRelu, Clip, and HardSwish activations. For activations where strict commutativity does not hold (e.g., Sigmoid), Kenosis applies QDQ wrapping on the activation output rather than commuting through it, preserving numerical correctness while still maximizing downstream fusion opportunities.

The resulting Kenosis-optimized graph places Dequantize after the activation, restoring the contiguous Conv ➔ ReLU pattern that the execution provider maps to a single QLinearConv kernel.

Quantize Conv ReLU Dequantize Kenosis output — Conv-ReLU contiguity preserved, QLinearConv fusion achieved

4. The Kenosis Pipeline

Kenosis is a native Rust graph optimization engine that applies seven coordinated optimizations statically, prior to deployment. Unlike standard tooling, Kenosis performs a topological traversal of the ONNX protobuf graph:

  1. Self-calibration: Automatically generates synthetic calibration inputs and runs them through the model via ONNX Runtime to collect per-tensor activation ranges. No external calibration data required. Multi-input models and NLP inputs (token IDs, attention masks) are handled automatically.
  2. Weight quantization: INT8 symmetric per-tensor or per-channel. All scale computations in f64 to match ORT's internal precision.
  3. INT32 bias quantization: scale = activation_scale × weight_scale, zero_point = 0. Wrapped with DequantizeLinear for ORT kernel fusion.
  4. Zero-point nudged activation quantization: UINT8 asymmetric with post-hoc range adjustment ensuring float 0.0 maps exactly to the quantized zero. Prevents rounding asymmetry from compounding across layers.
  5. Fusion-aware QDQ placement: Detects Conv/MatMul → Activation pairs at graph level and places QDQ after the activation instead of between them. Combined with second-pass wrapping of Add, Concat, MaxPool, and AveragePool, this maximizes QLinear fusions.
  6. Non-vision tensor protection: For multi-input models (detection, segmentation), tensors reachable from non-primary inputs (scale_factor, image_shape) are traced through the graph and excluded from quantization, preventing metadata paths from being crushed by INT8 range limits.
  7. Model output protection: Tensors that are direct model outputs are never QDQ-wrapped, preserving full FP32 precision in detection head scores and bounding box coordinates.

5. Benchmarks and Results

Test Environment

Component Specification
CPU Intel i5-13420H (8C/12T, 8 GB DDR5)
GPU Disabled (CPU-only execution)
Runtime ONNX Runtime 1.24 (CPU EP), ort crate v2.0.0-rc.12
Build Release (--release), Rust 1.85+
Isolated benchmarks 100 timed iterations after 20 warmup runs, single-threaded ORT (intra_op=1, inter_op=1)

Fidelity is evaluated using Cosine Similarity between INT8 and FP32 output vectors, as well as task-level Top-1 Predict Agreement (the percentage of inputs where INT8's top prediction matches FP32). Accuracy is evaluated on the official ImageNet-1K validation sample (1,000 images) and the full MS COCO val2017 set (5,000 images). All FP32 baseline models are sourced from the ONNX Model Zoo; PP-YOLOE+ Small is exported from PaddleDetection.

5.1 Isolated Latency — PP-YOLOE+ Small (Kenosis INT8 vs FP32)

PP-YOLOE+ is an anchor-free object detection architecture designed for efficient edge deployment. Single-threaded isolated inference.

Resolution INT8 Latency FP32 Latency Speedup INT8 Size
320×320 23ms 44ms 1.89× 7.9 MB (3.9× smaller)
416×416 43ms 77ms 1.80× 7.9 MB (3.9× smaller)
640×640 111ms 187ms 1.68× 7.9 MB (3.8× smaller)

5.2 Isolated Latency — Classifier Benchmarks (Kenosis INT8 vs FP32)

Standard vision classifiers quantized with per-tensor symmetric INT8 weights and self-calibrated activations.

Architecture Cosine Sim. Top-1 Agree. Kenosis INT8 FP32 Baseline Speedup
SqueezeNet 1.1 0.999 2.85ms 6.60ms 2.32×
ResNet50 v2 0.980 94.8% 27.8ms 68.4ms 2.46×
MobileNetV2 0.970 91.3% 4.61ms 6.53ms 1.42×
EfficientNet-Lite4 0.875 81.9% 14.2ms 26.8ms 1.89×

SqueezeNet 1.1 cosine is measured under synthetic inputs. Real-image accuracy is omitted due to non-standard ONNX Zoo preprocessing.

5.3 Direct Comparison — Kenosis vs ORT Python Quantizer (Isolated)

A head-to-head comparison against the ORT Python quantizer isolates the contribution of fusion-aware QDQ placement. Both quantizers use per-tensor INT8 with synthetic calibration data. ORT quantizer uses quantize_static with QDQ format after quant_pre_process.

Architecture Kenosis Latency ORT Latency Kenosis Advantage
SqueezeNet 1.1 2.85ms 8.13ms 65% faster
ResNet50 v2 27.8ms 46.1ms 40% faster
MobileNetV2 4.61ms 6.29ms 27% faster
EfficientNet-Lite4 14.2ms 23.5ms 40% faster

The ORT Python quantizer produces a SqueezeNet model that is slower than the FP32 baseline (8.13ms INT8 vs 6.60ms FP32) — a direct consequence of broken Conv-ReLU fusion caused by naive QDQ placement. Kenosis eliminates this regression entirely, delivering a 2.32× speedup over FP32.

For complex detection models like PP-YOLOE+ Small, the standard ORT quantizer fails to quantize the convolutional weights due to its local-node analysis limitation. This leaves the heavy weight matrices in FP32 while injecting redundant casting node pairs, resulting in a bloated model file (30.7 MB vs. 30.4 MB FP32 baseline). Kenosis successfully processes these complex layers, reducing the weight matrices to 8-bit integers and yielding a 7.9 MB file size (3.9× smaller).

5.4 Real-World Accuracy — Kenosis vs ORT Python Quantizer

Accuracy comparison reveals substantial divergence in predictive quality. Results report cosine similarity and Top-1 predict agreement for classifiers on 1,000 ImageNet-1K images, and per-output cosine similarity for PP-YOLOE+ Small on 5,000 MS COCO val2017 images. All metrics are computed against the FP32 baseline.

Architecture Dataset (N) Cosine Similarity Top-1 Predict Agreement
FP32 Kenosis ORT FP32 Kenosis ORT
ResNet50 v2 ImageNet-1K (1000) 1.000 0.980 0.974 100% 94.8% 49.4%
MobileNetV2 ImageNet-1K (1000) 1.000 0.970 0.954 100% 91.3% 5.3%
EfficientNet-Lite4 ImageNet-1K (1000) 1.000 0.875 0.364 100% 81.9% 16.3%
PP-YOLOE+ Small COCO val2017 (5000) 1.000 / 1.000 0.997 / 0.860 0.996 / 0.653

Across all architectures, Kenosis INT8 consistently achieves higher fidelity to the FP32 baseline. On classifiers, the divergence is most pronounced in Top-1 predict agreement: ResNet50 v2 achieves 94.8% under Kenosis vs. 49.4% under ORT; MobileNetV2 achieves 91.3% vs. 5.3%; and EfficientNet-Lite4 achieves 81.9% vs. 16.3% (with ORT's output cosine collapsing to 0.364).

For PP-YOLOE+ Small, cosine similarity is reported separately for bounding box coordinates and classification confidence scores. Both quantizers preserve box fidelity above 0.996, but the classification score head reveals a significant gap: Kenosis achieves 0.860 mean cosine on 5,000 COCO images while ORT degrades to 0.653.

6. Conclusion

The transition from Python-based, localized node injection to Rust-based, topologically aware graph rewriting represents a significant step forward in edge AI deployment. By aligning the static graph structure with the expectations of underlying hardware execution providers, Kenosis achieves native kernel fusion without requiring custom runtime extensions or modifications to the ONNX Runtime itself.

Fusion-aware QDQ placement yields INT8 models with cosine similarity scores of 0.875–0.999 against their FP32 origins while achieving the compute efficiency required to run high-density computer vision pipelines on commodity edge hardware. Evaluated on 6,000 real-world images from ImageNet-1K and MS COCO val2017, Kenosis INT8 achieves 81.9–94.8% Top-1 predict agreement with FP32 baselines across all tested classifiers, compared to 5.3–49.4% for the standard ORT Python quantizer.

View Kenosis on GitHub