Slimming Down YOLOv8m: Optimizing Latency with NVIDIA FastNAS

1. Problem

Modern object detection models are becoming increasingly heavy, while real-world deployment environments are still constrained by limited computational resources, especially on edge devices such as Jetson and embedded systems.
Although YOLOv8m offers a good balance between accuracy and efficiency, its latency can still be a bottleneck in real-time applications.

Goal:

Find a structured optimization method that improves inference latency while preserving accuracy as much as possible.

2. Limitations of Conventional Approaches

Typical pruning methods have the following limitations:

Unstructured pruning
- Sets small weights to zero
- Provides little to no actual latency improvement (due to lack of sparse computation support)
Heuristic-based structured pruning
- Removes channels based on importance (e.g., L1 norm)
- Does not consider hardware characteristics
- Reduction in FLOPs does not reliably translate to latency improvement

In other words,
reducing theoretical computation does not necessarily lead to faster inference in practice.

3. FastNAS Approach

Unlike conventional pruning, NVIDIA ModelOpt’s FastNAS explores optimal subnetworks under given constraints such as FLOPs.

Key characteristics:

Optimizes model structure (e.g., channel width, layer configuration)
Hardware-aware search
Constraint-based optimization (FLOPs)

Experiment objective: Maintain accuracy while enforcing the constraint $\text{FLOPs} \leq 66\%$ of the baseline model.

4. Experimental Setup

Model: YOLOv8m (Ultralytics)
Optimizer: NVIDIA ModelOpt (FastNAS)
Dataset: COCO128 (PoC)
Constraint: $\text{FLOPs} \leq 66\%$ of baseline
Fine-tuning: 50 epochs

Note: This experiment is a proof-of-concept, and full COCO evaluation is required for reliable generalization.

5. Results – Model Compression and Performance

Metric	Baseline	Pruned	Change
Parameters	25.9M	17.6M	-32%
FLOPs	79.3B	52.0B	-34%
Inference Time (ms)	10.77	8.66	-19.6%
mAP@50-95	0.839	0.785	-5.4%p
Recall	0.904	0.871	-3.3%p

6. Analysis

6.1 FLOPs vs Latency Gap

Although FLOPs were reduced by 34%, latency improved by only 19.6%.
This is because GPU performance is influenced not only by arithmetic operations but also by factors such as memory access and kernel launch overhead.
As a result, reducing FLOPs does not directly translate to proportional latency improvement.

6.2 Accuracy Degradation

The drop in accuracy is a natural consequence of structural pruning.

Possible causes include:

Reduced feature map channels
Loss of high-resolution features
Degradation in small object detection (lower recall)

Ultimately, reduced representational capacity leads to lower detection sensitivity.

6.3 Trade-off Interpretation

The results clearly demonstrate a trade-off:

Speed: +19.6%
Accuracy: -5.4%p

The optimal balance depends on the application domain.

7. Conclusion

FastNAS-based structural optimization enables actual latency improvements, unlike conventional pruning methods that often fail to translate theoretical gains into real-world performance.

It is particularly suitable for:

Edge deployment (e.g., Jetson devices)
Real-time applications where FPS is critical
Scenarios involving large and clearly distinguishable objects

However, caution is required in cases where:

Small object detection is critical
False negatives can lead to serious risks

Summary

While model compression is important, latency often becomes the primary constraint in real deployment environments.
FastNAS provides a structural approach that bridges the gap between FLOPs reduction and actual inference speed improvement.