Surpassing the Original — Knowledge Distillation on a Pruned YOLOv8m - Tea Lemon Balm

1. Problem (Why this work)

Model compression almost always leads to the same issue:

“It gets faster, but accuracy drops.”

After applying ModelOpt pruning to YOLOv8m, the results were:

25.9M → 16.5M parameters (36% reduction)
mAP50-95: 0.8392 → 0.7845 (-6.5%)

Latency improved, but accuracy degraded.
Detailed setup and full results are covered in the previous post.

The goal was not just recovery, but:

To make the compressed model outperform the original

2. Approach (Why Distillation)

After pruning, the model loses representational capacity.

There are two main ways to recover performance:

Fine-tuning
- Uses only ground-truth labels
- Limited ability to recover lost information
Knowledge Distillation
- Learns from the Teacher’s class probability distribution
- Transfers inter-class relationships beyond hard labels

We choose the following strategy:

Train with both ground-truth labels and Teacher signals

3. Core Design

The Student is trained with both the original detection loss
and an additional distillation loss from the Teacher.

\[L_{total} = L_{det} + \alpha \cdot L_{KD}\]

4. KD Loss Design

The YOLOv8 detection head outputs three components:

classification (cls)
bounding box (box)
distribution (dfl)

Accordingly, the KD loss is composed of three parts.

For classification, we match the Teacher’s class probability distribution
using temperature-scaled KL divergence.

\[L_{KD}^{cls} = T^2 \cdot KL\left( \text{softmax}\left(\frac{s}{T}\right) \,\|\, \text{softmax}\left(\frac{t}{T}\right) \right)\]

This allows the Student to learn inter-class relationships beyond hard labels.

For bounding boxes, instead of matching distributions,
we directly align final coordinates in the spatial domain.

\[L_{KD}^{box} = \| b_s - b_t \|_1\]

This enforces precise localization.

DFL represents uncertainty in box coordinates,
so we align the distributions using KL divergence.

\[L_{KD}^{dfl} = T^2 \cdot \frac{1}{4} \sum_{i=1}^{4} KL\left( \text{softmax}\left(\frac{s_i}{T}\right) \,\|\, \text{softmax}\left(\frac{t_i}{T}\right) \right)\]

This transfers the sharpness of localization.

Using all anchors introduces noise, since most correspond to background.

We therefore apply KD only to anchors where the Teacher confidence exceeds a threshold.

\[\mathcal{M} = \{ a \mid \max(\sigma(t_{cls})) > \tau \}\]

The final KD loss is:

\[L_{KD} = L_{KD}^{cls} + L_{KD}^{box} + L_{KD}^{dfl}\]

5. Hyperparameter Design and Tuning

KD performance is highly sensitive to hyperparameters.

We focus on three variables:

$\alpha$: balance between detection loss and KD loss
$T$: controls the smoothness of probability distribution
$\tau$: threshold for selecting anchors

$\alpha$ controls the strength of KD.

Too small → KD has little effect
Too large → destabilizes training

$T$ controls how soft the Teacher distribution is.

Low → close to one-hot
High → emphasizes inter-class relationships

$\tau$ determines which anchors are used for KD.

Low → includes noisy background
High → reduces training samples

The best performance was achieved with:

\[\begin{aligned} \alpha &= 0.5 \\ T &= 3.0 \\ \tau &= 0.3 \end{aligned}\]

This setting balances detection and distillation losses
without introducing excessive regularization.

6. Results

Model	mAP50-95	Latency
Teacher	0.8392	10.60ms
Pruned	0.7845	8.05ms
Pruned + KD	0.8679	8.05ms
Pruned + Feature KD	0.8737	7.92ms

The performance drop from pruning was recovered through KD,
and with Feature KD, the Student surpassed the Teacher.

7. Why Does the Student Surpass the Teacher?

At first glance, this seems counterintuitive:

“How can a smaller model outperform a larger one?”

The key is that KD changes the learning signal itself.

Standard training uses only ground-truth labels:

target class: 1
others: 0

This provides binary supervision.

The Teacher, however, provides a probability distribution:

one class: 0.9
similar class: 0.2
unrelated class: 0.01

This encodes relationships and ambiguity between classes.

As a result, the Student learns differently:

Richer gradients
Instead of sparse 0/1 signals, gradients flow across all classes
Regularization effect
Teacher guidance reduces overfitting to hard labels
Representation alignment
Feature KD aligns intermediate representations across the network

The Student is not simply mimicking the Teacher,
but converging to a more generalizable solution under structural constraints.

In practice, this leads to better performance than the Teacher.

8. Role of Feature KD

Feature KD transfers multi-scale information from FPN:

P3: small objects
P4: medium objects
P5: large objects

This improves representations in both backbone and neck.

9. Summary

The key ideas are:

KD loss enables not just recovery but performance gain
$\alpha$ controls the balance between objectives
$T$ and $\tau$ control information quality and selection
Feature KD improves representation learning

10. Conclusion

Pruning is not just model reduction.

It is a process of compressing information.

Knowledge Distillation injects information back into the compressed model.

As a result:

The model becomes smaller
Faster
And more accurate

In this experiment, the pruned model
actually surpassed the original Teacher.