Neural Network Compression: Making AI Models Faster and Smaller

Neural Networks

02.09.2025

Neural Network Compression: Making AI Models Faster and Smaller

Why Compression Matters in 2025

A major insurance company recently faced a choice: spend $2.4 million annually on GPU infrastructure to serve their claims classification model at scale, or invest three engineer-months compressing that same model to run on CPUs at one-tenth the cost. They chose compression. After applying INT8 quantization and modest pruning, their model processed claims 3.2× faster on existing hardware while maintaining 99.1% of original accuracy—well within acceptable bounds for their use case. The total cost of ownership dropped 87%, freeing budget for new AI initiatives.

This scenario plays out daily across enterprises in 2025. Neural network compression—the suite of techniques that make models smaller, faster, and cheaper to deploy—has evolved from academic curiosity to business necessity. Four converging pressures drive adoption: GPU scarcity and cost where H100s rent for $2-5 per hour and purchasing requires months of lead time; latency requirements as users expect sub-100ms response times for interactive applications; edge deployment constraints where models must fit in mobile devices with limited RAM, storage, and battery; and regulatory scrutiny under frameworks like the NIST AI Risk Management Framework that require documentation of model characteristics, testing across subgroups, and audit trails for deployed systems.

The stakes are substantial. A compressed model can mean the difference between profitable unit economics and unsustainable burn rate for an AI-powered SaaS product. For edge applications from autonomous vehicles to medical devices, compression enables real-time inference that safety-critical use cases demand. For large language model APIs, serving optimizations determine whether you can profitably serve customers at competitive prices or watch margins evaporate as token volumes scale.

The good news: modern compression techniques deliver dramatic improvements without proportional accuracy loss. INT8 quantization typically provides 2-4× speedup and 4× memory reduction with under 1% accuracy degradation. Structured pruning can remove 30-50% of parameters while recovering accuracy through brief fine-tuning. Knowledge distillation creates models 5-10× smaller that retain 95-98% of teacher performance. Serving-time optimizations like FlashAttention and vLLM unlock additional 2-8× throughput gains without touching model weights.

This guide provides practical, vendor-neutral guidance for engineering leaders and ML practitioners deploying compressed models to production. We'll examine quantization, pruning, distillation, parameter-efficient methods, and serving optimizations—explaining when to use each technique, what tools work best, how to validate results, and where accuracy/robustness risks hide.

The Compression Toolbox: Four Core Families

Neural network compression encompasses multiple complementary techniques that attack different aspects of model inefficiency. Understanding each method's strengths, limitations, and interaction patterns enables informed choices about which approaches suit specific deployment constraints.

Quantization: Lower Precision Without Lower Performance

Quantization reduces numerical precision of model weights and activations from 32-bit floating point (FP32) to lower bit-widths like INT8, INT4, or FP8. This technique exploits the fact that neural networks are remarkably tolerant to reduced precision—most models don't need 32 bits to represent parameters effectively.

Bit-width selection depends on hardware support, accuracy tolerance, and deployment context. INT8 (8-bit integer) represents the sweet spot for most production deployments, offering excellent hardware support across CPUs and GPUs, minimal accuracy degradation with proper calibration, and 4× memory reduction plus 2-4× speedup versus FP32. TensorRT and ONNX Runtime provide mature INT8 implementations that work out-of-box for most architectures.

INT4 (4-bit integer) enables more aggressive compression with 8× memory reduction, suitable for memory-bound applications where moderate accuracy loss is acceptable. INT4 works best for large language model weight storage combined with INT8 or FP16 activations, though specialized techniques like GPTQ and AWQ are typically required to maintain quality.

FP8 (8-bit floating point) provides a middle ground preserving more dynamic range than INT8, critical for training and some sensitive inference workloads. NVIDIA's Transformer Engine on Hopper GPUs supports two FP8 formats: E4M3 (4 exponent bits, 3 mantissa bits) for forward pass and E5M2 (5 exponent, 2 mantissa) for backward pass, enabling mixed-precision strategies that balance numerical stability and performance.

Symmetric vs. asymmetric quantization affects how floating-point values map to integers. Symmetric quantization constrains the range symmetrically around zero, simplifying hardware implementation and generally performing better for weights. Asymmetric quantization allows offset ranges better handling activations with non-centered distributions. Most frameworks default to symmetric for weights and asymmetric for activations.

Per-tensor vs. per-channel quantization determines granularity of scale factors. Per-tensor uses one scale/zero-point for an entire tensor—simple and fast but may not capture intra-tensor variation. Per-channel quantization applies separate scales per output channel (for weights) or per token (for activations), improving accuracy at modest computational cost. Per-channel is the default for production INT8 deployment and should be your baseline.

Post-Training Quantization (PTQ) converts trained FP32 models to lower precision without retraining. PTQ runs a calibration phase using representative data to determine optimal scale factors, then quantizes weights and records activation statistics. PyTorch quantization and TensorFlow Model Optimization Toolkit provide excellent PTQ implementations. PTQ works remarkably well for most computer vision and many NLP models, often maintaining 98-99% of original accuracy with minimal effort.

Quantization-Aware Training (QAT) simulates quantization during training by inserting fake quantization operations that model rounding effects. The model learns weight values robust to quantization noise, typically recovering 1-2 percentage points versus PTQ. QAT requires more engineering effort—modifications to training loop, longer convergence, hyperparameter tuning—but becomes essential when accuracy is critical or PTQ degrades unacceptably. Use QAT for safety-critical applications, when PTQ accuracy loss exceeds tolerance, or when you need INT4 or aggressive mixed-precision schemes.

Hardware targets and compilers determine what optimizations are actually deployed. NVIDIA TensorRT excels for NVIDIA GPU deployment with extensive INT8 and FP8 kernel fusions. Intel Neural Compressor optimizes for Intel CPUs and GPUs using AVX-512 VNNI and AMX instructions. ONNX Runtime provides portable quantization across hardware vendors. Match your quantization approach to target hardware—don't quantize for NVIDIA GPU optimizations if you're deploying to Intel CPUs.

Pruning & Structured Sparsity: Removing What Doesn't Matter

Pruning removes neural network parameters deemed unimportant, reducing model size and computation. The intuition is straightforward: not all weights contribute equally to model accuracy, and those with minimal impact can be zeroed without significant performance loss.

Magnitude pruning ranks parameters by absolute value and removes the smallest, based on the assumption that small weights contribute little to outputs. This simple heuristic works surprisingly well and serves as the baseline pruning approach. Movement pruning tracks how much each parameter changes during training and removes those that change least, capturing the idea that static weights are less important than dynamic ones actively learning. Movement pruning often outperforms magnitude pruning but requires more careful implementation.

Unstructured vs. structured sparsity determines what granularity parameters are removed. Unstructured (fine-grained) sparsity zeros individual weights anywhere in the network, achieving high compression rates (70-90% sparsity possible) but requiring specialized sparse kernels for speedups. Without hardware support, unstructured sparsity saves memory but doesn't accelerate inference. Structured sparsity removes entire channels, filters, or attention heads—coarser granularity but standard dense kernels can exploit it. Structured pruning typically targets 30-50% sparsity with immediate speedups on any hardware.

N:M structured sparsity represents a middle ground where in every M consecutive elements, N are non-zero. The most important pattern is 2:4 sparsity where 2 of every 4 consecutive values are non-zero (50% sparse). NVIDIA Ampere and newer GPUs provide hardware acceleration for 2:4 sparse matrix operations via the Sparse Tensor Core, delivering 2× theoretical speedup. See NVIDIA's 2:4 sparsity whitepaper for implementation details. 2:4 sparsity requires specialized training or pruning procedures but unlocks genuine hardware acceleration without custom kernels.

The Lottery Ticket Hypothesis proposed that dense networks contain sparse subnetworks (winning tickets) that train to comparable accuracy when initialized with the same starting weights. While conceptually elegant, finding lottery tickets at scale proves difficult. More practical for production: prune iteratively or one-shot after training, then fine-tune the pruned network to recover accuracy. This straightforward approach works reliably across architectures.

How much to prune depends on model architecture, data complexity, and accuracy tolerance. General guidelines: start conservative (10-30% sparsity), measure accuracy drop, and increase aggressively until you hit acceptable degradation thresholds. Vision models tolerate 40-60% pruning of later layers. Language models are more sensitive—attention layers particularly resist pruning. Pruning 70%+ typically requires iterative pruning with retraining cycles or advanced techniques like learned pruning masks.

Pruning workflow: (1) train a baseline model, (2) apply pruning based on magnitude or movement, (3) fine-tune the pruned model for recovery, (4) validate accuracy across your test distribution, (5) export and quantize if needed. For iterative pruning, repeat steps 2-4 multiple times, gradually increasing sparsity. Monitor not just aggregate accuracy but per-class and subgroup metrics to catch disparate impacts.

Knowledge Distillation & Low-Rank Adaptation: Teaching Smaller to Think Bigger

Knowledge distillation transfers knowledge from large "teacher" models to compact "student" models that maintain much of the teacher's performance. This differs from pruning/quantization which compress the same architecture—distillation changes the architecture entirely, often to something much smaller.

Classic distillation trains students to mimic teacher outputs rather than just matching ground-truth labels. The key innovation: using "soft targets"—the teacher's full probability distribution across classes rather than hard one-hot labels. Soft targets convey rich information about class similarities and uncertainties that hard labels discard. The seminal Distilling the Knowledge in a Neural Network paper by Hinton et al. introduced temperature-based softening where logits are divided by temperature T>1 before softmax, smoothing the distribution and amplifying information in near-zero probabilities.

Distillation loss combines soft target loss (student vs. teacher distributions) and hard target loss (student vs. ground truth), typically weighted 90% soft and 10% hard. The student learns from both the teacher's confident predictions and its uncertainty patterns. Temperature typically ranges from 2-10, selected via hyperparameter search. Higher temperatures extract more knowledge but can destabilize training.

Distillation recipes that work in production: use a teacher 2-10× larger than the student; ensure the student has sufficient capacity—too small and it can't learn; match training data distributions between teacher and student; pre-train the student on task data before distillation for better initialization; validate on held-out data not used for distillation to avoid overfitting teacher idiosyncrasies.

LoRA (Low-Rank Adaptation) addresses a different problem: parameter-efficient fine-tuning of large pre-trained models. LoRA freezes base model weights and injects trainable low-rank decomposition matrices into specific layers (typically attention projections). Instead of fine-tuning billions of parameters, you train millions in the low-rank adapters. LoRA delivers 10-100× reduction in trainable parameters while matching full fine-tuning quality for most tasks.

LoRA rank selection trades off capacity and efficiency. Ranks of 4-32 typically suffice, with higher ranks for more complex adaptations. The memory savings are dramatic: fine-tuning a 7B parameter model might require 112GB for optimizer states in full precision, but only 2-4GB with LoRA. This enables fine-tuning large models on single consumer GPUs.

QLoRA combines quantization with LoRA, storing the base model in INT4 while training FP16 adapters. QLoRA introduces additional innovations including 4-bit NormalFloat quantization, double quantization of scaling factors, and paged optimizers. Result: fine-tune a 65B parameter model on a single 48GB GPU. QLoRA has democratized large model customization for organizations without massive compute budgets.

After fine-tuning with LoRA/QLoRA, you have two deployment options: keep adapters separate for dynamic adapter swapping serving multiple fine-tuned variants from one base model, or merge adapters into base weights for a single standalone model that can then be quantized further.

Distill then quantize vs. quantize then distill: empirical guidance suggests distilling first produces a smaller student, then quantizing that student for maximum compression. However, if your deployment target mandates quantization, consider quantization-aware distillation where the student trains in quantized precision from the start, learning representations robust to quantization noise.

Serving-Time Optimizations: Speed Without Retraining

For large language models, serving-time optimizations often deliver larger throughput gains than model compression techniques, with zero accuracy loss since you're not modifying weights or structure.

FlashAttention revolutionized transformer serving by redesigning the attention mechanism around GPU memory hierarchy. The standard attention implementation materializes the full attention matrix (sequence_length²), causing memory bottleneck. FlashAttention and FlashAttention-2 use tiling and recomputation strategies that never materialize the full attention matrix, reducing memory footprint from O(N²) to O(N) and accelerating training and inference 2-4×. FlashAttention-3 further optimizes for Hopper architecture with asynchronous tensor core operations and overlapped memory transfers.

FlashAttention is essentially free performance—no accuracy impact, straightforward to integrate, and most modern frameworks include it. If you're serving transformers and not using FlashAttention, you're leaving 2-3× speedup on the table.

KV-cache stores previously computed key and value tensors for autoregressive generation, avoiding redundant computation. At long context lengths (8K-128K tokens), KV-cache memory dominates, consuming 100s of GB for large models. KV-cache quantization applies INT8 or INT4 quantization to cached tensors, reducing memory 4-8× with minimal quality loss. Techniques like ZipCache achieve better compression through learned codebooks.

vLLM and PagedAttention address fragmentation in KV-cache memory. Traditional implementations pre-allocate fixed blocks for each sequence's cache, wasting memory when sequences vary in length or finish early. vLLM introduces PagedAttention which manages KV-cache in virtual memory pages that can be allocated, reused, and swapped flexibly. Result: 2-4× higher throughput via improved memory utilization enabling larger batch sizes. vLLM has become the de facto standard for LLM serving and should be your starting point for any transformer inference workload.

Batching and scheduling policies determine how requests are grouped and processed. Continuous batching allows new requests to join in-flight batches as others complete, maximizing GPU utilization. Smart scheduling balances small, urgent requests against large batch processing. vLLM's scheduler implements these policies, but understanding the tradeoffs helps configure for your workload: latency-sensitive APIs want aggressive preemption and small batches, while high-throughput batch processing wants larger batches and less frequent preemption.

Serving optimizations stack multiplicatively: FlashAttention (2-3×) + vLLM batching (2-3×) + KV-cache quantization (enables larger batches for another 1.5-2×) = 6-18× total throughput improvement without touching model weights. Apply these before considering more invasive compression techniques.

Choosing the Right Strategy: Decision Playbook

Different deployment scenarios demand different compression approaches. Use this decision framework to identify the optimal starting point for your situation:

Goal: 2-4× throughput on existing GPUs (cloud serving)

Primary tactic: INT8 post-training quantization + FlashAttention + vLLM/PagedAttention
Expected gains: 3-6× throughput, 4× memory reduction, <1% accuracy loss
Effort: Low (1-2 engineer-days for PTQ, framework integration for serving optimizations)
Tools: ONNX Runtime, TensorRT, vLLM
Best for: Established models serving production traffic where retraining is expensive

Goal: Edge device deployment under tight RAM/flash constraints

Primary tactic: INT8 or INT4 quantization + structured pruning 30-50% + optional distillation to smaller architecture
Expected gains: 8-16× memory reduction, 3-6× speedup, 1-3% accuracy loss
Effort: Medium-High (1-2 engineer-weeks including validation on target hardware)
Tools: TensorFlow Model Optimization Toolkit, PyTorch Mobile, ONNX Runtime Mobile
Best for: Mobile apps, IoT devices, embedded systems with fixed compute budgets

Goal: Latency-critical API with strict accuracy SLOs

Primary tactic: Quantization-aware training to recover accuracy, per-channel INT8, consider FP8 on supported hardware
Expected gains: 2-3× speedup, <0.5% accuracy loss, predictable tail latency
Effort: Medium (1-2 engineer-weeks for QAT training modifications)
Tools: PyTorch quantization, TensorRT, Transformer Engine
Best for: Financial services, healthcare, real-time decision systems where accuracy cannot degrade

Goal: Cost reduction for cloud inference

Primary tactic: PTQ + moderate pruning 20-40% + serving optimizations; profile and compress hot layers
Expected gains: 5-10× throughput per dollar, 60-80% infrastructure cost reduction
Effort: Medium (1-2 engineer-weeks including cost modeling)
Tools: Combination of quantization + pruning tools, cost profiling scripts
Best for: High-volume inference APIs where per-request margin matters

Goal: Fine-tuning on limited GPU budget

Primary tactic: LoRA/QLoRA for parameter-efficient training, then quantize the merged model
Expected gains: 10-50× reduction in training memory, enables fine-tuning on single GPU
Effort: Low-Medium (3-5 engineer-days for LoRA integration)
Tools: LoRA implementation, QLoRA
Best for: Customizing foundation models without massive compute resources

Quantization: From Theory to Deployment

Successful quantization requires more than selecting a bit-width—calibration, validation, and deployment considerations determine whether quantized models maintain quality in production.

Calibration Data Selection

Post-training quantization relies on calibration data to determine scale factors and zero points for activation quantization. Calibration quality directly affects quantized model accuracy. Best practices:

Use representative production-like data that covers the distribution your model will encounter. Don't calibrate on training data if deployment data differs—calibration should reflect inference conditions.

Avoid data leakage by using held-out validation data for calibration, never test data. Calibration on test data inflates accuracy estimates and may overfit to test distribution.

Calibrate with 100-1000 samples typically suffices. More isn't always better—too many samples slow calibration without improving results. Start with 100-500 samples and increase if accuracy suffers.

Ensure coverage across important subgroups to prevent disparate quantization quality. If your model serves multiple demographics, languages, or use cases, calibration data should represent each proportionally.

Per-Tensor vs. Per-Channel Trade-offs

Per-tensor quantization is simpler—one scale factor per tensor—but sacrifices accuracy when parameter or activation distributions vary significantly within tensors. Per-channel quantization applies separate scales per output channel (for weights) or per token/batch element (for activations), capturing intra-tensor variation.

For production deployment, use per-channel quantization for weights as the default. The accuracy improvement over per-tensor typically justifies the small computation overhead. For activations, per-tensor often suffices unless you observe significant accuracy degradation.

Handling Outliers and Advanced Schemes

Extreme outlier values in activations can degrade quantization quality by forcing scales that poorly represent typical values. Several techniques address this:

SmoothQuant migrates difficulty from activations to weights by scaling activations down and weights up, leveraging the fact that weights can be quantized offline with more care while activations must be quantized in real-time during inference.

Group-wise quantization divides each tensor into groups and quantizes each group independently, providing middle ground between per-tensor and per-channel granularity. Particularly effective for large language model weights exhibiting structured variance.

Rounding strategies beyond simple round-to-nearest can improve accuracy. AdaRound learns optimal rounding for each weight, treating rounding as an optimization problem. Typically worthwhile for aggressive quantization (INT4) or when accuracy is marginal with standard rounding.

Validation Protocol: Beyond Top-1 Accuracy

Comprehensive quantization validation examines multiple metrics to catch issues aggregate accuracy might miss:

Task-appropriate metrics: Use AUC/F1 for classification, BLEU/ROUGE for generation, perplexity for language modeling, mAP for detection. Top-1 accuracy alone misses important degradation patterns.

Subgroup analysis: Validate accuracy across demographic groups, languages, or other relevant segments. Quantization can degrade unevenly, causing fairness issues even when aggregate accuracy appears acceptable.

Tail percentile latency: Measure P95/P99 latency, not just median. Quantized inference should reduce tail latency, but misconfigured serving stacks can introduce variance.

Robustness checks: Test on distribution shift, adversarial examples, and edge cases. Quantization can alter model robustness characteristics in subtle ways.

Calibration quality: For probabilistic models, check if predicted probabilities remain well-calibrated after quantization using reliability diagrams and expected calibration error.

LLM-Specific Quantization Considerations

Large language models present unique quantization challenges due to size, autoregressive generation, and sensitivity to outliers:

Weight-only quantization stores weights in INT8 or INT4 while computing activations in FP16, reducing memory while avoiding activation quantization complexity. GPTQ and AWQ (Activation-aware Weight Quantization) optimize weight quantization specifically for LLMs, achieving good quality at INT4.

Activation-aware methods account for the fact that LLM activations exhibit severe outliers that degrade naive quantization. These methods either handle outliers specially or use activation statistics to guide weight quantization.

KV-cache quantization compresses the memory-dominant key-value cache in autoregressive generation. INT8 KV-cache typically works well, though INT4 requires careful validation. ZipCache and similar methods achieve better compression through learned representations.

Per-token or per-channel activation quantization often necessary for LLMs due to high variance across sequence positions. Per-tensor activation quantization frequently degrades quality unacceptably.

Validate LLM quantization using generation metrics (BLEU, ROUGE, human eval) and perplexity across multiple domains. Aggregate metrics can hide degradation on specific task types or prompt patterns.

Pruning & Sparsity: What Survives the Scalpel

Effective pruning requires understanding what to prune, how much, when to fine-tune, and how to deploy sparse models for actual speedups.

How Much to Prune Without Catastrophic Drop

Pruning tolerance varies dramatically by architecture and layer type:

Vision models tolerate aggressive pruning of later layers (50-70%) while early layers are more sensitive (20-40%). Convolutional filters can often be removed entirely with modest impact.

Language models resist pruning more than vision models. Feed-forward layers tolerate 30-50% pruning, but attention layers are highly sensitive—pruning even 20-30% of attention parameters can significantly degrade performance.

Start conservative at 10-20% global sparsity, validate accuracy, then increase iteratively. Sudden high sparsity (>50%) without iterative training often causes unrecoverable accuracy loss.

Layer-wise analysis reveals which layers tolerate pruning. Profile each layer's importance via gradient or activation sensitivity analysis before global pruning. Then apply non-uniform sparsity targeting resilient layers.

Iterative vs. One-Shot Pruning

One-shot pruning applies target sparsity immediately after training, then fine-tunes to recover. Simple and fast, works well for moderate sparsity (30-40%) on robust architectures.

Iterative pruning gradually increases sparsity through multiple prune-train cycles: train baseline, prune 10%, fine-tune, prune another 10%, fine-tune, repeat. Enables higher final sparsity (60-80%) with better accuracy retention. Required for aggressive compression or sensitive models.

Pruning schedule affects results. Linear schedules increase sparsity uniformly. Cubic schedules prune slowly at first, then accelerate—often yielding better final accuracy.

Structured Sparsity for Hardware Speedups

Unstructured sparsity saves parameters but requires specialized sparse kernels to accelerate inference. Most production deployments lack these kernels, making unstructured sparsity useful only for memory reduction.

Structured sparsity removes entire structures: output channels in convolutions, attention heads, feed-forward neurons, or entire transformer layers. Standard dense kernels immediately exploit structured sparsity for speedup since computation reduces proportionally to sparsity.

Channel pruning ranks output channels by importance (via magnitude, gradient, or activation statistics) and removes the least important. Then fine-tune to recover accuracy. Works particularly well for CNNs.

Attention head pruning removes entire attention heads found to contribute minimally. Analysis shows many transformer models have redundant heads that can be removed without significant accuracy loss.

Layer dropping removes entire transformer layers, particularly effective in over-parameterized models. Can combine with knowledge distillation where a dense teacher guides a pruned student.

2:4 sparsity enables hardware acceleration on NVIDIA Ampere and newer GPUs. Requires specialized training or pruning to achieve exactly 50% sparsity with 2-of-4 pattern. See the NVIDIA 2:4 sparsity guide for implementation details. When targeting modern NVIDIA GPUs, 2:4 sparsity provides genuine 2× speedup with manageable accuracy cost.

Fine-Tuning After Pruning

Pruning damages accuracy—fine-tuning recovers it. How long to fine-tune depends on sparsity level and accuracy gap:

Brief fine-tuning (10-20% of original training) often suffices for moderate pruning (30-40% sparsity). Use lower learning rate than original training (10-100× smaller).

Extended fine-tuning (50-100% of original training) required for aggressive pruning (60-80% sparsity) or sensitive models.

Learning rate scheduling matters—too high causes instability, too low prevents recovery. Start with 1/10th to 1/100th of peak training LR and decay to zero.

Re-growth vs. fixed masks: Some methods allow pruned weights to regrow during fine-tuning if they prove important. Fixed masks prevent regrowth, maintaining target sparsity. Fixed masks are simpler and usually sufficient.

Monitoring Drift After Deployment

Pruned models may drift differently than dense models when data distribution shifts. Monitor:

Accuracy across subgroups to detect if pruning caused disparate impact that worsens with distribution shift.

Layer-wise activation statistics to identify if specific pruned layers are struggling with new data patterns.

Retraining triggers when accuracy drops below thresholds. Pruned models may require retraining more frequently than dense models in non-stationary environments.

Distillation & Low-Rank: Smaller Models That Don't Feel Small

Knowledge distillation and parameter-efficient methods enable model compression through fundamentally different mechanisms than quantization and pruning.

Distillation Recipes That Work

Practical distillation requires careful configuration across multiple dimensions:

Teacher-student size ratio: Teachers 2-10× larger than students work well. Too large and knowledge transfer becomes inefficient. Too small and the student learns little beyond what supervised training provides.

Temperature selection: Start with temperature T=3-5 for most tasks. Higher temperatures (5-10) extract more information but can destabilize training. Lower temperatures (1-2) provide less knowledge beyond hard labels. Cross-validate to find optimal temperature for your task.

Loss weighting: Typical recipes use 90% soft target loss (student vs. teacher distributions) and 10% hard target loss (student vs. ground truth). This balances learning from teacher uncertainty while maintaining grounding in actual labels.

Training process: Pre-train the student on task data with hard labels before distillation. This warm start provides better initialization than random weights. Then distill with combined soft+hard loss. Finally, optionally fine-tune on hard labels alone to sharpen predictions.

Intermediate layers: Advanced distillation can match intermediate feature maps between teacher and student, not just final outputs. This provides richer supervision but requires architectural compatibility between teacher and student.

Distill Then Quantize vs. Quantize Then Distill

Two viable paths exist for combining distillation and quantization:

Distill first: Train a smaller student via distillation, achieving desired size reduction. Then quantize the student for additional compression. This approach provides clean separation of concerns and typically yields smaller final models.

Quantize first: Quantize the teacher, then distill from quantized teacher to quantized student. Both teacher and student operate in quantized precision, so knowledge transfer accounts for quantization constraints. Useful when final deployment mandates specific precision and you want the student to learn quantization-robust representations from the start.

Empirically: Distill-first typically works better for achieving maximum compression. Quantize-first makes sense when deployment precision is fixed and you want representations optimized for that precision.

LoRA Rank Selection and Memory Math

LoRA introduces trainable low-rank matrices ΔW = BA where B is dimension d×r and A is r×k, with rank r << d,k. Total trainable parameters = r(d+k) compared to dk for full weight matrix.

Rank selection guidelines:

r=4-8: Minimal capacity, works for simple task adaptation with limited training data
r=16-32: Standard range balancing capacity and efficiency, suitable for most fine-tuning
r=64-128: High capacity for complex adaptations or when training data is abundant
r≥256: Approaching full fine-tuning, diminishing efficiency benefits

Memory math example: Fine-tuning a 7B parameter model with bf16 weights and AdamW optimizer requires:

Full fine-tuning: 7B × 2 bytes (weights) + 7B × 2 bytes (gradients) + 7B × 8 bytes (optimizer states) = ~84GB
LoRA with r=32: 7B × 2 bytes (frozen weights) + ~50M × 2 bytes (adapter weights) + ~50M × 2 bytes (gradients) + ~50M × 8 bytes (optimizer states) = ~14.6GB

LoRA reduces training memory by 5-6× for typical configurations, enabling single-GPU fine-tuning of models that would otherwise require distributed training.

Merging adapters: After training, you can merge LoRA adapters into base weights: W_final = W_base + BA. The merged model requires no special serving infrastructure. Alternatively, keep adapters separate for multi-tenant serving where one base model serves multiple task-specific adapters.

QLoRA for Long-Context LLMs

QLoRA extends LoRA with quantization, storing the base model in INT4 while training FP16 adapters. Key innovations:

4-bit NormalFloat: Custom quantization format optimized for normal distributions common in neural network weights, providing better accuracy than standard INT4.

Double quantization: Quantizes the quantization constants themselves (scale factors and zero points), saving additional memory.

Paged optimizers: Leverage CPU memory for optimizer states when GPU memory fills, enabling larger batch sizes or longer contexts.

Result: Fine-tune 33B parameter models on single 24GB GPU, 65B on 48GB. This democratizes large model customization for organizations without multi-GPU clusters.

QLoRA validation is critical—verify that INT4 base + FP16 adapters achieve comparable quality to FP16 throughout. Some tasks are more sensitive to base model quantization than others.

LLM Serving Tricks That Feel Like Compression

For transformer-based language models, serving optimizations deliver compression-like benefits—faster inference, lower memory, higher throughput—without modifying model weights.

FlashAttention Family: IO-Aware Kernels

Standard attention implementations materialize the full N×N attention matrix, causing memory bottleneck that limits batch size and context length. FlashAttention redesigns attention around GPU memory hierarchy:

FlashAttention-1 introduced tiling and recomputation: divide attention into blocks that fit in SRAM, compute attention for each block, recompute intermediate values as needed rather than storing them in HBM. Result: O(N) memory vs. O(N²), enabling longer contexts.

FlashAttention-2 improved parallelism and reduced non-matmul operations, achieving 2× additional speedup over FA-1. Better GPU utilization via optimized work partitioning.

FlashAttention-3 targets Hopper architecture specifically: asynchronous tensor core and TMA operations, overlapped memory transfers, warp-specialization. Achieves near-theoretical peak utilization.

Integration: Most modern frameworks (PyTorch, TensorFlow, JAX) now include FlashAttention or compatible implementations. For custom CUDA code, use the official FlashAttention implementation. This is free performance—no accuracy trade-off, minimal integration effort.

When to use: Always, for any transformer serving. FlashAttention universally improves memory and speed. No reason not to use it.

vLLM and PagedAttention: Fragment-Proof Batching

Traditional LLM serving pre-allocates fixed KV-cache blocks per sequence, leading to memory fragmentation and waste. vLLM's PagedAttention introduces virtual memory management:

Paged KV-cache: Divide KV-cache into pages (blocks) that can be allocated dynamically as sequences grow. Deallocate pages when sequences finish. Share pages across sequences for prefix caching.

Continuous batching: New requests join in-flight batches as others complete, maximizing GPU utilization. Traditional static batching wastes capacity when batch members finish at different times.

Benefits: 2-4× higher throughput via better memory utilization enabling larger effective batch sizes. Lower latency via continuous batching reducing time-to-first-token. Prefix caching shares KV-cache for common prompt prefixes.

Deployment: vLLM has become the production standard for LLM serving. Use it unless you have very specific requirements it doesn't support. Integrates with FastAPI, supports OpenAI-compatible API, handles multiple models, and includes request scheduling.

Configuration tips: Tune max_num_batched_tokens and max_num_seqs based on GPU memory and throughput/latency targets. Enable prefix caching for workloads with common prompt patterns. Monitor KV-cache utilization and tune block size if needed.

KV-Cache: Memory Dominance and Compression

At long context lengths, KV-cache memory consumption dominates. For a 7B parameter model at 32K context with batch size 32:

Model weights: 7B × 2 bytes (FP16) = 14GB
KV-cache: 32 layers × 2 (K+V) × 32 batch × 32K context × 4096 hidden × 2 bytes = 512GB

KV-cache is 36× larger than model weights! This explains why long-context serving is memory-constrained.

KV-cache quantization reduces this burden. INT8 KV-cache provides 2× reduction with minimal quality loss (typically <0.5 perplexity degradation). INT4 achieves 4× reduction but requires careful validation—acceptable for some tasks, degrading for others.

Implementation: Most serving frameworks support KV-cache quantization via configuration flags. Start with INT8, validate quality, consider INT4 if memory remains constrained and quality is acceptable.

Chunked KV-cache: For extremely long contexts, chunk the cache across CPU/GPU or even disk storage, swapping in relevant chunks as needed. Trading latency for memory capacity.

Scheduling and Batching Policies

How requests are scheduled and batched significantly impacts throughput and latency:

Continuous batching allows new requests to join in-flight batches as generation proceeds. Dramatically improves throughput versus static batching that waits for entire batch to complete.

Priority scheduling assigns priorities to requests—low-latency tier-1 requests preempt batch processing. Configure priority levels based on service tiers.

Fair scheduling prevents starvation by ensuring all requests eventually get capacity, even when higher-priority requests arrive continuously.

Preemption strategies: Aggressive preemption reduces tail latency at cost of throughput. Conservative preemption maximizes throughput but increases latency variance. Tune based on workload characteristics.

vLLM implements sophisticated scheduling—generally trust default policies but understand configuration options for workload-specific tuning.

Tool selection priorities:

Match target hardware: NVIDIA TensorRT for NVIDIA GPUs, Intel Neural Compressor for Intel CPUs, ONNX Runtime for portability
Leverage existing framework: If training in PyTorch, start with PyTorch quantization before considering export to other runtimes
Consider production maturity: vLLM and TensorRT are production-hardened; newer tools may lack operational features
Evaluate vendor support: Tools actively maintained by hardware vendors often get optimizations first

Multi-tool workflows are common: Train with PyTorch, quantize with PyTorch quantization, export to ONNX, deploy with ONNX Runtime or TensorRT. The export step introduces validation requirements—test that exported model produces identical outputs to the original.

Accuracy, Robustness, and Safety Trade-offs

Compression can shift model behavior in subtle ways that aggregate accuracy metrics miss. Safety-critical and regulated applications require additional validation.

How Compression Shifts Error Distributions

Compression rarely degrades all predictions uniformly. Common patterns:

Tail risk concentration: Compression may preserve average-case performance while degrading worst-case. The 1% of hardest examples might see 5-10% accuracy drop while aggregate accuracy drops only 1%.

Subgroup disparities: Quantization or pruning can affect demographic subgroups differently. A model performing equally across groups before compression might show disparate accuracy after compression if calibration data or pruning importance metrics favor majority groups.

Confidence miscalibration: Compressed models often become overconfident—predicted probabilities no longer reflect true likelihood. This matters for decision-making where confidence thresholds determine actions.

Task-specific sensitivity: Some tasks tolerate compression better than others. Classification is generally robust, while regression, ranking, and generation tasks may be more sensitive.

Validation best practices:

Test across multiple metrics, not just aggregate accuracy
Evaluate subgroup performance on protected attributes
Check calibration via reliability diagrams
Profile tail-percentile error rates
Validate on distribution shift and adversarial examples

Robustness to Distribution Shift and Adversarial Examples

Compressed models may exhibit different robustness characteristics than full-precision models:

Distribution shift: Quantized models sometimes generalize worse to distribution shifts not represented in calibration data. Test on datasets with natural variation from training distribution.

Adversarial robustness: Quantization can increase or decrease adversarial robustness depending on architecture and attack. Don't assume compression improves or degrades robustness—measure it.

Gradient-based attacks: Quantization introduces non-differentiability that can break gradient-based adversarial attacks, potentially providing robustness benefits. However, quantization-aware attacks exist, so don't rely on this for security.

Deployment validation: Test compressed models under conditions similar to production, including expected distribution shifts and potential adversarial inputs if applicable to your threat model.

Frequently Asked Questions

Is FP8 safe for production?

FP8 is production-ready on NVIDIA Hopper GPUs (H100, H200) with proper validation. The Transformer Engine provides mature FP8 support for transformer models. Validate that FP8 maintains quality for your specific model and task—most transformers handle FP8 well, but sensitive applications may require mixed-precision where certain operations remain in FP16. FP8 provides 1.5-2× speedup vs. FP16 with minimal accuracy loss when hardware supports it. For non-Hopper hardware, FP8 support is limited—stick with INT8 or FP16.

When does INT4 work?

INT4 quantization works when: (1) you can tolerate 1-3% accuracy degradation, (2) memory is the primary constraint, (3) you use specialized methods like GPTQ, AWQ, or QLoRA rather than naive quantization, (4) the task has sufficient margin (not near decision boundaries), or (5) you apply quantization-aware training. INT4 is most successful for LLM weight storage combined with higher-precision activations. For most production systems, INT8 provides better accuracy/efficiency trade-off unless memory constraints are severe.

Do I prune before or after quantization?

Generally: prune → fine-tune → quantize works best. Prune the FP32 model, fine-tune to recover accuracy, then quantize the pruned model for maximum compression. This sequence allows pruning to learn optimal sparse structure in full precision, then quantization compresses the already-efficient model. However, if you have specific deployment precision requirements, consider quantize → prune → fine-tune to ensure pruning learns importance in the target precision. Test both orders for your specific model—results can vary by architecture.

Can I combine all compression techniques?

Yes, techniques stack: distillation (to smaller architecture) + structured pruning (remove 30-40% of parameters) + INT8 quantization (reduce precision) + serving optimizations (FlashAttention, vLLM). This aggressive stacking can achieve 20-50× compression with 3-7% accuracy loss. However, each technique introduces risk and engineering effort. Start with low-risk techniques (PTQ, serving optimizations), validate results, then add more aggressive methods if needed. Don't over-optimize—stop when you meet deployment constraints.

How do I know if compression is working in production?

Monitor these signals:

Throughput increased by expected amount (measure requests/sec)
Latency decreased with P95/P99 within targets
Memory usage dropped proportionally to compression (4× for INT8, 2× for pruning, etc.)
Quality maintained via production accuracy/quality metrics
Cost reduced reflected in actual infrastructure spending
GPU utilization high (70-90%) indicating efficient hardware use

If metrics don't improve as expected, profile your serving stack—likely configuration issues preventing compression benefits from materializing.

What's the quickest way to get 2-3× speedup?

For transformers: Enable FlashAttention + deploy with vLLM. Total engineering time: 1-2 days. Zero accuracy impact. Immediate 2-4× throughput improvement. This is the highest ROI compression investment for LLM serving.

For other models: INT8 post-training quantization. Engineering time: 2-4 days including validation. Typically <1% accuracy loss. Achieves 2-4× speedup on most hardware. Start here before considering more complex techniques.