top of page

A Research Review of Scaling Laws for Differentially Private Language Models

  • Writer: Nikita Silaech
    Nikita Silaech
  • Sep 26
  • 2 min read
ree

By Ryan McKenna, Yangsibo Huang, Amer Sinha, Borja Balle, Zachary Charles, Christopher A. Choquette-Choo, Badih Ghazi, George Kaissis, Ravi Kumar, Ruibo Liu, Da Yu, Chiyuan Zhang

Published: Jan 2025 | https://arxiv.org/abs/2501.18914


Overview

This paper establishes the first comprehensive scaling laws for differentially private (DP) language model training, revealing fundamental differences from non-private scaling behavior. The authors extend traditional compute-utility tradeoffs to include privacy and data budgets, training BERT models from 4.5M to 778M parameters under various differential privacy constraints. 

Their methodology decouples noise calibration from privacy accounting to understand the underlying compute-privacy-utility dynamics. The work demonstrates that optimal DP training configurations can achieve 5-100× compute savings compared to naive baselines while maintaining comparable privacy guarantees and model performance.


Why It Matters

The tension between leveraging user data for AI advancement and protecting privacy has reached a critical juncture. While LLMs require massive, diverse datasets often containing sensitive information, current DP training methods remain largely empirical with limited theoretical guidance. 

This creates a practical barrier to responsible AI development at scale — the largest DP-trained models today have only hundreds of millions rather than billions of parameters. By establishing principled scaling laws, this work provides the theoretical foundation needed to train billion-parameter DP models efficiently, potentially transforming how the industry approaches privacy-preserving AI development.


Technical Approach

The authors develop a semi-parametric methodology centered on the "noise-batch ratio" (σ̄) — the standard deviation of noise added to mean minibatch gradients. Rather than directly modeling privacy budget effects, they decouple noise calibration by training with fixed physical batch sizes (1024) and varying noise levels, then use post-hoc accounting to predict performance at different configurations. 

They train BERT variants across 6 model sizes, 1280 iteration counts, and 18 noise-batch ratios, applying isotonic regression to enforce monotonicity properties and extrapolating training curves using parametric forms. This approach enables systematic exploration of the compute-privacy-utility tradeoff space.


Evaluation Highlights

  • Compute efficiency gains: Optimal configurations achieve 5-100× compute savings over baseline approaches

  • Model size reduction: Optimal DP models are 10× smaller than non-private scaling laws predict (e.g., ~100M vs ~10B parameters at 10²² FLOPs)

  • Token-to-model ratios: DP training requires 1000-100000× more tokens per parameter vs. the standard 20× ratio

  • Saturating returns: Compute budget increases provide diminishing returns beyond critical thresholds tied to privacy/data budgets


Core Contributions

  • Extended scaling laws: First framework incorporating compute, privacy (ε), and data (N) budgets simultaneously

  • Noise-batch ratio methodology: Novel decoupling approach enabling systematic study of DP training dynamics

  • Optimal allocation guidance: Empirical rules for distributing compute between model size, batch size, and training iterations

  • Critical budget identification: Discovery of saturation points where additional compute provides minimal benefit under fixed privacy constraints


Key Benefits

  • Principled hyperparameter selection: Eliminates guesswork in DP training configuration

  • Compute cost reduction: Dramatic efficiency improvements over current practice

  • Scalability: Clear path toward billion-parameter DP models with appropriate resource allocation

  • Theoretical foundation: Bridges the gap between DP theory and practical large-scale training


This work provides the first rigorous framework for understanding compute-privacy-utility tradeoffs in language model training, revealing that optimal DP training requires fundamentally different resource allocation strategies than non-private approaches and offering concrete guidance for achieving major efficiency improvements.


Comments


bottom of page