Skip to content

Pruning with Occam's Razor

Venue: iclr2026 (Desk Reject) Authors: B.N. Kausik OpenReview: https://openreview.net/forum?id=0LKnVeXBGK

Relevance

LLM score: 3/3 — The paper directly advances energy-efficient training by integrating pruning with gradient descent to reduce model size and compute, aligning with sparsity and training efficiency priorities. Keyword hits: pruning

TLDR

(none provided)

Abstract

Deep learning neural network models must be large enough to adapt to their problem domain, while small enough to avoid overfitting training data during gradient descent. To balance these competing demands, over-provisioned deep learning models such as transformers are trained for a single epoch on large data sets, and hence inefficient with both computing resources and training data. In response to these inefficiencies, we derive a provably good algorithm that can combine any training and pruning methods to simultaneously optimize efficiency and accuracy, identifying conditions that resist overfitting and reduce model size while outperforming the underlying training algorithm. We then use the algorithm to combine gradient descent with magnitude pruning into "Occam Gradient Descent." With respect to loss, compute and model size (a) on image classification benchmarks, linear and convolutional neural networks trained with Occam Gradient Descent outperform traditional gradient descent with or without post-train pruning; (b) on a range of tabular data classification tasks, neural networks trained with Occam Gradient Descent outperform traditional gradient descent, as well as Random Forests; (c) on natural language transformers, Occam Gradient Descent outperforms traditional gradient descent.

Keywords

Learning theory, occam's razor, pruning, gradient descent