HiDivDrop: Vision Token Reduction in MLLMs via Late Injection and Differentiable Top-K¶
Venue: iclr2026 (Poster) Authors: OpenReview: https://openreview.net/forum?id=2baJBgfr9S
Relevance¶
LLM score: 3/3 — The paper directly advances energy-efficient training by sparsifying visual tokens, reducing data movement and computation in MLLMs.
Keyword hits: pruning
TLDR¶
(none provided)
Abstract¶
The computational cost of Multimodal Large Language Models (MLLMs), driven by the quadratic complexity of processing vision tokens, remains a significant barrier to their widespread adoption. While progressive vision token pruning is a promising solution, we find that its full potential has been unrealized due to two key limitations: it misinterprets the role of shallow layers as being crucial for fusion and employs overly rigid, non-adaptive pruning schedules. To address these flaws, we introduce HiDivDrop, a framework that tailors token pruning to the true hierarchical function of MLLM layers. HiDivDrop incorporates two key innovations: (1) a Late Injection strategy that bypasses passive shallow layers, introducing visual tokens directly where active fusion begins; and (2) a Concave Pyramid Pruning scheme with an Early Exit mechanism that dynamically adjusts the pruning rate throughout the middle and deep layers. This process is optimized via an inter-layer similarity measure and a differentiable top-$k$ operator. Extensive experiments show that HiDivDrop compresses $\sim$90\% visual tokens while matching the original performance and accelerating training by 1.72$\times$. Our work not only sets a new state-of-the-art for efficient MLLM training and inference but also provides valuable insights into the hierarchical nature of multimodal fusion.
Keywords¶
MLLMs, Vision Token Pruning, Efficiency and Compression, Interpretability and Analysis