Skip to content

On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

Venue: iclr2026 (Withdraw) Authors: Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song OpenReview: https://openreview.net/forum?id=2zyfDjy5C0

Relevance

LLM score: 1/3 — Mentions efficiency angle (sub-quadratic time for VAR models) but main contribution is a fine-grained complexity analysis, not directly advancing energy-efficient training or Sutro Group's specific priorities. Keyword hits: phase transition, low-rank

TLDR

(none provided)

Abstract

Recently, Visual Autoregressive ($\mathsf{VAR}$) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine ``next-scale prediction'' paradigm. Suppose that $n$ represents the height and width of the last VQ code map generated by $\mathsf{VAR}$ models, the state-of-the-art algorithm in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes $O(n^{4+o(1)})$ time, which is computationally inefficient. In this work, we analyze the performance bottleneck and hardness result of $\mathsf{VAR}$ Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which $\mathsf{VAR}$ computations can achieve sub-quadratic time complexity. We have proved that assuming the Strong Exponential Time Hypothesis ($\mathsf{SETH}$) from fine-grained complexity theory, a sub-quartic time algorithm for $\mathsf{VAR}$ models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria.

Formally, suppose that $n$ is the height and width of the last VQ code map in $\mathsf{VAR}$ models, $d$ is the hidden dimension, $R$ is the bound of the entries of the input matrices for attention calculations in $\mathsf{VAR}$ models. We present two results:

On the negative side, we show that when $d = O(\log n)$ and $R = \Theta(\sqrt{\log n})$, assuming $\mathsf{SETH}$, it is impossible to approximate the output of $\mathsf{VAR}$ model up to $1/\mathrm{poly}(n)$ additive error in truly sub-quartic time $O(n^{4-\Omega(1)})$.

On the positive side, we demonstrate a sharp phase transition in the computational complexity of the $\mathsf{VAR}$ model. When $d = O(\log n)$ and $R = o(\sqrt{\log n})$, the runtime improves dramatically from $O(n^{4+o(1)})$ to $O(n^{2+o(1)})$, which approximates the output of the output of $\mathsf{VAR}$ model up to $1/\mathrm{poly}(n)$ additive error.

This work initiates the study of the computational efficiency of the $\mathsf{VAR}$ model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in $\mathsf{VAR}$ frameworks.

Keywords

generative model, complexity, fast attention