Skip to content

Vision as LoRA

Venue: iclr2026 (Withdraw) Authors: Han Wang, Yongjie Ye, Bingru Li, Yuxiang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, Can Huang OpenReview: https://openreview.net/forum?id=0n7dDguNeJ

Relevance

LLM score: 1/3 — Mentions efficiency via LoRA merging and distillation for training acceleration, but main contribution is architectural integration of vision into LLMs, not a core Sutro Group energy-efficient training focus. Keyword hits: distillation, lora

TLDR

(none provided)

Abstract

We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, \model{} can process inputs at arbitrary resolutions.

To further strengthen VoRA’s visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs.

Keywords

MLLM, LoRA