Skip to content

DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

Venue: iclr2026 (Poster) Authors: OpenReview: https://openreview.net/forum?id=3UDlRUf1es

Relevance

LLM score: 1/3 — Tangential: mentions efficiency by replacing costly text encoders, but the main focus is on decoupled detection architecture, not energy-efficient training or Sutro Group's core priorities. Keyword hits: knowledge distillation, distillation

TLDR

(none provided)

Abstract

Open-Vocabulary Object Detection (OVOD) plays a critical role in autonomous driving and human-computer interaction by enabling perception beyond closed-set categories. However, current approaches predominantly rely on multimodal fusion, facing dual limitations: multimodal fusion methods incur heavy computational overhead from text encoders, while task-coupled designs compromise between detection precision and open-world generalization. To address these challenges, we propose Decoupled Cognition DETR, a vision framework featuring a three-stage cognitive distillation mechanism: Dynamic Hierarchical Concept Pool constructs self-evolving concept prototypes using LLaVA-generated region descriptions filtered by CLIP alignment, aiming to replace costly text encoders and reduce computational overhead; Hierarchical Knowledge Distillation decouples visual-semantic space mapping via prototype-centric projection, avoiding task coupling to enhance open-world generalization; Parametric Decoupling Training coordinates localization and cognition through dual-stream gradient isolation, further optimizing detection precision. Extensive experiments on the common OVOD evaluation protocol demonstrated that DeCo-DETR achieves state-of-the-art performance compared to existing OVOD methods. It provides a new paradigm for extending OVOD to more real-world applications.

Keywords

Open-Vocabulary Object Detection, Knowledge Distillation, Multi-modal