Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection

Accepted to CVPR 2026 🎉

¹Mila - Québec AI Institute, ²UBC, ³Université de Montréal, ⁴Vector Institute, ⁵Canada CIFAR AI Chair
^*Equal Contribution, order determined by coin flip. ^†Equal Advising.

✨ Highlights ✨

Efficient Learning: Achieves near full-data performance using only 16-20% of labeled training data.
Dynamic Selection: Periodically self-evaluates to identify skills where performance improves fastest relative to its prior state.
Self-Contained: Requires no upfront answer annotations, no auxiliary reference VLMs, and no compute-heavy gradient computations.
Scalable & Transferable: Shows strong cross-architecture generalization (e.g., LLaVA-1.5 to Qwen2-VL) without model-specific tuning.

Abstract

Instruction tuning has been central to the success of recent vision-language models (VLMs), but it remains expensive—requiring large-scale datasets, high-quality annotations, and large compute budgets. We propose PROGRESS, a data- and compute-efficient framework that enables VLMs to dynamically select what to learn next based on their evolving needs during training. At each stage, the model tracks its learning progress across skills and selects the most informative samples—those it has not already mastered and that are not too difficult to learn at the current stage of training. This strategy effectively controls skill acquisition and the order in which skills are learned. PROGRESS requires no upfront answer annotations, queries answers only on a need basis, and avoids reliance on additional supervision from auxiliary VLMs. Experiments demonstrate that PROGRESS consistently outperforms state-of-the-art baselines with much less data and supervision.

Approach

PROGRESS employs a two-stage pipeline for efficient training:

Multimodal Concept Categorization: Unsupervised partitioning of the unlabeled pool into K skill clusters using concatenated DINO (visual) and BERT (textual) features.
Prioritized Concept Learning: A self-paced strategy where the model prioritizes skills showing the highest relative improvement (Δ) in accuracy or loss compared to its prior state.

We use a temperature-controlled softmax to balance informativeness (focusing on high-improvement skills) and diversity (preventing mode collapse).

Main Results

Across 14 diverse vision-language benchmarks, PROGRESS achieves 98.8% relative performance on LLaVA-665K with 20% data.

Citation