Across 14 diverse vision-language benchmarks, PROGRESS achieves 98.8% relative performance on LLaVA-665K with 20% data.
PROGRESS exhibits robust scalability and cross-architecture generalization. It demonstrates strong transferability to newer architectures, achieving 100% relative performance on Qwen2-VL-7B and ranking first or second on 9 out of 11 benchmarks. Furthermore, on the Vision-Flan dataset with a stricter 16.7% budget, PROGRESS maintained 99.0% relative performance, significantly outperforming COINCIDE and Random sampling
PROGRESS is Pareto superior to baselines, achieving higher accuracy in less total computational time. Full-data finetuning requires ~9 hr (and 100% data for training), much higher than our method which needs only 5.67hr of total training time and 20% data for training.
PROGRESS targets a "sweet spot" in the mid-rarity range, prioritizing skills with the highest potential for improvement. By bypassing over-abundant, saturated concepts and those too rare to generalize, the framework operates within the Zone of Proximal Development. This strategy ensures the model learns most effectively by focusing on tasks that are just beyond its current ability—neither too easy nor too difficult.