Turning Internal Gap into Self-Improvement:
Promoting the Generation–Understanding Unification in MLLMs

1The University of Hong Kong,  2Carnegie Mellon University,  3University of Sydney,  4RIKEN AIP,  5Shanghai Artificial Intelligence Laboratory,  6Shanghai Jiao Tong University,  7Hong Kong University of Science and Technology,  8Alibaba Group
Corresponding Authors
MLLMs' internal gap

Illustration of MLLMs' internal gap. We examine challenging cases involving implicit physical principles using ChatGPT o3 and Gemini 2.5 Flash, and find: images produced by generation branch are identified as incorrect by understanding branch, showing non-unification.

  • Non-unification is pervasive in unified MLLMs, with understanding typically outperforming generation.
  • Internal gap–based self-improvement, which leverages the stronger understanding branch to guide the generation branch, effectively improves generation and promotes unification in MLLMs.
  • Self-improvement induces co-improvement in generation and understanding because unified MLLMs share the same empirical neural tangent kernel (eNTK), which encourages aligned learning dynamics.
  • One way to capitalize on co-improvement is curriculum learning: progressively stronger generation and understanding revisit samples underutilized by pre-trained MLLMs, expanding post-training data and boosting performance and unification.

Although unified MLLMs aim to unify generation and understanding, they are considered to exhibit an internal gap, with understanding outperforming generation. Through large-scale evaluation across multiple MLLMs and tasks, we confirm the widespread non-unification of MLLMs, and demonstrate that it indeed stems from weak generation rather than misunderstanding. This finding motivates us to propose a simple yet effective internal gap-based self-improvement framework, which mitigates internal gaps by leveraging stronger understanding to guide weaker generation without relying on any external signals. We validate this strategy through comprehensive experiments: scoring generations with understanding to construct image data for post-training (e.g., SFT and DPO) significantly improves generation while promoting unification. Furthermore, we empirically discover a co-improvement effect of such self-improvement, a phenomenon well known in pre-training but underexplored in post-training. Specifically, as generation improves, understanding becomes more effective at detecting false positives that were previously misclassified as prompt-aligned. To explain this effect, we extend learning dynamic theory to the MLLM setting, showing that the shared empirical neural tangent kernel between generation and understanding encourages aligned learning dynamics, thereby driving co-improvement. This interplay between generation and understanding further motivates a curriculum learning approach for stronger self-improvement: progressively enhanced understanding and generation revisit samples underutilized by pre-trained MLLMs, dynamically expanding post-training data and leading to improved performance and unification.

MLLMs' internal gap

Self-improvement. Scoring generations with stronger understanding to construct image data for post-training (e.g., SFT and DPO) on the generation branch.

MLLMs' internal gap

Self-improvement enhances generation and unification, with gains up to 20% and 16% (1–non-unification score). Furthermore, improvements correlate with the internal gap (correlation coefficient $\rho_{\Delta,\text{Non.}}=0.53$): models and subtasks with larger gaps benefit more.

Curriculum learning overview

Curriculum Learning–based Self-Improvement. As generation and understanding improve together, difficult samples that pre-trained MLLMs could not previously utilize (due to weak generation or inaccurate understanding) can be incorporated later, forming an adaptive data expansion process based on prompt complexity.

BibTeX

@article{han2025turninginternalgapselfimprovement,
  title={Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs},
  author={Yujin Han and Hao Chen and Andi Han and Zhiheng Wang and Xinyu Liu and Yingya Zhang and Shiwei Zhang and Difan Zou},
  year={2025},
  eprint={2507.16663},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.16663}, 
}