PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Turning Internal Gap into Self-Improvement:
Promoting the Generation–Understanding Unification in MLLMs

Yujin Han^1,8, Hao Chen², Andi Han^3,4, Zhiheng Wang^5,6, Xinyu Liu⁷, Yingya Zhang⁸, Shiwei Zhang^8,†, Difan Zou^1,†

¹The University of Hong Kong, ²Carnegie Mellon University, ³University of Sydney, ⁴RIKEN AIP, ⁵Shanghai Artificial Intelligence Laboratory, ⁶Shanghai Jiao Tong University, ⁷Hong Kong University of Science and Technology, ⁸Alibaba Group
^† Corresponding Authors

arXiv Coming Soon

Unified MLLMs Remain Non-unified

Illustration of MLLMs' internal gap. We examine challenging cases involving implicit physical principles using ChatGPT o3 and Gemini 2.5 Flash, and find: images produced by generation branch are identified as incorrect by understanding branch, showing non-unification.

Key Insight

Non-unification is pervasive in unified MLLMs, with understanding typically outperforming generation.
Internal gap–based self-improvement, which leverages the stronger understanding branch to guide the generation branch, effectively improves generation and promotes unification in MLLMs.
Self-improvement induces co-improvement in generation and understanding because unified MLLMs share the same empirical neural tangent kernel (eNTK), which encourages aligned learning dynamics.
One way to capitalize on co-improvement is curriculum learning: progressively stronger generation and understanding revisit samples underutilized by pre-trained MLLMs, expanding post-training data and boosting performance and unification.

Abstract

Although unified MLLMs aim to unify generation and understanding, they are considered to exhibit an internal gap, with understanding outperforming generation. Through large-scale evaluation across multiple MLLMs and tasks, we confirm the widespread non-unification of MLLMs, and demonstrate that it indeed stems from weak generation rather than misunderstanding. This finding motivates us to propose a simple yet effective internal gap-based self-improvement framework, which mitigates internal gaps by leveraging stronger understanding to guide weaker generation without relying on any external signals. We validate this strategy through comprehensive experiments: scoring generations with understanding to construct image data for post-training (e.g., SFT and DPO) significantly improves generation while promoting unification. Furthermore, we empirically discover a co-improvement effect of such self-improvement, a phenomenon well known in pre-training but underexplored in post-training. Specifically, as generation improves, understanding becomes more effective at detecting false positives that were previously misclassified as prompt-aligned. To explain this effect, we extend learning dynamic theory to the MLLM setting, showing that the shared empirical neural tangent kernel between generation and understanding encourages aligned learning dynamics, thereby driving co-improvement. This interplay between generation and understanding further motivates a curriculum learning approach for stronger self-improvement: progressively enhanced understanding and generation revisit samples underutilized by pre-trained MLLMs, dynamically expanding post-training data and leading to improved performance and unification.

Phenomenon: Unified MLLMs remain non-unified with understanding outperforms generation

Verification of internal gaps. (a) and (b) identify task difficulty as a confounder in measuring non-unification score (Non.): easy tasks may underestimate the gap, while hard tasks risk overestimation. Stratifying by task difficulty (Easy–Medium–Hard) yields a more reliable estimation. (c) Evaluation of six MLLMs across three difficulty levels shows unified MLLMs remain non-unified, with non-unification scores approaching 60%.

Internal gap mainly stems from weak generation instead of misunderstanding. Weak-generation (Qwen/Human-checked) above 50% (even 100%) indicate internal gap mainly stems from weak generation instead of misunderstanding.

Method: Internal Gap-based Self-Improvement

Self-improvement. Scoring generations with stronger understanding to construct image data for post-training (e.g., SFT and DPO) on the generation branch.

Experimental Finding 1: Self-improvement Effectively Improves Generation and Unification

Self-improvement enhances generation and unification, with gains up to 20% and 16% (1–non-unification score). Furthermore, improvements correlate with the internal gap (correlation coefficient $\rho_{\Delta,\text{Non.}}=0.53$): models and subtasks with larger gaps benefit more.

Experimental Finding 2: Co-improvement between Generation and Understanding

Co-improvement Effect. (a) shows an example where self-improved Janus-Pro generates prompt-aligned images and correctly scores the original as mismatched; (b) reports win rates mostly above 50%, indicating self-improved MLLMs judge prompt–image alignment more accurately than pre-trained ones.

Win rate of understanding is the proportion of cases where the self-improved MLLM disagrees with the pre-trained one but agrees with the stronger external judge, e.g., Qwen2.5-VL-72B-Instruct. For example, if the models disagree on three samples and the self-improved model matches Qwen on two, win rate is 2/3.

Examples of co-improvements of self‑improved Janus-Pro and Show-o under SFT. We observe that, after self‑improvement, Show-o and Janus-Pro generate images that align prompts and accurately identify when images produced by the pre‑trained MLLM are misaligned with the prompts.

Examples of co-improvements of self‑improved Janus-Pro and Show-o with DPO. We observe that, after self‑improvement, Show-o and Janus-Pro generate images that align prompts and accurately identify when images produced by the pre‑trained MLLM are misaligned with the prompts.

Mechanism Insight: Explaining the Co-improvement Effect from Learning Dynamics

Shared eNTK between Understanding and Generation. The learning dynamics of generation $\Delta G_t$ and understanding $\Delta U_t$ are similar. The key difference is that $\Delta U_t$ includes an additional eNTK term, $\textcolor{#ef4444}{\mathcal{K}^{\,t}_{k,r}(\mathcal{Y}_i,\mathcal{Y}_u)}$, which measures alignment between $\mathcal{Y}_i$ $(i\neq 0)$ and the post-training data $\mathcal{Y}_u$.

Empirical Evidence from Self-improved Janus-Pro with SFT. (a) Compared to random samples, $(\mathbf{y}_0,\mathbf{x}_0)$ in the false positive correction group are more likely to be matched with highly similar post-training pairs $(\mathbf{y}_u,\mathbf{x}_u)$. (b) Such high data similarity makes $\textcolor{lightblue}{\mathcal{K}^{\,t}_{k,r}(\mathcal{Y}_0,\mathcal{Y}_u)}$ be the dominant term, thereby promoting aligned learning dynamics $\Delta G_t$ and $\Delta U_t$. (c) With aligned dynamics, $\Delta G_t < 0$ implies $\Delta U_t < 0$: both the probability of mis-generation $\pi_{\theta}(\mathbf{x}_0 \mid \mathbf{y}_0)$ and misjudging $\pi_{\theta}(\mathbf{y}_0 \mid \mathbf{x}_0)$ are reduced, i.e., false positive correction and co-improvement occur.

Improved Method: Curriculum Learning for Stronger Self-Improvement

Curriculum Learning–based Self-Improvement. As generation and understanding improve together, difficult samples that pre-trained MLLMs could not previously utilize (due to weak generation or inaccurate understanding) can be incorporated later, forming an adaptive data expansion process based on prompt complexity.

Experiments on Curriculum Learning-based Self-Improvement

Curriculum Learning–based Self-Improvement (C-SFT) yields better generation (higher Gen.) and understanding (higher Und.) and alleviates non-unification (lower Non.). It even surpasses baselines that rely on external reward models, such as T2I-R1 (built on Janus-Pro-7B) and HermesFlow (built on Show-o).

Results on T2I-CompBench++. Self-improvement enhances MLLMs in generation, understanding, and unification, achieving results comparable to or even surpassing those of baselines that leverage external rewards

Results on Geneval. Self-improvement enhances MLLMs in generation, understanding, and unification, achieving results comparable to or even surpassing those of baselines that leverage external rewards.

Results on Science-T2I. Self-improvement enhances MLLMs in generation, understanding, and unification, achieving results comparable to or even surpassing those of baselines that leverage external rewards.

BibTeX

@article{han2025turninginternalgapselfimprovement,
  title={Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs},
  author={Yujin Han and Hao Chen and Andi Han and Zhiheng Wang and Xinyu Liu and Yingya Zhang and Shiwei Zhang and Difan Zou},
  year={2025},
  eprint={2507.16663},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.16663}, 
}

Turning Internal Gap into Self-Improvement: Promoting the Generation–Understanding Unification in MLLMs