We propose a post-training framework called DivRL that jointly optimizes identity consistency and structural diversity simultaneously by leveraging disentangled visual features from a robust similarity model. Specifically, we introduce a Negative Self-Similarity Measure (nSSM) to quantify structural diversity, and Visual Semantic Matching (VSM) to evaluate identity consistency. We propose an ""Explore-and-Suppress" strategy that treats VSM as a gated constraint: the model freely explores structurally diverse configurations, and only samples that violate the identity threshold are penalized via a quadratic hinge loss. This converts identity preservation from a competing objective into a feasibility constraint, allowing nSSM and VSM to improve jointly. Experiments demonstrate that our method effectively pushes the model to generate both consistent and diverse images and improves structural diversity while maintaining comparable identity consistency through a gated optimization formulation.
Method overview. To quantify diversity, we introduce Negative Self-Similarity Measure (nSSM), while Visual Semantic Matching (VSM) is used to evaluate identity consistency. Our two-stage optimization uses a quadratic hinge loss to penalize identity drift, forcing the model to explore desirable latent spaces where structural novelty and identity coexist. During the exploration stage, the model is allowed to freely explore images with high structural diversity, while during the suppression stage, the generation with high identity drift is penalized to suppress undesirable exploration.
Click ◀ ▶ to cycle through methods. Results highlighted in red are ours.
@article{wang2026divrl,
title = {DivRL: Disentangled Self-Similarity Rewards
for Diverse Subject-Driven Generation},
author = {Wang, Qian and Li, Zhenyu and Eldesokey, Abdelrahman and Wonka, Peter},
journal = {arXiv preprint arXiv:2606.23950},
year = {2026}
}