POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation

Lanyun Zhu¹, Tianrun Chen³, Qianxiong Xu⁴, Xuanyi Liu⁵, Deyi Ji², Haiyang Wu², De Wen Soh¹, Jun Liu⁶,

¹Singapore University of Technology and Design, ²Tencent, ³Zhejiang University, ⁴Nanyang Technological University, ⁵Peking University, ⁶Lancaster University
CVPR 2025

Paper Code (Coming Soon) arXiv

Abstract

Existing LVLM-based reasoning segmentation methods often suffer from imprecise segmentation results and hallucinations in their text responses. This paper introduces POPEN, a novel framework designed to address these issues and achieve improved results. POPEN includes a preference-based optimization method to finetune the LVLM, aligning it more closely with human preferences and thereby generating better text responses and segmentation results. Additionally, POPEN introduces a preference-based ensemble method for inference, which integrates multiple outputs from the LVLM using a preference-score-based attention mechanism for refinement. To better adapt to the segmentation task, we incorporate several task-specific designs in our POPEN framework, including a new approach for collecting segmentation preference data with a curriculum learning mechanism, and a novel preference optimization loss to refine the segmentation capability of the LVLM. Experiments demonstrate that our method achieves state-of-the-art performance in reasoning segmentation, exhibiting minimal hallucination in text responses and the highest segmentation accuracy compared to previous advanced methods.

Results

In these examples, PixelLM suffers from serious hallucinations, generating objects in its text responses that do not exist within the images, such as the ``grand piano" in the left example and ``candle" in the right example. Furthermore, the segmentation accuracy is suboptimal, with coarse details for the segmentation of ``table" and ``chair" in the right example (failing to segment the table's left leg). By employing the proposed preference-based optimization and ensemble methods, our POPEN achieves significantly improved results, effectively mitigating hallucination in text responses and enhancing segmentation accuracy.

POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation

Abstract

Results

BibTeX