In these examples, PixelLM suffers from serious hallucinations, generating objects in its text responses that do not exist within the images, such as the ``grand piano" in the left example and ``candle" in the right example. Furthermore, the segmentation accuracy is suboptimal, with coarse details for the segmentation of ``table" and ``chair" in the right example (failing to segment the table's left leg). By employing the proposed preference-based optimization and ensemble methods, our POPEN achieves significantly improved results, effectively mitigating hallucination in text responses and enhancing segmentation accuracy.
@inproceedings{zhu2025popen,
title={Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation},
author={Zhu, Lanyun and Chen, Tianrun and Xu, Qianxiong and Liu, Xuanyi and Ji, Deyi and Wu, Haiyang and Soh, De Wen and Liu, Jun},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={30231--30240},
year={2025}
}