In these examples, PixelLM suffers from serious hallucinations, generating objects in its text responses that do not exist within the images, such as the ``grand piano" in the left example and ``candle" in the right example. Furthermore, the segmentation accuracy is suboptimal, with coarse details for the segmentation of ``table" and ``chair" in the right example (failing to segment the table's left leg). By employing the proposed preference-based optimization and ensemble methods, our POPEN achieves significantly improved results, effectively mitigating hallucination in text responses and enhancing segmentation accuracy.
TBD