To address the degradation of visual-language (VL) representations during VLA supervised fine-tuning (SFT), we introduce Visual Representation Alignment. During SFT, we pull a VLA’s visual tokens ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results