Abstract: Vision-Language Pretraining (VLP) has developed a series of fancy foundation models, which continuously advance the state-of-the-art on various multimodal tasks. However, there has been ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results