Abstract: Vision-Language Pretraining (VLP) has developed a series of fancy foundation models, which continuously advance the state-of-the-art on various multimodal tasks. However, there has been ...