Teaching Data CLIP is educated on the WebImageText dataset, which is composed of 400 million pairs of images and their corresponding natural language captions (never to be puzzled with Wikipedia-based Image Text)That site were really unstable and major advertisers fled, leaving of their wake low-cost, garbage merchandise and copyright advertisement