WIT by Google AI
A Wikipedia-Based Image Text Dataset For Multimodal Multilingual Machine Learning
About WIT by Google AI
WIT (Wikipedia-based Image Text) Dataset is a huge collection of data consisting of 37 million+ image-text pairs across 100+ languages. It was developed to help machines learn to identify the relationship between images and words.
Motivation
Research into multimodal visio-linguistic models requires a large dataset to successfully apply this technology. By creating WIT, Google AI seeks to provide an expansive dataset that goes beyond English language capabilities and has the potential to achieve breakthroughs in multilingual understanding through images.
Therefore, WIT was designed to be a high quality dataset with rigorous filtering applied. It encompasses 37.6 million image-text sets and covers 108 languages, with 12K+ examples for each language (53 of them have over 100K image-text pairs).
Sljf