Multimodal Large Language Models & Apple’s MM1 | by Matthew Gunton | Apr, 2024
For the Image Encoder, they varied between CLIP and AIM models, Image resolution size, and the dataset the models were trained on. The below chart shows you the results for each ablation. Table 1 from the paper Let’s go through the major pieces above and explain what they are. CLIP stands for Contrastive Language Image…