Combination-of-Specialists (MoE) fashions are essential for scaling mannequin capability whereas controlling inference prices. Whereas integrating MoE into multimodal fashions like CLIP improves efficiency, coaching these fashions is notoriously difficult and costly. We suggest CLIP-Upcycling (CLIP-UP), an environment friendly different coaching technique that converts a pre-trained dense CLIP mannequin right into a sparse MoE structure. By means of intensive experimentation with varied settings and auxiliary losses, we exhibit that CLIP-UP considerably reduces coaching complexity and value. Remarkably, our sparse CLIP B/16 mannequin, skilled with CLIP-UP, outperforms its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k text-to-image Recall@1 benchmarks respectively. It even surpasses the bigger CLIP L/14 mannequin on this process whereas utilizing solely 30% of the inference FLOPs. We additional exhibit the generalizability of our coaching recipe throughout totally different scales, establishing sparse upcycling as a sensible and scalable method for constructing environment friendly, high-performance CLIP fashions.