They say:
> When training the encoder, we sample from the CLIP and DALL-E datasets (approximately 650M images in total) with equal probability. When training the decoder, upsamplers, and prior, we use only the DALL-E dataset (approximately 250M images).