Hierarchical text-conditional image generation with CLIP latents