Since clouds are amorphous, it seems there would be problems trying to feed training data to the model. Could one simply use the entire training image by specifying the bounding box to the cloud to be the entire image bounds?
I'm exploring new models, having already tried Fully Convolutional DenseNet with semi-satisfactory results (but with very large GPU memory footprint).
Also, while I don’t know much about your use case, using DenseNet seems like it might be an overkill since you only have two classes, cloud and sky. A lighter network might give you better results, especially if you don’t have a lot of training data.