This is more related to search tasks, where we encode text and image pairs to use text to search image. ResNet can also be served a backbone for search tasks: content-based image search/reverse image search/search image with image. You need to remove the ResNet50 classification head.
On the other hand, Tensorflow or MLNet are machine learning frameworks, to achieve the task you can choose whatever you want to build the model components.