Great job. I recently had a similar idea while reading the sci-fi series
The Expanse - I read in English and since it's not my first language, I am often confused by descriptions of spaceships, planets, constructions, etc. I have problems visualising it.
I had an idea to use RAG to extract all relevant descriptions of given object and compile them to a detailed description. The description would be fed to a text-to-image model.
Have you considered something similar? It would be harder to implement, but the results could be more precise and it would be possible to cover books GPT-4 is not familiar with.