I have a potential project for my e-commerce where I want to allow users to upload images of their house exteriors and impaint awnings.
I have an example of interior decorating inpainting where I replaced a large floor-to-ceiling window with a mirror, and the result was pretty impressive using NB Pro from nearly a year ago.
Locally hostable? For my money I'd argue Flux.2 Klein but Qwen-Edit still puts in the work.
I do agree, however, that the Flux2 family is the SoTA at the moment. Running locally via something like Comfy gets incredible results.
If you want real precision (especially for complex polygonal masks), or if you’re concerned about image degradation over multiple edit rounds, you'll slam against the limitations of those approaches.
Even with SOTA proprietary models, repeatedly editing and re-uploading an image is like making a copy of a copy of a VHS tape: you're gonna see subtle color shifts and quality loss steadily accumulate.
At that point, you either need to put in the manual work in something like Photoshop (bringing elements in as layers and masking them properly) or, as you mentioned, use a model or workflow that properly supports masking.
So you're saying that, if I can calculate from the picture the position (height, inclination and such), and I can render the model (should be doable) for that height and angle, my best course of action could be to combine original + render and only at the end use a visual model? That could be interesting.
This idea rests on the assumption that my understanding of what "awnings" are is correct and matches your project, i.e. additive structures. In that case, your primary problem is adding pixels on top of the user image. Additive modifications are easy to pull off. Inpainting seems like overkill here; it's something that shines when you need to poke holes or replace some of the aspects of the original that is not covered by the part you're adding.
OTOH, it might still be that inpainting is your best bet for operational reasons - additive modification itself may not be a problem, but fixing lighting and shadows might, and current image generations models should handle this in stride.
(I say should because that's my expectation, but I never tested any of the current models on for ability to fix shadows that cover areas similar to the targeted modification, but lay beyond it. It might be that you'll still need a model and a transform estimate just to generate a shadow map as a hint for the model where it needs to act and how.)
Instead, you're supposed to upload it to the cloud and ask a big, multimodal frontier model to maybe please do the thing you want and nothing else.
(I'm counting only times I used generative editing options in my Galaxy phone - if I were to take your question literally, it would be "at least once every other day", simply due to rotating and cropping.)
Is it just me or is it weird seeing these clickbaity AI-generated taglines in an otherwise scientific work?
Apart from this, the text details amazing work. Congrats.
I think it is safe to say this is pretty far from a "scientific" work.
The weirdest thing was when the inpainting tool added strange people to an image. This singer was all decked out in tinsel and red, and the inpainting model added a grumpy old man in a top hat. I don't recall clicking the "Add creepy old man" button.
At the time this was Stable Diffusion on the backend, run by a variety of model hosting services, Amazon being one. They all had different requirements for the input image and that made things really complex. For some the aspect ratio was impossible to meet, and it would fail if the banner was 200x60. For others, you had to resize it before input, which meant you were adding an image with poor resolution to start. Garbage in, garbage out.
All of this to say, there is a lot of preproduction that went into it, and the client never ended up using my attempts.
Thats because small models like SD (Stable Diffusion) are trained on very specific resolutions, its the fancier models that are trained on higher quality, or more diverse sets of resolutions, and if you use a higher quality model to generate lower resolution images, what's actually happening is you're trimming a much bigger image and getting a chunk of it output, at least that's how it feels based on my many hours of experimenting. If I use major models and try to center a thing, I never see it in the center. :) My GPU can only handle so much.
The general idea was: you mask the area you want changed, and the model inpaints that region at full resolution. The advantage of masking, compared to plain img2img, is that you’re not sending the entire picture to the model.
With the classic setups like SD 1.5 and SDXL, you’d effectively inpaint at full resolution: take the masked area from a larger image, scale just that region to the model’s native resolution, process it at the full ~1 megapixel then scale it back and composite it into the original. This lets you add MORE detail.
Unfortunately if the OP is using hosted SD models, they might not have that granular control and thus would suffer pretty bad quality loss.
Edit: I think I found it https://huggingface.co/hustvl/Moebius
https://characterdesignreferences.com/artist-of-the-week-3/m...
(Claude Code transcript: https://gisthost.github.io/?58039ba5c1ca3ed177e8659168996ee4)
Wrote this up in more detail on my blog: https://simonwillison.net/2026/Jun/22/porting-moebius/
2) If these are reasonable, a WebGPU demo would be great..
Barely useful enough to erase things in thumbnails.
Also, what's going on behind the in-painted corner of the house? We'd need to see higher resolution pictures, but I'm not convinced that it too shouldn't get a flag. Likewise with the beach just behind the surfboard. Not terrible, but what gets flagged in the competitors is similar.