Show HN: AI assisted image editing with audio instructions (opens in new tab)

(github.com)

95 pointsShaShekhar1y ago30 comments

Excited to launch AAIELA, an AI-powered tool that understands your spoken commands and edits images accordingly. By leveraging open-source AI models for computer vision, speech-to-text, large language models (LLMs), and text-to-image inpainting, we have created a seamless editing experience that bridges the gap between spoken language and visual transformation.

Imagine the possibilities if Google Photos integrated voice assisted editing like AAIELA! Alongside Magic Eraser and other AI tools, editing with audio instruction could revolutionize how we interact with our photos.

Show HN: AI assisted image editing with audio instructions

(github.com)

95 pointsShaShekhar1y ago30 comments

30 comments

29 comments · 13 top-level

throwaway4aday1y ago· 6 in thread

Love it! Voice interaction is a great modality for UI. A lot of people have a bad taste left over from early attempts but I expect to see a lot of progress made now that STT and natural language understanding is so much better.

The biggest reason we should be adding conversational UI to everything is the harm done by RSI and sedentary keyboard and mouse interfaces. We're crippling entire generations of people by sticking to outdated hardware. The good news is we can break free of this now that we have huge improvements in LLMs and AR hardware. We'll be back to healthy levels of activity in 5 to 10 years. Sorry Keeb builders, it's time to join the stamp collectors and typewriter enthusiasts. We'll be working in the park today.

prawn1y ago

I'd like to see a voice instruction layer that can work independently of the mouse/keyboard later without stealing focus. Things like moving files or preparing windows/positioning prior to switching.

mistermann1y ago

One big problem would be that in open office environments there would be a lot of noise. I wonder if some sort of active noise cancellation could be introduced so the voices of your co-workers could be ~completely canceled out if you are wearing special headphones?

throwaway4aday1y ago

When I consider my own LLM workflow the amount of time reading/listening/thinking outweighs the amount of time spent typing/speaking. If that's any indication of how a fully fledged conversational workflow would work then I think open plan offices wouldn't be a lot louder than they currently are. Depending on how quickly agentic LLMs are developed I'm not even sure we will be using offices the same way we are now. We might only need to meet or checkin with our coworkers and our LLM agents every few hours or once a day or maybe even longer in order to realign and check on results. Maybe we'll get occasional messages asking us to confirm something or provide clarification, I could honestly see most knowledge work evaporating and leaving behind only high level coordination, research and ideation.

Before that, I'm certain we'll all be spending a lot more time reviewing work, trying out prototypes and tweaking prompts or specifications than we do typing or talking.

xyproto1y ago

Have you tried sitting in a park for hours, talking out loud and seeing what happens?

N0b8ez1y ago

Isn't that just like taking a phone call? I'm not sure what you're trying to imply.

1 more reply

throwaway4aday1y ago

Ignoring the snark. This will change as technology is adopted, go back 40 years (or even less) and a person walking around staring at a little black rectangle would have been perceived as weird and anti-social. We used to make fun of people talking on the phone via bluetooth headsets and now everyone does it with AirPods or whatever.

If you've got the technology to enable you to seamlessly transition from working in your home to working while sitting outside at a cafe to working while sitting on a blanket under a tree in the park to working wherever you feel like it then there will be enough brave people that say "fuck what other people think" and just do it so they can enjoy being active and getting fresh air and eventually more and more people will join them. Eventually we'll reach the point where sitting inside at a desk for 8-12 hours will be the weird thing.

sgbeal1y ago· 4 in thread

Wow! We're now just a hair's-width away from finally being able to say, "Computer, enhance image!" without sounding like we're in a bad sci-fi show.

sargstuff1y ago

Think the only thing historical science fiction/Blade Runner photo inspect scene[0] didn't forsee was voically having AI assist/analyze photo to summerize list of items/objects avaliable to zoom/view. (vs. pan/zoom around). Although altavista glasses / hand gestures[3] would have been a future concept at the time, too.

----

[0] : https://scifiinterfaces.com/2020/04/29/deckards-photo-inspec...

[1] 'mirror reality' image / TERI[2] : https://www.hackster.io/news/blade-runner-s-image-enhancemen...

[2] : TERI, almost IRL blade runner move image enhancement tool : https://news.ycombinator.com/edit?id=40844595 / https://github.com/iscilab2020/TERI-3DNLOS/tree/TERI

[3] : Gest : https://news.ycombinator.com/edit?id=40844704

throwaway4aday1y ago

Using Whisper as the voice interface, an LLM to understand the prompt and issue function call commands and an image upscaler you could build this in a weekend. Would it be useful? Not especially by itself but I think there is a lot of promise in voice interaction with LLM operated software.

jaggs1y ago

Make it so!

sargstuff1y ago

gMake it, you gAught it. (once there's enough bandwidth to go around[0])

[0] : Intel CPU with OCI Chiplet Demoed with 4Tbps of Bandwidth and 100M Reach : https://news.ycombinator.com/item?id=40844616

omerhac1y ago· 2 in thread

Very cool - which method do you use for editing the images? is it SDEdit or InstructPix2Pix? another one?

ShaShekharOP1y ago

Thanks. Stable diffusion inpainting v1.5. I'd played around this model so much that i ended up using it. I've read both papers SDEdit where you need mask for inpaiting and instructPix2Pix where you don't. I know, i'm a year behind when it comes to using new models like LEDIT++, LCM, SDXL inapainting etc. There is so much work to do. VCs won't fund me as it's not a b2b spinoff.

ShaShekharOP1y ago

instructpix2pix is fine-tuned on sd-v1.5 which is a inpainting model (aware of contexts and semantics) that why it don't require mask.

throwaway4aday1y ago· 1 in thread

Forgot to share this link as well, not sure if you're aware of it but it's a great write up on fine tuning small local models on specific APIs and seems like it would be a perfect fit for your project. https://bair.berkeley.edu/blog/2024/05/29/tiny-agent/

ShaShekharOP1y ago

I did integrated and tested the microsoft phi3-mini and it works really well. Having freedom to run locally without sharing private photo is my utmost objective.

vunderba1y ago· 1 in thread

Nice job. I actually experimented with a chat driven instruct2pix sort of interface that connected via API to a stable diffusion backend. The big problem is that it's difficult to know if the inpainting job you've done is satisfactory to the user.

This is why usually when you're doing this sort of traditional inpainting in automatic1111 you generate several iterations with various mask blurs, whole picture vs only masked section, padding and of course the optimal inpainting checkpoint model to use depends on whether or not the original images is photorealistic versus illustrated, etc.

ShaShekharOP1y ago

Right now, the inpainting is done on semantic mask (output from segmentation model). For more complex instruction, we also have to support contextual mask generation, which is an active area of research in the field of Visual Language Model. When it comes to perform several iteration, you can also do that on semantic level or get a batch of output. The sdv1.5 inpainting model is quite weak and we haven't seen any large scale open source inpainting model for a while.

leobg1y ago· 1 in thread

I love how in the demo video, even the audio instructions themselves are AI generated. No human in the loop, at all! :)

ShaShekharOP1y ago

I did it intentionally. The video had my voice, but then I decided to replace it with an AI voice.

kveykva1y ago· 1 in thread

This pitches a lot but only seems to support a specific inpainting operation?

ShaShekharOP1y ago

The tools are there, we just have to connect it (check out the TODO section). For more complex instruction like when you want to create the mask, it requires a lot of contextual reasoning which i tried to point out in Research section.

ShaShekharOP1y ago

Example instructions: 1. Replace the sky with a deep blue sky then replace the mountain with a Himalayan mountain covered in snow. 2. Stylize the car with a cyberpunk aesthetic, then change the background to a neon-lit cityscape at night. 2. Replace the person with sculpture complementing the architecture.

Check out the Research section for more complex instructions.

G1N1y ago

We're so close to being able to create our own Tayne

(https://www.youtube.com/watch?v=a8K6QUPmv8Q)

benzguo1y ago

Super cool! We're building an API that makes it easy to build chained multi-model workflows like this that run with zero latency between tasks - https://www.substrate.run/

beautifulfreak1y ago

It didn't just replace the sky and background, it replaced the trees. That wasn't part of the instructions.

parentheses1y ago

soon the movie trope of saying "enhance" repeatedly could be a real thing!

whatnotests21y ago

Zoom. Enhance!

j / k navigate · click thread line to collapse

30 comments

29 comments · 13 top-level

throwaway4aday1y ago· 6 in thread

prawn1y ago

I'd like to see a voice instruction layer that can work independently of the mouse/keyboard later without stealing focus. Things like moving files or preparing windows/positioning prior to switching.

mistermann1y ago

throwaway4aday1y ago

Before that, I'm certain we'll all be spending a lot more time reviewing work, trying out prototypes and tweaking prompts or specifications than we do typing or talking.

xyproto1y ago

Have you tried sitting in a park for hours, talking out loud and seeing what happens?

N0b8ez1y ago

Isn't that just like taking a phone call? I'm not sure what you're trying to imply.

1 more reply

throwaway4aday1y ago

sgbeal1y ago· 4 in thread

Wow! We're now just a hair's-width away from finally being able to say, "Computer, enhance image!" without sounding like we're in a bad sci-fi show.

sargstuff1y ago

----

[0] : https://scifiinterfaces.com/2020/04/29/deckards-photo-inspec...

[1] 'mirror reality' image / TERI[2] : https://www.hackster.io/news/blade-runner-s-image-enhancemen...

[2] : TERI, almost IRL blade runner move image enhancement tool : https://news.ycombinator.com/edit?id=40844595 / https://github.com/iscilab2020/TERI-3DNLOS/tree/TERI

[3] : Gest : https://news.ycombinator.com/edit?id=40844704

throwaway4aday1y ago

jaggs1y ago

Make it so!

sargstuff1y ago

gMake it, you gAught it. (once there's enough bandwidth to go around[0])

[0] : Intel CPU with OCI Chiplet Demoed with 4Tbps of Bandwidth and 100M Reach : https://news.ycombinator.com/item?id=40844616

omerhac1y ago· 2 in thread

Very cool - which method do you use for editing the images? is it SDEdit or InstructPix2Pix? another one?

ShaShekharOP1y ago

instructpix2pix is fine-tuned on sd-v1.5 which is a inpainting model (aware of contexts and semantics) that why it don't require mask.

throwaway4aday1y ago· 1 in thread

ShaShekharOP1y ago

I did integrated and tested the microsoft phi3-mini and it works really well. Having freedom to run locally without sharing private photo is my utmost objective.

vunderba1y ago· 1 in thread

ShaShekharOP1y ago

leobg1y ago· 1 in thread

I love how in the demo video, even the audio instructions themselves are AI generated. No human in the loop, at all! :)

ShaShekharOP1y ago

I did it intentionally. The video had my voice, but then I decided to replace it with an AI voice.

kveykva1y ago· 1 in thread

This pitches a lot but only seems to support a specific inpainting operation?

ShaShekharOP1y ago

Check out the Research section for more complex instructions.

G1N1y ago

We're so close to being able to create our own Tayne

(https://www.youtube.com/watch?v=a8K6QUPmv8Q)

benzguo1y ago

Super cool! We're building an API that makes it easy to build chained multi-model workflows like this that run with zero latency between tasks - https://www.substrate.run/

beautifulfreak1y ago

It didn't just replace the sky and background, it replaced the trees. That wasn't part of the instructions.

parentheses1y ago

soon the movie trope of saying "enhance" repeatedly could be a real thing!

whatnotests21y ago

Zoom. Enhance!

j / k navigate · click thread line to collapse