It's pretty simple:
- Use a keyboard shortcut to take a screenshot of your active macOS window and start recording the microphone.
- Speak your question, then press the keyboard shortcut again to send your question + screenshot off to OpenAI Vision
- The Vision response is presented in-context/overlayed over the active window, and spoken to you as audio.
- The app keeps running in the background, only taking a screenshot/listening when activated by keyboard shortcut.
It's built with NodeJS/Electron, and uses OpenAI Whisper, Vision and TTS APIs under the hood (BYO API key).
There's a simple demo and a longer walk-through in the GH readme https://github.com/elfvingralf/macOSpilot-ai-assistant, and I also posted a different demo on Twitter: https://twitter.com/ralfelfving/status/1732044723630805212
I can see how much time it will save me when I'm working with a software or domain I don't know very well.
Here is the video of my interaction: https://www.youtube.com/watch?v=ikVdjom5t0E&feature=youtu.be
Weird these negative comments. Did people actually try it?
I sent him your video, hopefully he'll believe me now :)
MidiJourney: ChatGPT integrated into Ableton Live to create MIDI clips from prompts. https://github.com/korus-labs/MIDIjourney
I have some work on a branch that makes ChatGPT a lot better at generating symbolic music (a better prompt and music notation).
LayerMosaic allows you to allow MusicGen text-to-music loops with the music library of our company. https://layermosaic.pixelynx-ai.com/
"Here's a list of effects. Here's a list of things that make a song. Is it good? Yes. What about my drum effects? Yes here's the name of the two effects you are using on your drum channel"
None of this is really helpful and I can't get over how much it sounds like Eliza.
So... beware when you use it.
You can turn it on and off. Not necessary to turn it on when editing confidential documents.
You never enable screen-sharing in videoconferencing software?
There's nothing wrong with using web tech to build things! It's often easier, the documentation is more comprehensive, and if you ever wanted to make it cross-platform election makes it trivial.
If you were working for a company it might be worth considering the trade-offs—do you need to support Macs with less RAM?—but for a side project that's for yourself and maybe some friends, just do what works for you!
If you want to niche down into a more macOS specific app, you could learn AppKit and SwiftUI and build a fully native macOS app.
If you want to stay cross-platform, but you're not happy with Electron, then it might be worth checking out Tauri. It provides a JavaScript-based API to display native UI components, but without packaging a V8 runtime with your app bundle. Instead, it uses a native JavaScript host e.g. on macOS it uses WebKit, so it significantly reduces the download size of your app.
In terms of developing this into a product, on one hand it seems like deep integration with the host OS is the best way to build a "moat", but then again, Apple could release their own version and quickly blow a product like that out of the water.
this is a macOS specific app it seems - if you want better performance and more integration with the OS, i'd recommend using swift
Congrats to the op on shipping!
I prefer speaking over typing, and I sit alone, so probably won't add a text input anytime soon. But I'll hit you up on Discord in a bit and share notes.
That would be great for people with Mac mini who don't have a mic.
Just kidding. Text seem to be the most requested addition, and it wasn't on my own list :) Will see if I add it, should be fairly easy to make it configurable and render a text input window with a button instead of triggering the microphone.
Won't make any promises, but might do it.
I made Clipea, which is similar but has special integration with zsh.
I was skimming through the video you posted, and was curious.
https://www.youtube.com/watch?v=1IdCWqTZLyA&t=32s
code link: https://github.com/elfvingralf/macOSpilot-ai-assistant/blob/...
I suspect OSX vs macOS has marginal impact on the outcome :)
[1]: https://iris.fun/
$ ls
a.txt b.txt c.txt
$ # AI: concatenate these files and sort the result on the third column
$ #....
$ # cat a.txt b.txt c.txt | sort -k 3
This already works brilliantly by just pasting into CodeLLaMa so it's purely terminal integration to make it work. All i need is the rest of life to stop being so annoyingly busy. $ qq concatenate all files in the current directory and sort the result on the third column
cat * | sort -k3CodeLLama with GPU acceleration on Mac M1 is almost instant in response, its really compelling.
1. Download LLaVA from https://github.com/Mozilla-Ocho/llamafile
2. Run Whisper locally for speech to text
3. Save screenshots and send to the model, with a script like https://til.dave.engineer/openai/gpt-4-vision/
Do you know of a simple setup that I can run locally with support for both images and text?
I suspect that you should be able to take my code and only require a few tweaks to make it work tho, shouldn't be much about it that is macOS only.
How does the pricing per message + reply end up in practice? (If my calculations are right, it shouldn't be too bad, but sounds a bit too good to be true)
But I also pay for ChatGPT Plus, and it's sooo worth it to me.
If you'd like to skip Plus and use something else, I don't think my project is the right one. I'd STRONGLY suggest you check out TypingMind, the best wrapper I've found: https://www.typingmind.com/
Perfect Show HN and a great start of a product if the author wants to.
In general I’m pretty excited about LLM as interface and what that is going to mean going forward.
I think our kids are going to think mice and keyboards are hilariously primitive.
https://github.com/samoylenkodmitry/Linux-AI-Assistant-scrip...
F1 - ask ChatGPT API about current clipboard content F5 - same, but opens editor before asking num+ - starts/stops recording microphone, then passes to Whisper (locally installed), copies to clipboard
I find myself rarely using them however.
EDIT: I checked again and it seems the pricing is comparable. Good stuff.
Right now there's also a daily API limit on the Vision API too that kicks in before it gets really bad, 100+ requests depending on what your max spend limit is.
Thanks for sharing!
The pilot name comes from Microsoft's use of "Copilot" for their AI assistant products, and I tried to play on it with macOSpilot which is maco(s)pilot. I think that naming has completely flown over everyone's heads :D
E.g. in Photoshop: "How do I merge all layers" --> "To merge all layers you can use the keyboard shortcut Shift + command + E"
If you can get that response in JSON, you could prompt the user if they want to take the suggested action. I don't see myself using it very often, so didn't think much further about it.
Make the console.log:s for the three API calls a bit more verbose to find out which call is causing this, and if there's more info in the error body.
https://github.com/KoljaB/LocalAIVoiceChat
While the below one uses openai - don't see why it can't be replaced with above project and local mode.
I went for OpenAI because I wanted to build something quickly, but you should be able to replace the external API calls with calls to your internal models.
https://news.ycombinator.com/item?id=38244883
There are some pros and cons to that. I’m intrigued by your stand-alone MacOS app.
> off to OpenAI Vision
Pick one
The problem is obvious. Time to reaction. API calls limitation. Average response for a complex task due to limitation of the vision module. Similar functionality has to be available for free with local model tuned to those type of tasks - helper/copilot. Apple and Microsoft will include helper models into the OS soon. Let's hope they are generous and don't turn this to a local data gathering funnel (I have my doubts on this).
Lost me.