Show HN: Open-source macOS AI copilot using vision and voice (opens in new tab)

(github.com)

430 pointsralfelfving2y ago159 comments

Heeey! I built a macOS copilot that has been useful to me, so I open sourced it in case others would find it useful too.

It's pretty simple:

- Use a keyboard shortcut to take a screenshot of your active macOS window and start recording the microphone.

- Speak your question, then press the keyboard shortcut again to send your question + screenshot off to OpenAI Vision

- The Vision response is presented in-context/overlayed over the active window, and spoken to you as audio.

- The app keeps running in the background, only taking a screenshot/listening when activated by keyboard shortcut.

It's built with NodeJS/Electron, and uses OpenAI Whisper, Vision and TTS APIs under the hood (BYO API key).

There's a simple demo and a longer walk-through in the GH readme https://github.com/elfvingralf/macOSpilot-ai-assistant, and I also posted a different demo on Twitter: https://twitter.com/ralfelfving/status/1732044723630805212

Show HN: Open-source macOS AI copilot using vision and voice

(github.com)

430 pointsralfelfving2y ago159 comments

Heeey! I built a macOS copilot that has been useful to me, so I open sourced it in case others would find it useful too.

It's pretty simple:

- Use a keyboard shortcut to take a screenshot of your active macOS window and start recording the microphone.

- Speak your question, then press the keyboard shortcut again to send your question + screenshot off to OpenAI Vision

- The Vision response is presented in-context/overlayed over the active window, and spoken to you as audio.

- The app keeps running in the background, only taking a screenshot/listening when activated by keyboard shortcut.

It's built with NodeJS/Electron, and uses OpenAI Whisper, Vision and TTS APIs under the hood (BYO API key).

159 comments

118 comments · 39 top-level

thomashop2y ago· 9 in thread

Just used it with the digital audio workstation Ableton Live. It is amazing! Its tips were spot-on.

I can see how much time it will save me when I'm working with a software or domain I don't know very well.

Here is the video of my interaction: https://www.youtube.com/watch?v=ikVdjom5t0E&feature=youtu.be

Weird these negative comments. Did people actually try it?

ralfelfvingOP2y ago

So glad when I saw this, thanks for sharing this! It was exactly music production in Ableton was the spark that lit this idea in my head the other week. I tried to explain to a friend that don't use GPT much that with Vision, you can speed up your music production and learn how to use advanced tools like Ableton more quickly. He didn't believe me. So I grabbed a Ableton screenshot off Google and used ChatGPT -- then I felt there had to be a better way, I realized that I have my own use-cases, and it all evolved into this.

I sent him your video, hopefully he'll believe me now :)

thomashop2y ago

You may be interested in two proof of concepts I've been working on. I work with generative AI and music at a company.

MidiJourney: ChatGPT integrated into Ableton Live to create MIDI clips from prompts. https://github.com/korus-labs/MIDIjourney

I have some work on a branch that makes ChatGPT a lot better at generating symbolic music (a better prompt and music notation).

LayerMosaic allows you to allow MusicGen text-to-music loops with the music library of our company. https://layermosaic.pixelynx-ai.com/

3 more replies

mikey_p2y ago

Is it just me or is it incredibly useless?

"Here's a list of effects. Here's a list of things that make a song. Is it good? Yes. What about my drum effects? Yes here's the name of the two effects you are using on your drum channel"

None of this is really helpful and I can't get over how much it sounds like Eliza.

thomashop2y ago

I just made a video where I test it with a proper use case. It helps me find effects to make a bassline more dubby and helps carve out frequencies in the kick drum to make space for the bass.

https://www.youtube.com/watch?v=zyMmurtCkHI

thomashop2y ago

I made that video right at the start but since then I've asked it for example what kind of compression parameters would fit with a certain track and it could explain to me how to find an expert function which I would have had to consult a manual for otherwise.

urbandw311er2y ago

Yeah I thought the same. Ultra generic advice and no evidence it has actually parsed anything unique or useful from the user’s actual composition.

1 more reply

pelorat2y ago

I mean it does send a screenshot of your screen off to a 3rd party, and that screenshot will most likely be used in future AI training sets.

So... beware when you use it.

zwily2y ago

OpenAI claims that data sent via the API (as opposed to chatGPT) will not be used in training. Whether or not you believe them is a separate question, but that's the claim.

thomashop2y ago

Beware of it seeing a screenshot of my music set? OpenAI will start copying my song structure?

You can turn it on and off. Not necessary to turn it on when editing confidential documents.

You never enable screen-sharing in videoconferencing software?

1 more reply

ProfessorZoom2y ago· 9 in thread

e-e-e-electron... for this..

atraac2y ago

Ah yes, cause what's better than building a real, working MVP? Learning Rust for half a year just so you can 'optimize' the f out of an app that does two REST calls.

wtallis2y ago

To be fair, this does sound like the kind of app that would benefit from being able to launch instantly, and potentially registering with the OS as a service in a way that cross-platform frameworks like Electron cannot easily accommodate. But Rust would not be the easiest choice to avoid those limitations.

ralfelfvingOP2y ago

I don't know man. I'm new to development, it's what I chose, probably don't know any better. Tell me what you would have chosen instead?

lolinder2y ago

Don't mind them—there's a certain subset of HN that is upset that web tech has taken over the world. There are some legitimate gripes about the performance of some electron apps, but with some people those have turned into compulsive shallow dismissals of any web app that they believe could have been native.

There's nothing wrong with using web tech to build things! It's often easier, the documentation is more comprehensive, and if you ever wanted to make it cross-platform election makes it trivial.

If you were working for a company it might be worth considering the trade-offs—do you need to support Macs with less RAM?—but for a side project that's for yourself and maybe some friends, just do what works for you!

1 more reply

programmarchy2y ago

My two cents: I think you made a good, practical choice. If you're happy with Electron, I'd say stick with it, especially if you have cross-platform plans in the future.

If you want to niche down into a more macOS specific app, you could learn AppKit and SwiftUI and build a fully native macOS app.

If you want to stay cross-platform, but you're not happy with Electron, then it might be worth checking out Tauri. It provides a JavaScript-based API to display native UI components, but without packaging a V8 runtime with your app bundle. Instead, it uses a native JavaScript host e.g. on macOS it uses WebKit, so it significantly reduces the download size of your app.

In terms of developing this into a product, on one hand it seems like deep integration with the host OS is the best way to build a "moat", but then again, Apple could release their own version and quickly blow a product like that out of the water.

airstrike2y ago

I think the parent comment is a shallow dismissal, but since you're asking, I would have built in SwiftUI

guytv2y ago

What's important is to get an product out there. Nobody cares what stack you use. just us geeks. don't get discouraged. you did well :)

xNeil2y ago

electron's a really nice option, specially for people that aren't interested in porting their apps or spending too much time on development

this is a macOS specific app it seems - if you want better performance and more integration with the OS, i'd recommend using swift

1 more reply

jdamon962y ago

ignore the naysayers; nice job building out your idea

1 more reply

swiftcoder2y ago· 6 in thread

Worth mentioning that if you are in a corporate environment, running a service that sends arbitrary desktop screenshots to a 3rd party cloud service is going to run afoul of pretty much every security and regulatory control in existence

ralfelfvingOP2y ago

I assume that anyone capable of cloning the app, starting the it on their machine and obtaining + adding an OpenAI API key understands that some data is being sent offsite -- and will be aware of their corporate policies. I think that's a fair assumption.

greenie_beans2y ago

that's a fair assumption. feels like swiftcoder is just trying to gotcha

isoprophlex2y ago

You're telling me... the cloud... is other people's computers?!

thelittleone2y ago

The control for that is endpoints should be locked down to prevent install of non approved apps. Any org under regulatory controls would have some variation of that. Safe to assume an orgs users are stupid or nefarious and build defences accordingly.

1 more reply

abrichr2y ago

This is exactly why in https://github.com/OpenAdaptAI/OpenAdapt we have implemented three separate PII scrubbing providers.

Congrats to the op on shipping!

brookst2y ago

True, but also true of other screen capture utilities that send data to the cloud. Your PSA is true, but hardly unique to this little utility. And probably not surprising to the intended audience.

jondwillis2y ago· 5 in thread

You should add an option for streaming text as the response instead of TTS. And also maybe text in place of the voice command as well. I have been tire-kicking a similar kind of copilot for awhile, hit me up on discord @jonwilldoit

ralfelfvingOP2y ago

There's definitely some improvements to shuttling the data between interface<->API, all that was done in a few hours on day 1 and there's a few things I decided to fix later.

I prefer speaking over typing, and I sit alone, so probably won't add a text input anytime soon. But I'll hit you up on Discord in a bit and share notes.

jondwillis2y ago

Yeah, just some features I could see adding value and not being too hard to implement :)

tomComb2y ago

> text in place of the voice command as well

That would be great for people with Mac mini who don't have a mic.

ralfelfvingOP2y ago

Hmmm... what if I added functionality that uses the webcam to read your lips?

Just kidding. Text seem to be the most requested addition, and it wasn't on my own list :) Will see if I add it, should be fairly easy to make it configurable and render a text input window with a button instead of triggering the microphone.

Won't make any promises, but might do it.

1 more reply

ralfelfvingOP2y ago

Added text input instead of voice as an option today.

qainsights2y ago· 4 in thread

Great. I created `kel` for terminal users. Please check it out at https://github.com/qainsights/kel

dave1010uk2y ago

Very cool! Have you had much luck with Llama models?

I made Clipea, which is similar but has special integration with zsh.

https://github.com/dave1010/clipea

qainsights2y ago

Yes, I used Langchain for Llama.

qainsights2y ago

Clipea is cool.

1 more reply

causal2y ago

Chatblade is another good one: https://github.com/npiv/chatblade

e28eta2y ago· 3 in thread

Did you find that calling it “OSX” in the prompt worked better than macOS? Or was that just an early choice that you didn’t spend much time on?

I was skimming through the video you posted, and was curious.

https://www.youtube.com/watch?v=1IdCWqTZLyA&t=32s

code link: https://github.com/elfvingralf/macOSpilot-ai-assistant/blob/...

ralfelfvingOP2y ago

No, this is an oversight by me. To be completely honest, up until the other day I thought it was still called OSX. So the project was literally called cOSXpilot, but at some point I double checked and realize it's been called macOS for many years. Updated the project, but apparently not the code :)

I suspect OSX vs macOS has marginal impact on the outcome :)

e28eta2y ago

Haha, makes perfect sense, thanks for the reply!

hot_gril2y ago

Heh. I remember calling it Mac OS back in the day and getting corrected that it's actually OS X, as in "OS ten," and hasn't been called Mac OS since Mac OS 9. Glad Apple finally saw it my way (except it's cased macOS).

kssreeram2y ago· 3 in thread

People reading this should check out Iris[1]. I’ve been using it for about a month, and it’s the best macOS GPT client I’ve found.

[1]: https://iris.fun/

LeoPanthera2y ago

Oof, $20/month is a lot, when I already have my own OpenAI API key.

kssreeram2y ago

I guess having to enter the API key is not a great user experience for regular people who aren’t developers.

1 more reply

mdrzn2y ago

I wish there was something like this for Windows!

netika2y ago· 3 in thread

Such a shame it uses Vision API, i.e. it can not be replaced by some random self-hosted LLM.

ralfelfvingOP2y ago

It can be replaced with a self-hosted LLM, simply change the code where the Vision API is being called. That's true for all of the API calls in the app.

freedomben2y ago

Actually it's open source, so it can be replaced by some random self-hosted LLM

iandanforth2y ago

For example, one of these:

https://opencompass.org.cn/leaderboard-multimodal

hackncheese2y ago· 2 in thread

Love it! Will definitely use this when a quick screenshot will help specify what I am confused about. Is there a way to hide the window when I am not using it? i.e. I hit cmd+shift+' and it shows the window, then when the response finishes reading, it hides again?

ralfelfvingOP2y ago

There's a way for sure, it's just not implemented. Allowing for more configurability of the window(s) is on my list, because it annoys me too! :)

hackncheese2y ago

Annoyance Driven Development™

poorman2y ago· 2 in thread

Currently imagining my productivity while waiting 10 seconds for the results of the `ls` command.

ralfelfvingOP2y ago

It's a basic demo to show people how it works. I think you can imagine many other examples where it'll save you a lot of time.

hot_gril2y ago

The demo on Twitter is a lot cooler, partially because you scroll to show the AI what the page has. Maybe there's a more impressive demo to put on the GH too?

zmmmmm2y ago· 2 in thread

I've been wanting to build something like this by integrating into the terminal itself. Seems very straight forward and avoids the screen shotting. So you would just type a comment in the right format and it would recognise it:

    $ ls 
    a.txt b.txt c.txt

    $ # AI: concatenate these files and sort the result on the third column
    $ #....
    $ # cat a.txt b.txt c.txt | sort -k 3

This already works brilliantly by just pasting into CodeLLaMa so it's purely terminal integration to make it work. All i need is the rest of life to stop being so annoyingly busy.

paulmedwards2y ago

I wrote a simple command line app to let me quickly ask a quick question in the terminal - https://github.com/edwardsp/qq. It outputs the command I need and puts it in the paste buffer. I use it all the time now, e.g.

    $ qq concatenate all files in the current directory and sort the result on the third column
    cat * | sort -k3

zmmmmm2y ago

yep absolutely - have seen a few of those. And how well they work is what inspires me to want the next parts, which are (a) send the surrounding lines and output as context - notice above I can ask it about "these files" (b) automatically add the result to terminal history so I can avoid copy/paste if I want to run it. I think this could make these things absolutely fluid, almost like autocomplete (another crazy idea is to actually tie it into bash-completion so when you press tab it does the above).

CodeLLama with GPU acceleration on Mac M1 is almost instant in response, its really compelling.

1 more reply

I_am_tiberius2y ago· 2 in thread

I would love to have something like this but using an open source model and without any network requests.

dave1010uk2y ago

LLaVA, Whisper and a few bash scripts should be able to do it. I don't know how helpful the model is with screenshots though.

1. Download LLaVA from https://github.com/Mozilla-Ocho/llamafile

2. Run Whisper locally for speech to text

3. Save screenshots and send to the model, with a script like https://til.dave.engineer/openai/gpt-4-vision/

trenchgun2y ago

Probably in three months, approximately.

smcleod2y ago· 2 in thread

Nice project, any plans to make it work with local LLMs rather than "open"AI?

ralfelfvingOP2y ago

Thanks. Had no plans, but might give it a try at some point. For me, personally, using OpenAI for this isn't an issue.

hmottestad2y ago

I think that LM Studio has an OpenAI "compliant" API, so if there is something similar that supports vision+text then it would be easy enough to make the base URL configurable and then point it to localhost.

Do you know of a simple setup that I can run locally with support for both images and text?

d4rkp4ttern2y ago· 2 in thread

I’ve looking for a simple way to use voice input on the main ChatGPT website, since it gets tiresome to type a lot of text into it. Anyone have recommendations? The challenge is to get technical words right.

ralfelfvingOP2y ago

If you're ok with it, you can use the mobile app -- it supports voice. Then you just have the same chat/thread open on your computer in case you need to copy/paste something.

d4rkp4ttern2y ago

Good idea, yes I do use the iOS app with voice all the time. But didn’t occur to me to use the iOS app to start a chat and continue on desktop. The main pain though is where I have lengthy back and forth with GPT4 discussing an approach or getting some piece of code just right. It often gets tiring enough that I just quickly type with lots of typos and it still does fine. But I’d rather not have to do that because these typo-filled chats will be hard to search though later :)

quinncom2y ago· 2 in thread

I’d love to see a version of this that uses text input/output instead of voice. I often have someone sleeping in the room with me and don’t want to speak.

ralfelfvingOP2y ago

Added the text input option today.

ralfelfvingOP2y ago

You're not the first to request. Might add it, can't promise tho.

lordswork2y ago· 2 in thread

This looks very cool. Does anyone know of something similar for Windows? (or does OP intend to extend support to Windows?)

ralfelfvingOP2y ago

Hey, OP here. I don't have a Windows machine so have not been able to confirm if it works, and probably won't be able to develop/test for it either -- sorry! :/

I suspect that you should be able to take my code and only require a few tweaks to make it work tho, shouldn't be much about it that is macOS only.

coolspot2y ago

For testing/development, you can download a free Windows VM here: https://developer.microsoft.com/en-us/windows/downloads/virt...

qirpi2y ago· 2 in thread

Awesome! I love it! I was just about to sign up for ChatGPT Plus, but maybe I will pay for the API instead. So much good stuff coming out daily.

How does the pricing per message + reply end up in practice? (If my calculations are right, it shouldn't be too bad, but sounds a bit too good to be true)

ralfelfvingOP2y ago

I have a hard time saying how much this particular application cost to run, because I use the Voice+Vision APIs for so many different projects on a near daily basis and haven't implemented a prompt cost estimator.

But I also pay for ChatGPT Plus, and it's sooo worth it to me.

If you'd like to skip Plus and use something else, I don't think my project is the right one. I'd STRONGLY suggest you check out TypingMind, the best wrapper I've found: https://www.typingmind.com/

qirpi2y ago

Wow, thanks for sharing that link, I've been looking for something like this :)

havkom2y ago· 2 in thread

A lot of negative comments here. However, I liked it!

Perfect Show HN and a great start of a product if the author wants to.

ralfelfvingOP2y ago

Thank you, it's my first GH project & Show HN.. and.. yeah.. learning here :D

jonplackett2y ago

Also think this is fun.

In general I’m pretty excited about LLM as interface and what that is going to mean going forward.

I think our kids are going to think mice and keyboards are hilariously primitive.

1 more reply

faceless32y ago· 1 in thread

Wrote some similar scripts for my Linux setup, that I bind with XFCE keyboard shortcuts:

https://github.com/samoylenkodmitry/Linux-AI-Assistant-scrip...

F1 - ask ChatGPT API about current clipboard content F5 - same, but opens editor before asking num+ - starts/stops recording microphone, then passes to Whisper (locally installed), copies to clipboard

I find myself rarely using them however.

ralfelfvingOP2y ago

Nice!

Art96812y ago· 1 in thread

Make sure to set OpenAI API spend limits when using this or you'll quickly find yourself learning the difference between the cost of the text models and vision models.

EDIT: I checked again and it seems the pricing is comparable. Good stuff.

ralfelfvingOP2y ago

I think a prompt cost estimator might be a nifty thing to add to the UI.

Right now there's also a daily API limit on the Vision API too that kicks in before it gets really bad, 100+ requests depending on what your max spend limit is.

rchaves2y ago· 1 in thread

Hey, I was working on something to allow GPT-V to actually do stuff on the screen, click around and type, I tested on my Mac and it’s working pretty well, do you think it would be cool to integrate? https://github.com/rogeriochaves/driver

ralfelfvingOP2y ago

Yes. I think you commented this somewhere else, and I like it. I was considering doing something similar to have it execute keyboard commands, but decided it would have to wait for a future version. I think click + type + and performing other actions would be powerful, especially if it can do it fast and accurate. Then it's less about "How do I do X?", and more "Can you do X for me?".

ukuina2y ago· 1 in thread

This is very cool! Thank you for working on it and sharing it with us.

ralfelfvingOP2y ago

Thank you for checking it out! <3

qup2y ago· 1 in thread

I have a tangential question: my dad is old. I would love to be able to have this feature, or any voice access to an LLM, available to him via an easy-to-press external button. Kind of like the big "easy button" from staples. Is there anything like that, that can be made to trigger a keypress perhaps?

ralfelfvingOP2y ago

I personally have no experience with configuring or triggering keyboard shortcuts beyond what I learned and implemented in this project. But with that said, I'm very confident that what you're describing is not only possible but fairly easy.

behat2y ago· 1 in thread

Nice! Built something similar earlier to get fixes from chatgpt for error messages on screen. No voice input because I don't like speaking. My approach then was Apple Computer Vision Kit for OCR + chatgpt. This reminds me to test out OpenAI's Vision API as a replacement.

Thanks for sharing!

ralfelfvingOP2y ago

Thanks! You could probably grab what I have, and tweak it a bit. Try checking if you can screenshot just the error message and check what the value of the window.owner is. It should be the name of the application, so you could just append `Can you help me with this error I get in ${window.owner}?` to the Vision API call.

dekhn2y ago· 1 in thread

I misread the title and thought this was an app you run on a laptop as you drive around... which if you think about it, would be pretty useful. A combined vision/hearing/language model with access to maps, local info, etc.

ralfelfvingOP2y ago

It would be really cool, and I think we're not very far away from this being something you have on your phone.

The pilot name comes from Microsoft's use of "Copilot" for their AI assistant products, and I tried to play on it with macOSpilot which is maco(s)pilot. I think that naming has completely flown over everyone's heads :D

pyryt2y ago· 1 in thread

Do you have use case demo videos somewhere? Would be great to see this in action

ralfelfvingOP2y ago

There's one at 00:30 in this YouTube video (timestamped the link): https://www.youtube.com/watch?v=1IdCWqTZLyA&t=32s

stephenblum2y ago· 1 in thread

You made real-life Clippy! for the Mac. This would be great to be for other mac apps too. Add context of current running apps.

ralfelfvingOP2y ago

It should work for any macOS app. It just takes a screenshot of the currently active window, you can even append the application name if you'd like.

jamesmurdza2y ago· 1 in thread

Have you thought about integrating the macOS accessibility API for either reading text or performing actions?

ralfelfvingOP2y ago

No, my thought process never really stretched outside of what I built. I had this particular idea, then sat down to build it. I had some idea of getting OpenAI to respond with keyboard shortcuts that the application could execute.

E.g. in Photoshop: "How do I merge all layers" --> "To merge all layers you can use the keyboard shortcut Shift + command + E"

If you can get that response in JSON, you could prompt the user if they want to take the suggested action. I don't see myself using it very often, so didn't think much further about it.

spullara2y ago· 1 in thread

Did you not find the built-in voice-to-text and text-to-speech APIs to be sufficient?

ralfelfvingOP2y ago

Didn't even think of them to be honest.

satchlj2y ago· 1 in thread

It's not working for me, I get a "Too many requests" http error

ralfelfvingOP2y ago

Hmm.. OpenAI bunch a few things into some error. Iirc this could be because you're out of credits / don't have a valid payment method on file, but it could also be that you're hitting rate limits. The Vision API could be the culprit, while in beta you can only call it X amount of times per day (X varies by account).

Make the console.log:s for the three API calls a bit more verbose to find out which call is causing this, and if there's more info in the error body.

mdrzn2y ago· 1 in thread

Very cool, would love to have a Windows version of this.

ralfelfvingOP2y ago

I've not tried this on Windows, but might actually work if you run the packager. Try it. If it doesn't work, there shouldn't be too much that is macOS specific -- so you should be able to tweak the underlying code to work with Windows with fairly few changes.

knowsuchagency2y ago· 1 in thread

This is brilliant!

ralfelfvingOP2y ago

Glad you liked it!

Jayakumark2y ago· 1 in thread

Was following these two projects by someuser on Github which makes similar things possible with Local models. Sending screenshot to openai is expensive , if done every few seconds or minutes.

https://github.com/KoljaB/LocalAIVoiceChat

While the below one uses openai - don't see why it can't be replaced with above project and local mode.

https://github.com/KoljaB/Linguflex

ralfelfvingOP2y ago

Nice! Although the productivity increase from being able to resolve blockers more quickly adds up to a lot (at least for me), local models would be more cost effective -- and probably feel less iffy for many people.

I went for OpenAI because I wanted to build something quickly, but you should be able to replace the external API calls with calls to your internal models.

jackculpan2y ago· 1 in thread

This is awesome

ralfelfvingOP2y ago

Thanks, glad you liked it!

amelius2y ago· 1 in thread

Please include "OpenAI-based" in the title. (Now many people here are disappointed).

ralfelfvingOP2y ago

Fair point, didn't think it would matter so much. Can't edit it any more, otherwise I'd change it to add OpenAI to the title!

krschacht2y ago

I love it! I’ve been circling around a similar set of ideas, although my version integrates with the web-based ChatGPT:

https://news.ycombinator.com/item?id=38244883

There are some pros and cons to that. I’m intrigued by your stand-alone MacOS app.

fake-name2y ago

> Open Source

> off to OpenAI Vision

Pick one

nbzso2y ago

Welcome to the future where nobody is professional because there is no need for professionals. Just ask Corporate Overlord Surveillance Bot to give you instruction on what to do and how to think. Voilà. You are the master of the Universe. Dunning-Kruger champion for the ages to come.

The problem is obvious. Time to reaction. API calls limitation. Average response for a complex task due to limitation of the vision module. Similar functionality has to be available for free with local model tuned to those type of tasks - helper/copilot. Apple and Microsoft will include helper models into the OS soon. Let's hope they are generous and don't turn this to a local data gathering funnel (I have my doubts on this).

LeoNatan252y ago

“macOSpilot runs NodeJS/Electron”

Lost me.

j / k navigate · click thread line to collapse

159 comments

118 comments · 39 top-level

thomashop2y ago· 9 in thread

Just used it with the digital audio workstation Ableton Live. It is amazing! Its tips were spot-on.

I can see how much time it will save me when I'm working with a software or domain I don't know very well.

Here is the video of my interaction: https://www.youtube.com/watch?v=ikVdjom5t0E&feature=youtu.be

Weird these negative comments. Did people actually try it?

ralfelfvingOP2y ago

I sent him your video, hopefully he'll believe me now :)

thomashop2y ago

You may be interested in two proof of concepts I've been working on. I work with generative AI and music at a company.

MidiJourney: ChatGPT integrated into Ableton Live to create MIDI clips from prompts. https://github.com/korus-labs/MIDIjourney

I have some work on a branch that makes ChatGPT a lot better at generating symbolic music (a better prompt and music notation).

LayerMosaic allows you to allow MusicGen text-to-music loops with the music library of our company. https://layermosaic.pixelynx-ai.com/

3 more replies

mikey_p2y ago

Is it just me or is it incredibly useless?

"Here's a list of effects. Here's a list of things that make a song. Is it good? Yes. What about my drum effects? Yes here's the name of the two effects you are using on your drum channel"

None of this is really helpful and I can't get over how much it sounds like Eliza.

thomashop2y ago

I just made a video where I test it with a proper use case. It helps me find effects to make a bassline more dubby and helps carve out frequencies in the kick drum to make space for the bass.

https://www.youtube.com/watch?v=zyMmurtCkHI

thomashop2y ago

urbandw311er2y ago

Yeah I thought the same. Ultra generic advice and no evidence it has actually parsed anything unique or useful from the user’s actual composition.

1 more reply

pelorat2y ago

I mean it does send a screenshot of your screen off to a 3rd party, and that screenshot will most likely be used in future AI training sets.

So... beware when you use it.

zwily2y ago

OpenAI claims that data sent via the API (as opposed to chatGPT) will not be used in training. Whether or not you believe them is a separate question, but that's the claim.

thomashop2y ago

Beware of it seeing a screenshot of my music set? OpenAI will start copying my song structure?

You can turn it on and off. Not necessary to turn it on when editing confidential documents.

You never enable screen-sharing in videoconferencing software?

1 more reply

ProfessorZoom2y ago· 9 in thread

e-e-e-electron... for this..

atraac2y ago

Ah yes, cause what's better than building a real, working MVP? Learning Rust for half a year just so you can 'optimize' the f out of an app that does two REST calls.

wtallis2y ago

ralfelfvingOP2y ago

I don't know man. I'm new to development, it's what I chose, probably don't know any better. Tell me what you would have chosen instead?

lolinder2y ago

There's nothing wrong with using web tech to build things! It's often easier, the documentation is more comprehensive, and if you ever wanted to make it cross-platform election makes it trivial.

1 more reply

programmarchy2y ago

My two cents: I think you made a good, practical choice. If you're happy with Electron, I'd say stick with it, especially if you have cross-platform plans in the future.

If you want to niche down into a more macOS specific app, you could learn AppKit and SwiftUI and build a fully native macOS app.

airstrike2y ago

I think the parent comment is a shallow dismissal, but since you're asking, I would have built in SwiftUI

guytv2y ago

What's important is to get an product out there. Nobody cares what stack you use. just us geeks. don't get discouraged. you did well :)

xNeil2y ago

electron's a really nice option, specially for people that aren't interested in porting their apps or spending too much time on development

this is a macOS specific app it seems - if you want better performance and more integration with the OS, i'd recommend using swift

1 more reply

jdamon962y ago

ignore the naysayers; nice job building out your idea

1 more reply

swiftcoder2y ago· 6 in thread

ralfelfvingOP2y ago

greenie_beans2y ago

that's a fair assumption. feels like swiftcoder is just trying to gotcha

isoprophlex2y ago

You're telling me... the cloud... is other people's computers?!

thelittleone2y ago

1 more reply

abrichr2y ago

This is exactly why in https://github.com/OpenAdaptAI/OpenAdapt we have implemented three separate PII scrubbing providers.

Congrats to the op on shipping!

brookst2y ago

True, but also true of other screen capture utilities that send data to the cloud. Your PSA is true, but hardly unique to this little utility. And probably not surprising to the intended audience.

jondwillis2y ago· 5 in thread

ralfelfvingOP2y ago

There's definitely some improvements to shuttling the data between interface<->API, all that was done in a few hours on day 1 and there's a few things I decided to fix later.

I prefer speaking over typing, and I sit alone, so probably won't add a text input anytime soon. But I'll hit you up on Discord in a bit and share notes.

jondwillis2y ago

Yeah, just some features I could see adding value and not being too hard to implement :)

tomComb2y ago

> text in place of the voice command as well

That would be great for people with Mac mini who don't have a mic.

ralfelfvingOP2y ago

Hmmm... what if I added functionality that uses the webcam to read your lips?

Won't make any promises, but might do it.

1 more reply

ralfelfvingOP2y ago

Added text input instead of voice as an option today.

qainsights2y ago· 4 in thread

Great. I created `kel` for terminal users. Please check it out at https://github.com/qainsights/kel

dave1010uk2y ago

Very cool! Have you had much luck with Llama models?

I made Clipea, which is similar but has special integration with zsh.

https://github.com/dave1010/clipea

qainsights2y ago

Yes, I used Langchain for Llama.

qainsights2y ago

Clipea is cool.

1 more reply

causal2y ago

Chatblade is another good one: https://github.com/npiv/chatblade

e28eta2y ago· 3 in thread

Did you find that calling it “OSX” in the prompt worked better than macOS? Or was that just an early choice that you didn’t spend much time on?

I was skimming through the video you posted, and was curious.

https://www.youtube.com/watch?v=1IdCWqTZLyA&t=32s

code link: https://github.com/elfvingralf/macOSpilot-ai-assistant/blob/...

ralfelfvingOP2y ago

I suspect OSX vs macOS has marginal impact on the outcome :)

e28eta2y ago

Haha, makes perfect sense, thanks for the reply!

hot_gril2y ago

kssreeram2y ago· 3 in thread

People reading this should check out Iris[1]. I’ve been using it for about a month, and it’s the best macOS GPT client I’ve found.

[1]: https://iris.fun/

LeoPanthera2y ago

Oof, $20/month is a lot, when I already have my own OpenAI API key.

kssreeram2y ago

I guess having to enter the API key is not a great user experience for regular people who aren’t developers.

1 more reply

mdrzn2y ago

I wish there was something like this for Windows!

netika2y ago· 3 in thread

Such a shame it uses Vision API, i.e. it can not be replaced by some random self-hosted LLM.

ralfelfvingOP2y ago

It can be replaced with a self-hosted LLM, simply change the code where the Vision API is being called. That's true for all of the API calls in the app.

freedomben2y ago

Actually it's open source, so it can be replaced by some random self-hosted LLM

iandanforth2y ago

For example, one of these:

https://opencompass.org.cn/leaderboard-multimodal

hackncheese2y ago· 2 in thread

ralfelfvingOP2y ago

There's a way for sure, it's just not implemented. Allowing for more configurability of the window(s) is on my list, because it annoys me too! :)

hackncheese2y ago

Annoyance Driven Development™

poorman2y ago· 2 in thread

Currently imagining my productivity while waiting 10 seconds for the results of the `ls` command.

ralfelfvingOP2y ago

It's a basic demo to show people how it works. I think you can imagine many other examples where it'll save you a lot of time.

hot_gril2y ago

The demo on Twitter is a lot cooler, partially because you scroll to show the AI what the page has. Maybe there's a more impressive demo to put on the GH too?

zmmmmm2y ago· 2 in thread

    $ ls 
    a.txt b.txt c.txt

    $ # AI: concatenate these files and sort the result on the third column
    $ #....
    $ # cat a.txt b.txt c.txt | sort -k 3

This already works brilliantly by just pasting into CodeLLaMa so it's purely terminal integration to make it work. All i need is the rest of life to stop being so annoyingly busy.

paulmedwards2y ago

    $ qq concatenate all files in the current directory and sort the result on the third column
    cat * | sort -k3

zmmmmm2y ago

CodeLLama with GPU acceleration on Mac M1 is almost instant in response, its really compelling.

1 more reply

I_am_tiberius2y ago· 2 in thread

I would love to have something like this but using an open source model and without any network requests.

dave1010uk2y ago

LLaVA, Whisper and a few bash scripts should be able to do it. I don't know how helpful the model is with screenshots though.

1. Download LLaVA from https://github.com/Mozilla-Ocho/llamafile

2. Run Whisper locally for speech to text

3. Save screenshots and send to the model, with a script like https://til.dave.engineer/openai/gpt-4-vision/

trenchgun2y ago

Probably in three months, approximately.

smcleod2y ago· 2 in thread

Nice project, any plans to make it work with local LLMs rather than "open"AI?

ralfelfvingOP2y ago

Thanks. Had no plans, but might give it a try at some point. For me, personally, using OpenAI for this isn't an issue.

hmottestad2y ago

Do you know of a simple setup that I can run locally with support for both images and text?

d4rkp4ttern2y ago· 2 in thread

ralfelfvingOP2y ago

If you're ok with it, you can use the mobile app -- it supports voice. Then you just have the same chat/thread open on your computer in case you need to copy/paste something.

d4rkp4ttern2y ago

quinncom2y ago· 2 in thread

I’d love to see a version of this that uses text input/output instead of voice. I often have someone sleeping in the room with me and don’t want to speak.

ralfelfvingOP2y ago

Added the text input option today.

ralfelfvingOP2y ago

You're not the first to request. Might add it, can't promise tho.

lordswork2y ago· 2 in thread

This looks very cool. Does anyone know of something similar for Windows? (or does OP intend to extend support to Windows?)

ralfelfvingOP2y ago

Hey, OP here. I don't have a Windows machine so have not been able to confirm if it works, and probably won't be able to develop/test for it either -- sorry! :/

I suspect that you should be able to take my code and only require a few tweaks to make it work tho, shouldn't be much about it that is macOS only.

coolspot2y ago

For testing/development, you can download a free Windows VM here: https://developer.microsoft.com/en-us/windows/downloads/virt...

qirpi2y ago· 2 in thread

Awesome! I love it! I was just about to sign up for ChatGPT Plus, but maybe I will pay for the API instead. So much good stuff coming out daily.

How does the pricing per message + reply end up in practice? (If my calculations are right, it shouldn't be too bad, but sounds a bit too good to be true)

ralfelfvingOP2y ago

But I also pay for ChatGPT Plus, and it's sooo worth it to me.

If you'd like to skip Plus and use something else, I don't think my project is the right one. I'd STRONGLY suggest you check out TypingMind, the best wrapper I've found: https://www.typingmind.com/

qirpi2y ago

Wow, thanks for sharing that link, I've been looking for something like this :)

havkom2y ago· 2 in thread

A lot of negative comments here. However, I liked it!

Perfect Show HN and a great start of a product if the author wants to.

ralfelfvingOP2y ago

Thank you, it's my first GH project & Show HN.. and.. yeah.. learning here :D

jonplackett2y ago

Also think this is fun.

In general I’m pretty excited about LLM as interface and what that is going to mean going forward.

I think our kids are going to think mice and keyboards are hilariously primitive.

1 more reply

faceless32y ago· 1 in thread

Wrote some similar scripts for my Linux setup, that I bind with XFCE keyboard shortcuts:

https://github.com/samoylenkodmitry/Linux-AI-Assistant-scrip...

I find myself rarely using them however.

ralfelfvingOP2y ago

Nice!

Art96812y ago· 1 in thread

Make sure to set OpenAI API spend limits when using this or you'll quickly find yourself learning the difference between the cost of the text models and vision models.

EDIT: I checked again and it seems the pricing is comparable. Good stuff.

ralfelfvingOP2y ago

I think a prompt cost estimator might be a nifty thing to add to the UI.

Right now there's also a daily API limit on the Vision API too that kicks in before it gets really bad, 100+ requests depending on what your max spend limit is.

rchaves2y ago· 1 in thread

ralfelfvingOP2y ago

ukuina2y ago· 1 in thread

This is very cool! Thank you for working on it and sharing it with us.

ralfelfvingOP2y ago

Thank you for checking it out! <3

qup2y ago· 1 in thread

ralfelfvingOP2y ago

behat2y ago· 1 in thread

Thanks for sharing!

ralfelfvingOP2y ago

dekhn2y ago· 1 in thread

ralfelfvingOP2y ago

It would be really cool, and I think we're not very far away from this being something you have on your phone.

pyryt2y ago· 1 in thread

Do you have use case demo videos somewhere? Would be great to see this in action

ralfelfvingOP2y ago

There's one at 00:30 in this YouTube video (timestamped the link): https://www.youtube.com/watch?v=1IdCWqTZLyA&t=32s

stephenblum2y ago· 1 in thread

You made real-life Clippy! for the Mac. This would be great to be for other mac apps too. Add context of current running apps.

ralfelfvingOP2y ago

It should work for any macOS app. It just takes a screenshot of the currently active window, you can even append the application name if you'd like.

jamesmurdza2y ago· 1 in thread

Have you thought about integrating the macOS accessibility API for either reading text or performing actions?

ralfelfvingOP2y ago

E.g. in Photoshop: "How do I merge all layers" --> "To merge all layers you can use the keyboard shortcut Shift + command + E"

If you can get that response in JSON, you could prompt the user if they want to take the suggested action. I don't see myself using it very often, so didn't think much further about it.

spullara2y ago· 1 in thread

Did you not find the built-in voice-to-text and text-to-speech APIs to be sufficient?

ralfelfvingOP2y ago

Didn't even think of them to be honest.

satchlj2y ago· 1 in thread

It's not working for me, I get a "Too many requests" http error

ralfelfvingOP2y ago

Make the console.log:s for the three API calls a bit more verbose to find out which call is causing this, and if there's more info in the error body.

mdrzn2y ago· 1 in thread

Very cool, would love to have a Windows version of this.

ralfelfvingOP2y ago

knowsuchagency2y ago· 1 in thread

This is brilliant!

ralfelfvingOP2y ago

Glad you liked it!

Jayakumark2y ago· 1 in thread

Was following these two projects by someuser on Github which makes similar things possible with Local models. Sending screenshot to openai is expensive , if done every few seconds or minutes.

https://github.com/KoljaB/LocalAIVoiceChat

While the below one uses openai - don't see why it can't be replaced with above project and local mode.

https://github.com/KoljaB/Linguflex

ralfelfvingOP2y ago

I went for OpenAI because I wanted to build something quickly, but you should be able to replace the external API calls with calls to your internal models.

jackculpan2y ago· 1 in thread

This is awesome

ralfelfvingOP2y ago

Thanks, glad you liked it!

amelius2y ago· 1 in thread

Please include "OpenAI-based" in the title. (Now many people here are disappointed).

ralfelfvingOP2y ago

Fair point, didn't think it would matter so much. Can't edit it any more, otherwise I'd change it to add OpenAI to the title!

krschacht2y ago

I love it! I’ve been circling around a similar set of ideas, although my version integrates with the web-based ChatGPT:

https://news.ycombinator.com/item?id=38244883

There are some pros and cons to that. I’m intrigued by your stand-alone MacOS app.

fake-name2y ago

> Open Source

> off to OpenAI Vision

Pick one

nbzso2y ago

LeoNatan252y ago

“macOSpilot runs NodeJS/Electron”

Lost me.

j / k navigate · click thread line to collapse