For that reason I think this will be less appealing to developers than GitHub may think, otherwise I think it's a cool idea.
For the average dev, I agree this is more of a novelty.
Maybe this will be different, and that'd be neat. Though I just think more expressions of code is neat. I also know the accessibility you're talkin about isn't for blindness.
That being said I can talk about code decently well, but if you've never heard code come out of text-to-speech, well, it's painful.
I bring up the text-to-speech because if speech is input, it would make sense for speech to also be the output. Selfishly, getting a lot of developers to spend time coding through voice might end up with some novel and well thought out solutions.
I bet if we use our imaginations, we’ll think of a lot of places were using voice to code could come in handy.
Personally, I’ve been waiting for it for a few decades.
The creator of TCL has RSI and has been using voice since the late 1990’s
https://web.stanford.edu/~ouster/cgi-bin/wrist.php
Thought we were really close 10 years ago when Tavis Rudd developed a system:
GitHub seems to be more high-level. It figures out the syntax and what you actually want to write.
This would help if you barely knew the language.
Time to learn Rust or Scala with a little help from machine learning.
If Copilot is any indicator of effectiveness, then I have high hopes for this! I've always wanted to program while stationary biking :)
IF the voice analysis was any good of course. But maybe it will also be able to be better than typical voice analysis because the syntax is limited, when programming I use a much more limited vocabulary than when writing literary criticism. So while text to speech is total crap for handling complex literary phrasing it might be adequate for programming structures.
I found that for general subjects it was quite difficult to use because of the fairly poor recognition rate.
But when I talked about computers, it got almost everything right. I assumed it must have been trained by the developers, who talked about computers mostly.
This is another special purpose vocabulary, so it seems as if it would have a good chance of a high recognition rate.
> I could never be productive programming like this.
It's likely to work much better than a generic speech-to-text model due to fine-tuning.
Plus, consciously or not, we will adapt our human language to the English-ML "pidgin" (e.g. by introducing a more efficient grammatical structures, using a specific subset of vocabulary).
The way I see it is that it's not much different from giving commands to your dog, writing a Google query, writing a Stable Diffusion prompt. It'll get better. Manual input is not as fast as speech though and that's where I see the issue.
I imagine that voice to code would be like standing over the shoulders of a junior coder who knows the syntax and some techniques just enough to follow orders but has no idea whats doing and when gets it wrong will be very wrong.
This not only holds for literature but also for programming. Concerning the hard part, I would argue that is the reason why it is not called "talking is thinking".
Even though now speech recognition rate is really high, but I wonder how many authors use speech to write articles. The comparison may make sense. And I think there's few.
Ie., when you're managing your house you want something that can be communicated in an infinite number of ways, but the "AI" accepts a tiny finitude of ways.
However when programming it seems like we arent asking the machine to "write a function to do X", but rather saying, "def open-paren star args...."
This seems like a pretty trivial problem to solve.
Click the link first and take a look at what is being showcased, because your comment is the exact opposite of what they demo when you visit the HN link.
However, I too really doubt if there's any better use cases than simple tasks, let alone everyone would hear what you ask the AI to do in the office. Oh my! How embarrassing am I?
VERY PROMISING, in any case you can just manually fill the gaps with the keyboard!
If machines were amazing at Speech-to-Text, okay, sure. But while the capabilities are impressive, they still kinda suck at it.
Yet, it hasn't stuck. I'm exclusively using Siri to set timers. Most people are like me, or don't use it at all. Some use assistants for googling factoids or something. Fidelity wise, it's really underwhelming.
It's not a social acceptance issue, because people would still use it at home, and they don't. It's a small chance there's some key UI insight missing (discoverability for one), but I doubt it. Even with perfect UI, natural language is quite flawed when you're dealing with technical details (see exhibit on variable naming).
Anyway, the chances of Github solving this in an exceptionally difficult subdomain, as a side project, seems like a... Let's say, long shot.
That said, the silver lining in all these billions spent on voice interfaces is accessibility. For some people, these things are a life saver.
Because in my experience it is very often like "Call Peter" -> "Today it's sunny in NY".
On macOS it still seems pretty good - I have carpal tunnel syndrome and by Thursday or Friday most weeks I end up using Siri to dictate not code but a lot of conversations in Slack, pull requests, iMessage, etc. In fact, I wrote this reply with Siri right now.
But yeah, something about talking to a device which gets things wrong all the time is ridiculously distracting, at least for me.
Sometimes I look back at the road after trying to workout what it interpreted and I feel scared how focused on the phone I became.
Code is much more constrained by language syntax though.
Even for the "call peter" example, while the input is easy, the expected range of inputs that Siri should handle and be able to differentiate it from is huge.
Of course this is still a problem for e.g. defining variable names, where you could say anything.
Are either of those companies investing particularly heavily into voice agents? Certainly neither of them has anywhere near the kind of power of something like Copilot.
Also, a general agent is way different from one that's specific to writing code.
I would totally enjoy being able to tell my IDE to "call foo with bar and string hello there end string with a block of gee times two" or something, instead of:
foo(:bar, "Hello there") { |gee| gee * 2 }
Just that, not having to think about typing different symbols would be a serious quality of life feature for me.Why?
Can't really see myself working like this in an office, plane, cafe, with music on (my favorite way to code), in the house where my partner is also working. Then as others have said, editing might suck.
If it was a neural link then I'd be in agreement.
The hard part will be open plan offices.
It’s bad enough that so many meetings are now zoom/teams and proximity to coworkers means you end up hearing their side of their meetings.
Just wait until all the devs are coding this way too.
"USER!! UNDERSCORE LIMIT!! EQUALS TWO THOUSAND AND FORTY EIGHT!"
Why?
The problem with speech to code has always been that precise syntax is hard, but AI codegen solves that.
So, no, it might not take off, but I feel like if it does, then it means ai-codegen will become the dominant way code is crafted.
That would be paradigm shifting.
It’s inconceivable that it wouldn’t be.
Usually this kind of exploratory work involves a lot of Googling and copy-pasting snippets from Stackoverflow without putting too much time in trying to deeply understand things. If you get out what you want - great, if not, back to Google.
Then it seemed to just die off. I don't think it was bad technology, because I don't think novelty value was enough to account for its popularity - you had to put hours in to get it to work well, it wasn't a casual toy.
What's changed since then in terms of technology? Unless it's very significant, I suspect it will go the same way. Apart from an assistive technology viewpoint, my gut instinct is that it's not that satisfying or rewarding talking to a computer all day.
The story for the most mainstream-popular dictation softwares is kind of funny. Back in the late 90's there was Dragon NaturallySpeaking and IBM's ViaVoice. In early 00s, after a financial fraud and bankruptcy involving both the then current Dragon owners and Goldman Sachs, they got bought by Scansoft. Scansoft bought Nuance, began to use its name, and then got exclusive rights for ViaVoice (!) from IBM.
Now, in March this year, Nuance has been acquire by Microsoft.
Training data is now abundant compared to twenty years ago, and so is computation power. That means training can be much more complex now.
The underlying technology is now typically neural networks (broadly speaking), whereas twenty years ago it might have been Hidden Markov Models.
Overall, recognition quality, even without speaker-specific training, is now on a very different level than back then. Whether it’s considered good is a matter of opinion. But it’s significantly better than twenty years ago.
It's hugely significant - look at this graph of Google's speech model accuracy across 2013 to 2017:
https://sonix.ai/packs/media/images/corp/articles/history-of...
Or this that shows a similar pattern:
https://cdn.static-economist.com/sites/default/files/externa...
1) Improvements in speech to text, as others have mentioned
2) Improvements in language models (and model size) allowing for more flexible interpretation of speech. This isn't dictation anymore. It's more like instruction. You don't have to tell the computer exactly what to write, you tell it in much more broader terms. Eg "pull this out into a function". Or "delete the cookie before creating the transaction". Or "lint the file".
That's my guess anyways! This mostly feels like a voice interface to Copilot in a lot of ways. Can't say whether it'll be effective, but I'd love to be able to program while I'm e.g. on a stationary bike!
Can I have transcription that can then turn my rambling into neat and concise prose?
Ironically, I think only technical people would even want to do something like that. The less technical you get, the more high level (and ambitious) you need to go.
You can see this a lot in game dev questions. Beginner questions will be "How do I make an MMORPG?" and the more advanced questions will be "how do I return x from y" or whatever, and then it scales between the two ends of the spectrum.
As a more simple problem space, building programs from UML charts was one of Java's promise, and it failed miserably, not because the technology was lacking, but because it's just a damn hard problem.
As of now ee have nothing approaching "non technical people to be able to instantly create things" if the "things" you want are useful in any way.
I'd encourage anyone who hasn't tried it to give Copilot a try. There really has never been anything like it in my memory, and while I totally agree there have been dozens of efforts to allow non-technical people to generate code, I think Copilot may be on to something very special.
You have to read code and debug it that is inevitable, you can't say that there will not be any bugs if you use voice instead of writing.
The crazy thing is... this probably will work.
In 20 years. But, it probably will work.
There is absolutely no reason you cannot use a neural network to transcribe appropriately phrased requirements into an AST.
1. How do you check the output of the voice to code step? If you need as much expertise as you do now to actually review the code, then the voice to code step is just a layer that adds confusion
2. How would debugging work? Again, would you still need to be able to understand the code? Same issue.
3. What if you have to pause and think? This will affect how the voice to code interface interprets your speech.
4. How would you make a precise edit to your source audio using a voice interface?
5. How would you make changes which touch multiple components across the project? How would you coordinate this?
6. Precisely defining interfaces between components and using correct references to specific symbols is very difficult to do in natural speech, which typically uses context to resolve ambiguous references. The language you would be using would still have to resemble the strictness of a programming language even when spoken, but you have replaced a reliable checkable channel (input through keyboard, transfer as-as to text buffer, feedback from visual view of source) with an unreliable channel (input through microphone, transfer through complex signal processing and multiple neural network language models, through multiple representations, where you have to be able to check multiple representations for feedback about the structure of your program (initial speech-to-text step, text to source))
But now, given how magical this thing is, it opens many doors to what's possible with no code.
I never really believe in anything no code before apart from Excel and RAD.
But basic tasks are going to get accessible to a lot of people sooner than I expected
The problem with current voice programming systems is they're just too slow so I end up getting impatient and using my fingers anyway
Could it also do a virtual keyboard, but a custom layout to not trigger arm, elbow and hand pains?
There's only two ways to do this effectively and unfortunately no one has taken the true path to accessibility. The more common way is plugins/extensions to grab a information from the editor.
Accessibility is more than just one editor. It's the OS and all the applications. Microsoft needs to take the hard route to make an accessibility UI automation server to grab that information and only make up the difference through plugins as needed.
It's all about grabbing information from the application and generating on the fly commands, not just parsing free dictation in order to get the best accuracy.
It takes a lot of expertise to make any sort of UI automation, fast and efficient for navigating and selecting text or out of focus menu items.
I've fussed around and managed to get tree sitter to navigate across code. For example generic commands are like 'next function'. Code simply isn't pronounceable when it's written by others. Therefore, navigating across generic tokens is really the best method. Then other methods can be used for fine navigation if needed.
My hope is that they develop a grammar system that is open source and integrates with accessibility frameworks focused on performance.
I wish I could have a phone call with the development team.
“Top of file
Down 5 lines
Modify import source to …
inside first class
Down 5 methods
Insert new method after
Inside arg list
Append arg named … of type …
…
…”
And add a way to indentify types and parameters with special pronunciation.Accessibility accessing the content of the application and the context is what's important. It's more important than the speech recognition backend.
Speech recognition shines work best with a narrow context. (when those commands are available)
The type of performance we need as a speech recognition community and screen reader community is quite high. By the beginning of speech and just before decode time information needs to be available to be parsed for navigation/editing. That way these tokens can be weighted as commands for recognition.
Commands could be modeled after vim functionality though.
Outside of tree sitter it would be interesting to hook into hooking into as a client a language protocol server. However, I think they only expect one client. In addition, I still see that as a lesser approach without dedicated support for high performance UI automation server for speech recognition engine to leverage.
Imagine even more precise commands 'next function' followed by a letter. That allows you to navigate to only a function with that letter defined. Really the possibilities are endless when we have complete context of the screen and the structure of the code.
Someday I hope for the release of something like stable diffusion for voice coding. An open complete pipeline that users can illiterate fast and innovate!
Inputting code with voice is generally difficult, often due to variable names, casing, punctuation etc being hard to get right in voice-to-text. I think this might help quite a lot with that.
_However_, some of the hardest things in voice coding isn't necessarily just the input. Navigating large codebases is hard, and particularly editing existing code can be extremely difficult, probably much more difficult than just inputting new code.
I have my doubt that with the demonstration shown here, that it's able to make complex editing tasks simple, but if it does - I cannot overstate how huge of a leap forward it is.
[0]: https://www.gustavwengel.dk/state-of-voice-coding-2017/ [1]: https://www.gustavwengel.dk/state-of-voice-coding-2019/
I'm curious, why have you done this?
However, the average developer doesn't need those fine-grained navigation controls but can still benefit from enhanced input through voice. Some have mental disabilities who interface differently. Others are simply supplement their input as an average developer by voice as a preventative measure for repetitive strain RSI. The day the hope is develop something that every developer could see the value and leverage. In a way accessibility is for everyone.
In general I see accessibility as a hierarchy that could benefit everyone. Accessibility APIs, close to real time OCR, Eye tracking, alternative inputs (eg pedal, touch pad, stylus) allowing for the broadest possible input and APIs to extract information from applications. Extraction of information from applications and input to applications allows the user to specialize for their use case.
My experience as people will become experts in voice their command vernacular shortens as they carve out their niche use case. It goes beyond singular shortcuts too series of actions to get stuff done. However, what really means to happen is voice systems need access to the OS and to the application to really shine. That would empower not only navigation for those that are disabled but context-specific commands that are intuitive and abstracted like next function or parameter.
I'm bullish.
Copilot is getting better everyday, because it's learning from the way we are using it.
So all the more power to them, but I am very skeptical. Especially since co-pilot has zero knowledge of the formal semantics of programming languages.
This is a lot different than the half ass auto complete that it already does since that at least has some context.
/*
this functions transforms this json from:
{
... some complex structure in json
}
to this json:
{
... some different structure in json
}
*/
... copilot comes up with the function that takes in the first and spits out the latter. Even if the fieldnames do not match etc, it usually 'guesses' right what fits on what (so it does have some context from it's learning phase what 'looks alike' or 'might be the same thing'. Example: I had a structure with firstName: string, lastName: string and a target structure with name: string; it just did name: firstName+' '+lastName, which was indeed what I wanted. But it comes up with more intricate stuff as well that is pretty much surprising (too human basically).What is another bonus; if you generated function transfromAtoB(a: A) above, then you only have to do:
/*
do the reverse of function transfromAtoB, accept json structure B as input and return structure A
*/
And it'll come up with the reverse.It's not hard to write yourself, but it's boring and error prone (some of these structures are huge). Now I press tab a bunch of times, and run the tests to see if it worked. I am also not that worried i'm infringing someone's open source code; this is all way to custom to look like anything else. That's where this shines; things where it verbatim copies something, you should've been using a library anyway.
Statically typing and using typescript definitely works better than other combinations I have tried (C# was pretty bad last I tried it, JS is good but often subtly wrong because of type issues).
That sounds exhausting when we have spent countless human years developing languages which let us communicate our intentions precisely to computers.
If you don't do this there is no ambiguity detector. Meaning it's entirely possible for the computer to interpret what you are saying completely different than intended, yet it is a perfectly valid interpretation. So the only one who can qualify if it got it right is you.
If err unequal nil opening bracket, no no don't open the racket opening bracket... BRACKET, do you know what a bracket is No don't do a do while, delete delete. Don't delete everything... sigh
Well something like that, I imagine it being a very painful experience.
"if (int i = 0; i < count; i++)"
you could say something like
"if beep int i click zero boop i bop count boop i pop pop zing"
This would achieve the same thing, but much faster and with less effort than typing.
It does look like we've made some progress in the 15 years since. I do wonder how this would work in an office setting though - so much noise, so much distraction, and so much crosstalk between programmers...
Everyone gets a throat mic and the cubicle farm is full of unintelligible whispering instead of clacking of keyboards? Can't wait for the future. /s
Let’s hope that I never get in a serious accident or get an disabling disease, but if I do I am not planning on giving up coding. What would you do if you lost your hands, or became permanently paralyzed. This is the tool we need to combat that. Hats off to github on this one.
1) import matplotlib.pyplot as plt
Why "as plt"?! Let the import alone. But this is a matter of style.
2) Get titanic csv data from the web [...]
Surprise, it turns out that "the web" is an URL on raw.githubusercontent.com Hopefully I'll be able to spell an URL of my choice
3) clean records from titanic data where age is null
Somehow I already know that there is an Age field and somehow it knows that it must capitalize age into Age
4) fill null values of column Fare with average column values
The generated code looks great but somehow I managed to spell a capitalized Fare this time :-) (this is probably a typo in the demo)
5) Hey,Github! New line
Inserting a new line can't take so many words. We're going to do without new lines or rely on a formatter or something equivalent.
6) plot line graph of age vs fare column
This is where it becomes evident that there was no need to import as plt because I'm not pressing those keys anyway. But this is style and it's going to be uniform across all the users of these tools.
7) Hey, Github! Run program
Good.
Considerations:
A) Why do commands (new line, run) need "Hey, Github!" which is pretty long and terrible to repeat all the day long (just imagine having to say Hey Joe every time we have to say a sentence to Joe, withing a long conversation with Joe) and text-to-code doesn't?
B) We got a graph at the end. Now what should I do to edit the code in those 99% of cases where I got the graph wrong? An acceptable answer could be mouse and keyboard. It's a little underwhelming but voice to code already gave me the structure of the code.
C) Does that mean that Microsoft and GitHub are going to know all the closed source code we'll write for our customers (there might be contractual implications) or is this something that will be self hosted in our machines?
Hope this is helpful :)
“Go to line 35” “Open the model controller” “Show the get method and set method side by side”
The voice part seems like an (albeit important) accessibility add on.
I'm sure it won't be perfect but an amazing step forward in the evolution of programming languages
I'm thinking about this in terms of the navigator-pilot pair programming approach, and believe that as a senior, if it's even half-as-good as working with a fresh out of uni hire, then it could have real value. When there's a piece of code that I would like written, when I have good test cases in mind, but would prefer to offload it on someone, I could perhaps write the test cases and function signatures (maybe with the bot's help), get the bot to fill in the blanks until it passes the tests, and then give it direct feedback on how to refactor the code.
I've signed up for the waiting list and am excited to try this out.
GitHub Blocks – waiting list signup - https://news.ycombinator.com/item?id=33537706 - Nov 2022 (41 comments)
GitHub code search – waiting list signup - https://news.ycombinator.com/item?id=33537614 - Nov 2022 (48 comments)
A good HN discussion needs more than a waiting list signup. A good time to have a thread would be when something is actually available.
I have been playing with using Whisper + Github Copilot in Vim [0]. The Whisper text transcription runs offline with a custom C/C++ inference and I use Copilot through the copilot.nvim plugin for Neovim. The results were very satisfying.
Edit: And just in case there is interest in this, the code is available [1]. It would be very awesome if someone helps to wrap this functionality in a proper Vim plugin.
[0] https://youtu.be/3flN9kTcZJY
[1] https://github.com/ggerganov/whisper.cpp/tree/master/example...
FEATURES
Write/edit code
Just state your intent in natural language and let Hey, GitHub! do the heavy lifting of suggesting a code snippet. And if you don't like what was generated, ask for a change in plain English. Go to the next method
Code navigation
No more using mouse and arrow keys. Ask Hey, GitHub! to...
go to line 34
go to method X
go to next block
Control the IDE"Toggle zen mode", “run the program”, or use any other VisualStudio Code command.
Code Summarization Don’t know what a piece of code does? No problem! Ask Hey, GitHub! to explain lines 3-10 and get a summary of what the code does.
Explain lines 3 - 10
"insert curly brace", "insert semicolon", "insert insertion", etc. does not sound to fun.
https://www.theverge.com/2022/11/8/23446821/microsoft-openai...
For me, I got a ergonomical keyboard before my wrists went bad, and so far they seem to be holding up!
Moral of the story: get a good keyboard early, or you might need a tool like this one someday!
I have looked at some tools for the blind, but you need just way too much dedication for it to work for you and since you have working eyes it is usually easier to just open your eyes.
https://www.youtube.com/watch?v=YKuRkGkf5HU
The demos are in Ruby, but I could imagine that languages with strong type-aware auto-completion could be easier to do.
I don’t think the voice part is necessary. It’s easy enough to slap ASR on the front. But going from natural language -> full problem spec -> code is hard in the general case, but doable in well-understood domains. Why can’t Scotty talk to a computer? (https://youtube.com/watch?v=hShY6xZWVGE&feature=share)
I can see this helping as an accessibility tool, but beyond that I don’t think it will be useful. This kind of assumes you know everything about what you’re doing, most of the time you don’t.
https://numen.johngebbie.com/index.html
It's free software, it's local to your machine, you don't have to sign up for it, and it works today.
https://www.amazon.com/Destroy-Tech-Startup-Easy-Steps/dp/09...
But at the end of the book I struck an upbeat note, about how the technology was advancing quickly and within 3 or 4 years someone would achieve something much greater than our own limited successes.
But I was wrong. 7 years later I'm surprised at how little progress there has been. I don't see any startup that's done much better than what we did in 2015. Voice interfaces remain limited in accuracy and use.
If this is reliable I would pay to use it to some capacity, like add an argument.
But for a glimpse of the future watch The Expanse or read William Gibson's Agency.
Also, imagine you are sitting in an office with other team mates - what happens if all of them talk together but are working on different projects. It will disturb others in terms of noise pollution.
but it will definitely be a fun project and might work perfectly when you are working alone from home.
:/
bool success equals user dot no i mean ah fuck stop stop quit