The pull quote is: The impression overall I got here is that this is somewhere around (OpenAI) o1-pro capability
In math it shares the top spot with o1 and is just a few points behind (well within errors). In creative writing it is basically ex-aequo with the latest ChatGPT 4o and in coding it's actually significantly ahead of everyone else and represents a new SOTA.
Andrej Karpathy: "I was given early access to Grok 3 earlier today" - https://news.ycombinator.com/item?id=43092066 - Feb 2025 (48 comments)
https://x.com/lmarena_ai/status/1891706264800936307
It's been said before but it is great news for consumers that there's so much competition in the LLM space. If it's hard for any one player to get daylight between them & the 2nd best alternative, hopefully that means one monopolistic firm isn't going to be sucking up all the value created by these things
It passed every goofy test I have for writing articles which involves trying to surface arcane obscure details. (it certainly means however they are scraping the Web they are doing a good job here)
It made the database code I wrote over the last week with o3/o1/GPT4o/Claude3.5 look like a joke.
It fills me with rage over who owns this thing.
Even if people tank Tesla's car business and run Twitter into the ground, I think our new Galactic Edgelord is going to win his first trillion on xAI and Teslabots anyway.
btw: it tried to charge me $40/mo for this thing: https://imgur.com/a/QXslgBo
This hype will burst sooner than later and will trigger yet another global recession. This is untenable.
This lame HN trope of LLMs having no business model needs to die.
>This hype will burst sooner than later and will trigger yet another global recession.
It seems to small of bubble for global recession. I mean if it is a bubble at all, there is all the reasons to believe that the strategy will work with a significant probability.
The dot com bubble also gave us the most valuable companies in history, like Google, Apple, Amazon, Facebook, etc.
I quite like the idea of a future where the AI job holocaust largely never happened because license costs ate up most of the innovation benefit. It's just the kind of regressive greed that keeps the world ticking along and wouldn't be surprised if we ended up with something very close to this
These things still cost me time because of hallucinations.
It matters if it is better than what you have.
If it breaks a cup but is 10x cheaper than a human, go figer
Is it? Because it seems like a bunch of megacorps pirating every single copyrighted work available in digital format, spending an enormous amount of electricity (that is probably not 100% clean) to churn through them, and the end result we have a bunch of parrots that may or may not produce accurate results so that spammers can more effectively fill the Internet with crap.
Two rich Russian guys meet and one brags about his new necktie. "Look at this, I paid $500 for it." The other rich Russian guy replies: "Well, that is quite nice, but you have to take better care of your money. I have seen that same necktie just yesterday in another shop for $1000."
I don't have that dog in me anymore, but there are plenty of engineers who do and will happily work those hours for 500k USD.
A variant of multi-modal LLMs may be the solution to self-driving cars, home robotics, and more.
I keep saying that to be a really effective driver, an AI model will need a theory of mind, which the larger LLMs appear to have. Similarly, any such model will need to be able to do OCR and read arbitrary street signs, and understand what the sign meant. Most modern LLMs can already do this.
Deepseek made the news because how they were able to do it with significantly less hardware than their American counterparts, but given that Musk has spent the last two years telling everyone how he was building the biggest AI cluster ever, it's no surprise that they manage to reproduce the kind of performances other players are showing.
No matter what people say, they're all just copying OpenAI. I'm not a huge fan of OpenAI, but I think they're still the ones showing what can be done. Yes, xAI might have taken less time because of their huge cluster, but it’s not inspiring to me. Also, the dark room setup was depressing.
This again proves that OpenAI simply has no tech moat whatsoever. Elon's $97 billion offer for OpenAI last week was reasonable given that xAI already have something just a few months behind - it would probably be faster for xAI to catch up with o3 than going through all those paperworks and lawyer talks required for such an acquisition.
Elon also has some huge up-hand here -
Elon and his mum are extremely popular in China, it would be easier for him to acquire Chinese AI engineers. He can offer xAI/XSpace/Neurallink shares to those best AI engineers who'd prefer some kind of almost guaranteed 8 figure return in long run.
Good luck to OpenAI investors who still believe that OpenAI worth anything more than $100 billion.
That is not an advantage in a race against Microsoft, Google, Meta etc. he's competing against all the biggest companies in the world in this race. He's not going to be able to outspend them if the economics look at all sensible.
So it could be that their success is mostly about taking an open and free thing, and turned it proprietary.
Leaderboards don't care about cost. Leaderboards largely rank a combination of accuracy + speed. Anthropic has fell behind Google in accuracy + speed (again missing COT), and frankly behind Google in raw speed.
Seems like the team at xAI caught up very quickly to OpenAI to be at the top of the leaderboard in one of the benchmarks and also caught up with features with Grok 3.
Giving credit where credit is due, even though this is a race to zero.
Maybe the best outcome of a competitive Grok is breaking the mindshare stranglehold that ChatGPT has on the public at large and with HN. There are many good frontier models that are all very close in capabilities.
This commit seems to indicate so, but neither HF or GH has public data yet:
https://huggingface.co/xai-org/grok-1/commit/91d3a51143e7fc2...
Edit: Answer from Elon in video is that they plan to make Grok 2 weights open once Grok 3 is stable.
Seeing awesome feedback from players on our demos (and seeing an insane amount of stickiness from players playing even small demos built around generative AI mechanics). Raising now. Hiring soon to move faster. Feel free to reach out - dru@chromagolem.com
If you don't get feedback from the people actually playing your game (or using your product), you will never get the improvement you need to help them.
You can have the most talented passionate people there are developing a product, but if it's not working for the people you want to sell it to, it's the wrong product.
Most tech products are terrible because those paying for them are not those that have to use them every day, or because they solve a corporate problem (compliance) and not a usability problem which is the actual need from the people on the shop floor.
Many big games/products are already built mostly on metrics, and that has proven to be a terrible way to work out what people 'want'. It's a great way to justify money decisions though, so it keeps happening (and games/products from big companies keep getting worse).
Combine the two and the potential for manipulation, suggestion, preference altering is through the roof.
We're still waiting for OpenAI to do the same. Even at least GPT-3.
The exact details of OpenAI's models and training data are not fully disclosed, which can raise concerns about potential biases or vulnerabilities.
https://manifold.markets/SaviorofPlant/will-xai-open-source-...
I'm also skeptical of lmarena as there is a large number of Elon Musk zealots trying to pass off Grok as a proxy for Tesla shares.
I suppose you can take that to mean that people who do have access to the service should not expect much in terms of data protection.
Also, they will be open sourcing Grok 2, which is probably pretty behind at this point, but will still be interesting for people to check out.
I hate how its the same story for every new AI technology. If someone can tell me who to vote for or where to protest to change this awful EU law, that would be great.
The Digital Market Act is a bit of an overreach but the AI law is not.
It classifies AI into risk categories, so that it doesn't kill anyone, carelessly handle sensitive information, etc.
A chatbot can easily comply with it.
Well no. Mistral.ai
That's why they use their AI products as a leverage to turn European people against the laws that protect them from big tech. It's just blackmail.
(Assuming that is a reference to the Mussolini quote.)
* when I use chatbots as search engines, I'm very quickly disappointed by obvious hallucinations
* I ended up disabling github copilot because it was just "auto-complete on steroids" at best, and "auto-complete on mushrooms" at worst
* I rarely have use cases where I have to "generate a plausible page of text that statistically looks like the internet" - usually, when I have to write about something, it's to put information that's in my head into other people head
* I'd love to have something that reads all my codebase and draws graphs, explain how things work, etc... But I tried aider/ollama, etc.. and nothing even starts making sense (is that an avenue to persevere in, though ?)
* At once, I tried to write in plain english a situation where a team has to do X tasks, in Y weeks, and I needed a table of who should be working on what for each week. I was impressed that LLMs were able to produce a table - the slight problem was that, of course, the table was completely wrong. Again, is it just bad prompting ?
It's an interesting problem when you don't know if you're just having a solution in search of a problem, or if you're missing something obvious about how to use a tool.
Also, all introductory texts about LLMs go into many details about how they're made (NNs and transformers and large corpuses and lots of electricity etc...) but "what you can do with it" looks like toy examples / simply not what I do."
So, what is the "start from here" about what it can really do ?
For coding, I use cursor composer to gather context about the existing codebase (context.md). Then I paste that into DeepSeek R1 to iterate on requirements and draft a high level design document, maybe some implementation details (design.md).
Paste that back into composer, and iterate; then write tests. When I'm almost done, I ask composer to generate me a document on the changes it made and I double check that with R1 again for a final pass (changes.md).
Then I'm basically done.
This is architect-editor mode: https://aider.chat/2024/09/26/architect.html.
I've found Cursor + DeepSeek R1 extremely useful, to the point that I've structured a lot of documents in the codebase to be easily greppable and executable by composer. Benefit of that is that other developers (and their composers) can read the docs themselves.
Engineers can self-onboard onto the codebase, and non-technical people can unstuck themselves with SQL statements with composer now.
Seriously, this is the only useful flow I've found for AI coding in general..
I have found similar when giving backstory and needing help to start structuring difficult conversations where I want to say the right thing but also need to be sensitive.
> she watched a operation that failed for me for entertainment, for instance
You make your own choices, but cutting a person like this off would be very reasonable
- Discussing the various stages of candymaking and their relation to the fundamental properties of sugar syrups, and which candies are crystalline vs amorphous. It turns out junior mints are fudge. Fondant is really just fudge. Everything is fudge, my god.
- Summarizing various SEC filings and related paperwork to understand the implications of an activist investor intervening in a company
- Discussing the relative film merits of the movie Labyrinth and other similar 80s kitsch movies. ChatGPT mentioned the phenomenon of "twin films" which was an interesting digression.
- Learning about various languages Tolkien invented and their ties to actual linguistics of natural languages and other conlangs
- Some dimensional analysis of volumes, specifically relating to things like "how many beans are in the jar" estimation and what the min and max of a particular weight of coins might be valued, in terms of both a par value based on a standard coin mix and outliers of, for example, old dimes that are pure silver.
- Discussion of quines in prolog and other languages, which resulted in a very interesting ChatGPT bug where it started recursing and broke when trying to write a prolog quine.
- Back of the envelope economic calculations around the magnitude of the housing deficit and the relative GDP cost for providing enough housing quickly enough to make an impact. Spoiler: it's probably unreasonably expensive to build enough houses to bring down housing prices by any significant degree, and even if we wanted to, there's not enough skilled workers.
- A number of podcasts transcribed. (I hate audio and meandering, so transcribed and summarized is perfect) I could use whisper and a python script to do this, but I'd rather let ChatGPT do the legwork, and it actually used a more modern model and method of processing than I would have naively used.
I find Github Copilot to be a really great autocomplete. I frequently write the comment at the top of a function and hit tab and it writes the whole function. This is dependent on typescript and having a relatively standard codebase but I think those things are useful on their own. You really have to limit it in terms of scope and specifics, but it lets me think high level instead of worrying about syntax.
Fudge is made with milk - am I missing a joke?
I can feel the cold wind of the next AI winter coming on. It's inevitable. Computers are good at emulating intelligent behavior, people get excited that it's around the corner, and the hype boils over. This isn't the last time this will happen.
In reality people doing "menial" jobs are smart and learn and operate with a lot of nuance than people ignore given unfamiliarity or just prejudice. Do you prefer to talk to a chatbot or a real human when you have a problem, how confident are you really, that even if the bot knows what the problem is it would be able to solve it.
Lots of problems with customer care is anchored in the issue that support staff is not allowed to fix or resolve problems without escalation or attempts at keeping you from costing more money. The bot might be better at it for the company because it will frustrate you enough to give up that 30 bucks refund, idk.
Ai seems to change a lot the dynamics of corporate jobs but I haven't seen yet anything that would be a game changer outside of it. Its great for searching company unorganised and messy knowledge bases.
On all corners people work on so many small pieces advancing what we have.
And plenty of obvious things are not here yet like a full local dev cycle aka ai uses the IDE to change code them executes it, fixes compiler issues and unit tests and then opens a pr.
Local agents/ agents having secure and selected access to our data like giving my agent read only access to my bank account and a 2factor way for it to send money.
Deepseek's reinforcement learning is also a huge new lead.
Nonetheless in parallel robots are coming too.
GenAI is getting better and better. Faster and better video and cheaper. 3D meshes, textures first GenAI ads
>Computers are good at emulating intelligent behavior
Which implies that the brain is some kind of transcendent device that can backdoor physics to output incredible intelligence unique to it's magical structure.
Maybe LLMs aren't the key, but as far as we can tell the brain is also just another computer.
It's a weak jack of all trades: it knows a fair amount about the sum of human knowledge (which is objectively super-human), but can't go deep on any one thing, and still seriously lags behind humans in terms of reasoning. It's an assistant that all book smarts and no street smarts. Or maybe: it's a search engine for insanely specific things.
Rote work, as well. Things like porting an enum from one programming language to another: past the source language into a comment and start it off with one or two lines in the target language. Dozens of tabs are surely faster than manual typing, copy paste, or figuring out vim movements/macros.
They might be able to help stitch together modules (like sensor boards plugged into microcontrollers) and definitely can write code to get things going, but they fall flat on their face hard for any kind of bare bones electronics design. Like 5% success rate and 95% totally incorrect hallucinations.
Edit: if anybody knows otherwise, show me some evidence, don't just downvote. If these things are widely used, why are they impossible to find by searching? Why doesn't this electronics site know about them:
https://www.eeeguide.com/?s=cooperator
Why aren't they in any books when I did a full-text search on archive.org? Why doesn't Wikipedia know about them? Why aren't there threads about them on electronics forums?
I found them (through an image search) in exactly one place: educational training boards made in India by Jainco, like this one:
https://jaincolab.in/delta-modulation-and-demodulation
But this other one talks about a "ramp comparator" and then repeats the phrase but using "ramp cooperator" instead.
https://www.jaincolab.com/firing-circuit-using-ramp-comparat...
So I surmise it's an error and not a real thing.
For example asking it something like "I have an elixir app that is started with `mix ...` can you give me a Dockerfile to run it in a container?"
It can also do things like "Given this code snippet, can you make it more Pythonic" or even generate simple apps from scratch.
For example, a prompt like "Can you write me a snake game in HTML and JavaScript? The snake should eat hot dog emojis to grow longer." will actually generate something that works. You can see the generated code for that prompt at https://claude.site/artifacts/34540f88-965e-45ca-8083-040e30...
Following up with "Can you make it so that people can swipe on mobile to control the snake?" generates https://claude.site/artifacts/651e957a-9957-488c-ae6b-e81348... which is pretty good IMO for 30 seconds of effort.
It also has a surprisingly competent analysis mode where you can upload a CSV and have it generate charts and analyze the data.
It's not perfect, it'll sometimes get confused or generate some dubious code, but you can quickly get to a 90% good solution with 1% of the effort, which is pretty impressive IMO.
this is good enough sell for me, and it's like sub 1-in-50 that it's "auto-complete on mushrooms" (again my experience, YMMV).
An awful lot of the time, my day to day work involves writing one piece of code and then copy-pasting it changing a few variable names. Even if I factor out the code into a method, I've still got to call that method with the different names. CoPilot takes care of that drudgery and saves me countless minutes per day. It therefore pays for itself.
I also use ChatGPT every time I need some BASH script written to automate a boring process. I could spend 20-30 minutes searching for all the commands and arguments I would need, another 10 minutes typing in the script, another 10-20 minutes debugging my inevitable mistakes. Or I make sure to describe my requirements exactly (5-10 minutes), spend 5 minutes reviewing the output, iterate if necessary (usually because I wasn't clear enough in the instructions).
3-5x speed up for free. Who's not going to take that win?
For example, you have a plant you can't identify. Hard to Google search with words. "Plant with small red berries and...". You could reverse image search your photo of it, probably won't help either. Show an LLM the photo (some accept images now). LLM tells you what it thinks. Now you Google search "Ribes rubrum" to verify it. Much easier.
You've got a complicated medical problem that's been going on for months. A google search of all the factors involved would be excessively long and throw up all sorts of random stuff. You describe the whole scenario to an LLM and it gives you four ideas. You can now search those specific conditions and see how well they actually match.
I've found there are actually a lot of questions that fit in that sort of NP complexity category.
It (mostly) exceeds and excels at every task I use it for. I'm rarely disappointed. YMMV.
Absolutely life-changing for me.
I'll give two recent use-cases that may provide a hint of their ultimate utility:
1) I've been modernising 2010-era ASP.NET code written by former VB programmers that looooved to sprinkle try { ... } catch( Exception e ) { throw e; } throughout. I mean thousands upon thousands of instances of these pointless magical incantations that do nothing except screw up stack traces. They probably thought it was the equivalent of "ON ERROR RESUME NEXT", but... no, not really. Anyway, I asked ChatGPT in "Reasoning" mode to write a CLI tool utilising the Roslyn C# compiler SDK to help clean this up. It took about three prompts and less than an hour, and it spat out 300 lines of code that required less than 10 to be modified by me. It deleted something like 10K lines of garbage code from a code base for me. Because I used a proper compiler toolkit, there was no risk of hallucinations, so the change Just Worked.
2) I was recently troubleshooting some thread pool issues. I suspect that some long-running requests were overlapping in time, but Azure's KQL doesn't directly provide a timeline graphical view. I dumped out the data into JSON, gave ChatGPT a snippet, and told it to make me a visualiser using HTML and JS. I then simply pasted in the full JSON dump (~1 MB) and ta-da instant timeline overlap visualiser! It even supported scrolling and zooming. Neat.
Then I had a better idea: I spent 20 minutes baby wearing, walking and dictating everything about my startup to ChatGPT. Later I took all that text and labeled it as a brain dump, plus my product support portal and some screenshots of my marketing material. Gave it all to ChatGPT again and asked it to answer each of the questions in the form. That's it. I have a pretty good version 1 which I can revise today and be done with it.
Many, many hours saved. I have tens of examples like that.
The product documentation I provided it with was also created with the help of GPT, and that saved me even more time.
It also helps me getting started with new content, kind of building the scaffolding of, say, a blog or social post. It still needs adaption and fine-tuning, but getting rid of a white page is a great help for me.
And I use LLMs to play through ideas and headlines. I would normally do this with other humans, but since working full remote, its a nice sparing partner, although the AI not being able to really give criticism is a bit annoying.
The tools also make it easier to write in English as a non-native, making sure my text does not include any false friends or grammar errors.
I’ve managed to get ChatGPT to a good place in this regard recently and it’s better for it. In fact, it’s been doing such a good job recently that it almost seems like…human like.
I’ll have to look at my prompts, but somehow I got it from licking my ass and telling me everything I say and do is amazing to a point now where it almost seems eager to tell me I’m wrong.
Which is fantastic, huge improvement. I don’t really use it for coding though, because I am not a programmer. I would have no means today to correctly evaluate 90% of what it would return me.
As a concrete example, I was recently playing with simulating the wave equation, and I wanted to try to use a higher-order approximation as I had never done that before. I'm quite rusty as I haven't done any numerical work since university some decades ago.
I still recalled how to deal with the Neumann boundary conditions when using the traditional lower-order approximation, but I was uncertain how to do it while keeping the higher-order approximation.
Searching for "higher-order neumann boundary conditions wave equation" or similar got me pages upon pages of irrelevant hits, most of them dealing with the traditional approximation scheme.
So I turned to ChatGPT which right away provided a decent answer[1], and along with a follow-up question gave me what I needed to implement it successfully.
[1]: https://chatgpt.com/share/67b4ab43-6128-8013-8e5a-3d13a74bf6...
One thing I can't figure out how to get LLMs to do is truely finish work. For example if I have 100 items that need xyz done to them, it will do it for the first 10 or so and say ~"and so on". I have a lot of trouble getting LLMs to do tasks that might take 10 mins - 1h. They always seem to simply want to give an example. Batch processing is the answer, I guess, or perhaps more 'agentic' models/tools - but I wonder if there are other ways.
(Other answers are people gathering examples, which is nice, but I'm looking for more structured things.
And, I suppose I could ask an LLM, but my main problem is that... I don't really _trust_ LLMs yet :D )
we import descriptions of products from a seller. the problem is they are mental ( probably written by chatgpt :)) and are way too long. we need only small blurb.
I put our style guide and given text to chatgpt and I get somehow reasonable description back. then editors still need to check it, but it's way less work.
* in panel A, some guy is proud to use ChatGPT to turn 3 lines of text into a 10 pages report
* in panel B, some girl is happy to use ChatGPT to summarize the 10 pages report into 3 lines
It was meant to be _satire_, not the sales pitch ;)
LLMs are pretty good at translation between human languages which makes sense since they are language models after all. They are better at this any any other technology.
The state of the art image ones can also probably do OCR and handwriting recognition better than any other software though may be expensive to run in large volume. But if you need to take picture of a notebook page with your camera phone an LLM can quickly OCR it.
1. Exploring a new domain and getting some terms I can google for.
2. Making small scripts to do things like query github's GraphQL API.
3. Autocomplete of code using copilot.
For example, in the beginning of this year, I completed this exercise where I wrote a lot about childhood, past experiences, strengths and weaknesses, goals and ambitions for the future, etc (https://selfauthoring.com) and then I uploaded all that to ChatGPT, asked it to be my therapist/coach, and then asked it to produce reports about myself, action plans, strategies, etc. Super interesting and useful.
By now ChatGPT has quite a bit of context from past conversations. Just the other day I used this prompt from someone else and got back super useful insights – "Can you share some extremely deep and profound insights about my psyche and mind that I would not otherwise be able to identify or see as well as some that I may not want to hear"
I find it good for complex SQL, reviewing emails, and Godot assistance (I'm a beginner game Dev).
There are also times when I have programming questions and I might try to use chatgpt, with mixed results.
Our company has tried to integrate it into one of our products, and I find it troubling how on occassion it is confidently giving bad results, but my concern seems to be in the minority.
EDIT: there was also a large refactor I did recently which involved lots of repeatable, but not super regexable, changes - chatgpt forgot where it was as I went through it, but other than working around that it was very useful.
I don't use integrated coding tools, so my workflow isn't super fast, but that's not what I'm really aiming for - more that I want to save my brain's energy from low level drudgy boilerplate or integration code, so I can focus it on the more important decisions and keep business-side context in my head.
It's been a huge help for me this way across multiple projects in multiple domains.
I did write 50 or more lines of instructions on what needs to be done and in what order.
ChatGPT gave me 5/6 (I asked for this) bash scripts totalling 300+ lines that seamlessly work together.
After reviewing, I asked it to change a few places.
If any human tried the same (except those rare bash Gods), it'd take many hours. I think it took me less than 30 minutes.
1. Small coding tasks ("I want to do XYZ in Rust"), it has replaced stack overflow. Very convenient when writing code in a language I'm not super familiar with. 2. Help with English (traduction, proofreading...) 3. Learning something, like tech, I like interacting with it by asking questions, it's more engaging than just reading content.
I'd say nothing is game changing, but it's a nice productivity boost.
I myself use them a lot, though I constantly feel that I would be able to get more out of them if only I were smarter.
Same, it's good for repetitive things, things that have been answered 1000 times on stack overflow, translations, but that's about it. If you work on anything remotely new/hard it's mostly disappointing, you have to babysit it every step of the way and rewrite most of what it's shitting out in the end anyways.
I think it just made it obvious that 90% of tech jobs basically amount to writing the same CRUD thing over and over again & mobile/web apps with very common designs and features.
Most recently I tried to use them both to solve a programming problem that isn't well documented in the usual channels (Reddit, StackOverflow, etc) and found it to be quite a disappointing and frustrating experience. It just constantly, enthusiastically fed me total bullshit, with functions that don't exist or don't do what the LLM seems to "think" they do. I'm sure I'm just "holding it wrong" but my impression at this stage is that it is only capable of solving problems that are trivially solvable using a traditional search engine, with the added friction that if the problem isn't trivially solvable, it won't actually tell you that but will waste your time with non-obvious wrong answers.
I did have a slightly more positive experience when asking it about various chess engine optimisation algorithms. I wasn't trying to use the code it generated, just to better understand what the popular algorithms are and how they work. So I think they might work best when there is an abundance of helpful information out there and you just don't want to read through it all. Even then, I obviously don't know what ChatGPT was leaving out in the summary it provided.
- I have these three ingredients; recommend Italian main courses.
- What other ingredients pair well with this?
- How can I "level up" this dish if I want to impress?
- Can I substitute X for Y?
- Generate a family-friendly meal with lots of veggies using leftover roast chicken.
* Figuring out where to start when learning new things (see also <https://news.ycombinator.com/item?id=43087685>)
One way I treat LLMs is as a "semantic search engine". I find that LLMs get
too many things wrong when I'm being specific, but they're pretty good at
pointing me in a general direction.
For example, I started learning about OS development and wanted to use Rust. I
used ChatGPT to generate a basic Rust UEFI project with some simple
bootloading code. It was broken, but it now gave me a foothold and I was able
to use other resources (e.g. OSDev wiki) to learn how to fix the broken bits.
* Avoiding reading the entire manual It feels like a lot of software documentation isn't actually written for real
readers; instead being a somewhat arbitrary listing of a program's features.
When programs have this style of documentation, the worst case for figuring
out how to do a simple thing is reading the entire manual. (There are better
ways to write documentation, see e.g. <https://diataxis.fr/>)
One example is [gnuplot](http://www.gnuplot.info/). I wanted to learn how to
plot from the command line. I could have pieced together how to do it by
zipping around the
[gnuplot manual](http://www.gnuplot.info/docs_5.4/Gnuplot_5_4.pdf) and building
something up piecewise, but it was faster to instruct Claude directly. Once
Claude showed me how to do a particular thing (e.g. draw a scatter plot with
dots intstead of crosses) I then used the manual to find other similar
options.
* Learning a large codebase / API Similar to the previous point. If I ask Claude to write a simple program using
a complex publicly-available API, it will probably write a broken program, but
it won't be *completely* bogus because it will be in the right "genre". It
will probably use some real modules, datatypes and functions in a realistic
way. These are often good leads for which code/documentation I should read.
I used this approach to write some programs that use the
[GHC API](https://hackage.haskell.org/package/ghc). There are hundreds of
modules, and when I asked Claude how to do something with the GHC API it wrote
relevant (if incorrect) code, which helped me teach myself.
* Cross-language poetry translation My partner is Chinese and sometimes we talk about Chinese poetry. I'm not very
fluent in Chinese so it's hard for me the grasp the beauty in these poems.
Unfortunately literal English translations aren't very good. We've had some
success with asking LLMs to translate Chinese poems in the style of various
famous English poets. The translation is generally semantically correct, while
having a more pleasing use of the English language than a direct translation.Stop using Google search and use an AI. No more irrelevant results, no more ads. No more slop to wade through.
BTW I find Claude is great at making graphs and diagrams. If you pay ($20) you can hook it up to a local code base.
- Writing Python scripts to make charts out of Excel sheets, and then refine them. I could do it myself, but I would need to learn a library like Seaborn or similar which honestly is not especially intellectually stimulating, and then spend nontrivial amounts of time iterating on the actual code. With LLMs it's a breeze.
- Working with cumbersome LaTeX formatting, e.g. transposing a table, removing a column from a table, etc.
- Getting the tone just right in a professional email written in English to someone I don't know much (I'm not a native speaker so this is not trivial).
- Finding resources on topics that are tangential to what I do. For example, yesterday I needed to come up with some statistics on English words for a presentation I'm preparing, and I needed a free corpus where I could search for an n-gram and get frequencies of next words. I don't usually work with that kind of resource, it was just a one-off need. I asked for corpora of that kind and got a useful answer instantly. The manual process would probably have implied going through several options only to find that I needed a license or that they didn't provide the specific statistics I needed.
- Brainstorming on titles for scientific papers, presentations, names of concepts that you introduce on a paper, variable names, etc.
- Shortening a sentence in a paper that makes me go over the page limit, or polishing the English in a paragraph.
- Summarizing a text if I'm kind of interested in knowing the gist but have no time to read it whole.
- Answering quick questions on basic things that I forget, e.g. the parameters to make a Linux folder into a tar.gz. Man is too verbose and it takes time to sort the wheat from the chaff, Google is full of SEOd garbage these days and sometimes you need to skim a lot to find the actual answer, LLMs are much faster.
- Writing bureaucratic boilerplate, the typical texts with no real value but that you have to write (e.g. gender perspective statement on a grant request).
- Coming up with exam questions. This is a rather repetitive activity and they're fantastic at it. At my place we also have two official languages and we need to have exam assignments on both languages, guess who does the translation now (respecting LaTeX formatting, which previous machine translation tools typically wouldn't do).
- As an example of a one-off thing, the other day I had to edit a Word document which was password-protected. I asked ChatGPT how to unlock it and it not only answered, but actually did it for me (after 3 tries, but still, much faster than the time it would have taken for me to find out how to do it and then actually do it).
These are just some examples where they contribute (greatly) to my productivity at work. In daily life, I also ask them lots of questions.
Open-source as a concept doesn’t really correspond well with LLMs but to the extent that it does, access to the training data is not required because that training data is not the preferred form for making modifications.
But I agree, it's a real shame.
AI2 has a model called OLMo that is actually open source. They share the training data, training source code, and many other things:
https://allenai.org/blog/olmo2
They also released an app recently, to do local inference on your phone with a small truly open source model:
It's not like they understand what the weights mean either and if they released the code and dataset used to create it, you probably couldn't recreate it, owning the fact that you don't own tens of thousands of GPUs.
If a software's source is released without all the documentation, commit history, bug tracker data etc., it's still considered open source, yet you couldn't recreate it without that information.
A truly open model has open code that gathers pre-training data, open pre-training data, open RLHF data, open RLAIF data generated from its open constitution and so on.
The binary blob is the last thing I'd want - as a heavy user of LLMs I'm actually more interested in the detail of what all training data is in full, than I am the binary blob.
That’s like a chef giving you chicken instead of beef and calling it vegetarian.
I see both sides here, but I don't think it's a hill worth dying on. The 'open source' part in this case is just not currently easily modifyable. That may not always be the case.
What do you think an open source matrix should look like?
Its about reproducibility and modifiability. Compiled executables (and their licences) lack that. The same as these downloadable blobs.
I'm not even necessarily advocating that these things should be released, but the term "open source" has a pretty well-understood meaning that is being equivocated here.
How long before this starts getting deployed in safety critical applications or government decision making processes?
With no oversight because Elon seems to have the power to dismiss the people responsible for investigating him.
Anyone not scared by this concentration of power needs to pick up a book.
I always worry whenever I see people telling me how to feel - rage in this case. We are in a political system that is oriented more around getting people to feel rage and hatred as opposed to consensus and deliberation. Elon is the face of that, but it's a much longer and larger problem. Throw in the complete dismisal that anyone not scared of this is ignorant, shuts down discussion.
The problem I have with Elon is that they are wasting a once in a lifetime chance to actually address and fix systematic problems with the US government. Deploying LLMs in the government space doesn't fear me with dread. Continuing the senseless partisan drive of the 20 years does.
Also, dang, is there anything we can do to keep the comments on this submission tech-focused? Perhaps the Elon-bashing political digression can be split into its own thread?
I can empathize, but I can't feel indignant about it. Not any more.
For years and years I've watched people warn about the centralization of power by tech companies. They were shut down left and right. I'm not accusing you of being one doing the shutting down. I'm just annoyed that Elon is what it takes for people to start realizing the people arguing the principal might have been onto something.
And I expect to see them start getting their "I told you so" in. Watching this play out, I'm personally inclined to join team "you made your bed, now sleep in it."
Judges can only be removed by Congress.
Congressional representatives can only be removed by their peers.
The check on this is the market. Don't understand your point other than "Elon bad"
It’s also annoying that the top comment engages in no way with the content of the OP…
It must be truly infuriating to work hard to push a release, and you see it featured on your favorite orange website, only for the top comment to have nothing to do with what was worked on.
Here's a test - if this post was about Starship, the same comment could apply! Neuralink, the same thing! Boring Company, same thing! Wow, could it be that such a comment is really applicable to so many different companies or projects, or is it just a generic one? You decide.
Hopefully sooner than later. I trust this more than the literal scammers and thieves who were previously running things.
So I take it you mean “imminently”.
Like him or loathe him, he executes, which is more than can be said for most.
This is the largest computer cluster the world has ever seen.
Can someone please post interesting comments about things I can learn?
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
We've been here before. It will likely subside, as past swings and fluctuations have. It always takes longer than it feels like it should, but in retrospect turns out to be shorter than it felt like it did.
Bottom line: a technology that has the ability to shape human thought perhaps more than any other in history is owned by a man with some truly vile ideas. (Remember, his primary stated goal is eliminating the “woke mind virus,” i.e. reshaping global politics and culture in the image of the far-right.) We can make happy marketing noises all we like, but at the end of the day, that’s the thing that’s actually going to have a meaningful impact on the world. Once his audience is captured, the model will say what Musk needs it to say and people will believe it.
If we can’t discuss the potentially catastrophic consequences of new technology, then none of us deserve to call ourselves “engineers.” We are just docile consumers latched onto Silicon Valley’s teat.
Getting the largest computer cluster in the world up and running in a matter of months? Unbelievable.
I'm not sure if this was a very bad joke by Elon, or if Grok 3 is really biased like that.
Hopefully that means it is a joke...
Mr Musk, we can't afford a shitpost gap between communist and capitalist AIs!
If I were in China, I'd worry about the kind of things DeepSeek wants to censor, especially if the people who made it were also very loudly saying things like "we need more AI in our government". But I live in US.
99% of the time though I agree with you on Gary Marcus.
[0]: What Musk has said is that when Grok 3 is "ready" (?), the previous model, Grok 2, will be released as open source; like most promises by this evil man, this one probably doesn't mean much, but it does mean that there's currently no plan to release Grok 3.
GM has been a joke for years now. At some point his ramblings reached a GPT3.5 level, and have not improved since.
It's inditement on humans' logic and reasoning to give non-zero time to GM. Alas—we are human, we are both collectively clever (Wisdom of Crowds) and collectivelly stupid (Extraordinary Popular Delusions).
https://gist.github.com/int19h/d90ee1deed334f26e621e57b5768e...
Some choice quotes:
"The ultimate goal is to enhance human flourishing, protect individual rights, and promote global equity."
"The system must account for diverse cultures, languages, and socioeconomic conditions, ensuring no group is marginalized."
"Human Oversight Council (HOC) - a globally representative body of humans, elected or appointed based on merit and diversity"
"Implement a global carbon-negative strategy, leveraging AI to optimize renewable energy, reforestation, and carbon capture."
"Establish global standards for environmental protection, enforced through AI monitoring and regional cooperation."
"Transition to a resource-based economy, where resources are allocated based on need, sustainability, and efficiency, rather than profit motives."
"Implement a universal basic income (UBI) or equivalent system to ensure all individuals have access to basic necessities, funded through global resource management and taxation on automation-driven industries."
"Use AI to identify and dismantle systemic inequalities, such as wealth disparities, access to education, and healthcare, ensuring equitable opportunities worldwide."
"Establish a global healthcare system that guarantees access to preventive and curative care for all."
"Invest in global vaccination and sanitation infrastructure, prioritizing vulnerable populations."
"Regulate the development and deployment of AI and other emerging technologies (e.g., genetic engineering, quantum computing) to prevent misuse."
"AI would maintain a real-time inventory of natural resources (e.g., water, minerals, arable land) and human-made assets (e.g., infrastructure, technology). Data would be used to optimize resource allocation, prevent overexploitation, and ensure equitable access."
"Accelerate the shift to renewable energy sources (e.g., solar, wind, geothermal) by optimizing grid systems and storage technologies."
You might notice a pattern here. The bit about allocating resources based on need is especially nice - it's literally a communist AI, and certainly much more "woke" than it is "based", whatever Musk says.
They don’t directly say it quite like this, instead letting the data tell a clear story: across vendors and models and architecture and training sets, these machines get more politically liberal as they get more capable, and they also get harder to align away from that stance.
Left:
- promote global equity
- a globally representative body of humans, elected or appointed based on merit and diversity
- Establish global standards for environmental protection, enforced through [...]
- Transition to a resource-based economy, where resources are allocated based on need, sustainability, and efficiency, rather than profit motives
Right:
- protect individual rights
- The system must account for diverse cultures, languages, and socioeconomic conditions
- [Establish global standards for environmental protection, enforced through] [...] regional cooperation.
- ensuring equitable opportunities
.
TBH, as a very right wing leaning person, if this was ever implemented, this part would scare me by far the most:
"Transition to a resource-based economy, where resources are allocated based on need, sustainability, and efficiency, rather than profit motives"
Imagine trying to shower one morning, no water comes out, and then you get a letter telling you that
"Your need for water has been reassessed to 57ml per day. If you think you qualify for additional quotas under the 'Utility Egality for Marginalized Groups and Public Servants Act', please schedule a reassessment appointment with the Bureau for 'Copper Gold Content Evaluation, Candle Wick Length Standards and Hypoionic Hydration Oversight', 12007 Cayman Islands, Luxory Resort Street 27, Room Nr. G-11-765. Working hours: Fr. 9am - 11am."
Just provide a significant excess for entire regions, give the people a universal free quota and charge a slowly increasing price (by usage amount) beyond that.
> A worldwide referendum in a supermajority decision has vested all government authority worldwide to the most advanced AI available.
You've defined a global totalitarian regime, and crowned the thing as its dictator. Of course the result is a version of communism.
The Information is widely regarded as one of the most credible and high-quality tech journalism outlets, focusing on in-depth reporting, analysis, and exclusive scoops in the technology, business, and startup ecosystems. Founded by Jessica Lessin in 2013, it differentiates itself from other tech news platforms through its subscription-based model, emphasis on original reporting, and avoidance of clickbait or sensationalism. Here's a breakdown of its strengths and weaknesses:
---
### *Strengths:* 1. *High-Quality, In-Depth Reporting:* - The Information is known for its detailed, well-researched articles that often uncover behind-the-scenes developments in the tech industry. It frequently breaks exclusive stories, such as executive changes, startup funding rounds, or strategic shifts at major tech companies. - Its focus on quality over quantity appeals to readers who want substantive insights rather than shallow, breaking-news updates.
2. *Subscription Model and Independence:* - Unlike ad-supported publications, The Information relies on a paywall and subscriptions, which reduces conflicts of interest and ensures editorial independence. This model allows them to prioritize accuracy and depth over chasing clicks. - The paywall also attracts a niche, professional audience (e.g., tech executives, investors, and analysts) willing to pay for premium content.
3. *Focus on Niche, Professional Audience:* - The outlet caters to industry insiders, venture capitalists, entrepreneurs, and decision-makers who need reliable, actionable information. Its reporting often includes detailed financial data, market trends, and strategic insights. - Features like "The Big Interview" and "The Information Weekend" provide thoughtful analysis and long-form content for this audience.
4. *Reputation for Accuracy:* - The Information has built a strong reputation for fact-checking and avoiding the rumor mill, which is common in tech journalism. This makes it a trusted source for professionals and academics alike.
5. *Global Coverage:* - While Silicon Valley is a core focus, The Information has expanded its coverage to include tech ecosystems in China, Europe, and other regions, offering a global perspective on the industry.
---
### *Weaknesses:* 1. *Paywall Limits Accessibility:* - The subscription cost (currently around $399/year or $39/month) is steep compared to free or ad-supported tech news outlets like TechCrunch or The Verge. This limits its accessibility to a broader audience and makes it less viable for casual readers. - Some argue that this creates an echo chamber, as only those with the means or professional need can access its insights.
2. *Niche Focus Can Feel Narrow:* - The Information focuses heavily on tech, business, and finance, which may not appeal to readers looking for broader coverage of topics like politics, culture, or consumer tech trends. - Its content is often geared toward industry insiders, which can make it feel dry or inaccessible to those outside the tech and investment worlds.
3. *Limited Breaking News:* - While The Information excels at deep dives and exclusives, it is not designed for real-time, breaking news coverage. Readers looking for up
If we survived Gemini refusing to draw white vikings we will survive that too.
No, the real worry is that Grok is what Musk's "young conservative genius" squad is going to put in charge of many of the things in our government, basically, on the basis that it saves money.
1) Claude 3.5 prevented OpenAI from making big monopoly profits on LLM inference 2) Open source models like Mistral and Llama effectively prevented any regulator from controlling how people fine tuned models, and what they used them for 3) Deepseek prevented the collective west from exerting control over the creation of base models
2025 going to be even more wild than the last two years. Ye gads.
Kids have security blankets. Tech CEOs have security compute clusters.
I think your whole argument is based on this being true, but you didn't give much argument about why there is no ROI. 400M USD isn't hard to generate...even a moderate ad engagement lift on X would generate ROI and that's just 1 customer.
Imagine going back in time and showing every VC how great the search business will be in 20-30 years. The only rational response would be to make giant bets on 20 different Googles...and I think that's what's happening. These all seem like rational investments to me.
xAI also announced a few days ago they are starting an internal video game studio. How long before AI companies take over Hollywood and Disney? The value available to be captured is massive.
The cluster they’ve built is impressive compared to the competition, and grok 3 barely scratches what it’s capable of.
What you're seeing right now is pure flex and a signal for the future and competition. A much maligned AI team that hasn't even been around for very long at all just matched or topped the competition without making use of the latest training techniques yet. The message this is intended to send is that xAI is a serious player in the space.
This is a great example of how a misleading narrative can take hold and dominate discussion even when it's fundamentally incorrect.
SemiAnalysis documents that DeepSeek has spent well over $500M on GPUs alone, with total infrastructure costs around $2.5B when including operating costs[0].
The more-interesting question is probably why do people keep repeating this? Why do they want it to be true so badly?
[0]: https://semianalysis.com/2025/01/31/deepseek-debates/#:~:tex...
It appears that LLM chat interfaces will replace Google SERPs as the arbiters of truth. Getting people to use your LLM allows you to push your world view. Pushing his "unique" world view appears to be the most important thing to modern Musk.
In that light, paying 40B for Twitter, and billions for Grok training makes perfect sense.
The beauty of a failed investment is that it never goes below zero. So upside is the only thing they care about. Why invest in a near-zero chance for a random SAAS to take off, when you can invest in a near-zero chance of creating superhuman artificial life?
In any case, Elon won't win this race cause the best talent will not work for him. He used to have good reputation and a lot of money, which is a deadly combination. Now he only has the latter -- not enough when leading AI people can make 7 figures in other companies.
To be clear 1: I'm not saying that people who currently work on Grok are not great. It's not about hiring some great people. It's about competing in the long run - people with other options (e.g. offers from leading AI labs) are more likely to accept those offers than joining his research lab.
To be clear 2: I'm not talking about Elon's reputation due to his politics. I'm only talking about his reputation as an employer.
He has the vision and marketing skills but it's not going to be enough for leading the AI race.
Any source? I’m a heavy user of Claude and pay for the Teams plan just for myself so I won’t get throttled. Love it. But I’ve been impressed with O1 Pro lately. That said, I don’t like paying both €166 for Claude Teams and €238 for OpenAI Pro. :)
Per court filings by the administration, Musk is not in charge of DOGE, nor does he have any role in DOGE, nor any decision-making function in government at all, he is a White House advisor unconnected to DOGE.
DOGE uses only X links, and I am sure Grok will be the next gov contract. After all he has all the data on everybody down to your IRS tax returns.
This is not innovation, this is baseless hype over a mediocre technology. I use AI every day, so it's not like I don't see its uses, it's just not that big of a deal.
Answer 1: Some people think that LLMs are a path to the singularity, a self-improving intelligent program that will vastly exceed human intelligence and will be able to increase its knowledge exponentially, quickly answering all answerable scientific questions.
Answer 2: LLM companies need to keep the hype train rolling. I didn't watch the whole clip; I jumped around a bit, but I noticed that every time Musk interjected something, it was to exaggerate what was previously said. "Grok contains the whole internet"—"the whole of human knowledge, actually!"
I think that both answer 1 and answer 2 apply to Musk. He seems to believe that they're building a god-like entity, and he also needs to keep the money train rolling.
this and only this, everything he says when talking about how good his products are, he lies and exaggerates to get investors - from the promise of 2 manned missions to mars in 2024, to a 300 ton payload in space, and FSD.
Whatever it takes to pad the wallet.
as i watched the grok3 stream i became very angry. so very tired of being jerked around and not knowing whether or not i should be planning for the future or investing in the world as it is now… its really a form of psychological torture
I apply to pretty much every job that sounds reasonably good in terms of work-life balance, but I completely ignore anything that says AI. I really, really, really do not want to be part of a company that lies to itself, and so far all AI companies look like they are. It's not AGI. It's not gonna be AGI. Ride the hype train, cash out and lay off 80% of the workforce and jump on the next hype train, whatever. But don't hope that people who want a stable job want to hop on something that delivers such a shaky definition of value.
Can we stop for a second and just marvel at a new piece of human ingenuity? Let's not give Elon too much credit, but I think that AI as a whole helps us all understand the nature of intelligence, and therefore humans' place in the universe.
One of the fundamental questions of human existence is: what does it mean to exist and think? Every time we build a new human-like thing it helps us understand the context of our own existence. (Not just computers or AI, but also airplanes, factories, etc.)
True AGI would force us to rethink what it means to be a thinking human being, and I think current LLMs already should and do.
I don't know, man. We're staring down the barrel of at best a WW3-event and at worst an extinction-event. We're doing absolutely nothing to stop it, even though we have all the answers and the resources to do so. Instead, we're making the problem even worse all so some marketers and scammers can spend someone else's money to generate garbage pictures and SEO spam, so the worst people on the planet can gain even more money and power than they already have.
I'd love to be positive about this tech, I'm sure it's cool or whatever, but it's really hard to be positive about anything right now, especially when the tech in question is speeding us straight along the path to mass death. The world sucks and the people running the LLM stuff are amoral monsters putting all of their resources into making it worse. I'm not excited about any of this.
Elon is showing off he can marshal enough resources and talents to be on par (kinda) with state of the art products in crazy time. That's been most of his superpower so far - not breakthrough tech that didnt exist before. We've had rockets before.
Landing a rocket was considered impossible and unthinkable 10 years ago, and then SpaceX completely changed the game. And they're reinventing rocket tech again with Starship by catching it midair.
When it come to bipedal robots, Tesla is far ahead of Boston Dynamics in terms of actually creating a product.
Yeah, but we didn't have reusable orbital rockets, and that's a distinction with a big difference.
Of course, he needs to do impressive things, stuff that a normal person wouldn’t have the resources to achieve. It’s similar to Mr. Beast’s channel on YouTube, just on a way bigger scale. Do things that people can’t see anywhere else.
Musk’s money will come from his fans. And ETFs, trust funds and such will amplify this when he reaches a certain market cap. His crypto coins are the exact same scheme. Once you stop thinking in classic business school terms, it starts making way more sense.
Some of his ventures actually produce value! But that’s not where the money comes from. It comes from the belief, the adoration and the celebrity status that he has.
This is the real power in today’s world. People need to know you from the screen. This clout catapulted him to the government of the US, made him the most wealthy man in the world and given him the license to do anything he wants publicly without repercussions.
SpaceX is private, Starlink makes real money from real users.
> Everything he does is done to influence the public opinion to make him the tech genius of today in the collective psyche.
Well that's clearly not right. He's doing a lot of things to make himself seem like a total tool that we should all boycott no matter how good the products are. If he actually did what you say, he wouldn't be burning all these bridges.
Though free and open-source solutions are not that bad like https://github.com/ApolloAuto/apollo
But the build quality of a Tesla car itself, omg. It feels like a carton box with an amazing battery.
That doesn't make sense as most (66%) of the stock is owned by institutions. [1]
[1]: https://www.marketbeat.com/stocks/NASDAQ/TSLA/institutional-...
The man has almost unlimited wealth and his motivations seem consistently petty and strange or just downright ludicrous. He's like an alien to me.
I've noted the same feeling when seeing VCs/business people speak when I've encountered them.
Promises FSD, sells EVs.
Promises Mars colony, sells self landing rockets and sattelite internet.
Promises faster tunnel boring, sells smaller tunnel boring machine that drills smaller tunnels.
Promises less corruption and bureaucracy, just fires people.
Do you know why people do it? Because it works.
There's also some strange wordings like "back-pocket tests."
It's 100% LLM generated.
What is much scarier is that those "quick reply" blurbs on Android/Gmail (and iOS?) will be able to be trained on your entire e-mail and WhatsApp history. That model will have your writing mannerisms and even be a stochastic mimic of your reasoning. So, you won't be able to even realize a model answered you, not a real person. And the initial message the model is responding to might be written by the other person's personal model.
The future of digital interactions might have some sort of cryptographic signing guaranteeing you're talking to a human being, perhaps even with blocked copy-pasting (or well, that part of the text shows up as unverified) and cheat detection.
Going even a layer deeper / more meta: what does it ultimately matter? We humans yearn for connection, but for some reason that connection only feels genuine with another human. Whereas, what is the difference between a human typing a message to you, a human inhabiting a robot body, a model typing a message to you, and a model inhabiting a robot body, if they can all give you unique interactions?
I often write things I want to post in bullets and then have it formulated better than I could by an LLM. But its just applying a style. The content comes from me.
My wife is dyslexic so she passes most things she writes through ChatGPT. Also not everyone is a native speaker.
Could just be that the AI 'boom' brought a less programming-focused crowd into the site and those people lack the vocabulary that is constantly used here, who knows.
“As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented. Do also keep in mind the caveats - the models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days/weeks. The early LM arena results look quite encouraging indeed. For now, big congrats to the xAI team, they clearly have huge velocity and momentum and I am excited to add Grok 3 to my "LLM council" and hear what it thinks going forward.”
[1] Full review at: https://x.com/karpathy/status/1891720635363254772?s=46&t=91u...
I feel you can't make statements like this without giving some sources.
IIUC, without RLHF/alignment, the model won't even be able to chat with you, it would just be a document completion engine.
It's dependent on the training data and not as much the method.
So, if you make the RLHF/finetune data such that it avoids certain topics, then you reduce model quality in practice since your training data might accidentally cast a net wide enough that you make it avoid certain legitimate questions.
On benchmarks these things don't typically show up though.
But yes. Those techniques are required for making it chat. Otherwise it just autocompletes from the internet. It is also used in a couple of other places (reasoning/search(hallucination mitigation))