How has DeepSeek improved the Transformer architecture? (opens in new tab)

(epoch.ai)

258 pointssuperasn1y ago68 comments

68 comments

59 comments · 5 top-level

1970-01-011y ago· 17 in thread

Has DeepSeek challenged the very weird hallucination problem? Reducing hallucinations now seems to be the remaining fundamental issue that needs scientific research. Everything else feels like an engineering problem.

bane1y ago

To me, the second biggest problem is that the models aren't really conversational yet. They can maintain some state between prompt and response, but in normal human-human interactions responses can be interrupted by either party with additional detail or context provided.

"Write Python code for the game of Tetris" resulting in working code that resembles Tetris is great. But the back and forth asking for clarification or details (or even post-solution adjustments) isn't there. The models dwell and draw almost entirely self-referentially from their own reasoning through the entire exchange.

"Do you want to keep score?" "How should scoring work?" "Do you want aftertouch?" "What about pushing down, should it be instantaneous or at some multiple of the normal drop speed?" "What should that multiple be?"

as well as questions from the prompter that inquire about capabilities and possibilities. "Can you add one 5-part piece that shows up randomly on average every 100 pieces on average?" or "Is it possible to make the drop speed function as an acceleration rather than a linear drop speed?"....these are somewhat possible, but sometimes require the model to re-reason the entire solution again.

So right now, even the best models may or may not provide working code that generates something that resembles a Tetris game, but have no specifics beyond what some internal self-referential reasoning provides, even if that reasoning happens in stages.

Such a capability would help users of these models troubleshoot or fix specific problems or express specific desires....the Tetris game works but has no left-hand L blocks for example. Or the scoring makes no sense. Everything happens in a sort of highly superficial approach where the reasoning is used to fill in gaps in the top-down understand of the problem the model is working on.

hnuser1234561y ago

I have my own automated LLM developer tool. I give a project description, and the script repeatedly asks the LLM for a code attempt, runs the code, returns the output to the LLM, asking if it passes/fails the project description, repeating until it judges the output as a pass. Once/if it thinks the code is complete, it asks the human user to provide feedback or press enter to accept the last iteration and exit.

For example, I can ask it to write a python script to get the public IP, geolocation, and weather, trying different known free public APIs until it succeeds. But the first successful try was dumping a ton of weather JSON to the console, so I gave feedback to make it human readable with one line each for IP, location, and weather, with a few details for the location and weather. That worked, but it used the wrong units for the region, so I asked it to also use local units, and then both the LLM and myself judged that the project was complete. Now, if I want to accomplish the same project in fewer prompts, I know to specify human-readable output in region-appropriate units.

This only uses text based LLMs, but the logical next step would be to have a multimodal network review images or video of the running program to continue to self-evaluate and improve.

mistrial91y ago

are you familiar with

https://gorilla.cs.berkeley.edu/leaderboard.html

1 more reply

MrLeap1y ago

> The models dwell and draw almost entirely self-referentially from their own reasoning through the entire exchange.

Sounds like a deficiency in theory of the mind.

Maybe explains some of the outputs I've seen from deepseek where it conjectures about the reasons why you said whatever you said. Perhaps this is where we're at for mitigations for what you've noticed.

timdellinger1y ago

This sounds like a prompting issue.

If your prompt instructs the model to ask such questions along the way, the model will, in fact, do so!

But yes, it would be nice if the model were smart enough to realize when it's in a situation where it should ask the user a few questions, and when it should just get on with things.

corimaith1y ago

>Everything else feels like an engineering problem.

That's probably the key to understanding why the hallucination "problem" isn't going to be fixed because language models, as probabilistic models it's an inherent feature and they were never designed to be expert systems in the first place.

Building an knowledge representation system that can properly model the world itself is going more into the foundations of mathematics and logic than it is to do with engineering, of which the current frameworks like FOL are very lacking and there aren't many people in the world who are working on such problems.

nyrikki1y ago

Hallucinations are a fundamental property of transformers, it can be minimized but never eliminated.

https://www.mdpi.com/1999-4893/13/7/175

> the open-domain Frame Problem is equivalent to the Halting Problem and is therefore undecidable.

Diaconescu's Theorem will help understand where Rice's theorm comes to play here.

Littlestone and Warmuth's work will explain where PAC Learning really depends on a many to one reduction that is similar to fixed points.

Viewing supervised learning as paramedic linear regression, this dependent on IID, and unsupervised learning as clustering thus dependent on AC will help with the above.

Both IID and AC imply PEM, is another lens.

Basically for problems like protein folding, which has rules that have the Markovian and Ergodic properties it will work reliably well for science.

The basic three properties of (confident, competent, and inevitable wrong) will always be with us.

Doesn't mean that we can't do useful things with them, but if you are waiting for the hallucinations problem to be 'solved' you will be waiting for a very long time.

What this new combo of elements does do is seriously help with being able to leverage base models to do very powerful things, while not waiting for some huge groups to train a general model that fits your needs.

This is a 'no effective procedure/algorithm exists' problem. Leveraging LLMs for frontier search will open up possible paths, but the limits of the tool will still be there.

Stable orbits of the planets is an example of another limit of math, but JPL still does a great job as an example.

Obviously someone may falsify this paper... but the safe bet is that it holds.

https://arxiv.org/abs/2401.11817

Heck Laplacian determism has been falsified, but as scientists are more interested in finding useful models that doesn't mean it isn't useful.

All models are wrong, some are useful is the TL;DR

zone4111y ago

The problem is confabulations. In my benchmark (https://github.com/lechmazur/confabulations/), you see models produce non-existent answers in response to misleading questions that are based on provided text documents. This can be addressed.

Jerrrry1y ago

  > the open-domain Frame Problem is equivalent to the Halting Problem and is therefore undecidable.

Thank you, Code as Data problems are innate to the von Nuemman architecture, but I could never articulate how LLMs are so huge they are essentially Turing-complete and equivalent computationally.

You _can_ combinate through them, just not in our universe.

adgjlsfhk11y ago

this is very wrong. LLMs are very much not Turing complete, but they are algorithms on a computer so they definitely can't compute anything uncomputable

1 more reply

GaggiX1y ago

From what I see, the Deepseek R1 model seems to be better calibrated (knowing what it knows) than any other model, at least on the HLE benchmark: https://lastexam.ai/

seba_dos11y ago

There's nothing weird about so called hallucination ("confabulation" would be a better term), it's the expected behavior. If your use case cannot deal with it, it's not a good use case for these models.

And yes, if you thought this means these models are being commonly misapplied, you'd be correct. This will continue until the bubble bursts.

HarHarVeryFunny1y ago

There was an amusing tongue-in-cheek comment from a recent guest (Prof. Rao) on MLST .. he said that reasoning models no longer hallucinate - they gaslight you instead... give a wrong answer and try to convince you why it's right! :)

whimsicalism1y ago

hallucinations decrease with scale and reasoning, the model just gets better and stops making stuff up.

littlestymaar1y ago

o1 still hallucinates badly though.

Jerrrry1y ago

False, facts only need to be seen once, and one mis-step in reasoning and your CoT is derailed.

whimsicalism1y ago

> one mis-step in reasoning and your CoT is derailed.

tell me you've never seen reasoning traces without telling me

ilaksh1y ago· 14 in thread

Why is it that the larger models are better at understanding and following more and more complex instructions. And generally just smarter?

With DeepSeek we can now run on non-GPU servers with a lot of RAM. But surely quite a lot of the 671 GB or whatever is knowledge that is usually irrelevant?

I guess what I sort of am thinking of is something like a model that comes with its own built in vector db and search as part of every inference cycle or something.

But I know that there is something about the larger models that is required for really intelligent responses. Or at least that is what it seems because smaller models are just not as smart.

If we could figure out how to change it so that you would rarely need to update the background knowledge during inference and most of that could live on disk, that would make this dramatically more economical.

Maybe a model could have retrieval built in, and trained on reducing the number of retrievals the longer the context is. Or something.

AJRF1y ago

> But surely quite a lot of the 671 GB or whatever is knowledge that is usually irrelevant?

Small correction - It's 671B Parameters - not 671 Gigabytes (doing some rudimentary math if you want to run the entire model in memory it would take ~750GB (671b * fp8 == 8 bits * 1.2 (20% overhead)) = 749.901 GiB)

It's a MoE model so you don't actually need to load all 750gb at once.

I think maybe what you are asking is "Why do more params make a better model?"

Generally speaking its because if you have more units of representation (params) you can encode more information about the relationships in the data used to train the model.

Think of it like building a LEGO city.

A model with fewer parameters is like having a small LEGO set with fewer blocks. You can still build something cool, like a little house or a car, but you're limited in how detailed or complex it can be.

A model with more parameters is like having a giant LEGO set with thousands of pieces in all shapes and colours. Now, you can build an entire city with skyscrapers, parks, and detailed streets.

---

In terms of "is a lot of of irrelevant?" - This is a hot area of research!

It's very difficult currently to know what parameters are relevant and what aren't - there is an area of research called mechanistic interpretability that aims to illuminate this - if you are interested - Anthropic released a good paper called "Golden Gate Claude" on this.

reagle1y ago

To extend your Lego metaphor to the question of “is a lot of this irrelevant?“ Does your Lego model of a city need to have interior floors, furniture, and fixtures in order to satisfy your requirements? Perhaps in some cases, but not in most.

ilaksh1y ago

I know it's a MoE and I didn't need the five year old explanation of why larger models are smarter. I'm also aware of interpretability research. You should read my question much more carefully and think about it harder.

tharant1y ago

Although the practicality of what you described towards the end of your original comment conceptually demonstrates an MoE-like architecture, the fact that you explicitly mentioned not understanding why larger models are smarter and then proceeded to try to couch-engineer a new, smaller architecture suggests that you were in fact not aware of the MoE architecture and thus the ELI5 LEGO approach was reasonably helpful. I’ve read your question carefully many times, and I’ve read others’ comments in the thread; you seem frustrated that folks aren’t answering your questions when in fact they have been answered — albeit not in the way you seem to want; how can we fix this?

zamadatix1y ago

This is, more or less, what mixture-of-experts (MoE) section is picking away at. The difference is rather than trying to break it out via how rare or common the info is it's broken out by specialization. There isn't as much a focus on keeping the inactive portions on disk because it's more economical to host it all but in a way that lets you use parallelism of requests across the experts. This has the added effect you can constantly select the best expert as the answer is generated without losing efficient hosting.

ilaksh1y ago

I know what MoE is. Maybe read my comments more carefully and give me the benefit of the doubt.

zamadatix1y ago

My comment would've done an astoundingly bad job at introducing you to what mixture of experts is, had that been its goal. It's really about why the MoE-style enhancements don't target how to keep parts on disk when optimizing the model to be most economical to host. There's really not any doubt in that, it's just an observation as to why they optimize the way they do.

If you were put off by defining terms on first use: that's just good form, not something related to you.

joshuakogut1y ago

Yesterday when I started evaluating Deepseek-R1 V3 it was insanely better at code generation using elaborate prompts, I asked it to write me some boilerplate code in python using the ebaysdk library to pull a list of all products sold by user with $name and it spit it out, just a few tweaks and it was ready to go.

I tried the same thing on the 7B and 32B model today, neither are as effective as codellama.

ilaksh1y ago

I think people didn't understand my comment. I am very aware of this already.

seba_dos11y ago

I think you failed to convey what you meant to with your comment.

If you want your contribution to the discussion to be meaningful, you may want to give it another go.

A4ET8a8uTh0_v21y ago

I am intrigued. What did you use to run your deepseek instance?

lossolo1y ago

You would need to extract logical patterns and concepts somehow, not just word relationships. I know what you mean, this introduces another level of abstraction between relationships. If there is no way to extract these patterns, or if there are no real logical patterns present but only statistical relationships (larger model = more relationships = better prompt following etc) between words without any real 'emergent abilities' then Transformers are essentially a dead end in the context of AGI.

HarHarVeryFunny1y ago

I'm sure that a smaller generalist model with RAG would work for many cases, especially where the RAG is just looking up some facts or technique, but would you really want a smart high school kid who's googled brain surgery to be operating on your brain? Books are useful for looking up facts, but there's no substitute for experience/training in actually getting good at something.

doubleyou1y ago

if you google LLM youll see the first L stands for large.

whimsicalism1y ago· 9 in thread

none of these techniques except MLA are new

eldenring1y ago

They're not new in the same way Attention wasn't new when the transformer paper was written.

No one (publically) had really pushed any of these techniques far, especially not for such a big run.

whimsicalism1y ago

no one publicly pushes any techniques very far except for meta and it’s true they continue to train dense models for whatever reason.

the transformer was an entirely new architecture, very different step change than this

e: and alibaba

leetharris1y ago

They likely continue to train dense models because they are far easier to fine tune and this is a huge use case for the Llama models

1 more reply

bilbo0s1y ago

There's new stuff on lower layers. Some of the math is, interesting? A novel method of scaling mantissas and exponents. Yes, some of the operations have to use higher precision. Yes, some values like optimizer states, gradients and weights still need higher precision. But what they can do in 8 they do in 8. Of course, like everyone, they're reduced to begging NVidia to quantize on global to shared transfer in order to realize the true potential of what they're trying to do. But I mean, hey, that's where we all are and most papers I read don't have near as many interesting and novel techniques in them.

I think recomputing MLA and RMS on backprop is something few would have done.

Dispensing with tensor parallelism by kind of overlapping forward and backprop. That would not have been intuitive to me. (I do, however, make room for the possibility that I'm just not terribly good at this anymore.)

I don't know? I just think there's a lot of new takes in there.

whimsicalism1y ago

i think some are reading my comment as critical of deepseek, but i'm more trying to say it is an infrastructural/engineering feat moreso than an architectural innovation. this article doesn't even mention fp8. these have been by far the most interesting technical reports i've read in a while

cma1y ago

Flash attention was also a set of common techniques in other areas of optimized software, yet the big guys weren't doing the optimizations when it came out and it significantly improved everything.

whimsicalism1y ago

yes, i agree that low-level & infra work is where a lot of deepseek's improvement came from

WithinReason1y ago

There is a big difference between inventing a technique and productising it.

anonymousDan1y ago

One issue is that a lot of techniques proposed (especially from academic research) are hard to validate at scale given the resources required. At least DeepSeek helps a little in that regard.

juancn1y ago· 8 in thread

The compute scheduling part of the paper is also vey good, the way they balanced load to keep compute and communication in check.

There is also a lot of thought put into all the tiny bits of optimization to reduce memory usage, using FP8 effectively without significant loss of precision nor dynamic range.

None of the techniques by themselves are really mind blowing, but the whole of it is very well done.

The DeepSeekV3 paper is really a good read: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSee...

ahartmetz1y ago

When everyone kind of ignores performance because compute is cheap and speed will double anway in 18 months (note: hasn't been true for 15 years), the willingness to optimize is almost a secret weapon. The first 50% or so are usually not even difficult because there is so much low-hanging fruit, and in most environments there's a lot of helpful tooling to measure exactly which parts are slow.

sgt1011y ago

Compute has been more than doubling because people have been spending silly money on it. How long ago would a proposal for a $10m cluster for ML have been thought surreal by any funding agency? Certainly less than 10 years ago. Now people are talking of spending billions and billions.

Madness.

HarHarVeryFunny1y ago

When people are talking about $100M-$1B frontier model training runs, then obviously efficiency matters!

Sure training cost will go down with time, but if you are only using 10% of the compute of your competition (TFA: DeepSeek vs LLaMa) then you could be saving 100's of millions per training run!

ahartmetz1y ago

I was more stating the perception that compute is cheap than the fact that compute is cheap - often enough it isn't! But carelessness about performance happens, well, by default really.

steve_adams_861y ago

At my org this is a crazy problem. Before I arrived, people would throw all kinds of compute at problems. They still do. When you've got AWS over there ready to gobble up whatever tasks you've got, and the org is willing to pay, things get really sloppy.

It's also a science-based organization like OpenAI. Very intelligent people, but they aren't programmers first.

ASalazarMX1y ago

I think the AI megacorps plan was always SaaS. Their focus was never on self-hosting, so optimization was useless: their customers would pay for unoptimized services whether they wanted or not.

Making AI practical for self-hosting was the real disruption of DeepSeek.

fsndz1y ago

The secret is to basically use RL to create a model that will generate synthetic data. Then you use the synthetic dataset to fine-tune a pretrained model. The secret is basically synthetic data imo: https://medium.com/thoughts-on-machine-learning/the-laymans-...

cyanydeez1y ago

Keep in mind: America made them do this.

doener1y ago· 6 in thread

I hate it so much that HN automatically removes some words in headlines like „how.“ You can add them after posting though for a while by editing the headline.

dboreham1y ago

Perhaps an faq, but why the weird quote characters?

doener1y ago

Because I‘m German and that‘s the way we use them in Germany. So my German mobile keyboard does this automatically, yes. Oftentimes I change it in English messages, sometimes it slips.

bflesch1y ago

I'm also German and have never seen such weird quotes. Maybe this is some weird windows charset issue but definitely not widespread way of quoting text.

1 more reply

vanderZwan1y ago

I'm guessing they're using a mobile device with a keyboard that does this automatically.

iamacyborg1y ago

Wait until you see how the French do quotes

cyberax1y ago

Mandarin Chinese keyboards「have entered the chat」.

1 more reply

j / k navigate · click thread line to collapse

68 comments

59 comments · 5 top-level

1970-01-011y ago· 17 in thread

bane1y ago

hnuser1234561y ago

This only uses text based LLMs, but the logical next step would be to have a multimodal network review images or video of the running program to continue to self-evaluate and improve.

mistrial91y ago

are you familiar with

https://gorilla.cs.berkeley.edu/leaderboard.html

1 more reply

MrLeap1y ago

> The models dwell and draw almost entirely self-referentially from their own reasoning through the entire exchange.

Sounds like a deficiency in theory of the mind.

timdellinger1y ago

This sounds like a prompting issue.

If your prompt instructs the model to ask such questions along the way, the model will, in fact, do so!

But yes, it would be nice if the model were smart enough to realize when it's in a situation where it should ask the user a few questions, and when it should just get on with things.

corimaith1y ago

>Everything else feels like an engineering problem.

nyrikki1y ago

Hallucinations are a fundamental property of transformers, it can be minimized but never eliminated.

https://www.mdpi.com/1999-4893/13/7/175

> the open-domain Frame Problem is equivalent to the Halting Problem and is therefore undecidable.

Diaconescu's Theorem will help understand where Rice's theorm comes to play here.

Littlestone and Warmuth's work will explain where PAC Learning really depends on a many to one reduction that is similar to fixed points.

Viewing supervised learning as paramedic linear regression, this dependent on IID, and unsupervised learning as clustering thus dependent on AC will help with the above.

Both IID and AC imply PEM, is another lens.

Basically for problems like protein folding, which has rules that have the Markovian and Ergodic properties it will work reliably well for science.

The basic three properties of (confident, competent, and inevitable wrong) will always be with us.

Doesn't mean that we can't do useful things with them, but if you are waiting for the hallucinations problem to be 'solved' you will be waiting for a very long time.

This is a 'no effective procedure/algorithm exists' problem. Leveraging LLMs for frontier search will open up possible paths, but the limits of the tool will still be there.

Stable orbits of the planets is an example of another limit of math, but JPL still does a great job as an example.

Obviously someone may falsify this paper... but the safe bet is that it holds.

https://arxiv.org/abs/2401.11817

Heck Laplacian determism has been falsified, but as scientists are more interested in finding useful models that doesn't mean it isn't useful.

All models are wrong, some are useful is the TL;DR

zone4111y ago

Jerrrry1y ago

  > the open-domain Frame Problem is equivalent to the Halting Problem and is therefore undecidable.

Thank you, Code as Data problems are innate to the von Nuemman architecture, but I could never articulate how LLMs are so huge they are essentially Turing-complete and equivalent computationally.

You _can_ combinate through them, just not in our universe.

adgjlsfhk11y ago

this is very wrong. LLMs are very much not Turing complete, but they are algorithms on a computer so they definitely can't compute anything uncomputable

1 more reply

GaggiX1y ago

From what I see, the Deepseek R1 model seems to be better calibrated (knowing what it knows) than any other model, at least on the HLE benchmark: https://lastexam.ai/

seba_dos11y ago

And yes, if you thought this means these models are being commonly misapplied, you'd be correct. This will continue until the bubble bursts.

HarHarVeryFunny1y ago

whimsicalism1y ago

hallucinations decrease with scale and reasoning, the model just gets better and stops making stuff up.

littlestymaar1y ago

o1 still hallucinates badly though.

Jerrrry1y ago

False, facts only need to be seen once, and one mis-step in reasoning and your CoT is derailed.

whimsicalism1y ago

> one mis-step in reasoning and your CoT is derailed.

tell me you've never seen reasoning traces without telling me

ilaksh1y ago· 14 in thread

Why is it that the larger models are better at understanding and following more and more complex instructions. And generally just smarter?

With DeepSeek we can now run on non-GPU servers with a lot of RAM. But surely quite a lot of the 671 GB or whatever is knowledge that is usually irrelevant?

I guess what I sort of am thinking of is something like a model that comes with its own built in vector db and search as part of every inference cycle or something.

But I know that there is something about the larger models that is required for really intelligent responses. Or at least that is what it seems because smaller models are just not as smart.

Maybe a model could have retrieval built in, and trained on reducing the number of retrievals the longer the context is. Or something.

AJRF1y ago

> But surely quite a lot of the 671 GB or whatever is knowledge that is usually irrelevant?

It's a MoE model so you don't actually need to load all 750gb at once.

I think maybe what you are asking is "Why do more params make a better model?"

Generally speaking its because if you have more units of representation (params) you can encode more information about the relationships in the data used to train the model.

Think of it like building a LEGO city.

A model with more parameters is like having a giant LEGO set with thousands of pieces in all shapes and colours. Now, you can build an entire city with skyscrapers, parks, and detailed streets.

---

In terms of "is a lot of of irrelevant?" - This is a hot area of research!

reagle1y ago

ilaksh1y ago

tharant1y ago

zamadatix1y ago

ilaksh1y ago

I know what MoE is. Maybe read my comments more carefully and give me the benefit of the doubt.

zamadatix1y ago

If you were put off by defining terms on first use: that's just good form, not something related to you.

joshuakogut1y ago

I tried the same thing on the 7B and 32B model today, neither are as effective as codellama.

ilaksh1y ago

I think people didn't understand my comment. I am very aware of this already.

seba_dos11y ago

I think you failed to convey what you meant to with your comment.

If you want your contribution to the discussion to be meaningful, you may want to give it another go.

A4ET8a8uTh0_v21y ago

I am intrigued. What did you use to run your deepseek instance?

lossolo1y ago

HarHarVeryFunny1y ago

doubleyou1y ago

if you google LLM youll see the first L stands for large.

whimsicalism1y ago· 9 in thread

none of these techniques except MLA are new

eldenring1y ago

They're not new in the same way Attention wasn't new when the transformer paper was written.

No one (publically) had really pushed any of these techniques far, especially not for such a big run.

whimsicalism1y ago

no one publicly pushes any techniques very far except for meta and it’s true they continue to train dense models for whatever reason.

the transformer was an entirely new architecture, very different step change than this

e: and alibaba

leetharris1y ago

They likely continue to train dense models because they are far easier to fine tune and this is a huge use case for the Llama models

1 more reply

bilbo0s1y ago

I think recomputing MLA and RMS on backprop is something few would have done.

I don't know? I just think there's a lot of new takes in there.

whimsicalism1y ago

cma1y ago

Flash attention was also a set of common techniques in other areas of optimized software, yet the big guys weren't doing the optimizations when it came out and it significantly improved everything.

whimsicalism1y ago

yes, i agree that low-level & infra work is where a lot of deepseek's improvement came from

WithinReason1y ago

There is a big difference between inventing a technique and productising it.

anonymousDan1y ago

One issue is that a lot of techniques proposed (especially from academic research) are hard to validate at scale given the resources required. At least DeepSeek helps a little in that regard.

juancn1y ago· 8 in thread

The compute scheduling part of the paper is also vey good, the way they balanced load to keep compute and communication in check.

There is also a lot of thought put into all the tiny bits of optimization to reduce memory usage, using FP8 effectively without significant loss of precision nor dynamic range.

None of the techniques by themselves are really mind blowing, but the whole of it is very well done.

The DeepSeekV3 paper is really a good read: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSee...

ahartmetz1y ago

sgt1011y ago

Madness.

HarHarVeryFunny1y ago

When people are talking about $100M-$1B frontier model training runs, then obviously efficiency matters!

Sure training cost will go down with time, but if you are only using 10% of the compute of your competition (TFA: DeepSeek vs LLaMa) then you could be saving 100's of millions per training run!

ahartmetz1y ago

I was more stating the perception that compute is cheap than the fact that compute is cheap - often enough it isn't! But carelessness about performance happens, well, by default really.

steve_adams_861y ago

It's also a science-based organization like OpenAI. Very intelligent people, but they aren't programmers first.

ASalazarMX1y ago

I think the AI megacorps plan was always SaaS. Their focus was never on self-hosting, so optimization was useless: their customers would pay for unoptimized services whether they wanted or not.

Making AI practical for self-hosting was the real disruption of DeepSeek.

fsndz1y ago

cyanydeez1y ago

Keep in mind: America made them do this.

doener1y ago· 6 in thread

I hate it so much that HN automatically removes some words in headlines like „how.“ You can add them after posting though for a while by editing the headline.

dboreham1y ago

Perhaps an faq, but why the weird quote characters?

doener1y ago

Because I‘m German and that‘s the way we use them in Germany. So my German mobile keyboard does this automatically, yes. Oftentimes I change it in English messages, sometimes it slips.

bflesch1y ago

I'm also German and have never seen such weird quotes. Maybe this is some weird windows charset issue but definitely not widespread way of quoting text.

1 more reply

vanderZwan1y ago

I'm guessing they're using a mobile device with a keyboard that does this automatically.

iamacyborg1y ago

Wait until you see how the French do quotes

cyberax1y ago

Mandarin Chinese keyboards「have entered the chat」.

1 more reply

j / k navigate · click thread line to collapse