Non-determinism in GPT-4 is caused by Sparse MoE (opens in new tab)

(152334h.github.io)

397 points152334H2y ago181 comments

181 comments

117 comments · 21 top-level

dudus2y ago· 42 in thread

Off topic

> 3 months later, reading a paper while on board a boring flight home, I have my answer.

I noticed people from hacker news routinely read scientific papers. This is a habit I envy but don't share.

Any tips or sites for someone interested in picking up more science papers to read.

jldugger2y ago

For just getting started I recommend collections:

1. Ideas That Created The Future[1]. It's a collection of fiftyish classic CS papers, with some commentary.

2. Wikipedia's list[2].

3. Test of Time awards[3]. These are papers that have been around for a while and people still think are important.

4. Best paper awards[4]. Less useful than ToT as not every best paper is actually that good or important, and sometimes the award committees can't see past names or brands for novel research.

5. Survey Journals[5]. Students often get their research started with a literature review and some go the extra step to collect dozens of papers into a summary paper. I subscribe to the RSS feed for that one, and usually one or two are interesting enough to read.

6. Citation mining -- As you read all these, consider their citation list as potential new reading material, or if an old paper leaves you wanting more, use Google Scholar to find a papers that cited what you just read.

[1]: https://www.amazon.com/Ideas-That-Created-Future-Computer/dp...

[2]: https://en.wikipedia.org/wiki/List_of_important_publications...

[3]: https://www.usenix.org/conferences/test-of-time-awards

[4]: https://jeffhuang.com/best_paper_awards/

[5]: https://dl.acm.org/journal/csur

puzzledobserver2y ago

I'd like to disagree with this. In particular, about [1]: It is a collection of papers in many different topics. There is little technical overlap between Alan Turing's Entscheidungsproblem paper, for instance, and Hoare's paper on axiomatic semantics. Also, the papers are all from the 70s. They're uniformly influential papers, and have shaped the field, but the fields and the vernacular used by working researchers is very different. At best, the papers approximate a four year undergrad curriculum in CS, and at worst, are a recipe to get distracted and overwhelmed. The link to Wikipedia [2] is somewhat better in that the papers appear to be more modern, but suffers even more from the problem of diversity.

A somewhat similar problem arises with test-of-time and best paper awards. To elaborate on my complaint, imagine the exaggerated case of someone trying to understand modern science by intensely focusing on the work of researchers who won the Nobel Prize. Clearly all very important work, but understanding the 1990 Physics Nobel Prize (on electron-proton scattering) is of no use to understanding the work for which 1991 Nobel was awarded (complex systems and polymers).

There are two things that (I'm assuming the OP's field of interest is computing) a CS education provides: At the undergrad and in the early stages of grad school, breadth of topics, and their modern synthesis. You don't spend much time reading papers (at least in an undergraduate education), but you understand the basics, and get a feel for the problems considered and the sensibilities of researchers. In an intermediate-level graduate seminar, you pick a narrow topic, and focus on papers in that topic. The first papers in the area (like Dijkstra's papers on distributed computing), the best / most important papers in the area, and the latest papers on topical interests (like Merkle trees and blockchains). There is thematic and technical continuity from one paper to the next, and you start to understand the the story being told. Then, late in graduate school, and in the rest of one's professional career, one starts reviewing papers that haven't even been published. At this point, you see the story being written: the steps and the missteps, and the memorable and not-so-memorable papers in a field. To truly understand a field, one needs to read not just the great papers, but also the middling ones.

And one needs to concentrate on a topic. The thing about a forum such as HackerNews is that for every topic of interest, there's likely a person here who's an expert in the area, but it is easy to confuse that observation with the much stronger claim that there's a person here who's an expert on every topic. The last of those people died in the mid-20th century, if they ever existed.

1 more reply

jldugger2y ago

From there, just keep a reading queue. If you notice a particular journal is a good source of material, consider subscribing to it.

nerdponx2y ago

> I noticed people from hacker news routinely read scientific papers.

Do they? I suspect that most don't, and those that do are either in specialized careers or are engaged in some kind of scientific research.

Some interesting research gets disseminated via Twitter and chatrooms. Or maybe you follow a podcast that mentions new research. But you might also be following new publications from a handful of reputable journals, or following an Arxiv category, or looking through new conference papers. It's very easy to get overwhelmed with new research to read, and not knowing what's worth your time, unless you're already very familiar with the field and well-versed in the material.

MacsHeadroom2y ago

Long time HN'er college dropout and I read a LOT of scientific papers. Probably an average of 4 a week over the past couple of decades, sometimes reading 40 in a week.

I probably averaged 20 a week back in March when open source AI was booming in the wake of Llama and on the heels of GPT-4.

1 more reply

i-use-nixos-btw2y ago

I strongly agree.

Once upon a time, I was in condensed matter physics. I was (and remain) interested in a very specific niche within that, and I read a small handful of the papers that were published each week. I’m not actively researching or publishing anymore so I cap this to one or two per month now, and mostly scan over them to see if anything piques my interest.

I was still interested in condensed matter as a whole, at the time, and attended group seminars once a month to see what other people were currently excited about - there wasn’t any hope of me reading a cross section of all condensed matter papers because there is far more published per week than I’d be physically able to even glimpse at, and most of it is stuff I don’t understand or particularly care about.

I was likewise interested in physics as a whole, and twice a year I’d attend a departmental seminar and see what people in the entire department were interested in. Most was far over my head, but it still directed me to a small handful of papers that I’d read for the hell of it. Of course, I couldn’t do this without first hearing people review the research. There’s far more published per day in physics as a whole than I could read in a year, and most of it I’d find unrelatable and uninteresting.

I guess where I’m going with this is that anyone with a specific interest is already reading papers. It’s their job. Anyone with a general interest would find actively pursing paper hunting to be a waste of time with a ridiculously bad signal to noise ratio. Instead, they should use channels that align closely with their own interests, through which they can get recommendations to read papers from the aforementioned specialists who have already filtered out much of the noise themselves. At that point, they should actually read the resulting papers.

There is another trick, though, and that’s to find an individual who publishes two unrelated pieces of work that you find interesting, then read their work and maybe those of their coauthors. Be careful, though, because this is a slippery slope to specialising, after which you’ll find yourself back at the point where you don’t aren’t following 99.9% of the stuff you wanted to follow in the first place.

CSMastermind2y ago

I typically look up and read a paper when it's referenced in discussion or cited in something else, I'm reading/watching, and the purported contents seem surprising to me. This normally happens 3 or 4 times a week.

Honestly many papers are written in a way that's hard to approach and difficult to understand unless you're prepared to reread them a few times.

You're better off just getting your science news from actual science communicators and not the raw source.

brmgb2y ago

> I noticed people from hacker news routinely read scientific papers.

Highly doubt that. It’s very hard to actually read scientific papers when you are not actively doing research.

You can’t just read a research paper in isolation. It’s next to useless. You need to understand its context, where it stands with regard to its sources and what it brings which is actually new and valuable. It’s nearly impossible to do properly if you are not fully immersed in a research subject.

I don’t even know how you would scheme introduction and sources to filter articles which are immediately obviously useless without being immersed in a field.

I guess you can obviously go though lists of papers which have be deemed worthwhile by someone else or got prices. That solves the filtering issue but then nearly every time you will be better served reading a text book presenting the ideas in said papers.

I fully expect the HN readership to contain a significant amount of students and actual researchers which explain why you encounter people reading papers but these people aside I would be surprised if the habit is common.

cypress662y ago

You don't need to be doing research to read an ML paper. With some general knowledge in AI you should be able to understand most papers.

And even then, sometimes you don't understand or care about their procedures, and you just want to look at the pretty results (check out this song they generated using AI!). There's even a very popular YouTube channel that focuses on this (two minute papers).

Finally, you usually hear about these cool papers via Twitter / X

1 more reply

allisdust2y ago

Don't read them for the sake of reading them. Read them to solve your current problem or trying to keep up with advancements in a narrow field you love. Most papers (especially the ones in deep learning) seem to also have a mathematical fetish (to put it mildly) where needless representations are used where none are required and are self evident (for example inputs belong to Real number set). It ends up making the paper pseudo complex and unapproachable. Most papers are doing average/summation/series operations but instead of just saying so, use the symbols all over the place. So even if a few papers appear tough, keep reading them and digest your first paper thoroughly. You will find subsequent papers mostly are a rehash of existing work with similar fetish to make trial and error appear like mathematically sound research. Once in a while, you would find some paper which is fully theoretical and try to prove that either the inputs/outputs/components of models have certain well known mathematical properties and hence can be reasoned similarly. These are rare and would be difficult to parse through.

PS: Best papers I have seen are from deepmind where the approaches usually described are novel, varied and path breaking. Worst ones are - well no names but those that just use training and eval sets generated by GPT4 and try to prove things empirically

LudwigNagasena2y ago

> Most papers (especially the ones in deep learning) seem to also have a mathematical fetish (to put it mildly) where needless representations are used where none are required and are self evident (for example inputs belong to Real number set). It ends up making the paper pseudo complex and unapproachable.

I completely disagree with that. Spelling out math is literally something out of 12th century. It just hinders understanding, if you have basic STEM-level math literacy, which anyone who reads an ML paper is implied to have (how could you seriously study linear algebra and calculus without it?).

Math may actually be the first thing you recognise in a paper, which can help you cross-reference the text to understand it.

obblekk2y ago

Build the habit.

When google doesn't return a good result to a specific question, switch to scholar.google.com and start reading abstracts. Everything may seem like an opaque maze at first, but just keep reading and patterns start emerging quickly and become useful.

TechBro86152y ago

I don't mind reading research papers, but they're really annoying to read on a phone screen. I remember a few years ago, an HN comment shared a link to some tool that could convert a PDF to single column text and make it more readable on a phone screen, but I can't find it. Anyone remember this or have the link?

4 more replies

cpeterso2y ago

Check out the papers and talks from Papers We Love, a "repository of academic computer science papers and a community who loves reading them":

https://paperswelove.org/

eru2y ago

It depends on why you want to read papers and what you want to get out of it.

https://news.ycombinator.com/item?id=37006967 suggested some avenues for finding some classic papers. The follow-up https://news.ycombinator.com/item?id=37007360 pointed out some circumstances where that's not ideal. But in the process, implicitly assumes that you want to become familiar with current research, instead of just enjoying classic papers for some other motivation.

I mostly read papers in mathematics and computer science. For other disciplines I mostly rely on pop science, like Slate Star Codex or Money Stuff and blogs. There's also The Monad Reader (https://wiki.haskell.org/The_Monad.Reader) if you are interested in functional programming.

There's various blogs with interesting articles. Eg Vitalik Buterin has great stuff, like https://vitalik.ca/general/2017/11/09/starks_part_1.html and he links to the original papers. (I have no conclusive opinions on whether crypto-currencies are useful or good for the real world, but I do find the math behind some of them endlessly fascinating. Especially zero-knowledge proofs.)

Wikipedia is also often a good starting point. Whenever you read about a random topic, Wikipedia usually has an article that comes with plenty of references. Eg https://en.wikipedia.org/wiki/Forth_Bridge#References links to http://www.bath.ac.uk/ace/uploads/StudentProjects/Bridgeconf... and down the rabbit hole you go.

https://gwern.net/ also has great write-ups and links to original papers.

alecst2y ago

Honestly a lot are really hard to read. You start with the easy ones, learn the lingo, and then just keep going. Eventually you can enjoy reading the harder ones.

You learn pretty quickly that if you want answers, it's better to just go straight to the source, rather than have it filtered through someone else, where the message can (and often does) get twisted.

What are you interested in reading about? Maybe some people can recommend you some papers to start with.

eru2y ago

There are certainly easier and harder papers. Though when you are struggling: keep in mind that there are also papers that are just badly written (and some papers that are well written).

Swizec2y ago

> I noticed people from hacker news routinely read scientific papers. This is a habit I envy but don't share.

> Any tips or sites for someone interested in picking up more science papers to read.

Personally, the older I get, the more bored I've been getting with the level of information that "crosses my desk".

Eventually I basically stopped reading blogs et al and started getting my insights from books. Those books would often mention papers. Then I noticed a lot of books (and deep well-researched podcasts) mentioning the same papers. So I started reading those papers.

When you read a couple papers, you notice most of them reference a bunch of other papers. Now you have an exponentially growing queue of interesting papers that you'll never get to. Mission accomplished.

The main trick is to read stuff you're interested in knowing and understanding. Many papers can be quite difficult to read, but getting through a single paper will fuel your brain with more valuable information than 2 weeks of "the internet". In my experience at least.

Ultimately, life is short and papers give you a better information density return on your time than almost anything else. Even the bad ones.

mst2y ago

For computer science, https://blog.acolyer.org/ is called The Morning Paper and talks about one interesting paper per post.

Edit: It seems to've gone on indefinite hiatus but there's a lot of backlog already there and some of it's really quite fascinating.

NalNezumi2y ago

There are some materials about "how to read scientific paper", like the pdf one from U waterloo [3] with some methodological advice. Some good advice in this old HN thread [1]

But I don't see the point of reading a scientific paper unless you're actually curious about a specific topic. They are often hard to read, dense, have so many field-specific jargon that if you're new, you won't be able to read one paper and grasp everything. You would have to read references, or a book/blog that summaries core points.

So find a specific field you're interested in, find a good book/blog/homepage/tutorial/video to get your basics going so that when you start reading papers you won't be completely lost.

Then find a highly cited survey paper to understand what progress have been made beyond what is now basic. Then you can follow your curiously along that survey, decide a branch of research to read upon. You'll probably then realize that a few labs research/publish a lot in a specific direction. Now you can follow those professors (Twitter, Google scholar email notification) to keep up to date. By reading a lot you'll also start to notice papers that are "published just to get my PhD" and soon enough you can just read abstract + intro/result to judge if it is valuable or not.

If ML/LLM is your curiosity probably Lillian Wengs blog [2] is a good start for tutorials / surveys.

[1] https://news.ycombinator.com/item?id=24986727

[2] https://lilianweng.github.io/

Edit: direct link [3] https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPape...

necubi2y ago

For me it's very helpful to print out papers and read them with a pen in hand, away from my computer. Papers tend to be dense and require a level of focus that (I at least) cannot maintain when reading on a screen. It helps as well to able to easily take notes and annotate the paper.

quickthrower22y ago

Pick ones that are easy to read. Some are written line a magazine article. Others are math dense, reference another paper you can’t get hold of every other sentence and are a kind of marketing material anyway.

Also youtube and code: Attention is all you need is not a nice paper to read for Joe programmer, but you can understand what it is doing by watching karpathy and reading his code (or someone else who has implemented it, Llama for example). But you need to do some basic torch training first (karpathy again!)

AnthonBerg2y ago

Anyone can read scientific papers. All you need to do is pierce the layer of jargon. It takes practice but you kind of just pick it up. Reading on a computer helps because you can get words defined by clicking on them. Reading on paper is good too, it’s easier to keep at it and it sticks better.

Some sense of urgency helps. Most people will have a medical ailment or physiological issue of some sort. I promise you that there exist useful papers on it.

TX81Z2y ago

Once you obtain subject mastery you just need the read the abstracts.

To get a cold start look for a “survey”, “literature review”, or “systematization of knowledge” papers. Those organize a lot of papers, check out the ones that look cool and read the abstracts.

Rinse and repeat for five years and you get a phd.

obblekk2y ago

Build the habit.

When google doesn't return a good result to a specific question, switch to scholar.google.com and start reading abstracts. It'll seem like an opaque maze at first, but just keep reading and it'll start clearing up pretty quickly and become useful.

rgoldste2y ago

Don’t feel like you need to understand 100%. You can always give yourself an hour to read a paper and gloss over some notation. If you read 5 papers over the course of a month, you can go back to your favorite and dive into the notation.

j7ake2y ago

Feedly with keywords for your favorite topics or researchers works decently.

I imagine this routine comes from people with research backgrounds, where browsing papers is the academic way of googling around for answers.

ugh1232y ago

I usually just read the abstract and synthesize that with the comments on HN to get the gist (and legit-ness) of the research.

whimsicalism2y ago

They read scientific papers in the same way that everyone "read" Capital in the 21st Century, when that was a thing.

dustingetz2y ago

read textbooks instead most papers are obtuse and poorly written even famous ones. you can find them in wikipedia footnotes

throwawayadvsec2y ago

Step 1. Find papers you're interested in Step 2. Open them Step 3. read them

dekhn2y ago

Step 4: do a depth-first lookup of every citation, and read/finish that paper before continuing

lannisterstark2y ago

Step 4. Get lost within a minute.

au8er2y ago

Step 3.5, see some other interesting paper is referenced in the related work, go to step 1.

1 more reply

interrupt21h2y ago

Semantic Scholar for search. Scihub for any paywalled papers. Libgen for books. Zotero to organize.

1 more reply

jcims2y ago

Pick something you’re interested in and have a passing knowledge of.

152334HOP2y ago

just set up a desktop service to randomly open a paper once every few hours

if they're not too boring, and you're not doing anything important, you'll read it for fun

armchairhacker2y ago

I read the abstract and look at the pretty figures :)

dylan6042y ago

I want to know what a non-boring flight would be like

nerdponx2y ago

High turbulence definitely makes it less boring. So will a crying baby, disruptive passenger, or someone getting sick. After a few of those, you'll prefer the boring flights.

152334HOP2y ago

https://www.youtube.com/watch?v=iFImKMjM-q4

LordShredda2y ago

Snakes on a plane

1 more reply

jiggawatts2y ago· 21 in thread

Floating point inaccuracies are generally deterministic - running the same calculations twice ought to yield the same results, down to the bit.

You only get divergent results if there is some other source of state or entropy: not zeroing buffers correctly, race conditions, not setting rounding mode flags consistently, etc…

From the quality of the code I’ve seen being cobbled together in the AI/ML ecosystem I would assume all three of those issues going on, and maybe more.

n2d42y ago

No, this is not true for GPUs. https://www.twosigma.com/articles/a-workaround-for-non-deter...

(In this particular case, the order in which the numbers are summed up is non-deterministic due to GPU parallelism, which may change the result slightly.)

I would generally refrain from insulting other people's code if you don't know much about the system it's written on.

Editing here since all the replies to this are mostly saying the same thing: Yes, CPUs can also be parallel and it can happen there as well, but unlike a CPU where most instructions on their own are deterministic, CUDA provides primitives that aren't. This is very much by design (as they're faster than their deterministic counterparts), and I mostly just take issue with how parent phrased this as a bug caused by bad code.

Tunabrain2y ago

GPUs are deterministic machines, even for floating point.

The behavior in the linked article has to do with the use of atomic adds to reduce sums in parallel. Floating point addition is not associative, so the order in which addition occurs matters. When using atomic adds this way, you get slightly different results depending on the order in which threads arrive at the atomic add call. It's a simple race condition, although one which is usually deemed acceptable.

1 more reply

jiggawatts2y ago

Read the article you linked.

It literally says that the GPU is deterministic, the NVIDIA libraries on top are deterministic, but it is Tensorflow that introduces variability (errors!) for “performance”.

My argument is that it is the AI/ML code that is introducing non-determinism, usually by sacrificing repeatability to gain performance.

That's precisely what's happening here. Tensorflow introduced a "harmless"[1] data race to improve performance by not having to use a deterministic but slower algorithm.

The individual floating point computations are deterministic, it's the multi-threaded design on top that's introducing the variability in the output.

[1] Used to be harmless, but cutting corners like this will make it nigh impossible to repeatably validate the safety of future models like GPT5. That seems pretty dangerous...

2 more replies

johndough2y ago

The PyTorch documentation has an entire section about how to make your code deterministic. In my experience, the performance difference is negligible.

https://pytorch.org/docs/stable/notes/randomness.html#avoidi...

Unfortunately, determinism across devices or even driver versions is not that easy. You'd have to write your own BLAS kernels using only basic operations, which are guaranteed to follow IEEE 754 semantics.

https://docs.nvidia.com/cuda/floating-point/index.html

One gotcha are fused multiply-adds, which the compiler may or may not introduce, so you have to wrap all your floating point operations with __fma* intrinsics to make sure the compiler does not interpret them differently.

nextaccountic2y ago

As far as I can tell this article doesn't explain why this happens on the GPU (for example, why Tensorflow's reduce_sum is non-deterministic). My hypothesis is that this is entirely due to concurrency: if the same code can be run in two or more different interleavings, they can produce different results. This is corroborated by the first answer here [0].

If so, this exact same issue happens in CPU code as well: have two or more threads, run the program many times, observe different interleavings that expose race conditions which (depending on the algorithm) may or may not produce different results. This can happen even if you don't use floating point, and has nothing to do with floating point non-determinism itself. For example, have a thread print "Hello" and another thread print "World"; even without tearing, you may see either Hello World or World Hello on the screen.

Now, proper floating point non-determinism happens in two cases. One is that when you run the same code in two different architectures you could have different answers (because of rounding modes, or because some architecture doesn't support subnormal numbers or signaling nans, because transcedental functions like sine are implemented with different accuracy, etc). In this case it's deterministic when run the same in the same machine, but may run differently in another machine with a different architecture.

The other case is that some "optimizations" actually break your code if applied carelessly (you enable those broken optimizations with -ffast-math in C for example). Among other things, this may break numerical stability of algorithms like Kahan summation. And, if you let the compiler decide which exact optimizations will be applied and in what order, you get non-determinism between different compilers. So in this case it's deterministic when compiled with the same compiler, but may run differently with another compiler.

[0] https://stackoverflow.com/questions/50744565/how-to-handle-n...

ascar2y ago

To nitpick in addition to the already existing comments: this has nothing to do with GPUs per se. You would see the same issue in multithreaded code on a CPU. Even on a single core CPU this can happen with a multithreaded program depending on how the OS schedules and interrupts the threads. It just happens to be an implementation choice in a GPU library/API.

mschuster912y ago

> I would generally refrain from insulting other people's code if you don't know much about the system it's written on.

Well, the general state of how utterly shoddy most of the code in the AI/ML ecosystem is is observable to anyone trying to follow a guide on how to set up Stable Diffusion on AWS. It's a fucking mess of trying various combinations of driver versions, Ubuntu kernel versions, Python versions, and the fact that Python requirements.txt (similar to NodeJS) doesn't pin versions of transitive dependencies doesn't make it easier because it makes for very brittle and not reproducible builds/guides. Oh, and at least some of that stuff won't work without root.

Yeah I'll keep AI shit cordoned off in its own subnet.

1 more reply

zx142y ago

There isn't much of a culture around code quality in ML / AI / DS.

1 more reply

benreesman2y ago

I don’t know about how insulting it is, I don’t like rushing things out but we’ve all had to.

People are rushing like crazy to get there first with X for AI all over the place, it would be pretty shocking if there weren’t wires sticking out everywhere.

I don’t think that says anything positive or negative about the hackers involved.

jes51992y ago

it’s basically always reasonable to insult someone’s code because we are computer programmers and we know what we have done

DeathArrow2y ago

So you can generate true random numbers using just the GPU parallelism? Consider me impressed!

1 more reply

xyzzy_plugh2y ago

You've moved the goal posts. You're conflating CUDA with GPUs. From Wikipedia:

> CUDA (or Compute Unified Device Architecture) is a proprietary and closed source parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements, for the execution of compute kernels.

Is the issue we're discussing because of the GPU or is it because of choices made in software libraries?

The parent is right, there is a deterministic, reproducible way to solve these problems, so if determinism is a desired or expected property, then this is a bug. It's not an inherent problem like you make it out to be. The fact that "workarounds" are given in what you link prove this.

KolenCh2y ago

What you said can be violated when parallelism is involved. One such example is that we know some floating point operations such as addition and multiplication are non-commutative, hence it depends on order of execution to complete reduction for example. And then in parallel situation, some implementation will make the order or reduction non-deterministic (for performance reason) and hence the final result also non-deterministic.

toxik2y ago

Minor nit but commutative is the wrong term. Floats always obey a+b == b+a, but not associativity: (a+b)+c != a+(b+c).

1 more reply

DeathArrow2y ago

It's still deterministic even if the results appear not to be. If you have memory, CPU cache, CPU registers in the same state, you will get the very same results. You need a source of entropy for the results to be non deterministic.

2 more replies

dwpdwpdwpdwpdwp2y ago

Mathematically, computation is deterministic. The author dismisses or ignores the many ways that the physical apparatus driving the computation can force the result of a software application to be a function of time.

Calling GetTimeOfDay() could do it.

Clock frequency drift between multiple processors could it.

stevefan19992y ago

Quantum computer is under the category of computers.

Quantum computation relied on Quantum mechanics.

Quantum mechanics are not deterministic.

So, Quantum computers are not deterministic.

Therefore, unless P=NP, not all computations are deterministic.

water92y ago

When theory fails to consult reality.

neatze2y ago

hmm, how, I wonder if Alhazen’ s Circular Billiard Problem[1] results for n steps in simulation will be same for multiple runs.

[1] https://forumgeom.fau.edu/FG2012volume12/FG201216.pdf

DeathArrow2y ago

On a large scale, not having memory with good ECC is enough to have entropy.

alexnewman2y ago

Small nit. You mean errors due to floating point math

refulgentis2y ago· 8 in thread

This is _excellent_ work, I've been adamantly against MoE for a set of reasons, this is the first compelling evidence I've seen that hasn't been on Substack or a bare repeating of rumor.

I had absolutely no idea GPT4 was nondeterministic and I use it about 2 hours a day. I can see why a cursory looking wasn't cutting it, they "feel" the same in your memory, a lot of similar vocab usage, but are formatted entirely differently, and have sort of a synonym-phrase thing going where some of the key words are the same.

152334HOP2y ago

Thanks. I'm really no expert (:P) on MoE research; I just noticed what was written in the Soft MoE paper and felt a need to check.

The non-deterministic outputs are really similar, yeah, if you check the gist examples I linked https://gist.github.com/152334H/047827ad3740627f4d37826c867a.... This part is at least no surprise, since the randomness should be bounded.

I suspect OpenAI will figure out some way to reduce the randomness at some point, though, given their public commitment to eventually adding logprobs back to ChatCompletions.

cubefox2y ago

I don't think this commitment had any plausibility. Token "probabilities" only have a straightforward probabilistic interpretation for base models. In fine-tuned models, they do no longer represent the probability of the next token given the prompt, but rather how well the next token fulfills the ... tendencies induced by SL and RL tuning. Which is presumably pretty useless information. OpenAI has no intention to provide access to the GPT-4 base model, and they in fact removed API access to the GPT-3.5 base model.

1 more reply

derwiki2y ago

GPT4 web chat for two hours a day? I buy that. Using the API repeatedly for the same inputs, eg developing a program, and the non-determinism is hard to miss.

sebzim45002y ago

I would imagine that most people use nonzero temperature, so they won't need to look for any explanation for non-determinism.

1 more reply

phillipcarter2y ago

Yeah, it's one of the first things you notice when trying to do some kind of "feed GPT some data and get it to produce a novel answer to a question" task with the API.

1 more reply

FanaHOVA2y ago

> I've been adamantly against MoE for a set of reasons

Such as?

lucubratory2y ago

It was completely unsubstantiated, based on rumours from a blog, but everyone repeated it as fact.

1 more reply

bredren2y ago

What do you use it for? Are you using many plugins? Curious what sort of insights someone using the tool this much might have, perhaps even through the batch of features released this week.

gojomo2y ago· 6 in thread

Not sure I understand the excerpt from the referenced paper.

Is it saying that part of its more-efficient inferencing relies on mixing tokens from completely-separate inputs – eg, from other users? And then, depending on what other inputs chance into the same grouping, the relative assignment-to-'experts' varies, and thus the eventual completions?

If so, I'd see that as not just introducing non-determinism, but also potentially making the quality of your responses dependent on how-many-concurrent-requests are fighting for the same expert-allocations.

(For example, maybe the parts of the system best at translating/interpreting Hindi give worse results during peak usage hours-of-the-day in India, when the most concurrent inputs are competing for that same competence.)

Perhaps also, this is another possible explanation for perceived quality-degradation over time. When certain tests were reliably succeeding earlier, there was less congestion for the relevant 'experts'. Now, with more concurrent use, those same tests aren't as reliably winning as much of relevant 'experts' effort.

This may also suggest a bit of a quagmire: on whatever domains some sub-experts seem impressively good, initially, even more proportionate use will be attracted. But such new congestion means all the copycat use no longer gets the same expert allocations – and thus the initially-impressive performance degrades.

(And if the effect is strong, & known-but-undisclosed-by-OpenAI, does it amount to a bait-and-switch? Attract users with unrepresentative excellence on an initially-uncongested Mixture-of-Experts system, but then offer them the lower-quality results from a more-congested system.)

spott2y ago

The results are showing essentially 12 unique responses from 30 tries… not what you would expect from mixing tokens.

I think it groups the batch up differently, so if I have a batch of 10, and it groups it up into 2 groups of 5, if my prompt makes it to the second group or 1st group I get a different answer. But if I’m in the same location in the batch, then I get the same answer.

The whole batch is deterministic given the same batch (sequences and ordering), but if you shuffle the batch then you lose that determinism.

albystein2y ago

this seems like a plausible outcome, and if true could spell disaster for OpenAI models relative to the competition and open source models. Currently, reliability is one of the core obstacles preventing widespread adoption of LLMs in many business critical workflows. And if these rumors, that GPT-4 is inherently un-deterministic and unreliable, are true then most enterprises are better off finetuning open source LLMs—which are just as capable—for their specific domains. they stand to gain better performance that way anyways, as domain-specific models will always outperform generalist ones

mrtranscendence2y ago

> And if these rumors, that GPT-4 is inherently un-deterministic and unreliable, are true then most enterprises are better off finetuning open source LLMs—which are just as capable

Wait, am I misunderstanding you? I feel like I've had a head injury or something, because I've never heard of an open source LLM that's as capable as GPT-4 (in most scenarios).

1 more reply

geysersam2y ago

> domain-specific models will always outperform generalist ones

That's only true assuming you habe enough data to train a domain-specific model / expertise to train it and test it correctly.

I've encountered cases where an image recognition task could be accomplished well with a very general model like CLIP, but people still fine-tuned another model on their own small data set because that's considered better.

A domain specific model might be more likely to fail on weird outliers not present in the small domain specific training data.

> could spell disaster for OpenAI

Nah I don't think so. They are not all in on one specific model architecture. If the current architecture is found to have serious unfixable flaws then they'll just change architecture.

famouswaffles2y ago

>as domain-specific models will always outperform generalist ones

This is not even close to true for Language models.

famouswaffles2y ago

Fine-tuned MedPalm is worse than GPT-4 on most Medical Challenge Tests. Fine-tuned Minerva is much worse on arithmetic benchmarks.

The LLM space is just different. There's no guarantee a fine-tuned model will beat a bigger generalist one.

alpark32y ago· 5 in thread

_If_ 3.5 is a MoE model, doesn't that give a lot of hope to open source movements? Once a good open source MoE model comes out, maybe even some type of variation of the decoder models available(I don't know whether MoE models have to be trained from scratch), that implies a lot more can be done with a lot less.

152334HOP2y ago

I agree, and really hope that Meta is doing something in that vein. Reducing the FLOPs:Memory ratio (as in Soft MoE) could also open the door to CPU (or at least Apple Silicon) inference becoming more relevant.

osmarks2y ago

It would be bad for single-consumer-GPU inference setups.

Me10002y ago

Not an expert (no pun intended), but MoE where each expert is actually just a LoRA adaptor on top of the base model gets me pretty excited. Since LoRA adaptors can be swapped in and out at runtime, it might be possible to get decent performance without a lot of extra memory pressure.

1 more reply

worldsayshi2y ago

Could this work well with distributed solutions like petals?

https://github.com/bigscience-workshop/petals

I don't understand how petals can work though. I thought LLMs were typically quite monolithic.

1 more reply

kristianp2y ago

It could be good if the relevant expert(s) can be loaded on demand after reading the prompt? If the MOE is, say 8x8b params, then you could get good speed out of a 12GB GPU, despite the model being 64 params in size. Or am I misunderstanding how this all works?

osmarks2y ago· 4 in thread

I feel like this introduces the potential for weird and hard-to-implement side channel attacks, if the sequences in a batch can affect the routing of others.

tehsauce2y ago

I think you’re right. Would be very hard to exploit I imagine though.

derwiki2y ago

Hard like building a virtual machine in an image decoder? If there’s a way there’s a will.

catchnear43212y ago

the tools available to imagine such things are limited today.

the language models in our heads have not caught up to the ones in our browsers.

as the similarities and associations crystallize a bit better, it won’t look so hard.

bookmark this if you think it bullshit. eight months.

1 more reply

adql2y ago

Same thing was said about Spectre-like bugs

pazimzadeh2y ago· 3 in thread

Mixture of Experts

TechBro86152y ago

Thanks. I assumed it was Margin of Error. The article doesn't expand the acronym until midway through the post, where it appears almost accidentally. Perhaps the intended audience is a mixture of experts, of which I'm not a part.

mst2y ago

I suspect the article is written primarily to be clear to people sufficiently immersed in the relevant areas to be able to have a concrete opinion on the theory.

Also I strongly suspect that at least in the case of -me-, an article that was easier for me to understand wouldn't make the underlying theory any easier for me to judge.

(on the upside, at least I -did- understand and appreciate your self deprecating pun :)

airstrike2y ago

Thank you! I knew it couldn't mean "Merger of Equals"... but then again, if those experts are equals, then maybe that acronym also works ;-)

cratermoon2y ago· 3 in thread

> It’s well-known at this point that GPT-4/GPT-3.5-turbo is non-deterministic, even at temperature=0.0

Interestingly, on another discussion there was a claim that setting the temperature to 0.0 made gpt-4 deterministic: https://news.ycombinator.com/item?id=36503146

moonchrome2y ago

This guy probably never did anything nontrivial with the API - you notice almost instantly that the chat models (both 3.5 and 4) are nondeterministic at 0 temperature. Source - built a documentation search bot and had it crap out on me on copy pasted prompts when I was demoing it.

cratermoon2y ago

Apparently, and I haven't tested this, just from what I read, the simpler GPT-2 models are deterministic at 0 temperature.

keskival2y ago

If you want to make it deterministic, just cache the responses keyed by queries.

hyperthesis2y ago· 2 in thread

MoE: Mixture of Experts

ShamelessC2y ago

There’s a comment that’s 3 hours older than yours that clarifies this.

hyperthesis2y ago

I searched for MoE in the comments and didn't see it. ah, you must mean this one https://news.ycombinator.com/item?id=37006549, which doesn't include "MoE", so that's why I didn't find it. Still, my comment's upvotes show it was helpful to some - maybe they searched for "MoE" too, instead of "mixture of experts".

crazypython2y ago· 1 in thread

The GPT-3.0 "davinci-instruct-beta" models have been returning non-deterministic logprobs as early as early 2021. This is speculation. CUDA itself often has nondeterminism bugs.

text-davinci-001 and text-davinci-002 were trained through FeedMe and SFT, while text-davinci-003 was RLHF; the models themselves have more variance at high temperature.

cubefox2y ago

What about the foundation models, i.e. davinci and code-davinci-002?

f1shy2y ago· 1 in thread

I see in the comments it seems to be a huge miss understanding between 2 uses of “non-deterministic”: 1) from normal English: cannot be determined beforehand (results may vary) 2) from theory of computation: loosely “parallel computation” (unknown path to the solution)

PeterisP2y ago

For floating point math, there's no distinction, as "parralel computation with unknown path to the solution" inherently implies "results will vary", as (a+b)+c != a+(b+c).

throwawayadvsec2y ago

"these tokens often compete against each other for available spots in expert buffers. " So is this also why ChatGPT is often just writing placeholders in place of functions when I ask him for some long code?

afro882y ago

> these tokens often compete against each other for available spots in expert buffers.

Hold up, does this mean that under heavy load the results change? Does this explain why it sometimes feels like the output quality changes?

cainxinth2y ago

I asked GPT to explain this:

>In the MoE approach, different "experts" or portions of the model are selected for different parts of the input data. The selection of which experts to use can be influenced by several factors, including the specific content of the input data, the order in which data is processed in a batch, and possibly even minor variations in the internal state of the model.

>This "expert selection" process introduces a level of stochasticity, or randomness, into the model's operation. For example, if you process the same input data twice in slightly different contexts (e.g., as part of different batches), you might end up consulting slightly different sets of experts, leading to slightly different outputs.

icelancer2y ago

How interesting. I was just discussing this last night with our analysts after I experimentally noticed that temp=0.0 (and all penalties/top_p set accordingly) still showed non-determinate behavior. Wasn't sure why this was, and now this article comes about.

The explanation makes quite a bit of sense.

rgoldste2y ago

This is a plausible hypothesis. I’m curious whether OpenAI has considered this already and examined it I feel like an average senior eng could eval this in under two focused days, but maybe OpenAI has less unit-testing than I expect.

DeathArrow2y ago

Well, a colleague of mine managed to build a non deterministic GET REST API endpoint. :D

albystein2y ago

this hypothesis makes a lot of sense. if indeed gpt-4 is a sparse MoE—which i believe it is—then OpenAI must have tested and proved their initial idea of a large capacity MoE LLM model first training/building a smaller one. this smaller test model might be gpt-3.5-turbo.

rvcdbn2y ago

I wonder if there’s a side channel attack in there waiting to happen..

pmarreck2y ago

Determinism should always be an option in any system.

heroku2y ago

can somebody make some quantum AI, that's super deterministic.

j / k navigate · click thread line to collapse

181 comments

117 comments · 21 top-level

dudus2y ago· 42 in thread

Off topic

> 3 months later, reading a paper while on board a boring flight home, I have my answer.

I noticed people from hacker news routinely read scientific papers. This is a habit I envy but don't share.

Any tips or sites for someone interested in picking up more science papers to read.

jldugger2y ago

For just getting started I recommend collections:

1. Ideas That Created The Future[1]. It's a collection of fiftyish classic CS papers, with some commentary.

2. Wikipedia's list[2].

3. Test of Time awards[3]. These are papers that have been around for a while and people still think are important.

4. Best paper awards[4]. Less useful than ToT as not every best paper is actually that good or important, and sometimes the award committees can't see past names or brands for novel research.

[1]: https://www.amazon.com/Ideas-That-Created-Future-Computer/dp...

[2]: https://en.wikipedia.org/wiki/List_of_important_publications...

[3]: https://www.usenix.org/conferences/test-of-time-awards

[4]: https://jeffhuang.com/best_paper_awards/

[5]: https://dl.acm.org/journal/csur

puzzledobserver2y ago

1 more reply

jldugger2y ago

From there, just keep a reading queue. If you notice a particular journal is a good source of material, consider subscribing to it.

nerdponx2y ago

> I noticed people from hacker news routinely read scientific papers.

Do they? I suspect that most don't, and those that do are either in specialized careers or are engaged in some kind of scientific research.

MacsHeadroom2y ago

Long time HN'er college dropout and I read a LOT of scientific papers. Probably an average of 4 a week over the past couple of decades, sometimes reading 40 in a week.

I probably averaged 20 a week back in March when open source AI was booming in the wake of Llama and on the heels of GPT-4.

1 more reply

i-use-nixos-btw2y ago

I strongly agree.

CSMastermind2y ago

Honestly many papers are written in a way that's hard to approach and difficult to understand unless you're prepared to reread them a few times.

You're better off just getting your science news from actual science communicators and not the raw source.

brmgb2y ago

> I noticed people from hacker news routinely read scientific papers.

Highly doubt that. It’s very hard to actually read scientific papers when you are not actively doing research.

I don’t even know how you would scheme introduction and sources to filter articles which are immediately obviously useless without being immersed in a field.

cypress662y ago

You don't need to be doing research to read an ML paper. With some general knowledge in AI you should be able to understand most papers.

Finally, you usually hear about these cool papers via Twitter / X

1 more reply

allisdust2y ago

LudwigNagasena2y ago

Math may actually be the first thing you recognise in a paper, which can help you cross-reference the text to understand it.

obblekk2y ago

Build the habit.

TechBro86152y ago

4 more replies

cpeterso2y ago

Check out the papers and talks from Papers We Love, a "repository of academic computer science papers and a community who loves reading them":

https://paperswelove.org/

eru2y ago

It depends on why you want to read papers and what you want to get out of it.

https://gwern.net/ also has great write-ups and links to original papers.

alecst2y ago

Honestly a lot are really hard to read. You start with the easy ones, learn the lingo, and then just keep going. Eventually you can enjoy reading the harder ones.

What are you interested in reading about? Maybe some people can recommend you some papers to start with.

eru2y ago

There are certainly easier and harder papers. Though when you are struggling: keep in mind that there are also papers that are just badly written (and some papers that are well written).

Swizec2y ago

> I noticed people from hacker news routinely read scientific papers. This is a habit I envy but don't share.

> Any tips or sites for someone interested in picking up more science papers to read.

Personally, the older I get, the more bored I've been getting with the level of information that "crosses my desk".

Ultimately, life is short and papers give you a better information density return on your time than almost anything else. Even the bad ones.

mst2y ago

For computer science, https://blog.acolyer.org/ is called The Morning Paper and talks about one interesting paper per post.

Edit: It seems to've gone on indefinite hiatus but there's a lot of backlog already there and some of it's really quite fascinating.

NalNezumi2y ago

There are some materials about "how to read scientific paper", like the pdf one from U waterloo [3] with some methodological advice. Some good advice in this old HN thread [1]

So find a specific field you're interested in, find a good book/blog/homepage/tutorial/video to get your basics going so that when you start reading papers you won't be completely lost.

If ML/LLM is your curiosity probably Lillian Wengs blog [2] is a good start for tutorials / surveys.

[1] https://news.ycombinator.com/item?id=24986727

[2] https://lilianweng.github.io/

Edit: direct link [3] https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPape...

necubi2y ago

quickthrower22y ago

AnthonBerg2y ago

Some sense of urgency helps. Most people will have a medical ailment or physiological issue of some sort. I promise you that there exist useful papers on it.

TX81Z2y ago

Once you obtain subject mastery you just need the read the abstracts.

Rinse and repeat for five years and you get a phd.

obblekk2y ago

Build the habit.

rgoldste2y ago

j7ake2y ago

Feedly with keywords for your favorite topics or researchers works decently.

I imagine this routine comes from people with research backgrounds, where browsing papers is the academic way of googling around for answers.

ugh1232y ago

I usually just read the abstract and synthesize that with the comments on HN to get the gist (and legit-ness) of the research.

whimsicalism2y ago

They read scientific papers in the same way that everyone "read" Capital in the 21st Century, when that was a thing.

dustingetz2y ago

read textbooks instead most papers are obtuse and poorly written even famous ones. you can find them in wikipedia footnotes

throwawayadvsec2y ago

Step 1. Find papers you're interested in Step 2. Open them Step 3. read them

dekhn2y ago

Step 4: do a depth-first lookup of every citation, and read/finish that paper before continuing

lannisterstark2y ago

Step 4. Get lost within a minute.

au8er2y ago

Step 3.5, see some other interesting paper is referenced in the related work, go to step 1.

1 more reply

interrupt21h2y ago

Semantic Scholar for search. Scihub for any paywalled papers. Libgen for books. Zotero to organize.

1 more reply

jcims2y ago

Pick something you’re interested in and have a passing knowledge of.

152334HOP2y ago

just set up a desktop service to randomly open a paper once every few hours

if they're not too boring, and you're not doing anything important, you'll read it for fun

armchairhacker2y ago

I read the abstract and look at the pretty figures :)

dylan6042y ago

I want to know what a non-boring flight would be like

nerdponx2y ago

High turbulence definitely makes it less boring. So will a crying baby, disruptive passenger, or someone getting sick. After a few of those, you'll prefer the boring flights.

152334HOP2y ago

https://www.youtube.com/watch?v=iFImKMjM-q4

LordShredda2y ago

Snakes on a plane

1 more reply

jiggawatts2y ago· 21 in thread

Floating point inaccuracies are generally deterministic - running the same calculations twice ought to yield the same results, down to the bit.

You only get divergent results if there is some other source of state or entropy: not zeroing buffers correctly, race conditions, not setting rounding mode flags consistently, etc…

From the quality of the code I’ve seen being cobbled together in the AI/ML ecosystem I would assume all three of those issues going on, and maybe more.

n2d42y ago

No, this is not true for GPUs. https://www.twosigma.com/articles/a-workaround-for-non-deter...

(In this particular case, the order in which the numbers are summed up is non-deterministic due to GPU parallelism, which may change the result slightly.)

I would generally refrain from insulting other people's code if you don't know much about the system it's written on.

Tunabrain2y ago

GPUs are deterministic machines, even for floating point.

1 more reply

jiggawatts2y ago

Read the article you linked.

It literally says that the GPU is deterministic, the NVIDIA libraries on top are deterministic, but it is Tensorflow that introduces variability (errors!) for “performance”.

My argument is that it is the AI/ML code that is introducing non-determinism, usually by sacrificing repeatability to gain performance.

That's precisely what's happening here. Tensorflow introduced a "harmless"[1] data race to improve performance by not having to use a deterministic but slower algorithm.

The individual floating point computations are deterministic, it's the multi-threaded design on top that's introducing the variability in the output.

[1] Used to be harmless, but cutting corners like this will make it nigh impossible to repeatably validate the safety of future models like GPT5. That seems pretty dangerous...

2 more replies

johndough2y ago

The PyTorch documentation has an entire section about how to make your code deterministic. In my experience, the performance difference is negligible.

https://pytorch.org/docs/stable/notes/randomness.html#avoidi...

https://docs.nvidia.com/cuda/floating-point/index.html

nextaccountic2y ago

[0] https://stackoverflow.com/questions/50744565/how-to-handle-n...

ascar2y ago

mschuster912y ago

> I would generally refrain from insulting other people's code if you don't know much about the system it's written on.

Yeah I'll keep AI shit cordoned off in its own subnet.

1 more reply

zx142y ago

There isn't much of a culture around code quality in ML / AI / DS.

1 more reply

benreesman2y ago

I don’t know about how insulting it is, I don’t like rushing things out but we’ve all had to.

People are rushing like crazy to get there first with X for AI all over the place, it would be pretty shocking if there weren’t wires sticking out everywhere.

I don’t think that says anything positive or negative about the hackers involved.

jes51992y ago

it’s basically always reasonable to insult someone’s code because we are computer programmers and we know what we have done

DeathArrow2y ago

So you can generate true random numbers using just the GPU parallelism? Consider me impressed!

1 more reply

xyzzy_plugh2y ago

You've moved the goal posts. You're conflating CUDA with GPUs. From Wikipedia:

Is the issue we're discussing because of the GPU or is it because of choices made in software libraries?

KolenCh2y ago

toxik2y ago

Minor nit but commutative is the wrong term. Floats always obey a+b == b+a, but not associativity: (a+b)+c != a+(b+c).

1 more reply

DeathArrow2y ago

2 more replies

dwpdwpdwpdwpdwp2y ago

Calling GetTimeOfDay() could do it.

Clock frequency drift between multiple processors could it.

stevefan19992y ago

Quantum computer is under the category of computers.

Quantum computation relied on Quantum mechanics.

Quantum mechanics are not deterministic.

So, Quantum computers are not deterministic.

Therefore, unless P=NP, not all computations are deterministic.

water92y ago

When theory fails to consult reality.

neatze2y ago

hmm, how, I wonder if Alhazen’ s Circular Billiard Problem[1] results for n steps in simulation will be same for multiple runs.

[1] https://forumgeom.fau.edu/FG2012volume12/FG201216.pdf

DeathArrow2y ago

On a large scale, not having memory with good ECC is enough to have entropy.

alexnewman2y ago

Small nit. You mean errors due to floating point math

refulgentis2y ago· 8 in thread

This is _excellent_ work, I've been adamantly against MoE for a set of reasons, this is the first compelling evidence I've seen that hasn't been on Substack or a bare repeating of rumor.

152334HOP2y ago

Thanks. I'm really no expert (:P) on MoE research; I just noticed what was written in the Soft MoE paper and felt a need to check.

I suspect OpenAI will figure out some way to reduce the randomness at some point, though, given their public commitment to eventually adding logprobs back to ChatCompletions.

cubefox2y ago

1 more reply

derwiki2y ago

GPT4 web chat for two hours a day? I buy that. Using the API repeatedly for the same inputs, eg developing a program, and the non-determinism is hard to miss.

sebzim45002y ago

I would imagine that most people use nonzero temperature, so they won't need to look for any explanation for non-determinism.

1 more reply

phillipcarter2y ago

Yeah, it's one of the first things you notice when trying to do some kind of "feed GPT some data and get it to produce a novel answer to a question" task with the API.

1 more reply

FanaHOVA2y ago

> I've been adamantly against MoE for a set of reasons

Such as?

lucubratory2y ago

It was completely unsubstantiated, based on rumours from a blog, but everyone repeated it as fact.

1 more reply

bredren2y ago

What do you use it for? Are you using many plugins? Curious what sort of insights someone using the tool this much might have, perhaps even through the batch of features released this week.

gojomo2y ago· 6 in thread

Not sure I understand the excerpt from the referenced paper.

spott2y ago

The results are showing essentially 12 unique responses from 30 tries… not what you would expect from mixing tokens.

The whole batch is deterministic given the same batch (sequences and ordering), but if you shuffle the batch then you lose that determinism.

albystein2y ago

mrtranscendence2y ago

> And if these rumors, that GPT-4 is inherently un-deterministic and unreliable, are true then most enterprises are better off finetuning open source LLMs—which are just as capable

Wait, am I misunderstanding you? I feel like I've had a head injury or something, because I've never heard of an open source LLM that's as capable as GPT-4 (in most scenarios).

1 more reply

geysersam2y ago

> domain-specific models will always outperform generalist ones

That's only true assuming you habe enough data to train a domain-specific model / expertise to train it and test it correctly.

A domain specific model might be more likely to fail on weird outliers not present in the small domain specific training data.

> could spell disaster for OpenAI

Nah I don't think so. They are not all in on one specific model architecture. If the current architecture is found to have serious unfixable flaws then they'll just change architecture.

famouswaffles2y ago

>as domain-specific models will always outperform generalist ones

This is not even close to true for Language models.

famouswaffles2y ago

Fine-tuned MedPalm is worse than GPT-4 on most Medical Challenge Tests. Fine-tuned Minerva is much worse on arithmetic benchmarks.

The LLM space is just different. There's no guarantee a fine-tuned model will beat a bigger generalist one.

alpark32y ago· 5 in thread

152334HOP2y ago

osmarks2y ago

It would be bad for single-consumer-GPU inference setups.

Me10002y ago

1 more reply

worldsayshi2y ago

Could this work well with distributed solutions like petals?

https://github.com/bigscience-workshop/petals

I don't understand how petals can work though. I thought LLMs were typically quite monolithic.

1 more reply

kristianp2y ago

osmarks2y ago· 4 in thread

I feel like this introduces the potential for weird and hard-to-implement side channel attacks, if the sequences in a batch can affect the routing of others.

tehsauce2y ago

I think you’re right. Would be very hard to exploit I imagine though.

derwiki2y ago

Hard like building a virtual machine in an image decoder? If there’s a way there’s a will.

catchnear43212y ago

the tools available to imagine such things are limited today.

the language models in our heads have not caught up to the ones in our browsers.

as the similarities and associations crystallize a bit better, it won’t look so hard.

bookmark this if you think it bullshit. eight months.

1 more reply

adql2y ago

Same thing was said about Spectre-like bugs

pazimzadeh2y ago· 3 in thread

Mixture of Experts

TechBro86152y ago

mst2y ago

I suspect the article is written primarily to be clear to people sufficiently immersed in the relevant areas to be able to have a concrete opinion on the theory.

Also I strongly suspect that at least in the case of -me-, an article that was easier for me to understand wouldn't make the underlying theory any easier for me to judge.

(on the upside, at least I -did- understand and appreciate your self deprecating pun :)

airstrike2y ago

Thank you! I knew it couldn't mean "Merger of Equals"... but then again, if those experts are equals, then maybe that acronym also works ;-)

cratermoon2y ago· 3 in thread

> It’s well-known at this point that GPT-4/GPT-3.5-turbo is non-deterministic, even at temperature=0.0

Interestingly, on another discussion there was a claim that setting the temperature to 0.0 made gpt-4 deterministic: https://news.ycombinator.com/item?id=36503146

moonchrome2y ago

cratermoon2y ago

Apparently, and I haven't tested this, just from what I read, the simpler GPT-2 models are deterministic at 0 temperature.

keskival2y ago

If you want to make it deterministic, just cache the responses keyed by queries.

hyperthesis2y ago· 2 in thread

MoE: Mixture of Experts

ShamelessC2y ago

There’s a comment that’s 3 hours older than yours that clarifies this.

hyperthesis2y ago

crazypython2y ago· 1 in thread

The GPT-3.0 "davinci-instruct-beta" models have been returning non-deterministic logprobs as early as early 2021. This is speculation. CUDA itself often has nondeterminism bugs.

text-davinci-001 and text-davinci-002 were trained through FeedMe and SFT, while text-davinci-003 was RLHF; the models themselves have more variance at high temperature.

cubefox2y ago

What about the foundation models, i.e. davinci and code-davinci-002?

f1shy2y ago· 1 in thread

PeterisP2y ago

For floating point math, there's no distinction, as "parralel computation with unknown path to the solution" inherently implies "results will vary", as (a+b)+c != a+(b+c).

throwawayadvsec2y ago

afro882y ago

> these tokens often compete against each other for available spots in expert buffers.

Hold up, does this mean that under heavy load the results change? Does this explain why it sometimes feels like the output quality changes?

cainxinth2y ago

I asked GPT to explain this:

icelancer2y ago

The explanation makes quite a bit of sense.

rgoldste2y ago

DeathArrow2y ago

Well, a colleague of mine managed to build a non deterministic GET REST API endpoint. :D

albystein2y ago