It is rather unfortunate that this sort of paper is hard to reproduce.
That is a BIG downside, because it makes the result unreliable. They invested effort and money in getting an unreliable result. But perhaps other research will corroborate. Or it may give them an edge in their business, for a while.
They chose to publish. So they are interested in seeing it reproduced or improved upon.
Call me cynical, but this is not what I experienced to be the #1 reason of publishing AI papers.
I don’t think this is valid, as this point seems to ignore the fact that the data center that this compute took place in required a massive investment.
A paper like this is more akin to HEPP research. Nobody has the capability to reproduce the higgs results outside of at the facility the research was conducted within (CERN).
I don’t think reproduction was a concern of the researchers.
Obviously the 'best' result would be to have a separate collider as well, but no one is going to fund a new collider just to reaffirm the result for a third time.
Not necessarily, publishing also ensure that the stuff is no longer patentable.
Of course it would be a tiny fraction of the $10m figure here, but even 1% would be $100,000. Negligible to Google, but for Google even $10 million is couch cushion money.
$10M is about what Google would spend to get a publication in a top-tier journal. But google's internal pricing and costs don't look anything like what people cite for external costs; it's more like a state-supported economy with some extremely rich oligarch-run profit centers that feed all the various cottage industries.
That was using idle cycles on Intel CPUs, not GPUs or TPUs though.
From the link: "the total compute cost it would take to replicate the paper"
It's not Google's cost. Google's cost is of course entirely different. It's the cost for the author if he were to rent the resources to replicate the paper.
For Google, all of it is running at a "best effort" resource tier, grabbing available resources when not requested by higher priority jobs. It's effectively free resources (except electricity consumption). If any "more important" jobs with a higher priority comes in and asks for the resources, the paper-writers jobs will just be preempted.
For example, if YOU want to rent a backhoe to do some yard rearrangement it’s going to cost you.
But Bob who owns BackHoesInc has them sitting around all the time when they’re not being rented or used; he can rearrange his yard wholesale or almost free.
"Underutilized" isn't the right word here. There's some value in putting your capital to productive use. But, once immediate needs are satisfied, there's more value in having the capital available to address future needs quickly than there would be in making sure that everything necessary to address those future needs is tied up in low-value work. Option value is real value; being prepared for unforeseen but urgent circumstances is a real use.
If the job could easily run for weeks, even when you could buy your way for doing it in a day.
Then have a bidding on this “best effort” resource, where they factor in electricity at any given time
Those effort needs to be added in the cost calculation too.
Those effort needs to be added in the cost calculation too
The problem with neoclassical economics is that it doesn't concern itself with the physical counterpart of liquidity. It is assumed that the physical world is just as liquid as the monetary world.
The "liquidity mismatch" between money and physical capital must be bridged through overprovisioning on the physical side. If you want the option to choose among n different products, but only choose m products, then the n - m unsold products must be priced into the m bought products. If you can repurpose the unsold products, then you make a profit or you can lower costs for the buyer of the m products.
I would even go as far as to say that the production of liquidity is probably the driving force of the economy, because it means we don't have to do complicated central planning and instead use simple regression models.
Isn't that all what high frequency traders would say? :)
Perhaps there is some limit at which additional liquidity doesn't offer much value?
I was told by an employee that GDM internally has a credits system for TPU allocation, with which researchers have to budget out their compute usage. I may have completely misunderstood what they were describing, though.
it is a hustle only for the near future while this bubble lasts, but can help reduce costs.
My wife works on high-throughout drug screens. They routinely use over $100,000 of consumables in a single screen, not counting the cost of the screening “libraries”, the cost of using some of the -$10mil of equipment in the lab for several weeks, the cost of the staff in the lab itself, and the cost of the time of the scientists who request the screens and then take the results and turn them into papers.
Plus, I mean, there are a lot of products that don't work. We all buy garbage and often can't buy not garbage. Though I guess you're technically correct that in either of these situations there can still be a return on investment, but maybe that shouldn't be good enough...
That means Google payed way less than this amount and if you wanted to reproduce the paper yourself, you would potentially pay a lot more, depending on how many engineers you have in your team to squeeze every bit of performance per hour out of your cluster.
How is anyone else going to reproduce the experiment if it's going to cost them $10 million because they don't work at Google and would have to rent the infrastructure?
That being said, yes, this is hard to reproduce for your average Joe, but there are also a lot of companies (like OpenAI, Facebook, ...) that are able to throw this amount of hardware at the problem. And in a few years you'll probably be able to do it on commodity hardware.
It works and helps to get a salary raise or a better job, so they continue.
A bit like when someone goes to a job interview, didn't do anything, and claims "My work is under NDA".
No, it's not. The author clearly states in the very first paragraph that this is the price it would take them to reproduce the results.
Nowhere in the article (or the title) have they implied that this is how much Google spent.
[1] if the total cost estimate was relatively low, say less than 10k, then of course the lowest rental price and a random training codebase might make some sense in order to reduce administrative costs; once the cost is in the ballpark of millions of USD, it feels careless to avoid optimizing it further. There exist H100s in firesales or Ebay occasionally, which could reduce the cost even more, but the author already mentions 2USD/gpu/hour for bulk rental compute, which is better than the 3USD/gpu/hour estimate they used in the writeup.
MFU can certainly be improved beyond 40%, as I mention. But on the point of small models specifically: the paper uses FSDP for all models, and I believe a rigorous experiment should not vary sharding strategy due to numerical differences. FSDP2 on small models will be slow even with compilation.
The paper does not tie embeddings, as stated. The readout layer does lead to 6DV because it is a linear layer of D*V, which takes 2x for a forward and 4x for a backward. I would appreciate it if you could limit your comments to factual errors in the post.
Even if it's a small model, one could use ddp or FSDP/2 without slowdowns on fast interconnect, which certainly adds to the cost. But if you want to reproduce all the work at the cheapest price point you only need to parallelize to the minimal level for fitting in memory (or rather, the one that maxes the MFU), so everything below 2B parameters runs on a single H100 or single node.
And the blog post author is talking about the output layer where the model has to produce an output prediction for every possible token in the vocabulary. Each output token prediction is a dot-product between the transformer hidden state (D) and the token embedding (D) (whether shared with input or not) for all tokens in the vocabulary (V). That's where the VD comes from.
It would be great to clarify this in the blog post to make it more accessible but I understand that there is a tradeoff.
Looking at [1], the authors there claim that their improvements were needed to push BERT training beyond 30% MFU, and that the "default" training only reaches 10%. Certainly numbers don't translate exactly, it might well be that with a different stack, model, etc., it is easier to surpass, but 35% doesn't seem like a terribly off estimate to me. Especially so if you are training a whole suite of different models (with different parameters, sizes, etc.) so you can't realistically optimize all of them.
It might be that the real estimate is around 40% instead of the 35% used here (frankly it might be that it is 30% or less, for that matter), but I would doubt it's so high as to make the estimates in this blog post terribly off, and I would doubt even more that you can get that "also for small models with plain pytorch and trivial tuning".
I’m confident each one of them were multiple of $10M investments.
And this is just what we know because they were launched publicly.
The equivalent wastage for a self-employed person would be allowing a few cups of Starbucks coffee per year to go cold.
The cost was probably the limiting factor.
I mention this because a lot of universities and small labs are being edged out of the research space but we still want their contributions. It is easy to always ask for more experiments but the problem is, as this blog shows, those experiments can sometimes cost millions of dollars. This also isn't to say that small labs and academics aren't able to publish, but rather that 1) we want them to be able to publish __without__ the support of large corporations to preserve the independence of research[0], 2) we don't want these smaller entities to have to go through a roulette wheel in an effort to get published.
Instead, when reviewing be cautious in what you ask for. You can __always__ ask for more experiments, datasets, "novelty", and so on. Instead ask if what's presented is sufficient to push forward the field in any way and when requesting the previous things be specific as to why what's in the paper doesn't answer what's needed and what experiment would answer it (a sentence or two would suffice).
If not, then we'll have the death of the GPU poor and that will be the death of a lot of innovation, because the truth is, not even big companies will allocate large compute for research that is lower level (do you think state space models (mamba) started with multimillion dollar compute? Transformers?). We gotta start somewhere and all papers can be torn to shreds/are easy to critique. But you can be highly critical of a paper and that paper can still push knowledge forward.
[0] Lots of papers these days are indistinguishable from ads. A lot of papers these days are products. I've even had works rejected because they are being evaluated as products not being evaluated on the merits of their research. Though this can be difficult to distinguish when evaluation is simply empirical.
[1] I once got desk rejected for "prior submission." 2 months later they overturned it, realizing it was in fact an arxiv paper, for only a month later for it to be desk rejected again for "not citing relevant materials" with no further explanation.
> But you can be highly critical of a paper and that paper can still push knowledge forward.
Can you give a concrete example of this?
(A good rule of thumb is that an employee costs about twice their total compensation.)