Though it is notable that contrary to many (on HN and Twitter) that Meta would stop publishing papers and be like other AI labs (e.g. OpenAI). They're continued their rapid pace of releasing papers AND open source models.
Also, that wasn't based on purely hearsay, Zuck explicitly said:
> We believe the benefits of superintelligence should be shared with the world as broadly as possible. That said, superintelligence will raise novel safety concerns. We'll need to be rigorous about mitigating these risks and careful about what we choose to open source. Still, we believe that building a free society requires that we aim to empower people as much as possible. [0]
https://huggingface.co/facebook/models
The most interesting ones to me are:
- CWM (Code world model), an LLM for coding https://github.com/facebookresearch/cwm
- DINOv3, A vision encoder https://ai.meta.com/dinov3/
- MAPAnything, a 3d reconstruction model https://huggingface.co/facebook/map-anything
- VJEPA v2, Self-supervised video pre-training model https://github.com/facebookresearch/vjepa2
i'd interpret that as meaning "everybody is welcome to be our customer, but we're still control all of it"
MSL is not only those few high profile hires.
A bit of this is true at every major lab. There's tons of untapped potential. But these organizations are very risk adverse. I mean why not continue with the strategy that got us to the point we're at in the first place. Labs used to hire researchers and give them a lot of free reign. But those times ended and AI progress also slowed down. Maybe if you want to get ahead you gotta stop thinking like everyone else
Well meta... you can "hold me hostage" for a lot cheaper than those guys. I'm sure this is true for hundreds of passionate ML researchers. I'd take a huge pay cut to have autonomy and resources. I know for a fact there's many working at Mets right now that would do the same. Do maybe if you're going to throw money at the problem, diversify a bit and look back at what made SV what it is today and what made AI take leaps forward
The other day I was spending some time with a researcher from Deep Mind and I was surprised to find that while they were sharp and curious to an extent, nearly every ounce of energy they expended on research was strategic. They didn't write about research they were fascinated by, they wrote and researched on topics they strategically felt had the highest probability getting into a major conference in a short period of time to earn them a promotion. While I was a bit disappointed, I certainly didn't judge them because they are just playing the game. This person probably earns more than many rooms of smart, passionate people I've been in, and that money isn't for smarts alone; it's for appealing to the interests of people with the money.
You can see this very clearly by comparing the work being done in the LLM space to that being done in the Image/Video diffusion model space. There's much more money in LLMs right now, and the field is flooded with papers on any random topic. If you dive in, most of them are not reproducible or make very questionable conclusions based on the data they present, but that's not of very much concern so long as the paper can be added to a CV.
In the stable diffusion world it's mostly people driven by personal interest (usually very non-commericial personal interests) and you see tons of innovation in that field but almost no papers. In fact, if you really want to understand a lot of the most novel work coming out of the image generation world you often need to dig into PRs made by an anonymous users with anime themed profile pic.
The bummer of course is that there are very hard limits on what any researcher can do with a home GPU training setup. It does lead to creative solutions to problems, but I can't help but wonder what the world would look like if more of these people had even a fraction of the resources available exclusively to people playing the game.
> Someone has probably studied this
There's even a name for itI persist because I'm fantastic at politics while being good enough to do my job. Feels weird man.
I genuinely thing science would be better served if scientist got paid modest salaries to pursue their own research interests and all results became public domain. So many Universities now fancy themselves startup factories, and startups are great for some things, no doubt, but I don't think pure research is always served by this strategy.
I can't think of it ever really paying off. Bell Labs is the best example. Amazing research that was unrelated to the core business off the parent company. Microsoft Research is another great one. Lots of interesting research that .. got MS some nerd points? But has materialized into very very few actual products and revenue streams. Moving AI research doesn't help Meta build any motes or revenue streams. It just progresses our collective knowledge.
On the "human progress" scale it's fantastic to put lots of smart people in a room and let them do their thing. But from a business perspective it seems to almost never pay off. Waiting on the irrational charity of businesses executive is probably not the best way to structure thing.
I'd tell them to go become academics.. but all the academics I know are just busy herding their students and attending meetings
> I can't think of it ever really paying off
Sure worked for Bell LabsAlso it is what big tech was doing until LLMs hit the scene
So I'm not sure what you mean by it never paying off. We were doing it right up till one of those things seemed to pay off and then hyper focused on it. I actually think this is a terrible thing we frequently do in tech. We find promise in a piece of tech, hyper focus on it. Specifically, hyper focus on how to monetizing it which ends up stunting the technology because it hasn't had time to mature and we're trying to monetize the alpha product instead of trying to get that thing to beta.
> But from a business perspective it seems to almost never pay off.
So this is actually what I'm trying to argue. It actually does pay off. It has paid off. Seriously, look again at Silicon Valley and how we got to where we are today. And look at how things changed in the last decade...Why is it that we like off the wall thinkers? That programmers used to be known as a bunch of nerds and weirdos. How many companies were started out of garages (Apple)? How many started as open source projects (Android)? Why did Google start giving work lifestyle perks and 20% time?
So I don't know what you're talking about. It has frequently paid off. Does it always pay off? Of course not! It frequently fails! But that is pretty true for everything. Maybe the company stocks are doing great[0], but let's be honest, the products are not. Look at the last 20 years and compare it to the 20 years before that. The last 20 years has been much slower. Now maybe it is a coincidence, but the biggest innovation in the last 20 years has been in AI and from 2012 to 2021 there were a lot of nice free reign AI research jobs at these big tech companies where researchers got paid well, had a lot of autonomy in research, and had a lot of resources at their disposal. It really might be a coincidence, but a number of times things like this have happened in history and they tend to be fairly productive. So idk, you be the judge. Hard to conclude that this is definitely what creates success, but I find it hard to rule this out.
> I'd tell them to go become academics.. but all the academics I know are just busy herding their students and attending meetings
Same problem, different step of the ladderThis is very true, and more than just in ai.
I think if they weren’t so metric focused they probably wouldn’t have hit so much bad publicity and scandal too.
Quite the statement for anybody who follows developments (without excluding xAI).
Well for starters you need a leader who can rally the troops who "think(s) different" - something like a S Jobs.
That person doesnt seem to exist in the industry right now.
Doesn't really scream CEO of AGI to me.
Shareholders should be livid if they knew a single thing about what was going on.
There was an interesting quote “plain old BM25 from 1994 outperforms vector search on recall” and super relevant to what I did yesterday. I am trying to use small local models more often and yesterday I wrote Common Lisp code that uses a large corpus of text and a user query or prompt to construct a fairly concise one-shot prompt with select context from the text corpus. This is RAG, and I used both BM25 and vector embeddings matching. I added the code and an example as a new chapter in my CL book (link directly to new material: https://leanpub.com/lovinglisp/read#leanpub-auto-autocontext...) yesterday afternoon. BM25 is fast. This is new code, and I will certainly be experimenting more with it, but as-is it is useful when working with small local LLMs.
- a predefined document store / document chunk store where every chunk gets a a vector embedding, and a lookup decides what gets pulled into context as to not have to pull whole classes of document, filling it up
- the web search like features in LLM chat interfaces, where they do keyword search, and pull relevant documents into context, but somehow only ephemerally, with the full documents not taking up context in the future of the thread (unsure about this, did I understand it right?) .
with the new models with million + tokens of context windows, some where arguing that we can just throw whole books into the context non-ephemerally, but doesnt that significantly reduce the diversity of possible sources we can include at once if we hard commit to everything staying in context forever? I guess it might help with consistency? But is the mechanism with which we decide what to keep in context not still some kind of RAG, just with larger chunks of whole documents instead of only parts?
I'd be extatic if someone who really knows their stuff could clear this up for me
Throwing everything into one large context window is often impractical - it takes much more time to process, and many models struggle to find information accurately if too much is going on in the context window ("lost in the middle").
The "classic" RAG still has its place when you want low latency (or you're limited by VRAM) and the results are already good enough.
My impression is that GPT-5 gets confused, not quite right away, but after a couple of pages it has no idea. It doesn't take pages on pages before it forgets things.
In both cases for "Question Answering" it's about similarity search but there are two main orthogonal differences between RAG and Non-RAG :
-Knowing the question at the time of index building
-Higher order features : the ability to compare fetched documents with one another and refine the question
Non-RAG, aka multi-layer (non-causal) transformer with infinite context, is the more generic version, fully differentiable meaning you can use machine learning to learn how to Non-RAG better. Each layer of the transformer can use the previous layer to reason and refine the similarity search. (A causal transformer know the question at the time when it is feed the question, and can choose to focus it's attention on different part of the previously computed features of the provided documents but may benefit from having some reflection token, or better : be given the question before being presented the documents (provided you've trained it to answer it like that).)
RAG is an approximation of the generic case to make it faster and cheaper. Usually it breaks end-to-end differentiability by using external tools, so this mean that if you want to use machine learning to learn how to RAG better you will need to use some variant of Reinforcement Learning which is slower to learn things. RAG usually don't know the question at the time of index building, and documents are treated independently of each other, so no (automatic) higher order features (embeddings are fixed).
A third usual approximation, is to feed the output of RAG into Non-RAG, to hopefully get the best of both world. You can learn the Non-RAG given RAG with machine learning (if you train it with some conversations where it used RAG), but the RAG part won't improve by itself.
Non-RAG need to learn so it needs a big training dataset, but fortunately it can pick-up question answer pair in an unsupervised fashion when you feed it the whole web, and you only need a small instruction training and preference optimization dataset to shape it to your need. If performance isn't what you expect in a specific case, you can provide more specific examples and retrain the model until it gets it and you get better performance for the case you were interested in. You can improve the best case but it's hard to improve the worst case.
RAG has more control on what you feed it but content should be in a more structured way. You can prevent worst cases more easily but it's hard to improve good case.
RAG is confusing, because if you look at the words making up the acronym RAG, it seems like it could be either of the things you mentioned. But it originally referred to a specific technique of embeddings + vector search - this was the way it was used in the ML article that defined the term, and this is the way most people in the industry actually use the term.\
It annoys me, because I think it should refer to all techniques of augmenting, but in practice it's often not used that way.
There are reasons that specifically make the "embeddings" idea special - namely, it's a relatively new technique that actually fits LLM very well, because it's a semantic search - meaning, it works on "the same input" as LLMs do, which is a free-text query. (As opposed to a traditional lookups that work on keyword search or similar.)
As for whether RAG is dead - if you mean specifically vector-embeddings and semantic search, it's possible - because you could theoretically use other techniques for augmentation, e.g. an agent that understands a user question about a codebase and uses grep/find/etc to look for the information, or composes a search to search the internet for something. But it's definitely not going to die in that second sense of "we need some way to augment LLMs knowledge before text generation", that will probably always be relevant, as you say.
> But RAG is a very real world, practical topic for something as significant as a new lab’s first paper.
I would expect exactly the opposite - that a new lab would put out a few random papers that happen to be in areas their researchers were interested in and already working on, and once people had been working together a while and developed some synergy they would maybe come out with something really groundbreaking.
do people really view a "first paper" as something deeply significant and weighty? because that just seems like a good way to get bogged down in trying to second guess whether any given paper was good enough to be your all-important debut!
Of course here we are talking about a lab, not an individual person, but still I haven't heard of first papers being considered special in any way, even for labs.
(preferably using representative language from the article)
We don't catch every case, but if you're talking about the frontpage, I'm surprised to hear you say "epidemic". What are some recent examples?
IMO vector embedding is the most important innovation in computing of the last decade. There's something magical about it. These people deserve some kind of prize. The idea that you can reduce almost any intricate concept including whole paragraphs to a fixed-size vector which encapsulates its meaning and proximity to other concepts across a large number of dimensions is pure genius.
The fact that dot product addition can encode the concept of royalty and gender (among all other sorts) is kind of magic to me.
But similar ways to reduce huge numbers of dimensions to a much smaller set of "interesting" dimensions have been known for a long time.
Examples include principal component analysis/single value decomposition, which was the first big breakthrough in face recognition (in the early 90s), and also used in latent semantic indexing, the Netflix prize, and a large pile of other things. And the underlying technique was invented in 1901.
Dimensionality reduction is cool, and vector embedding is definitely an interesting way to do it (at significant computational cost).
- https://arxiv.org/abs/2410.07590 (literally titled "Block-Attention for Efficient RAG")
- https://arxiv.org/abs/2409.15355v3
- https://arxiv.org/abs/2212.10947
The REFRAG paper does not cite any of these.
In general we need to make it simpler for LLMs to take in different forms of embeddings. At least frameworks that simplify it.
Which means that modifications to the architecture, and combining it with other components and approaches, are the next likely step. This paper fits that.
"Send this through the math coprocessor." "Validate against the checklist." "Call out to an agent for X." "Recheck against input stream Y." And so on.
Retrieval augmentation is only one of many uses for this. If this winds up with better integration with agents, it is very possible that the whole is more than the sum of its parts.
It's effectively a multimodal model, which handles "concept" tokens alongside "language" tokens and "image" tokens.
A really big conceptual step, actually, IMO.
I came to believe the LLMs work with token embeddings. Is then the REFRAG only "something" in front of the LLM, and the decoder is the RL policy which expands only some token chunk embeddings into token embeddings feedable to LLM? Or the REFRAG needs you to 'tune' the LLM to be able to work with both token embeddings and token chunk embeddings?
Doesn't this tie the two layers together in a way that they can't evolve separately?
It means you're reading into it too much and need to be let down, gently, from the hype train.
So that others don't also have to look it up, it's Retrieval-Augmented Generation (RAG).
They even say it's "a topic that we didn’t expect"... so... perhaps many people wouldn't have heard of it?
Which other under pressure labs are you talking about?
TL;DR
• MSI’s first paper, REFRAG, is about a new way to do RAG.
• This slightly modified LLM converts most retrieved document chunks into compact, LLM-aligned chunk embeddings that the LLM can consume directly.
• A lightweight policy (trained with RL) decides which chunk embeddings should be expanded back into full tokens under a budget; the LLM runs normally on this mixed input.
• The net effect is far less KV cache and attention cost, much faster first-byte latency and higher throughput, while preserving perplexity and task accuracy in benchmarks.
I wish more long posts followed this model of a scientific paper.
2. Wild claim that the companies that sell LLMs are actually downplaying their capabilities instead of hyping them
Again, personal, experience, but in my team ~40-50% of the PRs are generated by Codex.
https://www.infoworld.com/article/4061078/the-productivity-p...
The real value of AI isn't in helping coding. It's in having a human-like intelligence to automate processes. I can't get into details but my team is doing things that I couldn't dream of three years ago.
Non-software devs are actually making functional programs for themselves for the first time ever. The value is crazy.