If the model comes up with anything even remotely correct it would be quite a strong evidence that LLMs are a path to something bigger if not then I think it is time to go back to the drawing board.
Things don't happen all of a sudden, and being able to see all the scientific papers of the era its possible those could have fallen out of the synthesis.
Standing on the shoulders of giants, as it were
There's a very similar parallel with dark matter in modern times. We certainly have endless hints to the truth that will be evident in hindsight, but for now? We are mostly convinced that we know the truth, perform experiments to prove that, find nothing, shrug, adjust the model to be even more esoteric, and repeat onto the next one. And maybe one will eventually show something, or maybe we're on the wrong path altogether. This quote, from Michelson in 1894 (more than a decade before Einstein would come along), is extremely telling of the opinion at the time:
"While it is never safe to affirm that the future of Physical Science has no marvels in store even more astonishing than those of the past, it seems probable that most of the grand underlying principles have been firmly established and that further advances are to be sought chiefly in the rigorous application of these principles to all the phenomena which come under our notice. It is here that the science of measurement shows its importance — where quantitative work is more to be desired than qualitative work. An eminent physicist remarked that the future truths of physical science are to be looked for in the sixth place of decimals." - Michelson 1894
It’s probably even more of a problem that different areas of scientific development don’t know about each other. LLMs combining results would still not be like they invented something new.
But if they could give us a head start of 20 years on certain developments this would be an awesome result.
My own prediction is that the LLMs would totally fail at connecting the dots, but a small group of very smart humans can.
Things don't happen all of a sudden, but they also don't happen everywhere. Most people in most parts of the world would never connect the dots. Scientific curiosity is something valuable and fragile, that we just take for granted.
I imagine possible buffers on that conjecture synthesis being epxerimentation and acceptance by the scientific community. AIs can come up with new ideas every day but Nature won't publish those ideas for years.
Training data for a base model isn't that hard to come by, even though you have to OCR most of it yourself because the publicly available OCRed versions are commonly unusably bad. But training a model large enough to be useful is a major issue. Training a 700M parameter model at home is very doable (and is what this TimeCapsuleLLM is), but to get that kind of reasoning you need something closer to a 70B model. Also a lot of the "smarts" of a model gets injected in fine tuning and RL, but any of the available fine tuning datasets would obviously contaminate the model with 2026 knowledge.
It largely cannot imaginr anything beyond the usual but there is a small part that it can. This is similar to in context learning, it's weak but it is there.
It would be incredible if meta learning/continual learning found a way to train exactly for novel learning path. But that's literally AGI so maybe 20yrs from now? Or never..
You can see this on CL benchmarks. There is SOME signal but it's crazy low. When I was traing CL models i found that signal was in the single % points. Some could easily argue it was zero but I really do believe there is a very small amount in there.
This is also why any novel work or findings is done via MASSIVE compute budgets. They find RL enviroments that can extract that small amount out. Is it random chance? Maybe, hard to say.
If we did this (to a good enough level of detail), would it be able to derive relativity? How large of an AI model would it have to be to successfully derive relativity (if it only had access to everything published up to 1904)?
I mostly agree with you, especially about distrusting the self-interested hype beasts.
While I don't think the models are actually "intelligent", I also wonder if there are insights to be gained by looking at how concepts get encoded by the models. It's not really that the models will add something "new", but more that there might be connections between things that we haven't noticed, especially because academic disciplines are so insular these days.
Most ML researchers think AGI is imminent.
At least the model part. Although others made same thought as you afaik none tried it.
They measure compression (perplexity) on future Wikipedia, news articles, code, arXiv papers, and multi-modal data. Data compression is intimately connected with robustness and generalization.
A). contaminate the model with your own knowledge of relativity, leading it on to "discover" what you know, or
B). you will try to simulate a blind operation but without the "competent human physicist knowledgeable up to the the 1900 scientific frontier" component prompting the LLM, because no such person is alive today nor can you simulate them (if you could, then by definition you can use that simulated Einstein to discover relativity, so the problem is moot).
So in both cases you would prove nothing about what a smart and knowledgeable scientist can achieve today from a frontier LLM.
In principle I see your point, in practice my default assumption until proven otherwise here -- is that a little something slipped through post-1900.
A much easier approach would be to just download some model, whatever model, today. Then 5 years from now, whatever interesting discoveries are found - can the model get there.
I imagine if you explained quantum field theory to a 19th century scientists they might think of it as a more refined understanding of luminiferous aether.
Or if an 18th century scholar learned about positive and negative ions, it could be seen as an expansion/correction of phlogiston theory.
[0] https://github.com/haykgrigo3/TimeCapsuleLLM/blob/main/Copy%...
In fact you don’t want it to know about them explicitly just have enough background knowledge that you can manage the rest via context.
Then, we have a model that knows quite a bit in modern English. We also legally have a data set for everything it knows. Then, there's all kinds of experimentation or copyright-safe training strategies we can do.
Project Gutenberg up to the 1920's seems to be the safest bet on that.
But I also think we can do this with normal LLMs trained on up-to-date text, by asking them to come up with any novel theory that fits the facts. It does not have to be a groundbreaking theory like QM, just original and not (yet) proven wrong ?
But considering that Special Relativity was published in 1905 which means all its building blocks were already floating in the ether by 1900 it would be a very interesting experiment to train something on Claude/Gemini scale and then say give in the field equations and ask it to build a theory around them.
1900 era humans are not available to do this so I'm not sure how this experiment is supposed to work.
Special relativity however seems possible.
It's a relief to think that we're not trusting these things for stuff like financial advice, medical advice, mental health counselling, ...
I saw TimeCapsuleLLM a few months ago, and I'm a big fan of the concept but I feel like the execution really isn't that great. I wish you:
- Released the full, actual dataset (untokenized, why did you pretokenize the small dataset release?)
- Created a reproducible run script so I can try it out myself
- Actually did data curation to remove artifacts in your dataset
- Post-trained the model so it could have some amount of chat-ability
- Released a web demo so that we could try it out (the model is tiny! Easily can run in the web browser without a server)
I may sit down and roll a better iteration myself.
If you could only give it texts and info and concepts up to Year X, well before Discovery Y, could we then see if it could prompt its way to that discovery?
You'd have to be specific what you mean by AGI: all three letters mean a different thing to different people, and sometimes use the whole means something not present in the letters.
> If you could only give it texts and info and concepts up to Year X, well before Discovery Y, could we then see if it could prompt its way to that discovery?
To a limited degree.
Some developments can come from combining existing ideas and seeing what they imply.
Other things, like everything to do with relativity and quantum mechanics, would have required experiments. I don't think any of the relevant experiments had been done prior to this cut-off date, but I'm not absolutely sure of that.
You might be able to get such an LLM to develop all the maths and geometry for general relativity, and yet find the AI still tells you that the perihelion shift of Mercury is a sign of the planet Vulcan rather than of a curved spacetime: https://en.wikipedia.org/wiki/Vulcan_(hypothetical_planet)
Well, they obviously can't. AGI is not science, it's religion. It has all the trappings of religion: prophets, sacred texts, origin myth, end-of-days myth and most importantly, a means to escape death. Science? Well, the only measure to "general intelligence" would be to compare to the only one which is the human one but we have absolutely no means by which to describe it. We do not know where to start. This is why you scrape the surface of any AGI definition you only find circular definitions.
And no, the "brain is a computer" is not a scientific description, it's a metaphor.
Ways data might leak to the model that come to mind: misfiled/mislabled documents, footnotes, annotations, document metadata.
As a thought experiment I find it thrilling.
The fact that tech leaders espouse the brilliance of LLMs and don't use this specific test method is infuriating to me. It is deeply unfortunate that there is little transparency or standardization of the datasets available for training/fine tuning.
Having this be advertised will make more interesting and informative benchmarks. OEM models that are always "breaking" the benchmarks are doing so with improved datasets as well as improved methods. Without holding the datasets fixed, progress on benchmarks are very suspect IMO.
LLMs have neither intelligence nor problem-solving abillity (and I won't be relaxing the definition of either so that some AI bro can pretend a glorified chatbot is sentient)
You would, at best, be demonstrating that the sharing of knowledge across multiple disciplines and nations (which is a relatively new concept - at least at the scale of something like the internet) leads to novel ideas.
This is your assertion made without any supportive data or sources. It's nice to know your subjective opinion on the issue but your voice doesn't hold much weight making such a bold assertion devoid of any evidence/data.
LLMs trained on data up to a strictly constrained point are our best vehicle to have a view (however biased) on something, detached from its origins and escape a local minima. The speculation is that such LLMs could help us look at correlational links accepted as truths and help us devise an alternative experimental path or craft arguments for such experiments.
Imagine you have an LLM trained on papers up to some threshold, feed your manuscript with correlational evidence and have an LLM point out uncontrolled confounders or something like that.
Do you have any resources that back up such a big claim?
> relatively small mount of focus & critical thinking on the issue of how LLMs & other categories of “AI” work.
I don't understand this line of thought. Why wouldn't the ability to recognize patterns in existing literature or scientific publications result in potential new understandings? What critical thinking am I not doing?
> postulate “new” scientific ideas
What are you examples of "new" ideas that aren't based on existing ones?
When you say "other categories of AI", you're not including AlphaFold, are you?
I think you might be confusing creation ex nihilo with combinatorial synthesis which LLMs excel at. The proposed scenario is a fantastic testcase for exactly this. This doesn't cover verification of course but that's not the question here. The question is wether an already known valid postulate can be synthesized.
PROMPT:Charles Darwin
Charles DarwinECCEMACY. Sir, — The following case is interesting to me : — I was in London a fortnight, and was much affected with an attack of rheumatism. The first attack of rheumatism was a week before I saw you, and the second when I saw you, and the third when I saw you, and the third in the same time. The second attack of gout, however, was not accompanied by any febrile symptoms, but was accompanied with an increased flow of urine, and with a more copious discharge of urine. The third attack was an hour after I saw you, and was succeeded by a return of a paroxysm of gout, and a more rapid return of the gout. The fourth attack was also accompanied by a fever, but was not always accompanied by any febrile symptoms. The third attack of gout was a fortnight after you had been ill, and the fourth was followed by a paroxysm of gout. The fourth attack was a fortnight after you were attacked, and was accompanied by a sense
It would be interesting to know how much text was generated per century!
Very cool concept though, but it definitely has some bias.
So, the OP model will accidentally reflect the biases of the time. The current, commercial models intentionally reflect specific biases. Except for uncensored models which accidentally have those in the training data modified by uncensoring set.
to be frank though, I think this a better way than all people's thoughts all of the time.
I think the "crowd" of information makes the end output of an LLM worse rather than better. Specifically in our inability to know really what kind of Bias we're dealing with.
Currently to me it feels really muddy knowing how information is biased, beyond just the hallucination and factual incosistencies.
But as far as I can tell, "correctness of the content aside", sometimes frontier LLMs respond like freshman college students, other times they respond with the rigor of a mathematics PHD canidate, and sometimes like a marketing hit piece.
This dataset has a consistency which I think is actually a really useful feature. I agree that having many perspectives in the dataset is good, but as an end user being able to rely on some level of consistency with an AI model is something I really think is missing.
Maybe more succinctly I want frontier LLM's to have a known and specific response style and bias which I can rely on, because there already is a lot of noise.
Another thought, just occurred when thinking about readmes and coding LLMs: obviously this model wouldn't have any coding knowledge, but I wonder if it could be possible to combine this somehow with a modern LLM in such a way that it does have coding knowledge, but it renders out all the text in the style / knowledge level of the 1800's model.
Offhand I can't think of a non-fine-tuning trick that would achieve this. I'm thinking back to how the old style transfer models used to work, where they would swap layers between models to get different stylistic effects applied. I don't know if that's doable with an LLM.
I still think it's super important. Archive your current models - they'll be great in the future.
But on various data (i.e., separate model per source): the Bible, Don Quixote and Franz Kafka. (As well as a (bad!) lyrics generator, and translator.)
The model that could come up with the cure based on the limited data of the time wouldn't just impress, it would demonstrate genuine emergent reasoning beyond pattern matching. The challenge isn't recombining existing knowledge (which LLMs excel at), but making conceptual leaps that require something else. Food for thought.
===
You:
I pray you, who is this Master Newton?
timecapsulellm-v2-1800-1875-mlx:
TI offer to pay you the very same fee as you did before. It was not in the power of your master to deliver the letter to your master. He did. I will be with you as soon as I can keep my word. It is not at all clear, whether the letter has been sent or not. It is not at all clear: but it is clear also that it was written by the person who gave it. "No," I said, "I cannot give it to you." There, the letter was sent to me. "The letter is yours, I believe," I said. "But, I hope, you will not refuse to give it to me?
There's a "Use this model" button on that page that can launch it in LM Studio.
This feels like a neat sci-fi short story hook to explain the continuous emergence of God as an artifact of a simulation
Back in the pre-LLM days, someone trained a Markov chain off the King James Bible and a programming book: https://www.tumblr.com/kingjamesprogramming
I'd love to see an LLM equivalent, but I don't think that's enough data to train from scratch. Could a LoRA or similar be used in a way to get speech style to strictly follow a few megabytes worth of training data?
Somewhat missing the cutoff of 1875 is the revised NT of the KJV. Work on it started in 1870 but likely wasn't used widely before 1881.
Edit: I figured it out
"The Lord of the Rings uding the army under the command of his brother, the Duke of York, and the Duke of Richmond, who fell in the battle on the 7th of April, 1794. The Duke of Ormond had been appointed to the command of the siege of St. Mark's, and had received the victory of the Rings, and was thus commanded to move with his army to the relief of Shenham. The Duke of Ormond was at length despatched to oppose them, and the Duke of Ormond was ordered
There's a disconnect somewhere that I can't quite put my finger on. Am I just lacking imagination?
To get Relativity, the model needs to realize the training data isn't just incomplete, but fundamentally wrong. That requires abductive reasoning (the spark of genius) to jump out of the local minimum. Without that AGI-level spark, a 'pure knowledge pile' will just generate a very eloquent, mathematically rigorous defense of Newtonian physics.
https://news.ycombinator.com/item?id=46319826
I found it particularly thought-inspiring how a model with training from that time period completely lacks context/understanding of what it is itself, but then I realized that we are the same (at least for now).
Now if I train a foundation models with docs from library of Alexandria and only those texts of that period, I would have a chance to get a rudimentary insight on what the world was like at that time.
And maybe time shift further more.
Congratulations, you've reinvented the history book (just with more energy consumption and less guarantee of accuracy)
With the internet and pervasive text communication and audio video recording we have the unique ability to make an LLM mimic daily life but I doubt that would be possible for those days.
That happened in the past, and the "naive" way of doing it is usually easy to spot. There are, however, many ways in which testing data can leak into models, even without data contamination. However this doesn't matter much, as any model that only does well in benchmarks but is bad in real-world usage will be quickly sussed out by people actually using them. There are also lots and lots of weird, not very popular benchmarks out there, and the outliers are quickly identified.
> perform on SWE-bench if trained only on data prior to 2024.
There's a benchmark called swe-REbench, that takes issues from real-world repos, published ~ monthly. They perform tests and you can select the period and check their performance. This is fool-proof for open models, but a bit unknown for API-based models.
That is, the architectures of today are chosen to yield the best results given the textual data around today and the problems we want to solve today.
I'd argue that this lack of bias would need to be researched (if it hasn't been already) before this kind of model has credence.
LLMs aren't my area of expertise but during my PhD we were able to encode a lot of a priori knowledge through the design of neural network architectures.
v0.5 123M Parameters
v1: 700M Parameters
v2mini-eval1: 300M Parameters
I would not call this LLM. This is not large. It's just a normal-sized LM. Or even small.
(It's also not a small LLM.)
Having said that, looking up small language model these days returns tons of results calling 7B models small language models.
------
My understanding of small language models is that they're generally intended for specific purposes, like analysis and classification (whatever you'd call the text equivalent of image interrogation with clip models), translation, etc; that there small because they don't need to be big to do their intended functions, not because they're just smaller versions of bigger models.
I think talking to a legit trained LLM from a different era would be rad. But… this seems the opposite of Gemini making black lady popes and Native American Nazis… that these views wouldn’t really be “allowed” (published by anyone that wants AI funding money).
A truly authentic historical model will have some unsavory opinions and very distinctive dialect.