I'll take the extreme position that language translation is an unsolvable problem because of this exact phenomena. There was a recent case where a politician was accused of making a racist remark. He said something like "You are a donkey." or "You all are donkeys." to another [minority background] politician. Which was it? Well in many languages the second person plural and the second person singular formal are identical. And there are no articles. So the two statements are literally identical. Which did he mean? Nobody will ever know, besides him.
And outside of inherent language ambiguities, start piling on the endless (and ever/rapidly changing) euphemisms, idioms, colloquialisms, metaphors, just plain old ambiguous sarcasm, and all the other things that make language fun (and more expressive). And these sort of things aren't really the exceptions so much as the rule. And it only becomes more common the more distant languages get. Translations from various Asian languages to English often look just hilarious. Now imagine going back to languages exponentially more detached from any modern language, using one can only imagine what sort of expressions, and trying to convert it.
Especially using a neural network type system you'll probably be able to get something. And, even worse, it might well even make sense. That's a problem because, kind of like ChatGPT, it being coherent is zero indication of it being right.
That's modern European languages ... and post ~1100 BCE if I recall correctly.
So, indigenous languages from people settled in Australia [1] for 50,000+ years can be a little different, some don't have "left" | "right" as relative to PoV directions and stick with East V. West as absolutes for example.
We're down to maybe 20-30 from pre colonial 100's though [2].
[1] https://mgnsw.org.au/wp-content/uploads/2019/01/map_col_high...
That's Indo-European (as in descended from PIE/Yanmaya). Basque is older than all of them and it's a living European language.
You're probably thinking of the indo-european language family, which accounts for about 45% of native language speakers. The largest language family in the world, but not even a majority.
Scripts and languages change over the course of decades, and while there are well known mechanisms to those changes, trying to deduce hieroglyphics or ancient Egyptian from a modern corpus is impossible.
The idea that there is some shared structure in all language is known as universal grammar. If that structure exists is still hotly debated.
I know this may be not be the answer that you are looking for, but the way most of these ML systems are designed is based on the idea that life, the universe, and everything can be modeled by a series of joint probabilities. For toy problems you draw a diagram
https://en.m.wikipedia.org/wiki/Graphical_model
It's an old idea in AI (predates even ML) but people have never been able to do anything useful with it outside of exam problems until the emergence of language models on modern deep learning hardware. All of a sudden variational learning and causal inference are not merely statistical word problems for grad students any more. This is the key to how most of the custom deep learning based avatar generators work. They use a Variational Autoencoder. For LLMs, it is in the form of a transformer which contains a sampling step (sampling from a distribution is the key to Bayesian methods).
I would like to emphasize the theory of probabilistic learning is very different from the actual practice. The theory we have today isn't much different from 20 years ago. Implement the methods in for example Murphy's Probabilistic ML book and they would be useless if you don't have access to modern deep learning hardware and gradient descent optimizers. Without deep learning, we won't have LLMs, regardless of how fancy the variational learning theories are.
There is a good candidate for a test. Someone will probably already work on it. Minoan as written in Linear A has only survived in a few thousand tokens and despite thousands of man years of effort, natural intelligence has made virtually no progress in understanding it. That's still easy mode, since we know that the Minoans were in contact with speakers of indo-european and Semitic languages, and writers of hieroglyphics and phonetician script, so their written Language was probably influenced by that.