Concretely I mean it's very easy to generate text using sentence templates. Just plug in words and it works out. "The $process_name has completed running." "Like $username's comment" "Ban $username".
Relatedly, I think focusing NLP efforts on English masks a lot of interesting phenomena, because English text already comes in a reasonably tokenized, chunked up and pre-digested, easy to handle form. For example speech recognition systems started out with closed vocabularies, with larger and larger numbers of words, and even in their toy forms you could recognize some proper English sentences. To do that in Hungarian for example, the "upfront costs" to a "somewhat usable" system are much higher, because closed vocabulary doesn't get you anywhere. (Similarly, learning basic English is very easy, you can build 100% correct sentences on day 1, you learn "I", "you", "see" and "hear" and can say "I see" and "You see" and "I see you" and "I hear Peter" which are all 100% correct. In Hungarian these are "nézek", "nézel", "nézlek", "hallom Pétert" requiring learning several suffixes and vowel harmony and definite/indefinite conjugation. The learning curve till your first 100% correct 3-5 word sentences is just steeper.)
I don't mean it's impossible to handle agglutinative languages in NLP, I just mean the "minimum viable model" is much simpler and attainable for English, which on the one hand was able to kickstart and propel the early research phases and on the other hand perhaps fueled a bit too much optimism.
English can seem very well structured and it can tempt one to think of language in a very symbolic, within-the-box, rule-based way. In terms of syntax trees, sets of valid sentences etc, instead of "fuzzy probabilistic mess" that it really is. Surely, the syntax tree, generative grammar approach (Chomsky and others) gave us a lot of computer science, but this kind of "clean" and pure symbolic parsing doesn't seem to drive today's NLP progress.
In summary, I wonder how linguistics and especially computational linguistics and NLP would have evolved in a non-Anglo culture, e.g. Slavic or Hungarian.
That's like saying "I wonder when you stopped beating your wife"; you assume there was a head start, when, in fact, the world's first commercial computer was German[1].
And until recently, natural languages had a near-zero effect on computing. Worst case, users ended up seeing messages which weren't grammatically perfect, and it wasn't a big deal.
>I wonder how linguistics and especially computational linguistics and NLP would have evolved in a non-Anglo culture, e.g. Slavic
Would have? NLP has only started to matter recently, at a time when it has to work in all languages from the get-go. The current evolution includes contributions of people from many languages and cultures.
And for that matter, English makes a lot of things harder.
Did the Z4 do a lot of German language text generation, or German language input parsing? But anyway German is also not agglutinative, but it does have complexities like gendered declension of articles and adjectives.
> And until recently, natural languages had a near-zero effect on computing.
Seems like we're talking past each other and I packed multiple things in the comment. I meant user-facing messages there. I've done some software internationalization (translation) work some years ago and in many cases the format was just templates. You were often expected to translate templates with pluggable strings. Whereas what you would actually need is to write a function that looks at the word that you want to plug in, extracts the vowels, categorizes them with some branching logic, looks at the last consonant, decides if you need a linking vowel, decides on the vowel harmony based on the vowels, look up if it's an exception and then apply the suffix.
In English you can generate the message "Added %s to the %s." These are usually translated to Hungarian as if it was "%s has been added to the following: %s". Or instead of "with %s" they must write "with the following: %s", because applying "with" to a word or personal name requires non-trivial logic. Whenever the translators resort to "... the following: %s", you can know they weren't able to fit it into the sentence with proper grammar due to the use of too primitive string interpolation-based internationalization.
Until recently, Facebook was not able to apply declension to people's names, as it is quite complicated. Normally "$person_name likes this post." would require putting $person_name into dative case, requiring determination of vowel harmony. To avoid it, they picked a rarer verb form which doesn't need the dative case but doesn't sound as natural. They've only transitioned to the dative case in the last year or so.
A lot of this stuff is just not even on the mind of English speaking devs, because template-based string interpolation is a good enough solution in English for the vast majority of cases. The only exception that would need a little bit of branching logic is applying "a" or "an" before a word or pluralization, but these don't come up too often.
Again, my point was dynamically generating user-facing messages, UI elements is so easy in English, while properly doing it in other languages.
> Would have? NLP has only started to matter recently, at a time when it has to work in all languages from the get-go. The current evolution includes contributions of people from many languages and cultures.
Most of the research outside of explicit machine translation research is still based on English. How many papers are out there, e.g., on visual question answering (VQA) systems in Polish or Finnish? In many cases I feel less impressed by such systems because I feel like English is too easy. The order is very predictable, the words are easily separable, the whole thing is much more machine processable. Maybe it isn't so, it would be interesting to see empirical results.
The difficult languages are inflectional languages, where you make things completely different instead of just tacking something on the end.
It's worth pointing out that all whitespace is completely optional in Fortran, the first programming language--doi=0,10 is exactly the same as DO I = 0, 10. So it's not like early computing relied heavily on gratuitous whitespace.
For example, every language is already used to math formatting. Programming languages draw more inspiration from math formatting than English.
That leaves naming. Here agglutinative languages should have an advantage. You can have more natural ways to describe roles like how in English we may have caller and callee, rather than more clumsily camel-casing something like sumOfLists.
> linguistics and especially computational linguistics and NLP would have evolved in a non-Anglo culture, e.g. Slavic or Hungarian.
Probably not much different, except that more elements of morphology are treated together with syntax.
> Concretely I mean it's very easy to generate text using sentence templates. Just plug in words and it works out. "The $process_name has completed running." "Like $username's comment" "Ban $username".
If computing were primarily championed by a fusional language (agglutinative languages usually have somewhat "clean" morphology), I imagine that libraries for inflection will be more prominently used. Like in English where more professional apps use a pluralizer library. One natural API for an inflection API is as a fluent API.
Re your point on language being a "[f]uzzy probabilistic mess" -- language is absolutely NOT a fuzzy probabilistic mess and its a damn shame that NLP based its success on black-box models, because it means no one bothers realizing that language isn't a mess at all. See Jelinek's law of speech recognizer accuracy [1]. Simply because we get results using messy black box models doesn't mean that's how things work under-the-hood.
[0] https://www.researchgate.net/publication/221013038_Unsupervi...
A big issue is that in synthetic languages 'words' are much more 'rare' (because there are more morpheme combinations per word). So if you're building something like a bag-of-words or an ngram model, your input data is likely to be very sparse which translates to poor modelling of the language itself/what words speakers would judge as grammatical.
With agglutinative languages like Turkish, a technique that has been used with considerable success is just considering each morpheme a distinct token, but it has many of the same problems as word-level tokenization. I was looking at a paper recently that claimed to have found a good way to do smoothing so that unseen ngrams could be assigned a non-zero probability in a way that conformed to the rules of the language, but we'll have to see if that can work in practice.
Here’s a real (if unlikely) word in Turkish and how this whole agglutination business works: https://twitter.com/languagecrawler/status/62385880386859827...
I don’t blame Apple though - it might actually be just impossible to do Turkish autocorrect in the same way English autocorrect works, because the beginning of the word indicates the actual word but the end indicates everything else (direction, modifiers etc.). So it’s about as easy/hard as English to guess the beginning of the word, but impossible to guess the modifiers that get added because the moment the modifier sequence starts, every single letter starts to change the meaning, thus there are almost no incorrect paths. A correct Turkish autocorrect implementation would autocomplete the word root, but leave at the halfway-compete word at where the modifier suffixes start so that the user can complete the modifier sequence on his / her own.
If you find the concept interesting, you will enjoy reading the novel.
Other languages use a lot of morphemes-per-word. One strategy to create words from morphemes is called agglutination (meaning to glue things together). An agglutinative language takes all the morphemes that are going to go into a word, and with minimal or no changes, glues them together to form a word.
For example, the Yupik word "tuntussuqatarniksaitengqiggtuq" means "He had not yet said again that he was going to hunt reindeer". It is formed by taking the following morphemes and agglutinating them:
"tuntu-ssur-qatar-ni-ksaite-ngqiggte-uq"
For example a totally normal Hungarian word is: szolgáltatásaiért = szolgá+l+tat+ás+a+i+ért = for his/her/its services. Szolga means servant, from Slavic origin. Szolgál is a verb meaning to serve. Szolgáltat means to provide service. Szolgáltatás means service (as in "goods and services", "internet service", etc.). Szolgáltatása means his/her/its service. Szolgáltatásai means his/her/its services. Szolgáltatásaiért means "for his/her/its services".
szolga = servant
+ l = verb-forming suffix [1]
+ tat = causative suffix [2]
+ ás = gerund-forming suffix, makes a verbal noun from a verb [3]
+ ai = marker for plural possession [4]
+ ért = causal-final suffix, denotes the reason for the action [5]
These suffixes are morphemes you can (with some simple rules) just add to words to achieve the corresponding change in meaning. In CS terms, they're functions that take the input word and make a new one. Each one of the above, used somewhere else:
winter -> to spend the winter: tél + l -> telel
to read -> to make them read: olvas + tat -> olvastat
to read -> the reading: olvas + ás -> olvasás
house -> his houses: ház + ai -> házai
house -> because of, affecting the house: ház + ért -> házért
[1] https://en.wiktionary.org/wiki/-l#Hungarian [2] https://en.wiktionary.org/wiki/-tat#Hungarian [3] https://en.wiktionary.org/wiki/-%C3%A1s#Hungarian [4] https://en.wiktionary.org/wiki/-ai#Hungarian [5] https://en.wiktionary.org/wiki/-%C3%A9rt
It’s oddly circular in English that certain word+ending combinations are only accepted if someone else has used it before.
It allows for a lot of really short sentences; here's a nonsensical example:
His book is on fire - kitabı yanıyor.
The word endings is sufficient to provide context and meaning.
If you find Turkish to be too difficult to learn, try Malay. It's also agglutinative and used by ~300 million people (Malay and Indonesian are for all practical purposes the same language).
- Book = kitap, not kitab
- My book = kitabım, not kitabim
- Your book = kitabın, not kitabsin
An example in Finnish:
- jousta (normal form of the verb run)
- juoksen (I run)
- juoksentelen (I run around)
- juoksentelisinkohan (I wonder should I run around)
- juostaankohammekohaan (I wonder do we run)
The two later forms are very rarely used, and I have no idea whether the last form is even correct. I have some friends who insist on talking like this. Usually, people express the same things with more words, such as juoksentelisinkohan is equivalent to about:
- Mietin ., että. pitäisikö. minun. juosta. ympäriinsä.
- I wonder., that. should. my (in this context, me). run. around.
The . are to separate the words.
Yet, it would be perfectly fine to just append a question mark to juoksentelisinkohan or juostaankohammekohaan and it would be a one-word sentence. An interesting remark is that in practice the question mark is redundant in both cases, as the -ko- part in the words reduces the only interpretation of the word to be a question.
I have absolutely no idea how would one formalize all this.
The frequentative forms would be "juoksentelemmekohan" and "juoksennellaankohan" respectively.
Also, intransitive verbs[2] are like thunks.