Where would you recommend to start for such a project, both in terms of minimal theoretical and practical knowledge, but also the engineering aspect of it? What open source libraries and software are available out there to speed up this process?
Then it highly depends on the language; for instance tokenization (split sentence into words) is really easy in English, Spanish, etc compared to Japanese, Chinese, etc. So I would say a good starting point would be to try using a NLP parser for a similar language. What language is it? What kind of NLP analysis do you want to perform?
I have tried models for Arabic but they fall short because of the borrowings from other languages, but also a different grammar in many cases. Also, on one hand, transliterating the Latin written sentences in North African Arabic into a regular Arabic alphabet is a challenging task (many transliterations are possible for a given word), on the other, North African Arabic (NAA) is not standardized, so words are commonly written phonetically using a loose set of transcription rules, also borrowed from other languages. An example of this last aspect would be 'the pharmacist' which in NAA could be written in any of those forms (combinations of either change are possible): - L'farmassian - Lfarmassian - Lpharmassian - Lpharmacian - l frmsian ..
Thanks for the reference, I will check it out!
First you should figure out what type of parsing you want. I would recommend looking at Stanford's CoreNLP library to see which task you actually want. There are multiple ways to parse grammar. Once you can name the actual problem you want to solve it should be googleable.
The downside of classical NLP is that you need to learn some amount of linguistics to create labeled parse trees for your data or even interpret them.
So, if your goal is to build an application, rather than a library, you may want to learn about neural nets/LSTMs. They can let you go from language to the actual information you want without you needing to encode and interpret parse trees.
The downside of neural nets is that they tend to need more data, but the data is much simpler so you could farm this out to mechanical turk if you wanted.
As far as I know, noone has actually succeeded in doing NLP in NAA or amazight...
It's a topic of great interest to me but unfortunately I don't have time to invest in that subject. Please keep me informed of your progress!
I'd love to see a system doing NLP in latin alphabet for amazight and NLG to Tifinagh...
Also nowdays, word2vec is unrelated to the understanding of grammatical constructs in natural languages. It's simply said a coincidence or co-occurance of words. Grammartical interpretation of a sentence must be seens as a general graph whereas word2vec operate on the linear structure of sentences (one word after the other). If word2vec had to work on grammatical constructs it should be able to ingest graph data. word2vec works on matrices where the graphical representation of the grammar of sentence (POS tagging, dependencies, anaphora, probably others) is graph otherwise said a sparse matrix or a matrix with a big number of dimensions. (It seems to me machine learning is always about dimension reduction with some noise).
I am quite ignorant about the literature on the subject of machine learning operated to/from graphical datastructures.
It statistic based.
If you can deal with the math, some papers such as [2] use corpora for existing languages as a tool to parse new languages, for which there are not too many resources available.
In both cases, you can always contact the authors. They might know how to help with your project, and/or direct you to the right people.