Most of the printed books are scattered, but available, but it's akin to an iceberg: there's a significant amount of 'submerged' knowledge about the language in written manuscripts and recorded audio, and this is where a lot of the value comes from. Printed texts are primarily religious, and getting the colloquial usages of words and phrases is very useful.
Many manuscripts aren't digitized at all, or are available and need transcription.
The language is relatively well-recorded (dating back to at least the late 16th century in written form), and yet small enough that a comprehensive reference is viable: estimates of about 5MM words crop up, but even 3x could easily fit in memory on a Digital Ocean droplet, even if fully POS tagged[1]. Texts are also mostly in the public domain, and there's a lot of bilingual texts (which act as a Rosetta Stone).
[0] https://en.wikipedia.org/wiki/Manx_language#Revival
[1] https://en.wikipedia.org/wiki/Part-of-speech_tagging
EDIT: More than happy to talk in depth about this if anyone wants, via comments, or email on my profile.
No comments yet.