The problem: given a long fiction manuscript, I need to extract characters, aliases, relationships, locations, timeline facts, and unresolved plot facts. It has to work reasonably well across CJK and Latin-script text. For example, Chinese/Japanese/Korean names do not tokenize like English names, aliases may be implicit, and the same character can appear under several surface forms across chapters.
I do not want to ship a huge local model just to do this. The app should work locally on normal Macs, and the extraction/indexing layer needs to be fast enough to run repeatedly as the manuscript changes.
What I am trying to figure out:
- practical local NER/entity-linking options for CJK + Latin text - whether a hybrid rule-based + embeddings approach is realistic - how to store weak candidate facts before they are confirmed - how to link extracted facts back to exact manuscript evidence - how to avoid one bad extraction polluting future semantic search - whether anyone has had success with small local models, SQLite FTS, vector indexes, or graph-style memory for this kind of problem
Has anyone built something like this? What worked, and what was a dead end?
No comments yet.