Ask HN: Local CJK/Latin entity extraction without shipping a model?

2 pointszphou15h ago0 comments

I am building a local-first Mac writing app, and I am stuck on the story-memory layer.

The problem: given a long fiction manuscript, I need to extract characters, aliases, relationships, locations, timeline facts, and unresolved plot facts. It has to work reasonably well across CJK and Latin-script text. For example, Chinese/Japanese/Korean names do not tokenize like English names, aliases may be implicit, and the same character can appear under several surface forms across chapters.

I do not want to ship a huge local model just to do this. The app should work locally on normal Macs, and the extraction/indexing layer needs to be fast enough to run repeatedly as the manuscript changes.

What I am trying to figure out:

- practical local NER/entity-linking options for CJK + Latin text - whether a hybrid rule-based + embeddings approach is realistic - how to store weak candidate facts before they are confirmed - how to link extracted facts back to exact manuscript evidence - how to avoid one bad extraction polluting future semantic search - whether anyone has had success with small local models, SQLite FTS, vector indexes, or graph-style memory for this kind of problem

Has anyone built something like this? What worked, and what was a dead end?

0 comments

No comments yet.