This is my first reaction as well. Talking about tokenization and POS tagging is getting lost in the weeds when one has goals like this:
>I also want to be able to assess how much of the text is about a given topic, so that if I'm interested in reading a detective story from e.g. the Project Gutenberg collection, I don't want it to pick up a story where a detective is only mentioned in one paragraph.
This more like NLU than an NLP problem isn't it? It's like tracking how much of a Harry Potter book contains Voldemort content without knowing ahead of time that he may be referred to as He Who Must Not Be Named, You-Know-Who, The Dark Lord and so on. One would have to first identify the thing you're interested in, then learn when characters/the author invent new ways to refer to it, and carry all those forwards to find new instances. Fun!