Are topic models reliable or useful? (opens in new tab)

(medium.com)

2 pointspvankessel4y ago2 comments

2 comments

This is consistent with my own experiences with topic models, although I'm left wondering to what extent these observations generalize and why. I tried to find more details in previous posts about the models used etc but couldn't find much.

There's a lot of interest in overfitting with ML but it tends to focus on supervised methods; I think there's a need for more focus on unsupervised methods in general, with regard to overfitting in particular but also just in general.

pvankesselOP4y ago

We started off by trying LDA and NMF, but the topics were too messy so we wound up switching to CorEx (https://github.com/gregversteeg/corex_topic), which is a semi-supervised algo that lets you "nudge" the model in the right direction using anchor terms. By the time our topics started looking coherent, it turned out that a regex with the anchor terms we'd picked outperformed the model itself. This case study was on a relatively small sample of relatively short documents (~4k survey open-ends) but for what it's worth, we also tried to use topic models to classify congressional Facebook posts (much larger corpus and longer documents) and the results were the same.

Overfitting is certainly part of the problem - in one of my earlier posts I talk about "conceptually spurious words," which are essentially the product of overfitting - but the more difficult problem is polysemy. I'm sure there are ways to mitigate that - expanding the feature space with POS tagging, etc. - but ultimately I think the solution is to simply avoid using a dimensionality reduction method for text classification. Supervised models are clearly the way to go - even if those "models" are just keyword dictionaries curated based on domain knowledge.

j / k navigate · click thread line to collapse

2 comments

teorema4y ago

pvankesselOP4y ago

j / k navigate · click thread line to collapse