Using Doc2Vec to Suggest SubReddits (opens in new tab)

(reddit2vec.com)

23 pointsjmportilla10y ago12 comments

12 comments

12 comments · 6 top-level

utunga10y ago· 3 in thread

Hi!

Great work. I guess my question is - do you use 'averaging' of word vectors or the Chinese Restaurant process - to get to sub reddit vectors. You describe the Chinese Restaurant process as a "more sophisticated method" that you "can" use, but in my experiments with word2vec and reddit (https://github.com/utunga/gensimred) I quickly discovered that simple averaging just does not work. Averaging has this awful 'revert to mean' thing that turns all the paragraph vectors into a sort of bland gray goo where they are all the same.

If you did use Chinese Restaurant process (I love that phrase - brings back memories of an occasion at a Dim Sum restaurant where this almost literally happened) it'd be great to see any source code you may feel like releasing ;_) ... well, it can't hurt to ask..

gojomo10y ago

The gensim library now offers the 'Paragraph Vector' [1] approach to create vectors for longer ranges of text. It's available in a class named Doc2Vec – but I don't think that's what is being used here.

The Paragraph Vector approach can give interesting results for document-similarity, including similarity after certain 'algebraic'-like additions/subtractions of other topics/word-concepts. [2]

[1] http://arxiv.org/abs/1405.4053

[2] http://arxiv.org/abs/1507.07998

jmportillaOP10y ago

I used the gensim Doc2Vec implementation. You can check out some of the source code here: https://github.com/jmportilla/Reddit2Vec

utunga10y ago

Hi... Thanks for that. Awesome and much appreciated.

sdrothrock10y ago· 1 in thread

This is pretty neat, but the biggest problem for me is the case sensitivity; reddit itself doesn't use case sensitivity, so it's hard to remember the exact capitalization of a subreddit name.

jmportillaOP10y ago

Yeah, I know its super annoying. Hopefully I'll have time to update the model with lowercase names sometime next week.

Yadi10y ago· 1 in thread

Awesome seeing someone use the reddit dataset :)!

Wouldn't a w2v as a recommender for the user might have been better?

Taking user's comments/likes/subreddits as a feature.

jmportillaOP10y ago

I think your thinking of just a classic collaborative filtering recommendation system.A simple w2v system would take into account all words, then have to be filtered by words that are equal to subreddits. Although, I may have misunderstood your suggestion.

haxiomic10y ago· 1 in thread

Nice idea :), works well. Spotted a small typo in the examples:

pcmasterace+mac should be pcmasterrace+mac (missing an r)

jmportillaOP10y ago

Thanks! I'll fix it

joelthelion10y ago

Very cool. Little tip: use "-funny" to get high-quality subs :)

riffraff10y ago

neat, I'd suggest considering spaces as "+" i.e. "cats awww" should be the same as "cats+awww" I guess :)

j / k navigate · click thread line to collapse

12 comments

12 comments · 6 top-level

utunga10y ago· 3 in thread

Hi!

gojomo10y ago

The Paragraph Vector approach can give interesting results for document-similarity, including similarity after certain 'algebraic'-like additions/subtractions of other topics/word-concepts. [2]

[1] http://arxiv.org/abs/1405.4053

[2] http://arxiv.org/abs/1507.07998

jmportillaOP10y ago

I used the gensim Doc2Vec implementation. You can check out some of the source code here: https://github.com/jmportilla/Reddit2Vec

utunga10y ago

Hi... Thanks for that. Awesome and much appreciated.

sdrothrock10y ago· 1 in thread

This is pretty neat, but the biggest problem for me is the case sensitivity; reddit itself doesn't use case sensitivity, so it's hard to remember the exact capitalization of a subreddit name.

jmportillaOP10y ago

Yeah, I know its super annoying. Hopefully I'll have time to update the model with lowercase names sometime next week.

Yadi10y ago· 1 in thread

Awesome seeing someone use the reddit dataset :)!

Wouldn't a w2v as a recommender for the user might have been better?

Taking user's comments/likes/subreddits as a feature.

jmportillaOP10y ago

haxiomic10y ago· 1 in thread

Nice idea :), works well. Spotted a small typo in the examples:

pcmasterace+mac should be pcmasterrace+mac (missing an r)

jmportillaOP10y ago

Thanks! I'll fix it

joelthelion10y ago

Very cool. Little tip: use "-funny" to get high-quality subs :)

riffraff10y ago

neat, I'd suggest considering spaces as "+" i.e. "cats awww" should be the same as "cats+awww" I guess :)

j / k navigate · click thread line to collapse