What are the Hidden Communities of Reddit? (opens in new tab)

(cs.utexas.edu)

194 pointseli_awry13y ago45 comments

45 comments

37 comments · 15 top-level

gurkendoktor13y ago· 6 in thread

OT - both Safari (w/o Flash) and Google Chrome max out all CPU cores as long as this site is open. The visualisation might need an upper limit on the work it is doing per second...

mikegioia13y ago

I'm on chrome without flash (ubuntu) and I took a screenshot of chrome using 172% of the cpu!

eli_awryOP13y ago

Here's a copy of the post with just text and screenshots: http://www.cs.utexas.edu/~elie/noscriptnetworks.html .

I developed this on Chrome on a Mac and Chromium on Ubuntu, and it worked on both of those. Sorry it's giving you problems.

EvilTerran13y ago

It made my Firefox / Windows XP / creaky old laptop freak out, too. Might putting the interactive bit on a separate page to the article be in order, so those of us without the specs to play with the interactive bit can still readily read the article?

shrikant13y ago

Firefox 19 on Window 7, and the page sails by just fine.

I would've hazarded a guess at a Webkit bug, but OP mentioned she developed this on Chrome..

oinksoft13y ago

It may "sail fine" but sure left my fans howling too. I don't think this is a browser thing.

hosay12313y ago

Not even a hint of stress on Firefox/OSX

DanBC13y ago· 4 in thread

I'm not sure what "hidden" means in the title. See, eg, (http://www.reddit.com/r/proana). There are a bunch of these closed groups.

The author's work seems really useful for detecting spam. There are some people / bots who post a lot of specialist content. They only ever post links to content on domains that pay when visitors click links. These domains have a lot of ads. There's no other interaction on the site.

_NOT SAFE FOR WORK_:

This user (http://www.reddit.com/user/walfa2) only posts content from sites which pay when viewers see the images. The domains have heavy ad content, with popups etc.

Here's an example domain:

(http://www.reddit.com/domain/img1.picfoco.com/)

Once you find one user you can find a bunch of these domains, and the other users posting to those domains, and thus find a few more domains.

With a bit of tinkering you could should a colour coded chart of spam domains; of users that only post content from those domains; and users that never make replies but only make top level comments.

That could be run once a week and (with human oversight) used to remove content which is not good for reddit.

eli_awryOP13y ago

That's a cool idea.

In the original title, 'Hidden' was 'Latent' - communities that de-facto exist even though they are not explicit. 'Latent' would have been a more precise word, but 'hidden' is more accessible.

true_religion13y ago

It's not entirely obvious that posting from a domain that incentizes traffic is a bad thing.

If the posts are upvoted by the community, then it should be seen as a good and not a negative.

One of the oddities of reddit as compared to other social sites is that content owners and traffickers are looked-down upon simply because they can profit from attention.

DanBC13y ago

> It's not entirely obvious that posting from a domain that incentizes traffic is a bad thing.

I agree. To me it's not a problem. But unfortunately some of these posts leak into unsuitable subreddits. In a NSFW porn subredddit it's not much of a problem when someone links to a site heavy with horrible porn ads; but when that link is posted to a non-porn subreddit it's more of a problem.

And, really, Reddit is better as a community rather than a dump for links. So people who have no interaction with the site other than dumping links can be a problem. Being paid when people visit those links can make them more of a problem. They have no interest in Reddit.

2 more replies

k3n13y ago

> If the posts are upvoted by the community, then it should be seen as a good and not a negative.

The problem with this thinking is that vote fraud runs rampant and is sometimes nearly impossible to detect.

NZ_Matt13y ago· 4 in thread

I vaguely remember several years back Reddit added the option for users to allow their subreddit and votes data to be used for research purposes with the hope of building a recommendation engine similar to this. Does anyone know if anything came from that? It would be great if the dataset was publicly available.

Edit: Here are the original threads, I don't think the project got very far. http://www.reddit.com/r/announcements/comments/ddz0s/reddit_...

http://www.reddit.com/r/redditdev/comments/dtg4j/want_to_hel...

naner13y ago

The option is opt-in (which is good, the fickle reddit community would revolt otherwise) which means almost nobody uses it. If Reddit would remind their users frequently (e.g. at the end of popular Reddit Blog posts as an aside) or reward people for enabling the option (free Reddit Gold for a week, etc.) I'm sure many more people would sign up.

EDIT: (Sorry for all the parentheticals.)

dmix13y ago

Why not just anonymize the user data part?

There are startups selling health data this way, I don't think it would be so bad for subreddit subscription data.

1 more reply

eli_awryOP13y ago

I didn't look very far into that, but I think that the issue was that subreddit data wasn't included in the vote dump they did. I actually only heard that secondhand - I could be mistaken.

I just used post and comment histories, which suited my purposes fairly well because the larger project was looking into how memes spread.

wting13y ago

Took me a while to find it:

http://www.reddit.com/r/redditdev/comments/dtg4j/want_to_hel...

Here's another version:

https://archive.org/details/2010-reddit-research

1wheel13y ago· 4 in thread

Really cool! Couple of comments:

1. I'm assuming you downloaded comment threads from the front page of each the subreddits you looked at and then looked at the subreddit each of the posters had commented in. How many requests did you end up making?

2. Did you hand select the subreddits you analysed? If so, what criteria were you looking for?

3. Have you thought about doing any more research into this area? I made http://redditgraphs.com/ and was looking into ways of guessing a user's age & gender based on their commenting history. I found some papers about similar sites:

twitter: http://www.aclweb.org/anthology-new/D/D11/D11-1120.pdf

blogspot: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.136...

youtube: http://static.googleusercontent.com/external_content/untrust... (This one looks the most promising; using their methods, treat subreddits as youtube videos to create more accurate profiles of communities and users. They also examine the propagation of speech patterns which capture the spread of some memes.)

Unfortunately, reddit doesn't have user profiles or name-like user names (so there isn't an easily available training set) and I was having difficulties organizing and analyzing the large amount of data I was downloading, so I put the project aside. There has been basically no research done specific to reddit (http://scholar.google.com/scholar?as_ylo=2008&q=reddit+d...) which is surprising to me because of its size and unique subreddit system.

4. If you want to examine the spread of memes, you need access to old threads. http://stattit.com/ is the best way of getting around the reddit API's 1000 most recent post limitation.

5. Last month, a similar data set (which only looked at reddit) was collected - I think you're trying to do something different and your presention is much better, but you might be interested in the discussion: http://www.reddit.com/r/TheoryOfReddit/comments/126pth/scrap...

eli_awryOP13y ago

I looked at about 60,000 distinct users. But you're right about my overall strategy. I chose all of the subreddits with over some number of subscribers (I forget what the number was now.)I ended up with 433 subs. I filtered out the current default subreddits from this visualization.

One thing I was wondering in terms of reddit research - have you looked into this at all - is that they have users check a specific box if they are ok with their voting data being used for research - even if it's already public. My question then is this - is it somehow wrong to use (already-public) data for research? Anyway, I talk about my original aims for the project in some other comments.

Thanks for the link to stattit. My strategy for getting enough threads for my other project was just to keep a slow scraper running for a month and then go back to it - stattit will be incredibly helpful.

1wheel13y ago

> One thing I was wondering in terms of reddit research - have you looked into this at all - is that they have users check a specific box if they are ok with their voting data being used for research - even if it's already public. My question then is this - is it somehow wrong to use (already-public) data for research? Anyway, I talk about my original aims for the project in some other comments.

Based on the dozens (at least) of papers published each year that use twitter data, I'm pretty sure it's kosher to use public posts. You might want to double check with your irb though. Depending on how you present the information, so users might be concerned about their privacy - I wrote a bot that replied to people posting variations of 'your comment history' with a link to the referenced person's redditgraph and several people said they were creeped out by it (a little more here, if your interested: http://www.roadtolarissa.com/redditgraphs-retrospective/).

Depending on what you are looking for the rate limit might slow you down a lot; you might want to contact the site admins:

> tl;dr If you need old data, we'd much rather work out a way to get you a data dump than to have you scrape.

https://groups.google.com/forum/?fromgroups=#!topic/reddit-d...

tansey13y ago

Does stattit have an archive that one can download or do they just provide the high-level summary stats shown on the site?

1wheel13y ago

I don't think so; the creator /u/Deimorz is pretty cool, you could try asking him.

mahesh_rm13y ago· 2 in thread

Isn't r/WTF missing from this picture?

eli_awryOP13y ago

Indeed. I took out all of the default subreddits because they added too much noise - everyone starts out subscribed to all of them.

corin_13y ago

Can you not just flip the switch and get equally useful data - don't look for people who subscribe to a default, but to those who unsubscribe from it?

1 more reply

dmix13y ago· 1 in thread

I'd be curious to see the connection between politics/economics and other subreddits.

Such as what subreddits are /r/ liberals, conservatives, libertarians, anarchists, etc likely to follow?

Are liberals commonly in /r/trees? Are libertarians big on /r/economics? Are conservatives avoiding /r/wtf and /r/trees?

eli_awryOP13y ago

Those subs weren't really active enough for me to get significant data in the time I was scraping (Reddit doesn't let you go back very far.) Great idea for a future post though.

1 more reply

razkul13y ago· 1 in thread

Awesome data. Really interesting to look at, and great presentation.

But there are a few things that kinda bother me with this:

The problem I can find with this data is that it isn't a representation of the reddit hidden communities as a whole, just the hidden communities of those who actually post (only 20% of Reddit).

A question I have is whether these are two-way connections with the groups. It's not clear exactly how the analysis is done 100% (perhaps I missed this portion), but could connections between subreddits be generated by there being a lot of people who post in a very tiny subreddit also posting in a larger subreddit? This means that though someone may like Large Subreddit A, they may not like the more specific Subreddit B. But a lot who like Subreddit B like Subreddit A.

eli_awryOP13y ago

I combatted this in two ways - first, I only looked at the top 433 reddits.

There are always going to be the same people cross-subscribed between A and B as between B and A. This graph is not of the number of people cross-subscribed between two reddits - it's of the sum (number of people cross-subscribed)/(users in A) + (number of people cross-subscribed)/(users in B). So if a lot of people in a tiny subreddit are cross-subscribed, they get a big boost from the first term, but almost no boost if they make up a tiny sliver of subscribers to reddit B.

the_cat_kittles13y ago

This is one only a handful of graphvis-esque visuals that ACTUALLY conveys information effectively, as far as I have seen. Not to mention it is really interesting info! Nice work!

msds13y ago

I did a similar thing with all of the departments of the UW: http://www.sorens.in/posts/2012-8-11-uw-courses

Kluny13y ago

Insanely fascinating. Keep working and adding more graphs and stuff. Everyone is going look for their favorite subreddit first, then see how common it is for members of that subreddit to be in to other things they are into.

For instance, I usually read /r/bicycles, but also programming, motorcycles, cars, and 2xc. How many other people have that unique mix of interests?

TGJ13y ago

The bottom interactive graph is kinda neat. Setting zero friction and minimal spring tension and gravity center turns the whole thing into a spheroidal structure much like the accretion of objects in space.

toadi13y ago

Good work for the visualization of the data. Take a look at http://www.datapointed.net/visualizations/ his visuals are superb.

skadamat13y ago

I go to UT and am on the FAI newsletter and totally get your emails!

rhizome13y ago

Adrian Chen thanks you.

jrochkind113y ago

Hi Eli, neat work!

j / k navigate · click thread line to collapse

45 comments

37 comments · 15 top-level

gurkendoktor13y ago· 6 in thread

OT - both Safari (w/o Flash) and Google Chrome max out all CPU cores as long as this site is open. The visualisation might need an upper limit on the work it is doing per second...

mikegioia13y ago

I'm on chrome without flash (ubuntu) and I took a screenshot of chrome using 172% of the cpu!

eli_awryOP13y ago

Here's a copy of the post with just text and screenshots: http://www.cs.utexas.edu/~elie/noscriptnetworks.html .

I developed this on Chrome on a Mac and Chromium on Ubuntu, and it worked on both of those. Sorry it's giving you problems.

EvilTerran13y ago

shrikant13y ago

Firefox 19 on Window 7, and the page sails by just fine.

I would've hazarded a guess at a Webkit bug, but OP mentioned she developed this on Chrome..

oinksoft13y ago

It may "sail fine" but sure left my fans howling too. I don't think this is a browser thing.

hosay12313y ago

Not even a hint of stress on Firefox/OSX

DanBC13y ago· 4 in thread

I'm not sure what "hidden" means in the title. See, eg, (http://www.reddit.com/r/proana). There are a bunch of these closed groups.

_NOT SAFE FOR WORK_:

This user (http://www.reddit.com/user/walfa2) only posts content from sites which pay when viewers see the images. The domains have heavy ad content, with popups etc.

Here's an example domain:

(http://www.reddit.com/domain/img1.picfoco.com/)

Once you find one user you can find a bunch of these domains, and the other users posting to those domains, and thus find a few more domains.

With a bit of tinkering you could should a colour coded chart of spam domains; of users that only post content from those domains; and users that never make replies but only make top level comments.

That could be run once a week and (with human oversight) used to remove content which is not good for reddit.

eli_awryOP13y ago

That's a cool idea.

In the original title, 'Hidden' was 'Latent' - communities that de-facto exist even though they are not explicit. 'Latent' would have been a more precise word, but 'hidden' is more accessible.

true_religion13y ago

It's not entirely obvious that posting from a domain that incentizes traffic is a bad thing.

If the posts are upvoted by the community, then it should be seen as a good and not a negative.

One of the oddities of reddit as compared to other social sites is that content owners and traffickers are looked-down upon simply because they can profit from attention.

DanBC13y ago

> It's not entirely obvious that posting from a domain that incentizes traffic is a bad thing.

2 more replies

k3n13y ago

> If the posts are upvoted by the community, then it should be seen as a good and not a negative.

The problem with this thinking is that vote fraud runs rampant and is sometimes nearly impossible to detect.

NZ_Matt13y ago· 4 in thread

Edit: Here are the original threads, I don't think the project got very far. http://www.reddit.com/r/announcements/comments/ddz0s/reddit_...

http://www.reddit.com/r/redditdev/comments/dtg4j/want_to_hel...

naner13y ago

EDIT: (Sorry for all the parentheticals.)

dmix13y ago

Why not just anonymize the user data part?

There are startups selling health data this way, I don't think it would be so bad for subreddit subscription data.

1 more reply

eli_awryOP13y ago

I didn't look very far into that, but I think that the issue was that subreddit data wasn't included in the vote dump they did. I actually only heard that secondhand - I could be mistaken.

I just used post and comment histories, which suited my purposes fairly well because the larger project was looking into how memes spread.

wting13y ago

Took me a while to find it:

http://www.reddit.com/r/redditdev/comments/dtg4j/want_to_hel...

Here's another version:

https://archive.org/details/2010-reddit-research

1wheel13y ago· 4 in thread

Really cool! Couple of comments:

2. Did you hand select the subreddits you analysed? If so, what criteria were you looking for?

twitter: http://www.aclweb.org/anthology-new/D/D11/D11-1120.pdf

blogspot: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.136...

4. If you want to examine the spread of memes, you need access to old threads. http://stattit.com/ is the best way of getting around the reddit API's 1000 most recent post limitation.

eli_awryOP13y ago

1wheel13y ago

Depending on what you are looking for the rate limit might slow you down a lot; you might want to contact the site admins:

> tl;dr If you need old data, we'd much rather work out a way to get you a data dump than to have you scrape.

https://groups.google.com/forum/?fromgroups=#!topic/reddit-d...

tansey13y ago

Does stattit have an archive that one can download or do they just provide the high-level summary stats shown on the site?

1wheel13y ago

I don't think so; the creator /u/Deimorz is pretty cool, you could try asking him.

mahesh_rm13y ago· 2 in thread

Isn't r/WTF missing from this picture?

eli_awryOP13y ago

Indeed. I took out all of the default subreddits because they added too much noise - everyone starts out subscribed to all of them.

corin_13y ago

Can you not just flip the switch and get equally useful data - don't look for people who subscribe to a default, but to those who unsubscribe from it?

1 more reply

dmix13y ago· 1 in thread

I'd be curious to see the connection between politics/economics and other subreddits.

Such as what subreddits are /r/ liberals, conservatives, libertarians, anarchists, etc likely to follow?

Are liberals commonly in /r/trees? Are libertarians big on /r/economics? Are conservatives avoiding /r/wtf and /r/trees?

eli_awryOP13y ago

Those subs weren't really active enough for me to get significant data in the time I was scraping (Reddit doesn't let you go back very far.) Great idea for a future post though.

1 more reply

razkul13y ago· 1 in thread

Awesome data. Really interesting to look at, and great presentation.

But there are a few things that kinda bother me with this:

The problem I can find with this data is that it isn't a representation of the reddit hidden communities as a whole, just the hidden communities of those who actually post (only 20% of Reddit).

eli_awryOP13y ago

I combatted this in two ways - first, I only looked at the top 433 reddits.

the_cat_kittles13y ago

This is one only a handful of graphvis-esque visuals that ACTUALLY conveys information effectively, as far as I have seen. Not to mention it is really interesting info! Nice work!

msds13y ago

I did a similar thing with all of the departments of the UW: http://www.sorens.in/posts/2012-8-11-uw-courses

Kluny13y ago

For instance, I usually read /r/bicycles, but also programming, motorcycles, cars, and 2xc. How many other people have that unique mix of interests?

TGJ13y ago

toadi13y ago

Good work for the visualization of the data. Take a look at http://www.datapointed.net/visualizations/ his visuals are superb.

skadamat13y ago

I go to UT and am on the FAI newsletter and totally get your emails!

rhizome13y ago

Adrian Chen thanks you.

jrochkind113y ago

Hi Eli, neat work!

j / k navigate · click thread line to collapse