Text Mining South Park (opens in new tab)

(kaylinwalker.com)

202 pointseamonncarey10y ago48 comments

48 comments

35 comments · 10 top-level

nanis10y ago· 20 in thread

I was in the process of reading this when I thought to check who this person is. Of course, by that time the site had failed, so I haven't read the whole thing yet.

But, it seems to me that the author is falling in to a trap many an unwary data "scientist" falls by not understanding the discipline of Statistics.

When one has the entire population data (i.e. a census), rather than a sample, there is no point in carrying out statistical tests.

If I know ALL the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.

No concept of "statistical significance" is applicable because there is no sample. We can calculate the population value of any parameter we can think of, because, we have the entire population (in this specific instance, ALL the words spoken by all the characters).

FYI, all budding data "scientists" ...

bonoboTP10y ago

Why so bitter and angry? As far as I can see, his calculations make sense and lead to interesting results. Instead of philosophical nitpicking, why not help him improve his understanding by explaining how you would have calculated/formalized/modeled this thing, so the scare-quote data "scientists" can learn something?

By the way, we definitely don't hear all words that these characters speak in their lives. It's implied in the story that there are conversations that we don't get to see in the actual episodes, but nevertheless these imaginary characters speak a lot more. For example we don't see each and every breakfast, lunch and dinner discussion, we don't hear all their words in the classroom etc.

Now of course the sampling isn't random, because the creators obviously "select" the more interesting bits of the characters' lives, but in statistics we always make assumptions that simplify the procedure but are known to be technically wrong.

sdenton410y ago

But we must protect ourselves from the research parasites! Man the ramparts and ready the harsh words!

vsbuffalo10y ago

You're treating this sample-is-the-population issue as if it's resolved in the statistics literature. It is not. Gelman has written on this [1][2], as the issue comes up in political science data frequently. As Gelman points out, 50 states are not a sample of states—it's the entire population. Similarly, the Correlates of War [3] data is every militarized international dispute between 1816-2007 that fits certain criteria—it too is not a sample but the entire population.

Treating his population as a large sample of a process that's uncertain or noisy and then applying frequentist statistics is not inherently wrong in the way you say it is. It may be that there's a better way to model the uncertainty in the process than treating the population as a sample, but that's a different point than the one you make.

[1]: http://andrewgelman.com/2009/07/03/how_does_statis/

[2]: http://www.stat.columbia.edu/~gelman/research/published/econ... (see finite population section)

dragonwriter10y ago

> Similarly, the Correlates of War [3] data is every militarized international dispute between 1816-2007 that fits certain criteria—it too is not a sample but the entire population.

Its the entire population of wars meeting a certain criteria in that time frame. If that is the topic of interest, then it is also the whole population. OTOH, datasets like that are often used in analysis that is intended to apply to, for instance, "what-if" scenarios about hypothetical wars that could have happened in that time frame, in which case the studied population is clearly not the population of interest, but is taken to be -- while there may be specific reasons to criticize this in specific cases for reasons other than "its the whole population, not a sample" -- a representative sample of a broader population.

bonoboTP10y ago

Exactly. There is an interpretation where the "population" is interpreted as a mathematical ideal process (with potentially infinite information content) and any real, physical manifestation is considered a "sample".

The old-school interpretation is stricter and considers both the "population" and the "sample" to be physical real things. It's understandable because these methods were developed for statistics about human populations (note the origin of the terminology), medical studies etc. (The word "statistics" itself derives from "state").

Somehow, frequentist statisticians are usually very conservative and set in one way of thinking and do not even like to entertain an alternative interpretation or paradigm... I'm not sure why it is so.

nanis10y ago

As an economist, I am also aware of the logical contortions we have to go through to be able to run regressions on historical data (i.e. pretty much all of economic data). None of this applies here. The data generating process consists of the minds of the writers.

For your reasoning to be applicable here, you have to put together a model of the data generating process from which you can derive a proper model that allows inference. What exactly are the assumptions on P( word_i | character_j ) that make it compatible with these particular tests' assumptions?

walkerkq10y ago

Hi, I'm the author. I appreciate the time you've taken to read and provide constructive criticism of my work. Here's my full write up (on GitHub, so it should continue to work): https://github.com/walkerkq/textmining_southpark/blob/master...

I was working under the assumption that we do not know ALL the words since the show's been renewed through 2019. This covers the first 18 seasons.

Additionally, the counting up their most frequent words produced results with very little semantic meaning - things like "just" and "dont" - which can be seen in this (really boring) wordcloud: https://github.com/walkerkq/textmining_southpark/blob/master...

Looking into the log likelihood of each word for each speaker produced results that were much more intuitive and carried more meaning, like ppod said below: I think the idea is that what we are really trying to measure is something unobservable like the underlying nature of the character or the writers' tendencies to give characters certain ways of speaking.

nanis10y ago

The point I am making is simple: You can calculate whatever you want to calculate, but there is no room for statistical testing because you do not have a probability sample, and, no sampling variation.

Yes, there will be future episodes, but you are not claiming that you are predicting what these characters will say in those future episodes (in which case your whole setup is rather inappropriate).

Also, I suggest you think very hard about this statement:

> The log likelihood value of 101.7 is significant far beyond even the 0.01% level, so we can reject the null hypothesis that Cartman and the remaining text are one and the same.

Even if the statistical test you employed were appropriate, this is not the conclusion you draw from it.

Also, are you confusing p = 0.01 with 1% or did you really choose p = 0.00001 as the significance level for your test?

wodenokoto10y ago

A simple tf-idf would get you similar results without a t-test.

I think that is what parent is implying.

minimaxir10y ago

> If I know ALL the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.

From the text, the author is performing statistical testing (chi-sq) for which words are most unique to a character, not which words they say the most. (although the two metrics are somewhat correlated)

nanis10y ago

As I said, I could not read the whole thing. As I was skimming, I noticed the tests, tried to load the main page, and I was disconnected.

Once again, "words that are most unique" to character is a parameter that can easily be counted from the set of ALL words with no sampling uncertainty because, yes, we have the population.

2 more replies

nanis10y ago

Also, I am going to go out on a limb here and guess that R's `read.csv` doesn't do what one hopes it would when fed this CSV:

    10,3,Brian,"You mean like the time you had tea with
    Mohammad, the prophet of the Muslim faith?
    Peter:
    Come on, Mohammad, let's get some tea.
    Mr. T:
    Try my ""Mr. T. ...tea.""
    "

Well, it seems people are not understanding the problem with this line. Here is the screenshot of the original script: http://imgur.com/pcu5N2U

    Brian: 	You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? [flashback #3]
    Peter: 	Come on, Mohammad, let's get some tea. [Mohammad is covered by a black box with the words "IMAGE CENSORED BY FOX" printed several times from top to bottom inside the box. They stop at a tea stand.]
    Mr. T: 	Try my "Mr. T. ...tea." [squints]

There, three characters speak.

However, R's read.csv will assign all three characters' speech to Brian: http://imgur.com/gLpPKdl

   > x[596, ]
       Season Episode Character
    596     10       3     Brian
              Line
    596 You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? \nPeter:\nCome on, Mohammad, let's get some tea. \n

    > x[597,]
        Season Episode Character
    597     10       3     Brian
                                                Line
    597 You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? \nPeter:\nCome on, Mohammad, let's get some tea. \nMr. T:\nTry my "Mr. T. ...tea." \n

as well as seemingly duplicating part of the conversation.

PS: In addition, both Muhammad and Mohammad appear, presumably under-counting the references to the prophet.

minimaxir10y ago

I took a look at the code in the author's GitHub repository.

The data sources are CSVs in this repository: https://github.com/BobAdamsEE/SouthParkData/

Looks like all the data is preprocessed, with everyone mostly having only 1 line. (Actually, it appears the line you note in 10-3 is broken!) You can make an argument that the script isn't processed correctly, but that's beyond the scope of the analysis, although a note might be helpful.

1 more reply

masukomi10y ago

why not? that's a valid single csv record with 4 "columns". When surrounded by quotes it IS legal for a csv entity to span multiple lines.

1 more reply

philh10y ago

Just tested, it handles that fine. (R 3.1.3)

1 more reply

ZoF10y ago

This implies there aren't future episodes upon which this type of statistical analysis could be applied.

This also strongly implies you think the author is a 'budding data scientist' out of his/her league.

This is very much a 'sample' given the context that South Park is still releasing new episodes.

FYI all elitist 'statisticians' ...

nanis10y ago

If one is trying to figure out what characters will say in future episodes based on their speech in previous episodes, then you are in a prediction context, not significance testing context.

As far as I can tell, there are a lot of people out of their leagues going around with the title "data scientist".

This is not a sample. This is a census at this point in time. The fact that there will be another population tomorrow does not change the fact that you have the entire population of all words spoken by all characters up to today.

I am not a statistician. I am an economist who knows enough about statistics and econometrics to know when a significance test is applicable.

Also, do note the issue that R's csv parsing is going to mis-attribute some characters' speech to others. GIGO speaks loud.

2 more replies

make310y ago

Would the fact that he/she does not have the future text in his sample/population and that he uses this dataset as a sample of all the southparks to be ever written (in a prediction mode) make this make sense

JoeAltmaier10y ago

Hm. The show is still running? Then the show can be considered a sample of what the characters (ok, the writers) will say/put in their mouths. The statistics then have predictive value.

nanis10y ago

Nope. https://en.wikipedia.org/wiki/Sample_%28statistics%29

1 more reply

agentgt10y ago· 2 in thread

I wonder how the results would change if it was based not on words but rather by lines (not string lines but actor lines in conversation).

Its also funny how Stan talks more than Kyle given the show now has a recurring joke that makes fun of Kyle's long educational dialogues.

cdubzzz10y ago

Maybe because of Kyle's decision to not give long speeches last season (:

flashman10y ago

It would definitely change. For instance I'd expect Kyle's words-per-sentence (or at least his 90th percentile sentence length) to be higher than Stan's, due to his speeches.

wodenokoto10y ago· 1 in thread

> Reducing the sparsity brought that down to about 3,100 unique words [from 30,600 unique words]

What does that mean? Does he remove words that are only said once or twice?

Can anyone point me to a text explaining the difference between Identifying Characteristic Words using Log Likelihood and using tfidf. ?

minimaxir10y ago

Relevant line in code:

   # remove sparse terms
   all.tdm.75 <- removeSparseTerms(all.tdm, 0.75) # 3117 / 728215

I believe it corresponds to the tfidf factor.

dropdatabase10y ago· 1 in thread

This is amazing, I wonder what results you'd get from The Simpsons

charlieegan310y ago

Not sure subtitles contain character information but the people running https://frinkiac.com/ might have the data.

gulbrandr10y ago· 1 in thread

  Error establishing a database connection

Someone has a cached version please?

eeturunen10y ago

http://i.imgur.com/IEudyni.jpg

seankross10y ago

Here's the accompanying GitHub repo: https://github.com/walkerkq/textmining_southpark

cadab10y ago

I've found an image, which i'm guessing it taken from the site: http://imgur.com/IEudyni, worth looking at if the sites still down.

LoSboccacc10y ago

I would have loved to see log characterization for the canadians characters, even if they aren't part of the main cast

rhema10y ago

Pretty interesting. This Large Scale Study of Myspace (http://www.cc.gatech.edu/projects/doi/Papers/Caverlee_ICWSM_...) paper shows a similar method for finding characteristic terms, using Mutual Information.

peg_leg10y ago

This should be nominated for an igNobel

j / k navigate · click thread line to collapse

48 comments

35 comments · 10 top-level

nanis10y ago· 20 in thread

I was in the process of reading this when I thought to check who this person is. Of course, by that time the site had failed, so I haven't read the whole thing yet.

But, it seems to me that the author is falling in to a trap many an unwary data "scientist" falls by not understanding the discipline of Statistics.

When one has the entire population data (i.e. a census), rather than a sample, there is no point in carrying out statistical tests.

If I know ALL the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.

FYI, all budding data "scientists" ...

bonoboTP10y ago

sdenton410y ago

But we must protect ourselves from the research parasites! Man the ramparts and ready the harsh words!

vsbuffalo10y ago

[1]: http://andrewgelman.com/2009/07/03/how_does_statis/

[2]: http://www.stat.columbia.edu/~gelman/research/published/econ... (see finite population section)

dragonwriter10y ago

> Similarly, the Correlates of War [3] data is every militarized international dispute between 1816-2007 that fits certain criteria—it too is not a sample but the entire population.

bonoboTP10y ago

nanis10y ago

walkerkq10y ago

I was working under the assumption that we do not know ALL the words since the show's been renewed through 2019. This covers the first 18 seasons.

nanis10y ago

Yes, there will be future episodes, but you are not claiming that you are predicting what these characters will say in those future episodes (in which case your whole setup is rather inappropriate).

Also, I suggest you think very hard about this statement:

> The log likelihood value of 101.7 is significant far beyond even the 0.01% level, so we can reject the null hypothesis that Cartman and the remaining text are one and the same.

Even if the statistical test you employed were appropriate, this is not the conclusion you draw from it.

Also, are you confusing p = 0.01 with 1% or did you really choose p = 0.00001 as the significance level for your test?

wodenokoto10y ago

A simple tf-idf would get you similar results without a t-test.

I think that is what parent is implying.

minimaxir10y ago

> If I know ALL the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.

nanis10y ago

As I said, I could not read the whole thing. As I was skimming, I noticed the tests, tried to load the main page, and I was disconnected.

Once again, "words that are most unique" to character is a parameter that can easily be counted from the set of ALL words with no sampling uncertainty because, yes, we have the population.

2 more replies

nanis10y ago

Also, I am going to go out on a limb here and guess that R's `read.csv` doesn't do what one hopes it would when fed this CSV:

    10,3,Brian,"You mean like the time you had tea with
    Mohammad, the prophet of the Muslim faith?
    Peter:
    Come on, Mohammad, let's get some tea.
    Mr. T:
    Try my ""Mr. T. ...tea.""
    "

Well, it seems people are not understanding the problem with this line. Here is the screenshot of the original script: http://imgur.com/pcu5N2U

    Brian: 	You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? [flashback #3]
    Peter: 	Come on, Mohammad, let's get some tea. [Mohammad is covered by a black box with the words "IMAGE CENSORED BY FOX" printed several times from top to bottom inside the box. They stop at a tea stand.]
    Mr. T: 	Try my "Mr. T. ...tea." [squints]

There, three characters speak.

However, R's read.csv will assign all three characters' speech to Brian: http://imgur.com/gLpPKdl

   > x[596, ]
       Season Episode Character
    596     10       3     Brian
              Line
    596 You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? \nPeter:\nCome on, Mohammad, let's get some tea. \n

    > x[597,]
        Season Episode Character
    597     10       3     Brian
                                                Line
    597 You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? \nPeter:\nCome on, Mohammad, let's get some tea. \nMr. T:\nTry my "Mr. T. ...tea." \n

as well as seemingly duplicating part of the conversation.

PS: In addition, both Muhammad and Mohammad appear, presumably under-counting the references to the prophet.

minimaxir10y ago

I took a look at the code in the author's GitHub repository.

The data sources are CSVs in this repository: https://github.com/BobAdamsEE/SouthParkData/

1 more reply

masukomi10y ago

why not? that's a valid single csv record with 4 "columns". When surrounded by quotes it IS legal for a csv entity to span multiple lines.

1 more reply

philh10y ago

Just tested, it handles that fine. (R 3.1.3)

1 more reply

ZoF10y ago

This implies there aren't future episodes upon which this type of statistical analysis could be applied.

This also strongly implies you think the author is a 'budding data scientist' out of his/her league.

This is very much a 'sample' given the context that South Park is still releasing new episodes.

FYI all elitist 'statisticians' ...

nanis10y ago

If one is trying to figure out what characters will say in future episodes based on their speech in previous episodes, then you are in a prediction context, not significance testing context.

As far as I can tell, there are a lot of people out of their leagues going around with the title "data scientist".

I am not a statistician. I am an economist who knows enough about statistics and econometrics to know when a significance test is applicable.

Also, do note the issue that R's csv parsing is going to mis-attribute some characters' speech to others. GIGO speaks loud.

2 more replies

make310y ago

JoeAltmaier10y ago

Hm. The show is still running? Then the show can be considered a sample of what the characters (ok, the writers) will say/put in their mouths. The statistics then have predictive value.

nanis10y ago

Nope. https://en.wikipedia.org/wiki/Sample_%28statistics%29

1 more reply

agentgt10y ago· 2 in thread

I wonder how the results would change if it was based not on words but rather by lines (not string lines but actor lines in conversation).

Its also funny how Stan talks more than Kyle given the show now has a recurring joke that makes fun of Kyle's long educational dialogues.

cdubzzz10y ago

Maybe because of Kyle's decision to not give long speeches last season (:

flashman10y ago

It would definitely change. For instance I'd expect Kyle's words-per-sentence (or at least his 90th percentile sentence length) to be higher than Stan's, due to his speeches.

wodenokoto10y ago· 1 in thread

> Reducing the sparsity brought that down to about 3,100 unique words [from 30,600 unique words]

What does that mean? Does he remove words that are only said once or twice?

Can anyone point me to a text explaining the difference between Identifying Characteristic Words using Log Likelihood and using tfidf. ?

minimaxir10y ago

Relevant line in code:

   # remove sparse terms
   all.tdm.75 <- removeSparseTerms(all.tdm, 0.75) # 3117 / 728215

I believe it corresponds to the tfidf factor.

dropdatabase10y ago· 1 in thread

This is amazing, I wonder what results you'd get from The Simpsons

charlieegan310y ago

Not sure subtitles contain character information but the people running https://frinkiac.com/ might have the data.

gulbrandr10y ago· 1 in thread

  Error establishing a database connection

Someone has a cached version please?

eeturunen10y ago

http://i.imgur.com/IEudyni.jpg

seankross10y ago

Here's the accompanying GitHub repo: https://github.com/walkerkq/textmining_southpark

cadab10y ago

I've found an image, which i'm guessing it taken from the site: http://imgur.com/IEudyni, worth looking at if the sites still down.

LoSboccacc10y ago

I would have loved to see log characterization for the canadians characters, even if they aren't part of the main cast

rhema10y ago

peg_leg10y ago

This should be nominated for an igNobel

j / k navigate · click thread line to collapse