But, it seems to me that the author is falling in to a trap many an unwary data "scientist" falls by not understanding the discipline of Statistics.
When one has the entire population data (i.e. a census), rather than a sample, there is no point in carrying out statistical tests.
If I know ALL the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.
No concept of "statistical significance" is applicable because there is no sample. We can calculate the population value of any parameter we can think of, because, we have the entire population (in this specific instance, ALL the words spoken by all the characters).
FYI, all budding data "scientists" ...
By the way, we definitely don't hear all words that these characters speak in their lives. It's implied in the story that there are conversations that we don't get to see in the actual episodes, but nevertheless these imaginary characters speak a lot more. For example we don't see each and every breakfast, lunch and dinner discussion, we don't hear all their words in the classroom etc.
Now of course the sampling isn't random, because the creators obviously "select" the more interesting bits of the characters' lives, but in statistics we always make assumptions that simplify the procedure but are known to be technically wrong.
Treating his population as a large sample of a process that's uncertain or noisy and then applying frequentist statistics is not inherently wrong in the way you say it is. It may be that there's a better way to model the uncertainty in the process than treating the population as a sample, but that's a different point than the one you make.
[1]: http://andrewgelman.com/2009/07/03/how_does_statis/
[2]: http://www.stat.columbia.edu/~gelman/research/published/econ... (see finite population section)
Its the entire population of wars meeting a certain criteria in that time frame. If that is the topic of interest, then it is also the whole population. OTOH, datasets like that are often used in analysis that is intended to apply to, for instance, "what-if" scenarios about hypothetical wars that could have happened in that time frame, in which case the studied population is clearly not the population of interest, but is taken to be -- while there may be specific reasons to criticize this in specific cases for reasons other than "its the whole population, not a sample" -- a representative sample of a broader population.
The old-school interpretation is stricter and considers both the "population" and the "sample" to be physical real things. It's understandable because these methods were developed for statistics about human populations (note the origin of the terminology), medical studies etc. (The word "statistics" itself derives from "state").
Somehow, frequentist statisticians are usually very conservative and set in one way of thinking and do not even like to entertain an alternative interpretation or paradigm... I'm not sure why it is so.
For your reasoning to be applicable here, you have to put together a model of the data generating process from which you can derive a proper model that allows inference. What exactly are the assumptions on P( word_i | character_j ) that make it compatible with these particular tests' assumptions?
I was working under the assumption that we do not know ALL the words since the show's been renewed through 2019. This covers the first 18 seasons.
Additionally, the counting up their most frequent words produced results with very little semantic meaning - things like "just" and "dont" - which can be seen in this (really boring) wordcloud: https://github.com/walkerkq/textmining_southpark/blob/master...
Looking into the log likelihood of each word for each speaker produced results that were much more intuitive and carried more meaning, like ppod said below: I think the idea is that what we are really trying to measure is something unobservable like the underlying nature of the character or the writers' tendencies to give characters certain ways of speaking.
Yes, there will be future episodes, but you are not claiming that you are predicting what these characters will say in those future episodes (in which case your whole setup is rather inappropriate).
Also, I suggest you think very hard about this statement:
> The log likelihood value of 101.7 is significant far beyond even the 0.01% level, so we can reject the null hypothesis that Cartman and the remaining text are one and the same.
Even if the statistical test you employed were appropriate, this is not the conclusion you draw from it.
Also, are you confusing p = 0.01 with 1% or did you really choose p = 0.00001 as the significance level for your test?
I think that is what parent is implying.
From the text, the author is performing statistical testing (chi-sq) for which words are most unique to a character, not which words they say the most. (although the two metrics are somewhat correlated)
Once again, "words that are most unique" to character is a parameter that can easily be counted from the set of ALL words with no sampling uncertainty because, yes, we have the population.
10,3,Brian,"You mean like the time you had tea with
Mohammad, the prophet of the Muslim faith?
Peter:
Come on, Mohammad, let's get some tea.
Mr. T:
Try my ""Mr. T. ...tea.""
"
Well, it seems people are not understanding the problem with this line. Here is the screenshot of the original script: http://imgur.com/pcu5N2U Brian: You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? [flashback #3]
Peter: Come on, Mohammad, let's get some tea. [Mohammad is covered by a black box with the words "IMAGE CENSORED BY FOX" printed several times from top to bottom inside the box. They stop at a tea stand.]
Mr. T: Try my "Mr. T. ...tea." [squints]
There, three characters speak.However, R's read.csv will assign all three characters' speech to Brian: http://imgur.com/gLpPKdl
> x[596, ]
Season Episode Character
596 10 3 Brian
Line
596 You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? \nPeter:\nCome on, Mohammad, let's get some tea. \n
> x[597,]
Season Episode Character
597 10 3 Brian
Line
597 You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? \nPeter:\nCome on, Mohammad, let's get some tea. \nMr. T:\nTry my "Mr. T. ...tea." \n
as well as seemingly duplicating part of the conversation.PS: In addition, both Muhammad and Mohammad appear, presumably under-counting the references to the prophet.
The data sources are CSVs in this repository: https://github.com/BobAdamsEE/SouthParkData/
Looks like all the data is preprocessed, with everyone mostly having only 1 line. (Actually, it appears the line you note in 10-3 is broken!) You can make an argument that the script isn't processed correctly, but that's beyond the scope of the analysis, although a note might be helpful.
This also strongly implies you think the author is a 'budding data scientist' out of his/her league.
This is very much a 'sample' given the context that South Park is still releasing new episodes.
FYI all elitist 'statisticians' ...
As far as I can tell, there are a lot of people out of their leagues going around with the title "data scientist".
This is not a sample. This is a census at this point in time. The fact that there will be another population tomorrow does not change the fact that you have the entire population of all words spoken by all characters up to today.
I am not a statistician. I am an economist who knows enough about statistics and econometrics to know when a significance test is applicable.
Also, do note the issue that R's csv parsing is going to mis-attribute some characters' speech to others. GIGO speaks loud.
What does that mean? Does he remove words that are only said once or twice?
Can anyone point me to a text explaining the difference between Identifying Characteristic Words using Log Likelihood and using tfidf. ?
# remove sparse terms
all.tdm.75 <- removeSparseTerms(all.tdm, 0.75) # 3117 / 728215
I believe it corresponds to the tfidf factor.Its also funny how Stan talks more than Kyle given the show now has a recurring joke that makes fun of Kyle's long educational dialogues.
Error establishing a database connection
Someone has a cached version please?