[MUSIC] Hello everyone. Welcome to Big Data and Language. In the previous lecture, we learn about lemma, and this time let's talk about keyness of a word. So are you ready? Let's get started. The keyness of a word provides an indicator of a word's importance in a corpus. The keyness of the word could be computed using Chi-square or log likelihood. So let me give you an example. Let's assume that in the Corpus A, we have orange 9 times and apple 35 times and banana 20 times. Whereas in Corpus B, we have orange 30 times and apple 13 times and banana 21 times. So based on these two corpora, then we can see that the keyness value of orange and apple should be the greater than banana. You might be curious how I can get this kind of keyness values. So let me more detail explain one by one. The first one as I said before, we can calculate the keyness of the word by Chi Square or log likelihood. So let's talk about Chi Square first. The Chi Square is the measure, the significance of the observed and the expected frequency of a word. So the higher the significance is, the more important. The more values, more significant then that means more important, okay. So, O_i means the observed frequency of a word and E_i is the expected frequency of a word. So I will show you this is the equation. Calculate it one by one. The example again the corpus in the corpus A, orange 9 times, apple 35 times, and banana 20 times. This one is the word. Okay, and the numbers are the frequencies. And also in corpus B, orange 30 times and apple 15 times and banana 21 times. So let's calculate orange observed value table. You can see that the expected frequency rate of orange. You can calculate it the value, 39/130. Okay, and the expected frequency rate of others. You can calculate it 91/130. Okay, because the total number of orange and other words the total, we can calculate it's 130. Okay, so in A, E_orange in A is 64 x 39/128. And E_other is 64 x 91/130. Okay, so you might want to stop and understand the table on the screen and you might also want to calculate it. Again, the orange in the corpus 9, but others 55, so the total in corpus A is 64 whereas in corpus B, orange 30 times and the other words 36, right? So the total is 66, right? So 130 means 64 from corpus A plus 66 from corpus B okay. And the total number of orange from corpus A and B, 39 total and other words from corpus A and B, 91. You can combine 55 plus 36 is 91. Okay, so you can get those numbers and based on the equation, apply the equation then you will get the expected value, okay. So now let's calculate it the orange expected values table. Let me let's see the expected value table. So the orange in a corpus a 19.2 and other words 44.8. So the total is 64 and in corpus B, the expected values orange is 19.8 and other words 46.2. So the total is 66. Okay, again, you will see that the total number is the same, 130, okay. So how you can get the chi-square value of orange? So you can see that (9- 19.2), which one is the expected value right? And you need to square it and divide it by 19.2. Okay, plus (55- 44.8) and you need to square it and divide it by 44.8 plus (30- 19 .2) square divided by 19.8. And the final one is (36- 46 .2) squared and divided by 46.2. So you need to add those four values, then you will calculate you will find the chi-square value of orange, which one is 15.24. Okay, so this is how you can calculate it and you can compare and you also compute the chi-square value for apple and banana as well. So based on the same equation, you will see that or you will find that the chi-square value of apple is 14.02 and the chi-square value of banana is 0.048. So orange and apple, you can see the value of chi-square, right? So you will notice that the orange and apple the words are more significant in the corpora. And banana may appear just by accident because just 0.048, it's very small value. Okay, so you understand the chi-square. If you need more time to understand, then you might want to stop and go back to the lecture and just review again. But if you understand the chi-square well, then let's move to the second tool, which one is the log likelihood. So log likelihood is a similar measure as chi-square test. It provides the better estimate of low frequency terms. So depending on the frequency of the your targeted words, you might want to use either Chi square or the log likelihood. Okay, so again, O_i is the observed frequency of a word and E_i is the expected frequency of a word, and does not take other words into account. So this one is the equation of log likelihood. Let's apply the same corpora example. Okay, again, the corpus A, we have orange 9 times, and apple, word apple 35 times, and banana word 20 times. And this time corpus B, orange has 30 times, orange appeared 30 times, and apple appear 15 times, and the word banana appeared 21 times. Again, you can expected it the values table. So in corpus A, orange 19.2 and total 64. In corpus B, the orange is 19.8 and other 66. So the total from corpus A and B, the orange value is 39 and the total is 130, okay. So the likelihood value of orange is a little bit more complex calculation. But you can use the computer calculator or any other tools. So you will find that the value is 11.29. Now, can you compute log likelihood value for apple and banana? The same equation, if you understand how we can find the value of log likelihood of orange, then you can also find the log likelihood value for apple and banana as well. So you might want to stop here and you might want to calculate it, okay. So let me give you the answer. The apple, the log likelihood value of apple is 8.86, and banana the log likelihood value of banana is .003. Okay, so you can see that as a result, orange and apple are more significant than banana, okay. Let's talk about the keyness usage. So compared two top keywords to mark the difference between two corpora. For example, using ANTCONC compared to compare a text with a comparison corpus. So the size, the frequency is 3466, and keyness is 7140. And the word please, the frequency is 4468. However, the keyness is 6122, okay. So in this result, in these values, you will see that the size is more, the word size is more significant in the given text despite the lower frequency. The frequency size was 3466, so which is lower than the frequency of the word please, 4468. However, the keyness is higher. The value of keyness is higher than the keyness of please in this corpus, the given corpus. So you can see that because the keyness of the size is 7140. Whereas the keyness of please in this given text is 6122. So 7140 is bigger than 6122. Okay, so depending on the value of the keyness, you will see that which one is more significant in the given text regardless of the frequency, okay. So measure this with the keyness is also measured the hypothesis test of p-value. So using a public table, p-value can be 0.1 or 0.05 or 0.01. So the keyness value for 0.01 is 6.63. So depending on the values, you might want to say that this word is significantly important in the given text or not. This keyness also determines if the frequency of a word is accidental. What I mean, is that the keyness value for orange was 15.24, which one is bigger than 6.63, so we can say that its frequency is statistically not accident. However, the keyness value of banana was 0.048, right? So this one is way lower than 6.63, so we can conclude that it may appear in the corpus by chance. Okay, so let me summarize the critical cutoff point for statistical significance is usually at p-value lower than 0.01. Though it also can be p-value will be lower than 0.05, depending on your research question or depending on your research design. So a chi-square value above 6.63 will be considered significant. If a word appears just by chance, then the chi-square value is small, okay. So now you might know that how you can interpret the value of chi-square. Okay, today we've talked about the keyness of a word. Then next time, let's talk about collocations. Thank you for your attention.