[MUSIC] Hello everyone. Welcome to Big Data and Language. This week, so far we've talked about lemma, keyness of a word and collocations. So, are you familiar with those terminologies? If not, then always feel free to go back and review the lectures. Because these terminologies and also the tools how we can analyze the linguistic features in the text data will be important for your final project at the end of this course. Okay? So, if you are ready, then let me give you another terminology, which one is n-grams. N-grams is very similar to collocations. However, slightly different. So, let's look at, find the similarities and differences one by one. Are you ready? Let's get started. So, n-grams are similar to collocations as I mentioned that, but is continuous sequence of n words from spoken or written text. So, what that mean is that n-gram is commonly used for word prediction. Let me give you the the sub terminologies under n-grams. For example, if just one single word, then we call it unigram, okay? And if we are look at two words connected together, then we call it bigram. Okay? And third, three words together we call the trigram, something like that, right? So, let me give you an example, N = 1 would be if the sentence is, this is a sentence. Okay? Then the unigrams are one single word. So, this is a sentence. Okay? However, if you examine bigrams, then we can put together two words in a one chunk, so that the bigrams is this is, is a, a sentence. Okay? Now, you understand the meaning and the concept of bigrams. And now what about trigrams? If you investigate trigrams, then the sentence, this is a sentence, we can make three words in one chunk. So, that will be trigrams, this is a, and is a sentence. Okay? So, now you understand, right? Collocations, that one is more often used than we expected, or the regular text or regular connections. However, this n-grams, it really doesn't matter, right? This one is not still connected together. The words are connecting. However, n-grams is just more like the objective chunk, okay? The number of words. So, now let me give you another example. So, for example n-grams from the Google n-gram corpus. If you look at trigrams, 3-grams, then you can find the trigrams such as ceramic collectibles fine 130 times. And ceramics collected by 52 times, and ceramics collectible pottery 50 times, and ceramics collectibles cooking 45 times. So, ceramics then if you find any other two nouns putting together frequently, two words connected frequently. You can examine and you will find these lists through and based on 3-grams and. And if you search 4-grams, then there are some examples from Google n-gram corpus, such as serve as the the incoming, so 92 times. And serve as the incubator 99 times, and serve as the independent, that one is 794 times. So, that's pretty big number. And serve as the index 223, and serve as the indication 72 times, and serve as the indicator 120 times. So, now we will notice that serve, verb serve, we use a lot with as the, and nouns such as indication, indicator. Okay? So, the uses, let's talk about the usage of n-grams. The item, this item can be the phonemes or syllables, letters or words, or base pairs according to the application. N-grams approach can be used in many other fields, such as protein sequencing or DNA sequencing. So, n-the concept of n-grams is not only used in linguistics. But also when you analyze DNA. So, for example DNA sequencing. If you have for example, the sample sequence is a AGCTTCGA, then 1-gram sequence is just the single A, G, C, T, T, C, G, A. However, if you have the bigram sequence, then you need to put two together. So, AG, GC, CT and TT, TC and CG, and GA, right? And if you have three, if you want to investigate or examine the 3-gram, trigram sequence. Then in that sample sequence you need to characterize as AGC, GCT and CTT, and TTC, and TCG, and CGA, something like that. Right? So, I will show you more examples in the table. Please take a look and that might be easier for you to understand the concepts of n-grams. Okay. So, now let's talk about the relationships between the n-grams and word predictions. N-gram analysis can be used to predict next word. For example, used in frequent search terms. If go to the Google website and if you type olympic, then there are another words, right, based on the prediction, and based on the n-gram analysis. So, such as olympic 2018, or olympic drone, olympic live, or olympic opening ceremony, or olympic games, torch relay, logo, rings, something like that. Okay? So, how we can use n-grams in sentence generation? So, choose a random bigram (<s>, w) according to its probability. And now choose a random bigram (w, x) according to its probability. And so on until we choose </s>, and then string the words together. So, we can see that I want to eat Chinese food, right? So, we can see all the probability. And word embedding is also related to n-grams. So, skip gram, n-grams skipping some words on left to right. For example, interface. Okay? If you want to take a look n-grams of interface, then you might want to take a look the right before n-gram interface, which is machine, right? Or the previous two words before interface, then you can find human, right? And what about after interface? Then for, ABC, computer, applications. Okay? So, you can choose the word that fits the sentence by calculating the maximum probability, based on the probability of each word. Then you can find if you have a sentence, research about the something of the convenient applications for the interaction between human and machine. Then from the big data and from the probability of each word, then the computer can find possible word, which could be interface, design, software or programs. So, now you know how the n-grams could work in word embedding, okay? So, the word embedding, you might have heard about Word2Vec, which is the words can be represented as vectors. Okay? Mathematical similarity and word analog calculation became possible. For example, the word father has closer distance with another word girl, son, okay? Or mother, boy and daughter, these two words are very closely connected with the word mother. Okay? So, those kind of approach you can find the word embedding as well. And let's talk about n-grams in language modeling. So, we can compute the probability of a sentence or sequence of words. And also, we can use the the probability of each components, such as word. And so, for example, you can use unigram to generate a sentence. Okay? So, I will share the clip, please take a look. >> The simplest case of a Markov model is called the unigram model. In the unigram model we simply estimate the probability of a whole sequence of words by the product of probabilities of individual words, unigrams. And if we generated sentences by randomly picking words, you can see that it would look like a word salad. So, here's some automatically generated sentences, generated by Dan Klein. And you can see that the word fifth, the word an, the word of, this doesn't look like a sentence at all. It's just a random sequence of words. Thrift, did, eighty, said, that's the properties of a unigram model, words are independent. >> Okay, today we've talked about n-grams. So, next time we will talk about POS parser. Thank you for your attention.