[SOUND] This lecture is about natural language content analysis. Natural language content analysis is the foundation of text mining. So we're going to first talk about this. And in particular, natural language processing with a factor how we can present text data. And this determines what algorithms can be used to analyze and mine text data. We're going to take a look at the basic concepts in natural language first. And I'm going to explain these concepts using a similar example that you've all seen here. A dog is chasing a boy on the playground. Now this is a very simple sentence. When we read such a sentence we don't have to think about it to get the meaning of it. But when a computer has to understand the sentence, the computer has to go through several steps. First, the computer needs to know what are the words, how to segment the words in English. And this is very easy, we can just look at the space. And then the computer will need the know the categories of these words, syntactical categories. So for example, dog is a noun, chasing's a verb, boy is another noun etc. And this is called a Lexical analysis. In particular, tagging these words with these syntactic categories is called a part-of-speech tagging. After that the computer also needs to figure out the relationship between these words. So a and dog would form a noun phrase. On the playground would be a prepositional phrase, etc. And there is certain way for them to be connected together in order for them to create meaning. Some other combinations may not make sense. And this is called syntactical parsing, or syntactical analysis, parsing of a natural language sentence. The outcome is a parse tree that you are seeing here. That tells us the structure of the sentence, so that we know how we can interpret this sentence. But this is not semantics yet. So in order to get the meaning we would have to map these phrases and these structures into some real world antithesis that we have in our mind. So dog is a concept that we know, and boy is a concept that we know. So connecting these phrases that we know is understanding. Now for a computer, would have to formally represent these entities by using symbols. So dog, d1 means d1 is a dog. Boy, b1 means b1 refers to a boy etc. And also represents the chasing action as a predicate. So, chasing is a predicate here with three arguments, d1, b1, and p1. Which is playground. So this formal rendition of the semantics of this sentence. Once we reach that level of understanding, we might also make inferences. For example, if we assume there's a rule that says if someone's being chased then the person can get scared, then we can infer this boy might be scared. This is the inferred meaning, based on additional knowledge. And finally, we might even further infer what this sentence is requesting, or why the person who say it in a sentence, is saying the sentence. And so, this has to do with purpose of saying the sentence. This is called speech act analysis or pragmatic analysis. Which first to the use of language. So, in this case a person saying this may be reminding another person to bring back the dog. So this means when saying a sentence, the person actually takes an action. So the action here is to make a request. Now, this slide clearly shows that in order to really understand a sentence there are a lot of things that a computer has to do. Now, in general it's very hard for a computer will do everything, especially if you would want it to do everything correctly. This is very difficult. Now, the main reason why natural language processing is very difficult, it's because it's designed it will make human communications efficient. As a result, for example, with only a lot of common sense knowledge. Because we assume all of us have this knowledge, there's no need to encode this knowledge. That makes communication efficient. We also keep a lot of ambiguities, like, ambiguities of words. And this is again, because we assume we have the ability to disambiguate the word. So, there's no problem with having the same word to mean possibly different things in different context. Yet for a computer this would be very difficult because a computer does not have the common sense knowledge that we do. So the computer will be confused indeed. And this makes it hard for natural language processing. Indeed, it makes it very hard for every step in the slide that I showed you earlier. Ambiguity is a main killer. Meaning that in every step there are multiple choices, and the computer would have to decide whats the right choice and that decision can be very difficult as you will see also in a moment. And in general, we need common sense reasoning in order to fully understand the natural language. And computers today don't yet have that. That's why it's very hard for computers to precisely understand the natural language at this point. So here are some specific examples of challenges. Think about the world-level ambiguity. A word like design can be a noun or a verb, so we've got ambiguous part of speech tag. Root also has multiple meanings, it can be of mathematical sense, like in the square of, or can be root of a plant. Syntactic ambiguity refers to different interpretations of a sentence in terms structures. So for example, natural language processing can actually be interpreted in two ways. So one is the ordinary meaning that we will be getting as we're talking about this topic. So, it's processing of natural language. But there's is also another possible interpretation which is to say language processing is natural. Now we don't generally have this problem, but imagine for the computer to determine the structure, the computer would have to make a choice between the two. Another classic example is a man saw a boy with a telescope. And this ambiguity lies in the question who had the telescope? This is called a prepositional phrase attachment ambiguity. Meaning where to attach this prepositional phrase with the telescope. Should it modify the boy? Or should it be modifying, saw, the verb. Another problem is anaphora resolution. In John persuaded Bill to buy a TV for himself. Does himself refer to John or Bill? Presupposition is another difficulty. He has quit smoking implies that he smoked before, and we need to have such a knowledge in order to understand the languages. Because of these problems, the state of the art natural language processing techniques can not do anything perfectly. Even for the simplest part of speech tagging, we still can not solve the whole problem. The accuracy that are listed here, which is about 97%, was just taken from some studies earlier. And these studies obviously have to be using particular data sets so the numbers here are not really meaningful if you take it out of the context of the data set that are used for evaluation. But I show these numbers mainly to give you some sense about the accuracy, or how well we can do things like this. It doesn't mean any data set accuracy would be precisely 97%. But, in general, we can do parsing speech tagging fairly well although not perfect. Parsing would be more difficult, but for partial parsing, meaning to get some phrases correct, we can probably achieve 90% or better accuracy. But to get the complete parse tree correctly is still very, very difficult. For semantic analysis, we can also do some aspects of semantic analysis, particularly, extraction of entities and relations. For example, recognizing this is the person, that's a location, and this person and that person met in some place etc. We can also do word sense to some extent. The occurrence of root in this sentence refers to the mathematical sense etc. Sentiment analysis is another aspect of semantic analysis that we can do. That means we can tag the senses as generally positive when it's talking about the product or talking about the person. Inference, however, is very hard, and we generally cannot do that for any big domain and if it's only feasible for a very limited domain. And that's a generally difficult problem in artificial intelligence. Speech act analysis is also very difficult and we can only do this probably for very specialized cases. And with a lot of help from humans to annotate enough data for the computers to learn from. So the slide also shows that computers are far from being able to understand natural language precisely. And that also explains why the text mining problem is difficult. Because we cannot rely on mechanical approaches or computational methods to understand the language precisely. Therefore, we have to use whatever we have today. A particular statistical machine learning method of statistical analysis methods to try to get as much meaning out from the text as possible. And, later you will see that there are actually many such algorithms that can indeed extract interesting model from text even though we cannot really fully understand it. Meaning of all the natural language sentences precisely. [MUSIC]