[SOUND] So, looking at the text mining problem more closely, we see that the problem is similar to general data mining, except that we'll be focusing more on text data. And we're going to have text mining algorithms to help us to turn text data into actionable knowledge that we can use in real world, especially for decision making, or for completing whatever tasks that require text data to support. Because, in general, in many real world problems of data mining we also tend to have other kinds of data that are non-textual. So a more general picture would be to include non-text data as well. And for this reason we might be concerned with joint mining of text and non-text data. And so in this course we're going to focus more on text mining, but we're also going to also touch how do to joint analysis of both text data and non-text data. With this problem definition we can now look at the landscape of the topics in text mining and analytics. Now this slide shows the process of generating text data in more detail. More specifically, a human sensor or human observer would look at the word from some perspective. Different people would be looking at the world from different angles and they'll pay attention to different things. The same person at different times might also pay attention to different aspects of the observed world. And so the humans are able to perceive the world from some perspective. And that human, the sensor, would then form a view of the world. And that can be called the Observed World. Of course, this would be different from the Real World because of the perspective that the person has taken can often be biased also. Now the Observed World can be represented as, for example, entity-relation graphs or in a more general way, using knowledge representation language. But in general, this is basically what a person has in mind about the world. And we don't really know what exactly it looks like, of course. But then the human would express what the person has observed using a natural language, such as English. And the result is text data. Of course a person could have used a different language to express what he or she has observed. In that case we might have text data of mixed languages or different languages. The main goal of text mining Is actually to revert this process of generating text data. We hope to be able to uncover some aspect in this process. Specifically, we can think about mining, for example, knowledge about the language. And that means by looking at text data in English, we may be able to discover something about English, some usage of English, some patterns of English. So this is one type of mining problems, where the result is some knowledge about language which may be useful in various ways. If you look at the picture, we can also then mine knowledge about the observed world. And so this has much to do with mining the content of text data. We're going to look at what the text data are about, and then try to get the essence of it or extracting high quality information about a particular aspect of the world that we're interested in. For example, everything that has been said about a particular person or a particular entity. And this can be regarded as mining content to describe the observed world in the user's mind or the person's mind. If you look further, then you can also imagine we can mine knowledge about this observer, himself or herself. So this has also to do with using text data to infer some properties of this person. And these properties could include the mood of the person or sentiment of the person. And note that we distinguish the observed word from the person because text data can't describe what the person has observed in an objective way. But the description can be also subjected with sentiment and so, in general, you can imagine the text data would contain some factual descriptions of the world plus some subjective comments. So that's why it's also possible to do text mining to mine knowledge about the observer. Finally, if you look at the picture to the left side of this picture, then you can see we can certainly also say something about the real world. Right? So indeed we can do text mining to infer other real world variables. And this is often called a predictive analytics. And we want to predict the value of certain interesting variable. So, this picture basically covered multiple types of knowledge that we can mine from text in general. When we infer other real world variables we could also use some of the results from mining text data as intermediate results to help the prediction. For example, after we mine the content of text data we might generate some summary of content. And that summary could be then used to help us predict the variables of the real world. Now of course this is still generated from the original text data, but I want to emphasize here that often the processing of text data to generate some features that can help with the prediction is very important. And that's why here we show the results of some other mining tasks, including mining the content of text data and mining knowledge about the observer, can all be very helpful for prediction. In fact, when we have non-text data, we could also use the non-text data to help prediction, and of course it depends on the problem. In general, non-text data can be very important for such prediction tasks. For example, if you want to predict stock prices or changes of stock prices based on discussion in the news articles or in social media, then this is an example of using text data to predict some other real world variables. But in this case, obviously, the historical stock price data would be very important for this prediction. And so that's an example of non-text data that would be very useful for the prediction. And we're going to combine both kinds of data to make the prediction. Now non-text data can be also used for analyzing text by supplying context. When we look at the text data alone, we'll be mostly looking at the content and/or opinions expressed in the text. But text data generally also has context associated. For example, the time and the location that associated are with the text data. And these are useful context information. And the context can provide interesting angles for analyzing text data. For example, we might partition text data into different time periods because of the availability of the time. Now we can analyze text data in each time period and then make a comparison. Similarly we can partition text data based on locations or any meta data that's associated to form interesting comparisons in areas. So, in this sense, non-text data can actually provide interesting angles or perspectives for text data analysis. And it can help us make context-sensitive analysis of content or the language usage or the opinions about the observer or the authors of text data. We could analyze the sentiment in different contexts. So this is a fairly general landscape of the topics in text mining and analytics. In this course we're going to selectively cover some of those topics. We actually hope to cover most of these general topics. First we're going to cover natural language processing very briefly because this has to do with understanding text data and this determines how we can represent text data for text mining. Second, we're going to talk about how to mine word associations from text data. And word associations is a form of use for lexical knowledge about a language. Third, we're going to talk about topic mining and analysis. And this is only one way to analyze content of text, but it's a very useful ways of analyzing content. It's also one of the most useful techniques in text mining. Then we're going to talk about opinion mining and sentiment analysis. So this can be regarded as one example of mining knowledge about the observer. And finally we're going to cover text-based prediction problems where we try to predict some real world variable based on text data. So this slide also serves as a road map for this course. And we're going to use this as an outline for the topics that we'll cover in the rest of this course. [MUSIC]