In this lecture, we're going to try and put some of what we saw in the previous lecture into practice, by learning how to read CSV and TSV or JSON files in Python. So we're going to demonstrate some of the main methods, and also to understand some of the edge cases that make reading these file formats difficult. So we're going to look at a few of the main functions in this lecture that we can use to read CSV and TSV and JSON files in Python. Specifically, we'll look at string.split, csv.reader, which is a library, eval and ast.eval, as well as the json.loads function in the JSON library. So essentially what we're trying to cover, is both manual way of reading CSV, TSV or JSON files, which is string.split and eval, as well as the library functions that can automate the same functionality. As we'll see when we try and do it the manual way, there'll be a bunch of edge cases that we get stuck with, and that's where the libraries become useful. So starting with the string.split function. Here we see a possible first line of a CSV or TSV file. Actually, in this case, it's space separated variable file, which contains a list of five different fields separated by a space character. So if we run the string.split operation on that line, what it's going to do, is look for any instance of a whitespace character and tokenize that string by converting it to a list of five shortest strings separated by a space character. So we go from a string containing the header to a list containing the individual entries. So that's what will happen if we run this string.split function with no argument. We'll just look for any whitespace character. In the third line here, we see another possible header row that we might see in a CSV or TSV file. In this case is actually a semicolon separated files or each of the entries are separated by a semicolon. If we'd like to split or tokenize the string from this file, we would need to provide an optional argument to the string.split operation to tell us what split character we should use, in this case, it's a semi-colon. So by providing a semicolon as the optional argument, we'll again get a tokenized list of the different entries from this file. So that's basically it. That's how the string.split function works. It converts a string to a list given a particular separator. By default, it will look for any whitespace character that uses the separator to provide no argument, tabs, spaces, or newlines. But if we have a different separator, such as we might see in a CSV file or a semicolon separated variable file, we can handle that providing an optional argument to string.split. So that's the basic case covered for string.split. What's going to happen in more complex situations? So for instance, what would happen when the delimiter actually appear within a column? So you might imagine something like a comma separated variable file, could be a challenging format to use for something like Amazon reviews, because the reviews themselves would typically contain commas. So here's an example of a possible entry from a CSV file of Amazon review containing a star rating followed by a review, where the review itself contains a comma character. So if we ran string.split on that type of file using a comma as the optional argument, we would get something that might not be what we want. It would split the file and every single comma, whereas we'd like to treat the entire review as being a single entry. So we might be able to handle that by using a different delimiter, such as the semicolon, but it's going to be very hard to deal with the most general cases. So Amazon reviews could have caused contains semicolons or other characters as well. By default, the CSV format will handle this by using what are called escape characters such as quotes. So by wrapping the review in quotation characters, it will tell us that any commas that appear inside the review are not ones that should be treated as individual entries but should be treated as part of a single entry. So handling all of those different edge cases with the string.split function could be difficult, and that's where the csv.reader library comes in. So here's how we use that library. We first specify a path to the file we'd like to read, we open that file like we would any regular file, then we create an instance of a csv.reader object by giving it that file, as well as a delimiter that file will use, which in this case is a tab character. Then we can read that file line by line using the csv.reader object, using this operator called "Next". So that's going to give us the next line of the file, which in this case is the first line of the file corresponding to the header which would now produce for us a list. If we run the next operator again, it's going to give us the first review in the dataset. Again, nicely separated into a list. So the csv.reader function in the CSV library is going to nicely handle all of the edge cases when reading CSV and TSV files. What about JSON files? So that turns out to be much easier to read in Python as these files are very closely analogous to Python's inbuilt dictionary objects. So if we wanted to read some data from Yelp review dataset, we can simply specify the path to it in the first line, open the file in the second line, read the first line of that file in the third line, and then we get this string containing the first line of review data from Yelp. Again, it doesn't look like very much yet. So far just a string, it's not yet a structured data object. All right. But to read the string into a Python object or a Python dictionary, we can run this function eval on it, which is going to treat that string as though it were a piece of Python code. In this case, it's going to interpret that code as a Python dictionary since it has the same syntax as a dictionary object. In other words, it's in the form of key value pairs. So if we run eval on the line and we store it in D, then D is now going to be a dictionary object, which we can now treat as a bunch of key value pairs and we can extract, for example, the user ID from that first line of the file. So that's very simple, very clean way to read JSON structured data in Python. But we should notice that it's a little bit risky. So essentially what the eval function is doing is just treating an arbitrary string as if it were Python code. So in this example we see a string containing four plus two is not the expression four plus two bits and string four plus two for the eval function is going to treat it as an expression and evaluate it, give the result six. So that seems very convenient, it seems very nice, but it could be dangerous to execute this type of code. If we're running it on an untrusted dataset, we're essentially telling it that we're happy for it to go ahead and execute whatever lines of code it finds in that dataset, which may not be safe. So what we'd like to do to get around that, is to try and use library functions to ensure that only valid JSON data is actually going to get executed, and that we can't just be executing arbitrary code by reading from a file. So we'll look at two ways to do that. One is the abstract syntax tree library and one is the JSON library. So the abstract syntax tree or AST library, is essentially going to allow us to read things that look like dictionary objects, but it's going to ensure that they really are just that. That what we're reading is just dictionary object and it is not trying to execute some arbitrary or possibly malicious code. So we can now run eval again from the AST library on that same line, it's going to produce precisely the same output, it's merely going to be a little bit safe to execute because it's actually checking that the line has valid formatting which would correspond to a Python dictionary. The JSON library is going to be much the same. If we import that library and we run the json.loads or load from string function on that same line, again, it's going to convert it to essentially a Python dictionary object, but it's going to check that all of the formatting is valid and this really is adjacent file. So these are two equivalent but somewhat safer ways to read JSON data. So briefly to summarize what we've covered in today's lecture. We've looked at a few different ways to read CSV, TSV or JSON files. So we've looked at the split and eval methods, which are essentially are overly-simplified but manual ways of trying to read CSV and JSON. I would say some of the dangers and issues that you can run into if you use these functions, to get around those we've looked at some libraries that can handle CSV and TSV data for us. So by now, you should be able to read JSON and CSV data in Python on your own. So go ahead and try loading the Amazon and Yelp dataset using the csv.reader library and using the json.loads function.