We're back with our next lecture. In the previous lecture, we pulled out all the product IDs that were associated with Nike products in this Amazon product dataset. Now, we need to download the reviews dataset and find reviews that match those products. I've got another data file for you, it looks just like what we did the last time. It's G zipped JSON file and we're saving it into our folder. We're decompressing it and we are loading some basic stuff here. Really this is the same exact stuff as last time except I'm opening up now the reviews JSON. I see you won't see that error there, your file should open okay. The next step is to really parse the review data, and that is to actually go through the review data and try to figure out how many reviews that we have. Because I think that's a good thing to just get started with, how many entries are in our dataset. Again, counting is always going to be helpful here. We're going to create a dictionary for all reviews and we're going to say for a line in the loaded JSON, we're going to go ahead and evaluate it, and we're going to say, "Hey, turn this text into a Python dictionary," and then we are going to then put the count variable as the key to the dictionary and then the review as the actual metadata. Why are we doing the count? Well, this data doesn't actually have unique identifiers for reviews. That means that reviews don't have IDs, and since they don't we need to create an ID. What's the cheapest way or easiest way to do it is to just take the iteration count and set that as the ID for the review. Not ideal in the lecture you something a slightly more unique in a later lecture, but for now it's going to work. If we print out the number of reviews, let's see how many we have. If we print out our reviews and I need you to be patient with this printing process because I just sat through it and it takes about five minutes. You have about 5.7 million Amazon reviews here in this dataset, and I think it's really kind of special that we're able to parse through my foot seven reviews in five minutes in a little Colab notebook. We have come a long way with big data, and let me tell you nothing has been more equalizing than google Colab. That being said we're not done yet, we got 5.7 million reviews stored in our little notebook and you can see here our RAM is like, ooh, it's starting to get full, about two-thirds full here on RAM, so we really couldn't take too much more data in this Colab notebook. If you ever run on a RAM and Google Colab, I will plug that. The 10 dollar a month package is wonderful, it gives you more RAM, you're allowed that require more RAM at your run time. If you ever get these running out of memory issues in Python, it'll basically look like the Notebook will run until the RAM gets full and then it'll just die, it'll just go back to zero it'll reset. If you pay the 10 bucks a month you can get access, not only to priority GPUs which means for the deep learning stuff that we did last class you'd get access to that, but you also get to change your runtime type to have more RAM, so it is worth it if you're doing this stuff regularly, 10 bucks a month, it's the cost of Hulu is. I think it's far more useful than Hulu, although I did like some shows on there recently I'm not going to say that Hulu is bad but this is a great tool for 10 bucks a month they really give you a ton. We've got our reviews parsed into a dictionary now. The next step for us is to actually load in those basins that we had extracted or those product IDs that we extracted from the product data. For Amazon, the only way that we can match our products to our actual reviews is through that identifier, that base and identifier. So load that data back in. Remember it's essentially just a CSV file and so we're just going to go ahead and open it up in a read format and we're going to tell it to take this data and split it. Because all it was, was a collection of things separated by commas, so we're going to split it by commas. And then we're going to go through this list of basins and we just just going to put it back in a dictionary. Could we have saved the list as a pickle and then loaded the pickle back in? Yes, we could have it would have worked just as well. You can see here we're back to our original 8,327 Nike reviews, and I always encourage you to print out the length of things when you have the chance. Because if it's off by one or two, which can happen when data file doesn't get closed correctly or Colab crashes in the middle of printing something. You won't know until it's too late and then potentially really damaging results, so in this case, I just print this out to make sure. For the next line of code which is still running, so let's see if it finishes by the time we're done here. In the next line of code, we're going to take a dictionary here called Nike reviews, and this is going to contain our actual stuff that we're going to be using for topic modeling. How do we get these reviews out of this JSON? Again, we're going to iterate through that dictionary that we just created, called all reviews. Again, we're going to set up a counter and I just went ahead and print it out, put it in hard-coded in the total number of reviews that we expect so that our counter will retrieve a decimal somewhere between 0 and 1. A couple of things that I'm going to want to do. Remember this is a dictionary format, so the reference a entry in a dictionary or a key in a dictionary we've got to do a few things. The first thing we got to do is pull the actual review out as a dictionary. This is going to be that dynamic reference to a review as we iterate through all of the reviews. Now, we need to extract out a few specific fields to make this match. Of course, we need to extract out from the review theasin, and of course we need to extract out the reviewer ID as well. By using the combination of product ID and reviewer ID, we're going to be able to create the unique identifier that will map to one review that is at least helpful in some vague way. We're going to go through and we're going to say, if for that review theasin that we've extracted for the review is in that large list of all Nikeasins then it's Nike review. If that condition is satisfied we're going to create the key for the variable, which again is the intersection or the combination of the asin and the reviewer ID, and I'm just going to separate it here with a period I could do an underscore or whatever. Then I'm setting that as the key for this dictionary and then I'm putting the entire metadata of the review in the Nike review dictionary. What I'm going to do here is I'm going to just make sure that I save this, because this is going to essentially be my final corpus, so I've taken this much larger corpus of 5.7 million reviews, I've gotten down to just the Nike ones. The thing that I want to do topic modeling on, it's a good point or it's a good place to just stop and say, "Hey, I'm going to save this, I'm going to dump this out into a file, and now I have just as a much smaller dataset to work from 5.7." This process is probably going to take 10-15 minutes to run and you can see here it's going to range from 0-1, and it's finished here. But I'm going to go ahead and wrap up this lecture by saying, when this is finished this is going to be saving this JSON file into your working directory and then we're going to be able to actually begin to parse through it to make sense of these Amazon reviews in a topic model.