We're back here talking about topic modeling, and now we're ready to actually get our hands dirty, induce some topic modeling. If you have not reviewed the project for the course, you'll really need to do that. Because it's really going to set up what we're doing and why, and what you're ultimately going to need to create to submit for your grade. But remember, we're looking at Amazon product reviews, and we're going to try and do some topic modeling to cluster out the common comments that people have about Nike shoes. Let's go ahead and make sure that we get our data imported. The first thing that we've to do, and it looks just like what we did in our first-class, is use W-get to export the data from a server or download it from a server straight onto your Google Drive. Remember that Wget is a command that we issue in the terminal, so we use an exclamation point to allow Colab to let us do that in the Notebook environment. We have our URL from the server here. Then I'm storing our data on, and then we're specifying with a P parameter here, the path and where we're going to store the data. By storing the straight onto Google Drive, we're saving ourselves a lot of headache because one, we don't have to download and upload the file to Google Drive, which we'll probably take an hour or two. Two, we're really saving ourselves a lot of local storage and duplication of files. Remember, your path could be very different from mine. It's wherever you want to store your project notebook. Remember that, when you issue a path in the terminal, you've got to escape spaces. I don't have my spaces escaped because I don't have any spaces in my path. But if there were, I would simply escape them by doing something like this. If the path of my folder had spaces in it, then I could escape them by doing that. But of course, I actually do not because I hate escaping stuff, and so I have no issues to worry about. If you wonder why I use underscores, this is why. The next step, as you can see here that this is a pretty big file, and it's 266 megabytes. We need to extract it into its raw format. The easiest way to do this, since this is a Gzip file, is just use Gzip command. We're using the dash d, which is just decompress. What's really nice about this is we are able to do this again via Google Drive. We don't have to decompress locally and then save time. What's nice about Gzip is, when it decompresses it, it'll go ahead and delete by default the zip file. You'll have just one file on your drive. I think it's somewhere around a gigabyte, so it's big. Those are some basic stuff we're going to load in pickles to load in and save some data files. We're going to use JSON because this is a JSON format. We're going to use sleep occasionally, if we want to make the computer pause as we're printing out data. Again, this is a working directory. It's just represented as a string. It's a path where our files are stored on Google Drive. We do this for a couple of reasons. One, if we move our file, we don't have to change all of the references. We just got to change this one. Two, I think the best reason to do this, is for opening up a file like this, we don't have to type out the full path name. We can just reference it with piping in a variable, and it will load up fine. When we're loading a JSON file, we got to open it in the Python file environment first. We're opening it in the read format, so that's what we're doing there. Really, this data that we're using, and just want to say verbally, I want to thank Julia McCauley and his colleagues for allowing us to have such rich Amazon data. They built scrappers that, just went out there and looked at a lot of Amazon pages and a lot of reviews. They collected this stuff so that we can play around with it as data scientists, so a really cool marketing goldmine. Getting these reviews is really going to be a two-step process. First, we've got to extract out all of the Amazon product entries for the things that we want to study, or that we want to get the reviews for. In this case, it's going to be Nike products. We're going to look at all of the reviews for Nike. But the way that this data is structured, we can't really do that, until we first find all the Nike products from the product catalog. Once we get a list of the IDs that correspond to Nike products, then it's really easy to go through and parse out the reviews. The first step we're going to be doing today is pulling out those Amazon product review IDs for Nike. We got to extract these ASINs, that's the ID or the identifier that Amazon calls these Amazon products. I don't know if you knew this, but you know now, that every Amazon product has an ID that's specific to Amazon. If you google about ASIN, you'll usually get what the product is. The Amazon page would pop up, and you can inspect what it is. Amazon uniformly creates this product catalog using these ASINs. We've got to find the ASINs that match Nike today. We're going to be doing a lot of iterating through large data sets. Trust me, you will get bored waiting for the code book to finish. You'll see the wheels spinning, and you'll say, Is it ever going to end? Is it dive? Is it going to ever finish? Whenever we're in that kind of issue where we're worried about, is this thing still going? How long am I going to have to wait? Using a counter is so important because it really just helps us stay patient and helps us know that the computer is working on it. We just got to go get a cup of coffee or whatever. We've loaded in our JSON data or our Amazon product category data as a file, but we haven't yet parsed through it, so we need to do that. What I'm doing here is just counting every time I'm iterating through a document. Of course, I could use I and enumerate to do this as well. But I'm just doing this to be deliberate to show you what I'm doing. Now I'm billing a little counter here and I think counters are so underused but are so important in data science when you got a lot of documents because again, you want to know that this thing is still going. There are several 100,000 products that we're going to go through here and we want to know our computers making progress on it. What I'm saying here is if the count when divided by 100,000 equals 0 or has no remainder, then print the count. Every 100 k interval that we hit with the count variable, we're going to print out the count. That's nice. It gives us status, but it's not spamming our output every line, 1,2,3,4, that would be crazy to look at. We only want to print when we reach these even intervals over 100,000 and that's what this logic is allowing us to do. Really parsing through the data is wonderfully simple, thanks to the data structure that Julia McCauley used. It really is just stored as Python dictionaries in text. Since it's actually really stored as Python dictionaries and not truly genuine JSON. Although to the native eye, they look similar, to me they look almost identical. Because it's actually Python dictionary stored in text, all we have to do is tell Python to evaluate that line of text. It will automatically say, hey, this is a dictionary. I know that this is a dictionary because if you would enter in one of these into the command line, it would return a dictionary as the objects that it recognizes it as. There's no real JSON interpretation or parsing that's needed. We just use the eval function to get this into a dictionary. This a product variable actually becomes a dictionary. What I'm doing here in this line of code is am essentially just making a little dictionary where the keys to the dictionary is a unique Amazon product ID. Then the values to the dictionary are all of the metadata associated with that product. You can see here that I have 1.5 million Amazon products in the clothing, shoes and jewelry category, that's a lot of products. If you think about it, amazing that Amazon, even this is old data now this was four or five years ago, they sold that much stuff but they do. We've got 1.5 million products, we've got parse through it and figure out what products are Nike products. I've created a new iteration here where I'm iterating over the dictionary that I just created. Again I'm printing out that count just so I know. I went ahead and hard-coded in the number of products, so I know now instead of just seeing 100,000, 200,000, 300,000, I'll actually just see the percentage finished that we are. It'll print out a percentage. Once it reaches one, we know that it's done. It's not technically a percentage, it's a decimal, but you get what I'm saying. I'm just taking out that product metadata and just setting it as a variable. For each iteration through, I'm going to get the medical stored in a product variable. Then I'm saying if there's a categories entry in the metadata of the product, then we want to continue. That means that there's an actual category assigned to this Amazon item and that means that we can look inside of it to see if it is a category that we're interested in. For a category in a product category, and then inside of that, there are actual multiple categories that a product can be assigned to because you think about it, Amazon products aren't just one category often. Shoes can be also considered as fashion and pants can be considered as leisure wear and also athletic wear so on and so forth. I just want to iterate through the data and I just want to count up the common categories that are available. I just want to see what categories we have of products in our Amazon data. What I'm doing here is I'm making a new dictionary and I initiated it up here. In this new dictionary is just taking the category name and it's tallying it up. If it's the first time that it saw a category, then it's making a new entry in the dictionary and it's saying all categories, shoes equals 1. The next time that it see shoes, it'll just go ahead and keep that count going so that we can get a count for each product in our data. Remember, dictionaries are not ordered by default. That means they have no natural order and we have to create ordered or sorted representations of them whenever we want to print them out as such. I'm going to take this dictionary that I just made it and turn it into a sorted list. All that I'm doing is saying for a category in all categories, the sorted list is going to be appended with that category and then the count for that category. I'm actually putting the count first. This is the count for that category and then this is the name of the category. Then once this is in a tuple format, this is in a list of tuples. All that I have to do to sort that list of tuples is tell it that I want to sort that list. Then I want to use the reverse, which means sorting from high, most high count most products down the lease products, and so I'm just printed out the 50 top categories here. It's no surprise that the general category, clothing, shoes, and jewelry come up the most, but then the second product category is women, the third is clothing, so on and so forth. We can see here that this is a good collection of fashion work. This is a rarer Lacroix, I hope you guys appreciate that I'm drinking and limoncello, and it tastes just like a Kaelyn pie. It's just amazing. It sounds good, you know Kaelyn pie, but it does. There is a Kaelyn flavor but this one's actually quite good. If I wanted to look at all categories here, you can actually see that the brands are present in the category types. For major brands, Amazon labels every product that's Nike as Nike, and so we can see here that if we use this all categories variable, we can see that there are 8,327 products associated with Nike that are for sale on Amazon. That's whole lots. We can do this for all kinds of other categories. We can do this for Nike, we can do this for Adidas, we can do this for whatever a product will want to fetch out of the data. Now we know that we should be able to get 8,327 product IDs so that we can then go and fetch the reviews for those product IDs. Let's see if we can do it. I'm going to create a new variable here and it's called allnikeasins. Why do you think I'm using a set here? Well, I'm using a set because set is a unique collection of entities, so it removes duplicates, and that is important. Because we might run across a product that accidentally entered into their catalog twice and we don't want to duplicate those ASINs, we just want a unique list of ASINs. I'm saying, hey, make a new set here called allnikeasins, and of course I'm counting through the data as I go. Really it's very simple for a review, and this time I'm iterating through that first dictionary that I created, which is all products. I'm looking through each product and I'm saying, hey, the forth categories inside of that product, and remember there are multiple categories, so we've got an iterate through each category in the list of categories. If Nike is a category in there, then I want to go ahead and add Nike to the product. Now, I just went ahead and lowered a category and I use the lowercase Nike because I just didn't know if Amazon was consistent with its capitalization. I wanted to make sure that even a lowercase Nike products still made it into our ASINs list. It turns out that they were consistent, so this isn't going to matter. But when we're matching terms based on text, we want to make sure that we're really clear on what the capitalization processes are and if we have any reason to suspect that they won't be consistent, then we want to lower them to make sure that lowercases is even a match. If I find a Nike category in a product, then I'm going to go ahead and to my Nike set. I'm going to go ahead and append the product, and specifically just the ASIN because that's all I need. I need to know the IDs that are associated with Nike products. Then I'm going to collect all those Nike reviews. If I do this, you can see that now this time as I'm iterating through the data, I get a set of decimals that culminate in one, and then I get that lovely 8,327 ASINs and so match perfectly. This tells me two things about the Amazon data. One it tells me the capitalization did not matter, and it also tells me that there were no duplicated values in the set because we use this set here and the set comes back with the low count where we were using a list. We've got 8,327 products and they are unique. That's all we really going to do in this first part, is really just get our data in here, pass through the ASINs and now we've got a list of ASINs. I'm going to save this in my working directory so that I can reference it later, and then I can go through, and actually retrieve the reviews associated with the ASINs in the next step. I'm going ahead and open a file and set it to the right parameter. Then I'm just going to go ahead and write it, and you don't really need to be specific with these formats. I mean, you could write it as a pandas file, you can write it whatever you want, but since this is just a list of IDs and there are no special characters or anything that might screw up this list, in Amazon product IDs, I can just literally write the entire set and join it with a comma, so what format is a list of entries separated by a comma? A CSV. We've essentially, even though we're calling this allasins.txt, we've just made it a CSV where each ASIN is separated by a comma. Now, we're just going to output close. Remember anytime you open a file to write it, you are going to close into finish it or else you get all kind of wonky results, so you always got to remember to close that file. Now, you should have in your google drive something called allasins.txt. Let's go ahead and browse our drive, and I actually have not connected yet because I'm just reading this off of something that already learn. I'm going to read to make sure my drive is connected. That's something you should do at the beginning of lecture. Now, I see my drive is there, I'm going to my drive and I'm going to browse through my folders, and there it is, under master files, under topic modeling, and then I can go and see that I have my allasins file where I saved. Already to go for the next step.