Hello, and welcome back to this course. In this video, we're talking about looking through the file system for data of interest. We can define data of interest fairly widely. We could be looking for user account credentials stored in insecure password files. For example, if you've got a [inaudible] file on your computer that has all your passwords in it, that would be data of interest. We could also be talking about consumer data or customer data, intellectual property, etc. Essentially, our goal here is to look for data on a particular system that might be of interest to an attacker. How we're going to determine if data is of interest is through the use of a few different heuristics. Our Python file on the left here shows how we're going to perform our file analysis. We're going to check three different factors. We're going to do a keyword search. We're going to check in the title of the file, we're going to check the contents of the file for some keywords, and then we're also going to check how often the file is used. Before we do that, let's move down to the bottom here where we'll start out. Here in our main function, we're defining the directory we're going to look at. In this case, I'm going to look generally through my documents folder, and I'm going to call the file search function to go through that. Then in the end, once we're done and get some results from that file search, we'll print out each result for future analysis. Our file search function is the main runner or driver behind our program here. What we're going to do with file searches, we'll take our root directory and we're going to call os.walk. Os.walk essentially just lets us search through that entire directory. We're going to look at every file in it and we're going to look at subdirectories as well. Calling os.walk, we can get a directory path and then a list of files within that directory. We're going to say for each of the filenames that we found, we'll call os.path.join to build the complete Windows directory. We're passing in a directory anyway, so don't really need to use the dir path. But in case we weren't or we were using recursion or something, we're going to do it this way because that way the code is a little bit more adaptable. Once we've built that fname, this is a complete path to a file on the system, so something within my documents folder. Now we're going to start our checks. The best way to guarantee that we find everything of interest would be to scan through every single document on the system. However, that's slow, it's noisy, and it's inefficient. There's a high probability that we're going to get tagged as malware and be blocked by the system. That's sort of accessing every file on the system is the sort of thing that ransomware does. While we're just reading from them rather than reading them and then trying to modify them by encrypting them, it's possible that we'd still be flagged by anti-ransomware solutions, and we don't want that. We want ways to also make our search faster. We're only going to look inside of a file if we've determined already that there's a good chance that it might be of interest to us. We're going to use two heuristics for that. One is a keyword check against the filename, and one is taking a look at usage data in timestamps for that particular file. Let's look at keyword check first. This is fairly straightforward. We've got our filename and we've got a list of keywords. In this case, I'm particularly looking for files with the word password in it. Because if you've got say passwords.text on your system, I don't care about the heuristic for access, etc. I want to look at that file because those passwords in there are useful information and might help with compromising other systems, accounts, etc. I could definitely expand this list of keywords to include other keywords as well. Maybe looking for a customer database or patent information, other intellectual property, really anything on your system that I could guess a keyword for that might be of interest. What we're going to do is use a list comprehension here to determine if one of our list of keywords appears in the filename. How this works is we'll take our list of keywords and iterate over it within the list comprehension. K will be one particular keyword. We'll then use Python's in operator to see if that keyword exists in a particular string, where that string is the filename converted all to lowercase. The result of this list comprehension is going to be a list containing true or false in each item, depending if the keyword in that location in our keywords list match the filename. A list of trues and falses isn't really what we want. We want to know, do we match something in our list of keywords or not? To do that, we can use our in operator again. If we say true in this list, we're going to get a true if there's one or more trues in the list of results, or we'll get false if it's all falses. That'll tell us whether or not our filename includes a keyword that means that it's of interest. That's one of the quick and easy ways in which we're going to determine if we want to look at a particular file. The other one is a bit more of a heuristic based off of user behavior, and that's in our usageCheck function. Imagine here that you've got an important file that has useful information in it, and information that you need regularly. For example, you might have a password file. That password file, you're going to open it all the time, you're going to access it all the time, but it stays fairly static because unless you're creating new user accounts all the time, then, most of what you're going to do is read from it to either type password into a prompt, or copy paste from that to password file into a webpage. If you've got a file that's relatively old, meaning we've set it up for awhile, hasn't changed much, but it's accessed frequently, that might be something of interest to us. That's what we're taking a look at here with usageCheck. We pass in our filename and then we're going to use the pathlib library to calc-grab some access or timestamp statistics for the file. From this request, we can ask for the access time, the modification time, and the creation time. This if statement here is just a little bit of a heuristic to determine if this meets some of our criteria. For example, if we have an access time that is a certain amount greater than the last modification time, that might be of interest because that means we've got that file where maybe the last time we added a password to it was awhile ago, but we're accessing it frequently because we need to read those passwords from it. That's matches our heuristic of this is the behavior that matches something of interest. We don't want data that's constantly changing because that might just be nothing, but we want to ensure that something that's still access, so it's not just filed and forgotten. The other heuristic we're going to use in this particular case is making sure that the modification time and the creation time aren't the same. This is just a guess because if you've got a password file, you've created it and added your first passwords and then you'd probably added more passwords to it at some point. That means that that creation time and that modification time aren't the same. Really we should be using some threshold here because the modification time will never match the creation time, because you create a file and then you save it. Those are different times. But if we had, say another threshold here saying that they're different by so much, we could use that as heuristic for use of behavior. There's no guarantee that this is a set of perfect heuristics. This is an example here. We might want to say it's been accessed recently, which isn't something we're testing for here. Or we might want to say the creation time was so long ago. Or if we've got some other information, maybe we know that a particular user account was created at this time, we might want to say, was it modified around that time indicating that a password was placed in there, etc. But for these heuristics, if we meet the heuristics requirements will return true, otherwise will return false. Between these, if we have a filename that's of interest, if you've got a passwords dot text, I definitely want to look at it. Or we have something that meets our behavior heuristics that we built. Something that's been created a while ago, modified awhile ago, but accessed more recently because you're copying and pasting those passwords out, that's of interest as well. If we trigger either of those, we say, okay, this is worth opening this file and looking inside. That's what our contents check functions here for. This is designed to do essentially a keyword search of the contents of the file. This is sort of thing that's computationally expensive, but it can help us determine whether or not we've got anything good. Here our keywords are a bit longer, and our list here is focused on finding passwords. If you've got a passwords file, you probably say, this is the URL that it's associated with, and then here's the password. Matching on passwords is tough because they should be random, but hopefully, we can hit one of the URLs. We're looking for things like HTTP. If you've got the entire URL, it'll start with HTTP or HTTPS. That's helpful. We're looking for things like.com ,.org,.net,.edu,.gov, common top-level domains that would probably show up there. Then also, if you're just the type that's naming your passwords by this is the Facebook password, this is the Twitter one, this is the Gmail one, and I'm assuming that you know how to get to those sites based off of that information, then will match some of those most common password names or account names as well. If we have any of these in the file, it's of interest. Here, we're going to open the file. We say with open filename, r as f. This means open the file, give us a handle to it in f. Then, we're going to try to read in the contents of the file. If we succeed, will go on to this return statement, otherwise, will return false, because if something goes wrong, we can't read the file, etc., we want the code to keep going, we don't want it to error out. Then we'll use that same list comprehension that we used in keyword check to determine if we match the keywords. In this case, we actually have multiple keywords, so we'll end up with an actual list with multiple values of trues and falses, and then, we'll use that True in to determine if there's at least one true result in there. This combination of heuristics is going to return false positives. We'll see that in just a moment. But it gives us a shot at identifying files of interest. You might develop your own heuristics based off of use cases or behavior, etc., that you can use to help narrow in your search on the files, directories, etc., that are most likely to have something that you actually want. Let's give this a run. We'll call Python FileAnalysis.py, run it and we should get a few results and we do. We've got a few files here. Ironically, they're all python files. The reason why we got hits for these is because when developing this learning path, I was using some of the previous learning path at Python for Cybersecurity courses as reference. Those learning paths were developed over six months apart, and so the last modified time of these files is six months earlier than the last access time. We're not hitting our keyword check, none of these have the word password in them, but we're hitting our usage check criteria because I use them as a reference just like you'd use a password file. Now, let's take a look inside one of these to see why we're hitting our keep contents check. I'll just do type, paste that URL, put it in quotes because of the spaces, I accidentally hit up. If we scroll through this, I'm certain that we're going to find one of the things that we're looking for. In fact, here it is. We've got a false positive here because the word HTTP appears in the file. Because I used HTTP so that we'd match both HTTP and HTTPS without the colon and slashes after it, we're getting a hit here. We might find similar things for the rest of these Python files as well. False positive in this case, but if I had files that maybe something with the word password and its name, or maybe key file, or other keywords, or something that is a different type of valuable information other than credentials, this file analysis script might've helped us find it more efficiently than doing a brute force search through the contents of every file on the file system. Thank you.