[MUSIC] Hello, this video is the last video of this module and we will finish the challenges that you may face while using Health Data. More specifically we'll talk about the denominator selection issues, the issues with the privacy and using such data and a couple other challenges. So the key takeaways are explaining the denominator selection challenges while you want to run analytics on health data. List privacy challenges when using the health data. And identify potentially other challenges for using such data. So what is a denominator and/or variable selection issue? So in any research you need to define what is the population of interest. You need to write a query and find the patients that match your criteria, and that is more challenging than you think. There is something called the denominator selection where you want to find patients that match a certain criteria. So for example, they have certain age. They have a certain type of disease. They have a special condition like a disability. They have a special insurance coverage. They have a special social condition, or maybe they have an acceptable data quality level. So that is the denominator selection. And depending on whether using one database, couple databases you might have issues with interoperability, standardization of the codes and everything else. And not one simple SQL query will work on all of these different data sources. So that the denominator selection is very, very important. Another issue about the denominator selection is, if you don't select the good denominator of interest you might have already cut out the signal that you wanted to look at, or you might already include too much noise in the selection where it basically conflicts with your analytic process. Now it's not only about the patience to select, there's also other selection issues like timeframe. What is the length of the date that you need? Outcome selection, what outcomes are you selecting to show the effect of certain variables. And then there's also something called the factor selection and reduction. How many variables do you want to have in your model? Can you have all of the ICD codes like 100,000 different ICD codes? Probably not. It doesn't work in a regression model that way, so you need to be very careful on how you prep your data to make sure that it doesn't introduce bias or skewness but still makes it doable. And the last item, of course, is purpose identification, making sure that what you have identified and what you will do with your analytics fits the purpose of that operation or research. So if you remember from the previous videos, we did have a process on how to do analytics and that entailed bringing in data bases, cleaning them up, and then running analytics. And eventually closing the loop where it sharing the analytics with others so they can run it and they generate data and then it feeds back into the databases that you using. Now while you're developing the analytics, you might have that denominator selection issue. And here at the bottom of this diagram I'm trying to show you what the issue is. You can see here that highlighted area, those are might be the rows that include the patients that are of interest, so then you can predict something, so you predict y which is sort of your outcome. And the base population that you selected all of those rows of course they have a bunch of variables like, here shown by x1, x2, up to xN and they represent all of the variables that you think would be predictive of that outcome or why. So the number of rows, the number of patients that you're bringing in here is the denominator selection challenge that you would face. As I mentioned earlier in this video that you might also have a time frame selection issues. So as you can see here, at the bottom of this diagram, you don't know whether you need one year of data, two years or maybe ten years of data to better predict something next year. So that is also of a challenge of how much of a timeframe do you need? Whether a database has enough of years, enough of instances, enough of encounters for a given patient that you can then built your analytics on top of it. The outcome selection is another challenge. You don't know what exactly you want to predict. Sometimes you just say I want to predict hospitalization. But then you find in the database, there are like ten different columns that indicates some sort of a hospitalization, which one should I use? Is it just the admission enough, or should I predict the discharge, or maybe something in the middle? Maybe the admission has couple different columns indicating different things in it. So that also becomes very challenging and you have to spend enough time to figure out what is the why in your analytics that you are trying to predict? As I stated earlier, factor selection and reduction is also a big challenge. And that is very important when you want to bring in variables that have a lot of different categories like diagnostic codes or medications. So diagnostic codes, that we might have like a 100,000 different diagnostic codes, 50,000 different medication codes and do you want to just put them all in your statistical or machine learning method? Maybe not or maybe yes. So depending on what is the method you want to use, sometimes you can manually prone them or manually group them to make it easier. Or you can use automated version to reduce this base, the number of factors that you're putting into your model to predict the outcome of interest. So that also becomes very tricky when you work with health data because you will soon find out that a lot of these variables might even have inter-correlations and just dropping one might actually remove some of the effect of the other one. So you would sometimes need to work with clinicians or senior data analysts who understand how you need to select the variables of interest before you put it into some automated version of so-called factor or feature reduction methodologies. And of course, you need to have a clear purpose on why you're doing it. And that is always a problem where people come up with the question of interest that has implications for operations or research. You try to find the data, the data does not have everything you need to do that to answer that question. So you slightly modify the question, slightly have a different answer and then that answer is being interpreted in a slightly different way. And then at the end of the day you can see that you are basically answering a completely different question than the original question that people had in the first place. So that is also a tough act. As a data scientist you need to find out that the several line, or sort the best used case is where the questions where the data and the results of interest could be harmonized in a way that has the best answer for the highest need on the ground. Except from the typical challenges that I discussed about analytics, there is also data governance issues. You can not get the data that easily. It takes months or years depending on where you are and what are the data governance challenges that you may face. Here in the US, one issue is the HIPAA rules or the Health Insurance Portability and Accountability Act, which was originally designed to better manage data that payers and maybe commercial vendors will exchange data. But then now it affects all sorts of research and it basically states that certain data types are protected. They are called the protected health information, you can see that list here, like names of patients, geographical data, especially if it's like to pinpoint data where there's an address or something. All elements of days, telephone numbers, fax numbers, email, social security number which is a unique number given to each person in the US. Medical record numbers of hospitals within an EHR or a claims database and a lot of other things on this list. So as you can see all of this data could be used to find an individual person in the real world. And because if you're just doing a second reuse of data, you're going into a data warehouse to query it and run analytics, there's no way to go and get consent from 1 million patients. And one way is to make sure that you can remove all of these data element so there's no way that somebody accessing the data can identify these patients. Then you might be able to have a waiver of consent and then go ahead with your research. So this is very important and you will see as you start building your career here in health data science. Fairly soon you will understand that a lot of times there are delays about how to get your hands on the data? How to remove these data elements? What you can remain in it in a limited way? And all of the other issues about data access and privacy. So to comply with HIPAA, basically you need to remove all of the PHI elements, but doing this will affect your health analytics by these items. First, it limits the use of some key health data. Like if you don't have date of birth, if you don't have the geographical data or various states of encounters, although there are sort of ways to get some of the state without the full HIPAA compliance, called the limited data sets. Where for example, the date of birth can not be given to you, but maybe year of birth could be shared with you if patients are less than 85 years old. So there are certain ways, workarounds but overall limits some of the key health data types that you may need. There are issues with master patient index because if you don't have that information you cannot link the different data sources and merge them to do more analytics. There are issues with shifting health research to quality improvement efforts where people try to just go around all of these HIPAA requirements and IRB requirements by defining a lot of projects as quality improvement rather than a research. But then when they do that and then they run it because quality improvement means it's for your internal operational mandates of let's say a health network and they are not necessarily going to publish it, it's not generalizable to others. But that is exactly where the problem is, a lot of times, because of data access and privacy issues, all of the findings remain within the walls of that health system because it's quality improvement. It's considered quality improvement and that eases the data governance issues but then it creates barriers and sharing results. And the last item on the list in increasing the cost to develop large but anonymize have data sets for researchers. So it's very costly to create this population level of data sources because partly HIPAA prevents us sharing the data that is necessary to link off these data sources. And three more challenges. It's good to know. One is the process of care, the other one is nature of intervention and the last one which is good to know is random chance, or external factors. So I'll talk about each of them briefly here but then more when I show you some diagrams that will help you to understand them better. So the process of care basically says, that different providers may generate different data values for the same events. So is the same diagnosis, but then one physician chooses this code and then another physician chooses a slightly different code. So the process of care by itself has a nuance and it creates some noise. because not all humans will think the same way for the same value or the same diagnosis or the same medication. Now if you look at it from the other way around, we have also something called nature of intervention where there are might be different interventions that actually mean different things, but then different physicians might code it the same way. So there are two different diseases, but because they're so close, two different physicians just use the same code, or they might not be even codes available for this small nuances or differences and there is only one code and that's why they just use one code for it. So that way, instead of sort of trying to show all of these small differences the nature of intervention basically impedes those differences in your analytics and then groups of everything in the way that you lose that potential signal. And the last item is random chance. There might be two exactly similar patients that you have the same diagnosis, same medications. Everything looks the same but then their outcomes are different and of course, you need to always be aware that not everything is in that database. There are a lot of other factors that you don't see. It's not in the database that might affect the patient's outcome. You can call the mediators, moderators, external factors, social factors. You can call them anything you like, but there are a lot of things that are not in your database. So let's look at these challenges on diagram to make it easier to understand. So here's the effect of the process of care. We can see there are two physicians or two provider settings, A and B, and they started using two slightly different codes for the same diagnosis. So for the same diagnosis, provided A called it ICD code XXX.XX. Just think about it as digits, but the other provider is quoting it as XXX.XO. So it's slightly different than the other ICD code, but then all of these creates some sort of a noise in your modeling. So on the right side you can see that your modeling may not predict risk or any other outcome of interest as you wished. Now in contrast to the process of care it's a nature of intervention where there are two different interventions. Let's say we have the interventions on the left side of the slide, you can see a high risk intervention, which is a surgery with general anesthesia, but the other one is a surgery with a local anesthesia. So one is high risk, the other one is low risk. But then there is no CPT code to make these two surgery different. So they all use one CPT code, which is here denoted by XXXXX. And that loses all of that little nuance that you really needed for your analytics. So grouping up these CPT, ICD or NBC codes into you know something bigger might sometimes help but then sometimes also harm your analytics as shown in the slide. And finally, you might have totally similar patient populations. You run your modelling and you expect that hey, everybody should have the same outcome. And it doesn't happen because there are a lot of external factors and just simple nature of random chance that people might develop different outcomes and it's not in your data. So the most important thing you should always remember is your analytics, your findings, your results are as good as your data. And if you're missing things in your data, if your data has any challenges, data quality, interoperability, coverage, you name it. Then your results will always be limited to those limitations. So in summary we talked about the denominator of variable selection challenges, including the patient denominator selection timeframe outcome factor selection and also the purpose of the study. We talked about HIPAA and data access and privacy challenges and also a couple other challenges such as the process of care and random chance in nature and sort of external factors that could always challenge your analytical efforts. Thank you.