Hello, in this video and the next two videos, we will talk about challenges that you will face in using health data. In this video, we'll talk about these challenges: data quality, interoperability and sometimes the system architecture. So, in this video, you will see how data quality will affect your analytics and how that may affect clinical operations and research. We will also analyze the data interoperability challenges that may hinder both health operations or research. So, what are the data management challenges? We have the data quality challenges and we'll talk more about it. We have accuracy completeness and timeliness issues, we have data linkage and integration challenges such as interoperability or being able just to link one dataset to another and knowing who's who and creating something called a master patient index or MPI, you have data system architecture issue where some architecture might be centralized, another one might be distributed and each of them might have their own advantages and disadvantages, and we also have some data access and privacy challenges which we will talk more about that in the next video. So, what is data quality and what are the challenges? Here's just a simple diagram to think about one of the very simple variables in a clinical data warehouse or an EHR and that is weight. So, you might think weight is probably one of the cleanest data there and it should be very easy to work with but it's not that easy especially if you're looking for a large sample of population. So, we have measurement issues where there might be issues with the weight scale getting different results and the weight scale might have validity and reliability issues. There might be issues with temporal data that for some days, you don't have the data, for other days, you do have the data and so on. So, that's the problem with whether we have captured the actual value or not. Of course as a data scientist, you would never know about the real value of something like that because you never know what happened in that office. Now there's also issues with how the data is being recorded. So, maybe the user typed it differently, maybe the user use the mismatching unit or made an assumption and just truncated the decimals for some reason or they added something in it that made it free text instead of a number anymore. So, there are all of these issues that may add messiness to your data and that is the user making it and also there might be the database structure bringing in more data quality challenges. For example, the database automatically truncates the data. They may change some errors into something, it might be a unit conversion, it might be a cleaning algorithm, it could be a lot of other things that may affect how the data is being structured and then even when you get it into your analytical program, then it's up to you how you want to clean it because sometimes you have three rates for the patient in one week and then which one you want to use or you have three weights in a year. Do you want to use the last one, the first one, the middle one, the main one. So, it depends on what you want to do. How do you want to clean it up, how you want to remove all of the outliers, how do you wanted to detect the typos and so on and all of that will change your results. So, there are a lot of dimensions for data quality. I've seen up to 10,11,12 different dimensions but in this video, we will only cover very simple three of them and that's accuracy, completeness and timeliness. Accuracy is basically the precision of your data, whether your data really represents what the actual value is and it's very sort of relational. Accuracy depends on how much accuracy you want. So, it's very relational and then its completeness on whether the data exists or not and that's one of the easiest ones that you can measure and timeliness whether the data is there, whether it should be there and whether you will have it in time for your analytics or not or whether there would be a delay associated with that or not. So, these are the three that we will talk and again data quality has more dimensions and you can always look online or read the literature and find out more about it. Here's data accuracy, as I said, if you have a weight scale and you're weighing patients, you have like 14 patients here in this little table and all of their weight, that's the actual rates. Then you can see that these weights are represented in three different databases on the right side and each of them might have tweaked it a bit. So, the first database for example removed off the decimals and replace it with a zero. So, the truncation or roundations might change the accuracy. Maybe your research really is so sensitive that you need the decimal. So, that database is not that good anymore for you but maybe that's what you have. Now the second database you can see that something happened to the data may be they have converted it from pounds to kilograms or something. But if you are not aware of it, you might think all of these patients are underweight if you think it's in pounds. Now or the other way around. Then the last one is the third database you can see there are obvious typos. Maybe somebody solve the weights on a paper chart and they tried to actually type it in and those are the issues that come up. As you can see, all of these are accuracy challenges and they're not- frankly there is no one turn-key solution that can fix all of the accuracy challenges and it requires a lot of work to fix accuracy. Completeness is usually easier to measure. You can see, for example, here is another database of those 14 patients and their weights and simply there are some null values or empty cells saying that the data has not been collected for them and that could be for the patients in total or it might be missing values for different encounters or visits. So on the right side, you can see patient number 2 had four visits and weight was collected in three visits but not in the other two. So, there was a total of five visits. So, completeness is probably the first thing that you would always measure to make sure that you have enough data for this population to start with, but accuracy, as I said, is a bigger challenge. Now timeliness is also another perspective of data quality but that is more related to operational purposes. So, if your analytical model is trying to feed into some operations, if the data doesn't come in time for your analytics, you can never keep up with the pace of operations and inform it in a timely fashion. So, here for example you can see when EHR databases send billing data to an insurer, it goes through a lot of different hoops and loops. You can see it goes through patient pre-authorization, benefits verification, capturing, coding, claim submission, and maybe denial of it, or acceptance of it before it shows up in a claims database. So, the data latency could be anywhere between one or two days to almost 30 days. So, if you want to run something on a claims data, that data latency might create some issues for you. So, timeliness of the data is also another data quality aspect. Here you can see some of the potential data sources for health data like EHRs, claims and so on and each of them might have different data quality issues either on their reliability or accessing the data and so on. I don't want to go through the entire list but I'll just mention two of them. For example lap values tend to have a good reliability because a lot of it's machine-generated, but then accessing lab values on a large population doesn't exist much. So, that is a bit low. However, for something like claims data, the data quality and reliability is under medium side but then accessibility of the data it's much higher on a population level. So, we talked about data quality. The next data challenge is interoperability. Data interoperability basically is defined as the ability of a system to exchange electronic health information with and use electronic health information from other systems without any special effort on the part of the user. So, if you look at the image when there has been some drivers in the US that improved data and probability, there are listed there, but basically, data interoperability is defined in the core, in the middle. You should be able to send the information and the receiver should be able to receive it well. They can find the information in the information you send, then they can also use it and that's interoperability. A lot of times, you send the information and they cannot use it or they can't find what they want. Then, they can't send you what you want. So, that is interoperability and that's very important because if there is no interoperability, then data always stays within silos that you do not have a good representation of all of the data that you need for your entire population. Here's another way to look at data interoperability. You can see in the middle of the picture from the left to right, we have the patient, the practice, the population, and the public, and then interoperability of making sure that all of the data is being exchanged between these entities. Here, is a more schematic way to look at it. So, there are two population level health databases here. One on the left, noted as number one, and another one on the right, number two. You can see number one has connectivity to some data sources like claims and maybe partially with some EHRs in that region, and labs, and surveys, but population health database number two has a much better connection with claims, labs, and surveys, and also partially has mHealth data, social data, and an EHR. So, this basically shows a population health database. Number two has a better representation of all of the data in that region for that population compared to population health database number one. So, of course, interoperability also brings up the problem with data linkage and integration. As soon as you want to integrate one EHR to another EHR, basically you need to also integrate who is who because you want to bring all of the records for the patient A from both data sources. For that one, I said patient A, you need to have an index, something that identifies who is who. That is the Master Patient Index or MPI. That is needed to be able to connect data sources like an EHR with another EHR, and EHR with insurance claims, and so on. But creating the MPI is a hard thing to do. It might even create errors or bias in your merge databases. Other challenge is if you want to create an MPI for data linkage, you can't just use names because a lot of people might be named John Smith. You need to have date of birth, address, phone number, gender, and if available, social security number, any other unique numbers about these patients that can help you to better link them with a better accuracy. Problem is that most of these data types like name, date of birth, address, and so on are protected by HIPAA or Health Insurance Portability and Accountability Act, and they are not easily accessible by researchers. So, most of the linkage should happen on the operational side and then the researcher should get a version of it that has all of this type of data stripped out of it. So, another challenge is how it's the architecture of your data warehouse or that population level database? It could be either centralized or federated. Federated meaning distributed and non-centralized. In a centralized architecture, all of the data sources feed into one big database. Of course, the advantages, it's simple, data is consistent, you can manage everything, you can make sure everybody is using the same standard, and it's easier to link the patients in one place. But there are disadvantages, of course, it doesn't scale up well. It might be issues with how you can manage such a big database. There's the trust of all of these data providers, whether they want to hand you over all of their data or not, it needs a lot of leadership and also, communication infrastructure needs to be in place. So, it is a bit tough to do it. However, more and more, we do see centralized architectures for population level data sources. Here's an example. See a data repository that it's getting data from an EHR, multiple EHRs, multiple claims, surveys, personal health records, clinical registry databases, laboratory information systems, all of it in one place and then you will have a good representation of all types of data about a population. Now, on the other side is the federated architecture where the advantages is that all of the data owners will keep their data. They do not actually send their entire data, they only send the patient identifiers that you need to create an MPI. Then, because you have a central MPI instead of a central database of all information, you only have a central MPI, then if somebody comes to you and says, "Can you give me data on this specific patient," then you can go to all of these databases, query them because you know who is who, and then compile all of the data at runtime. So, that's the advantage. It's very scalable. It builds on the existing infrastructure. There are more opportunities for creativity, but of course, there are disadvantages as well. It needs a lot of coordination. You can't run very large queries because it actually needs to look up all of these smaller databases to find that information for you. Of course, there's always this MPI issue on how accurate it is. Here is a schematic representation. You can see all of these standalone databases by different providers, EHR, claims, surveys, all of those, but all of the data does not go in the center and that data warehouse. The only thing that is in the data warehouse is the master patient index, and if you want to query it, then in real-time, that infrastructure will query all of those databases and bring you the data about that patient. So, in summary, we talked about these data challenges. We talked about the data quality issues like accuracy, completeness, and timeliness, the interoperability linkage and integration challenges especially the master patient index issue, and also, the architecture and design of a data system, whether it's centralized or federated. Thank you.