[MUSIC] Hello, in this video we will continue discussing the data challenges that you may face while working with health data. More specifically in this video we'll talk about Big Data hallenges and problems that you may face when you run Analytics. So the key takeaway is to identify challenges while using various Big Data, health data sources for analytic purposes. So what is Big Data? So by definition, Big Data is a collection of data sets that are so large and complex that it becomes impractical to use, using the traditional data base management tools or processing applications. So you often need to have something going beyond just a relational database. So just to give you a background, for a long time we had relational databases and then more and more we created something called web data and then there were datahouses of all of this. Then unstructured data grew and we needed to mind that unstructured data. Then virtualization and parallel processing started in the computing world which also improved the in-memory databases that became faster and faster to run queries on very large data sets in almost real time. And nowadays of course it's cloud-based computing, and that helps with the storage of these Big Data. And it comes with its own tools that you can use to better manage it and then use it for your analytic projects. To give you a history about Big Data in healthcare, we probably one of the earliest Big Data project was the human genome projects back in the 90s that created really large datasets. But that was all genomic data and genomic sequencing. Now going forward back in 2008, when the EHR adoption was funded by the HITECH Act, there was a massive rollout of EHRs. And then it was also integration of various data sources within those EHRs and that also started a revolution on clinical data that was massive and it's considered Big Data as well. And in the near future, just in couple years we expect that our healthcare data will grow from the 500 petabytes we had in 2012 in the US to more than 25,000 petabytes. So as you can see, processing these very large data sets on a population level would most probably require certain infrastructure and certain methods or tools. So what are the specifications of Big Data? There are many definitions out there in terms of the specs, I like these the five Vs, some people also say they're six or seven of them, but these are the common five Vs. So Big Data is not only about volume, volume is one of them. Volume, it's basically the quantity of the data, how big it is, how lengthy it is, and so on. There's also variety, meaning what are the different data types that you have within your database. And if the variety goes high, then managing it also becomes very hard. Another thing is velocity, meaning how fast your data is being refreshed. The faster it refreshes, the harder it is to work with it. So just think about ICU monitoring devices and feeding, let's say EKG data in real time. So velocity is also another issue with some data types. Another thing is veracity, meaning how good your data quality is. And we have talked about this in previous videos. Things about data accuracy, completeness, timeliness, and so on. And that's also another Big Data issue that we're facing in healthcare. And the last one is the value. What is the added value of this very specific data for certain outcomes that I have? And that is also very tough in healthcare. So these five V's defines Big Data, you don't have to have all five to define Big Data. It could be any of them. So again, it's volume, variety, velocity, veracity, and value. So talking about a health analytic process, most of the time we spend a lot of our effort in putting together different data sets, linking them. And that is box number 1, you can see that's a database development. And then in box number 2 you can see that the majority of an analyst is spent on cleaning up the data or prepping the data to make it sure that the data quality is good enough. Then, in box number 3, that's where the actual analytics happens, meaning that you develop a model, you find a pattern, it's statistical approach, it's a machine learning approach something, you mine the data and then you predict something with that. So it's a training or a base, sort of a database, and then there is an outcome or a test database. Now these models, of course, you need to doublecheck its validity and reliability, goodness of fit, consistency, and so on, for all of the data sets you have, that's box number 4. And then we hope this circle will be completed with box number 5 and 6, where whatever you generated could be used by others, by other health providers or players that have similar data sets. Then they can generate new knowledge, and then that new knowledge can generate new data, and that goes back to box number 1. So basically that creates something called a learning health system, where all of the data that ends up into analytics and all of the findings can feedback into this cycle. So that sort of the optimal way to look at health analytic process. Now talking about the Big Data issues. So let's go to all of these five V's and see how it affects your process. So remember that the first three stages of putting the data together, prepping the data, and then running the models. If your data is very big, usually it doesn't fit in a relational database. So you can see at the bottom of this image, there is a Big Data volume challenge where you might have billions or trillions of rows, thousands and thousands of columns, and the traditional queries may not work. That's where you might need to start using Big Data Architectures like NoSQL. There are a couple here document data bases, columnar data bases, graph data bases, XML and hierarchical data bases that might fit your operations Now variety, another Big Data challenge. So let's see how that affects your analytical process. So here you can see at the bottom they're multiple tables with multiple codings, rigid structures, and they are very hard to incorporate them into one big database that you can do the analytics. So for that same reason, you might want to use Big Data platforms where it's more cloud based and then you can bring all of these data as you need. There are a lot of different concepts in Big Data to deal with this such as Big Data islands or Big Data lakes where you can bring data in and merge it with other data and almost real time. Velocity is another Big Data challenge. You can see at the bottom of this diagram, some data like EKGs or real time temperature data, they don't fit tables. It's just if you have millions of points generated every minute, your tables very soon will get into trillion rows, and it's very inefficient to manage it using SQL databases. So that's where the Big Data clouds come into place once more, and there are certain solutions to deal with temporal data and make it useful for health analytics. Talking about data veracity and data quality. You can see at the bottom of this diagram, there might be issues with data completeness, data accuracy, or data timeliness, and each of them might affect your results. And if this becomes very large in scale you can never fix everything every time. And you need to use some Big Data platforms to be able to manage some of the data quality checks for you automatically. So you can see there are many initiatives, many concepts, many tools that you can use as an alternative to the typical sequel, tools to improve data veracity. And of course we have the ambiguity of data value. So you never know which table should I bring in to better predict an outcome? How to improve my health analytics, and that's where there might be Big Data alternatives that can help you to better identify which data source is better than the others. Talking about Big Data, you might wonder where is the volume of data? Which data source contributes the most to the volume of Big Data in healthcare. If you remember from previous videos where the patient and the provider interact, and each of them interact with their own circles. You can see here as well that on the left side is the physicians or the providers and each of them are interacting with their own circles and generating different data sources such as EHRs HIE's or genomic data and so on. And then on the right side is the patient and the patient community, and the population around them, and them creating data. And in this image, what I've tried to show is that these proportions of course are not accurate, they are just figurative. But you can see, genomic data is way bigger than just EHR data in terms of the size, or on the right outside the community wide data sets are way bigger than let's say, personal hot records. So just to get you an idea on where are the heavy lifts of health data that are coming in and how hard it is to manage them. So I'll just quickly give you some examples of Big Data sources. We have sort of covered them in previous videos about population level data sources. But here are some examples of the Center for Medicare and Medicaid Services. CMS has a number of data warehouses that ahs the Medicare data and also the medication data part d on almost like 50 million members, which is updated on a daily basis. There are some commercial entities here listed under slide, you can see IMS and so on that have collated massive amounts of claims data and some also include EHR data on large populations that are updated on either monthly or annually. And that could also be used for research but each of them are again very big and you may not be able ro run simple sequel queries anymore for some of these. Here's more on the Veterans Health Administration. VHA has a data warehouse called CDW, the corporate data warehouse that also has data on a very large population of our veterans. We have the National Patient-Centered Clinical Research Network, PCORnet, That also has different regions and each region has collated their EHR data into one big EHR data. These are called the clinical data research networks and that also covers a large population and also added data sources like the health care system research network that collects data from HMO's or the health maintenance organizations, and also HIEs or geo-data warehouses. So all of these are sort of at the threshold of the Big Data source. And remember, Big Data is not always about volume. Sometimes it's about variety, velocity and veracity as well. And some of these are definitely data sources that bring in a lot of different health data types and sort of dealing with them is not that easy. A big analytic challenge just to know about Big Data is the fact that, when your data becomes super big and it gets very close to having almost the entire population in it, your n, or data sample, is basically getting close to all people in that area. So sort of, you don't even have sampling anymore because every patient is in your database. So, and there are always challenges on what type of statistical analysis should you run, what it means to use traditional statistical methods that were designed for sampling, sort of a data to run on population level data. So and there is enough literature out there. I would recommend that you go and read literature about Big Data Analytics and challenges when your n gets equal to the entire population. So in summary, we talked about the Big Data challenges and how that might affect your analytics. More specifically we talked about volume, variety, velocity, veracity, and the value of big data. Thank you.