Next, we'll talk about feature construction. Our goal here is to construct potentially relevant feature about patients in order to predict the target outcome. First, we want to introduce a few key concepts that are related to feature construction. Given there's raw patient data that are arriving as event sequences over times, for example, each one of those icons indicates some clinical events. That could be diagnosis, that could be medication or lab result, and you name it, and it evolves over time. Eventually here, some target variable happen, for example, heart failure. In this example, the case's patients will be the one that is diagnosed with heart failure on this day, so we call this diagnosis day. Since control patient does not necessarily have a heart failure diagnosis, in theory, we can use any days from control patient as a diagnosis date. We'd just need the anchor point. But commonly, we choose to use the heart failure diagnosis date of the matching cases as the diagnosis date for the corresponding control. The benefit of that strategy is for this matching pair, they will have data coming from the same period of time, because they anchored on the same diagnosis date. Then we look before this diagnosis date, because we want to predict these events, and we have a window, right before that called prediction window, and before the prediction window we have index date at which we want to use this learn model to make a prediction about the target outcome. Before the index date, we have another time window called the observation window. We use patient information in the observation window to construct features. There are many different ways to use to do construct features. For instance, we can use the number of times an event occurs as a feature, for example, type 2 diabetes code appears three times during this observation window, the corresponding feature for a type 2 diabetes will be three, or sometimes we can take an average of those event value. If a patient have two A1C measures during this observation window, we can take the average of the two as a feature value for A1C, which is a lab test. The lens of observation window and prediction window are two important parameters that are going to impact the model performance. To understand this different observation prediction window, let's look at a few quizzes together. Which of the following timelines is easiest to model? Is it A, large observation window, small prediction window, or B, small observation window, and large prediction window, or C, small observation window, small prediction window, or D, large observation window and large prediction window? The answer is A, large observation window, small prediction window. It is often easier to predict the outcome in near future. That is, small prediction window is easier to predict. On the other hand, larger observation window means more information can be used to construct features, which is often the better settings since we can model patients better with more data in general. Therefore, larger observation window, small prediction window is easiest for modeling. Next, let's look at another one. Which of the following timeline is the most useful model, assuming we can model them all accurately? Is it large observational window and small prediction window, or small observation window, large prediction window, or small observation window, small prediction window, or large observation window and large prediction window? The answer is actually B, small observation window, large prediction window. In the ideal situation, if we can predict in all the scenarios, then we probably want to predict far into future, which means large prediction window, without much data about the patient, which means small observation window, therefore B reflect this idealistic timeline. However, this setting is often the most difficult to model. Here's another example illustrating the typical impact of prediction window. When increasing prediction window, the accuracy will typically reduce because model is trying to predict further into the future. Here's another quiz on prediction window. The x-axis is the prediction window size. Zero days, 90 days, 180 days, 270 days and so on. The y-axis is model accuracy. The higher the better. Which of the following options are the most desirable prediction curve? Is it A, B, or C, or D? The answer is B because we can predict accurately for fairly long prediction window up to 450 days. Well, the performance of other curves drops much more quickly as the prediction window increases. Now let's consider the impact of observation window. Typically, as observation window increases, the performance of the model improves because you know more about the patients. Here's a quiz about observation window. Given the performance curve when varying observation window, what is the optimal observation window to choose? Is it A, 90 days, or B, 270 days, or C, 630 days, or D, 900 days? The answer is C, because the model performance plateaued after 630 days. It indicates a diminishing return to go further beyond that point. Next, let's understand the feature selection step. So far we have talked about how to construct features using this longitudinal patient event sequences from electronic house records. In particular, we will construct features from a raw data in the observation window. In general, different types of features can be constructed, including patient demographics, symptoms, diagnoses, medications, lab results, and vital signs. However, not all the features are relevant for predicting a specific target. The goal of feature selection is to find the truly predictive features to be included in the final model. Here are some examples of two patients and their charts. We see some features such as demographic, age, sex, race, or vital signs, blood pressures, or diagnoses, such as diabetes and hypertension. However, we can consider in real data, there are many more features. In a real EHR record, there are often over 20,000 features for any given patients. Not all of those features are relevant for predicting a target. We need to select the ones that are relevant for, for example, predicting heart failure, those are indicated with the yellow lines. If we want to predict a different condition, for example, diabetes, then we may select another set of features indicated by the purple lines. Depending on the target, the feature selection result will be different.