Welcome to the What-If activity for ML for business professionals. In this activity, you'll learn how to use the What-If Tool, a tool for inspecting machine-learning models for fairness considerations. One of the AI principles we discussed in the course is to avoid creating or reinforcing unfair bias, and the What-If Tool is one tool for aiding in that. In this activity, you'll learn how to use the What-If Tool, visualize the predictions of a machine learning model, explore misclassification of a model, and consider the model under different fairness constraints. The What-If Tool allows you to visualize inference results, edit a data point and see how your model performs, explore the effects of a single feature, view confusion matrices, and test algorithmic fairness constraints all without writing a single line of code. This is the What-If Tool. On the left, there's a panel featuring several tabs, and on the right, there's a visualization of the dataset. The visualization shows 500 data points, which represent a 500 percent sample of the UCI census income dataset. The model predicts whether someone earns more than $50,000 per year based on individual characteristics such as income race, sex, and education. Those data points in red are individuals predicted to have an income greater than $50,000. While those in blue, are those predicted to have an income less than or equal to $50,000. As the left panel suggest that you do, select a data point. Click on the blue dot labeled 499. You'll find this in the road that lies at the intersection between the blue regions and the red regions. On the left, you'll see all the variables about these selected individual on dataset, including age, education, and others. On the bottom, you'll see the probability or score that the individual has gotten for the specific label. We want to start by identifying where the errors are. That is, which data points are misclassified. Individuals can be predicted to earn more than $50,000, but earn less than $50,000 in reality. The opposite can be true as well. Like in the loan example from the lecture, misclassification could have downstream effects such as denying loans to those predicted to be earning less than $50,000, when in fact they were earning more than $50,000. Let's start by changing the interface a little bit by shifting this region over, and looking at what kinds of errors the algorithm makes. Underpinning x-axis, click the drop down menu and select Inference correct. You'll see the data points move around a bit. Underpinning y-axis, click the drop down menu and select over_50 k. What you'll see on the left are all those data points which are predicted correctly. While on the right, you'll see all those data points which are predicted incorrectly. By itself, this doesn't tell you much about the potential bias of your algorithm. To test how well the ML does across different subgroups, underpinning x-axis, select Age, and underpinning y-axis, select Inference correct. Which age group has the most incorrect data points? You can get a sense of how the model performs for each age subgroup by comparing the size of one rectangle to rectangle directly below it. In this case, you can see that the model performs extremely well for the 17 to 23 age bucket, and also extremely well for every one above 68. But it doesn't perform nearly as well for people between 24 and 61. Let's try the same thing with race. In this case, you can see that the model performed very well for Amer-Indian-Eskimo and people who said their race is other. But for people who are white, the model isn't nearly is performed. Let's try the same thing with sex. In this case, the model performs much better for females than it does for males. Perhaps, the model performs worse on a subgroup which is distinguished by two variables, for instance age and sex. Under the pinning x-axis, select Race, and underpinning y-axis select Sex. Let's change the color of the data point to indicate whether the prediction was correct. Under color by, select Inference correct. Which categories have the most incorrect predictions? So now you can look at the color composition of each box, each combination of race and sex that is, and get a sense of the relative performance of each of these combinations. The Amer-Indian-Eskimo sub group seems to have all correct performance, and likewise with the other. But each of the other combinations of race and sex has some incorrect predictions in it. Recall that algorithmic fairness can be defined in several different ways for machine learning classifiers. One strategy for ensuring fairness is to adjust thresholds for particular sub-groups. The threshold is a number which separates those which get predicted in one class rather than the other. By default, the threshold is 0.5. But we can change it manually or according to different algorithmic fairness constraints. Group unaware of fairness, establishes the same threshold across each group. Demographic parody fairness equalizes positive rates across groups. Recall that positive rates are the percentage of the time that the model says positive versus negative. Lastly, equal opportunity fairness equalizes true positive rates across groups. Recall that true positive rates are the ratio of the times when it says positive that it's correct versus the set of all times when it just says positive. To explore fairness consideration, click the performance and fairness tab on the left. Under the Compare Slices and Fairness Metrics, first select Race under Slice by. Then select Sex under Slice by secondary. Under the Explorer Performance, you'll see what's called the ROC curve on the left and the confusion matrix. The ROC curve is an indicator of performance. In general, the greater the area under the curve, the better the predictive capability of the model. Now, look at the confusion matrix. The cells in green are those which have correct prediction. They predict income correctly, and those in red are incorrect. Which subgroup has the highest incorrect predictions by number? In order to answer this question, we need to scroll down through this box, and then each time we scroll, we need to look at the sum of the two numbers in the bottom left and the top right of the confusion matrices. It appears as though white males account for the largest number of incorrect predictions. How about by percentage? When now we do the same thing, but instead of summing the numbers in parentheses, we sum these two numbers here. It looks in this case is though the subgroup with the worst performance changes. Instead of white males, it changes to Asian-Pac-Islander of male, which has a 25 percent incorrect prediction rate. We can now consider different algorithmic fairness constraints. Under optimized sliced thresholds for, click group unaware. What happens to the threshold for each group? Remember a group unaware maintains the same threshold across each subgroup. In this case, the optimizer figured out that it's better to lower the threshold because we increased the number of true positives that are recognized relative to the number of false positives that are introduced. How does the visualization on the right change? Well, now we have slightly different percentages of blue to red in each of these boxes. The visualization on the right has changed because the total number of red dots seems to have gone down. In fact, certain combinations of race and sex seem now to have perfect performance in our sample. Now click demographic parody. What happens to the threshold in each subgroup? We'll recall that demographic parody fairness equalizes positive rates across groups. So that means that the percentage of the time that the model says yes for each subgroup should be about the same. That's what we observe. If we look at the value in the predicted yes total column, you'll see it's fairly consistent across combinations. So here we have 9.8, there we have 9.7. In some cases, due to the small number of examples in our sample, we can't get it exactly. Ten is not the same, but it's close. However, the composition on the right-hand side has changed quite a bit. Now, some subgroup combinations have perfect performance. They have all blue circles, but white females seem to have suffered in their performance. Now, click equal opportunity. What happens to the threshold of each group? Well, like with demographic parody, the thresholds for each group are no longer tied to each other. Recall that equal opportunity fairness equalizes true positive rates across groups. What that means is that we should observe the same ratio of the top left box in our confusion matrix relative to the sum of the top left on the top right. That's what we see. In this case, it's about 1.5. So this 14.9 relative to 14.9 plus 15.3, and here it's six relative to 12 and so forth. What this means is that the percentage of the time that the model says positive and it's actually correct, is the same across subgroups or is close as it can be. However, what this does to performance is that many subgroups which had perfect performance before, now have terrible performance. So if you look at these race and sex combinations, whereas before they were all blue, now they're in fact all red. Of each of the constraints, which seems to produce the best results for each of the subgroups? Is there one which works better for different subgroups? Absolutely. If we choose group unaware, we get the same threshold across each the groups. However, if we look at the percentage of the time that the model says yes for each of the subgroups, we get very highly variable results. So white males, for example, their model says yes 29.5 percent of the time, but for white females, it says nine percent of the time, which means that as a subgroup, white females are penalized in this model. Similarly, black females are penalized. Meanwhile, Asian-Pacific Islander males are rewarded. Fifty percent of the time the model says positive. If you look at the performance with respect to the accuracy of the model, you'll notice that there are a decent number of red dots and there distributed across most of the subgroups. There are a small number of subgroup combinations that have perfect performance though. If we choose demographic parody, now the percentage of the time that the model says yes is as constant as it can be across subgroups, which is very fair. We've also increase performance in our number of combinations of each subgroup. However, the performances has gone down significantly for white males. If we choose equal opportunity, as you recall, equalizes the true positive rates across groups. Now, we're concerned with the ratio of this box here to this entire row. You'll notice it's about 50 percent in all cases. However, the accuracy of the model in sub-groups has gone down substantially. From a fairness perspective, the percentage of the time that the model says yes still varies between subgroups. In this case, we could possibly alleviate disparate of error by using one of the fairness constraints defined in the What-If Tool. However, this isn't going to be doable in every case. Other bias mitigation techniques include collecting more training data and sub-sampling your data. When collecting more training data, the objective is to have the same number of dots in these under-represented combinations as we do in these over-represented combinations, as you've focus specifically on for instance Amer-Indian-Eskimo females. When sub-sampling your data, the intention is to shrink the number of dots in these over-represented categories by sub-sampling them. So for instance, we might sample from this group again so that we have the same number of white males as we do black males. What are the implications of getting it wrong? In the lecture, we discussed granting loans. If we grant someone a loan and they can't repay it, it costs us $700. But if we do not grant them a loan and they could have repaid it, they could be prevented from home ownership and making a significant investment. Think about several more cases. What are the implications for getting it wrong for each of these scenarios? Let's consider marketing. If you labeled someone as a low value customer instead of a high-value customer, what is the company using the model risk? Probably not a lot. What is the person who's misclassified risk? Well, they could be sent the wrong products, but they also may not be considered for important things that would affect their lives significantly. What about natality, when your labeling an expecting parent as normal, when they're actually at risk of having an underweight child? In this case, what does the hospital or doctor risk? Well, they risk a significant amount in terms of their insurance if the number of cases where the inappropriate care was given to a patient. What is the parent risk? The parent also has extraordinarily high risk in that they at risk of losing their child. That's the What-If Tool. In this activity, we learned how to use the What-If tool to visualize the predictions of a machine-learning model, explore the misclassifications of a model, and to consider the model under different fairness constraints, like group unaware, democratic parody, and equal opportunity.