# Explainable AI for Science and Medicine

[MUSIC]>>Yeah. Thanks for

letting me be here today. I’m excited to share

some of our work on explainable AI specifically

applied to science and medicine. I’m defending at U-Dub

here this quarter. So just by wrapping up, and I work with Su-in Lee, if you happen to know her, over in compile lab there in

U-Dub computer science. So maybe you guys, this group that are very well

sympathetic to this question, but I’d like to start out and ask, why do we care about

explainability in ML? Because a lot of our work

has focused on this, and to take a look at that, we can start with just a simple

example of a guy named John. So he is a typical bank customer, and like most customers today, whenever you interact with

a company oftentimes data about you ends up getting

sent into some kind of model. In this case, if you’re in

a financial institution, often these models are trying

to predict things that you care about in terms of outcomes. So in this case, predicting whether John has any repayment problems. So here, a chance of 55 percent leads to

the bank denying his loan. So in this case, there’s a standard thing that happens thousands of times every

day around the world, and it leads to questions from John, obviously, like, why

did you deny my loan? If you have a good manager, you should also ask

the same question, right? Why are we making these really

business-critical decisions? But unfortunately, sometimes

for the data scientist who came in building this model was

all about accuracy for them, and so they don’t have good answers sometimes when they’re trying to

explain what’s going on here. So why does this happen? Well, it happens often because many times you run into a trade-off

between a complex model, where because you have a very, very large potentially

complicated dataset, you can take advantage of the

flexibility of these complex models, and get a lot of accuracy. But that can lead to a lack

of interpretability. In contrast, some simple models can be interpretable

in the right contexts, but if there are too restrictive, they’ll lead to a lack of accuracy. Okay? This is what often leads people to run into this in

practice, particularly in finance. So if you have interpretability or accuracy and you can link choose one, that’s a painful trade-off. For the bank, it’s

particularly painful because accuracy directly corresponds

to money for them, because accuracy will correspond

to default rates for the loans. But interpretability

is really important. These are wonderful icons that

are different on Mac and Windows, but this is meant to

be a happy person. So if you’re interpretable, you have a happy customer,

happy manager. There’s also really

important legality concerns as GDPR and other things. So there are strong drivers on both sides of this trade-off

for the many companies. Now, one thing you can do is try and make simple

models better, right? So that you can improve

their ability to retain the interpretability and

also move towards accuracy. That’s a great approach.

We’re actually going to focus on the other one

though, which is basically, taking what’s already

considered very complex models and trying to extract

interpretability from them. Okay? So if you do that though and you just look at

a complex model by itself, it could be very

complicated in the mapping, and trying to explain an entire mapping space could

just be hopelessly complex. So instead what we’re

going to do is focus on explaining individual

predictions one at a time, because those involve, perhaps, just a small piece of

the overall complexity of the model. So I don’t have to describe

how a model behaves in all circumstances in order to tell John what was going on with his loan, I just need to tell him what

was going on that affected him. So if we do that, one way to go about it is to say, “Let’s start with a simple model

and see what this means, be really specific and concrete.” So here’s a a linear model. Why do we consider

these things interpretable? Often, it’s because they have this giant summation sitting in

the middle of the model, right? So we have a bunch

of terms coming in. These are things about

John, let’s say, these input attributes, and a bunch of terms come

together and they just get summed up and sent to

you as some output. Okay? So we think of these as

interpretable because we can look at that and see a bunch of things

coming together additively, and what’s coming into

the summation can be viewed as a credit that’s attributed

to each input feature. In contrast, if we look

at a complex model, say a neural net or a random forest, maybe gradient boosted

decision trees, things like this. Often, there’s so much going

on in there that there’s not a nice term that we can

just surface and say, “This is how the model’s going.” So what we’re going to talk about is, how we can explain an individual prediction in a way very similar to how

a linear model works, where again we’re going to have

a summation sitting inside there. But now, instead of having just terms that are straight from

the definition of the model, we’re going to have to come

up with our own definition of a credit attributed

to each feature. Okay? So this Phi function

is going to be indexed by the feature and it depends both on the model and the current input. Okay? So we’re essentially replacing the input to a summation

that you would get in a linear model with something that represents the importance of that feature in

the complicated model. Okay? So this is

a high-level motivation. So we can look at this, and this is actually

really just a form of an explanation, if

you think about this. So that’s really what we’re

doing is just saying, “Why did this model

make this prediction?” We say, “Well, whenever you give an explanation

it has to have some form, and this is the form

we’re saying it has.” Perhaps not surprisingly,

this form has been used by a lot

of previous methods. Okay? This is something

we noted before. So Lime, from Marco here

uses this approach, and there’s a variety

of methods here from Shapley values based on ideas from game theory

that I’ll get back to, that are really interesting, also, produce explanations of this form. These are all model agnostic, but there’s also ones

that are specific to particular popular model types that

are typically hard to explain. So Saabas is one targeted at trees, also came out of Microsoft here, and then there’s a variety that

target deep learning methods. All of them come out with

explanations that looked like this, they have a sum of feature

attribution values that they sum up and

represent the output. Okay? So this is interesting and people hadn’t previously appreciated this

unity among them methods, and so we named this class, the class of additive feature

attribution methods. Which is nice because it gives you some connection and

understanding about how this part of the literature

relates to one another. But you can learn a lot

more than that as well. Because it turns out there’s

various strengths and weaknesses in these methods. One really nice thing about this approach is based in

game theory is that they come as the unique

solution to a set of properties that can be

specified in terms of fairness. When you’re explaining something, fairness as defined

by these properties, turns out to be

really, really useful. So we’ll get back to that. In

contrast, these other methods. So I should mention, these

are nice because of that, but they can be a bit slow. Okay? Because of the way

they have to be computed. In contrast, these other

methods are typically based on what could be viewed as

a bit more heuristic approaches, but they tend to be much faster. Okay? So on one side, you have some better theoretical grounding

but slower computation. On the other side, you have faster

estimation but less guarantees. So what we’ve attempted to do is, combine these strings together to propose a method

that we call SHAP, and this is something that we

presented at NIPS in 2017. So in order to find this, we need to say, “How do we define

these feature attributions?” Okay? So I talked about, we’re going to make an explanation

by summing together a set of values that represent the

importance of each input. How can we define those things?

How do we define the credit? Let’s go back to John. So before explaining to John why

his loan was denied, and we’re going to do it using

this type of explanation. Of course, there are

many other types you could use. It’s good to start at a base rate. Okay? So in this case, we have

a base rate of 20 percent, and this is just, how

often do people get their loans denied on average? Okay? So that’s a base rate

of loan rejection, or at least, trouble in repayment. Now, the prediction for

John was 55 percent, right? So if we’re going to explain

what’s so special about John, what we need to do is

explain how we got from the base rate to our current

prediction, all right? Because if we predicted 20 percent, there might not been

anything special about John just because we

always predict 20 percent. Okay. But when John says,

“Why am I 55 percent?” What we really have to explain is

this 35 percent difference here. Okay? So how can we do this? Well, if we assume our

model’s fairly accurate, then we should just

take the expected value of the output of our model, that’s going to be the base rate. Okay? So we can just say what’s the expected value of

the model, let’s say, over our training dataset?

That’s the base rate. Then what we can do is, since

this is an expectation, we can just introduce a term into

that conditional expectation. That term here will say, “Let’s see. We condition on the fact

that he is 20 years old.” So on the condition on

the fact that John’s 20, his risk jumps up by 15 percent. Well, we can attribute

that 15 percent jump, if we’d like, to John’s age. Now, we can condition on the

fact that he’s a day trader, and that’s a very risky profession, and that jumps up to 70 percent. Then we condition on the fact

he only has one open account, like a waiting financial disaster, but he made a ton of money in

the stock market last year. So his capital gains pushes

him down to 55 percent. So what we’ve done

is, we’ve basically divided up how we got

from here to here by conditioning one at a time on all the features until we’ve

conditioned on all of them, which means we’re of course going

to be the output of the model. So any questions at this point?>>Yeah. This is amazing. Is it all independent?>>Good questions. So let’s assume they’re independent

for the moment, and let’s talk about

whether the model itself is linear or nonlinear, because both of those are going

to impact how this works. Because this is not the final

way we want to do this, because the order really matters

as we introduce these things. Okay? Either if they’re independent or dependent between

the input features, or if the model itself is nonlinear. So let’s assume that the inputs are fairly independent

for the moment, and just think about, maybe there’s a particularly

bad to be a young day trader. Okay? If that’s the case,

there could be some sort of interaction effect between

day trader and age. What would have

happened here is that, when we saw age we don’t know

you’re a day trader yet, but when we see day trader, we

already know you’re an age. So we get the extra boost

from that interaction effect, maybe it’s particularly bad

to be young day trader. If we were to reverse the order, age will get the interaction

effect and day trader won’t. Okay? So there’s potentially, n! Factorial different ways

of allocating credit if you were just to start throwing things in one at a time

into the ordering.>>If you are using the training

data to compute these marginals->>Yeah.>>Or conditionals. The more you condition

on past features, I’m going to be slicing and dicing my data in a way that

I’m going to have few [inaudible] estimating

the conditional next.>>Yeah.>>How does that actually

contribute into the approach?>>Yeah. Good question. There’s

two ways to go about it. Because what you’re getting

at is the challenge of being able to fully estimate the whole

joint distribution, because in order to compute

this exactly you would have to know the full joint distribution

of the input features, and then you could compute

all these conditional expectations. In practice, I don’t

think it’s wise to assume you can do that,

at least accurately. So in practice, what happens is, you often assume independence

between different input features in order to calculate

the conditional expectation.>>But then I’m imagining, given that 20-year-old, what is the probability of

the day trader, right?>>Well, so let’s say, for

example, we go back here. So we’ve introduced age. Now, what makes this tricky is if age is conditionally dependent

with other features, that would make it hard in order

to impute all the other features. But if I assume independence

between x1 and all the other xs, I can simply run an expectation

over just those terms. Now, if I assume independence

between x1 and x2, I can do the exact same thing

again. Does that make sense?>>Yeah, I’m just

trying to understand the assumption because I think

I’m hearing two different things. I think I’m hearing that I can assume complete independence and it will

complete this thing like that, or I can go it sequentially, where the ordering

will really matter. I’m going to be using

my condition at the start, but then I have actually a data

scarcity problem as I go forward. For a given problem, how do you decide which way you are? Are you measuring

conditional independence between the variables somehow or just picking and assuming one

and going with that?>>Yeah. So in this case here, we’ll compute this expectation by simply taking the mean

over the dataset. So just evaluate your model on the entire training dataset,

take the mean, you’re done. In this set here, what we can do is we can say, John is 20 years old. So one way to do this is like

a partial dependence approach. You simply plug 20 in and then

plug for all the other values, everything that’s in your training

dataset and you’d get that. Now, that makes

an independence assumption between age and the other features, and so that’s an assumption

that we are making. Then you can repeat that here, where now you’re fixing age to be 20, employment to be day trader, and then you sample from

the rest of the features. Again, independent from these two.>>There’s no data scarcity problem

like this because you’re taking expectation f

of x, rather than x. That is how the model is

going to be behaving. So it can [inaudible] generate as

much data as you want [inaudible].>>But then, do I need

to generate the model?>>Yes, you do need the model. You don’t need the generative

model to input data, but you do need the ability to

re-evaluate the model you’re explaining because that’s

going to gain your new labels.>>Okay.>>Yeah.>>So it seems to me like in

terms of human exploitability, making the economic order, in some sense spoils some of

the charm of an additive model, which is that I can think about all the different summed

contributions independently in

informal sense, right?>>Right, yes.>>Particularly I’m thinking

about the actionability for poor John here. So John gets denied and

wants to know what can I do and maybe the last thing on the list was life like

capital gains or something, which is something I maybe

could do something about. Well, I can’t change my age, changing my occupation is a pretty big move for just getting

a loan but somethings [inaudible]>>Maybe open accounts.>>Change your- right, I could

open an account but now, it’s not just what’s going to

happen if I open an account, it’s going to be what’s happening if I open an account given that I’m 20 and that’s much

harder to think about.>>Yeah, exactly. So hold on that

for just sec if you don’t mind.>>Just one more question.>>Go ahead.>>It seems like you’re

assuming that in this case x1, x2, x3 are conditionally

independent of everything else.>>The other x’s, yes.>>The other x’s, but you’re

also assuming that x1, x2, and x3 are somehow dependent

because the ordering matters.>>Well, so the dependence is driven by the nonlinearities

in the model, not by the input distribution. In the sense that, let’s

imagine I have an nth function, the order will matter even

if my input features are independent because whenever I see the second one

on my nth function, that’s going to change my model. So the only way that the ordering

will not matter is if all my input features are

independent and my model is linear. So non-linear models are what

we care about explaining, and so that’s why the order matters. Which gets back to what

you’re saying about, “I don’t want to have

a reason about orderings when I’m looking at an explanation.” There are many ways to

do the ordering is not even a good way to pick

the right ordering necessarily. So I think this is a good place

to actually step back and ask, have other people asked

this question before? Have other people asked, “Is

there a good way to allocate responsibility among a set of inputs to a function for

the output of that function?” It turns out that, perhaps not surprisingly, there have been people who thought about that. If you look back in 1950s, there’s a guy named Lloyd Shapley, who worked on this in the context

of cooperative game theory. There the idea was to say, we have a set of

players that are coming together and they’re going

to play a cooperative game. The output of this game is going

to be let’s say some money, and I need to divide that money

among the players in a fair way, but these people they

interact differently. So some people clearly

deserve more than others, how can I fairly divide that money? It turns out you can extend that, it doesn’t have to just be

positive values like money, it can just be any output

of the function. When you do that, you can put that in a variety of ways but the way Lloyd approached

it was he said, “Let me write down

some properties about fairness that I think should hold. If I write down it turns out only

a few properties about fairness, there’s only one way

to divide up the money such that these fairness properties

are not violated.” That I think it’s really compelling. He did a lot of good work on this

and a bunch of other things. He got a Nobel Prize in 2012. So this is like an economics thing, but these are very

very well known values called the Shapley values now, that are the unique solution

to these fairness properties. That’s what gets at how do we get around this whole

dependence on ordering? So what are these properties, first of all, that you wrote down? These are actually

some later properties updated in the ’80s because they’re a bit easier to connect with

machine learning, but one of them is called local accuracy or additivity

from game theory. Essentially, it’s

pretty straightforward. It just says, “You’re setting

a set of feature attributions, I want it to sum up from the base value to

the output of the model.” So if you’re giving

me some summation, I would like the sum of

these local feature attributions, the sum of these Phi values

to equal this to this. The thing we set out to start at the beginning is to

explain that difference. So it’s a natural property

we’d like to hold. I mean, you can certainly

violate it, but let’s assume we would like this

to hold for our explanation. Then, the second property

here is called consistency or monotonicity

in game theory. For this one,

essentially what it says is that if you were to have a model, and then you’re going to

reach in and make that model depend more on a feature, then your attribution for that

future shouldn’t go the other way. It shouldn’t actually decrease. This is really important because

if you violate consistency, it means you can’t trust any

of the feature orderings in your attributions even

within the same model. So really this is a core like, don’t be wrong kind of axiom. Then there are a couple of others

that are basically trivial and they’re always like, “I’m allowed to swap input features, zero means zero, and

things like that.” So these are the two

fundamental properties that really define these values.

What are these values? Well, they’re easy

to say, very easy to specify that the Shapley

values result from averaging, exactly what I just

told you to do except over all nth factorial

possible ordering. So for every ordering, I’m going

to get nth factorial Phi1, whenever I introduce

the first feature in whatever position, and

then I just average those. So it’s very simple to say but of

course very painful to compute. You wouldn’t actually want

to enumerate all n factorial possible orderings in

order to get this result. So it’s natural to ask the question, maybe there’s a way to do

this faster, quicker, easier. How do we do this? Well,

it turns out it’s NP-hard. So it’s a couple of options

we have for NP-hard problems. We’re really motivated perhaps by the theory and we

really wanted to hold but, how do we approach this

to solve this for ML? Well, one, of course, is to prove that P equals NP that’s one option. The other is to find

an approximate solution. I really thought this

would help our nips paper a lot and the impact but no. So we focused on

an approximate solutions. In reality, an approximate solution is pretty trivial if you

just look at the problem. It’s a n factorial thing,

I’m taking a mean, just draw at random samples from

those number of permutations. That’s exactly what

previous methods have done. Basically, they say we have

a certain number of permutations, let’s just draw maybe 100 or 1,000 permutations for each feature

and then we’ll take an average. But what we’re going to do

here is say, we can do better. How can we do better? By looking at the unification that

we just talked about, and trying to draw strings

from some other methods. In particular, there are connections

with all of these methods, but I’m going to focus on

the connection with Lime, and talk about how we use some

of the advantages in Lime to improve our ability to estimate these classic values

and game theory. So maybe you guys are

familiar with this, but if you don’t remember what

Marco wrote in his paper, he’s got a minimization function. It looks like this,

where essentially you’re trying to fit a linear model locally to your original model

that you’re trying to explain. It’s not important that details here, but what is important

is that you have parameters you need to choose. You have to pick a loss function that pushes these things

close to each other. You have to pick some regularizer, and most importantly have

to pick a local kernel that defines what local means. Perhaps naturally, you

just choose these things heuristically when

you’re trying to build this type of explanation method. But if we go back and

we say now we know that for dealing with

toggled type inputs, then we’re going to be inside this class and that means that

all these parameters are forced. There’s only one answer

that matches these axioms so that means there

must be only one set of parameters that we

can choose in order to maintain our local accuracy

and our consistency. So that’s a

non-constructive statement. But well, this is a difference

between Mac and Windows. There is a formula that we derived we’re simultaneously checking consistency between Mac and

Windows PowerPoint here. So this formula here is essentially a specific local

waiting kernel that we derived, that allowed us to estimate the

Shapley values in a new way. So these are values

that have been around for better part of a century, but now we can estimate them

using linear regression, which is fundamentally different than just random sampling

from a permutation. So that’s cool because these values

have come with a lot of important properties and

people care about them a lot not just in ML but elsewhere. So what’s nice about this is that, it’s not just a new way, but

it’s a way that can be helpful. So here, you can see that

permutation sampling, so just drawing from a permutation

has fairly high variance. So here I’m just drawing

more and more and more permutations, which involves evaluating the original model

more and more and more. So it’s computationally

painful to do this, and this shows the standard

deviation of my estimate over time. I have right here the right answer

for two different models. So this is an individual feature

we’re estimating a value for. So permutation sampling

has a high variance. If you look at Lime, it has a much lower variance due

to its regression formulation, but of course, it’s not

converging to the values that we would like based on these axioms. So what we’re able to do with Shapley is actually have

our cake and eat it too. So we get to keep the low variance of a regression-based formulation but also converge to the values we want, so we get to keep

the axiomatic agreement.>>How many features are

there in the [inaudible]?>>Yeah, good question. So each of these is just looking at

the convergence of a single feature. This is looking at

the convergence in a dense model. This is looking at the convergence

in a sparse model. Because in a sparse model, you’re allowed to use regularization to improve the

convergence like lasso, for example, for regression

and lots of tools of regression that allow you

to have better sample power. So this is demonstrating that. This is just a dense model

where we don’t get to use any regularization but we

still get a lower variance. So in this case, I think there was about 20 features and about this one

there was maybe 200. So this is fun. Maybe we’ll stop here and do a quick review of

where we’ve been through. So we’re going to talk about

contributions in all these areas, but we just talked about in theory a unification of explanation methods. So we just talked

about a wide number of explanation methods that were

in the literature and how they all connect together

because they have the same type of explanation

that they produce as an output. Then we talked about some strong

uniqueness results that now apply to the whole class and allow

us to go into where we used to make heuristic choices

and instead have theoretical basis for choosing parameter such that we obey

these classic Shapley axioms. Then in practice, this led to a new estimation method for

these classic Shapley values. It has lower variance and hence require less computation in order to explain individual

predictions from the models. At this point, I want us to move over to an application

of these techniques. This is actually the application that motivated all of this

work in the first place. So told the story in

reverse if you will. The application is anesthesia safety, so several doctors

came to my advisor and myself and they wanted to improve the safety of anesthesia

through machine learning. I don’t know if you guys

have ever stopped to think about it but anesthesia is this wonderful thing that’s

only been around for maybe 150, 200 years or so and in

wide broadly available sense. We’ve made a lot of progress but there are still

more safety things to happen. It’s still not

a perfectly risk-free event to keep you asleep but not dead. So why would ML help here? Well, the operating room, it turns out, is

a very data rich environment. It could be one of the places

you get the most data recorded about you and a health sense in your entire life if you go

into an operating room. There’s tons of high-frequency

measurements from lots and lots of sensors hooked up

to you over a time series. Of course there’s a lot of background medical data that comes in as well. So the anesthesiologists

who came to us said, “If you could predict

adverse events just before they happen that would

allow us to be proactive.” Being proactive would help them

better manage the patient. So the particular adverse event

they cared about here as hypoxemia where this

is low blood oxygen. So essentially, this used to

be the number one cause of anesthesia-related death

about 50 years ago. But then these things got

pulse oximetry got invented. Now, you can actually see where

the status of the patient is in terms of their hypoxemia or

their blood oxygen content. That just totally revolutionized

the safety of anesthesia. But it’s still retrospective, it’s still looking backwards. So if we could just look maybe five minutes in the future and guess, “Will you have hypoxemia

in five-minutes?” They can make less severe

interventions during the care of

patients in the hospital. So we built a system called

prescience that predicts hypoxemia. Again, five minutes is the number

that the doctors gave us is like this is about when the time

range we’d like to know. Then how does it work? Well, so it predicts hypoxemia

by taking a variety of inputs. So we have text data, numeric data like age, categorical data and lots of other all the different types of data input you would think

of except for images. Then we also have a bunch of

time series data coming in. It’s minute-by-minute data

coming in from U-Dub and Harborview here over

the course of several years. So we have about

eight million minutes of training data that we get

to feed into this model. So a large number of minutes. Then out of the model,

we’re predicting at every minute the

risk of the patient. So at this time, we’re

predicting odds ratios for the patient over time. This is forward in time. So this is five minutes, negative five minutes, negative

10 minutes in the past. You can see their risk

is going up and down over time where the risk is, “Will they fall into this

desaturation region?” So this is the saturation of O2. It’s pretty high and

then it dropped here. So this is an actual

desaturation event. Then we’re here and the question is are you going to have

another desaturation of it or not? This is from an actual patient

and the risk right now is 2.4.>>Do you know of any intervention? So when there was the first drop,

was their an intervention?>>Here?>>Yes.>>I do not know in

this point from this.>>[inaudible]>>The doctors are

constantly intervening. So that is a good question. I mean in an OR, it’s not like will

I have done interventions it’s what are the current interventions

because you’re on a ventilator, they’re like seven knobs that

are set to certain settings. I’ll talk about some of

those knobs in just a moment. But yeah, there are lots

of possible interventions. There’s lots of ongoing intervention. So this is definitely

risk with respect to the current standard of care is

definitely what we’re predicting.>>It’s varying overtime.>>It is varying over time, yeah. So the prediction of

the current moment is about twice the normal

likelihood and it has recently gone up in

the last couple of minutes. But just like John or John’s boss, the doctor has the same question. You can see now why we got pushed

into model explainability. They want to understand why. Well, if you want explainability, let’s start with a linear model. Maybe logistic regression

would be a great simple get the job done model. If we do that, we could produce an explanation in an odds ratio form. So here’s 2.4, odds ratio

at the current moment. We could express that as a product of terms if we’re doing

logistic regression. So let’s sort the term and we see the biggest term is the fact

that they’re overweight. The biggest risk factor for

having breathing problems or hypoxemia issues is overweight. Now, that’s something

I don’t expect or hope is not intervened on at

this moment in the hospital. You’re not going to

fix that instantly. The next one however

is low tidal volume. This is how much you’re

breathing in and out. So tidal volume is essentially

a knob sitting on the ventilator. How much air are we putting

in and out of your lungs? It turns out that there

was this spike here is almost entirely driven by the

doctors setting of that knob. Now, there are good

reasons to set the knob at a low setting but this quantifies the risk that goes into

it very explicitly. That’s what doctors

found very helpful, much more helpful than just

staring at a number and wondering what’s wrong

in the room somewhere. There’s some things, of course,

that helped the patient in this case their pulses

looking good. Then there’s thousands of

other features that are pulled out of here all together

have a moderate effect. So it turns out that at

each point there’s going to be a different set of features

but those features are going to be a fairly small set of features that really are

driving the risk at this moment. But maybe we don’t want to use a linear model because it turns out, if we plot the ROC curve, if you’re familiar with

this, higher is better. Plot the ROC curve for the logistic

regression model we were just talking about and you get

a reasonable performance at 0.86. But then if we throw

a complex model in this case, gradient boosted trees, at this we get a significant jump

in performance. This is a 15 percent jump and the true positive rate if

we fix our FBR at 0.1. So that’s a very non-trivial change

in accuracy of the system. We tried a lot of marginal

transformations to try and make a linear Lasso work

better but we did not succeed. So this left us with this question of how do we best address

this trade-off? Again, that’s what pushed us

into explainability methods and led us to this understanding of unity and kept pulling that string. So now, that we have

pulled that string, we came back to this problem

and said, “All right. How would we address this now that we have this approach called SHAP?” Well, how will we use these SHAP

values in the operating room? Well, one way to convey

these values to a doctor for example would be to start

with the base value of one. So this is odds ratio, this is normal typical risk, and then we can show you

your current risk of 2.4. What we can do is we can take all of the features that are positive. That is they are increasing

your risk and we can plot them where the width of the bar is the impact

of that feature on your risk. So tidal volume has this much

impact, height, and weight. We can sort them and you

can see that there’s almost all of them have

very little weight because there’s thousands of features here and stacked against that we can see

features lowering your risks. So the sum of the

purple plus the green, pushing against each other equals

of course the output 2.4.>>I find it interesting that

this presentation doesn’t make any distinction between time-varying measures

and these kind of like BMI and other things that are

going to be the same throughout.>>Yeah, that’s true.>>Those are in

the sense not actionable so I don’t know what the right thing to do is

because I don’t engage with doctors, but I would think that a distinction

would be interesting to that.>>So there’s two ways to at

least that we thought about that. One was do we want to actually go back into

the time series and try and figure out where in the time

series we’re getting the signal. Because in a time

varying thing, maybe your tidal volume is actually

really a time series of things. So you could maybe try and tell

me what part of it we went into. Some of the people in my lab

are actually working on that as an extension but we

didn’t do that right away. In terms of which ones are

intervenable and which ones are not, this is something that I’ll talk about the user studies

that we did do on this. But I think there’s still

more questions to tease out there because doctors almost instantly recognize that right

away when they see it. So I can of course label

which ones are affecting me and which ones are not but

they found it nice to know, because they also in their head

have a risk model and they would like to see even the unintervenable ones that

they have in their head, they want to also see here.>>I can imagine those serving

as landmarks or something, they might even be like thresholds

relative to your BMI risk, you just went above your BMI risk.>>Yeah, that’s a really good point. Yeah, exactly because the

fact that your tidal volume is at equal risk to the fact

that you’re overweight, suddenly, it’s like

something where well I’ve pretty quantified as a doctor. I know how much I think

this weight is affecting you. My turn to that knob is now equal

to that? That’s interesting.>>This model you’re focusing

on a particular hypoxemia. But if I change you’ll

see the title volume, that’s when they have effects on other measures of

the likelihood of success. How do you even to bring those in? So I love this it’s precondition on the fact that I’m falling one thing and

I care about that.>>Yeah, that’s

an excellent question. Essentially, it’s saying

how do you best measure a whole handful of outcomes or dozens of outcomes potentially

that you constantly care about. I think we thought about that

and said this is like I agree, that’s totally where

you want to be when you’re in a hospital setting

because you don’t want them myopically affecting one risk and

totally messing up another one.>>[inaudible]>>Yeah. So I could hypothesize how best

to display that to doctors, but I haven’t actually tested that. So I’m not sure it

would be helpful for me to go at it except for the fact that of course you can run many of these models or train

one multitask model if you prefer. Then you’re going to have lots and lots of these things showing up. I think, at some point,

the real important point is people are not going to be

staring at this all day probably. At some point, you’re going to raise some warning or bring

their attention to something. I think that should

probably be one of a whole panel of things

that you’re interested in. At some point, one of those things in the panel is going to light up and you’re going to

go and look at it. At that point, we were trying to say we want to explain at least what the model was concerned about at that moment so you can

warn a number, yeah.>>I could have framed the problem.>>Okay.>>So right now you framing it as, “There is a risk factor

via the thing.” What are the features, both actionable and un-actionable, that explain why I’m predicting

this risk to be like this? I could have framed the problem as, “We are at this risk, my goal is reducing this risk.” Give me interventions. Give me concrete things I can play with so that I

can reduce this risk. Because you can’t give

me an explanation like this where there is nothing

for me that is actionable. What do you think as the [inaudible]

between framing the question as pure explanation versus framing the question greatly targeted

on the intervention?>>Yeah. So what you’re asking is

obviously what the doctor wants. I think the issue is oftentimes

we don’t have that to give them, in the sense that this

isn’t a causal model. It’s trained on historical data, it’s full of confounding, and that’s actually one

of the things going forward that I think is

a lot of open work here. I think in order to properly

interpret this model, you really probably have to have some sort of causal

graphical model sitting in your head about where or how all these variables

relate to one another. You have to actually

sit down in front of this with experts before

it ever goes into the operating room and look and see like peak pressure actually helps people but you found it

hurt people because of an association with some confounder

that you didn’t think about. That’s all stuff that needs to be sorted out before it goes

into the OR and it has to be done beforehand because you can’t expect people to

do that on the fly. But that being said, I’m not sure where the boundary is but I’m hesitant to hide too

much of the actual association. I would rather probably tell them I don’t know how

to visualize this but I really think that it’s

future work and how to integrate causal modeling

with this to help people not make causal claims

about things that are not causal. But I guess what we came

away with was oftentimes, something could go wrong

but doctors are there for six hours in some of

these surgeries and there can be really simple obvious

things that would be obvious if they were paying

a 100 percent attention for six hours but they

aren’t otherwise. So just seeing why the algorithm

is concerned can get you a long way there but it’s not

going to get you to causality. So I think that if we wanted

to answer the causal question, we would actually have

to be in the loop in order for me to trust it at least. We’d have to be in the loop, and there would have to

be some understanding of what’s randomized and what isn’t. So from our perspective,

we said, “Well, how far can we get with

an observational study?” That’s why we’re out here. Does

that answer your question?>>Yeah. That answers my questions, just that there has been

this more recent line of work. I’m sure you know about counterfactual explanations and

the actionable explanations and having found the

absolute [inaudible] and I think the only insight

they believe is that, “I just want to be as my explanation is on things that are actionable.” In my mind also I don’t

have an answer as you do. I’m just trying to do the trade

off between giving this kind of an explanation versus giving like

a counterfactual actionable, whatever that is, and

it will be, I guess, good to do the same

observational studies with professionals to see how

they’re interpreting them.>>Yeah. No I agree. I guess the one thing we want

to avoid is presenting things that are

observational as though they were true counterfactual things.>>I think there’s [inaudible]

that they look at this.>>I agree, yeah.>>They’re going to look at

the tidal volume and say, “Oh, yeah. I’ll drop this and then I should

expect to see this risk go down.>>My question’s along those lines too but perhaps maybe

it’s an easier question. So let’s make the assumption

that this model is a perfect model and it truly represents the real world

in every possible way. I’m wondering, maybe I just missed this in the presentation

earlier and processes, how simulatable is this as a doctor? Like, if I know the kind of

Shapley’s score I get from tidal volume and I know that

the volume is two right now, can I make assumptions about what the model would predict if I

dropped the tidal volume to one? So I’ve got a black-box model, I see a score like an attribution

volume two, and I say, “Okay. I can turn this knob down to one.” Do I know how the model will change? Is that valid?>>There’s two questions there. One I’m going to postpone to later

in the talk where we actually try and understand the

individual features one at a time, which this plot does not tell

you. That’s the first thing. Just by looking at

this all you know is what the current value is doing. However, you can draw

curves with these things. First you could of course do

a partial dependence box, just literally dragged

the curve and see it, but it also turns out that you

can make interesting plots using lots and lots of

these explanations and that’s what I’ll talk

about a little bit later. So maybe ask that question

again when you see that and then tell me

if that answers it.>>Yeah. I guess I have one other question too

about taking base values. So I guess like the guidance is to pick it based on your

training data set but if the people I’m evaluating on are different than the people that

I’ve seen in the training set, is there any problems there?>>Yeah. So for example, let’s go to a financial case. It might make a lot more sense to

use the background distribution, not to be your training

data set but to be the distribution of

people who got a loan. Now everything is with respect

to people who got loans. Here is why I didn’t get a loan. So I think it really depends

on what is the reference frame because if you’re sitting in a room as a doctor and

someone comes in and says, “I have a fever,” and your risk

for something goes up, that’s because you mentally

have some background model in your head of what the people

typically come into your office. If you live in the jungle you might have a different

background distribution than someone who lives in Alaska. But you should think the same way when you’re

building this algorithm. It’s like the person has

something in their mind. You’d like your background

distribution to match their mental model

as much as possible. All right. I’m going to

move pretty quickly. I think 11:45, is that

when we’re trying to wrap up a room somewhere in there? I’m not sure the exact timing.>>I think you have until 12 but you’re getting

a lot of questions.>>Okay. I’ll aim 11:45 and do that. So let’s move quickly here. So we have this plot explaining an individual prediction at

individual point in time. If you rotate it 90 degrees and

then you can run it over time, you get things that look like this. So essentially, this is

going forward in time. This is the current time, so this is actually

this slice right here and remember this junction

is the current predictions. So you can look at this as the prediction and the junction between green and purple over time, and we can see the area

of different features changing throughout

the course of the surgery. In fact, tidal volume

is actually something that expands and then goes away, and just in this little

piece right here. So that tells us it’s

a transient effect at this point in time

due to the model. So this is nice to see the flow of risk of a patient overtime

during a surgery.>>Can I ask a question about this?>>Yeah.>>So if you are

explaining a prediction?>>That’s right.>>Are you explaining

a prediction a point in time because when I create this

explainable models for each point, the model may take

different features?>>That’s correct.>>It’s representative.

So barring effects, which features are meaningful? Because, for example, tidal volume may go away and

something else may come in.>>Yeah. So if I’m looking at an individual prediction,

I just sort them. So tidal volume maybe here

because it’s really big but over here it’s going

to be buried somewhere in this tail, so you’ll never see it.>>But I mean the different features. Does the explanation use

the same features or?>>Yeah. So this is all features. All the features are here. Just some of them have

larger widths than others because they’re more important

at that moment than others.>>So in this sort order here is

based on some sort of average?>>This sort of here?>>Yeah. Are we doing

crossovers [inaudible]?>>No. There’s no crossovers here. So I just sort by

their global importance, and then for title line it may be farther down in the bundle

that it pops out. That is to roughly put like

most of them tied to the tails.>>So if I have two features that

I believe are highly correlated, I may randomly pick

one over the other at any given time and I can’t see all of these things

fluctuating very much? They are actually

representing the same thing?>>Is that possible?>>Well, what you’re explaining here is what the models

are actually using. So that would only fluctuate

if the model itself was actually randomly at each time

point picking different features.>>But if I have two features, they’re identical, they’re

giving me the same message. Can I really ensure that I’m

going to be- I’m trying to understand that [inaudible]

I’m giving it the right rate.>>Essentially, you have to think

about like this is a model, that is the function

that is the model making a prediction and instead of inputs let’s say two of them

are very correlated, and I want to know how much do

those inputs matter for my output. That depends a lot of a function, so it’s a lasso, maybe it only picked one and ignore the other one. In which case you’ll

never see one of them in this plot because the model

doesn’t actually touch it. Maybe it was a ridged model,

ridge regression. In that case, it will spread them out equally perhaps between the two, and then you would

have seen both doing all the same things together

throughout the whole plot. But in order for them to swap, the model itself would have to change from one time

point to another. It would have to have

some sort of like if this, then that kind of statement and

that would be very unlikely.>>I think I need to understand how. So the predictor model itself

is a black box at this point?>>Sure, it’s just a

function you can evaluate.>>So I think I’m questioning

how the explanation can really understand the internal logic of the model because things

are getting highly correlated.>>It’s essentially by breaking that correlation, perturbing

them independently. That’s deep into it I guess. I don’t know if we should go into

that conversation at the moment, but essentially it’s basically perturbing the inputs to

the model to understand. That’s how all black-box models have to understand the

function because there’s no other way to look

inside it without making some assumption

about the internals. But ultimately that’s

what’s going to happen. So that’s how it’s

going to perturb what’s going on and then assuming

that your model is consistent, you’ll get consistent

explanations across time. So we tested how this

worked by going back and replaying historical data

to set up anesthesiologists, practicing anesthesiologists

that you’d have in her view and you did in

Children’s Hospital. We showed them cases both with and without assistance by

prescience and then ask them to anticipate on a scale of 1-100 risk in the next five minutes. So that allowed us to make

ROC curves not just for prescience, but also for the doctors

and not just doctors, but doctors with and without

help with prescience. So if we do that, we can do is run that for

all five doctors and then average their ROC curves to get

the solid green curve here, and we can do the same

thing when they were assisted by prescience

and get a blue curve. It’s hard to say exactly what

the false positive rate of a doctor is because they don’t

have that in their head, but let’s assume that they

anticipate 15 percent. Then that provides

a very significant improvement in their ability to anticipate

which we were encouraged by.>>Do we really find individual differences that

below false positive rates?>>Yeah. Exactly, and that’s because once you get low and lower

false positive rates, we didn’t want to

kill off the doctors by asking them to run forever. Yeah. So that’s why we highlighted these things here and tried to pick an FPR high enough that we are confident with this distance

of separation, which is significant in that case. This is something we wrote

up, you can read about more in Nature BME. I’m going to move fairly quickly here to make it through

the rest of this stuff, but there’s still room

for improvement here. Why is there room for improvement? Because as we just talked about, modeling agnostic methods

can only explain models by perturbing their inputs and

observing how their output changes. It’s really the only way

that you can explain things, if you don’t make any assumptions

about what’s inside them. But how does that hurt us? Well, it turns out if you want

to explain a whole data set, it can be inconvenient to do this. Let’s imagine we’re trying to explain a whole data set just

with a very fast model. This is going to be

actually boost and inference on sets of

trees is very quick. But here’s what we’re doing is having simulated data and we’re essentially increasing the number of

features in our model, so we can control that all of

these features really matter, and then retraining

an extra boost model and then we’re explaining 10,000 predictions. What you can see is

this is minutes of run time and it’s just a linear increase as you

might expect the number of features because I have to

do this for each feature essentially perturb things

and this is the lower bound. This Is just the time

to run the model, so whatever estimation stuff you

do on top; it’s an addition. Here, we’re talking a couple

hours by the time we’re up to 90, or really important features, which is certainly doable for whole data set of

10,000 predictions, but it’s it’s unpleasant

and would totally change the way a data scientist

does their workflow. It’s not that we were using

too many samples either. This is the previous permutation sampling approach that we’ve talked about and then this is the new one,

the regression-based approach. This lower variance, but still a couple percent of the magnitude of these bars are twiddling

around a little bit. So it’s not nontrivial. You have to think about the

fact that there is noise, there’s this balancing act

that’s going on. So if we go back to what we

did for an NP hard problem, there really is

a third option and that is we can restrict

the problem definition. So we don’t have to solve this

for all possible functions, we can maybe just solve it

for a class of functions, which is very attractive

in the context of machine learning because machine

learning models come in classes, many of which are very popular. So we decided to choose the class

of trees and this leads to a a new estimation method for these classic Shapley values for tree-based machine

learning models. So why did we pick trees? Not just because we’re using them, but also because other people,

it turns out we’re using them. So this is Kaggle 2017, they basically asked

data scientist at work, what do you use for your

machine learning models? Not surprisingly, logistic regression

is right up there at the top. But then if you look at

it, it’s Random Forests, Decision Trees, Ensemble Models,

Gradient Boosted Machines. All of these involve or are

totally dependent on trees, so it’s a tree-based

machine learning models. So whenever you’re thinking of

explaining a complex model, people typically think

a non-linear model. So that means almost all the

nonlinear models that are deployed and used currently in

practice today are based on trees. So if we can make significant

advances on this class of models, that will impact a lot

of current applications. So how will we do this? Well, let’s imagine we wanted to explain it directly with no sampling. What would we be facing? We have T trees in

our ensemble, let’s say, and each of them has L leaves, and then we have M input features. So now we have M factorial because we have all these permutations

we want to do, and then we still have

computer expectations. So that’s another sample, however many N you want to draw in order to compute that expectation.

So that’s factorial. It turns out there’s a way to

rephrase it as exponential, but we’re not really much closer

to track the ability yet. But then now we can restrict

ourselves to the class of trees, and it turns out

there’s a fairly cool, but involved algorithm that’s recursive method that I

won’t go through today. It’s interesting because, of course, the solution here depends

on an exponential number of conditional expectations, but they fall out in the trees

in such a way that we can design an algorithm

that’s TLD squared, or D is the depth of the trees. So this means TL is just linear time. So that’s number of nodes in all the trees and then

D squared is the depth. This polynomial-time run time

gives us the exact solutions, just like you had done this, which is very nice because now we don’t have any of

the problems we had before. In fact, if we plot the run time of our new method against the lower

bound for the agnostic methods, it’s like indistinguishable

from zero, which is nice, and of course, there’s absolutely

no explanation variability. So if you’re a data scientist who’s thinking about explaining your model, suddenly this thing

is done in seconds and you never have to worry

about sampling variability, which really changes how

people can use this stuff.>>It turns out it matters too

because if think about trees, you might think trees have

been around for a long time, like we should know how to

explain trees by now,right? They have been studied

for quite a while. A lot of people have

looked into explaining and understanding feature

importance in trees. But almost always people are thinking about global feature importance. What is the importance of

this feature over my whole model, over my whole training data set, gain, gene importance,

permutation importance. These are all things that are

global feature importance methods. There’s only one approach, that we found

heuristically out their. Literature to explain an

individual prediction from a tree, which is what we need for John. Where we meet in

our medical examples etc. They’re heuristic and

they’re inconsistent. So how do we mean inconsistent? So this goes back to

those two properties. We had local accuracy

and consistency. Consistency, is one of

those properties you want to hold. Here’s an example of when it fails, and this is from

the heuristic methods we have for trees right now. So here isn’t set of AND functions. So my model is very simple,

it’s just an AND function. Take some binary inputs, all independent, and it just

says if they’re all true one, if they’re all false zero. A two-way, a three-way, a four-way, all the way up

to 10 way and functions, and then if I was to be consistent, of course I’d have to allocate

credit fairly between them. If my background is typically zero, then I’m going to just

split up my 1.5 and 0.5 for two-thirds, fourths, fifths etc. So this is like we know the right answer because

it’s such a simple model. But then what we do is we say, “Let’s run the heuristic methods.” Well, we get something

very different. It turns out that most

of the credit actually goes to the leaves of the tree, just the way that this method works. It’s actually very similar

to the gain method. Essentially, as you go down the tree, the purer and purer your group gets. All sudden you get more and

more gains for getting it done. So this is exactly the opposite of what you want because

these are probably the most important features

because you greedily selected them for your tree-building. But by the time you get to

like a depth four and function almost no credit is going to the root and almost all of its

ending up in the leaves. So this is the definition

of inconsistency. When we have our first tree shape, of course it identically

produces exactly this, because it comes with guarantees. So this is one strong motivation for using these values over

what we had before. Another one is to

basically go back and say, “Well there’s lots of metrics

out there that we can use to measure the performance of

a local explanation method.” Particularly, ones that are

assigning numbers to their inputs. So Local feature attribution thing. No individual metric is going

to be the right metric, but there are many

ways to go about it. So what we did is we defined

lots of metrics that, I’m not going to go into

the details of all of them, but they’re all about

perturbing or removing features into seeing,

did I get the right one? Did I get, you said that it was

positive, I’ll take it out. Did you make thing

go down? Okay, etc. Then, we’re going to

plot that against lots of explanation methods designed specifically for trees because we are not looking at

modeling gnostic ones. Necessarily we’re just looking

at the tree-based ones. When we then look at

our exact explanations for trees, we call three explainer, we can see it consistently

performs well across all these metrics in general, which we found very encouraging both for a decision tree

and a random forest, and we also tested this

on grading boosted trees. So I think this is a broad benchmark, evaluation, means these axiomatic

things are not just nice. Properties builds lead to practical improvements in

the explanations you get. Another thing that we

found actually surprising, and we weren’t actually

setting out to do was you could of course take local explanations and put them

back together to get a global one. Because we take

the mean absolute value, or if I’m explaining

the loss function, I can just take the mean

and the loss, for example. When we do that, we can

get a single number, but now we were doing this axiomatically

motivated attributions for each individual explanation. So when we put them back together, it turns out we get better

feature selection power than previous approaches

like gain permutation. Gain is like gene importance, if you’re more familiar

with that term. So these are the ways that people do feature importance

for trees today, which is fairly widely

used in a lot of domains. So we’ve found it

pretty surprising that this is a simulated data

set we pulled from some other paper that was using

to check feature selection power, and this is when the

interactions are minimum, and you can see reasonable jump both of the decision tree and

with a random forest. So we’ve found that

surprising and nice. We’ve actually got better

global feature selection power when we went back to global

from our local methods. The last thing we checked is that

while these are explanations, so they should probably

be consistent with human intuition. How

can we measure that? What we can do is go back to

the simple functions like AND, OR and exclusive or things like this. These are things people in theory if explained well should

understand themselves. They know how the whole thing works. Then, you can ask them how

would you define credit? How would you allow fairly allocate credit

among the input features? If you do that for an AND function, unsurprisingly, this

is an AND function mixed with some additive effects. They fairly divide it

between fever and cough, which is part of a story problem

that we used on Mechanical Turk to get people to understand

what this model was doing. Of course, we didn’t tell it

was machine learning model. Then, we can compare that to

the shaft values that we get, which nicely lineup

with human intuition, but then we can compare that

to the heuristic values and get significant differences. This significant difference

happens not just for AND but for OR exclusive OR, all that non-linear functions. So it agrees when you have

a linear function but otherwise we see significant deltas

between what we observed, and a consensus from Mechanical

Turk studies on simple models, which is another reason that we felt we should prefer

these sharp eyes. So, in review, for trees, we’ve narrowed in on

particular type of models trees. We have really fast exact

computation methods that allow this to be very practical. They come still with these

attractive theoretical guarantees because you’re not doing

any approximations now. They have really good

performance across a broad breadth of

explainable AI metrics. Surprisingly, they actually improved global feature selection

power which is fun, and they have strong consistency

with human intuition. So, using all those results, we basically said, what

can we do with this? How can we use these to improve our ability to do

machine learning in practice? So we built a set of

explainable AI tools to do that. So, to explore this, let’s look at a couple data sets. This one here is a classic data set

called NHANES ONE. From the 1970s, it’s a very classic, that 20,000 people in

the United States were given lots of standard medical tests, and then 20 years later they

were followed up for mortality. So we can train a cox

proportional hazards model with graded boosted trees

in this case, to predict mortality over

in the US in the 1970s. So again, as I say, we can take the mean

absolute value to get a global feature importance. If we do that, we can learn what’s the number one risk factor

for death in 1970s in the United States and Drumroll

is age. Big surprise. What was most surprising

to me at least in magnitude was that the number

two killer was being a guy. Then, after that comes, the things that you might think

about in terms of blood pressure, inflammation, body mass, and

a whole bunch of other things. So this is what you would get if you just explained the model today, maybe not with

a theoretical guarantees, but you would get

bar plots like this. You just said got feature importance

from one of these things.>>But now that we have explanations

for each individual sample, we can do better than just

put a bar chart here. Okay? Because one thing

this bar chart does, is it conflates the prevalence of an effect with the

magnitude of an effect. Whenever you have

a global feature measure, you have to come up with one number. So you have to pick a trade-off and somehow combine

these two things together. That’s really important

because we often want to find rare high magnitude

effects in a population. You could, of course, train lots

of models on lots of subsets of your population and then look at their global feature importance. But now we can actually look at

each individual in the population, look for groups of people that have high magnitude even though

those maybe a small group of people. So as an example, let’s go down to the

one at the bottom of our list here, called blood protein. If you look at this

you think, well maybe the global importance is fairly low, so maybe this is not important

factor for life expectancy. What we can do over here is, on the x axis we’re going to plot our feature impact or SHAP value. So negative is good, meaning it’s lowering

your risk, positive is bad. It’s essentially years of life

in a linear sense on the scale. What we can do is, we plot

a dot for every person, because for every person

we’re going to have a value assigned to the

blood protein measurement, like an attribution assigned to it. Then, we’re going to color those dots by the actual blood proteins. So we can see whether

it’s high or low. When we do that, if we do like a little bee swarms style plot where the dots pile

up to show density, we can see a long tail stretching off to the right

here of red dots. So there’s a pile up here rated zero, which means for almost everybody

their blood protein has no impact on their

risk for mortality, at least with respect to

the general US population. But like for that

particular person right there blood protein is

extremely important, it’s probably

the most important factor in their life expectancy perhaps. These rare high magnitude effects are things that pop out when you can explain everything at an individual

level across the population. Now, you can say well, of

course, we could do that before, but it’s really nice now

that we have these very high-speed trustable

explanation approaches for TRIZ because this makes it really practical because we can quickly explain whole data sets

and then plot them like this. We could do this for

all the features. If we do, we see interesting trends. So here on the right side

you’ll see there’s lots of ways to die young by

being way out of range. Okay. So these are people with

very high blood pressure, these are actually

very underweight people. There’s lots of reasons

to be out of range. But in case you’re looking around

and browsing articles online, there’s not a lot of ways to be wildly out of range and

somehow live way longer. So I would urge skepticism

except be young. Okay? So you can be on

that I guess the best way. But these are like

insights that come when we tease apart prevalence

and magnitude of effect. Okay. We can do that now that we’ve

explained individual samples. But we also might want

to zoom in closer on one of these features and

understand more about it. So one way we could do

that is to literally put that features

value on the x-axis, like a partial dependence

plot would be. But instead of scrubbing

things around on a PDP plot, instead what we’re

going to plot now is the SHAP value for

that feature on the y-axis. So for every person, they’re going to have

the impact of their value and the actual value of

their systolic blood pressure. We plot them we get this. So this looks very much

like the standard curve of risk you would get for a standard

systolic blood pressure stuff. So here’s a 120, higher 120 you start going up and in

your risk of mortality. But what you notice is that there’s

dispersion here. All right. There’s not like a PDP plot

we just get a line. Now there’s a vertical dispersion

and this dispersion is driven by interaction

effects in your model. If you had a linear model, you would never see any vertical dispersion. But what this means is a lot of people with

a blood pressure of 180. But for that person, it’s

less concerning than for that person, and question is why? Why is it more concerning for

someone than someone else? Well, it turns out, I don’t have time to

go into the details, but you can extend the ideas from game theory about putting

attribution on individual features, you can extend that to

interactions of any order, but what we did is we implemented

it with this algorithm, a high-speed way to compute the interactions for

all pairwise interactions, so now we’re essentially

assigning credit not just among like a vector of

features attributions, but a matrix of feature attributions where the diagonals

is the main effects and the off-diagonal is

all your pairwise interactions. When we do that, now we can

look in and see well what are the interactions that are

here and what’s driving that. It turns out in this case is age. So now we can color by age in order

to see what’s the effect here. We can see that this is highlighting how early onset

high blood pressure is much more dangerous than late onset high blood pressure

in terms of your mortality risk. We don’t need to go into all the

medical reasoning behind that. But I think this is something

that kind of highlights again another example of when

you have lots and lots of local explanations that you trust, you can pull out signal from your data that you

might not otherwise have seen from models that would have

otherwise been considered opaque, in this case, like depth six decision

trees several 1,000 of them. Now, I said we had

interaction effects. So we could also plot a feature, let’s say age on the x-axis, and here we can plot one of the off-diagonal things in

our interaction matrix, this time between age and sex. If we do that we’ll see

the varying impact of being men versus women

over a lifetime. So here we have risk being fairly stable and then they

cross and peek at age 60, which when we talked to doctors

who was strikingly similar to when cardiovascular risk also peaks which is predominantly affecting men. It’s not causal so I don’t know

exactly what’s driving it. But these are the types of very interesting interactions

that just pop out from just plotting

many many local explanations, in this case, from a GBM tree.>>Why does it go down

from [inaudible]?>>Well, what’s happening here is remember this is

the off-diagonal effects. There’s always a main effect. It shows that women are

always better off than men. But this is going to

of course be centered because its main effect

is subtract the mean. So by definition it’s simply

showing the relative difference.>>Yeah.>>So once I know this one

I kind of know this one. Last thing I want to end on here is probably actually one of the more

fun applications of this stuff, and it comes from when you have machine-learning models and you deploy them, it turns out they break. I know you’ve never heard this, but it turns out sometimes

they have problems like most software, bugs show up, features drift, someone

changed something in the data pipeline that messed up the way the thing got

deployed and no one knew. Sometimes that costs a lot of money depending on

what pipeline we’re talking about. So in a hospital that doesn’t cost money or lives depending on what

what you’re looking at. So we’re asking how

can we better improve the safety profile of ML models when they’re used

in high-stakes decisions? We can talk about

model monitoring in a hospital. So one popular thing people

do with ML models in hospitals is trying to

predict the duration of things so they can do

better scheduling. That’s fairly benign task. So here we’re predicting

the duration of procedures in a hospital over

the course of five years, so this is data from

U-Dub and Harborview. We use the first year

of data for training. Okay. We trained to predict based on all that kind

of static features, how long is this procedure

going to take? So what doctors, how long do

they need to be scheduled? Then what you can see, of course, is a natural jump in error, once I do train test. So in this is training error, this is test error, it’s

just naturally higher. But if you just look

at this over time, you’re like well, I don’t know if

my model is doing great or not. But this is typically what people do, they just look at the Loss of

my model over time and see, is it going down? Is it going up? You’d never see if

there was a problem here unless it totally

destroyed your model. To demonstrate that we actually introduce a bug into the model here, we actually went in and we changed the codes of two rooms in a hospital. So it’s really easy change that

can happen in any data pipeline, just swap the names

of those two rooms and this training, this test. Can you see where we

introduced the bug? Of course, the answer is no, and you could guess, what you

guess in your head, but then what we can do

before I show you that, I guess I already did, I can’t take it away now

it’s already there.>>What we do is we said, let’s

explain the loss of the model. So not the output of

the model but the loss. If we explain the loss of

the model with these SHAP values, what we’re going to do

is we’re going to take that value which is the loss, and we’re going to allocate

it among the input features. We’re just going to say, for

each of these input features, how much did you hurt or

help the loss of my model? Of course, I can do that on

every single input prediction. So I’ll do that, that’ll

essentially deconvolve my loss backwards through

the model onto the input features, and then I can look at

these individual input features and room number six was one

of the ones that we swapped. The question now is, so red, I’ve colored the ones

where this room is on. So they’re in the room, and

blue is when they’re not. So it’s not very

informative to know when they aren’t in the room because there’s lots of rooms

in the hospital. But when they are in the room,

you can see it’s typically lowering the error of the model,

so it’s got negative value. That means it’s lowering the loss. Then all of a sudden here,

now it’s hurting the loss. So if you look up here, you’d never know that this bug had been introduced

into your model pipeline. But if you look down here, you can see a very clear

signal that pops up. Importantly, you might

be able to find this by just monitoring the input

statistics of your data, because that’s basically what people do today from

my understanding, in order to monitor models. They look at this, and they look at the input statistics of

your data over time. But if these rooms had equal usages, you would never see a change

in the input statistics, their marginals look the same. But the effect on the model’s

prediction is quite dramatic. Not in overall sense, because there are many procedures happening all over these hospitals, but it’s certainly hurting

a lot of the predictions, hundreds if not thousands of

predictions are impacted by this. So this, I think, is

really interesting and could really be helpful in basically taking

explainable AI and using it to impact model

monitoring in practice. Now, it’s not just bugs

that we introduced. So this is actually a Batch David that was in theory

cleaned but of course, data is never fully clean. So what we found was, we just plotted this for lots

of features and I just pulled out a few examples

that were interesting. Here’s one where we plot, are you under general anesthesia, the flag in the electronic

medical record, turns out somehow it was wildly unhelpful for

the algorithm right in here, for a subset of the rooms and

a subset of one of the hospitals. That’s something you would

never have found otherwise, turns out we went back and we

found out it was because of some transient EMR connection issue

between various things. Exactly the thing that you would

want to fix if you were in the middle of it because it’s

hurting your prediction performance. Here’s one where we actually

observed drift over time. So this is another binary feature

where we’re saying, “Are you undergoing

atrial fibrillation ablation?” This is like a procedure where

they zap a part of your heart in order to stop

aphid from happening. We’re trying to predict the duration, and what you can see is a general

upward trend where particularly, the training period but also

in the early test deployment, we’re helping things and by

the time we get to the end of it, this procedure’s duration

has actually changed, and we went back to the hospital and the cardiology department

and they were like, “Oh, well actually,

different people came in, and we’ve got new technology,

and we’re much faster now.” Well, that’s nice. That’s great for the patients but it’s

bad for the ML model, because now your feature has changed, and again, the marginal statistic

one not changed at all. So another example of how

you can use this stuff. So to review, I guess I’m

going to wrap it up here. In theory, we talked

about how we could unify a variety of explanation

methods into a single class, and that gave us insights

and how they are related. We talked about strong uniqueness results from game theory that now spread to that whole class

that we connected together, and how that can help

us pick new parameters. It’s not just the

theoretical perspective but also impacts

practice because we have a new way of estimating these values through regression instead

of simply random sampling. That applies for black-box models. Then we also propose a new way of estimating these in very high

speed way exactly for trees. Based on these things, we

are then able to build a whole set of explainable

AI tools that help you build and monitor and understand

the models that you’re building, particularly if you’re using

these trees, it’s very convenient. We talked about a variety

of applications, the motivating one

that started us down this track in anesthesia safety, and also some work in

mortality risk and hospital scheduling demonstrating how

these tools work in practice. So I’d like to highlight that you don’t

have to take my word for it, you can obviously try it

yourself. It’s on GitHub. It’s also actually, it supports

and is directly integrated into XGBoost like GBM which is a Microsoft thing CatBoost,

and Scikit-learn. These are all tree-based integrations for this first tree model

implementing T++, and there’s a whole bunch

of stuff that we did working on deep learning models. Let’s assume the classic deep

learning models, what can we do? That I didn’t talk about but we have integrations of TensorFlow,

Keras and Py Torch. We’ve also, as we

said we started with hospital medical problems

that got us down this road, but it turns out it got

used all over the place. You stick things on GitHub, who

knows what they’ll do with it. Turns out you can help optimize

performance in sports teams. Microsoft actually uses it in their Cloud pipeline

right now as does Google. Cleveland Clinic uses it for identifying different

subtypes of cancers, which I thought was cool,

close to what we started with. It’s been used for optimizing manufacturing processes

for a jet engine stuff, they build complicated

models and try to figure out what’s breaking and why? I had a chance to work with a very large international

financial banking stuff, and this is really fun because

actually got to go in and help them do augmented

intelligence stuff, where we basically said they

have models and they have people and at

a certain risk threshold, they use people not models, but you don’t want to

lose everything that the model had when you

transition over to people. So I got to work with them for, I did consulting with

them essentially to help them use these values to help better essentially collaborate

between humans and machines for their decision-making

and financial risk prediction. It’s also been used for economics research and a bunch of other stuff. So I’m going to skip over well, I want to skip over future work. I’ll skip over what’s next but, there’s a lot of places to

go from here as you can probably imagine, a

lot of fun things. In theory there’s a lot of fundamental

difference Interpretability trade-offs that exist when you

have correlated features, okay. So you touched on

this a little bit but sometimes you have to

decide am I going to violate my data or am I going to understand what

my model actually did. So there’s infernal trade-offs, that I think will be

really interesting. There’s lot of interesting

work that’s going on that I’m involved

with actually back in our lab two and using explanation constraints to guide model training, it’s a really a whole

new way of saying, the model is doing something, but do it this way, such that

your explanation is like this. That’s a really fun way

of doing things. In practice, I think there’s a lot of work like

this model monitoring stuff is fairly new and I think there’s a lot of things that

we could do there to make that much more practical and usable to the large number of people. It’s also, I think when we’re talking about actually

deploying stuff into hospitals, like causal structure and assumptions and if you have

causal modeling assumptions, that’s essentially expert

knowledge that needs to be integrated into these systems before you ever put

it in the hospitals, so you don’t expect

people will suddenly think about how a confounding

could happen on the spot. In applications, I always talked

about the financial one with high-stakes decision making

hospitals that are very similar. Same like, “I have an AI and I am here and I need to make

a decision, how can you help me? We’re doing a lot of

interesting things understanding adverse drug interactions

and stuff with genomics and stuff with

understanding protein folding with, these are all ongoing

collaborations here at U-Dub. I think I talked about

the finance stuff. So I’m going to skip over

this because I’m out of time but all of this work of course has been in close collaboration with my advisors who in here at the U-Dub. I’ve had the opportunity to

mentor a number younger grad students in the PhD program here. Also some in math, as well as the

MD-PhD program here at U-Dub. So there are a lot of papers

that we’re working on together, both from my basic genomics

research to better understanding time series data to estimating drug-drug interactions, I just talked about

that, even cancer stuff. Of course, there’s always

external collaborations to make any medical stuff happen, you can’t do that on your own. So we’ve worked with

some great anesthesiologists at children’s, U-Dub, and Harborview as well as stuff

I didn’t talk about at all, it had to do with Kidney Research

Institute here in Seattle which is done some great work there. Cardiology, and then a bunch

of work that I didn’t talk about because of time but we had fun things learning

large-scale graphical models to understand genomics with some

people at University of Toronto. So thanks.

"AI" is science fiction and always will be science fiction… unless of course it's being redefined to be a reference to the increasing intellectual degeneration of mankind.

Thank you for this talk!

found this talk really great, shared it with everyone!

Thanks for sharing. Really informed audience

can any one say what are the tools/libraries used for xai?

The lady always asking is really annoying…