Explainable AI for Science and Medicine

Explainable AI for Science and Medicine


[MUSIC]>>Yeah. Thanks for
letting me be here today. I’m excited to share
some of our work on explainable AI specifically
applied to science and medicine. I’m defending at U-Dub
here this quarter. So just by wrapping up, and I work with Su-in Lee, if you happen to know her, over in compile lab there in
U-Dub computer science. So maybe you guys, this group that are very well
sympathetic to this question, but I’d like to start out and ask, why do we care about
explainability in ML? Because a lot of our work
has focused on this, and to take a look at that, we can start with just a simple
example of a guy named John. So he is a typical bank customer, and like most customers today, whenever you interact with
a company oftentimes data about you ends up getting
sent into some kind of model. In this case, if you’re in
a financial institution, often these models are trying
to predict things that you care about in terms of outcomes. So in this case, predicting whether John has any repayment problems. So here, a chance of 55 percent leads to
the bank denying his loan. So in this case, there’s a standard thing that happens thousands of times every
day around the world, and it leads to questions from John, obviously, like, why
did you deny my loan? If you have a good manager, you should also ask
the same question, right? Why are we making these really
business-critical decisions? But unfortunately, sometimes
for the data scientist who came in building this model was
all about accuracy for them, and so they don’t have good answers sometimes when they’re trying to
explain what’s going on here. So why does this happen? Well, it happens often because many times you run into a trade-off
between a complex model, where because you have a very, very large potentially
complicated dataset, you can take advantage of the
flexibility of these complex models, and get a lot of accuracy. But that can lead to a lack
of interpretability. In contrast, some simple models can be interpretable
in the right contexts, but if there are too restrictive, they’ll lead to a lack of accuracy. Okay? This is what often leads people to run into this in
practice, particularly in finance. So if you have interpretability or accuracy and you can link choose one, that’s a painful trade-off. For the bank, it’s
particularly painful because accuracy directly corresponds
to money for them, because accuracy will correspond
to default rates for the loans. But interpretability
is really important. These are wonderful icons that
are different on Mac and Windows, but this is meant to
be a happy person. So if you’re interpretable, you have a happy customer,
happy manager. There’s also really
important legality concerns as GDPR and other things. So there are strong drivers on both sides of this trade-off
for the many companies. Now, one thing you can do is try and make simple
models better, right? So that you can improve
their ability to retain the interpretability and
also move towards accuracy. That’s a great approach.
We’re actually going to focus on the other one
though, which is basically, taking what’s already
considered very complex models and trying to extract
interpretability from them. Okay? So if you do that though and you just look at
a complex model by itself, it could be very
complicated in the mapping, and trying to explain an entire mapping space could
just be hopelessly complex. So instead what we’re
going to do is focus on explaining individual
predictions one at a time, because those involve, perhaps, just a small piece of
the overall complexity of the model. So I don’t have to describe
how a model behaves in all circumstances in order to tell John what was going on with his loan, I just need to tell him what
was going on that affected him. So if we do that, one way to go about it is to say, “Let’s start with a simple model
and see what this means, be really specific and concrete.” So here’s a a linear model. Why do we consider
these things interpretable? Often, it’s because they have this giant summation sitting in
the middle of the model, right? So we have a bunch
of terms coming in. These are things about
John, let’s say, these input attributes, and a bunch of terms come
together and they just get summed up and sent to
you as some output. Okay? So we think of these as
interpretable because we can look at that and see a bunch of things
coming together additively, and what’s coming into
the summation can be viewed as a credit that’s attributed
to each input feature. In contrast, if we look
at a complex model, say a neural net or a random forest, maybe gradient boosted
decision trees, things like this. Often, there’s so much going
on in there that there’s not a nice term that we can
just surface and say, “This is how the model’s going.” So what we’re going to talk about is, how we can explain an individual prediction in a way very similar to how
a linear model works, where again we’re going to have
a summation sitting inside there. But now, instead of having just terms that are straight from
the definition of the model, we’re going to have to come
up with our own definition of a credit attributed
to each feature. Okay? So this Phi function
is going to be indexed by the feature and it depends both on the model and the current input. Okay? So we’re essentially replacing the input to a summation
that you would get in a linear model with something that represents the importance of that feature in
the complicated model. Okay? So this is
a high-level motivation. So we can look at this, and this is actually
really just a form of an explanation, if
you think about this. So that’s really what we’re
doing is just saying, “Why did this model
make this prediction?” We say, “Well, whenever you give an explanation
it has to have some form, and this is the form
we’re saying it has.” Perhaps not surprisingly,
this form has been used by a lot
of previous methods. Okay? This is something
we noted before. So Lime, from Marco here
uses this approach, and there’s a variety
of methods here from Shapley values based on ideas from game theory
that I’ll get back to, that are really interesting, also, produce explanations of this form. These are all model agnostic, but there’s also ones
that are specific to particular popular model types that
are typically hard to explain. So Saabas is one targeted at trees, also came out of Microsoft here, and then there’s a variety that
target deep learning methods. All of them come out with
explanations that looked like this, they have a sum of feature
attribution values that they sum up and
represent the output. Okay? So this is interesting and people hadn’t previously appreciated this
unity among them methods, and so we named this class, the class of additive feature
attribution methods. Which is nice because it gives you some connection and
understanding about how this part of the literature
relates to one another. But you can learn a lot
more than that as well. Because it turns out there’s
various strengths and weaknesses in these methods. One really nice thing about this approach is based in
game theory is that they come as the unique
solution to a set of properties that can be
specified in terms of fairness. When you’re explaining something, fairness as defined
by these properties, turns out to be
really, really useful. So we’ll get back to that. In
contrast, these other methods. So I should mention, these
are nice because of that, but they can be a bit slow. Okay? Because of the way
they have to be computed. In contrast, these other
methods are typically based on what could be viewed as
a bit more heuristic approaches, but they tend to be much faster. Okay? So on one side, you have some better theoretical grounding
but slower computation. On the other side, you have faster
estimation but less guarantees. So what we’ve attempted to do is, combine these strings together to propose a method
that we call SHAP, and this is something that we
presented at NIPS in 2017. So in order to find this, we need to say, “How do we define
these feature attributions?” Okay? So I talked about, we’re going to make an explanation
by summing together a set of values that represent the
importance of each input. How can we define those things?
How do we define the credit? Let’s go back to John. So before explaining to John why
his loan was denied, and we’re going to do it using
this type of explanation. Of course, there are
many other types you could use. It’s good to start at a base rate. Okay? So in this case, we have
a base rate of 20 percent, and this is just, how
often do people get their loans denied on average? Okay? So that’s a base rate
of loan rejection, or at least, trouble in repayment. Now, the prediction for
John was 55 percent, right? So if we’re going to explain
what’s so special about John, what we need to do is
explain how we got from the base rate to our current
prediction, all right? Because if we predicted 20 percent, there might not been
anything special about John just because we
always predict 20 percent. Okay. But when John says,
“Why am I 55 percent?” What we really have to explain is
this 35 percent difference here. Okay? So how can we do this? Well, if we assume our
model’s fairly accurate, then we should just
take the expected value of the output of our model, that’s going to be the base rate. Okay? So we can just say what’s the expected value of
the model, let’s say, over our training dataset?
That’s the base rate. Then what we can do is, since
this is an expectation, we can just introduce a term into
that conditional expectation. That term here will say, “Let’s see. We condition on the fact
that he is 20 years old.” So on the condition on
the fact that John’s 20, his risk jumps up by 15 percent. Well, we can attribute
that 15 percent jump, if we’d like, to John’s age. Now, we can condition on the
fact that he’s a day trader, and that’s a very risky profession, and that jumps up to 70 percent. Then we condition on the fact
he only has one open account, like a waiting financial disaster, but he made a ton of money in
the stock market last year. So his capital gains pushes
him down to 55 percent. So what we’ve done
is, we’ve basically divided up how we got
from here to here by conditioning one at a time on all the features until we’ve
conditioned on all of them, which means we’re of course going
to be the output of the model. So any questions at this point?>>Yeah. This is amazing. Is it all independent?>>Good questions. So let’s assume they’re independent
for the moment, and let’s talk about
whether the model itself is linear or nonlinear, because both of those are going
to impact how this works. Because this is not the final
way we want to do this, because the order really matters
as we introduce these things. Okay? Either if they’re independent or dependent between
the input features, or if the model itself is nonlinear. So let’s assume that the inputs are fairly independent
for the moment, and just think about, maybe there’s a particularly
bad to be a young day trader. Okay? If that’s the case,
there could be some sort of interaction effect between
day trader and age. What would have
happened here is that, when we saw age we don’t know
you’re a day trader yet, but when we see day trader, we
already know you’re an age. So we get the extra boost
from that interaction effect, maybe it’s particularly bad
to be young day trader. If we were to reverse the order, age will get the interaction
effect and day trader won’t. Okay? So there’s potentially, n! Factorial different ways
of allocating credit if you were just to start throwing things in one at a time
into the ordering.>>If you are using the training
data to compute these marginals->>Yeah.>>Or conditionals. The more you condition
on past features, I’m going to be slicing and dicing my data in a way that
I’m going to have few [inaudible] estimating
the conditional next.>>Yeah.>>How does that actually
contribute into the approach?>>Yeah. Good question. There’s
two ways to go about it. Because what you’re getting
at is the challenge of being able to fully estimate the whole
joint distribution, because in order to compute
this exactly you would have to know the full joint distribution
of the input features, and then you could compute
all these conditional expectations. In practice, I don’t
think it’s wise to assume you can do that,
at least accurately. So in practice, what happens is, you often assume independence
between different input features in order to calculate
the conditional expectation.>>But then I’m imagining, given that 20-year-old, what is the probability of
the day trader, right?>>Well, so let’s say, for
example, we go back here. So we’ve introduced age. Now, what makes this tricky is if age is conditionally dependent
with other features, that would make it hard in order
to impute all the other features. But if I assume independence
between x1 and all the other xs, I can simply run an expectation
over just those terms. Now, if I assume independence
between x1 and x2, I can do the exact same thing
again. Does that make sense?>>Yeah, I’m just
trying to understand the assumption because I think
I’m hearing two different things. I think I’m hearing that I can assume complete independence and it will
complete this thing like that, or I can go it sequentially, where the ordering
will really matter. I’m going to be using
my condition at the start, but then I have actually a data
scarcity problem as I go forward. For a given problem, how do you decide which way you are? Are you measuring
conditional independence between the variables somehow or just picking and assuming one
and going with that?>>Yeah. So in this case here, we’ll compute this expectation by simply taking the mean
over the dataset. So just evaluate your model on the entire training dataset,
take the mean, you’re done. In this set here, what we can do is we can say, John is 20 years old. So one way to do this is like
a partial dependence approach. You simply plug 20 in and then
plug for all the other values, everything that’s in your training
dataset and you’d get that. Now, that makes
an independence assumption between age and the other features, and so that’s an assumption
that we are making. Then you can repeat that here, where now you’re fixing age to be 20, employment to be day trader, and then you sample from
the rest of the features. Again, independent from these two.>>There’s no data scarcity problem
like this because you’re taking expectation f
of x, rather than x. That is how the model is
going to be behaving. So it can [inaudible] generate as
much data as you want [inaudible].>>But then, do I need
to generate the model?>>Yes, you do need the model. You don’t need the generative
model to input data, but you do need the ability to
re-evaluate the model you’re explaining because that’s
going to gain your new labels.>>Okay.>>Yeah.>>So it seems to me like in
terms of human exploitability, making the economic order, in some sense spoils some of
the charm of an additive model, which is that I can think about all the different summed
contributions independently in
informal sense, right?>>Right, yes.>>Particularly I’m thinking
about the actionability for poor John here. So John gets denied and
wants to know what can I do and maybe the last thing on the list was life like
capital gains or something, which is something I maybe
could do something about. Well, I can’t change my age, changing my occupation is a pretty big move for just getting
a loan but somethings [inaudible]>>Maybe open accounts.>>Change your- right, I could
open an account but now, it’s not just what’s going to
happen if I open an account, it’s going to be what’s happening if I open an account given that I’m 20 and that’s much
harder to think about.>>Yeah, exactly. So hold on that
for just sec if you don’t mind.>>Just one more question.>>Go ahead.>>It seems like you’re
assuming that in this case x1, x2, x3 are conditionally
independent of everything else.>>The other x’s, yes.>>The other x’s, but you’re
also assuming that x1, x2, and x3 are somehow dependent
because the ordering matters.>>Well, so the dependence is driven by the nonlinearities
in the model, not by the input distribution. In the sense that, let’s
imagine I have an nth function, the order will matter even
if my input features are independent because whenever I see the second one
on my nth function, that’s going to change my model. So the only way that the ordering
will not matter is if all my input features are
independent and my model is linear. So non-linear models are what
we care about explaining, and so that’s why the order matters. Which gets back to what
you’re saying about, “I don’t want to have
a reason about orderings when I’m looking at an explanation.” There are many ways to
do the ordering is not even a good way to pick
the right ordering necessarily. So I think this is a good place
to actually step back and ask, have other people asked
this question before? Have other people asked, “Is
there a good way to allocate responsibility among a set of inputs to a function for
the output of that function?” It turns out that, perhaps not surprisingly, there have been people who thought about that. If you look back in 1950s, there’s a guy named Lloyd Shapley, who worked on this in the context
of cooperative game theory. There the idea was to say, we have a set of
players that are coming together and they’re going
to play a cooperative game. The output of this game is going
to be let’s say some money, and I need to divide that money
among the players in a fair way, but these people they
interact differently. So some people clearly
deserve more than others, how can I fairly divide that money? It turns out you can extend that, it doesn’t have to just be
positive values like money, it can just be any output
of the function. When you do that, you can put that in a variety of ways but the way Lloyd approached
it was he said, “Let me write down
some properties about fairness that I think should hold. If I write down it turns out only
a few properties about fairness, there’s only one way
to divide up the money such that these fairness properties
are not violated.” That I think it’s really compelling. He did a lot of good work on this
and a bunch of other things. He got a Nobel Prize in 2012. So this is like an economics thing, but these are very
very well known values called the Shapley values now, that are the unique solution
to these fairness properties. That’s what gets at how do we get around this whole
dependence on ordering? So what are these properties, first of all, that you wrote down? These are actually
some later properties updated in the ’80s because they’re a bit easier to connect with
machine learning, but one of them is called local accuracy or additivity
from game theory. Essentially, it’s
pretty straightforward. It just says, “You’re setting
a set of feature attributions, I want it to sum up from the base value to
the output of the model.” So if you’re giving
me some summation, I would like the sum of
these local feature attributions, the sum of these Phi values
to equal this to this. The thing we set out to start at the beginning is to
explain that difference. So it’s a natural property
we’d like to hold. I mean, you can certainly
violate it, but let’s assume we would like this
to hold for our explanation. Then, the second property
here is called consistency or monotonicity
in game theory. For this one,
essentially what it says is that if you were to have a model, and then you’re going to
reach in and make that model depend more on a feature, then your attribution for that
future shouldn’t go the other way. It shouldn’t actually decrease. This is really important because
if you violate consistency, it means you can’t trust any
of the feature orderings in your attributions even
within the same model. So really this is a core like, don’t be wrong kind of axiom. Then there are a couple of others
that are basically trivial and they’re always like, “I’m allowed to swap input features, zero means zero, and
things like that.” So these are the two
fundamental properties that really define these values.
What are these values? Well, they’re easy
to say, very easy to specify that the Shapley
values result from averaging, exactly what I just
told you to do except over all nth factorial
possible ordering. So for every ordering, I’m going
to get nth factorial Phi1, whenever I introduce
the first feature in whatever position, and
then I just average those. So it’s very simple to say but of
course very painful to compute. You wouldn’t actually want
to enumerate all n factorial possible orderings in
order to get this result. So it’s natural to ask the question, maybe there’s a way to do
this faster, quicker, easier. How do we do this? Well,
it turns out it’s NP-hard. So it’s a couple of options
we have for NP-hard problems. We’re really motivated perhaps by the theory and we
really wanted to hold but, how do we approach this
to solve this for ML? Well, one, of course, is to prove that P equals NP that’s one option. The other is to find
an approximate solution. I really thought this
would help our nips paper a lot and the impact but no. So we focused on
an approximate solutions. In reality, an approximate solution is pretty trivial if you
just look at the problem. It’s a n factorial thing,
I’m taking a mean, just draw at random samples from
those number of permutations. That’s exactly what
previous methods have done. Basically, they say we have
a certain number of permutations, let’s just draw maybe 100 or 1,000 permutations for each feature
and then we’ll take an average. But what we’re going to do
here is say, we can do better. How can we do better? By looking at the unification that
we just talked about, and trying to draw strings
from some other methods. In particular, there are connections
with all of these methods, but I’m going to focus on
the connection with Lime, and talk about how we use some
of the advantages in Lime to improve our ability to estimate these classic values
and game theory. So maybe you guys are
familiar with this, but if you don’t remember what
Marco wrote in his paper, he’s got a minimization function. It looks like this,
where essentially you’re trying to fit a linear model locally to your original model
that you’re trying to explain. It’s not important that details here, but what is important
is that you have parameters you need to choose. You have to pick a loss function that pushes these things
close to each other. You have to pick some regularizer, and most importantly have
to pick a local kernel that defines what local means. Perhaps naturally, you
just choose these things heuristically when
you’re trying to build this type of explanation method. But if we go back and
we say now we know that for dealing with
toggled type inputs, then we’re going to be inside this class and that means that
all these parameters are forced. There’s only one answer
that matches these axioms so that means there
must be only one set of parameters that we
can choose in order to maintain our local accuracy
and our consistency. So that’s a
non-constructive statement. But well, this is a difference
between Mac and Windows. There is a formula that we derived we’re simultaneously checking consistency between Mac and
Windows PowerPoint here. So this formula here is essentially a specific local
waiting kernel that we derived, that allowed us to estimate the
Shapley values in a new way. So these are values
that have been around for better part of a century, but now we can estimate them
using linear regression, which is fundamentally different than just random sampling
from a permutation. So that’s cool because these values
have come with a lot of important properties and
people care about them a lot not just in ML but elsewhere. So what’s nice about this is that, it’s not just a new way, but
it’s a way that can be helpful. So here, you can see that
permutation sampling, so just drawing from a permutation
has fairly high variance. So here I’m just drawing
more and more and more permutations, which involves evaluating the original model
more and more and more. So it’s computationally
painful to do this, and this shows the standard
deviation of my estimate over time. I have right here the right answer
for two different models. So this is an individual feature
we’re estimating a value for. So permutation sampling
has a high variance. If you look at Lime, it has a much lower variance due
to its regression formulation, but of course, it’s not
converging to the values that we would like based on these axioms. So what we’re able to do with Shapley is actually have
our cake and eat it too. So we get to keep the low variance of a regression-based formulation but also converge to the values we want, so we get to keep
the axiomatic agreement.>>How many features are
there in the [inaudible]?>>Yeah, good question. So each of these is just looking at
the convergence of a single feature. This is looking at
the convergence in a dense model. This is looking at the convergence
in a sparse model. Because in a sparse model, you’re allowed to use regularization to improve the
convergence like lasso, for example, for regression
and lots of tools of regression that allow you
to have better sample power. So this is demonstrating that. This is just a dense model
where we don’t get to use any regularization but we
still get a lower variance. So in this case, I think there was about 20 features and about this one
there was maybe 200. So this is fun. Maybe we’ll stop here and do a quick review of
where we’ve been through. So we’re going to talk about
contributions in all these areas, but we just talked about in theory a unification of explanation methods. So we just talked
about a wide number of explanation methods that were
in the literature and how they all connect together
because they have the same type of explanation
that they produce as an output. Then we talked about some strong
uniqueness results that now apply to the whole class and allow
us to go into where we used to make heuristic choices
and instead have theoretical basis for choosing parameter such that we obey
these classic Shapley axioms. Then in practice, this led to a new estimation method for
these classic Shapley values. It has lower variance and hence require less computation in order to explain individual
predictions from the models. At this point, I want us to move over to an application
of these techniques. This is actually the application that motivated all of this
work in the first place. So told the story in
reverse if you will. The application is anesthesia safety, so several doctors
came to my advisor and myself and they wanted to improve the safety of anesthesia
through machine learning. I don’t know if you guys
have ever stopped to think about it but anesthesia is this wonderful thing that’s
only been around for maybe 150, 200 years or so and in
wide broadly available sense. We’ve made a lot of progress but there are still
more safety things to happen. It’s still not
a perfectly risk-free event to keep you asleep but not dead. So why would ML help here? Well, the operating room, it turns out, is
a very data rich environment. It could be one of the places
you get the most data recorded about you and a health sense in your entire life if you go
into an operating room. There’s tons of high-frequency
measurements from lots and lots of sensors hooked up
to you over a time series. Of course there’s a lot of background medical data that comes in as well. So the anesthesiologists
who came to us said, “If you could predict
adverse events just before they happen that would
allow us to be proactive.” Being proactive would help them
better manage the patient. So the particular adverse event
they cared about here as hypoxemia where this
is low blood oxygen. So essentially, this used to
be the number one cause of anesthesia-related death
about 50 years ago. But then these things got
pulse oximetry got invented. Now, you can actually see where
the status of the patient is in terms of their hypoxemia or
their blood oxygen content. That just totally revolutionized
the safety of anesthesia. But it’s still retrospective, it’s still looking backwards. So if we could just look maybe five minutes in the future and guess, “Will you have hypoxemia
in five-minutes?” They can make less severe
interventions during the care of
patients in the hospital. So we built a system called
prescience that predicts hypoxemia. Again, five minutes is the number
that the doctors gave us is like this is about when the time
range we’d like to know. Then how does it work? Well, so it predicts hypoxemia
by taking a variety of inputs. So we have text data, numeric data like age, categorical data and lots of other all the different types of data input you would think
of except for images. Then we also have a bunch of
time series data coming in. It’s minute-by-minute data
coming in from U-Dub and Harborview here over
the course of several years. So we have about
eight million minutes of training data that we get
to feed into this model. So a large number of minutes. Then out of the model,
we’re predicting at every minute the
risk of the patient. So at this time, we’re
predicting odds ratios for the patient over time. This is forward in time. So this is five minutes, negative five minutes, negative
10 minutes in the past. You can see their risk
is going up and down over time where the risk is, “Will they fall into this
desaturation region?” So this is the saturation of O2. It’s pretty high and
then it dropped here. So this is an actual
desaturation event. Then we’re here and the question is are you going to have
another desaturation of it or not? This is from an actual patient
and the risk right now is 2.4.>>Do you know of any intervention? So when there was the first drop,
was their an intervention?>>Here?>>Yes.>>I do not know in
this point from this.>>[inaudible]>>The doctors are
constantly intervening. So that is a good question. I mean in an OR, it’s not like will
I have done interventions it’s what are the current interventions
because you’re on a ventilator, they’re like seven knobs that
are set to certain settings. I’ll talk about some of
those knobs in just a moment. But yeah, there are lots
of possible interventions. There’s lots of ongoing intervention. So this is definitely
risk with respect to the current standard of care is
definitely what we’re predicting.>>It’s varying overtime.>>It is varying over time, yeah. So the prediction of
the current moment is about twice the normal
likelihood and it has recently gone up in
the last couple of minutes. But just like John or John’s boss, the doctor has the same question. You can see now why we got pushed
into model explainability. They want to understand why. Well, if you want explainability, let’s start with a linear model. Maybe logistic regression
would be a great simple get the job done model. If we do that, we could produce an explanation in an odds ratio form. So here’s 2.4, odds ratio
at the current moment. We could express that as a product of terms if we’re doing
logistic regression. So let’s sort the term and we see the biggest term is the fact
that they’re overweight. The biggest risk factor for
having breathing problems or hypoxemia issues is overweight. Now, that’s something
I don’t expect or hope is not intervened on at
this moment in the hospital. You’re not going to
fix that instantly. The next one however
is low tidal volume. This is how much you’re
breathing in and out. So tidal volume is essentially
a knob sitting on the ventilator. How much air are we putting
in and out of your lungs? It turns out that there
was this spike here is almost entirely driven by the
doctors setting of that knob. Now, there are good
reasons to set the knob at a low setting but this quantifies the risk that goes into
it very explicitly. That’s what doctors
found very helpful, much more helpful than just
staring at a number and wondering what’s wrong
in the room somewhere. There’s some things, of course,
that helped the patient in this case their pulses
looking good. Then there’s thousands of
other features that are pulled out of here all together
have a moderate effect. So it turns out that at
each point there’s going to be a different set of features
but those features are going to be a fairly small set of features that really are
driving the risk at this moment. But maybe we don’t want to use a linear model because it turns out, if we plot the ROC curve, if you’re familiar with
this, higher is better. Plot the ROC curve for the logistic
regression model we were just talking about and you get
a reasonable performance at 0.86. But then if we throw
a complex model in this case, gradient boosted trees, at this we get a significant jump
in performance. This is a 15 percent jump and the true positive rate if
we fix our FBR at 0.1. So that’s a very non-trivial change
in accuracy of the system. We tried a lot of marginal
transformations to try and make a linear Lasso work
better but we did not succeed. So this left us with this question of how do we best address
this trade-off? Again, that’s what pushed us
into explainability methods and led us to this understanding of unity and kept pulling that string. So now, that we have
pulled that string, we came back to this problem
and said, “All right. How would we address this now that we have this approach called SHAP?” Well, how will we use these SHAP
values in the operating room? Well, one way to convey
these values to a doctor for example would be to start
with the base value of one. So this is odds ratio, this is normal typical risk, and then we can show you
your current risk of 2.4. What we can do is we can take all of the features that are positive. That is they are increasing
your risk and we can plot them where the width of the bar is the impact
of that feature on your risk. So tidal volume has this much
impact, height, and weight. We can sort them and you
can see that there’s almost all of them have
very little weight because there’s thousands of features here and stacked against that we can see
features lowering your risks. So the sum of the
purple plus the green, pushing against each other equals
of course the output 2.4.>>I find it interesting that
this presentation doesn’t make any distinction between time-varying measures
and these kind of like BMI and other things that are
going to be the same throughout.>>Yeah, that’s true.>>Those are in
the sense not actionable so I don’t know what the right thing to do is
because I don’t engage with doctors, but I would think that a distinction
would be interesting to that.>>So there’s two ways to at
least that we thought about that. One was do we want to actually go back into
the time series and try and figure out where in the time
series we’re getting the signal. Because in a time
varying thing, maybe your tidal volume is actually
really a time series of things. So you could maybe try and tell
me what part of it we went into. Some of the people in my lab
are actually working on that as an extension but we
didn’t do that right away. In terms of which ones are
intervenable and which ones are not, this is something that I’ll talk about the user studies
that we did do on this. But I think there’s still
more questions to tease out there because doctors almost instantly recognize that right
away when they see it. So I can of course label
which ones are affecting me and which ones are not but
they found it nice to know, because they also in their head
have a risk model and they would like to see even the unintervenable ones that
they have in their head, they want to also see here.>>I can imagine those serving
as landmarks or something, they might even be like thresholds
relative to your BMI risk, you just went above your BMI risk.>>Yeah, that’s a really good point. Yeah, exactly because the
fact that your tidal volume is at equal risk to the fact
that you’re overweight, suddenly, it’s like
something where well I’ve pretty quantified as a doctor. I know how much I think
this weight is affecting you. My turn to that knob is now equal
to that? That’s interesting.>>This model you’re focusing
on a particular hypoxemia. But if I change you’ll
see the title volume, that’s when they have effects on other measures of
the likelihood of success. How do you even to bring those in? So I love this it’s precondition on the fact that I’m falling one thing and
I care about that.>>Yeah, that’s
an excellent question. Essentially, it’s saying
how do you best measure a whole handful of outcomes or dozens of outcomes potentially
that you constantly care about. I think we thought about that
and said this is like I agree, that’s totally where
you want to be when you’re in a hospital setting
because you don’t want them myopically affecting one risk and
totally messing up another one.>>[inaudible]>>Yeah. So I could hypothesize how best
to display that to doctors, but I haven’t actually tested that. So I’m not sure it
would be helpful for me to go at it except for the fact that of course you can run many of these models or train
one multitask model if you prefer. Then you’re going to have lots and lots of these things showing up. I think, at some point,
the real important point is people are not going to be
staring at this all day probably. At some point, you’re going to raise some warning or bring
their attention to something. I think that should
probably be one of a whole panel of things
that you’re interested in. At some point, one of those things in the panel is going to light up and you’re going to
go and look at it. At that point, we were trying to say we want to explain at least what the model was concerned about at that moment so you can
warn a number, yeah.>>I could have framed the problem.>>Okay.>>So right now you framing it as, “There is a risk factor
via the thing.” What are the features, both actionable and un-actionable, that explain why I’m predicting
this risk to be like this? I could have framed the problem as, “We are at this risk, my goal is reducing this risk.” Give me interventions. Give me concrete things I can play with so that I
can reduce this risk. Because you can’t give
me an explanation like this where there is nothing
for me that is actionable. What do you think as the [inaudible]
between framing the question as pure explanation versus framing the question greatly targeted
on the intervention?>>Yeah. So what you’re asking is
obviously what the doctor wants. I think the issue is oftentimes
we don’t have that to give them, in the sense that this
isn’t a causal model. It’s trained on historical data, it’s full of confounding, and that’s actually one
of the things going forward that I think is
a lot of open work here. I think in order to properly
interpret this model, you really probably have to have some sort of causal
graphical model sitting in your head about where or how all these variables
relate to one another. You have to actually
sit down in front of this with experts before
it ever goes into the operating room and look and see like peak pressure actually helps people but you found it
hurt people because of an association with some confounder
that you didn’t think about. That’s all stuff that needs to be sorted out before it goes
into the OR and it has to be done beforehand because you can’t expect people to
do that on the fly. But that being said, I’m not sure where the boundary is but I’m hesitant to hide too
much of the actual association. I would rather probably tell them I don’t know how
to visualize this but I really think that it’s
future work and how to integrate causal modeling
with this to help people not make causal claims
about things that are not causal. But I guess what we came
away with was oftentimes, something could go wrong
but doctors are there for six hours in some of
these surgeries and there can be really simple obvious
things that would be obvious if they were paying
a 100 percent attention for six hours but they
aren’t otherwise. So just seeing why the algorithm
is concerned can get you a long way there but it’s not
going to get you to causality. So I think that if we wanted
to answer the causal question, we would actually have
to be in the loop in order for me to trust it at least. We’d have to be in the loop, and there would have to
be some understanding of what’s randomized and what isn’t. So from our perspective,
we said, “Well, how far can we get with
an observational study?” That’s why we’re out here. Does
that answer your question?>>Yeah. That answers my questions, just that there has been
this more recent line of work. I’m sure you know about counterfactual explanations and
the actionable explanations and having found the
absolute [inaudible] and I think the only insight
they believe is that, “I just want to be as my explanation is on things that are actionable.” In my mind also I don’t
have an answer as you do. I’m just trying to do the trade
off between giving this kind of an explanation versus giving like
a counterfactual actionable, whatever that is, and
it will be, I guess, good to do the same
observational studies with professionals to see how
they’re interpreting them.>>Yeah. No I agree. I guess the one thing we want
to avoid is presenting things that are
observational as though they were true counterfactual things.>>I think there’s [inaudible]
that they look at this.>>I agree, yeah.>>They’re going to look at
the tidal volume and say, “Oh, yeah. I’ll drop this and then I should
expect to see this risk go down.>>My question’s along those lines too but perhaps maybe
it’s an easier question. So let’s make the assumption
that this model is a perfect model and it truly represents the real world
in every possible way. I’m wondering, maybe I just missed this in the presentation
earlier and processes, how simulatable is this as a doctor? Like, if I know the kind of
Shapley’s score I get from tidal volume and I know that
the volume is two right now, can I make assumptions about what the model would predict if I
dropped the tidal volume to one? So I’ve got a black-box model, I see a score like an attribution
volume two, and I say, “Okay. I can turn this knob down to one.” Do I know how the model will change? Is that valid?>>There’s two questions there. One I’m going to postpone to later
in the talk where we actually try and understand the
individual features one at a time, which this plot does not tell
you. That’s the first thing. Just by looking at
this all you know is what the current value is doing. However, you can draw
curves with these things. First you could of course do
a partial dependence box, just literally dragged
the curve and see it, but it also turns out that you
can make interesting plots using lots and lots of
these explanations and that’s what I’ll talk
about a little bit later. So maybe ask that question
again when you see that and then tell me
if that answers it.>>Yeah. I guess I have one other question too
about taking base values. So I guess like the guidance is to pick it based on your
training data set but if the people I’m evaluating on are different than the people that
I’ve seen in the training set, is there any problems there?>>Yeah. So for example, let’s go to a financial case. It might make a lot more sense to
use the background distribution, not to be your training
data set but to be the distribution of
people who got a loan. Now everything is with respect
to people who got loans. Here is why I didn’t get a loan. So I think it really depends
on what is the reference frame because if you’re sitting in a room as a doctor and
someone comes in and says, “I have a fever,” and your risk
for something goes up, that’s because you mentally
have some background model in your head of what the people
typically come into your office. If you live in the jungle you might have a different
background distribution than someone who lives in Alaska. But you should think the same way when you’re
building this algorithm. It’s like the person has
something in their mind. You’d like your background
distribution to match their mental model
as much as possible. All right. I’m going to
move pretty quickly. I think 11:45, is that
when we’re trying to wrap up a room somewhere in there? I’m not sure the exact timing.>>I think you have until 12 but you’re getting
a lot of questions.>>Okay. I’ll aim 11:45 and do that. So let’s move quickly here. So we have this plot explaining an individual prediction at
individual point in time. If you rotate it 90 degrees and
then you can run it over time, you get things that look like this. So essentially, this is
going forward in time. This is the current time, so this is actually
this slice right here and remember this junction
is the current predictions. So you can look at this as the prediction and the junction between green and purple over time, and we can see the area
of different features changing throughout
the course of the surgery. In fact, tidal volume
is actually something that expands and then goes away, and just in this little
piece right here. So that tells us it’s
a transient effect at this point in time
due to the model. So this is nice to see the flow of risk of a patient overtime
during a surgery.>>Can I ask a question about this?>>Yeah.>>So if you are
explaining a prediction?>>That’s right.>>Are you explaining
a prediction a point in time because when I create this
explainable models for each point, the model may take
different features?>>That’s correct.>>It’s representative.
So barring effects, which features are meaningful? Because, for example, tidal volume may go away and
something else may come in.>>Yeah. So if I’m looking at an individual prediction,
I just sort them. So tidal volume maybe here
because it’s really big but over here it’s going
to be buried somewhere in this tail, so you’ll never see it.>>But I mean the different features. Does the explanation use
the same features or?>>Yeah. So this is all features. All the features are here. Just some of them have
larger widths than others because they’re more important
at that moment than others.>>So in this sort order here is
based on some sort of average?>>This sort of here?>>Yeah. Are we doing
crossovers [inaudible]?>>No. There’s no crossovers here. So I just sort by
their global importance, and then for title line it may be farther down in the bundle
that it pops out. That is to roughly put like
most of them tied to the tails.>>So if I have two features that
I believe are highly correlated, I may randomly pick
one over the other at any given time and I can’t see all of these things
fluctuating very much? They are actually
representing the same thing?>>Is that possible?>>Well, what you’re explaining here is what the models
are actually using. So that would only fluctuate
if the model itself was actually randomly at each time
point picking different features.>>But if I have two features, they’re identical, they’re
giving me the same message. Can I really ensure that I’m
going to be- I’m trying to understand that [inaudible]
I’m giving it the right rate.>>Essentially, you have to think
about like this is a model, that is the function
that is the model making a prediction and instead of inputs let’s say two of them
are very correlated, and I want to know how much do
those inputs matter for my output. That depends a lot of a function, so it’s a lasso, maybe it only picked one and ignore the other one. In which case you’ll
never see one of them in this plot because the model
doesn’t actually touch it. Maybe it was a ridged model,
ridge regression. In that case, it will spread them out equally perhaps between the two, and then you would
have seen both doing all the same things together
throughout the whole plot. But in order for them to swap, the model itself would have to change from one time
point to another. It would have to have
some sort of like if this, then that kind of statement and
that would be very unlikely.>>I think I need to understand how. So the predictor model itself
is a black box at this point?>>Sure, it’s just a
function you can evaluate.>>So I think I’m questioning
how the explanation can really understand the internal logic of the model because things
are getting highly correlated.>>It’s essentially by breaking that correlation, perturbing
them independently. That’s deep into it I guess. I don’t know if we should go into
that conversation at the moment, but essentially it’s basically perturbing the inputs to
the model to understand. That’s how all black-box models have to understand the
function because there’s no other way to look
inside it without making some assumption
about the internals. But ultimately that’s
what’s going to happen. So that’s how it’s
going to perturb what’s going on and then assuming
that your model is consistent, you’ll get consistent
explanations across time. So we tested how this
worked by going back and replaying historical data
to set up anesthesiologists, practicing anesthesiologists
that you’d have in her view and you did in
Children’s Hospital. We showed them cases both with and without assistance by
prescience and then ask them to anticipate on a scale of 1-100 risk in the next five minutes. So that allowed us to make
ROC curves not just for prescience, but also for the doctors
and not just doctors, but doctors with and without
help with prescience. So if we do that, we can do is run that for
all five doctors and then average their ROC curves to get
the solid green curve here, and we can do the same
thing when they were assisted by prescience
and get a blue curve. It’s hard to say exactly what
the false positive rate of a doctor is because they don’t
have that in their head, but let’s assume that they
anticipate 15 percent. Then that provides
a very significant improvement in their ability to anticipate
which we were encouraged by.>>Do we really find individual differences that
below false positive rates?>>Yeah. Exactly, and that’s because once you get low and lower
false positive rates, we didn’t want to
kill off the doctors by asking them to run forever. Yeah. So that’s why we highlighted these things here and tried to pick an FPR high enough that we are confident with this distance
of separation, which is significant in that case. This is something we wrote
up, you can read about more in Nature BME. I’m going to move fairly quickly here to make it through
the rest of this stuff, but there’s still room
for improvement here. Why is there room for improvement? Because as we just talked about, modeling agnostic methods
can only explain models by perturbing their inputs and
observing how their output changes. It’s really the only way
that you can explain things, if you don’t make any assumptions
about what’s inside them. But how does that hurt us? Well, it turns out if you want
to explain a whole data set, it can be inconvenient to do this. Let’s imagine we’re trying to explain a whole data set just
with a very fast model. This is going to be
actually boost and inference on sets of
trees is very quick. But here’s what we’re doing is having simulated data and we’re essentially increasing the number of
features in our model, so we can control that all of
these features really matter, and then retraining
an extra boost model and then we’re explaining 10,000 predictions. What you can see is
this is minutes of run time and it’s just a linear increase as you
might expect the number of features because I have to
do this for each feature essentially perturb things
and this is the lower bound. This Is just the time
to run the model, so whatever estimation stuff you
do on top; it’s an addition. Here, we’re talking a couple
hours by the time we’re up to 90, or really important features, which is certainly doable for whole data set of
10,000 predictions, but it’s it’s unpleasant
and would totally change the way a data scientist
does their workflow. It’s not that we were using
too many samples either. This is the previous permutation sampling approach that we’ve talked about and then this is the new one,
the regression-based approach. This lower variance, but still a couple percent of the magnitude of these bars are twiddling
around a little bit. So it’s not nontrivial. You have to think about the
fact that there is noise, there’s this balancing act
that’s going on. So if we go back to what we
did for an NP hard problem, there really is
a third option and that is we can restrict
the problem definition. So we don’t have to solve this
for all possible functions, we can maybe just solve it
for a class of functions, which is very attractive
in the context of machine learning because machine
learning models come in classes, many of which are very popular. So we decided to choose the class
of trees and this leads to a a new estimation method for these classic Shapley values for tree-based machine
learning models. So why did we pick trees? Not just because we’re using them, but also because other people,
it turns out we’re using them. So this is Kaggle 2017, they basically asked
data scientist at work, what do you use for your
machine learning models? Not surprisingly, logistic regression
is right up there at the top. But then if you look at
it, it’s Random Forests, Decision Trees, Ensemble Models,
Gradient Boosted Machines. All of these involve or are
totally dependent on trees, so it’s a tree-based
machine learning models. So whenever you’re thinking of
explaining a complex model, people typically think
a non-linear model. So that means almost all the
nonlinear models that are deployed and used currently in
practice today are based on trees. So if we can make significant
advances on this class of models, that will impact a lot
of current applications. So how will we do this? Well, let’s imagine we wanted to explain it directly with no sampling. What would we be facing? We have T trees in
our ensemble, let’s say, and each of them has L leaves, and then we have M input features. So now we have M factorial because we have all these permutations
we want to do, and then we still have
computer expectations. So that’s another sample, however many N you want to draw in order to compute that expectation.
So that’s factorial. It turns out there’s a way to
rephrase it as exponential, but we’re not really much closer
to track the ability yet. But then now we can restrict
ourselves to the class of trees, and it turns out
there’s a fairly cool, but involved algorithm that’s recursive method that I
won’t go through today. It’s interesting because, of course, the solution here depends
on an exponential number of conditional expectations, but they fall out in the trees
in such a way that we can design an algorithm
that’s TLD squared, or D is the depth of the trees. So this means TL is just linear time. So that’s number of nodes in all the trees and then
D squared is the depth. This polynomial-time run time
gives us the exact solutions, just like you had done this, which is very nice because now we don’t have any of
the problems we had before. In fact, if we plot the run time of our new method against the lower
bound for the agnostic methods, it’s like indistinguishable
from zero, which is nice, and of course, there’s absolutely
no explanation variability. So if you’re a data scientist who’s thinking about explaining your model, suddenly this thing
is done in seconds and you never have to worry
about sampling variability, which really changes how
people can use this stuff.>>It turns out it matters too
because if think about trees, you might think trees have
been around for a long time, like we should know how to
explain trees by now,right? They have been studied
for quite a while. A lot of people have
looked into explaining and understanding feature
importance in trees. But almost always people are thinking about global feature importance. What is the importance of
this feature over my whole model, over my whole training data set, gain, gene importance,
permutation importance. These are all things that are
global feature importance methods. There’s only one approach, that we found
heuristically out their. Literature to explain an
individual prediction from a tree, which is what we need for John. Where we meet in
our medical examples etc. They’re heuristic and
they’re inconsistent. So how do we mean inconsistent? So this goes back to
those two properties. We had local accuracy
and consistency. Consistency, is one of
those properties you want to hold. Here’s an example of when it fails, and this is from
the heuristic methods we have for trees right now. So here isn’t set of AND functions. So my model is very simple,
it’s just an AND function. Take some binary inputs, all independent, and it just
says if they’re all true one, if they’re all false zero. A two-way, a three-way, a four-way, all the way up
to 10 way and functions, and then if I was to be consistent, of course I’d have to allocate
credit fairly between them. If my background is typically zero, then I’m going to just
split up my 1.5 and 0.5 for two-thirds, fourths, fifths etc. So this is like we know the right answer because
it’s such a simple model. But then what we do is we say, “Let’s run the heuristic methods.” Well, we get something
very different. It turns out that most
of the credit actually goes to the leaves of the tree, just the way that this method works. It’s actually very similar
to the gain method. Essentially, as you go down the tree, the purer and purer your group gets. All sudden you get more and
more gains for getting it done. So this is exactly the opposite of what you want because
these are probably the most important features
because you greedily selected them for your tree-building. But by the time you get to
like a depth four and function almost no credit is going to the root and almost all of its
ending up in the leaves. So this is the definition
of inconsistency. When we have our first tree shape, of course it identically
produces exactly this, because it comes with guarantees. So this is one strong motivation for using these values over
what we had before. Another one is to
basically go back and say, “Well there’s lots of metrics
out there that we can use to measure the performance of
a local explanation method.” Particularly, ones that are
assigning numbers to their inputs. So Local feature attribution thing. No individual metric is going
to be the right metric, but there are many
ways to go about it. So what we did is we defined
lots of metrics that, I’m not going to go into
the details of all of them, but they’re all about
perturbing or removing features into seeing,
did I get the right one? Did I get, you said that it was
positive, I’ll take it out. Did you make thing
go down? Okay, etc. Then, we’re going to
plot that against lots of explanation methods designed specifically for trees because we are not looking at
modeling gnostic ones. Necessarily we’re just looking
at the tree-based ones. When we then look at
our exact explanations for trees, we call three explainer, we can see it consistently
performs well across all these metrics in general, which we found very encouraging both for a decision tree
and a random forest, and we also tested this
on grading boosted trees. So I think this is a broad benchmark, evaluation, means these axiomatic
things are not just nice. Properties builds lead to practical improvements in
the explanations you get. Another thing that we
found actually surprising, and we weren’t actually
setting out to do was you could of course take local explanations and put them
back together to get a global one. Because we take
the mean absolute value, or if I’m explaining
the loss function, I can just take the mean
and the loss, for example. When we do that, we can
get a single number, but now we were doing this axiomatically
motivated attributions for each individual explanation. So when we put them back together, it turns out we get better
feature selection power than previous approaches
like gain permutation. Gain is like gene importance, if you’re more familiar
with that term. So these are the ways that people do feature importance
for trees today, which is fairly widely
used in a lot of domains. So we’ve found it
pretty surprising that this is a simulated data
set we pulled from some other paper that was using
to check feature selection power, and this is when the
interactions are minimum, and you can see reasonable jump both of the decision tree and
with a random forest. So we’ve found that
surprising and nice. We’ve actually got better
global feature selection power when we went back to global
from our local methods. The last thing we checked is that
while these are explanations, so they should probably
be consistent with human intuition. How
can we measure that? What we can do is go back to
the simple functions like AND, OR and exclusive or things like this. These are things people in theory if explained well should
understand themselves. They know how the whole thing works. Then, you can ask them how
would you define credit? How would you allow fairly allocate credit
among the input features? If you do that for an AND function, unsurprisingly, this
is an AND function mixed with some additive effects. They fairly divide it
between fever and cough, which is part of a story problem
that we used on Mechanical Turk to get people to understand
what this model was doing. Of course, we didn’t tell it
was machine learning model. Then, we can compare that to
the shaft values that we get, which nicely lineup
with human intuition, but then we can compare that
to the heuristic values and get significant differences. This significant difference
happens not just for AND but for OR exclusive OR, all that non-linear functions. So it agrees when you have
a linear function but otherwise we see significant deltas
between what we observed, and a consensus from Mechanical
Turk studies on simple models, which is another reason that we felt we should prefer
these sharp eyes. So, in review, for trees, we’ve narrowed in on
particular type of models trees. We have really fast exact
computation methods that allow this to be very practical. They come still with these
attractive theoretical guarantees because you’re not doing
any approximations now. They have really good
performance across a broad breadth of
explainable AI metrics. Surprisingly, they actually improved global feature selection
power which is fun, and they have strong consistency
with human intuition. So, using all those results, we basically said, what
can we do with this? How can we use these to improve our ability to do
machine learning in practice? So we built a set of
explainable AI tools to do that. So, to explore this, let’s look at a couple data sets. This one here is a classic data set
called NHANES ONE. From the 1970s, it’s a very classic, that 20,000 people in
the United States were given lots of standard medical tests, and then 20 years later they
were followed up for mortality. So we can train a cox
proportional hazards model with graded boosted trees
in this case, to predict mortality over
in the US in the 1970s. So again, as I say, we can take the mean
absolute value to get a global feature importance. If we do that, we can learn what’s the number one risk factor
for death in 1970s in the United States and Drumroll
is age. Big surprise. What was most surprising
to me at least in magnitude was that the number
two killer was being a guy. Then, after that comes, the things that you might think
about in terms of blood pressure, inflammation, body mass, and
a whole bunch of other things. So this is what you would get if you just explained the model today, maybe not with
a theoretical guarantees, but you would get
bar plots like this. You just said got feature importance
from one of these things.>>But now that we have explanations
for each individual sample, we can do better than just
put a bar chart here. Okay? Because one thing
this bar chart does, is it conflates the prevalence of an effect with the
magnitude of an effect. Whenever you have
a global feature measure, you have to come up with one number. So you have to pick a trade-off and somehow combine
these two things together. That’s really important
because we often want to find rare high magnitude
effects in a population. You could, of course, train lots
of models on lots of subsets of your population and then look at their global feature importance. But now we can actually look at
each individual in the population, look for groups of people that have high magnitude even though
those maybe a small group of people. So as an example, let’s go down to the
one at the bottom of our list here, called blood protein. If you look at this
you think, well maybe the global importance is fairly low, so maybe this is not important
factor for life expectancy. What we can do over here is, on the x axis we’re going to plot our feature impact or SHAP value. So negative is good, meaning it’s lowering
your risk, positive is bad. It’s essentially years of life
in a linear sense on the scale. What we can do is, we plot
a dot for every person, because for every person
we’re going to have a value assigned to the
blood protein measurement, like an attribution assigned to it. Then, we’re going to color those dots by the actual blood proteins. So we can see whether
it’s high or low. When we do that, if we do like a little bee swarms style plot where the dots pile
up to show density, we can see a long tail stretching off to the right
here of red dots. So there’s a pile up here rated zero, which means for almost everybody
their blood protein has no impact on their
risk for mortality, at least with respect to
the general US population. But like for that
particular person right there blood protein is
extremely important, it’s probably
the most important factor in their life expectancy perhaps. These rare high magnitude effects are things that pop out when you can explain everything at an individual
level across the population. Now, you can say well, of
course, we could do that before, but it’s really nice now
that we have these very high-speed trustable
explanation approaches for TRIZ because this makes it really practical because we can quickly explain whole data sets
and then plot them like this. We could do this for
all the features. If we do, we see interesting trends. So here on the right side
you’ll see there’s lots of ways to die young by
being way out of range. Okay. So these are people with
very high blood pressure, these are actually
very underweight people. There’s lots of reasons
to be out of range. But in case you’re looking around
and browsing articles online, there’s not a lot of ways to be wildly out of range and
somehow live way longer. So I would urge skepticism
except be young. Okay? So you can be on
that I guess the best way. But these are like
insights that come when we tease apart prevalence
and magnitude of effect. Okay. We can do that now that we’ve
explained individual samples. But we also might want
to zoom in closer on one of these features and
understand more about it. So one way we could do
that is to literally put that features
value on the x-axis, like a partial dependence
plot would be. But instead of scrubbing
things around on a PDP plot, instead what we’re
going to plot now is the SHAP value for
that feature on the y-axis. So for every person, they’re going to have
the impact of their value and the actual value of
their systolic blood pressure. We plot them we get this. So this looks very much
like the standard curve of risk you would get for a standard
systolic blood pressure stuff. So here’s a 120, higher 120 you start going up and in
your risk of mortality. But what you notice is that there’s
dispersion here. All right. There’s not like a PDP plot
we just get a line. Now there’s a vertical dispersion
and this dispersion is driven by interaction
effects in your model. If you had a linear model, you would never see any vertical dispersion. But what this means is a lot of people with
a blood pressure of 180. But for that person, it’s
less concerning than for that person, and question is why? Why is it more concerning for
someone than someone else? Well, it turns out, I don’t have time to
go into the details, but you can extend the ideas from game theory about putting
attribution on individual features, you can extend that to
interactions of any order, but what we did is we implemented
it with this algorithm, a high-speed way to compute the interactions for
all pairwise interactions, so now we’re essentially
assigning credit not just among like a vector of
features attributions, but a matrix of feature attributions where the diagonals
is the main effects and the off-diagonal is
all your pairwise interactions. When we do that, now we can
look in and see well what are the interactions that are
here and what’s driving that. It turns out in this case is age. So now we can color by age in order
to see what’s the effect here. We can see that this is highlighting how early onset
high blood pressure is much more dangerous than late onset high blood pressure
in terms of your mortality risk. We don’t need to go into all the
medical reasoning behind that. But I think this is something
that kind of highlights again another example of when
you have lots and lots of local explanations that you trust, you can pull out signal from your data that you
might not otherwise have seen from models that would have
otherwise been considered opaque, in this case, like depth six decision
trees several 1,000 of them. Now, I said we had
interaction effects. So we could also plot a feature, let’s say age on the x-axis, and here we can plot one of the off-diagonal things in
our interaction matrix, this time between age and sex. If we do that we’ll see
the varying impact of being men versus women
over a lifetime. So here we have risk being fairly stable and then they
cross and peek at age 60, which when we talked to doctors
who was strikingly similar to when cardiovascular risk also peaks which is predominantly affecting men. It’s not causal so I don’t know
exactly what’s driving it. But these are the types of very interesting interactions
that just pop out from just plotting
many many local explanations, in this case, from a GBM tree.>>Why does it go down
from [inaudible]?>>Well, what’s happening here is remember this is
the off-diagonal effects. There’s always a main effect. It shows that women are
always better off than men. But this is going to
of course be centered because its main effect
is subtract the mean. So by definition it’s simply
showing the relative difference.>>Yeah.>>So once I know this one
I kind of know this one. Last thing I want to end on here is probably actually one of the more
fun applications of this stuff, and it comes from when you have machine-learning models and you deploy them, it turns out they break. I know you’ve never heard this, but it turns out sometimes
they have problems like most software, bugs show up, features drift, someone
changed something in the data pipeline that messed up the way the thing got
deployed and no one knew. Sometimes that costs a lot of money depending on
what pipeline we’re talking about. So in a hospital that doesn’t cost money or lives depending on what
what you’re looking at. So we’re asking how
can we better improve the safety profile of ML models when they’re used
in high-stakes decisions? We can talk about
model monitoring in a hospital. So one popular thing people
do with ML models in hospitals is trying to
predict the duration of things so they can do
better scheduling. That’s fairly benign task. So here we’re predicting
the duration of procedures in a hospital over
the course of five years, so this is data from
U-Dub and Harborview. We use the first year
of data for training. Okay. We trained to predict based on all that kind
of static features, how long is this procedure
going to take? So what doctors, how long do
they need to be scheduled? Then what you can see, of course, is a natural jump in error, once I do train test. So in this is training error, this is test error, it’s
just naturally higher. But if you just look
at this over time, you’re like well, I don’t know if
my model is doing great or not. But this is typically what people do, they just look at the Loss of
my model over time and see, is it going down? Is it going up? You’d never see if
there was a problem here unless it totally
destroyed your model. To demonstrate that we actually introduce a bug into the model here, we actually went in and we changed the codes of two rooms in a hospital. So it’s really easy change that
can happen in any data pipeline, just swap the names
of those two rooms and this training, this test. Can you see where we
introduced the bug? Of course, the answer is no, and you could guess, what you
guess in your head, but then what we can do
before I show you that, I guess I already did, I can’t take it away now
it’s already there.>>What we do is we said, let’s
explain the loss of the model. So not the output of
the model but the loss. If we explain the loss of
the model with these SHAP values, what we’re going to do
is we’re going to take that value which is the loss, and we’re going to allocate
it among the input features. We’re just going to say, for
each of these input features, how much did you hurt or
help the loss of my model? Of course, I can do that on
every single input prediction. So I’ll do that, that’ll
essentially deconvolve my loss backwards through
the model onto the input features, and then I can look at
these individual input features and room number six was one
of the ones that we swapped. The question now is, so red, I’ve colored the ones
where this room is on. So they’re in the room, and
blue is when they’re not. So it’s not very
informative to know when they aren’t in the room because there’s lots of rooms
in the hospital. But when they are in the room,
you can see it’s typically lowering the error of the model,
so it’s got negative value. That means it’s lowering the loss. Then all of a sudden here,
now it’s hurting the loss. So if you look up here, you’d never know that this bug had been introduced
into your model pipeline. But if you look down here, you can see a very clear
signal that pops up. Importantly, you might
be able to find this by just monitoring the input
statistics of your data, because that’s basically what people do today from
my understanding, in order to monitor models. They look at this, and they look at the input statistics of
your data over time. But if these rooms had equal usages, you would never see a change
in the input statistics, their marginals look the same. But the effect on the model’s
prediction is quite dramatic. Not in overall sense, because there are many procedures happening all over these hospitals, but it’s certainly hurting
a lot of the predictions, hundreds if not thousands of
predictions are impacted by this. So this, I think, is
really interesting and could really be helpful in basically taking
explainable AI and using it to impact model
monitoring in practice. Now, it’s not just bugs
that we introduced. So this is actually a Batch David that was in theory
cleaned but of course, data is never fully clean. So what we found was, we just plotted this for lots
of features and I just pulled out a few examples
that were interesting. Here’s one where we plot, are you under general anesthesia, the flag in the electronic
medical record, turns out somehow it was wildly unhelpful for
the algorithm right in here, for a subset of the rooms and
a subset of one of the hospitals. That’s something you would
never have found otherwise, turns out we went back and we
found out it was because of some transient EMR connection issue
between various things. Exactly the thing that you would
want to fix if you were in the middle of it because it’s
hurting your prediction performance. Here’s one where we actually
observed drift over time. So this is another binary feature
where we’re saying, “Are you undergoing
atrial fibrillation ablation?” This is like a procedure where
they zap a part of your heart in order to stop
aphid from happening. We’re trying to predict the duration, and what you can see is a general
upward trend where particularly, the training period but also
in the early test deployment, we’re helping things and by
the time we get to the end of it, this procedure’s duration
has actually changed, and we went back to the hospital and the cardiology department
and they were like, “Oh, well actually,
different people came in, and we’ve got new technology,
and we’re much faster now.” Well, that’s nice. That’s great for the patients but it’s
bad for the ML model, because now your feature has changed, and again, the marginal statistic
one not changed at all. So another example of how
you can use this stuff. So to review, I guess I’m
going to wrap it up here. In theory, we talked
about how we could unify a variety of explanation
methods into a single class, and that gave us insights
and how they are related. We talked about strong uniqueness results from game theory that now spread to that whole class
that we connected together, and how that can help
us pick new parameters. It’s not just the
theoretical perspective but also impacts
practice because we have a new way of estimating these values through regression instead
of simply random sampling. That applies for black-box models. Then we also propose a new way of estimating these in very high
speed way exactly for trees. Based on these things, we
are then able to build a whole set of explainable
AI tools that help you build and monitor and understand
the models that you’re building, particularly if you’re using
these trees, it’s very convenient. We talked about a variety
of applications, the motivating one
that started us down this track in anesthesia safety, and also some work in
mortality risk and hospital scheduling demonstrating how
these tools work in practice. So I’d like to highlight that you don’t
have to take my word for it, you can obviously try it
yourself. It’s on GitHub. It’s also actually, it supports
and is directly integrated into XGBoost like GBM which is a Microsoft thing CatBoost,
and Scikit-learn. These are all tree-based integrations for this first tree model
implementing T++, and there’s a whole bunch
of stuff that we did working on deep learning models. Let’s assume the classic deep
learning models, what can we do? That I didn’t talk about but we have integrations of TensorFlow,
Keras and Py Torch. We’ve also, as we
said we started with hospital medical problems
that got us down this road, but it turns out it got
used all over the place. You stick things on GitHub, who
knows what they’ll do with it. Turns out you can help optimize
performance in sports teams. Microsoft actually uses it in their Cloud pipeline
right now as does Google. Cleveland Clinic uses it for identifying different
subtypes of cancers, which I thought was cool,
close to what we started with. It’s been used for optimizing manufacturing processes
for a jet engine stuff, they build complicated
models and try to figure out what’s breaking and why? I had a chance to work with a very large international
financial banking stuff, and this is really fun because
actually got to go in and help them do augmented
intelligence stuff, where we basically said they
have models and they have people and at
a certain risk threshold, they use people not models, but you don’t want to
lose everything that the model had when you
transition over to people. So I got to work with them for, I did consulting with
them essentially to help them use these values to help better essentially collaborate
between humans and machines for their decision-making
and financial risk prediction. It’s also been used for economics research and a bunch of other stuff. So I’m going to skip over well, I want to skip over future work. I’ll skip over what’s next but, there’s a lot of places to
go from here as you can probably imagine, a
lot of fun things. In theory there’s a lot of fundamental
difference Interpretability trade-offs that exist when you
have correlated features, okay. So you touched on
this a little bit but sometimes you have to
decide am I going to violate my data or am I going to understand what
my model actually did. So there’s infernal trade-offs, that I think will be
really interesting. There’s lot of interesting
work that’s going on that I’m involved
with actually back in our lab two and using explanation constraints to guide model training, it’s a really a whole
new way of saying, the model is doing something, but do it this way, such that
your explanation is like this. That’s a really fun way
of doing things. In practice, I think there’s a lot of work like
this model monitoring stuff is fairly new and I think there’s a lot of things that
we could do there to make that much more practical and usable to the large number of people. It’s also, I think when we’re talking about actually
deploying stuff into hospitals, like causal structure and assumptions and if you have
causal modeling assumptions, that’s essentially expert
knowledge that needs to be integrated into these systems before you ever put
it in the hospitals, so you don’t expect
people will suddenly think about how a confounding
could happen on the spot. In applications, I always talked
about the financial one with high-stakes decision making
hospitals that are very similar. Same like, “I have an AI and I am here and I need to make
a decision, how can you help me? We’re doing a lot of
interesting things understanding adverse drug interactions
and stuff with genomics and stuff with
understanding protein folding with, these are all ongoing
collaborations here at U-Dub. I think I talked about
the finance stuff. So I’m going to skip over
this because I’m out of time but all of this work of course has been in close collaboration with my advisors who in here at the U-Dub. I’ve had the opportunity to
mentor a number younger grad students in the PhD program here. Also some in math, as well as the
MD-PhD program here at U-Dub. So there are a lot of papers
that we’re working on together, both from my basic genomics
research to better understanding time series data to estimating drug-drug interactions, I just talked about
that, even cancer stuff. Of course, there’s always
external collaborations to make any medical stuff happen, you can’t do that on your own. So we’ve worked with
some great anesthesiologists at children’s, U-Dub, and Harborview as well as stuff
I didn’t talk about at all, it had to do with Kidney Research
Institute here in Seattle which is done some great work there. Cardiology, and then a bunch
of work that I didn’t talk about because of time but we had fun things learning
large-scale graphical models to understand genomics with some
people at University of Toronto. So thanks.


6 thoughts on “Explainable AI for Science and Medicine

  1. "AI" is science fiction and always will be science fiction… unless of course it's being redefined to be a reference to the increasing intellectual degeneration of mankind.

Leave a Reply

Your email address will not be published. Required fields are marked *