搜档网
当前位置:搜档网 › 华盛顿大学公开课 Introduction to Data Science 6 - 7 - 07 Overfitting (11-04)

华盛顿大学公开课 Introduction to Data Science 6 - 7 - 07 Overfitting (11-04)

Okay so where are we?
We're talking about supervised learning
and we're talking specifically about
classification problems where we predict
a class label based on the other
attributes.
And so we gave an example of predicting
whether or not we played some sport based
on a few different attributes
representing the weather.
Or, predicting whether a passenger aboard
the Titanic did or did not survive, based
on things, things like their gender,
their age, the fare they paid, and so on.
Okay.
And, so, to get started here, we talked
about just, you know, manually guessing
simple rules that might explain the data,
and, as you might imagine.
The relationships between the attributes
and the class label are complex, and so,
you know, your, your intuition may be
wrong and so you need some way to
automate the search for these kinds of
rules.
Okay.
So we've talked about this one-rule
algorithm that will proceduralizes the
process of choosing a good rule.
and that worked fine.
But, you know, obviously it's pretty
limited, right, it can only, it can only
find very simple kinds of relationships,
and so we also talked about sequential
cover algorithm that looks for, and
builds up sets of, rules that may have
multiple conditions in them.
Okay, you know, so great, so a set of
rules they're themselves complex, are
more descriptive of the data and
therefore have better strength as a
classifier.
But they're also harder to interpret.
And one of the points that we'll be
making in this class is that, you know,
communicability of these models,
communicability of the results is
important, okay.
And so, a single rule is easy, a set of
complex rules is perhaps less so.
And for that reason and for other reasons
we can talk about decision trees.
And we sort of show that a decision tree,
I was about to say can be constructed
from a set of rules.
Each path from the root is a rule, but
that's not quite true.
Constructing a decision tree from a set
of rules is not necessarily trivial.
Going the other direction is pretty, is
straightforward as we as we described.
Every path in the room is a rule, okay.
But, but there is a relationship,
moreover, given a decision tree it's
pretty easy to interpret I would claim
right.
You just sort of take your data item and
look at the root node and answer the
question.
You know, is, is the ginger male or
female?
Female, well then go down this branch,
male, go down this branch so it's very
easy to understand what's going on.
it's also easy to understand what the
most important decisions are because they
bubble up to the top, Okay?
And so if someone's asking you well, you
know, how is your model behaving, why
does it make one decision over the
another?
You can sort of answer those questions,
Okay?
And so we talked about using how you make
the decision of which attributes to
select at each level and we talked about
entropy.
Being a measure of, sort of, the purity,

right?
The gain is to try to make an entire
branch, pure with respect to a class
label, right?
Everyone down this branch survived.
Okay.
And so entropy gives us a way to do that.
And we talk about a few extensions for
numeric attributes, where you could
split.
based on, at a particular level as
opposed to, you know, so if you don't
have gender male and female but rather
you have a, a, a number like humidity or
the [INAUDIBLE] you have to be a little
more careful.
You don't want to have a thousand
different branches coming out of a node,
one for each unique value.
Rather you want to bucket them somehow,
and so what's the way to bucket them?
Well, one, one idea is just to find a
split point that splits them to two
children.
So everything below this value goes one
way, and everything above this value goes
another way, and we talked about how to
find that threshold.
Okay.
So, fine.
So where we are now is that decision
trees are potentially prone to over
fitting and so.
Let's talk about that.
This except is from Pedro Domingo's 2012
CACM paper that we've mentioned before.
So you know, what if the knowledge and
data we have are not sufficient to
completely determine the correct
classifier.
So it could be that we've.
You know, designed a classifier that is
responding to what, you know, what he
calls random quirks in the data as
opposed to sort of uncovering some kind
of fundamental truth.
You know, it doesn't have any predictive
power in the real world, it just sort of
describes this particular data set.
And so the problem here is over fitting,
and you know, he says it's the bugbear of
machine learning.
You know a lot of machine learning
problems really reduced to, how do we
avoid overfitting?
Right?
You could, you can always train a model
on a particular data set, but, how do you
avoid, specializing into this data set
and making sure it has some, some
predictive power in the real world?
Okay, and so, you know, the case to look
out for is, your learner output to
classify that is 100% accurate on the
training data, but is only 50% accurate
on test data, when in fact you could have
output one that is 75% accurate on both.
It has overfit.
Okay.
And so then I want to call out that this
is really the, the definition to think of
when you think about overfitting.
Low error on training data and high error
on test data.
Okay?
So an image that sometimes is called to
mind when you're talking about
overfitting is this one, which is, you
know, I've, I've fit two polynomials.
To a small set of data points.
One with a high degree, I think ten and
the other with a low degree, in this case
actually still pretty high as five.
And you can sort of see that, a couple of
things.
One is that the red curve is exactly
matching the data.
Okay.
but you can also see that the sort of,
what appears to be the trend in the data
probably is better described by the, the
green line.
And, in fact,

it was, it was generated by
a model that sort of added in some random
noise.
to a curve, and so the green one probably
does actually reflect the actual,
underlying data model, and the underlying
process which, which generated the data.
But, you know, the other, the other point
to look out for here is, it's sensitive
to change, to perturbations in the data.
If I move one of these points, just one
of these points a little bit, how much
does that curve change.
And the green one would not change much
while the red one might, okay.
But I actually don't think these curve
fitting model is the best image to call
up when you think about overfitting
because it's, it's, in my mind it's a
little bit specific to polynomial curves
and the number of degrees of freedom you
are, you are.
Working with.
Okay.
So it's not always clear, at least to me,
how to map this image that I call in my
head of overfitting to machine learning
problems.
So I think a more useful one is one
that's actually associated with the
Wikipedia page on an article related to
overfitting.
that's, released in the Creative Commons,
is this.
That over time, your error goes down from
both your training set and your test set,
but at some point, it continues to go
down for your training set.
I'm pointing at the wrong screen.
This continues to go down on your
training set, but then it starts to creep
up on your test set, and so, at this
point, when you put a little symbol here
to indicate, that's where over fitting
has set in.
And so, I think, this is the image you
call to mind when you are thinking about
overfitting problems.This is this
diference between the error in your
training set and the error in your test
set.
Now, in some cases it's tough, because
the test set, you know, maybe.
you, you know, what is the test set, is
it something you actually have your hands
on where you can measure this error, or
is it more like the predictive power in
the real world?
But regardless, any kind of estimate you
have of, over the error that you're
actually achieving on data that you
didn't train on, is, is what to look out
for, and when these start to diverge,
that means you were were good, all right?
So other language you should be familiar
with around this concept, is you know the
model able to generalize?
That's when it's, that's when, if, if
you're, if it is able to generalize then
you're not over fit, okay?
Can it deal with unseen data or does it
over fit the data it test on.
So in order to solve for this, you test
on Hold-out data.
Okay.
So one way to do this is, split the data
to be modeled into training and test set.
Just in a fixed way.
Train the model in the training set.
Evaluate a model in the training set.
And evaluate the model in the test set.
And the difference between those two,
again, is, is, a measure of the ability
of the model's ability to generalize.
You know a measure of how over fit it is,
okay.
Now, d

oing this just once, splitting
between [UNKNOWN] is, not the most
powerful mechanism to do this.
And there are a couple a slides, give you
a better one.
Alright, so another image to kind of call
to mind when you're thinking about
overfitting is and also comes from[ Pedro
Domingo's paper, which is this is
sanction between bias and various,
variance, excuse me.
So, in the underfit case, right, you,
you're missing the mark, you're not
describing the data very well.
But you might have low variance.
You know you're, you're, you're clearing
missing the mark.
And then, you know high bias and high
variance well, then you're sort of just
way off the mark.
You haven't, you know, you're, your
learner has not producing anything useful
at all.
but in the over-fit case, you might have
very, very low bias, but you have high
variance.
Now, the way to interpret this language
I'm using, high variance, is variance
when you, when you add more data, or when
you perturb the data, or when you
evaluate on a different data set.
How my, how, what's your quality look
like, right, how well do you do?
Okay.
And so if you have low bot-, so, if you
have low bias on your test set, but it's
sensitive to the data, alright, that's
where the high variance comes from, then
you're in an over-fit scenario.
Okay.
And in the other case, while you're,
you're not very sensitive to the data,
you have low variance, right.
But you're biased, but you have high
bias, you're wrong.
You're getting the wrong answer.
So this is like, if I just have a model
that's always predicting, you know, let's
say I have a model that says, every, on
the [UNKNOWN] data set that everybody
dies all the time.
While this is very insenstive to the
data, right, it has low variance, right.
No matter what data set I give it, I
would either always produce the same
answer.
But it has high bias, it's wrong.
Right, the, the error is, is high, while
over-fitting is the other way around.
We've gotta get an exact answer out of my
training set, where I have very, very low
bias, but as soon as I add one more data
point, it it changes drastically.
Okay.
So this is another set of terms and kind
of concepts to think about when you think
about overfitting.
Alright.

相关主题