搜档网
当前位置:搜档网 › 华盛顿大学公开课 Introduction to Data Science 2 - 1 - From Data Models to Databases (10-35)

华盛顿大学公开课 Introduction to Data Science 2 - 1 - From Data Models to Databases (10-35)

[MUSIC].
Welcome back.
I want to talk a little bit about data
models, as a run up to talk about
databases over the next couple of
segments.
So, this is a data science course, so we
know we have data.
So, the first question to ask is just,
where is this stored?
How do we store data?
And so one way of interpreting this
question is just to talk about
technology, right?
So, what technology do we use?
Well, we use magnetic media, and we might
use solid state drive more recently,
right?
And so both of these have the property
that they're not they persist, even when
the power goes out.
The data is safe, even when the power is
not on, okay, so non volatile storage,
right?
And so, a lot of the work in data base is
just all about work with data that's
stored on the non volatile storage media.
But another way of interpreting the
question is a little different, a little
more of the logical way we store, the
organization of the data.
And so we might ask you know, one way of
interpreting this question is what is the
data model we're using, okay?
So, there's not just bits on a disc, or
bits in a file.
What do we do?
Well, you know, in your personal
computer, or maybe even in your work
computer, you might store data sort of
hierarchically arranged in kind of these
nested folders, right?
That's one organization of data, and so
the data model here is kind of tree-like.
Another way is rows and columns and this
is what we'll talk what about in this
course, right?
So, in this case its a nasty file and
these are hits from a biological database
matches in a biological database for a
particular sequence.
And of course, you might have
spreadsheets that are a little funny,
right?
Maybe they look a little bit like rows
and columns or maybe they don't.
So, you have here sort of an embedded
table within a spreadsheet and so on.
So, you need to sort of, the idea is to
think about what data model is being
applied whenever you think about data,
alright?
And so, it could be a tree, it could be a
table, it could be something a grid like
this unstructured, or it could be a graph
and we'll talk a little bit about that in
the future.
So in general, what is a data model?
There's going to be three components that
you should remember.
one is this is going to be some notion of
the structure, alright?
So, I in the case of tables, rows, and
columns.
There's going to be some notion of
constraints, what are the legal
structures you allow to create.
So, for example, typically if we think
about a tabular data model.
you won't have, all the rows will have
the exact same number of columns, right?
This is sort of a constraint.
You might have more constraints on the
values themselves such as this field must
be this column, every value in this
column is an integer.
And you could even have other kinds of
more semantic constraints, such as every
column in this or every value in this
column must be within a certa

in range of
numbers because it represents say, days
of the year.
Okay, and the third one is the
operations.
And so, this is sometimes thought of as
independent of these three, but I really
like to call the data model all three of
these things.
The structures, the constraints to, to
define valid structures, you know valid
associations of these structures.
And then the operations you can actually,
that these structures support, okay?
So, let's see some examples.
So, your structures might be, as you
mentioned, rows and columns, nodes and
edges if it's a graphic graph model.
Key value pairs has been popular with the
no sequel movement.
Just a sequence of bytes, that might be
the structures you have if you're just
working with bare files.
And then the constraints you might
imagine are, you know, all rows and the
same number of columns as I mentioned,
all values of one column are of the same
type.
for a, for a hierarchical view, you might
have a have a view a child cannot have
two parents, right?
So, this would with for example, one file
cannot be in two folders at the same time
in your, in your data model of your file
system, okay?
and then the operations that are
supported.
Well, maybe you can look up the value
given a key x in these key value pair
data models, that's one of the primary
operations, if I give you the key, you
give me back the value.
for a tabular data model, you might say,
well find me all the rows where one,
where a particular column has a
particular value.
In this case, column last name is equal
to the value Jordan.
Okay, and with a file there is not to
many operations you can think about.
It's essentially get the next in bytes,
move to another position within the file
and then you can open and close the file,
and that's about that's not all the
operations are supported.
Okay, so I think in any case you see data
especially on non-volatile storage, you
can think about what operations are
supported.
What constraints there are the over the
structure and so on and this gives you an
idea of the data model.
Alright, what does it, so what is a
database?
Well, one definition that, I, I think is
pretty adequate is, is this one which is
a collection of information organized to
afford efficient retrieval.
Alright, so this is a very pretty general
definition.
It doesn't say anything about table.
It doesn't say anything about relations.
So, when you think about database, don't
necessarily assume relational databases.
It's perfectly adequate to talk about a
database that has nothing to do with
relations.
Alright, but it is not just a file of
data either.
It's organized to afford efficient
retrieval, alright?
So, another view of a database is this
idea of a schema.
And so, Jim Gray in this Fourth Paradigm
book that I've, that we talked about in
the eScience segment a little bit ago,
has this quote.
When people use the word database,
fundamentally what they are

saying is
that the data should be self-describing
and it should have a schema, that's
really all the word database means.
And so this goes back to the notion of a
data model, right?
There's a structure there and that, and
some constraints and that the end some
operations.
And all three of these are things you can
intuit, by looking at the data itself, it
needs to be self describing, right?
So, if I have a, a, you know, if I have a
file of data that's organized into rows
and columns.
Some where I'm able to inspect which
columns it has ans how many rows there
are and so on, right?
You need to be able to understand how to
read this data just by looking at the
data itself.
And so, in a database for example, there
will be a catalog, there will be a an
explicit schema.
So, having this idea that the word
database is synonymous with self
describing data.
You know, data equipped with the schema,
that's a pretty common view and its
important to keep in mind when as we go
to this course.
So, let me give you another view to
database, motivated by this question.
Why would I want one on the first place?
Why would I want a database?
What problem do they solve?
Because maybe four issues you might run
into that putting your data into some
kind of a database broadly defined can
help you solve, and these are the one's
that I like to talk about.
So one is sharing data, right?
So, you know, once you have multiple
users trying to access same file of data
some sort of infrastructure or interface
to manage concurrent access.
Starts to become required, and this is
something that all databases, all, I
would argue all real databases afford.
Okay, another is the enforcement of a
data model.
You might say, well look, I use a roving
columns model, or I use a hierarchical
data model.
But there's some sort of software that
needs to enforce that, and so this is
something that databases can do.
Now, remember the data model is not just
the, the raw structure of, you know
parents and children, say, with trees or
rows and columns.
It's also higher level of constraint,
such as you know, this must be a number
from one to five, or this must be a day
of the week.
Okay, or it must be one of the other
customers that already exist in another
table.
Okay, so these kinds of constraints are
difficult to enforce in the application
layer, and I'll argue this more a little
bit.
Okay, so the third reason might be scale,
right?
So, we know we have a file of data, and
we know we have sort of a data model
floating around here.
But once it gets over a certain size or
if we get a certain number of instances
of this data model.
we're going to want to use specialized
algorithms to be able to work with it.
And writing all these specialized
algorithms ourselves to traverse large
and larger data sets becomes the model
link.
And so, a database can organize these
algorithms and expose them through
convenient mechanisms.

We talk a lot about complexity hiding
interfaces, right?
Databases provide complexity hiding
interfaces for large scale data.
And I think the fourth one is
flexibility, which is you know you might
have organised, you might have written
some software to access your set of files
in a particular way.
But assume you have to access in some way
you didn't anticipate, right?
You have to rewrite a bunch of code, and
so what databases try to do is anticipate
a broad set of, of different ways of
accessing the data, and working with the
data, and support all of them.
Okay, and so I'm talking pretty
abstractly here fairly deliberately,
because all of these things are certainly
true of relational databases, which we'll
talk about.
But I think they thought to be true of
anything that sort of earns the term
database, and they're certainly true of
other systems as well.
And so, we want to try and think about,
keep these things in mind when we think
about no sequel systems and key value
stores and graph databases and so on.
Okay, so it's not just relational
databases, this is broader than just
relational databases.
Alright, and so I think in general, when
you're looking at these systems and
thinking about the data storage layer.
Which is what we're starting here in
this, in this conversation about data
science in this portion about data
science, is how is the data physically
organized on disk?
You know, what kinds of queries?
What kinds of operations are going to be
efficiently supported by some particular
organization and what kinds are not,
right?
So, one immediate question well and this
third bullet leads to immediate question
is it hard to update the data or add new
data.
That's one quick way to split all, all
the different operation you can do on the
data.
One quick one is that we need to is the
reads versus the writes.
And in many cases, you'll find that what
convenient you know, with the
organization that makes sense to read the
data efficiently is not the same one that
makes it efficient to write the data.
And that trade off is at the heart of a
lot of the system design challenges, in
databases and other large scale systems.
Okay, but there's other kinds of
operations as well, as we've discussed.
Do you want to look up by key?
Do you want to look up by some other
field and so on, and we'll talk about
this in future segments.
Okay and then what happens if I encounter
new queries I didn't anticipate?
Would, do I need to completely reorganize
the data?
Do I need to write a bunch of new code?
You know, how hard is this?
And so this, these are the evaluation
criteria you are going to use when you're
thinking about chooosing a platform for
your data science task, okay.
Or these are some of the questions, if
perhaps not all of them, right?
So, if broadly your choices are a file or
files, an office shelf database.
A no sequel system or some, something
else.
By having these questio

ns in mind, as
you're evaluating the pros and cons is
going to be really important, okay?
Alright.

相关主题