搜档网
当前位置:搜档网 › 华盛顿大学公开课 Introduction to Data Science 3 - 6 - MapReduce Relational Join, Social Example

华盛顿大学公开课 Introduction to Data Science 3 - 6 - MapReduce Relational Join, Social Example

[MUSIC].
So let's look at an example that doesn't
involve processing a corpus of documents.
So, let's think about how to implement a
relational the join operation that we
learn from the relation algebra in Map
Reduce.
Okay, so here recall that you're given
two relations, and a relation is a set of
tuples, all right?
And you're trying to find every record in
one relation that corresponds to a record
in the other relation, right?
And so here, we're going to join on SSN
equal to EmpSSN.
And actually, it wasn't quite correct for
me to leave this unspecified.
There's the notion of a natural join,
that would match up with the field names.
they had to remade was mash but here they
don't.
So, I need be explicit on what I'm
joining on.
Okay, and so the output here is, these
three records.
This record with joins with these two,
and this record joins with this one,
fine.
Now, image that both of these relations
are huge.
Okay, so the map phase here is going to
process every two [INAUDIBLE] in general.
And we have a problem right off the bat
that join is a binary operation.
It has a two-input relations a left
relation and the right relation and
you're trying to corresponding two equals
one and match the other.
Is a unary operation that processes a
single data set.
So how can you, you know, at first, to
first approximation you can't express
join in map reduce period.
But that's okay, there's a, there's a,
there's a bit of a trick here.
The trick is look you know, if it's okay
that you know imagine the data set here
is just a big jumble of tuples.
It's doesn't matter what table they came
from we just lump them all together.
Into a single data set called tuples.
Okay, and that will be what we process
with map reduce.
Okay, and so here I'm asking the question
you know, what is this for?
Well, this is a label that we've attached
to every tuple, so that we can know where
it came from and that's used later.
Okay, now I 'm not being specific about
how, where you get this label or how you
do it in MapReduce.
But I want you to think logically that
it's necessary and in practice it's not
that difficult.
Probably because you know, for example,
if you're processing, you have
distributed files from this directory
representing all the employee chunks And
you've got distributed files in this
directory representing the assigned
departments chunks.
You can look at the file name to tell
what table it is.
And so in the map function that you
write, you can determine that aha, this
has file name, you know, this has the
employee relation name in it's, in it's
file path and I'm going to go ahead and
attach that as part of the record, okay?
And so we could write pseudo code for
this.
And in fact, you might, as part of the an
upcoming assignment, alright?
[INAUDIBLE], so what's the map phrase
look like of this relational join?
Well, for every record on the input
you're going to produce a key value pa

ir.
We know we have to produce key value
pairs, because that's how MapReduce
works.
So what's the key going to be?
Well, I'm going to give you, the key is
going to be the join attribute, the
attribute that you're joining on, and the
value is going to be sort of everything
else, all right?
The rest of the tuple, and we can
actually get away with removing this guy
to save some space but typically
typically you wouldn't bother.
Okay, so fine.
So, given a tuple that looks like this
with, with three fields, we produce a key
of seven seven seven seven seven seven so
on.
And the value is the entire tuple, and so
on.
And again notice that both the tuple from
both relations are all lumped into the
same input.
Alright, I may be belaboring this but I
want to make sure that it's clear.
Okay, so so far we've done two tricks.
One is lump everything together and two
is produce a key value pair with a key as
the joint attribute.
Now, what happens in the magic shuffle
phase?
Well Everything with the same key gets
lumped together, as we've talked about.
And so now you get a reducer invocation
that has you know, one reducer invocation
for every unique key.
So in this case, it would be one for 999
so on, one for 777 so on and the list of
values.
Associated with that key will be all the
tuple that share the same join key.
Now it doesn't matter what relation they
came from they'll all be in this list.
So in this case we get one tuple from the
employee relation, and we get one tuple
from the department relation.
And here we get one from the employee
relation, and two tuples from the
department relation.
Alright, so now in the context of a
single computer, we have everything we
need to compute a single join.
And further, each of these reduce
invocations can be done on a separate
machine, and that's how we scale.
So now this reduce function that the
program would have to write if you are
implementing a join phase would have to
take this key, and takes all these tuples
and produce the joined tuple, right?
Where these two attributes came from
employee and these two attributes came
from department.
And same thing here employee, department.
Now, if you want to think about what kind
of operation you need to implement in
this reduce function.
Well, you've got a set of tuples from one
relation and you need to associate it
with every possible tuple from the other
relation, right because we need to know
they all join.
They're all by definition, they all join,
they have the same key, they all match.
So if you have every member of a set
paired with.
Every possible member of another set, if
you remember that's a cross product
operation.
And so here again we see relation
[INAUDIBLE] popping up sort in a, in a
different context.
Okay, so I don't want to confuse you to
much with the overall operation we're
trying to do is implement a parallel
join.
It happens to be that locally that right
here insid

e one reduce function.
We recognize, aha, that's another
relational algebra operator, a cross
product.
But this is more for abstraction purposes
than for algorithm purposes, I just
want to point that out.
All right, so again, you put on your
relation algebra colored glasses and the
whole world, you know, these operators
start to pop up, pop up everywhere.
All right, so fine, so let's do this one
more time just to make sure it's clear.
So, I'm giving, giving you two relations,
order and line item and they have these
fields; an order ID, an account and a
date and here there's an order ID, an
item ID and a quantity and we're going to
join on order ID.
So what's the map phase look like?
Well, once again, the key will be the
join key in this case order ID and the
value will be this pair.
The relation name, along with the
original tuple.
And before, we just sort of lumped these
together in one tuple and it doesn't
matter very much.
Here I've kind of structured it slightly
differently, but the point is, is all the
information is here.
The key must be the join key and value is
the tuple as well as some sort of
indicator for of what relation it came
from.
So, maybe I'll ask real quick.
Why do I need that indicator of the
relation?
Well, in the reduced phase, I wouldn't be
able to produce the joint tuples properly
if we didn't know which, which two
peoples went with which relation.
If I just had a big bundle of tuples and
I couldn't really figure it out very
easily, that would be a problem.
Theoretically, you might be able to avoid
tagging explicitly with employee and
department because you could deduce that
well, the employee table is the one that
has a string first followed by the join,
followed by the join key and the
department relation has the join key
followed by some other string.
But it's a little dangerous to rely on
that kind of information, because you
could, you could be joining two tables
that have exactly the same schema.
And there wouldn't be any real obvious
way to figure it out.
So an explicit tag is a little bit safer
here, alright, fine.
So that's the map phase.
Process all the, we, we lump all the
records together.
And produce key value pairs, where the
key is the join key from the, from the
corresponding relation.
Join attribute from the corresponding
relation, and the value is the entire
tuple tag with this, with the relation
name, okay?
And there it is [LAUGH], right there on
the slide.
Okay, so what's the reducer look like?
Well Now we've got all the tuples that
have this share the same key together and
it will produce these joined tuples.
It will join this order tuple with both
of these line item tuples to produce
these two joined tuples.
Okay, fine.
So, now lets go and do a different
example, and so here may be we're
going to get started with analyzing the
social networks.
So here we have a graph where every edge
represents A, let's say a friend

relationship.
Or if you're thinking about Twitter, you
can always be a followers relationship.
So, the input here is a set of edges with
the semantics of Jim is friends with Sue,
or follows Sue, and Sue is friends with
Jim, or follows Jim, and so on.
And so, one point I'm making here is
you'll notice that if, if Lin follow, if
Jim, if Lin points to Joe then Joe points
to Lin, and so this is there's a
symmetric relationship here, okay?
And I've done that for simplicity sake to
sort of avoid the confusion that can
result from thinking about undirected
graphs versus directed graphs.
So, any time you see an edge going one
way you'll see the other edge coming
back.
Now, what we want to do here is, well
before, before I say what we're going to
do, is make sure task is clear.
We're going to count the friends.
We want something very simple.
We just want to say, how many friends
does Jim have?
And how many friends does Sue have, and
so on.
And so the desired output here is Jim
with three, because Jim has Sue as a
friend and Jim has Kai as a friend and
Jim has Lin as a friend.
And we want to have Lin equals two, or
Lin has two friends, becaue Joe and Jim
and Kai is just one, right here.
Jim and Joe is just one, which is right
here, Lin, okay?
So, this should already look like
something we've already done.
Like before so it happens to be social
network analysis and the records happen
to be these you know, pairs of people but
the algorithm you're going to use should
start to stand out to you okay and maybe
you'll see why in a second if it's not
clear.
So, in the math phase how are we going to
do this well for every you know, friend
on the left hand side produce a key value
pair with the key is the name of the
friend or the name of the person and the
value is just the number one indicating
that there is exactly one friend
associated with that.
So exactly there is, there is that we've,
that we've encountered 1 friend
associated with.
Say Jim in this case alright?
So this record gets turned into Jim and 1
and this record gets turned into Sue and
1 and this record gets turned into Lin
and 1 and so on.
Okay and then you know, through the magic
of map produce the shuffle phase produces
this new media result where we have Jim
the key associated with a list of
occurrences 1, 1 and 1.
And Lin associated with two occurrences,
and so on.
And in the reduced phase, just sim-,
simply adds up all of these occurrences.
And so, what does this remind you of?
Well, it's a lot like the word count
example, right?
Instead of documents, producing words and
counts.
It's even simpler.
It's for every record it just produces a
single count for that for that person
that the record represents, the person on
the left hand side.
And after that the reducer is literally
exactly the same, right?
It'll, it'll add up all the occurrences
and produce a single number.
Okay, so in the next segment, we'll talk
ab

out something a little more, a little
trickier.
which is implementing matrix
multiplication in MapReduce, alright?

相关主题