搜档网
当前位置:搜档网 › 华盛顿大学公开课 Introduction to Data Science 3 - 2 - Parallel Processing Patterns (11-26)

华盛顿大学公开课 Introduction to Data Science 3 - 2 - Parallel Processing Patterns (11-26)

[MUSIC].
Last time we talked about scalability,
and we argued that scalability really
means working in parallel.
And we talked about this specific task,
which is, you know, Read Trimming, okay?
So this is a bunch of small genetic
sequences, and your task was to trim off
the last few characters from each one.
Okay, and we sort of showed that this is
pretty simple to think about in parallel.
You would divide the sets of, the set of
reads into chunks and put them all onto
separate computers.
And process them all in parallel.
Here, you know there's a function F that
takes a single read and trims off the
last few characters and returns the
prefix.
and you can apply this function in
parallel.
And you can get out, you, the data set
you want, which is a set of trend reads,
okay?
So let's see some more examples.
So, a new task.
that was new to be performed at the New
York Times in 2008.
And they have a couple blog posts about
it that you can read was.
In a simplified version was convert a
bunch of TIFF images into a different
format.
And what was really going on here is that
they had digitized.
Images from the newspaper, along with
some information about the optical
character recognition, so some extracted
text.
And they wanted to turn this into a more
web, web friendly format.
And so they had to convert the images to
a web friendly format and they also had
to convert the extracted text into a
little package of JavaScript code.
Okay?
So they're getting, get read to put this
stuff on the web.
So this is four hundred and five thousand
images, which was quite a bit, especially
at the time.
Okay, but the schematic looks sort of
similar, right, you take a big set of
TIFF images and you split them into
chunks and put them on a bunch of
different computers.
and you have a function f that converts a
TIFF to a PNG, and does the other work
too, let's say.
And what you get out is the dataset you
want, a bunch of PNG images, okay?
And they're distributed across these,
these machines.
Right, so let's look at another example.
All right, so now we want to run
thousands of little simulations.
And what we have are the parameters to
each one of those thousands of
simulations.
And so, and example of this at the URL
here at the bottom of the slide is from
simulating muscle dynamics, and this
comes up a lot, if you have these Monte
Carlo simulations, they need to do they,
they understand sort of phenomenon
stochastically by running lots and lots
and lots of simulations with different
kind of, different inputs, and then kind
of averaging the results.
Okay.
As to modeling everything precisely.
So now you want to run thousands of
simulations.
Well, you have a set of inputs, of
parameters, to these simulations.
And you break them into chunks and put
them all in separate machines.
And apply the function, and here the
function is actually running the
simulation.
And what you get out is the out

put of the
stimulation distributed across all the
machines.
Okay.
So there's, you know, a pattern should be
emerging here, right?
So another example, so imagine each one
of these little bars is a document.
And your task is just to find the most
common word in every individual document.
Okay.
Well, same thing, distribute the
documents across the k computers.
And then your function f now in this case
opens up a single document, figures out
which word is the most common in that
document, and then just produces that
word.
And so now you have a big distributed
list of pairs where, you know, the first
part of the pair is the document ID and
the second one is the word.
Okay, so that could be useful but it's a
bit contrived.
You know, consider a slightly more
general program that computes the word
frequency of every word still in a single
document.
Right?
So instead of just finding the most
common one and producing that, now you're
going to produce [INAUDIBLE] histogram of
the frequencies of every word in the
document.
Okay, so given this Input, you produce
all of these items.
Right?
A set of items.
And the only reason I'm making this
distinction from the last one is that the
last one, you know, took a single
document to produce a single word.
And now we're taking a single [INAUDIBLE]
just want to make that clear that that's
allowed.
Okay.
Looks like the animation isn't here, but
I don't think I'll wind up fixing it.
So you have millions of documents, you
distribute them again, now your function
returns a set of word frequency pairs,
right?
But that's okay, and now we have, you
know, lots of little lines here.
[NOISE] a single word, let's say.
So they're not one to one any more.
But that's no problem the function just
returns a set of things.
So there should be a pattern here, right?
We have a function that maps a read to a
trimmed read.
We have a function that maps a TIFF image
to a PNG image.
A function that maps a set of parameters
to the simulation result.
Alright, it's the simulation itself.
We have a function that maps a document
to its most common word.
And we have a function that maps a
document to the histogram of its word
frequencies.
Okay?
So, so good.
So these kinds of tasks we think we know
how to do in parallel.
Given a big set of objects and a function
that knows how to process a single object
you know, you should be able to think
about how to paralyze this.
[INAUDIBLE] but we should abstractly
understand how this is done.
Right?
So, I'll say that one more time.
We have a big set of objects, we have a
function Maps a single object [INAUDIBLE]
computers.
Okay?
The objects among the computers and
function that can parallel.
This is trivial.
Alright.
Okay, so what if we want to compute the
word frequency across all documents not
just uh, [INAUDIBLE] frequencies for each
document [INAUDIBLE] frequencies for a
single document.
Okay, so here you know,

if we have one of
these three documents, now we want to get
a single histogram that counts out the
number of times the word people appears
across all three of them, the number
times the word, government appears across
all three of them and so on.
So, let's go back to our schematic here,
the pattern.
Well, now we want to compute the word,
frequency, across five million documents.
And we can still distribute them among k
computers, you know.
So far, so good.
And then for each document we return a
set of word frequency pairs and now I've
switched the notation here from F to map
since we can consider this a map, that's
the, the, the terminology I used.
Okay, but now what do we do?
So what we can get out here, what we will
get out here is a set of Frequency but
that's not what we want a, one big
histogram.
And to build this one big histogram we
have to make sure that a single computer
has access to every occurrence of some
particular term, so.
[SOUND] If the word history appears in
some document on this machine, and it
appears, you know, twice in this
document, and three times, and two, two
times in documents on that machine, and
so on, we have to sort of group those all
up and send them to, to a single place,
just so we can count them.
Okay.
So, let's look at this again.
So, we distribute the documents across
these computers.
We map apply our map function to each
document in order to, to produce a set of
word frequency pairs.
And now we have a big distributed list of
these words, sets of word frequencies.
And now we want to get.
These workers involved in the process.
And these guys are going to be the ones
who count the occurrences of a particular
word.
Okay, and so imagine all these little
colored red lines are occurrences of
words, such that all the red lines are,
represent a, a single word occurrences of
a single word, and all the green lines
represent a different word, and so on.
On.
Well so, these guys are going to sent, to
their respective locations.
So, such that, this worker is in charge
of handling all the occurrences of the
blue word.
And this worker is in charge of all the
occurrences of the red work, and so on.
Okay, so now instead of lines that go,
you know, from 1 to 1 we have lines that
go from this 1 computer to a bunch of
different computers.
Okay.
And so on.
Alright?
They just sort of shuffle the data, they
have to spray this data out across the
network, in order to regroup it.
Fine, so now that we have the data
grouped the way we want and partitioned
the right way, we can apply another
function which I'll call the reduced
function which in this case it doesn't
really assemble it just counts them and
that allows us to produce our final
result which is oh, there are 4 green
words and 4 red words and 3 blue words.
Words and so on.
Okay, so, now the schematic looks a
little different.
We have sort of a two-step process.
So, we start with [SOUND] som

e large set
of objects distributed over a bunch of
machines.
And then we want to apply some function F
to each one of those objects.
And that was our First step.
But then the output of those functions
are all going to be redistributed across
the network and grouped.
To form groups.
Okay, and then the second step is to
process each one of those groups.
So I'll write, instead of f, I'll write
map here.
And I'll write reduce here.
Okay, and so that's exactly what Map
Reduce does and we'll explain this in
more detail next time, but the key idea
here is that the user, the programmer is
going to write these two functions, a map
function and a reduce function, which are
serial and what I mean by serial is there
not parallel.
You don't have to worry about how to
manage, how to program distributed
cluster within each one of these
functions.
You just write a function map that takes
in an object of some kind and returns
some other object.
And I'm not using objects in this sort of
object-oriented sense, this is sort of in
the mathematical sense, just sort of any
input to produce any kind of output.
Okay, and then the reduced function takes
a set of objects which I'll note this way
and returns some of the kind of object
and actually this can be a set of objects
as well.
And in fact this can be a set of objects
as well.
So maybe I'll switch colors here.
This can actually be a set of things and
this can actually be a set of things.
So for example we saw a document
returning a set of word frequency pairs.
But, you can think of it.
It's, it's essentially important to
understand what's going on here, okay.
So this is, this is, this is a, an
interesting hypothesis, right?
That perhaps all distributed algorithms
can be expressed as sequences of these
two step operations, right?
A map fall by reduce, and then maybe more
map reduce, map reduce, map reduce.
As, as needed.
Okay, and we'll talk about that in more
detail next time.

相关主题