搜档网
当前位置:搜档网 › The Google File System

The Google File System

The Google File System

Sanjay Ghemawat,Howard Gobioff,and Shun-T ak Leung

Google?

ABSTRACT

We have designed and implemented the Google File Sys-

tem,a scalable distributed?le system for large distributed

data-intensive applications.It provides fault tolerance while

running on inexpensive commodity hardware,and it delivers

high aggregate performance to a large number of clients.

While sharing many of the same goals as previous dis-

tributed?le systems,our design has been driven by obser-

vations of our application workloads and technological envi-

ronment,both current and anticipated,that re?ect a marked

departure from some earlier?le system assumptions.This

has led us to reexamine traditional choices and explore rad-

ically di?erent design points.

The?le system has successfully met our storage needs.

It is widely deployed within Google as the storage platform

for the generation and processing of data used by our ser-

vice as well as research and development e?orts that require

large data sets.The largest cluster to date provides hun-

dreds of terabytes of storage across thousands of disks on

over a thousand machines,and it is concurrently accessed

by hundreds of clients.

In this paper,we present?le system interface extensions

designed to support distributed applications,discuss many

aspects of our design,and report measurements from both

micro-benchmarks and real world use.

Categories and Subject Descriptors

D[4]:3—Distributed?le systems

General Terms

Design,reliability,performance,measurement

Keywords

Fault tolerance,scalability,data storage,clustered storage

For example,we have relaxed GFS’s consistency model to vastly simplify the?le system without imposing an onerous burden on the applications.We have also introduced an atomic append operation so that multiple clients can append concurrently to a?le without extra synchronization between them.These will be discussed in more details later in the paper.

Multiple GFS clusters are currently deployed for di?erent purposes.The largest ones have over1000storage nodes,

over300TB of d

i skstorage,and are heav

i

ly accessed by

hundreds of cl

i ents on d

i

st

i

nct mach

i

nes on a cont

i

nuous

bas

i s.

2.DESIGN OVERVIEW 2.1Assumptions

In des

i gn

i

ng a?le system for our needs,we have been

gu i ded by assumpt

i

ons that o?er both challenges and op-

portunities.We alluded to some key observations earlier and now lay out our assumptions in more details.

?The system is built from many inexpensive commodity components that often fail.It must constantly monitor itself and detect,tolerate,and recover promptly from component failures on a routine basis.

?The system stores a modest number of large?les.We expect a few million?les,each typically100MB or larger in size.Multi-GB?les are the common case and should be managed e?ciently.Small?les must be supported,but we need not optimize for them.

?The workloads primarily consist of two kinds of reads: large streaming reads and small random reads.In large streaming reads,individual operations typically read hundreds of KBs,more commonly1MB or more.

Successive operations from the same client often read through a contiguous region of a?le.A small ran-dom read typically reads a few KBs at some arbitrary o?set.Performance-conscious applications often batch and sort their small reads to advance steadily through the?le rather than go backand forth.

?The workloads also have many large,sequential writes that append data to?les.Typical operation sizes are similar to those for reads.Once written,?les are sel-dom modi?ed again.Small writes at arbitrary posi-tions in a?le are supported but do not have to be e?cient.

?The system must e?ciently implement well-de?ned se-mantics for multiple clients that concurrently append to the same?le.Our?les are often used as producer-consumer queues or for many-way merging.Hundreds of producers,running one per machine,will concur-rently append to a?le.Atomicity with minimal syn-chronization overhead is essential.The?le may be read later,or a consumer may be reading through the ?le simultaneously.

?High sustained bandwidth is more important than low latency.Most of our target applications place a pre-m um on process ng data n bulkat a h gh rate,wh le few have str ngent response t me requ rements for an

nd v dual read or wr te.2.2Interface

GFS prov des a fam l ar?le system nterface,though t does not mplement a standard API such as POSIX.F les are organ zed h erarch cally n d rector es and dent?ed by path-names.We support the usual operat ons to create,delete, open,close,read,and write?les.

Moreover,GFS has snapshot and record append opera-t ons.Snapshot creates a copy of a?le or a d rectory tree at low cost.Record append allows mult ple cl ents to ap-pend data to the same?le concurrently wh le guarantee ng the atom c ty of each nd v dual cl ent’s append.It s use-ful for mplement ng mult-way merge results and producer-consumer queues that many cl ents can s multaneously ap-pend to without additional locking.We have found these types of?les to be invaluable in building large distributed applications.Snapshot and record append are discussed fur-ther in Sections3.4and3.3respectively.

2.3Architecture

A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients,as shown in Figure1.Each of these is typically a commodity Linux machine running a user-level server process.It is easy to run both a chunkserver and a client on the same machine,as long as machine resources permit and the lower reliability caused by running possibly?aky application code is acceptable.

Files are divided into?xed-size chunks.Each chunk

i

s i

dent

i

?ed by an

i

mmutable and globally un

i

que64b

i

t chunk handle assi gned by the master at the ti me of chunkcreati on. Chunkservers store chunks on local disks as Linux?les and read or wr

i

te chunkdata spec

i

?ed by a chunkhandle and byte range.For reli abi li ty,each chunki s repli cated on multi-ple chunkservers.By default,we store three replicas,though users can designate di?erent replication levels for di?erent regions of the?le namespace.

The master maintains all?le system metadata.This in-cludes the namespace,access control information,the map-ping from?les to chunks,and the current locations of chunks. It also controls system-w

i

de act

i

v

i

t

i

es such as chunklease management,garbage collection of orphaned chunks,and chunkm

i

grat

i

on between chunk servers.The master per

i

-odically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.

GFS client code linked into each application implements the?le system API and communicates with the master and chunkservers to read or write data on behalf of the applica-tion.Clients interact with the master for metadata opera-tions,but all data-bearing communication goes directly to the chunkservers.We do not provide the POSIX API and therefore need not hook nto the L nux vnode layer.

Neither the client nor the chunkserver caches?le data. Client caches o?er little bene?t because most applications stream through huge?les or have working sets too large to be cached.Not having them simpli?es the client and the overall system by eliminating cache coherence issues. (Clients do cache metadata,however.)Chunkservers need not cache?le data because chunks are stored as local?les and so Linux’s bu?er cache already keeps frequently accessed data in memory.

2.4Single Master

Having a single master vastly simpli?es our design and enables the master to make sophisticated chunk placement

Data messages

Control messages Figure 1:GFS Architecture

and replication decisions using global knowledge.However,we must minimize its involvement in reads and writes so that it does not become a bottleneck.Clients never read and write ?le data through the master.Instead,a client asks the master which chunkservers it should contact.It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.

Let us explain the interactions for a simple read with refer-ence to Fi gure 1.Fi rst,usi ng the ?xed chunksi ze,the cli ent translates the ?le name and byte o?set speci ?ed by the ap-pl cat on nto a chunk ndex w th n the ?le.Then,t sends the master a request containing the ?le name and chunk index.The master replies with the corresponding chunk handle and locations of the replicas.The client caches this nformat on us ng the ?le name and chunk ndex as the k ey.The cl ent then sends a request to one of the repl cas,most likely the closest one.The request speci?es the chunk handle and a byte range within that chunk.Further reads of the same chunkrequ re no more cl ent-master nteract on unt l the cached nformat on exp res or the ?le s reopened.In fact,the client typically asks for multiple chunks in the same request and the master can also include the informa-tion for chunks immediately following those requested.This extra information sidesteps several future client-master in-teractions at practically no extra cost.

2.5Chunk Size

Chunks ze s one of the k ey des gn parameters.We have chosen 64MB,wh ch s much larger than typ cal ?le sys-tem blocks zes.Each chunkrepl ca s stored as a pla n Linux ?le on a chunkserver and is extended only as https://www.sodocs.net/doc/212093950.html,zy space allocation avoids wasting space due to internal fragmentation,perhaps the greatest objection against such a large chunks ze.

A large chunks ze o?ers several mportant advantages.F i rst,i

t reduces cl ents’need to nteract w th the master because reads and wr tes on the same chunkrequ re only one ni ti al request to the master for chunklocati on nforma-tion.The reduction is especially signi?cant for our work-loads because applications mostly read and write large ?les sequentially.Even for small random reads,the client can comfortably cache all the chunklocat on nformat on for a multi-TB working set.Second,since on a large chunk,a client is more likely to perform many operations on a given chunk,it can reduce network overhead by keeping a persis-

tent TCP connection to the chunkserver over an extended period of time.Third,it reduces the size of the

metadata stored on the master.This allows us to keep the metadata in memory,which in turn brings other advantages that we will discuss in Section 2.6.1.

On the other hand,a large chunksi ze,even wi th lazy space allocati on,has ts di sadvantages.A small ?le consi sts of a small number of chunks,perhaps just one.The chunkservers storing those chunks may become hot spots if many clients are accessing the same ?le.In practice,hot spots have not been a major issue because our applications mostly read large mult -chunk?les sequent ally.

However,hot spots d d develop when GFS was ?rst used by a batch-queue system:an executable was wr tten to GFS as a s i

ngle-chunk?le and then started on hundreds of ma-chines at the same time.The few chunkservers storing this executable were overloaded by hundreds of simultaneous re-quests.We ?xed this problem by storing such executables with a higher replication factor and by making the batch-queue system stagger application start times.A potential long-term solution is to allow clients to read data from other clients in such situations.

2.6Metadata

The master stores three major types of metadata:the ?le and chunknamespaces,the mapp i

ng from ?les to chunk s,and the locations of each chunk’s replicas.All metadata is kept in the master’s memory.The ?rst two types (names-paces and ?le-to-chunkmappi ng)are also k ept persi stent by loggi ng mutati ons to an operation log stored on the mas-ter’s local d skand repl cated on remote mach i

https://www.sodocs.net/doc/212093950.html, i

ng a log allows us to update the master state s i

mply,rel i ably,and without risking inconsistencies in the event of a master crash.The master does not store chunklocat i

on i nforma-tion persistently.Instead,it asks each chunkserver about its chunks at master startup and whenever a chunkserver joins the cluster.

2.6.1In-Memory Data Structures

Since metadata is stored in memory,master operations are fast.Furthermore,it is easy and e?cient for the master to periodically scan through its entire state in the background.Th s per od c scann ng s used to mplement chunkgarbage collection,re-replication in the presence of chunkserver fail-ures,and chunkm i

grat i on to balance load and d i

skspace

usage across chunkservers.Sections4.3and4.4will discuss these activities further.

One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has.This is not a serious limitation in practice.The mas-ter maintains less than64bytes of metadata for each64MB chunk.Most chunks are full because most?les contain many chunks,only the last of which may be partially?lled.Sim-ilarly,the?le namespace data typically requires less then 64bytes per?le because it stores?le names compactly us-ing pre?x compression.

If necessary to support even larger?le systems,the cost of adding extra memory to the master is a small price to pay for the simplicity,reliability,performance,and?exibility we gain by storing the metadata in memory.

2.6.2Chunk Locations

The master does not keep a persistent record of which chunkservers have a replica of a given chunk.It simply polls chunkservers for that information at startup.The master can keep itself up-to-date thereafter because it controls all chunkplacement and mon tors chunk server status w th reg-ular HeartBeat messages.

We initially attempted to keep chunk location information persistently at the master,but we decided that it was much simpler to request the data from chunkservers at startup, and periodically thereafter.This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster,change names,fail,restart,and so on.In a cluster with hundreds of servers,these events happen all too often.

Another way to understand this design decision is to real-ize that a chunkserver has the?nal word over what chunks it does or does not have on its own disks.There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause chunks to vanish spontaneously(e.g.,a disk may go bad and be disabled)or an operator may rename a chunkserver.

2.6.3Operation Log

The operation log contains a historical record of critical metadata changes.It is central to GFS.Not only is it the only persistent record of metadata,but it also serves as a logical time line that de?nes the order of concurrent op-erations.Files and chunks,as well as their versions(see Section4.5),are all uniquely and eternally identi?ed by the logical times at which they were created.

Since the operation log is critical,we must store it reli-ably and not make changes visible to clients until metadata changes are made persistent.Otherwise,we e?ectively lose the whole?le system or recent client operations even if the chunks themselves survive.Therefore,we replicate it on multiple remote machines and respond to a client opera-tion only after?ushing the corresponding log record to disk both locally and remotely.The master batches several log records together before?ushing thereby reducing the impact of?ushing and replication on overall system throughput. The master recovers its?le system state by replaying the operation log.To minimize startup time,we must keep the log small.The master checkpoints its state whenever the log grows beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the

Record Append

Serial de?ned

success interspersed with

consistent

but unde?ned

Failure inconsistent

Table1:File Region State After Mutation limited number of log records after that.The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without ex-tra parsing.This further speeds up recovery and improves availability.

Because building a checkpoint can take a while,the mas-ter’s internal state is structured in such a way that a new checkpoint can be created without delaying incoming muta-tions.The master switches to a new log?le and creates the new checkpoint in a separate thread.The new checkpoint includes all mutations before the switch.It can be created in a minute or so for a cluster with a few million?les.When completed,t s wr tten to d skboth locally and remotely. Recovery needs only the latest complete checkpoint and subsequent log?les.Older checkpoints and log?les can be freely deleted,though we keep a few around to guard against catastrophes.A failure during checkpointing does not a?ect correctness because the recovery code detects and skips incomplete checkpoints.

2.7Consistency Model

GFS has a relaxed consistency model that supports our highly distributed applications well but remains relatively simple and e?cient to implement.We now discuss GFS’s guarantees and what they mean to applications.We also highlight how GFS maintains these guarantees but leave the details to other parts of the paper.

2.7.1Guarantees by GFS

File namespace mutations(e.g.,?le creation)are atomic. They are handled exclusively by the master:namespace locking guarantees atomicity and correctness(Section4.1); the master’s operation log de?nes a global total order of these operations(Section2.6.3).

The state of a?le region after a data mutation depends on the type of mutation,whether it succeeds or fails,and whether there are concurrent mutations.Table1summa-rizes the result.A?le region is consistent if all clients will always see the same data,regardless of which replicas they read from.A region is de?ned after a?le data mutation if it is consistent and clients will see what the mutation writes in its entirety.When a mutation succeeds without interference from concurrent writers,the a?ected region is de?ned(and by implication consistent):all clients will always see what the mutation has written.Concurrent successful mutations leave the region unde?ned but consistent:all clients see the same data,but it may not re?ect what any one mutation has written.Typically,it consists of mingled fragments from multiple mutations.A failed mutation makes the region in-consistent(hence also unde?ned):di?erent clients may see di?erent data at di?erent times.We describe below how our applications can distinguish de?ned regions from unde?ned

regions.The applications do not need to further distinguish between di?erent kinds of unde?ned regions.

Data mutations may be writes or record appends .A write causes data to be written at an application-speci?ed ?le o?set.A record append causes data (the “record”)to be appended atomically at least once even in the presence of concurrent mutations,but at an o?set of GFS’s choosing (Section 3.3).(In contrast,a “regular”append is merely a write at an o?set that the client believes to be the current end of ?le.)The o?set is returned to the client and marks the beginning of a de?ned region that contains the record.In addition,GFS may insert padding or record duplicates in between.They occupy regions considered to be inconsistent and are typically dwarfed by the amount of user data.

After a sequence of successful mutations,the mutated ?le region is guaranteed to be de?ned and contain the data writ-ten by the last mutation.GFS achieves this by (a)applying mutat i ons to a chunk i n the same order on all i ts repl i cas

(Secti on 3.1),and (b)usi ng chunkversi on numbers to detect any repli ca that has become stale because t has mi ssed mu-tations while its chunkserver was down (Section 4.5).Stale

replicas will never be involved in a mutation or given to

clients asking the master for chunk locations.They are garbage collected at the earliest opportunity.

S nce cl ents cache chunklocat ons,they may read from a

stale repl ca before that nformat on s refreshed.Th s w n-dow s l m ted by the cache entry’s t meout and the next open of the ?le,wh i ch purges from the cache all chunk i n-format i on for that ?le.Moreover,as most of our ?les are

append-only,a stale repl i ca usually returns a premature end of chunkrather than outdated data.When a reader retr i es and contacts the master,i t w i ll i mmed i ately get cur-rent chunklocat ons.

Long after a successful mutat on,component fa lures can of course st ll corrupt or destroy data.GFS dent ?es fa led chunkservers by regular handshakes between master and all chunkservers and detects data corruption by checksumming (Section 5.2).Once a problem surfaces,the data is restored from valid replicas as soon as possible (Section 4.3).A chunk is lost irreversibly only if all its replicas are lost before GFS can react,typically within minutes.Even in this case,it be-comes unavailable,not corrupted:applications receive clear errors rather than corrupt data.

2.7.2Implications for Applications

GFS applications can accommodate the relaxed consis-tency model with a few simple techniques already needed for other purposes:relying on appends rather than overwrites,checkpointing,and writing self-validating,self-identifying records.

Practically all our applications mutate ?les by appending rather than overwriting.In one typical use,a writer gener-ates a ?le from beginning to end.It atomically renames the ?le to a permanent name after writing all the data,or pe-riodically checkpoints how much has been successfully writ-ten.Checkpoints may also include application-level check-sums.Readers verify and process only the ?le region up to the last checkpoint,which is known to be in the de?ned state.Regardless of consistency and concurrency issues,this approach has served us well.Appending is far more e?-cient and more resilient to application failures than random writes.Checkpointing allows writers to restart incremen-tally and keeps readers from processing successfully written

?le data that is still incomplete from the application’s per-spective.In the other typical use,many writers concurrently ap-pend to a ?le for merged results or as a producer-consumer

queue.Record append’s append-at-least-once semantics pre-serves each writer’s output.Readers deal with the occa-sional padding and duplicates as follows.Each record pre-pared by the writer contains extra information like check-sums so that its validity can be veri?ed.A reader can identify and discard extra padding and record fragments using the checksums.If it cannot tolerate the occasional duplicates (e.g.,if they would trigger non-idempotent op-erations),it can ?lter them out using unique identi?ers in

the records,which are often needed anyway to name corre-sponding application entities such as web documents.These functionalities for record I/O (except duplicate removal)are in library code shared by our applications and applicable to other ?le interface implementations at Google.With that,the same sequence of records,plus rare duplicates,is always delivered to the record reader.

3.SYSTEM INTERACTIONS

We designed the system to minimize the master’s involve-ment in all operations.With that background,we now de-scribe how the client,master,and chunkservers interact to implement data mutations,atomic record append,and snap-shot.

3.1Leases and Mutation Order

A mutation is an operation that changes the contents or metadata of a chunksuch as a wr i

te or an append opera-tion.Each mutation is performed at all the chunk’s replicas.We use leases to maintain a consistent mutation order across repli cas.The master grants a chunklease to one of the repli -cas,whi ch we call the primary .The primary picks a serial order for all mutations to the chunk.All replicas follow this order when applying mutations.Thus,the global mutation order is de?ned ?rst by the lease grant order chosen by the master,and within a lease by the serial numbers assigned by the primary.

The lease mechanism is designed to minimize manage-ment overhead at the master.A lease has an initial timeout of 60seconds.However,as long as the chunk i s be i ng mu-tated,the pr i mary can request and typ i cally rece i

ve exten-s i ons from the master i nde?n i tely.These extens i on requests and grants are piggybacked on the HeartBeat messages reg-ularly exchanged between the master and all chunkservers.The master may sometimes try to revoke a lease before it expires (e.g.,when the master wants to disable mutations on a ?le that is being renamed).Even if the master loses communication with a primary,it can safely grant a new lease to another replica after the old lease expires.

In Figure 2,we illustrate this process by following the control ?ow of a write through these numbered steps.1.The client asks the master which chunkserver holds the current lease for the chunkand the locat i ons of the other repl i

cas.If no one has a lease,the master grants one to a repl i ca i

t chooses (not shown).

2.The master repl i es w i th the i dent i ty of the pr i mary and the locat i ons of the other (secondary )repl i cas.The cl i ent caches th i s data for future mutat i

ons.It needs to contact the master aga i n only when the pr i

mary

Control Data

Figure 2:Write Control and Data Flow becomes unreachable or replies that it no longer holds

a lease.

3.

The client pushes the data to all the replicas.A client can do so in any order.Each chunkserver will store the data in an internal LRU bu?er cache until the data is used or aged out.By decoupling the data ?ow from the control ?ow,we can improve performance by scheduling the expensive data ?ow based on the net-worktopology regardless of wh i ch chunk server i s the pr i mary.Sect i on 3.2d i scusses th i

s further.

4.

Once all the replicas have acknowledged receiving the data,the client sends a write request to the primary.The request identi?es the data pushed earlier to all of the replicas.The primary assigns consecutive serial numbers to all the mutations it receives,possibly from multiple clients,which provides the necessary serial-ization.It applies the mutation to its own local state in serial number order.

5.

The primary forwards the write request to all sec-ondary replicas.Each secondary replica applies mu-tations in the same serial number order assigned by the primary.

6.The secondaries all reply to the primary indicating that they have completed the operation.

7.

The primary replies to the client.Any errors encoun-tered at any of the replicas are reported to the client.In case of errors,the write may have succeeded at the primary and an arbitrary subset of the secondary repli-cas.(If it had failed at the primary,it would not have been assigned a serial number and forwarded.)The client request is considered to have failed,and the modi?ed region is left in an inconsistent state.Our client code handles such errors by retrying the failed mutation.It will make a few attempts at steps (3)through (7)before fall ng backto a retry from the be-g nn ng of the wr te.

If a write by the application is large or straddles a chunk boundary,GFS client code breaks it down into multiple write operations.They all follow the control ?ow described above but may be interleaved with and overwritten by con-current operations from other clients.Therefore,the shared

?le region may end up containing fragments from di?erent clients,although the replicas will be identical because the in-dividual operations are completed successfully in the same order on all replicas.This leaves the ?le region in consistent but unde?ned state as noted in Section 2.7.

3.2Data Flow

We decouple the ?ow of data from the ?ow of control to use the networke?c i ently.Wh i le control ?ows from the cl i ent to the pr i mary and then to all secondar i es,data i

s pushed linearly along a carefully picked chain of chunkservers in a pipelined fashion.Our goals are to fully utilize each mach i ne’s networkbandw i dth,avo i

d networkbottleneck s and high-latency links,and minimiz

e the latency to push through all the data.To fully ut i l i ze each mach i ne’s networkbandw i dth,the data is pushed linearly along a chain o

f chunkservers rather than distributed in some other topology (e.g.,tree).Thus,

each

machine’s full outbound bandwidth is used to trans-fer the data as fast as possible rather than divided among multiple recipients.To avoid network bottlenecks and high-latency links (e.g.,inter-switch links are often both)as much as possible,each machine forwards the data to the “closest”machine in the networktopology that has not rece i ved i t.Suppose the client is pushing data to chunkservers S1through S4.It sends the data to the closest chunkserver,say S1.S1for-wards it to the closest chunkserver S2through S4closest to S1,say S2.Similarly,S2forwards it to S3or S4,whichever i s closer to S2,and so on.Our networktopology i s s i

mple enough that “d i stances”can be accurately est i mated from IP addresses.F i nally,we m i n i m i ze latency by p i pel i n i ng the data trans-fer over TCP connections.Once a chunkserver receives some data,it starts forwarding immediately.Pipelining is espe-c ally helpful to us because we use a sw tched networkw th

full-duplex links.Sending the data immediately does not reduce the rece i ve rate.W i thout networkcongest i

on,the i deal elapsed t i me for transferr i ng B bytes to R repl i cas i

s B/T +RL where T s the networkthroughput and L s la-tency to transfer bytes between two machines.Our network links are typically 100Mbps (T ),and L is far below 1ms.Therefore,1MB can ideally be distributed in about 80ms.

3.3Atomic Record Appends

GFS provides an atomic append operation called record append .In a traditional write,the client speci?es the o?-set at which data is to be written.Concurrent writes to the same region are not serializable:the region may end up containing data fragments from multiple clients.In a record append,however,the client speci?es only the data.GFS appends it to the ?le at least once atomically (i.e.,as one continuous sequence of bytes)at an o?set of GFS’s choosing and returns that o?set to the client.This is similar to writ-ing to a ?le opened in O

serve as multiple-producer/single-consumer queues or con-tain merged results from many di?erent clients.Record append is a kind of mutation and follows the con-trol ?ow in Section 3.1with only a little extra logic at the primary.The client pushes the data to all replicas of the last chunkof the ?le Then,i t sends i ts request to the pr i

-mary.The primary checks to see if appending the record to the current chunkwould cause the chunkto exceed the maxi mum si ze (64MB).If so,t pads the chunkto the max-mum si ze,tells secondari es to do the same,and repli es to the cli ent ndi cati ng that the operati on should be retri ed

on the next chunk.(Record append is restricted to be at

most one-fourth of the max mum chunks ze to k eep worst-case fragmentat on at an acceptable level.)If the record ?ts w th n the max mum s ze,wh ch s the common case,the pr mary appends the data to ts repl ca,tells the secon-dar es to wr te the data at the exact o?set where t has,and

?nally repl es success to

the cl ent.If a record append fa ls at any repl ca,the cl ent retr es the

operat on.As

a result,repl cas of the same chunkmay con-ta n d ?erent data poss bly nclud ng dupl cates of the same

record n whole or n part.GFS does not guarantee that all repl cas are bytew se dent cal.It only guarantees that the data s wr tten at least once as an atom c un t.Th s prop-erty follows read ly from the s mple observat on that for the operat on to report success,the data must have been wr tten at the same o?set on all replicas of some chunk.Further-more,after this,all replicas are at least as long as the end of record and therefore any future record will be assigned a h i gher o?set or a d i ?erent chunkeven i f a d i ?erent repl i ca

later becomes the pr i mary.In terms of our cons i stency guar-antees,the reg i ons i n wh i ch successful record append opera-t i ons have wr i tten the i r data are de?ned (hence cons i stent),whereas i nterven i ng reg i ons are i ncons i stent (hence unde-?ned).Our appl i cat i ons can deal w i th i ncons i stent reg i ons as we d i scussed i n Sect i on 2.7.2.

3.4Snapshot

The snapshot operation makes a copy of a ?le or a direc-tory tree (the “source”)almost instantaneously,while min-imizing any interruptions of ongoing mutations.Our users use it to quickly create branch copies of huge data sets (and often copies of those copies,recursively),or to checkpoint the current state before experimenting with changes that can later be comm tted or rolled backeas ly.

Like AFS [5],we use standard copy-on-write techniques to implement snapshots.When the master receives a snapshot request,it ?rst revokes any outstanding leases on the chunks in the ?les it is about to snapshot.This ensures that any subsequent writes to these chunks will require an interaction with the master to ?nd the lease holder.This will give the master an opportunity to create a new copy of the chunk ?rst.

After the leases have been revoked or have expired,the master logs the operation to disk.It then applies this log record to its in-memory state by duplicating the metadata for the source ?le or directory tree.The newly created snap-shot ?les point to the same chunks as the source ?les.

The ?rst t me a cl ent wants to wr te to a chunkC after the snapshot operat on,t sends a request to the master to ?nd the current lease holder.The master not ces that the reference count for chunkC i

s greater than one.It defers replying to the client request and instead picks a new chunk

handle C’.It then asks each chunkserver that has a current

repl i

ca of C to create a new chunkcalled C’.By creat i ng

the new chunkon the same chunk servers as the or g nal,we ensure that the data can be cop ed locally,not over the net-work(our d sk s are about three t mes as fast as our 100Mb

Ethernet links).From this point,request handling is no dif-ferent from that for any chunk:the master grants one of the repli cas a lease on the new chunkC’and repli es to the cli ent,whi ch can wri te the chunknormally,not k nowi ng that t has just been created from an existing chunk.

4.MASTER OPERATION The master executes all namespace operations.In addi-t i on,i t manages chunkrepl i cas throughout the system:i t makes placement decisions,creates new chunks and hence replicas,and coordinates various system-wide activities to keep chunks fully replicated,to balance load across all the chunkservers,and to reclaim unused storage.We now dis-cuss each of these topics.

4.1Namespace Management and Locking

Many master operations can take a long time:for exam-ple,a snapshot operation has to revoke chunkserver leases on all chunks covered by the snapshot.We do not want to delay other master operations while they are running.Therefore,we allow multiple operations to be active and use locks over regions of the namespace to ensure proper serialization.Unlike many traditional ?le systems,GFS does not have a per-directory data structure that lists all the ?les in that directory.Nor does it support aliases for the same ?le or directory (i.e,hard or symbolic links in Unix terms).GFS logically represents its namespace as a lookup table mapping full pathnames to metadata.With pre?x compression,this table can be e?ciently represented in memory.Each node in the namespace tree (either an absolute ?le name or an absolute directory name)has an associated read-write lock.Each master operation acquires a set of locks before it

runs.Typically,if it involves /d1/d2/.../dn/leaf ,it will

acquire read-locks on the directory names /d1,/d1/d2,...,/d1/d2/.../dn ,and e ther a read lockor a wr te lockon the full pathname /d1/d2/.../dn/leaf .Note that leaf may be a ?le or d rectory depend ng on the operat on.

We now illustrate how this locking mechanism can prevent a ?le /home/user/foo from being created while /home/user is being snapshotted to /save/user .The snapshot oper-ati on acqui res read lock s on /home and /save ,and wri te locks on /home/user and /save/user .The ?le creation ac-quires read locks on /home and /home/user ,and a write lockon /home/user/foo .The two operat i ons w i ll be ser i -alized properly because they try to obtain con?icting locks on /home/user .File creation does not require a write lock on the parent directory because there is no “directory”,or inode -like,data structure to be protected from modi?cation.The read lockon the name s su?c ent to protect the parent d rectory from delet on.

One nice property of this locking scheme is that it allows concurrent mutations in the same directory.For example,multiple ?le creations can be executed concurrently in the same d i rectory:each acqu i res a read lockon the d i rectory name and a wr i te lockon the ?le name.The read lockon the d i rectory name su?ces to prevent the d i

rectory from being deleted,renamed,or snapshotted.The write locks on

?le names serialize attempts to create a ?le with the same

name twice.

Since the namespace can have many nodes,read-write lock objects are allocated lazily and deleted once they are not in use.Also,locks are acquired in a consistent total order

to prevent deadlock:they are ?rst ordered by level in the namespace tree and lexicographically within the same level.

4.2Replica Placement

A GFS cluster is highly distributed at more levels than one.It typically has hundreds of chunkservers spread across many machine racks.These chunkservers in turn may be accessed from hundreds of clients from the same or di?erent https://www.sodocs.net/doc/212093950.html,munication between two machines on di?erent racks may cross one or more network switches.Addition-ally,bandw i dth i nto or out of a rackmay be less than the aggregate bandwidth of all the machines within the rack.Multi-level distribution presents a unique challenge to dis-tribute data for scalability,reliability,and availability.

The chunkrepl ca placement pol cy serves two purposes:max m ze data rel ab l ty and ava lab l ty,and max m ze net-workbandw i dth ut i l i zat i on.For both,i t i s not

enough to spread repl i cas across mach i nes,wh i

ch only guards aga i nst di skor machi ne fai lures and fully uti li zes each machi

ne’s net-workbandw dth.We must also spread chunkrepl cas across racks.This ensures that some replicas of a chunk will sur-v ve and rema

n ava lable even f an ent re rack s damaged or o?ne (for example,due to fa lure of a shared resource like a network switch or power circuit).It also means that tra?c,espec i ally reads,for a chunkcan explo i t the aggre-gate bandwidth

of multiple racks.On the other hand,write tra?c has to ?ow through multiple racks,a tradeo?we make willingly.

4.3Creation,Re-replication,Rebalancing

Chunkrepl i

cas are created for three reasons:chunkcre-at i on,re-repl i cat i on,and rebalanc i ng.

When the master creates a chunk,it chooses where to place the initially empty replicas.It considers several fac-tors.(1)We want to place new replicas on chunkservers with below-average d i skspace ut i l i zat i on.Over t i me th i s w i ll equali ze di skuti li zati on across chunk servers.(2)We want to limit the number of “recent”creations on each chunkserver.Although creation itself is cheap,it reliably predicts immi-nent heavy write tra?c because chunks are created when de-manded by writes,and in our append-once-read-many work-load they typically become practically read-only once they have been completely written.(3)As discussed above,we want to spread repl cas of a chunkacross rack s.

The master re-replicates a chunkas soon as the number of ava i lable repl i cas falls below a user-spec i ?ed goal.Th i s could happen for various reasons:a chunkserver becomes unavailable,it reports that its replica may be corrupted,one of its disks is disabled because of errors,or the replication goal s ncreased.Each chunkthat needs to be re-repl cated s pr or t zed based on several factors.One s how far t s from ts repl cat on goal.For example,we g ve h gher pr or-ty to a chunkthat has lost two repli cas than to a chunkthat has lost only one.In addi ti on,we prefer to ?rst re-repli cate chunks for live ?les as opposed to chunks that belong to re-cently deleted ?les (see Section 4.4).Finally,to minimize the impact of failures on running applications,we boost the pr or ty of any chunkthat s block ng cl ent progress.

The master picks the highest priority chunk and “clones”it by instructing some chunkserver to copy the chunk data

directly from an existing valid replica.The new replica is placed with goals similar to those for creation:equalizing d i skspace ut i l i zat i on,l i m i t i ng act i ve clone operat i ons on any single chunkserver,and spreading replicas across racks.To keep cloning tra?c from overwhelming client tra?c,the master limits the numbers of active clone operations both for the cluster and for each chunkserver.Additionally,each chunkserver limits the amount of bandwidth it spends on each clone operation by throttling its read requests to the source chunkserver.Finally,the master rebalances replicas periodically:it ex-amines the current replica distribution and moves replicas

for better d skspace and load balanc ng.Also through th s process,the master gradually ?lls up a new chunkserver rather than instantly swamps it with new chunks and the heavy write tra?c that comes with them.The placement criteria for the new replica are similar to those discussed above.In addition,the master must also choose which ex-isting replica to remove.In general,it prefers to remove those on chunkservers with below-average free space so as

to equal ze d skspace usage.

4.4Garbage Collection

After a ?le s deleted,GFS does not mmed ately recla m the ava lable phys cal storage.It does so only laz ly dur ng regular garbage collect on at both the ?le and chunklevels.We ?nd that this approach makes the system much simpler and more reliable.

4.4.1Mechanism

When a ?le is deleted by the application,the master logs the deletion immediately just like other changes.However instead of reclaiming resources immediately,the ?le is just renamed to a hidden name that includes the deletion times-tamp.During the master’s regular scan of the ?le system namespace,it removes any such hidden ?les if they have ex-isted for more than three days (the interval is con?gurable).Until then,the ?le can still be read under the new,special name and can be undeleted by renam ng t backto normal.When the h dden ?le s removed from the namespace,ts n-memory metadata is erased.This e?ectively severs its links to all its chunks.

In a s i m i lar regular scan of the chunknamespace,the master identi?es orphaned chunks (i.e.,those not reachable from any ?le)and erases the metadata for those chunks.In a HeartBeat message regularly exchanged with the master,each chunkserver reports a subset of the chunks it has,and the master replies with the identity of all chunks that are no longer present in the master’s metadata.The chunkserver is free to delete its replicas of such chunks.

4.4.2Discussion

Although distributed garbage collection is a hard problem that demands complicated solutions in the context of pro-gramming languages,it is quite simple in our case.We can easily identify all references to chunks:they are in the ?le-to-chunkmapp i ngs ma i nta i ned exclus i vely by the master.We can also eas i ly i dent i fy all the chunkrepl i cas:they are Linux ?les under designated directories on each chunkserver.Any such replica not known to the master is “garbage.”

The garbage collection approach to storage reclamation o?ers several advantages over eager deletion.First,it is simple and reliable in a large-scale distributed system where component fa i lures are common.Chunkcreat i on may suc-ceed on some chunkservers but not others,leaving replicas

that the master does not know exist.Replica deletion mes-sages may be lost,and the master has to remember to resend

them across failures,both its own and the chunkserver’s.Garbage collection provides a uniform and dependable way to clean up any replicas not known to be useful.Second,it merges storage reclamation into the regular background activities of the master,such as the regular scans of names-paces and handshakes with chunkservers.Thus,it is done in batches and the cost is amortized.Moreover,it is done

only when the master is relatively free.The master can re-spond more

promptly to client requests that demand timely attention.Third,the delay in reclaiming storage provides a safety net against accidental,irreversible deletion.In our experience,the main disadvantage is that the delay

sometimes hinders user e?ort to ?ne tune usage when stor-age is tight.Applications that repeatedly create and delete

temporary ?les may not be able to reuse the storage right away.We address these issues by expediting storage recla-mation if a deleted ?le is explicitly deleted again.We also allow users to apply di?erent replication and reclamation policies to di?erent parts of the namespace.For example,users can specify that all the chunks in the ?les within some directory tree are to be stored without replication,and any deleted ?les are immediately and irrevocably removed from the ?le system state.

4.5Stale Replica Detection

Chunkrepl i cas may become stale i f a chunk server fa i ls and m i sses mutat i ons to the chunkwh i le i t i s down.For each chunk,the master maintains a chunk version number to distinguish between up-to-date and stale replicas.

Whenever the master grants a new lease on a chunk,it ncreases the chunkvers on number and nforms the up-to-date repl cas.The master and these repl cas all record the new vers on number n the r pers stent state.Th s occurs before any cl ent s not ?ed and therefore before t can start writing to the chunk.If another replica is currently unavail-able,i ts chunkvers i on number w i

ll not be advanced.The master will detect that this chunkserver has a stale replica when the chunkserver restarts and reports its set of chunks and their associated version numbers.If the master sees a version number greater than the one in its records,the mas-ter assumes that it failed when granting the lease and so takes the higher version to be up-to-date.

The master removes stale replicas in its regular garbage collection.Before that,it e?ectively considers a stale replica not to exist at all when it replies to client requests for chunk information.As another safeguard,the master includes the chunkvers i on number when i t i nforms cl i ents wh i ch chunkserver holds a lease on a chunk or when it instructs a chunkserver to read the chunk from another chunkserver in a cloning operation.The client or the chunkserver veri?es the version number when it performs the operation so that it is always accessing up-to-date data.

5.FAULT TOLERANCE AND DIAGNOSIS

One of our greatest challenges in designing the system is dealing with frequent component failures.The quality and

quantity of components together make these problems more the norm than the exception:we cannot completely trust the machines,nor can we completely trust the https://www.sodocs.net/doc/212093950.html,-ponent failures can result in an unavailable system or,worse,corrupted

data.We discuss how we meet these challenges and the tools we have built into the system to diagnose prob-lems when they inevitably occur.

5.1High Availability

Among hundreds of servers in a GFS cluster,some are bound to be unavailable at any given time.We keep the overall system highly available with two simple yet e?ective strategies:fast recovery and replication.

5.1.1Fast Recovery

Both the master and the chunkserver are designed to re-store their state and start in seconds no matter how they terminated.In fact,we do not distinguish between normal and abnormal termination;servers are routinely shut down just by killing the process.Clients and other servers experi-ence a minor hiccup as they time out on their outstanding requests,reconnect to the restarted server,and retry.Sec-tion 6.2.2reports observed startup times.5.1.2Chunk Replication

As d scussed earl er,each chunk s repl cated on mult ple chunkservers on di?erent https://www.sodocs.net/doc/212093950.html,ers can specify di?erent replication levels for di?erent parts of the ?le namespace.The default is three.The master clones existing replicas as needed to keep each chunk fully replicated as chunkservers go o?ine or detect corrupted replicas through checksum ver-i?cation (see Section 5.2).Although replication has served us well,we are exploring other forms of cross-server redun-dancy such as parity or erasure codes for our increasing read-only storage requirements.We expect that it is challenging but manageable to implement these more complicated re-dundancy schemes in our very loosely coupled system be-cause our tra?c is dominated by appends and reads rather than small random writes.

5.1.3Master Replication

The master state is replicated for reliability.Its operation log and checkpoints are replicated on multiple machines.A mutation to the state is considered committed only after i ts log record has been ?ushed to d i sklocally and on all master repl i cas.For s i mpl i c i ty,one master process rema i ns in charge of all mutations as well as background activities such as garbage collection that change the system internally.When it fails,it can restart almost instantly.If its machine or d skfa ls,mon tor ng nfrastructure outs de GFS starts a new master process elsewhere w th the repl cated operat on log.Cl ents use only the canon cal name of the master (e.g.gfs-test),wh ch s a DNS al as that can be changed f the master s relocated to another mach ne.

Moreover,“shadow”masters prov de read-only access to the ?le system even when the pr mary master s down.They are shadows,not m rrors,n that they may lag the pr mary sl ghtly,typ cally fract ons of a second.They enhance read ava lab l ty for ?les that are not be ng act vely mutated or appl cat ons that do not m nd gett ng sl ghtly stale results.In fact,since ?le content is read from chunkservers,appli-cations do not observe stale ?le content.What could be

stale within short windows is ?le metadata,

like directory contents or access control information.

To keep itself informed,a shadow master reads a replica of the growing operation log and applies the same sequence of changes to its data structures exactly as the primary does.Like the primary,it polls chunkservers at startup (and infre-quently thereafter)to locate chunkrepl i cas and exchanges

frequent handshake messages with them to monitor their status.It depends on the primary master only for replica location updates resulting from the primary’s decisions to create and delete replicas.5.2Data Integrity

Each chunkserver uses checksumming to detect corruption of stored data.Given that a GFS cluster often has thousands of disks on hundreds of machines,it regularly experiences d i skfa i lures that cause data corrupt i on or loss on both the

read and wr

i te paths.(See Sect i on 7for one cause.)We can recover from corrupt on us ng other chunkrepl cas,but t would be mpract cal to detect corrupt on by compar ng replicas across chunkservers.Moreover,divergent replicas may be legal:the semantics of GFS mutations,in particular atomic record append as discussed earlier,does not guar-antee identical replicas.Therefore,each chunkserver must independently verify the integrity of its own copy by main-taining checksums.A chunki s brok en up nto 64KB block s.Each has a corre-sponding 32bit checksum.Like other metadata,checksums

are kept in memory and stored persistently with logging,separate

from user data.For reads,the chunkserver veri?es the checksum of data blocks that overlap the read range before returning any data to the requester,whether a client or another chunkserver.Therefore chunkservers will not propagate corruptions to other mach i nes.If a blockdoes not match the recorded checksum,the chunkserver returns an error to the requestor

and reports the mismatch to the master.

In response,the requestor will read from other replicas,while the master w ll clone the chunkfrom another repl ca.After a val d new replica is in place,the master instructs the chunkserver that reported the mismatch to delete its replica.

Checksumming has little e?ect on read performance for several reasons.Since most of our reads span at least a few blocks,we need to read and checksum only a relatively small amount of extra data for veri?cation.GFS client code further reduces this overhead by trying to align reads at checksum block boundaries.Moreover,checksum lookups and comparison on the chunkserver are done without any I/O,and checksum calculation can often be overlapped with I/Os.

Checksum computation is heavily optimized for writes that append to the end of a chunk(as opposed to wr i tes that overwr i te ex i st i ng data)because they are dom i nant i

n our workloads.We just incrementally update the check-sum for the last partial checksum block,and compute new checksums for any brand new checksum blocks ?lled by the append.Even if the last partial checksum block is already corrupted and we fail to detect it now,the new checksum value will not match the stored data,and the corruption will be detected as usual when the block s next read.

In contrast,f a wr te overwr tes an ex st ng range of the chunk,we must read and verify the ?rst and last blocks of the range being overwritten,then perform the write,and

?nally compute and record the new checksums.If we do not verify the ?rst and last blocks before overwriting them partially,the new checksums may hide corruption that exists in the regions not being overwritten.

During idle periods,chunkservers can scan and verify the contents of inactive chunks.This allows us to detect corrup-tion in chunks that are rarely read.Once the corruption is detected,the master can create a new uncorrupted replica and delete the corrupted replica.This prevents an inactive but corrupted chunkrepl i ca from fool i ng the master i

nto thinking that it has enough valid replicas of a chunk.

5.3Diagnostic Tools

Extensive and detailed diagnostic logging has helped im-measurably in problem isolation,debugging,and perfor-mance analysis,while incurring only a minimal cost.With-out logs,it is hard to understand transient,non-repeatable interactions between machines.GFS servers generate di-agnostic logs that record many signi?cant events (such as chunkservers going up and down)and all RPC requests and replies.These diagnostic logs can be freely deleted without a?ecting the correctness of the system.However,we try to keep these logs around as far as space permits.

The RPC logs include the exact requests and responses sent on the wire,except for the ?le data being read or writ-ten.By matching requests with replies and collating RPC records on di?erent machines,we can reconstruct the en-tire interaction history to diagnose a problem.The logs also serve as traces for load testing and performance analysis.The performance impact of logging is minimal (and far outweighed by the bene?ts)because these logs are written sequentially and asynchronously.The most recent events are also kept in memory and available for continuous online monitoring.

6.MEASUREMENTS

In this section we present a few micro-benchmarks to illus-trate the bottlenecks inherent in the GFS architecture and

implementation,and also some numbers from real clusters in use at Google.

6.1Micro-benchmarks

We measured performance on a GFS cluster consisting of one master,two master replicas,16chunkservers,and 16clients.Note that this con?guration was set up for ease of testing.Typical clusters have hundreds of chunkservers and hundreds of clients.

All the machines are con?gured with dual 1.4GHz PIII processors,2GB of memory,two 80GB 5400rpm disks,and a 100Mbps full-duplex Ethernet connection to an HP 2524switch.All 19GFS server machines are connected to one switch,and all 16client machines to the other.The two switches are connected with a 1Gbps link.

6.1.1Reads

N clients read simultaneously from the ?le system.Each client reads a randomly selected 4MB region from a 320GB ?le set.This is repeated 256times so that each client ends up reading 1GB of data.The chunkservers taken together have only 32GB of memory,so we expect at most a 10%hit rate in the Linux bu?er cache.Our results should be close to cold cache results.

Figure 3(a)shows the aggregate read rate for N clients and its theoretical limit.The limit peaks at an aggregate of 125MB/s when the 1Gbps l i nkbetween the two sw i tches i s saturated,or 12.5MB/s per cl i ent when i

ts 100Mbps network i nterface gets saturated,wh i chever appl i

es.The observed read rate i s 10MB/s,or 80%of the per-cl i

ent l i m i t,when just one cl i ent i s read i

ng.The aggregate read rate reaches 94MB/s,about 75%of the 125MB/s li nkli mi t,for 16readers,or 6MB/s per cli ent.The e?ci ency drops

from 80%to 75%because as the number of readers ncreases,

so does the probabi li ty that multi ple readers si multaneously read from the same chunkserver.

6.1.2Writes N clients write simultaneously to N distinct ?les.Each client writes 1GB of data to a new ?le in a series of 1MB writes.The aggregate write rate and its theoretical limit are shown in Figure 3(b).The limit plateaus at 67MB/s be-cause we need to wri te each byte to 3of the 16chunk servers,each wi th a 12.5MB/s i nput connecti on.The wri te rate for one cli ent i s 6.3MB/s,about half of the li mi t.The mai n culpri t for thi s s our networkstack .It does not nteract very well wi th the pi peli ni ng scheme we use for

push ng data to chunkrepl cas.Delays n propagat ng data from one repl ca to another reduce the overall wr te rate.

Aggregate wr te rate reaches 35MB/s

for 16cl ents (or 2.2MB/s per cl ent),about half the theoret cal l m t.As n the case of reads,it becomes more likely that multiple clients write concurrently to the same chunkserver as the number of clients increases.Moreover,collision is more likely for 16writers than for 16readers because each write involves three di?erent replicas.

Writes are slower than we would like.In practice this has not been a major problem because even though it increases the latencies as seen by individual clients,it does not sig-ni?cantly a?ect the aggregate write bandwidth delivered by the system to a large number of clients.6.1.3Record Appends Figure 3(c)shows record append performance.N clients

append simultaneously to a single ?le.Performance is lim-i ted by the networkbandw i dth of the chunk servers that store the last chunkof the ?le,i

ndependent of the num-ber of cl i ents.It starts at 6.0MB/s for one cl i

ent and drops to 4.8MB/s for 16cl i ents,mostly due to congest i on and var ances n networktransfer rates seen by d ?erent cl ents.Our appl cat ons tend to produce mult ple such ?les con-currently.In other words,N cl ents append to M shared ?les s multaneously where both N and M are n the dozens or hundreds.Therefore,the chunkserver network congestion in our experiment is not a signi?cant issue in practice be-cause a client can make progress on writing one ?le while the chunkservers for another ?le are busy.

6.2Real World Clusters

We now examine two clusters in use within Google that are representative of several others like them.Cluster A is used regularly for research and development by over a hun-dred eng neers.A typ cal task s n t ated by a human user and runs up to several hours.It reads through a few MBs to a few TBs of data,transforms or analyzes the data,and wr tes the results backto the cluster.Cluster B s pr mar ly used for production data processing.The tasks last much

Cluster

A B Chunkservers 22772TB 55TB

Number of Files

737k Number of Dead ?les 232k Number of Chunks

1550

k

13GB 48MB

51015

Number of clients N

050

100

R e a d r a t e (M B /s )

Network limit Aggregate read rate

(a)Reads

51015

Number of clients N

020

40

60W r i t e r a t e (M B /s )

Network limit

Aggregate write rate

(b)Writes 0

510

15

Number of clients N

0510

A p p e n d r a t e (M

B /s )

Network limit Aggregate append rate

(c)Record appends

Figure 3:Aggregate Throughputs.Top curves show theoret cal l m ts mposed by our networktopology.Bottom curves

show measured throughputs.They have error bars that show 95%con?dence ntervals,wh ch are lleg ble n some cases because of low var ance n measurements.

Cluster

A

B

Read rate (last minute)380MB/s Read rate (last hour)384MB/s Read rate (since restart)

49MB/s

1MB/s 2MB/s 25MB/s

Master ops (last minute)533Ops/s Master ops (last hour)518Ops/s Master ops (since restart)

347Ops/s

Table 3:Performance Metrics for Two GFS Clusters The read rates were much higher than the write rates.The total workload consists of more reads than writes as we have assumed.Both clusters were in the middle of heavy read activity.In particular,A had been sustaining a read rate of 580MB/s for the preceding week.Its network con-?guration can support 750MB/s,so it was using its re-sources e?c ently.Cluster B can support peakread rates of 1300MB/s,but ts appl cat ons were us ng just 380MB/s.

6.2.4Master Load

Table 3also shows that the rate of operat ons sent to the master was around 200to 500operat ons per second.The master can easily keep up with this rate,and therefore is not a bottleneckfor these work loads.

In an earl er vers on of GFS,the master was occas onally a bottleneckfor some work loads.It spent most of i

ts t i

me sequent i

ally scann i ng through large d i

rector i

es (wh i ch con-tained hundreds of thousands of ?les)looking for particular ?les.We have since changed the master data structures to allow e?cient binary searches through the namespace.It can now easily support many thousands of ?le accesses per second.If necessary,we could speed it up further by placing name lookup caches in front of the namespace data struc-tures.

6.2.5Recovery Time

After a chunkserver fails,some chunks will become under-replicated and must be cloned to restore their replication levels.The time it takes to restore all such chunks depends on the amount of resources.In one experiment,we killed a single chunkserver in cluster B.The chunkserver had about

15,000chunks containing 600GB of data.To limit the im-pact on running applications and provide leeway for schedul-ing decisions,our default parameters limit this cluster to 91concurrent clonings (40%of the number of chunkservers)where each clone operation is allowed to consume at most 6.25MB/s (50Mbps).All chunks were restored in 23.2min-utes,at an e?ective replication rate of 440MB/s.

In another experiment,we killed two chunkservers each with roughly 16,000chunks and 660GB of data.This double failure reduced 266chunks to having a single replica.These 266chunks were cloned at a higher priority,and were all

restored to at least 2x replication within 2minutes,thus putting the cluster in a state where it could tolerate another chunkserver failure without data loss.

6.3Workload Breakdown

In this section,we present a detailed breakdown of the workloads on two GFS clusters comparable but not identi-cal to those in Section 6.2.Cluster X is for research and development while cluster Y is for production data process-ing.

6.3.1Methodology and Caveats

These results include only client originated requests so that they re?ect the workload generated by our applications for the ?le system as a whole.They do not include inter-server requests to carry out client requests or internal back-ground activities,such as forwarded writes or rebalancing.Statistics on I/O operations are based on information heuristically reconstructed from actual RPC requests logged by GFS servers.For example,GFS cl ent code may breaka read nto mult ple RPCs to ncrease parallel sm,from wh ch we nfer the or g nal read.S nce our access patterns are h ghly styl zed,we expect any error to be n the no se.Ex-pl c t logg ng by appl cat ons m ght have prov ded sl ghtly more accurate data,but t s log st cally mposs ble to re-comp le and restart thousands of runn ng cl ents to do so and cumbersome to collect the results from as many ma-ch nes.

One should be careful not to overly general ze from our workload.Since Google completely controls both GFS and its applications,the applications tend to be tuned for GFS,and conversely GFS is designed for these applications.Such mutual in?uence may also exist between general applications

Operation Read Write Record Append

X Y X Y 0K00

0.1 4.10.29.2 1K..8K0.4 1.0

29.945.178.0 2.8 64K..128K 2.3 1.9

0.20.3<.110.6 256K..512K 4.27.7

3.9 6.9 2.225.5 1M..inf 1.512.3

Cluster X Y

<.1<.1<.1<.1 1K..8K<.1<.1

11.49.3 2.30.3

64K..128K0.30.3

0.80.6<.1 5.8

256K..512K 3.47.7

65.955.1.146.8

1M..inf 3.328.0

X Y

Open

0.7 1.5

FindLocation

7.813.4

FindMatchingFiles

0.50.8

Table6:Master Requests Breakdown by Type(%) proximates the case where a client deliberately overwrites previous written data rather than appends new data.For cluster X,overwriting accounts for under0.0001%of bytes mutated and under0.0003%of mutation operations.For cluster Y,the ratios are both0.05%.Although this is minute, it is still higher than we expected.It turns out that most of these overwrites came from client retries due to errors or timeouts.They are not part of the workload per se but a consequence of the retry mechanism.

6.3.4Master Workload

Table6shows the breakdown by type of requests to the master.Most requests askfor chunklocat

i

ons(FindLo-cation)for reads and lease holder information(FindLease-Locker)for data mutations.

Clusters X and Y see signi?cantly di?erent numbers of Delete requests because cluster Y stores production data sets that are regularly regenerated and replaced with newer versions.Some of this di?erence is further hidden in the di?erence in Open requests because an old version of a?le may be implicitly deleted by being opened for write from scratch(mode“w”in Unix open terminology). FindMatchingFiles is a pattern matching request that sup-ports“ls”and similar?le system operations.Unlike other requests for the master,it may process a large part of the namespace and so may be expensive.Cluster Y sees it much more often because automated data processing tasks tend to examine parts of the?le system to understand global appli-cation state.In contrast,cluster X’s applications are under more explicit user control and usually know the names of all needed?les in advance.

7.EXPERIENCES

In the process of building and deploying GFS,we have experienced a variety of issues,some operational and some technical.

Initially,GFS was conceived as the backend ?le system for our production systems.Over time,the usage evolved to include research and development tasks.It started with little support for things like permissions and quotas but now includes rudimentary forms of these.While production sys-tems are well disciplined and controlled,users sometimes are not.More infrastructure is required to keep users from interfering with one another.

Some of our bi ggest problems were di skand Li nux related.Many of our disks claimed to the Linux driver that they supported a range of IDE protocol versions but in fact re-sponded reliably only to the more recent ones.Since the pro-tocol versions are very similar,these drives mostly worked,but occasionally the mismatches would cause the drive and the kernel to disagree about the drive’s state.This would corrupt data silently due to problems in the kernel.This problem motivated our use of checksums to detect data cor-ruption,while concurrently we modi?ed the kernel to handle

these protocol mismatches.Earlier we had some problems with Linux 2.2kernels

due to the cost of fsync().Its cost is proportional to the size of the ?le rather than the size of the modi?ed portion.This

was a problem for

our large operation logs especially before we implemented checkpointing.We worked around this for a time by using synchronous writes and eventually migrated to Linux 2.4.

Another Linux problem was a single reader-writer lock which any thread in an address space must hold when it pages i n from d i sk(reader lock )or mod i ?es the address space i n an mmap()call (writer lock).We saw transient timeouts in our system under light load and looked hard for resource bottlenecks or sporadic hardware failures.Even-tually,we found that th i s s i ngle lockblock ed the pr i mary networkthread from mapp ng new data nto memory wh le the d i skthreads were pag i ng i n prev i ously mapped data.S nce we are ma nly l m ted by the network nterface rather than by memory copy bandwidth,we worked around this by replacing mmap()with pread()at the cost of an extra copy.Despite occasional problems,the availability of Linux code has helped us time and again to explore and understand system behavior.When appropriate,we improve the kernel and share the changes with the open source community.

8.RELATED WORK

Like other large distributed ?le systems such as AFS [5],GFS provides a location independent namespace which en-ables data to be moved transparently for load balance or fault tolerance.Unlike AFS,GFS spreads a ?le’s data across storage servers in a way more akin to xFS [1]and Swift [3]in order to deliver aggregate performance and increased fault tolerance.

As disks are relatively cheap and replication is simpler than more sophisticated RAID [9]approaches,GFS cur-rently uses only replication for redundancy and so consumes more raw storage than xFS or Swift.

In contrast to systems like AFS,xFS,Frangipani [12],and Intermezzo [6],GFS does not provide any caching below the ?le system interface.Our target workloads have little reuse within a single application run because they either stream through a large data set or randomly seekwi thi n t and read small amounts of data each ti me.

Some distributed ?le systems like Frangipani,xFS,Min-nesota’s GFS[11]and GPFS [10]remove the centralized server

and rely on distributed algorithms for consistency and man-agement.We opt for the centralized approach in order to simplify the design,increase its reliability,and gain ?exibil-ity.In particular,a centralized master makes it much easier to mplement sophi sti cated chunkplacement and repli cati on poli ci es si nce the master already has most of the relevant nformati on and controls how t changes.We address fault tolerance by keeping the master state small and fully repli-cated on other machines.Scalability and high availability (for reads)are currently provided by our shadow master mechanism.Updates to the master state are made persis-tent by appending to a write-ahead log.Therefore we could adapt a primary-copy scheme like the one in Harp [7]to pro-vide high availability with stronger consistency guarantees than our current scheme.

We are addressing a problem similar to Lustre [8]in terms of delivering aggregate performance to a large number of clients.However,we have simpli?ed the problem signi?-cantly by focusing on the needs of our applications rather than building a POSIX-compliant ?le system.Additionally,GFS assumes large number of unreliable components and so fault tolerance is central to our design.

GFS most closely resembles the NASD architecture [4].While the NASD architecture is based on network-attached d skdr ves,GFS uses commod ty mach nes as chunk servers,as done in the NASD prototype.Unlike the NASD work,our chunkservers use lazily allocated ?xed-size chunks rather than variable-length objects.Additionally,GFS implements features such as rebalancing,replication,and recovery that are required in a production environment.

Unlike Minnesota’s GFS and NASD,we do not seek to alter the model of the storage device.We focus on ad-dressing day-to-day data processing needs for complicated distributed systems with existing commodity components.The producer-consumer queues enabled by atomic record appends address a similar problem as the distributed queues in River [2].While River uses memory-based queues dis-tributed across machines and careful data ?ow control,GFS uses a persistent ?le that can be appended to concurrently by many producers.The River model supports m-to-n dis-tributed queues but lacks the fault tolerance that comes with persistent storage,while GFS only supports m-to-1queues e?ciently.Multiple consumers can read the same ?le,but they must coordinate to partition the incoming load.

9.CONCLUSIONS

The Google File System demonstrates the qualities es-sential for supporting large-scale data processing workloads on commodity hardware.While some design decisions are speci?c to our unique setting,many may apply to data pro-cessing tasks of a similar magnitude and cost consciousness.We started by reexamining traditional ?le system assump-tions in light of our current and anticipated application workloads and technological environment.Our observations have led to radically di?erent points in the design space.We treat component failures as the norm rather than the exception,optimize for huge ?les that are mostly appended to (perhaps concurrently)and then read (usually sequen-tially),and both extend and relax the standard ?le system interface to improve the overall system.

Our system provides fault tolerance by constant moni-toring,replicating crucial data,and fast and automatic re-covery.Chunkrepl cat on allows us to tolerate chunk server

failures.The frequency of these failures motivated a novel online repair mechanism that regularly and transparently re-pairs the damage and compensates for lost replicas as soon as possible.Additionally,we use checksumming to detect data corrupt i on at the d i skor IDE subsystem level,wh i ch becomes all too common given the number of disks in the system.

Our design delivers high aggregate throughput to many concurrent readers and writers performing a variety of tasks.

We achieve this by separating ?le system control,which

passes through the master,from data transfer,which passes directly between chunkservers and clients.Master involve-ment in common operations is minimized by a large chunk s i ze and by chunkleases,wh i ch delegates author i ty to pr i -mary replicas in data mutations.This makes possible a sim-ple,centralized master that does not become a bottleneck.We believe that improvements in our networking stack will lift the current limitation on the write throughput seen by

an individual client.

GFS has successfully met our storage needs and is widely used within Google as the storage platform for research and development as well as production data processing.It is an important tool that enables us to continue to innovate and attackproblems on the scale of the ent re web.

ACKNOWLEDGMENTS

We wi sh to thankthe followi ng people for thei r contri buti ons to the system or the paper.Brai n Bershad (our shepherd)and the anonymous revi ewers gave us valuable comments and suggesti ons.Anurag Acharya,Je?Dean,and Davi d des-Jardins contributed to the early design.Fay Chang worked on comparison of replicas across chunkservers.Guy Ed-jlali worked on storage quota.Markus Gutschke worked on a test i ng frameworkand secur i ty enhancements.Dav i d Kramer worked on performance enhancements.Fay Chang,Urs Hoelzle,Max Ibel,Sharon Perl,Rob Pike,and Debby Wallach commented on earlier drafts of the paper.Many of

our colleagues

at Google bravely trusted their data to a new ?le system and gave us useful feedback.Yoshka helped with early testing.

REFERENCES

[1]Thomas Anderson,Michael Dahlin,Jeanna Neefe,

David Patterson,Drew Roselli,and Randolph Wang.Serverless network?le systems.In Proceedings of the 15th ACM Symposium on Operating System

Principles ,pages 109–126,Copper Mounta n Resort,Colorado,December 1995.

[2]Remz H.Arpac -Dusseau,Er c Anderson,Noah

Treuhaft,Dav d E.Culler,Joseph M.Hellerste n,David Patterson,and Kathy Yelick.Cluster I/O with River:Making the fast case common.In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems (IOPADS ’99),pages 10–22,Atlanta,Georgia,May 1999.

[3]Luis-Felipe Cabrera and Darrell D.E.Long.Swift:

Us ng d str buted d skstr p ng to prov de h gh I/O data https://www.sodocs.net/doc/212093950.html,puter Systems ,4(4):405–436,1991.[4]Garth A.G bson,Dav d F.Nagle,Khal l Am r ,Je?

Butler,Fay W.Chang,Howard Gob o?,Charles Hard n,Er kR edel,Dav d Rochberg,and J m Zelenka.A cost-e?ective,high-bandwidth storage architecture.In Proceedings of the 8th Architectural Support for Programming Languages and Operating Systems ,pages 92–103,San Jose,California,October 1998.

[5]John Howard,Michael Kazar,Sherri Menees,David

Nichols,Mahadev Satyanarayanan,Robert Sidebotham,and Michael West.Scale and

performance in a distributed ?le system.ACM T ransactions on Computer Systems ,6(1):51–81,February 1988.

[6]InterMezzo.https://www.sodocs.net/doc/212093950.html,,2003.

[7]Barbara Liskov,Sanjay Ghemawat,Robert Gruber,

Paul Johnson,Liuba Shrira,and Michael Williams.

Replication in the Harp ?le system.In 13th Symposium on Operating System Principles ,pages 226–238,Paci?c Grove,CA,October 1991.[8]Lustre.http://www.lustreorg,2003.

[9]David A.Patterson,Garth A.Gibson,and Randy H.

Katz.A case for redundant arrays of inexpensive disks (RAID).In Proceedings of the 1988ACM SIGMOD International Conference on Management of Data ,pages 109–116,Chicago,Illinois,September 1988.[10]FrankSchmuckand Roger Hask n.GPFS:A

shared-d sk?le system for large comput ng clusters.In Proceedings of the First USENIX Conference on File and Storage Technologies ,pages 231–244,Monterey,California,January 2002.

[11]Steven R.Soltis,Thomas M.Ruwart,and Matthew T.

O’Keefe.The Gobal File System.In Proceedings of the Fifth NASA Goddard Space Flight Center Conference on Mass Storage Systems and Technologies ,College

Park,Maryland,September 1996.[12]Chandramohan A.Thekkath,

Timothy Mann,and Edward K.Lee.Frangipani:A scalable distributed ?le system.In Proceedings of the 16th ACM Symposium on Operating System Principles ,pages 224–237,Saint-Malo,France,October 1997.

相关主题