搜档网
当前位置:搜档网 › exploring mapreduce efficiency with highly-distributed data

exploring mapreduce efficiency with highly-distributed data

exploring mapreduce efficiency with highly-distributed data
exploring mapreduce efficiency with highly-distributed data

Exploring MapReduce Ef?ciency with

Highly-Distributed Data?

Michael Cardosa,Chenyu Wang,Anshuman Nangia,Abhishek Chandra,Jon Weissman

University of Minnesota

Minneapolis,MN,USA

{cardosa,chwang,nangia,chandra,jon}@https://www.sodocs.net/doc/d111419750.html,

ABSTRACT

MapReduce is a highly-popular paradigm for high-performance com-puting over large data sets in large-scale platforms.However,when the source data is widely distributed and the computing platform is also distributed,e.g.data is collected in separate data center loca-tions,the most ef?cient architecture for running Hadoop jobs over the entire data set becomes non-trivial.In this paper,we show the traditional single-cluster MapReduce setup may not be suitable for situations when data and compute resources are widely distributed. Further,we provide recommendations for alternative(and even hi-erarchical)distributed MapReduce setup con?gurations,depending on the workload and data set.

Categories and Subject Descriptors

D.4.7[Organization and Design]:Distributed Systems

General Terms

Management,Performance

1.INTRODUCTION

MapReduce,and speci?cally Hadoop,has emerged as a domi-nant paradigm for high-performance computing over large data sets in large-scale platforms.Fueling this growth is the emergence of cloud computing and services such as Amazon EC2[2]and Ama-zon Elastic MapReduce[1].Traditionally,MapReduce has been deployed over local clusters or tightly-coupled cloud resources with one centralized data source.

However,this traditional MapReduce deployment becomes in-ef?cient when source data along with the computing platform is widely(or even partially)distributed.Applications such as scien-ti?c applications,weather forecasting,click-stream analysis,web crawling,and social networking applications could have several distributed data sources,i.e.,large-scale data could be collected in separate data center locations or even across the Internet.For ?This work was supported by NSF Grant IIS-0916425and NSF Grant CNS-0643505.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro?t or commercial advantage and that copies bear this notice and the full citation on the?rst page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior speci?c permission and/or a fee.

MapReduce’11,June8,2011,San Jose,California,USA.

Copyright2011ACM978-1-4503-0700-0/11/06...$10.00.these applications,usually there also exist distributed computing resources,e.g.multiple data centers.In these cases,the most ef?-cient architecture for running MapReduce jobs over the entire data set becomes non-trivial.

One possible“local”approach is to collect the distributed data into a centralized location from which the MapReduce cluster would be based.This would revert into a tightly-coupled environment for which MapReduce was originally intended.However,large-scale wide-area data transfer costs can be high,so it may not be ideal to move all the source data to one location,especially if compute resources are also distributed.

Another approach would be a“global”MapReduce cluster de-ployed over all wide-area resources to process the data over a large pool of global resources.However,in a loosely-coupled system, runtime costs could be large in the MapReduce shuf?e and reduce phases where potentially large amounts of intermediate data could be moved around the wide-area system.

A third potential distributed solution is to set up multiple MapRe-duce clusters in different locales and then combine their respective results in a second-level MapReduce(or reduce-only)job.This would potentially avoid the drawbacks of the?rst two approaches by distributing the computation such that the required data move-ment would be minimized through the coupling of computational resources with nearby data in the wide-area system.One impor-tant issue with this setup,however,is that the?nal second-level MapReduce job can only complete after all?rst-level MapReduce jobs have completed,so a single straggling MapReduce cluster may delay the entire computation an undesirable amount.

Important considerations in the above approaches to construct an appropriate MapReduce architecture are the workload and data ?ow patterns.Workloads that have a high level of aggregation(e.g., Word Count on English text)may bene?t largely from a distributed approach to avoid shuf?ing large amounts of input data around the system when the output will be much smaller.However,workloads with low aggregation(e.g.,sort)may perform better under local or global architectures.

In this paper,we show the traditional single-cluster MapReduce setup may not suitable for situations when data and compute re-sources are widely distributed.We examine three main MapReduce architectures given the distributed data and resources assumption, and evaluate their performance over several workloads.We uti-lize two platforms:PlanetLab,for Internet-scale environments,and Amazon EC2for datacenter-scale infrastructures.Further,we pro-vide recommendations for alternative(and even hierarchical)dis-tributed MapReduce setup con?gurations,depending on the work-load and data set.

Figure1:Data and compute resources are widely-distributed and have varying interconnection link speeds.

2.SYSTEM MODEL AND ENVIRONMENT 2.1Resource and Workload Assumptions

In our system,we assume we have a number of widely-distributed compute resources(https://www.sodocs.net/doc/d111419750.html,pute clusters)and data sources.We also have a MapReduce job which must be executed over all the combined data.The following are the assumptions we make in our system model:

?Compute resources:Multiple clusters exist where a cluster

is de?ned as a local grouping of one or more physical ma-

chines.where the machines within the cluster are tightly-

coupled.Machines belonging to different clusters are as-

sumed to be loosely-coupled.For example,as seen in Fig-

ure1,there could be a compute cluster in the US,one in

Europe,and one in Asia.

?Data sources:There exist data sources from various loca-

tions,where large-scale data is either collected or generated.

In our example in Figure1,data sources are shown along

with their connection speeds with the compute resources.

The bandwidth available between the data sources and com-

pute resources is a function of network proximity/topology.

In our experiments,we locate data sources within the same

clusters as the compute resources.

?Workload:A MapReduce job is to be executed whose in-

put is all the data sources.We explicitly use Hadoop,and

assume a Hadoop Distributed File System(HDFS)instanti-

ation must be used to complete the job.Therefore the data

must be moved from the data sources into HDFS before the

job can begin.

2.2MapReduce Data?ow in Distributed En-

vironments

As previous work in MapReduce has not included performance analysis over wide-area distributed systems,it is important to un-derstand where performance bottlenecks may reside in the data?ow of a MapReduce job in such environments.

Figure2shows the general data?ow of a MapReduce job.Prior to the job starting,the delay in starting the job is in the transfer of all data into HDFS(with a given replication factor).If the source data is replicated to distant locations,this could introduce signi?cant delays.

After the job starts,the individual Map tasks are usually executed on machines which have already stored the HDFS data block cor-responding to the Map task;these are normal,“block-local”tasks. This stage does not rely on network bandwidth or

latency.Figure2:In the traditional MapReduce work?ow,Map tasks operate on local blocks,but intermediate data transfer during

the Reduce phase is an all-to-all operation.

If a node becomes idle during the Map phase,it may be assigned

a Map task for which it does not have the data block locally stored;

it would then need to download that block from another HDFS data location,which could be costly depending on which data source it chooses for the download.

Finally,and most importantly,is the Reduce phase.The Reduce operation is an all-to-all transmission of intermediate data between

the Map task output data and Reduce tasks.If there is a signi?cant amount of intermediate data,this all-to-all communication could be costly depending on the bandwidth of each end-to-end link.

In the next section,we will propose architectures in order to avoid these potential performance bottlenecks when performing MapRe-duce jobs in distributed environments.

3.FACTORS IMPACTING MAPREDUCE PER-

FORMANCE

In this section,we study factors that could impact MapReduce performance in loosely-coupled distributed environments.First,

we suggest some potential architectures for deploying MapReduce clusters,and then later discuss how different workloads may also impact the performance.

3.1Architectural Approaches

Intuitively,there are three ways to tackle MapReduce jobs run-ning on widely-distributed data.We now describe these three ar-chitectures,showing examples of these respective architectures in Figure3.

?Local MapReduce(LMR):Move all the data into one cen-

tralized cluster and perform the computation as a local MapRe-

duce job in that single cluster itself.(Figure3(a))

?Global MapReduce(GMR):Build a global MapReduce clus-

ter using nodes from all locations.We must push data into

a global HDFS from different locations and run the MapRe-

duce job over these global compute resources.(Figure3(b))

?Distributed MapReduce(DMR):Construct multiple MapRe-

duce clusters using a partitioning heuristic(e.g.,node-to-data

proximity),partitioning the source data appropriately,run

MapReduce jobs in each cluster and then combine the re-

sults with a?nal MapReduce job.This can also be thought

of as a hierarchical MapReduce overlay.(Figure3(c))

(a)Local MapReduce

(LMR)(b)Global MapReduce

(GMR)(c)Distributed MapReduce(DMR)

Figure3:Architectural approaches for constructing MapReduce clusters to process highly-distributed data.This example assumes there are two widely-separated data centers(US and Asia)but tightly-coupled nodes inside the respective data centers.

The choice of architecture is paramount in ultimately determin-

ing the performance of the MapReduce job.Such a choice depends

on the network topology of the system,inter-node bandwidth and

latency,the workload type,and the amount of data aggregation oc-

curring within the workload.Our goal in this paper is to provide

recommendations for which architecture to use,given these vari-

ables as inputs.

Next,we examine three different levels of data aggregation oc-

curring in different example workloads,from which we will derive

our experimental setup for evaluation.

3.2Impact of Data Aggregation

The amount of data that?ows through the system in order for the

MapReduce job to complete successfully is a key parameter in our

evaluations.We will use three different workload data aggregation

schemes in our evaluation,described as follows:

?High Aggregation:The MapReduce output is multiple or-

ders of magnitude smaller than the input,e.g.Wordcount on

English plain-text.

?Zero Aggregation:The MapReduce output is the same size

as the input,e.g.the Sort workload.

?Ballooning Data:The MapReduce output is larger than the

input,e.g.document format conversion from LaTeX to PDF.

The size of output data is important because during the reduce

phase,the all-to-all intermediate data shuf?e may result in consid-

erable overhead in cases where the compute facilities have band-

width bottlenecks.In our evaluation we will analyze the the perfor-

mance of each of the three architectures in combination with these

workload data aggregation schemes in order to reach a better un-

derstanding of which architecture should be recommended under

varying circumstances.

4.EXPERIMENTAL SETUP

In this section we describe our speci?c experimental setups over

two platforms:

?PlanetLab:A planetary-scale highly-heterogeneous distributed

shared virtualized environment where slices are guaranteed

only1

n of machine resources where n is the number of ac-

tive slices on the machine.This environment models highly-distributed loosely-coupled systems in general for our study.

?Amazon EC2:The Amazon Elastic Compute Cloud is a large pool of resources with strong resource guarantees in

a large virtualized data center environment.We run exper-

iments across multiple EC2data centers to model environ-ments where data is generated at multiple data center loca-tions and where clusters are tightly coupled.

We use Hadoop0.20.1over these platforms for our experiments. In our experimental setups,due to the relatively low overhead of the master processes(i.e.,JobTracker and NameNode)given our work-loads,we also couple a slave node with a master node.In all our experiments,source data must?rst be pushed into an appropriate HDFS before the MapReduce job starts.

We have three main experiments we use in both of our plat-forms for evaluation,involving Wordcount and Sort,and the two different source data types.The experimental setups can be seen in Tables3and5.These experiments are meant to model the high-aggregation,zero-aggregation,and ballooning workload data?ow models as mentioned in Section3.2.

4.1PlanetLab

In PlanetLab we used a total of8nodes in two widely-separated clusters–4nodes in the US,4nodes in Europe.In addition,we used one node in both clusters as our data sources.For each cluster, we chose tightly-coupled machines with high inter-node bandwidth (i.e.,they were either co-located at the same site or share some network infrastructure).In the presence of bandwidth limitations in PlanetLab and workload interference with other slices,the inter-site bandwidth we experienced was between1.5–2.5MB/s.On the other hand,the inter-continental bandwidth between any pair of nodes(between US and EU)is relatively low,around300–500 KB/s.The exact con?guration is shown in Tables1and2.

Due to the limited compute resources available to our slice at each node,we were limited to a relatively small input data size for our experiments to?nish in a timely manner,and also to not cause an overload.At each of the two data sources,we placed400MB plain-text data(English text from some random Internet articles) and125MB random binary data generated by RandomWriter in Hadoop.In total,there was800MB plain-text data and250MB random binary data used in our PlanetLab experiments.

The number of Map tasks by default is the number of input data blocks.However,since we were using a relatively small input data size,the default block size of64MB would result in a highly coarse-grained load for distribution across our cluster.Therefore we set the number of Map tasks to12for the250MB data set(split size approximately20.8MB).

Table1:PlanetLab US inter-cluster and intra-cluster transmission speed

Node Location from Data Source US from Data Source EU from/to MasterUS MasterUS/SlaveUS0Harvard University 5.8MB/s297KB/s-

SlaveUS1Harvard University 6.0MB/s358KB/s9.6MB/s SlaveUS2University of Minnesota 1.3MB/s272KB/s 1.6MB/s SlaveUS3University of Minnesota 1.3MB/s274KB/s 1.7MB/s DataSourceUS Princeton University-- 5.8MB/s

Table2:PlanetLab EU inter-cluster and intra-cluster transmission speed

Node Location from Data Source US from Data Source EU from/to MasterUS MasterEU/SlaveEU0Imperial University600KB/s 1.1MB/s-

SlaveEU1Imperial University600KB/s 1.2MB/s 1.3MB/S

SlaveEU2Imperial University580KB/s 1.2MB/s 1.2MB/S

SlaveEU3Imperial University580KB/s 1.1MB/s 1.2MB/S DataSourceEU UCL-- 1.1MB/S

Table3:PlanetLab workload and data con?gurations Workload Data source Aggregation Input Size Wordcount Plain-text High800MB(400MB x2) Wordcount Random data Ballooning250MB(125MB x2) Sort Random data Zero250MB(125MB x2) We set the number of Reduce tasks to2for each workload.In the case where we separate the job into two disjoint MapReduce jobs,we give each of the two clusters a single Reduce task.

We implemented the MapReduce architectures as described in Section3.1with the two disjoint clusters being the nodes from US and EU,respectively.The speci?c architectural resource alloca-tions are as follows:

?Local MapReduce(LMR):4compute nodes in the US are

used along with the US and EU data sources.

?Global MapReduce(GMR):2compute nodes from the US,

2compute nodes from EU,and both US and EU data sources

are used.

?Distributed MapReduce(DMR):2compute nodes from the

US,2compute nodes from EU,and both US and EU data

sources are used.

4.2Amazon EC2

In our Amazon EC2environment we used m1.small nodes,each of which was allocated1EC232-bit compute unit,1.7GB of RAM and160GB instance storage.We used nodes from EC2data centers located in the US and Europe.The data transmission speeds we experienced between these data centers are shown in Table4.As in the PlanetLab setup,the number of reducers was?xed at2in LMR and1per cluster in DMR.Our workload con?gurations can be seen in Table5.We used data input sizes between1–3.2GB.Our EC2 experiment architectures were limited to LMR and DMR1and are allocated as follows:

?Local MapReduce(LMR):6compute nodes in the US data

center are used along with the US and EU data sources.

1Due to an IP addressing bug between public and private address usage on EC2,we were unable to get GMR to run across multiple EC2data center locations.Table4:EC2inter-cluster and intra-cluster transmission speed From/To US EU

US14.3MB/s9.9MB/s

EU 5.8MB/s10.2MB/s

Table5:Amazon EC2workload and data con?gurations Workload Data source Aggregation Input Size Wordcount Plain-text High 3.2GB(1.6GB x2) Wordcount Random data Ballooning1GB(500MB x2) Sort Random data Zero1GB(500MB x2)?Distributed MapReduce(DMR):3compute nodes from the

US data center,3compute nodes from the EU data center,

and both US and EU data sources are used.

4.3Measuring Job Completion Time

We enumerate the steps needed for the MapReduce job to be completed under the selected architectures:

?Local MapReduce(LMR):Job completion time is mea-

sured from the time taken to insert data from both data sources into HDFS located in the main cluster plus the MapReduce

runtime in the main cluster.

?Global MapReduce(GMR):Job completion time is mea-

sured from global HDFS data insertion from both data sources time plus global MapReduce runtime.

?Distributed MapReduce(DMR):Job completion time is

taken from the max time of both individual MapReduce jobs

(plus HDFS push times)in the respective clusters(both must

?nish before the results can be combined),plus the result-

combination step to combine the sub-results into the?nal re-

sult at a single data location.

5.EV ALUATION

5.1PlanetLab

We ran4trials of each of the three main experiments over the three architectures on PlanetLab.We measured and plotted each component of the job completion times:

Figure 4:PlanetLab results.In the case of high-aggregation in (a),DMR ?nishes the fastest due to avoiding the transfer of input data over slow links.But in zero-aggregation and ballooning-data conditions,LMR ?nishes faster since it minimizes intermediate and T i m e (i n s e c o n d s )

(a) 0

50

100

150

200

250

300

Figure 5:Amazon EC2results.In the case of high aggregation in (a),DMR still outperforms LMR but at a smaller relative margin than PlanetLab due to higher inter-cluster bandwidth.LMR again outperforms DMR in zero-aggregation and ballooning-data scenarios where it is advantageous to centralize the input data immediately instead of waiting until the intermediate or output phases.

1.Push US is the time taken to insert the data from the US data source into the proper HDFS given the architecture,

2.Push EU ,similar to Push US,is the time taken to insert the data from the EU data source into the proper HDFS,

3.Map is the Map-phase runtime,

4.Reduce is the residual Reduce-phase runtime after the Map progress has reached 100%,

5.Result-Combine is the combine phase for the DMR archi-tecture only,comprised of the data transmission plus combi-nation costs,assuming those could be done in parallel,and

6.Total is the total runtime of the entire MapReduce job.We have plotted the averages as well as the 95th percentile con?-dence intervals.

High-Aggregation Experiment:We ran Wordcount on 800MB plain text.As seen in Figure 4(a),DMR completed 53%faster than the other architectural approaches.First of all,DMR bene?ts from parallelism by saving time from the initial HDFS data push trans-mission costs from both data sources.Since the data is only pushed

to its local HDFS cluster,and this is done in parallel in both clus-ters,we avoid a major bottleneck.LMR and GMR both transmit input data across clusters,which is costly in this environment.Secondly,note that the Map and Reduce phase times are almost identical across the three architectures,since the Map tasks are run with local data,and the intermediate data is small due to the high aggregation factor.

Last,the result combine step for DMR is low since the output data is small.Therefore we see a statistically signi?cant advantage in using the DMR architecture under our high-aggregation experi-ment.

Ballooning-data Experiment:We ran Wordcount on 250MB random binary data which resulted in an output size 1.8times larger than the input size,since each “word”in the random data is unique and textual annotations to each word occurrence adds to the size of the output data.

As seen in Figure 4(b),DMR still bene?ts from faster HDFS push times from the data sources.However,since the intermediate and output data sizes are quite large,the result-combining step adds a large overhead to the total runtime.Even though the reduce phase appears to be much faster,this is because the ?nal result-combining step is acting as a ?nal reduce operation as well.

LMR is the statistically-signi?cant best?nisher in this experi-ment due to its avoidance of transmitting large intermediate and output data across the wide-area system(as in both GMR and DMR); instead,by just transmitting the input data across the two clusters, it comes out ahead,16%and30%faster than DMR and GMR,re-spectively.

Zero-Aggregation Experiment:We ran Sort on250MB ran-dom binary data.As seen in Figure4(c),this experiment has results similar to the ballooning-data experiment.However since there is less intermediate and output data than the previous experiment,the three architectures?nish much closer to each other.LMR?nishes only9%faster than DMR and17%faster than GMR.

Since there is zero aggregation occurring in this experiment,and it is merely shuf?ing the same amount of data around,only at dif-ferent steps,it makes sense that the results are very similar to each other.LMR transmits half the input data between clusters before the single MapReduce job begins;DMR transmits half the output data between the clusters after the half-size MapReduce jobs have been completed,only to encounter a similar-size result-combine step.DMR is also at a disadvantage if its workload partitioning is unequal,in which case one cluster would be waiting idle for the other to complete its half of the work.

GMR has a statistically worse performance than LMR primarily because some data blocks may travel between clusters twice(due to the location-unaware HDFS replication)instead of just once. 5.2Amazon EC2

Our experiments in Amazon EC2further con?rmed our intu-itions behind our PlanetLab results,and provided even stronger statistically-signi?cant results.We ran4trials of each of the three main experiments over the three architectures,plotting the averages and the95th percentile con?dence intervals.

High-aggregation Experiment:As seen in Figure5(a),DMR still outperforms LMR but by only9%,a smaller relative margin than PlanetLab due to higher inter-cluster bandwidth.Also,be-cause of the high inter-cluster bandwidth,LMR incurs less of a penalty from transferring half the input data across data centers, which makes it a close contender to DMR.

Ballooning-data Experiment:As seen in Figure5(b),LMR outperforms DMR but at a slightly higher relative margin(44%) compared to our PlanetLab results(16%).LMR again avoids the cost of moving large intermediate and output data between clusters and instead saves on transfer costs by moving the input data before the job begins.

Zero-Aggregation Experiment:From Figure5(c),our results are quite similar to those from PlanetLab.LMR and DMR are al-most in a statistical tie,but DMR?nishes in second place(by21%) most likely due to the equal-partitioning problem where if one of the two clusters falls slightly behind in the computation,the other cluster?nishes slightly early and then half the data center compute resources sit idle while waiting for the other half to?nish.Instead with LMR,no compute resources would remain idle if there were straggling tasks.

5.3Summary

In our evaluation we measured the performance of three separate MapReduce architectures over three benchmarks in two platforms, PlanetLab and Amazon EC2.In the case of a high-aggregation workload,we found that DMR signi?cantly outperformed both LMR and GMR architectures since it avoids data transfer overheads of the input data.

However,in the case of zero-aggregation or ballooning-data sce-narios,LMR outperforms DMR and GMR since it moves the input data to a centralized cluster,incurring an unavoidable data transfer cost at the beginning instead of at later stages,which allows the compute resources to operate more ef?ciently over local data.

6.RECOMMENDATIONS

In this paper we analyzed the performance of MapReduce when operating over highly-distributed data.Our aim was to study the performance of MapReduce in three distinct architectures over vary-ing workloads.Our goal was to provide recommendations on which architecture should be used for different combinations of work-loads,data sources and network topologies.

We make the following recommendations from the lessons we have learned:

?For high-aggregation workloads,distributed computation

is preferred.This is true especially with high inter-cluster

transfer costs.In this case,avoiding the unnecessary transfer

of input data around the wide-area system is a key perfor-

mance optimization.

?For workloads with zero-aggregation or ballooning data,

centralizing the data(LMR)is preferred.This assumes

that an equal amount of compute resources could be allocated

in an LMR architecture as a DMR or GMR setting.

?For distributed computation,equal partitioning of the

workload is crucial to the architecture being bene?cial.

As seen with the sort benchmark on random data,DMR fell

slightly behind LMR in both PlanetLab and EC2results.Even

a slightly unequal partitioning has noticeable effects,since

compute resources sit idle while waiting for other sub-jobs

to?nish.If unsure about input partitioning,LMR or GMR

should be preferred over DMR.

?If the data distribution and/or compute resource distri-

bution is asymmetric,GMR may be preferred.If we are

unable to decide where the local cluster should be located in

consideration of LMR,and we are not able to equally-divide

the computation in consideration of DMR where each sub-

cluster should have a similar runtime over the disjoint data

sets,then GMR is probably the most conservative decision.

?The Hadoop Namenode,by being made location-aware

or topology-aware,could make more ef?cient block repli-

cation decisions.If we could insert data into HDFS such

that the storage locations are in close proximity to the data

sources,this would in fact emulate DMR in a GMR-type

architecture.The default block replication scheme assumes

tightly-coupled nodes,either across racks(even with rack-

awareness)in the same datacenter,or in a single cluster.In-

stead,the goals change in a distributed environment where

data transfer costs are much higher.

7.RELATED WORK

Traditionally,the MapReduce[7]programming paradigm assumed the operating cluster was composed of tightly-coupled homoge-neous compute resources that are generally reliable.Previous work has shown that if this assumption is broken,that MapReduce/Hadoop performance suffers.

Work in[11]showed that in heterogeneous environments,the Hadoop performance greatly suffered from stragglers,simply from machines that were slower than others.When applying a new schedul-ing algorithm,these drawbacks were resolved.In our work,we

assume that nodes can be heterogeneous since they belong to dif-ferent data centers or locales.However,the loosely-coupled nature of the systems we address adds an additional bandwidth constraint problem,along with the problem of having widely-dispersed data sources.Mantri [3]proposed strategies for detecting stragglers in MapRe-duce in a pro-active fashion to improve performance of the MapRe-duce job.Such improvements are complementary to our techniques;

our work is less concerned with node-level slowdowns and more fo-cused on high-level architectures for higher-level performance is-sues with resource allocation and data movement.

MOON [9]explored MapReduce performance in volatile,vol-unteer computing environments and extended Hadoop to provide

improved performance under situations where slave nodes are un-reliable.In our work,we do not focus on solving reliability issues;

instead we are concerned with performance issues of allocating

compute resources to MapReduce clusters and relocating source

data.Moreover,MOON does not consider WAN,which is a main concern in this paper.Other work has focused on ?ne-tuning MapReduce parameters or offering scheduling optimizations to provide better performance.Sandholm et.al.[10]present a dynamic priorities system for im-proved MapReduce run-times in the context of multiple jobs.Our work is concerned with optimizing single jobs relative to data source and compute resource locations.Work by Shivnath [4]provided al-gorithms for automatically ?ne-tuning MapReduce parameters to optimize job performance.This is complimentary to our work,since these same strategies could be applied in our system after we determine the best MapReduce architecture,in each of our re-spective MapReduce clusters.MapReduce pipelining [6]has been used to modify the Hadoop work?ow for improved responsiveness and performance.This would be a complimentary optimization to our techniques since they are concerned with the packaging and moving of intermediate data without storing it to disk in order to speed up computation time.This could be implemented on top of our architecture recommendations to improve performance.

Work in wide-area data transfer and dissemination includes GridFTP [8]and BitTorrent [5].GridFTP is a protocol for high-performance data transfer over high-bandwidth wide-area networks.Such mid-dleware would further complement our work by reducing data trans-fer costs in our architectures and would further optimize MapRe-duce performance.BitTorrent is a peer-to-peer ?le sharing protocol for wide-area distributed systems.Both of these could act as mid-dleware services in our high-level architectures to make wide-area data more accessible to wide-area compute resources.

8.CONCLUSION

In this paper,we have shown the traditional single-cluster MapRe-duce architecture may not suitable for situations when data and compute resources are widely distributed.We examined three ar-chitectural approaches to performing MapReduce jobs over highly distributed data and compute resources,and evaluated their per-formance over several workloads in two platforms,PlanetLab and Amazon EC2.As a result of the lessons learned from our experi-mental evaluations,we have provided recommendations for when to apply the various MapReduce architectures,as a function of sev-eral key parameters:workloads and aggregation levels,network topology and data transfer costs,and data partitioning.We learned that a local architecture (LMR)is preferred in zero-aggregation conditions,and distributed architectures (DMR)are preferred in high-aggregation and equal-partitioning conditions.

References

[1]Amazon EMR.https://www.sodocs.net/doc/d111419750.html,/elasticmapreduce/.[2]Amazon EC2.https://www.sodocs.net/doc/d111419750.html,/ec2.[3]G.Ananthanarayanan,S.Kandula,A.Greenberg,I.Stoica,Y .Lu,and B.Saha.Reining in the outliers in map-reduce clusters.In Proceedings of OSDI ,2010.

[4]S.Babu.Towards automatic optimization of mapreduce pro-grams.In ACM SOCC ,2010.[5]Bittorrent.https://www.sodocs.net/doc/d111419750.html,/.

[6]T.Condie,N.Conway,P.Alvaro,J.M.Hellerstein,

K.Elmeleegy,and R.Sears.Mapreduce online.In Proceed-ings of NSDI ,2010.[7]J.Dean and S.Ghemawat.Mapreduce:Simpli?ed data pro-cessing on large clusters.In Proc.of OSDI ,2004.[8]Gridftp.https://www.sodocs.net/doc/d111419750.html,/toolkit/docs/3.2/gridftp/.[9]H.Lin,X.Ma,J.Archuleta,W.-c.Feng,M.Gardner,and Z.Zhang.Moon:Mapreduce on opportunistic environments.In Proceedings of the 19th ACM International Symposium on

High Performance Distributed Computing ,HPDC ’10,2010.[10]T.Sandholm and https://www.sodocs.net/doc/d111419750.html,i.Mapreduce optimization us-ing dynamic regulated prioritization.In ACM SIGMET-RICS/Performance ,2009.[11]M.Zaharia,A.Konwinski,A.D.Joseph,R.H.Katz,and

I.Stoica.Improving mapreduce performance in heteroge-neous environments.In Proceedings of OSDI ,2008.

MapReduce海量数据并行处理总结

MapReduce海量数据并行处理 复习大纲 Ch. 1. 并行计算技术简介 1.为什么需要并行计算? 提高计算机性能有哪些基本技术手段 提高字长,流水线微体系结构技术,提高集成度,提升主频 迫切需要发展并行计算技术的主要原因 1)单处理器性能提升达到极限 2)爆炸性增长的大规模数据量 2)超大的计算量/计算复杂度 2.并行计算技术的分类 有哪些主要的并行计算分类方法? 1)按数据和指令处理结构:弗林(Flynn)分类 2)按并行类型

3)按存储访问构架 4)按系统类型 5)按计算特征 6)按并行程序设计模型/方法 1)按数据和指令处理结构:弗林(Flynn)分类 SISD:单指令单数据流 传统的单处理器串行处理 SIMD:单指令多数据流 向量机,信号处理系统 MISD:多指令单数据流 很少使用 MIMD:多指令多数据流 最常用,TOP500高性能计算机 基本都属于MIMD类型 2)按并行类型分类 位级并行(Bit-Level Parallelism) 指令级并行(ILP:Instruction-Level Parallelism) 线程级并行(Thread-Level Parallelism) 数据级并行:一个大的数据块划分为小块,分别由不同的处理器/线程处理 任务级并行:一个大的计算任务划分为子任务分别由不同的处理器/线程来处理 3)按存储访问结构分类 A.共享内存(Shared Memory) 所有处理器通过总线共享内存 多核处理器,SMP……

也称为UMA结构(Uniform Memory Access) B. 分布共享存储体系结构 各个处理器有本地存储器 同时再共享一个全局的存储器 C. 分布式内存(Distributed Memory) 各个处理器使用本地独立的存储器 B和C也统称为NUMA结构 (Non-Uniform Memory Access) 4)按系统类型分类 多核/众核并行计算系统MC(Multicore/Manycore) 或Chip-level multiprocessing, CMP 对称多处理系统SMP(Symmetric Multiprocessing) 多个相同类型处理器通过总线连接并共享存储器 大规模并行处理MPP(Massive Parallel Processing) 专用内联网连接一组处理器形成的一个计算系统 集群(Cluster) 网络连接的一组商品计算机构成的计算系统 网格(Grid) 用网络连接远距离分布的一组异构计算机构成的计算系统 5)按并行程序设计模型/方法分类 共享内存变量(Shared Memory Variables) 消息传递方式(Message Passing) MapReduce方式

Excel中用SUMIF函数实现按指定条件求平均值

Excel中用SUMIF函数实现按指定条件求 平均值 Excel 2003中的条件求和SUMIF函数非常实用,例如在年级段总成绩表中计算某科教师所教的所有班级成绩的平均分(如5到8班化学老师的平均分),就可以利用如下方法实现: 在准备放该化学教 师所教所有班级平均分的单元格中输入 =SUMIF(K2:K132,">4",G2:G132)/COUNTIF(K2:K132,">4")回车即可,这里边用到了 1) SUMIF和COUNTIF两个函数。(如图 SUMIF函数是按给定条件对指定单元格进行求和的函数。其语法格式是: SUMIF(range,criteria,sum_range),range是要根据条件进行计算的单元格区域,每个区域中的单元格都必须是数字和名称、数组和包含数字的引用,空值和文本值将被忽略。criteria 是指对range指定的区域实行什么条件,其形式可以为数字、表达式或文本。如条件可以表示为32、"32"、">32" 或"ap ples";sum_range是要进行相加的实际单元格,如果省略Sum_range,则当区域中的单元格符合条件时,它们既按条件计算,也执行相加。 注意:Sum_range 与Range的大小和形状可以不同,相加的实际单元格从sum_range 中左上角的单元格作为起始单元格,然后包括与range大小和形状相对应的单元格。公式中range是指“K2:K132”,也就是“班级”这列所有单元格;criteria是指“">4"”,意思是指班级数大于4的5、6、7、8班;而sum_range是指“化学”这列成绩,意思是对符合“班级”条件的

大数据处理平台构架设计说明书

大数据处理平台及可视化架构设计说明书 版本:1.0 变更记录

目录 1 1. 文档介绍 (3) 1.1文档目的 (3) 1.2文档范围 (3) 1.3读者对象 (3) 1.4参考文献 (3) 1.5术语与缩写解释 (3) 2系统概述 (4) 3设计约束 (5) 4设计策略 (6) 5系统总体结构 (7) 5.1大数据集成分析平台系统架构设计 (7) 5.2可视化平台系统架构设计 (11) 6其它 (14) 6.1数据库设计 (14) 6.2系统管理 (14) 6.3日志管理 (14)

1 1. 文档介绍 1.1 文档目的 设计大数据集成分析平台,主要功能是多种数据库及文件数据;访问;采集;解析,清洗,ETL,同时可以编写模型支持后台统计分析算法。 设计数据可视化平台,应用于大数据的可视化和互动操作。 为此,根据“先进实用、稳定可靠”的原则设计本大数据处理平台及可视化平台。 1.2 文档范围 大数据的处理,包括ETL、分析、可视化、使用。 1.3 读者对象 管理人员、开发人员 1.4 参考文献 1.5 术语与缩写解释

2 系统概述 大数据集成分析平台,分为9个层次,主要功能是对多种数据库及网页等数据进行访采集、解析,清洗,整合、ETL,同时编写模型支持后台统计分析算法,提供可信的数据。 设计数据可视化平台 ,分为3个层次,在大数据集成分析平台的基础上实现大实现数据的可视化和互动操作。

3 设计约束 1.系统必须遵循国家软件开发的标准。 2.系统用java开发,采用开源的中间件。 3.系统必须稳定可靠,性能高,满足每天千万次的访问。 4.保证数据的成功抽取、转换、分析,实现高可信和高可用。

大大数据管理系统之大大数据可视化设计

数据管理系统企业级数据可视化项目Html5 应用实践 项目经理:李雪莉 组员:申欣邹丽丹陈广宇陈思 班级:大数据&数字新媒体 一、项目背景 随着大数据、云计算和移动互联网技术的不断发展,企业用户对数据可视化的需求日益迫切。用户希望能够随时随地简单直观的了解企业生产经营、绩效考核、关键业务、分支机构的运行情况,即时掌握突发性事件的详细信息,快速反应并作出决策。随着企业信息化的不断推进,企业不断的积累基础信息、生产运行、经营管理、绩效考核、经营分析等以不同形式分布在多个系统或个人电脑文档内的业务数据。如何将大量的数据进行分析整理,以简单、直观、高效的形式提供给管理者作为经营决策的依据是当前企业数据应用的迫切需求。传统的企业数据可视化方案多基于Java Applet、Flash、Silverlight 等浏览器插件技术进行开发,在当前互联网和移动互联网技术高速发展的背景下,Web技术标准也随之高速发展,用户对互联网技术安全性和使用体验的要求越来越高。Java Applet、Flash、Silverlight 等浏览器插件技术因为落后和封闭的技术架构,以及高功耗、高系统

资源占用,已经被微软、谷歌、苹果、火狐等主流操作系统和浏览器厂商逐步放弃,转而不断支持和完善基于HTML5的新一代Web技术标准 对数据进行直观的拖拉操作以及数据筛选等,无需技术背景,人人都能实现数据可视化无论是电子表格,数据库还是 Hadoop 和云服务,都可轻松分析其中的数据。 数据可视化是科学、艺术和设计的结合,当枯燥隐晦的数据被数据科学家们以优雅、简明、直观的视觉方式呈现时,带给人们的不仅仅是一种全新的观察世界的方法,而且往往具备艺术作品般的强大冲击力和说服力。如今数据可视化已经不局限于商业领域,在社会和人文领域的影响力也正在显现。 数据可视化的应用价值,其多样性和表现力吸引了许多从业者,而其创作过程中的每一环节都有强大的专业背景支持。无论是动态还是静态的可视化图形,都为我们搭建了新的桥梁,让我们能洞察世界的究竟、发现形形色色的关系,感受每时每刻围绕在我们身边的信息变化,还能让我们理解其他形式下不易发掘的事物。 二、项目简介 目前,金融机构(银行,保险,基金,证劵等)面临着诸如利率汇率自由化,消费者行为改变,互联网金融崛起等多个挑战。为满足企业的发展需要,要求管理者运用大数据管理以更为科学的手段对企

大数据平台建设方案

大数据平台建设方案 (项目需求与技术方案) 一、项目背景 “十三五”期间,随着我国现代信息技术的蓬勃发展,信息化建设模式发生根本性转变,一场以云计算、大数据、物联网、移动应用等技术为核心的“新 IT”浪潮风起云涌,信息化应用进入一个“新常态”。***(某政府部门)为积极应对“互联网+”和大数据时代的机遇和挑战,适应全省经济社会发展与改革要求,大数据平台应运而生。 大数据平台整合省社会经济发展资源,打造集数据采集、数据处理、监测管理、预测预警、应急指挥、可视化平台于一体的大数据平台,以信息化提升数据化管理与服务能力,及时准确掌握社会经济发展情况,做到“用数据说话、用数据管理、用数据决策、用数据创新”,牢牢把握社会经济发展主动权和话语权。 二、建设目标 大数据平台是顺应目前信息化技术水平发展、服务政府职能改革的架构平台。它的主要目标是强化经济运行监测分析,实现企业信用社会化监督,建立规范化共建共享投资项目管理体系,推进政务数据共享和业务协同,为决策提供及时、准确、可靠的信息依据,提高政务工作的前瞻性和针对性,加大宏观调控力度,促进经济持续健康发

展。 1、制定统一信息资源管理规范,拓宽数据获取渠道,整合业务信息系统数据、企业单位数据和互联网抓取数据,构建汇聚式一体化数据库,为平台打下坚实稳固的数据基础。 2、梳理各相关系统数据资源的关联性,编制数据资源目录,建立信息资源交换管理标准体系,在业务可行性的基础上,实现数据信息共享,推进信息公开,建立跨部门跨领域经济形势分析制度。 3、在大数据分析监测基础上,为政府把握经济发展趋势、预见经济发展潜在问题、辅助经济决策提供基础支撑。 三、建设原则 大数据平台以信息资源整合为重点,以大数据应用为核心,坚持“统筹规划、分步实施,整合资源、协同共享,突出重点、注重实效,深化应用、创新驱动”的原则,全面提升信息化建设水平,促进全省经济持续健康发展。

大数据分析平台技术要求

大数据平台技术要求 1.技术构架需求 采用平台化策略,全面建立先进、安全、可靠、灵活、方便扩展、便于部署、操作简单、易于维护、互联互通、信息共享的软件。 技术构架的基本要求: ?采用多层体系结构,应用软件系统具有相对的独立性,不依赖任何特定的操作系统、特定的数据库系统、特定的中间件应用服务器和特定的硬 件环境,便于系统今后的在不同的系统平台、不同的硬件环境下安装、 部署、升级移植,保证系统具有一定的可伸缩性和可扩展性。 ?实现B(浏览器)/A(应用服务器)/D(数据库服务器)应用模式。 ?采用平台化和构件化技术,实现系统能够根据需要方便地进行扩展。2. 功能指标需求 2.1基础平台 本项目的基础平台包括:元数据管理平台、数据交换平台、应用支撑平台。按照SOA的体系架构,实现对我校数据资源中心的服务化、构件化、定制化管理。 2.1.1元数据管理平台 根据我校的业务需求,制定统一的技术元数据和业务元数据标准,覆盖多种来源统计数据采集、加工、清洗、加载、多维生成、分析利用、发布、归档等各个环节,建立相应的管理维护机制,梳理并加载各种元数据。 具体实施内容包括: ●根据业务特点,制定元数据标准,要满足元数据在口径、分类等方面的 历史变化。 ●支持对元数据的管理,包括:定义、添加、删除、查询和修改等操作,

支持对派生元数据的管理,如派生指标、代码重新组合等,对元数据管 理实行权限控制。 ●通过元数据,实现对各类业务数据的统一管理和利用,包括: ?基础数据管理:建立各类业务数据与元数据的映射关系,实现统一的 数据查询、处理、报表管理。 ?ETL:通过元数据获取ETL规则的描述信息,包括字段映射、数据转 换、数据转换、数据清洗、数据加载规则以及错误处理等。 ?数据仓库:利用元数据实现对数据仓库结构的描述,包括仓库模式、 视图、维、层次结构维度描述、多维查询的描述、立方体(CUBE)的 结构等。 ●元数据版本控制及追溯、操作日志管理。 2.1.2数据交换平台 结合元数据管理模块并完成二次开发,构建统一的数据交换平台。实现统计数据从一套表采集平台,通过数据抽取、清洗和转换等操作,最终加载到数据仓库中,完成整个数据交换过程的配置、管理和监控功能。 具体要求包括: ●支持多种数据格式的数据交换,如关系型数据库:MS-SQLServer、MYSQL、 Oracle、DB2等;文件格式:DBF、Excel、Txt、Cvs等。 ●支持数据交换规则的描述,包括字段映射、数据转换、数据转换、数据 清洗、数据加载规则以及错误处理等。 ●支持数据交换任务的发布与执行监控,如任务的执行计划制定、定期执 行、人工执行、结果反馈、异常监控。 ●支持增量抽取的处理方式,增量加载的处理方式; ●支持元数据的管理,能提供动态的影响分析,能与前端报表系统结合, 分析报表到业务系统的血缘分析关系; ●具有灵活的可编程性、模块化的设计能力,数据处理流程,客户自定义 脚本和函数等具备可重用性; ●支持断点续传及异常数据审核、回滚等交换机制。

MapReduce实验报告

硕士研究生实践报告 题目 作者姓名 作者学号 指导教师 学科专业 所在学院 提交日期 一题目要求 我们的项目背景是,可穿戴设备的实时数据分析。1.txt记录的是某一个用户的心跳周期数据,每一个数值表示一次心跳的周期,单位是秒。例如,0.8表示用户当时的心跳间隙是0.8秒。心跳间期按照顺序存储。 1.利用Hadoop的MapReduce框架编写程序,计算出总测量时间和平均心跳间期,即求和 与求平均。请写出程序,并在实验报告中简单描述你的思路。 2.探索Spark的Transformation中的mapPartition,写出示例程序,并思考何时会用到 mapPartition,为什么要用它? 3.探索Spark的Transformation中的flatMap,写出示例程序,并思考何时会用到它,为什 么要用到它。 4.(选做)SD1和SD2是表征心率变异性的一种指标。结合发给你们的论文,用Java或 者Scala实现SD1和SD2的计算(不用考虑并行化,普通的Java或Scala程序即可)。(选做)假设我们同时监控100个用户的心率,是否能够利用Spark的RDD的特性,并行地计算SD1和SD2?(提示:把每一个用户的心率数据作为RDD里面的一个元素,RDD中不同的元素表示不同用户的心率数据,利用map对每一个用户的心率数据进行并行分析)。请描述设计思路,并尽可能实现一个多用户心率监控的计算程序。 二题目实现 第一题: 本题就是利用Hadoop的MapReduce框架编写程序,计算出总测量时间和平均心跳间期,即求和与求平均,程序代码如下: package ; import ; import ; import ; import ;

山东政务信息系统整合共享工程大数据管理平台

山东省政务信息系统整合共享工程大数据管理平台 项目需求和技术方案要求 一、项目概况 (一)建设目标 通过大数据管理平台建设,建立统一的数据资源汇聚、数据治理、数据资源引擎和数据安全管理能力,实现大数据基础设施的集约共用和对全省政务信息资源的统筹管理和数据治理。将现有“逻辑集中、物理分散”数据共享交换方式向数据实体集中存储管理方式转变,建立完善的数据安全管理体系,实现由数据“资源”向数据“资产”的提升。 (二)建设原则 1.开放性 平台应具备良好的开放性,提供开放接口便于和第三方系统对接或者基于该接口构建新的业务。 2.先进性 在设计理念和技术体系等方面需借鉴先进的互联网技术,确保应用系统架构满足未来业务发展需求。 3.扩展性 平台应具备规范的开发接口和高可扩展性,保证未来新的需求提出时可以方便地应用到现有系统中。 4.可维护性 平台应具备良好的维护性,方便今后的扩展应用和运行维护。 5.安全性 平台应具备高安全性,确保系统正常运行的同时防止政府内部数据泄露。 (三)建设周期 2 个月。 (四)采购清单

二、建设内容 2.1数据汇聚系统建设内容 数据汇聚平台支持通过图形化的操作方式,把不同系统来源、不同类型的数据汇聚到大数据平台,能够兼容以SHE( Spark 、Hadoop、ElasticSearch )为首的大数据生态技术栈;并提供基础算子如关联、去重、过滤等完成数据转换。可以通过机器学习实现多人协作开发,提供脚本开发,工作流开发环境,能够针对任务资源实现共享以提升实施效率,可以提供基于消息流和文本的实时采集能力;提供精细化的任务调度管理,便于查看每个任务具体的数据处理情况,实现数据汇聚和加工处理一站式开发管理。 2.1.1 多源数据采集 1)支持离线数据采集,实现对各种主流数据库系统的支持,如Oracle 、DB2、SQL Server 、Sybase 、InfoMix 等主流数据库,MySQ、L PostgreSQL 等开源数据库,达梦、汉高、神通、GBase8t、KingBase 、LibrA 等国产数据库。 2)支持提供触发器、时间戳、全表对比、系统日志分析等多种数据增量采集方式。 3)支持大数据采集,实现HBase 的输入输出转换组件,可连接的数据库类型支持Hadoop Hive ,提供Hadoop HDFS文件拷贝的任务组件。 4)支持实时数据采集,实现基于Flume+Kafka 技术来采集流数据,能够接入HDFS、Hbase 或Storm 消费数据。 5)支持对FTP、SFTP、MONGOD文B件服务器的文件采集,支持包括普通文本、CSV、XML、Excel 等多种格式的文件。 2.1.2 可视化的流程设计 1)支持ETL作业调度流程和转换流程,能够通过图形化界面设计ETL转换过程和作业,支持后台批量运行ETL 转换。 2)支持200 种以上的主流数据处理组件,包括数据文件采集组件,清洗组件,大数据组件等。 3)支持图形化拖拽方式进行任务编排,将多类有顺序或者依赖关系的任务能够串接起来。同时提供任务流的管理能力。 2.1.3 统一的任务调度 1)支持多种任务管理,包括批量采集任务、实时采集任务、数据流任务等,支持多种调

MapReduce源码分析完整版

一MapReduce概述 Map/Reduce是一个用于大规模数据处理的分布式计算模型,它最初是由Google工程师设计并实现的,Google已经将它完整的MapReduce论文公开发布了。其中对它的定义是,Map/Reduce是一个编程模型(programming model),是一个用于处理和生成大规模数据集(processing and generating large data sets)的相关的实现。用户定义一个map函数来处理一个key/value对以生成一批中间的key/value对,再定义一个reduce函数将所有这些中间的有着相同key的values合并起来。很多现实世界中的任务都可用这个模型来表达。 二MapReduce工作原理 1 Map-Reduce Map-Reduce框架的运作完全基于对,即数据的输入是一批对,生成的结果也是一批对,只是有时候它们的类型不一样而已。Key和value的类由于需要支持被序列化(serialize)操作,所以它们必须要实现Writable接口,而且key的类还必须实现WritableComparable接口,使得可以让框架对数据集的执行排序操作。 一个Map-Reduce任务的执行过程以及数据输入输出的类型如下所示: Map: ——> list Reduce:> ——> 2例子 下面通过一个的例子来详细说明这个过程。WordCount是Hadoop自带的一个例子,目标是统计文本文件中单词的个数。假设有如下的两个文本文件来运行WorkCount程序:Hello World Bye World Hello Hadoop GoodBye Hadoop 2.1 map数据输入 Hadoop针对文本文件缺省使用LineRecordReader类来实现读取,一行一个key/value对,key取偏移量,value为行内容。 如下是map1的输入数据: Key1 Value1 0 Hello World Bye World 如下是map2的输入数据: Key1Value1 0 Hello Hadoop GoodBye Hadoop 2.2 map输出/combine输入 如下是map1的输出结果

大数据中心信息数据管理制度

大数据数据中心信息数据管理制度 为进一步加强和规范数据管理,保障数据安全,提高开放共享水平,支撑政府治理能力现代化,制定本制度。 一、数据管理遵循分级管理、安全可控、充分利用的原则,明确数据的采集生产、加工整理、开放共享和管理使用等活动的责任主体,加强能力建设,促进开放共享。 二、数据采集生产、使用、管理活动应当遵守有关法律法规及规章,不得利用科学数据从事危害国家安全、社会公共利益和他人合法权益的活动。 三、贯彻落实国家数据管理政策;建立健全管理政策和制度;指导相关单位加强和规范数据管理。 四、引导督促数据产生者要按照相关标准规范组织开展数据采集生产和加工整理,形成便于使用的数据库,保证数据的准确性和可用性。 五、引导督促相关单位要对数据进行分级分类,明确数据的密级和保密期限、开放条件、开放对象和审核程序等,按要求公布数据开放目录,通过在线下载、系统共享或定制服务等方式向社会开放共享。 六、对于政府决策、公共安全、国防建设、环境保护、防灾减灾、公益性科学研究等需要使用数据的,应当无偿提供;确需收费的,应按照规定程序和非营利原则制定合理的

收费标准,向社会公布并接受监督。对于因经营性活动需要使用数据的,当事人双方应当签订有偿服务合同,明确双方的权利和义务。法律法规有特殊规定的,遵从其规定。 七、涉及国家秘密、国家安全、社会公共利益、商业秘密和个人隐私的数据,不得对外开放共享;确需对外开放的,要对利用目的、用户资质、保密条件等进行审查,并严格控制知悉范围。 八、涉及国家秘密的数据按照国家有关保密规定执行。建立健全涉及国家秘密的数据管理与使用制度,对制作、审核、登记、拷贝、传输、销毁等环节进行严格管理。 九、按照网络安全管理规定,建立网络安全保障体系,采用安全可靠的产品和服务,完善数据管控、属性管理、身份识别、行为追溯、黑名单等管理措施,健全防篡改、防泄露、防攻击、防病毒等安全防护体系。 十、建立应急管理和容灾备份机制,按照要求建立应急管理系统,对重要的数据进行异地备份。

Excel自动求平均值的函数公式

Excel自动求平均值的函数公式 时间:2012-07-12 来源:Word联盟阅读:66530次评论52条 在制作表格的过程中,我们可能会用Excel来对数据进行各种运算,如:求和、求差、求积等公式,来完成我们的运算。在前面几课中我们已经基本的讲解了各种运算的函数公式,本篇再来说下在Excel表格中如何求平均值。我们在制作一份成绩表排名的时候,知道了各科成绩,需要求出成绩的平均值,我们该如何来完成呢?下面就看看Word联盟为大家演示吧! 首先,这里是一份成绩表,上面有各门功课的成绩,我们要求出平均分数。 ①将光标定位到“平均分”下面一个单元格中,然后点击“插入函数”按钮,如下图红色区域便是;

②在弹出的“插入函数”中,我们选择函数“AVERAGE”,然后单击确定按钮; ③接着马上会弹出“函数参数”的窗口,此时,我们可以用鼠标左键来拖选需要求平均值的单元格,也可以按住键盘上的“Ctrl + 鼠标左键”来选择多个单元格,然后按确定按钮;

这时,得出的平均值就自动显示在“平均分”下面的单元格中了。(如下图)

好了,平均值已经求出来了,那么我们现在的问题是如何让每个同学的平均分数自动显示在平均分的单元格中。 我们只需要将光标放到第一位同学的平均分单元格的右下方,此时,鼠标会变成一个“黑色十字架”,我们鼠标左键按住不放,然后将鼠标拖到最后一个同学“平均分”的单元格中,松开左键,OK了,所以同学的平均分数全部求出来了!

本篇只是拿成绩表作为演示,告诉大家如何用Excel求平均值,相信大家在实际操作时还会遇到各种各样的问题,希望大家能够举一反三,灵活运用!

Excel公式和函数 计算平均值

Excel 公式和函数 计算平均值 在分析实际问题时,为了反映整个过程或整体的概貌,经常会引入平均值的概念。平均值有多种类型,常见的有几何平均值、算术平均值、加权平均值、方均根等等。平均值的引入能够使计算结果显得更加直观、简易。在Excel 中,提供了一系列专门用于各种平均值统计的函数。 1.AVEDEV 函数 AVEDEV 函数用于返回一组数据与其均值的绝对偏差的平均值,该函数通常用于评测数据的离散度,如学生的某科考试成绩。离散度是描述数值变量资料频数分布的主要特征。 语法:AVEDEV (number1, number2,...) 其中,参数Number1, number2, ...用于计算绝对偏差平均值的一组参数,参数的个数可以有1到255个。输入数据时,所使用的计量单位将会影响AVEDEV 函数的计算结果。 例如,如图7-25所示为某个学生参加演讲比赛的得分情况,求该生得分的绝对偏差平均值。 选择D14单元格,在【插入函数】对话框中,选择AVEDEV 函数,在【函数参数】对话框中,设置参数Number1为D4:D13,单击【确定】按钮,即可得出计算结果为0.26,如图7-26所示。 图7-25 得分情况表 图7-26 绝对偏差平均值 2.AV ERAG E 和 AVERAGEA 函数 在进行数据统计时,如统计学生考试情况或者销售情况,经常需要计算学生各科目的平均成绩或销售业绩,此时,就可以利用Excel 统计函数中的AVERAGE 和AVERAGEA 函数进行计算。这两种函数都可以返回参数的算术平均值,两者区别在于如何对待非数值的单元格。 算术平均值它是将一组数据相加后,除以数据的个数得出的。例如,2、3、3、5、7和10的平均数是30除以6,结果是5。 语法:AVERAGE (number1, number2,...) AVERAGEA (value1, value2,...) 其中的参数均表示需要计算平均值的1到255个参数,如果计算中不包括引用的逻辑值和代表数字的文本,可以使用AVERAGE 函数;若包括引用中的逻辑值(如TRUE 和FALSE )和代表数字的文本,则应该使用AVERAGEA 函数。 技 巧 用户也可以选择D14单元格,在【编辑栏】中输入“=AVEDEV(D4:D13)”公式,计算 绝对偏差的平均值。 得分情况 设置 计算结果

大数据之mapreduce理论

MapReduce理论篇 2.1 Writable序列化 序列化就是把内存中的对象,转换成字节序列(或其他数据传输协议)以便于存储(持久化)和网络传输。 反序列化就是将收到字节序列(或其他数据传输协议)或者是硬盘的持久化数据,转换成内存中的对象。 Java的序列化是一个重量级序列化框架(Serializable),一个对象被序列化后,会附带很多额外的信息(各种校验信息,header,继承体系等),不便于在网络中高效传输。所以,hadoop自己开发了一套序列化机制(Writable),精简、高效。 2.1.1 常用数据序列化类型 常用的数据类型对应的hadoop数据序列化类型 2.1.2 自定义bean对象实现序列化接口 1)自定义bean对象要想序列化传输,必须实现序列化接口,需要注意以下7项。 (1)必须实现Writable接口 (2)反序列化时,需要反射调用空参构造函数,所以必须有空参构造 (3)重写序列化方法 (4)重写反序列化方法 (5)注意反序列化的顺序和序列化的顺序完全一致 (6)要想把结果显示在文件中,需要重写toString(),且用”\t”分开,方便后续用 (7)如果需要将自定义的bean放在key中传输,则还需要实现comparable接口,因为

mapreduce框中的shuffle过程一定会对key进行排序

详见3.2.1统计每一个手机号耗费的总上行流量、下行流量、总流量(序列化)。 2.2 InputFormat数据切片机制 2.2.1 FileInputFormat切片机制 1)job提交流程源码详解 waitForCompletion() submit(); // 1建立连接 connect(); // 1)创建提交job的代理 new Cluster(getConfiguration()); // (1)判断是本地yarn还是远程 initialize(jobTrackAddr, conf); // 2 提交job submitter.submitJobInternal(Job.this, cluster) // 1)创建给集群提交数据的Stag路径 Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf); // 2)获取jobid ,并创建job路径 JobID jobId = submitClient.getNewJobID(); // 3)拷贝jar包到集群 copyAndConfigureFiles(job, submitJobDir); rUploader.uploadFiles(job, jobSubmitDir); // 4)计算切片,生成切片规划文件

浅谈大数据平台建立对企业管理的影响

浅谈大数据平台建立对企业管理的影响随着网络信息化时代在企业管理占比日益增大,数据对传、数据分析、大大拓展了互联网在企业应用管理中的疆界和应用领域,我们正处在一个数据爆炸性增长的"大数据"时代,数据将成为企业的核心资产,在企业决策管理中产生深远影响。既是机遇也是挑战。 1、大数据的建立能够使企业防范风险的能力增强, 在大数据时代来临之前,数据作为特殊“资产”,人们更多的是从历史数据中总结规律,查找上一年度的失误与缺陷。在下一年度工作中进行消缺和提升工作不足。但大数据时代的决策最主要的功能就是预测未来,也就是说从数据的分析中寻找不足与缺陷,以大数据提供的分析为依据及时针对某一方面做出整改。以此来降低企业生产运行分险。如果竞争企业可以对行业市场进行预测对企业自身承载能力进行全面透彻的了解,但自己所在的企业不能,那么企业将会失去未来。企业存在的风险是企业系统不能适应环境变化的风险,在数据时代,这种风险更多地是体现在管理者的日常决策中,体现在企业管理决策要面向需求产品的客户的变化上。 2、企业的管理决策权由原来的被动追求价值向主动增加企业价值转变。 大数据的优点在于引导价值,促使企业价值根据良性化发展,对于企业来说,企业价值体现在其企业管理组织架构中,企业原有组织形式是单一的为企业价值而存在和建立的。在数据时代,企业的组织结构形式必须以实现企业的价值增长基础,提高企业在市场经济

的核心竞争力,也就是说,企业的组织架构的变化必将会诱发企业管理决策和领导者决策的变化,大数据就是建立一条无形的通道在生产者和决策者和市场之间行成多元化的隐性联系。使管理者和决策者参与到产品塑造的过程中去,大数据的建立可以有效地可以有效地避免决策者过度的追随价值带来的被动,从根本上引导管理者和决策者改变传统的决策方式。 3.企业创造价值的方式发生改变 在大数据之前企业已形成了一套成熟的管理方法,但依靠业务驱动以及因果思维形成的管理方法始终无法实现最高的管理水平,这种模式永远是现寻求问题的原因再去寻找解决问题的方法,但在未来,数据驱动模式将代替业务驱动模式,大数据技术可以让企业决策者直接看到解决问题的方法,从而分析问题出现的原因,并帮助决策者做出正确决策,这样及排除了决策者个人主观判断对问题的影响,也让企业决策者的决策思维超越了眼前事实。大数据技术中蕴含着丰富的数据信息资源,它们的科学有效应用能够切实为企业带来巨大的经济产值,产生更多经济收益。因此,要利用好信息资源就要进一步加强大数据技术的完整型,全面性、时效性。大数据信息资源的有效应用离不开先进的数据技术和信息化思维,将传统数据信息方法与大数据技术有机地结合起来,通过将不同数据集进行重组和整合,发挥就数据集所不具有的新功能,从而为企业创造出更多的价值。利用有效的 4、企业的管理决策从单一的中高层管理向员工参与决策转变

3-MapReduce编程

MapReduce编程 一、实验目的 1、理解MapReduce编程模型基本知识 2、掌握MapReduce开发环境的搭建 3、掌握MapReduce基本知识,能够运用MapReduce进行基本的开发 二、实验原理 MapReduce 是Hadoop两个最基础最重要的核心成员之一。它是大规模数据(TB 级)计算的利器,Map 和Reduce 是它的主要思想,来源于函数式编程语言。从编程的角度来说MapReduce分为Map函数和Reduce函数,Map负责将数据打散,Reduce负责对数据进行聚集,用户只需要实现map 和reduce 两个接口,即可完成TB级数据的计算。Hadoop Map Reduce的实现采用了Master/Slave 结构。Master 叫做JobTracker,而Slave 叫做TaskTracker。用户提交的计算叫做Job,每一个Job会被划分成若干个Tasks。JobTracker负责Job 和Tasks 的调度,而TaskTracker负责执行Tasks。常见的应用包括:日志分析和数据挖掘等数据分析应用,另外,还可用于科学数据计算,如圆周率PI 的计算等。 MapReduce 框架的核心步骤主要分两部分:Map 和Reduce。当你向MapReduce 框架提交一个计算作业时,它会首先把计算作业拆分成若干个Map 任务,然后分配到不同的节点上去执行,每一个Map 任务处理输入数据中的一部分,当Map 任务完成后,它会生成一些中间文件,这些中间文件将会作为Reduce 任务的输入数据。Reduce 任务的主要目标就是把前面若干个Map 的输出汇总到一起并输出。按照以上基本的描述,其工作图如下。

(仅供参考)Excel使用AVERAGEIF函数计算满足条件的平均值

在Excel中,如果要计算满足条件的平均值,可以使用AVERAGEIF 函数计算满足条件的平均值。Excel2007可使用AVERAGEIF函数计算满足条件的平均值。 如上图所示,在B6单元格输入公式: =AVERAGEIF(B2:B5,">=60",B2:B5) 按回车键即可计算满足条件的平均值。返回“B2:B5”单元格中的成绩大于或等于60的平均值。 Excel2007可使用AVERAGEIF函数计算满足条件的平均值。 相关说明: ?AVERAGEIF函数语法: AVERAGEIF(range,criteria,average_range) ?range:是要判断计算平均值条件的一个或多个单元格,其中包括数字或包含数字的名称、数组或引用。 ?criteria:是数字、表达式、单元格引用或文本形式的条件,用于定义要对满足哪些条件的单元格计算平均值。例如,条件可以表示为 32、"32"、">32"、"apples"或B4。

?average_range:是要计算平均值的实际单元格集。如果忽略,则使用range。 ?忽略区域中包含TRUE或FALSE的单元格。 ?如果average_range中的单元格为空单元格,AVERAGEIF 将忽略它。 ?如果range为空值或文本值,则AVERAGEIF会返回#DIV0! 错误值。 ?如果条件中的单元格为空单元格,AVERAGEIF将其视为0值。?如果区域中没有满足条件的单元格,则AVERAGEIF会返回#DIV/0!错误值。 ?您可以在条件中使用通配符,即问号(?)和星号(*)。问号匹配任一单个字符;星号匹配任一字符序列。如果要查找实际的问号或星号,请在字符前键入波形符(~)。 ?average_range不必与range的大小和形状相同。 ?AVERAGEIF函数返回某个区域内满足给定条件的所有单元格的平均值(算术平均值)。

大数据平台数据治理体系建设和管理方案

XXX企业级省大数据平台数据治理子系统的 建设和管理方案

目录 1.范围 (5) 2.规范性引用文件 (5) 3.术语、定义和缩略语 (17) 4.总体说明 (23) 4.1.概述 (23) 4.2.目标 (23) 4.3.原则 (24) 5.数据治理体系 (25) 5.1.总体框架 (25) 5.2.组织架构 (26) 5.2.1.组织构成 (26) 5.2.2.角色职责 (27) 5.3.系统架构 (29) 5.3.1.系统功能框架 (29) 5.3.2.系统模块流程 (32) 5.4.系统边界 (33) 5.4.1.与企业级省大数据平台关系 (34) 5.4.2.与对外能力开放平台关系 (34) 5.4.3.与平台运维系统关系 (34) 6.数据治理核心模块 (35)

6.1.数据标准管理 (35) 6.1.1.背景 (35) 6.1.2.目标及原则 (37) 6.1.3.业务分类和定义 (37) 6.1.4.技术功能要求 (45) 6.1.5.本期建设范围及内容 (51) 6.1.6.实施要求 (52) 6.2.元数据管理 (52) 6.2.1.背景 (52) 6.2.2.元数据运营模式 (55) 6.2.3.元模型标准 (55) 6.2.4.元数据运维 (62) 6.2.5.本期重点建设内容 (63) 6.3.数据质量管理 (64) 6.3.1.与传统经营分析系统的区别 (64) 6.3.2.范围和原则 (66) 6.3.3.与其它功能模块的关系 (67) 6.3.4.本期数据质量功能需求 (70) 6.3.5.本期数据质量运维要求 (72) 6.4.数据资产管理 (73) 6.4.1.数据资产概述 (73) 6.4.2.数据资产范围 (75)

基于MapReduce的高能物理数据分析系统

第 40 卷 Vol.40 ·专栏· 专栏·
第2期 No.2
计 算 机 工 程 Computer Engineering
文章编号: 文章编号:1000—3428(2014)02—0001—05 文献标识码: 文献标识码:A
2014 年 2 月 February 2014
中图分类号: 中图分类号:TP311
基于 MapReduce 的高能物理数据分析系统
臧冬松 1,2,霍 菁 1,2,梁 栋 1,2,孙功星 1
(1. 中国科学院高能物理研究所,北京 100049;2. 中国科学院大学,北京 100049) 摘 要:将 MapReduce 思想引入到高能物理数据分析中,提出一个基于 Hadoop 框架的高能物理数据分析系统。通过建立事例的 TAG 信息数据库,将需要进一步分析的事例数减少 2~3 个数量级,从而减轻 I/O 压力,提高分析作业的效率。利用基于 TAG 信息 的事例预筛选模型以及事例分析的 MapReduce 模型,设计适用于 ROOT 框架的数据拆分、事例读取、结果合并等 MapReduce 类 库。在北京正负电子对撞机实验上进行系统实现后,将其应用于一个 8 节点实验集群上进行测试,结果表明,该系统可使 4×106 个事 例的分析时间缩短 23%,当增加节点个数时,每秒钟能够并发分析的事例数与集群的节点数基本呈正比,说明事例分析集群具有 良好的扩展性。 关键词: 键词:高能物理;大数据;数据分析;MapReduce 模型;集群;分布式计算
High Energy Physics Data Analysis System Based on MapReduce
ZANG Dong-song1,2, HUO Jing1,2, LIANG Dong1,2, SUN Gong-xing1
(1. Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China; 2. University of Chinese Academy of Sciences, Beijing 100049, China) 【Abstract】This paper brings the idea of MapReduce parallel processing to high energy physics data analysis, proposes a high energy physics data analysis system based on Hadoop framework. It significantly reduces the number of events that need to do further analysis by 2~3 classes by establishing an event TAG information database, which reduces the I/O volume and improves the efficiency of data analysis jobs. It designs proper MapReduce libs that fit for the ROOT framework to do things such as data splitting, event fetching and result merging by using event pre-selection model based on TAG information and MapReduce model of event analysis. A real system is implemented on BESIII experiment, an 8-nodes cluster is used for data analysis system test, the test result shows that the system shortens the data analyzing time by 23% of 4×106 event, and event number of concurrence analysis per second is higher than cluster nodes when adding more worker nodes, which explains that the case analysis cluster has a good scalability. 【Key words】high energy physics; big data; data analysis; MapReduce model; cluster; distributed computing DOI: 10.3969/j.issn.1000-3428.2014.02.001
1
概述
高能物理实验产生了海量数据,位于法国和瑞士边境
支持,成为大数据领域的标准。 目前 Hadoop 在高能物理实验中的全面应用还非常少, 但已经出现了许多探索和部分应用。美国的 7 个 CMS 实验 网格站点采用了 HDFS 作为存储系统[5]; 意大利国家核物理 研究所探索了利用 Hadoop 的 MapReduce 框架分析高能物 理 实 验 数 据 [6] ; 文 献 [7] 探 索 了 在 亚 马 逊 云 计 算 中 采 用 Hadoop 框架进行高能物理数据分析。 本文分析高能物理数据分析的流程和特点,以及高能 物理普遍采用的 ROOT[8]软件及 ROOT 格式文件的 I/O 特 性,给出一种基于 Hadoop 框架的高能物理数据 MapReduce 分析系统,并进行初步的评估。
的大型强子对撞机(Large Hadron Collider, LHC)每年可产生 25 PB 的数据;改造后的北京正负电子对撞机(BEPCII)和北 京谱仪(BESIII)的取数效率和性能大幅提高, 2012 年产生的 数据量超过了过去几年的总和。数据量的不断膨胀促使不 断探索新的数据存储和计算技术。谷歌公司自 2003 年后相 继发表的 GFS 、MapReduce 和 BigTable 引领了互联网
[1] [2] [3]
界大数据处理的技术革新,而随后基于 Google 实现的 Hadoop 开源项目,更是被众多的公司和厂商广泛采用和
[4]
————————————
基金项目: 基金项目:国家自然科学基金资助重点项目(90912004) 作者简介: 作者简介:臧冬松(1981-),男,博士研究生,主研方向:分布式计算,海量数据管理;霍 研究员 收稿日期: 收稿日期:2013-02-20 修回日期: 修回日期:2013-03-20 E-mail:donal.zang@https://www.sodocs.net/doc/d111419750.html, 菁、梁 栋,博士研究生;孙功星,

相关主题