搜档网
当前位置:搜档网 › Statistical Methods for Expression Trait Loci (eQTL) Mapping

Statistical Methods for Expression Trait Loci (eQTL) Mapping

Statistical Methods for Expression Trait Loci (eQTL) Mapping
Statistical Methods for Expression Trait Loci (eQTL) Mapping

Biometrics62,19–27

March2006

DOI:10.1111/j.1541-0420.2005.00437.x

Statistical Methods for Expression Quantitative

Trait Loci(eQTL)Mapping

C.M.Kendziorski,1,?M.Chen,2M.Yuan,https://www.sodocs.net/doc/207773268.html,n,3and A.

D.Attie3

1Department of Biostatistics and Medical Informatics,University of Wisconsin-Madison,

Madison,Wisconsin53703,U.S.A.

2Department of Statistics,University of Wisconsin-Madison,Madison,Wisconsin53703,U.S.A.

3Department of Biochemistry,University of Wisconsin-Madison,Madison,Wisconsin53703,U.S.A.

?email:kendzior@https://www.sodocs.net/doc/207773268.html,

Summary.Traditional genetic mapping has largely focused on the identi?cation of loci a?ecting one,or at

most a few,complex traits.Microarrays allow for measurement of thousands of gene expression abundances,

themselves complex traits,and a number of recent investigations have considered these measurements as

phenotypes in mapping https://www.sodocs.net/doc/207773268.html,bining traditional quantitative trait loci(QTL)mapping methods

with microarray data is a powerful approach with demonstrated utility in a number of recent biological

investigations.These expression quantitative trait loci(eQTL)studies are similar to traditional QTL studies,

as a main goal is to identify the genomic locations to which the expression traits are linked.However,eQTL

studies probe thousands of expression transcripts;and as a result,standard multi-trait QTL mapping

methods,designed to handle at most tens of traits,do not directly apply.One possible approach is to

use single-trait QTL mapping methods to analyze each transcript separately.This leads to an increased

number of false discoveries,as corrections for multiple tests across transcripts are not made.Similarly,

the repeated application,at each marker,of methods for identifying di?erentially expressed transcripts

su?ers from multiple tests across markers.Here,we demonstrate the de?ciencies of these approaches and

propose a mixture over markers(MOM)model that shares information across both markers and transcripts.

The utility of all methods is evaluated using simulated data as well as data from an F2mouse cross in a

study of diabetes.Results from simulation studies indicate that the MOM model is best at controlling false

discoveries,without sacri?cing power.The MOM model is also the only one capable of?nding two genome

regions previously shown to be involved in diabetes.

Key words:Bayesian hierarchical mixture model;Expression trait loci(eQTL)mapping;Gene expression;

Microarray;Quantitative trait loci(QTL)mapping.

1.Introduction

Traditional genetic mapping has largely focused on the iden-ti?cation of loci a?ecting one,or at most a few,complex traits.Microarrays allow for measurement of thousands of gene expression abundances,themselves complex traits,and a number of recent investigations have considered these mea-surements as phenotypes in mapping studies.This type of approach has the potential to impact a broad range of bio-logical endeavors(Cox,2004).Utility has been demonstrated in identifying candidate genes(Schadt et al.,2003),in infer-ring not only correlative but also causal relationships between modulator and modulated genes(Brem et al.,2002;Schadt et al.,2003;Yvert et al.,2003),and in elucidating subclasses of clinical phenotypes(Schadt et al.,2003).As a result of these early successes,a number of e?orts are now underway to localize the genetic basis of gene expression.

As part of one such e?ort,an experiment was designed to identify the genetic basis for di?erences between two in-bred mouse populations(B6and BTBR)that show diverse responses to a mutation in the leptin gene.Leptin is a protein hormone with important e?ects in regulating body weight, metabolism,and reproductive function(Zhang et al.,1994).

A mutation in the leptin gene causes only mild and transient type II diabetes in B6mice,but severe diabetes in BTBR mice.Microarray experiments have led to the identi?cation of previously unappreciated genes that are di?erentially ex-pressed between the populations(Lan et al.,2003a).To iden-tify genetic modi?ers and novel regulatory pathways,we have collected second-generation o?spring from these populations. Each o?spring has been genotyped at145markers across the genome and45,265expression traits have been obtained for each using A?ymetrix chips.

It is clear that the experimental set up in an expression quantitative trait loci(eQTL)mapping study is similar in structure to a traditional quantitative trait loci(QTL)map-ping study,but with thousands of phenotypes.The simplic-ity with which this di?erence can be stated obscures the re-sulting challenges posed for the statistical analysis of eQTL data.The statistical methods available for multi-trait QTL

C 2005,The International Biometric Society19

20Biometrics,March2006

mapping consider relatively few traits and are not easily ex-tended to the eQTL setting as they require estimation of a phenotype covariance matrix,which is not feasible for hun-dreds or thousands of traits(for a review of multiple-trait QTL methods,see Lund et al.,2003and references therein).

To circumvent this,one could apply single-trait QTL map-ping methods to reduced summaries of expression obtained, for example,via principal components analysis(Lan et al., 2003b).Doing so has proven useful;however,transcript-speci?c information is oftentimes of primary interest.When this is the case,simple tests(such as the Wilcoxon–Mann–Whitney)for linkage between each marker and transcript can be carried out with combinations identi?ed as important if the resulting p-value is su?ciently small(Brem et al.,2002). Alternatively,interval mapping methods(see Broman,2001 for a review)can be used to obtain transcript-speci?c signi?-cance pro?les that are then calibrated via a common critical value intended to account for the potential increase in type I error induced by testing at multiple markers(Schadt et al., 2003).

As we show here,the repeated application of a transcript-speci?c linkage analysis has a number of serious?aws.Most notably,although adjustments are made for multiple tests across markers,few if any adjustments are made for mul-tiple tests across transcripts.Furthermore,information com-mon across transcripts is not utilized,which can lead to a loss in power.The use of a single,approximate,critical value for all transcripts is also problematic as the exact critical value for a given transcript depends not only on the number of transcripts and genomic locations tested(?xed for every data set),but also on the expression levels of that https://www.sodocs.net/doc/207773268.html,-ing a common critical value further reduces power for some transcripts while increasing type I error for others.To address some of these issues,a marker-based approach can be used.

As a main goal of eQTL mapping is to identify transcripts and genomic locations that are signi?cantly linked,instead of testing each transcript for signi?cant linkage across the genome as described above,one could test each genome lo-cation for linked transcripts.At a given marker,this con-sists of identifying all transcripts with signi?cant di?erences among phenotype groups where groups are determined by the marker’s genotype.In this context,any method for identifying di?erentially expressed(DE)genes could be applied(for a re-view of methods,see Parmigiani et al.,2003).An advantage of this marker-based approach is that most methods to identify DE transcripts adjust for the multiple tests across transcripts. However,none of the methods currently used to identify DE genes would be applicable to the eQTL setting between mark-ers where genotypes are unknown;and furthermore,although multiple tests across transcripts would be accounted for,mul-tiple tests across markers would not be.

We have developed an approach that combines advantages from both the transcript-and marker-based methods.Our method maps eQTL by combining information across tran-scripts while controlling for the multiplicities induced by tests at transcripts and markers.The advantages are demonstrated and validated using simulated data as well as data from the diabetes study described above.

Section2describes in detail a transcript-based and two marker-based approaches.As discussed,none of the ap-proaches properly accounts for multiplicities.An empiri-

cal Bayes hierarchical mixture over markers(MOM)model,

which adjusts for relevant multiplicities,is introduced in Sec-

tion3.A simulation study is presented in Section4,demon-

strating that the MOM model controls the false discovery rate

(FDR),without a substantial loss in power.The data set of

interest is discussed in detail in Section5and is analyzed us-

ing all methods considered.Section6gives a discussion and

outlines open questions in the analysis of eQTL data.

2.eQTL Mapping Methods

Consider for simplicity a backcross population from two in-

bred parental populations,P1and P2,genotyped as0or1at

M markers(this simpli?cation to a backcross is not required

and is relaxed in our simulations and analyses).For the k th

animal,let y t,k denote the expression level for transcript t and

g m,k denote the genotype at marker m;t=1,2,...,T and

k=1,2,...,n.Of interest is the identi?cation of signi?cant

linkages between transcripts and markers.To be precise,a

transcript t is linked to marker m ifμt,0=μt,1,whereμt,0(1) denotes the latent mean level of expression of transcript t

for the population of animals with genotype0(1)at marker

m.Suppose observations y t,k have density f obs(y t,k|μt,g

m,k

,θ)

whereθdenotes any remaining unknown parameters.Un-

der the null hypothesis of no linkage,the data are governed

by

n

k=1

f obs(y t,k|μt,0=μt,1,θ);and under the alternative,

n

k=1

{f obs(y t,k|μt,0,θ)}1?g m,k{f obs(y t,k|μt,1,θ)}g m,k.As dis-cussed below,a main di?erence between the transcript-based (TB)and marker-based(MB)approaches arises from di?erent assumptions regarding the latent means.

2.1Transcript-Based Approach

A T

B approach refers generally to the repeated application of

any single-phenotype mapping method to each mRNA tran-

script,with locations identi?ed as important if the test statis-

tic of interest exceeds some critical value.The LOD score

log10

?

???

??

???

??

n

k=1

f obs(y t,k|?μt,0,?μt,1,?θ)

n

k=1

f obs(y t,k|?μ,?θ)

?

???

??

???

??

is often used as the statistic measuring evidence in favor of

linkage,where(?·)denotes the maximum likelihood estimate

(MLE)of the associated parameter(s)andμdenotes the mean

common across genotype groups(Lander and Botstein,1989).

Critical values can be obtained theoretically(Dupuis and

Siegmund,1999)or via permutations(Churchill and Doerge,

1994).

The speci?c TB approach that will be evaluated here as-

sumes a Gaussian density for f obs with tests performed at ev-

ery marker and critical values determined theoretically by the

formulas given in Dupuis and Siegmund(1999).This marker

regression approach,referred to as TB-MR,is identical(at

each marker)to that used by Schadt et al.(2003)to identify

signi?cantly linked expression traits in an F2mouse cross. 2.2Two Marker-Based Approaches

To identify transcripts signi?cantly linked to genomic loca-

tions,instead of testing each transcript for signi?cant linkage

Statistical Methods for eQTL Mapping

21

across markers,one could test at each marker for signi?cant linkage across transcripts.This amounts to identifying DE transcripts at each marker,with groups determined by marker genotypes.The MB approach refers generally to the repeated application,at each marker,of any method for identifying DE transcripts.In this setting,a number of approaches could be used (for a review,see Parmigiani et al.,2003).We con-sider two:an empirical Bayes approach,EBarrays ,described in detail in Kendziorski et al.(2003)and an approach based on the Student’s t -test followed by p-value adjustment,similar to that proposed by Dudoit et al.(2002).

EBarrays assumes measurements y t ,k arise as conditionally independent random deviations from an observation distribu-tion f obs (·|μt,·,θ).Instead of treating the μt,·’s as ?xed e?ects as in TB-MR,the underlying means are described by a dis-tribution π(μ).In this case,an equivalently expressed (EE)transcript t presents data y t =(y t,1,y t,2,...,y t ,n )according to the distribution

f 0(y t )=

n

k =1

f obs (y t,k |μ)

π(μ)dμ,(1)

where μ=μt,0=μt,1.

For a DE transcript,let y l t denote the set of observations

for animals with genotype l =0,1.The data y t =(y 0t ,y 1

t )are governed by the distribution

f 1(y t )=f 0 y 0t f 0 y 1

t

,

(2)

owing to the fact that di?erent mean values,μt ,0and μt ,1,

govern the di?erent subsets y 0t and y 1

t of samples and are considered independent draws from π(μ).As a transcript’s expression state is never known a priori,the marginal distri-bution of the data is given by pf 1(y t )+(1?p )f 0(y t )where p denotes the proportion of DE transcripts.With estimates of p,f 0,and thus f 1obtained via the EM algorithm,the posterior probability of DE is calculated by Bayes’rule.

Although a number of parametric assumptions are available in EBarrays ,for comparison with the TB-MR approach,here we also consider a Gaussian model on the log observations for f obs and a Gaussian model for π.Speci?cally,for a log-transformed expression measurement y t ,k ,

y t,k ~N μt,g m,k ,σ2

and

μt,·~N μ0,τ02

.

(3)

At a particular marker,a transcript is identi?ed as signif-icantly linked if the posterior probability of di?erential ex-pression exceeds some threshold.These posterior probabil-ities have been referred to as “local FDRs”(Efron et al.,2001;Efron and Tibshirani,2002;Efron,2004)and it has been shown that to control the posterior expected FDR at α·100%,the appropriate threshold is the smallest poste-rior probability such that the average posterior probability of all transcripts exceeding the threshold is larger than 1?α(Efron,2004;Newton et al.,2004).This marker-based empir-ical Bayes approach will be referred to as MB-EB.

The second MB approach consists of calculating Student’s t -statistics at a marker and obtaining adjusted p-values.Dudoit et al.(2002)propose methods that control the family-wise error rate across transcripts.Here,we control the FDR using q-values (Storey and Tibshirani,2003).In particular,

to control the FDR at α,transcripts with q-values ≤αare considered signi?cant;MB-Q will denote this MB approach.2.3TB and MB Combined

To test transcript and marker combinations simultaneously,one could consider the p-value matrix obtained from calculat-ing Student’s t -statistics for every transcript at every marker,and calculate q-values for the entire matrix at once.The FDR can be controlled using q-values as described above.We refer to this approach as Q-ALL.This approach is justi?ed pro-vided the p-values are weakly dependent (Storey,Taylor,and Siegmund,2004).Storey and Tibshirani (2003)hypothesize that weak dependence is the most likely form of dependence in genomewide studies such as the eQTL study of Brem et al.(2002).

3.Mixture over Markers Model

Although the TB and MB approaches described above are in many ways fundamentally di?erent,they share an important ?aw.Separate tests are conducted for each transcript-marker pair,and each measures evidence that the transcript maps to that marker relative to evidence that it maps nowhere.Since a transcript can map to any of many marker locations,the evidence that a transcript maps to a particular marker should not be judged relative only to the possibility that it maps nowhere,but rather relative to the possibility that it maps nowhere or to some other marker.This idea motivates the MOM model.

Suppose a transcript t maps nowhere with probability p 0

or to any marker m with probability p m where M

i =0p i

=1and M denotes the total number of markers.(In fact,this is only an approximation as the transcript could map in be-tween markers.This possibility is discussed in Section 6.)The marginal distribution of the data y t is then given by

p 0f 0(y t )+

M m =1

p m f m (y t ),(4)

where f m describes the distribution of data if transcript t maps to marker m (f 0describes the data for nonmapping transcripts).A density of the form given by equation (1)((2))describes the marginal distribution of data for nonmap-ping (mapping)transcripts.In the degenerate case of a single marker,equation (4)reduces to the mixture model given be-low equation (2)that forms the basis for MB-EB.For most eQTL mapping data sets,including the one discussed in Sec-tion 5,M is large (>100).

Similar to MB-EB,a Gaussian model is assumed for f obs (·)and for π(·).However,here we allow for the possibility that clusters of transcripts present data with di?erent variances.Thus,σ2as in equation (3)is no longer constant,but is clus-ter dependent.Cluster membership is determined by K means prior to model ?tting.The total number of clusters is chosen by the Bayes Information Criterion (BIC).Model ?t proceeds via EM (see details in Kendziorski et al.,2003).Multiple ini-tial value con?gurations are used to check convergence.Di-agnostics such as those described in Newton and Kendziorski (2003)should always be checked.For the moderately sized data set described in Section 5,parameter estimates were obtained via the EM algorithm implemented in R 1.9.1(R Development Core Team,2004).This took under 9hours

22Biometrics,March2006

on a Dell Precision650(Xeon,3GHz)with4GB of memory. We found that20iterations were su?cient to reach conver-gence.We also found that results were robust to di?erent initial cluster centers,but dependent on the lower bound for the number of clusters chosen via BIC.If too few clusters were chosen(fewer than the optimal predicted by BIC),model di-agnostics were poor.

Once parameter estimates are obtained,posterior proba-bilities of mapping nowhere or to any of the M locations are calculated via Bayes’rule.A transcript is identi?ed as DE using the MOM approach if the posterior probability of EE is smaller than some threshold,where thresholds are chosen to bound the posterior expected FDR at5%as described in Section2.2.Expression QTL for identi?ed transcripts are those contained in the90%highest posterior density(HPD) region(Carlin and Louis,1998).With thousands of tran-scripts,posterior uncertainty regardingθis generally very small(Kendziorski et al.,2003)and so the anti-conservative nature of the HPD intervals should be minimal.

4.Simulation Studies

To assess the performance of these approaches,we performed a small set of simulation studies.The simulations are in no way designed to capture the many complexities of eQTL data, but rather to provide some preliminary information on op-erating characteristics of the approaches in simple settings. Marker genotype data were obtained from chromosomes2 and3of the F2data described in Section5.Chromosome 2(3)contains17(6)markers with an average intermarker distance of7.6(17.7)cM.An eQTL at marker5on chro-mosome2was simulated;no eQTL is simulated on chromo-some3.Each transcript is simulated as either EE or in any one of four DE patterns(aa|Aa,AA;aa,Aa|AA;aa,AA|Aa; aa|Aa|AA)where“|”denotes inequality among the latent genotype group means.Pattern membership is determined by a multinomial where the expected proportion of transcripts in each pattern is speci?ed at3%,3%,1%,and3%,respectively.

Conditional of the mean pattern,simulated log intensities follow a Gaussian distribution.Since both the TB and MB approaches assume a log-normal distribution(for TB,the in-tensities are logged before analysis),this assumption does not bias the simulation in favor of any method.Rather than spec-ify arbitrary means and variances for the simulation,we use values derived from the F2data.Consider a single transcript https://www.sodocs.net/doc/207773268.html,tent means for each genotype group are obtained by cal-culating sample averages within the groups.As the genotype groupings change at each marker,so too will these averages. To remedy this,the median value across markers within each genotype group speci?esμt,aa,μt,Aa,andμt,AA.This is done separately for each transcript.The di?erences between the aa and AA genotype groups are also considered.A length T vectorδis de?ned as the maximum of the di?erences across markers.

For one transcript t,the aa group mean is sampled from the vectorμ·,aa.If t is EE,the means in the heterozygous and homozygous AA group are set to the sampled value,μs?,aa. If t is in any DE pattern,a random sample,δs?,is taken from the upper quartile of the vectorδ.If aa|Aa for t,the heterozygous mean is de?ned to beμs?,aa+δs?.If t is in pattern aa|Aa|AA,the homozygous AA mean isμs?,aa+ 2×δs?.

To set the variance for a transcript t,we use the posterior mean ofσ2t,given by

n

k=1

(y t,k?ˉy t,·)2+ν0σ20/ν0+n?2(de-rived assuming the variance is distributed as scaled inverse chi squareσ2t~Invχ2(ν0,σ20)).Note that asν0→0,the pos-terior mean approaches(n?1)s2/(n?2)≈s2,the transcript-speci?c sample variance,which is the naive estimate of any EE transcript variance under TB-MR assumptions.Data sim-ulated with smallν0are therefore consistent with assumptions made in TB-MR.Asν0→∞,the posterior mean approaches a constant varianceσ20,which is assumed in MB-EB(note that this assumption implies a constant coe?cient of varia-tion on the raw gene expression scale).By varyingν0,op-erating characteristics can be evaluated without biasing the results in favor of one method.Data simulated by this em-pirical method have marginal distributions that are virtually indistinguishable from the observed data.

Seven sets of simulations were obtained forν0between5?5 and55.At each?xedν0,the pro?le marginal MLE is obtained forσ20.For each simulated data set,thresholds are chosen as described in Section2to control the type I error rate across the two simulated chromosomes at5%for TB-MR(by the formulas in Dupuis and Siegmund,1999,the critical value for the simulations is2.57)and to control the FDR at5% for MB-EB,MB-Q,and Q-ALL.The location of the maxi-mum LOD(TB-MR),maximum posterior probability of DE (MB-EB),or minimum q-value(MB-Q and Q-ALL)for each transcript was recorded.Mapping transcripts are de?ned as those for which the evidence in favor of linkage at the loca-tion of the maximum(minimum)exceeds the threshold(or is smaller than the threshold in the case of MB-Q and Q-ALL). With multiple transcripts and putative linkage locations,the de?nition of power and FDR in an eQTL study is not ob-vious.By only considering the single,most likely location of mapping for each transcript as we have done here(given by maximum LOD,minimum q-value,etc.),the de?nitions are simpli?ed.

Power measures the ability to identify the DE transcripts exactly at marker5or either of the?anking markers that are16.5cM and5.8cM away,respectively(this de?nition is motivated by that used in Broman and Speed,2002,where an identi?cation is deemed correct if it is made within a 20cM window containing the true QTL—in that work,un-like here,the QTL was located in the center of the window). As shown in Figure1(left panel),there is little variation in power acrossν0.MB-Q is the most powerful method,followed by TB-MR,Q-ALL,MOM,and MB-EB.Power-b only con-siders calls exactly at marker5.Table1shows that there is only a slight decrease in power when the?anking markers are not considered.Although power is signi?cantly di?erent among some of the approaches atα=5%,the magnitude of the di?erences is quite small.This is not the case for FDR.

FDR gives the proportion of transcripts,out of all that mapped to chromosome2or3that were not truly DE or that were DE but mapped to a region outside the?anking marker region.Figure1(right panel)shows that the MOM model is the only approach with well-controlled FDR over a variety of simulations(indexed byν0).For the TB and MB methods,FDR is well over the target level of0.05for virtually

Statistical Methods for eQTL Mapping

23

ν0

P o w e r

5?55?35?150515355

0.80

0.850.90

0.95

1.00

TB–MR MB–EB MB–Q Q–ALL MOM

ν0

F D R

5?55?35?150515355

0.00.10.20.30.

4

Figure 1.For each value of ν0,20simulated data sets are generated (see Section 4).Operating characteristics are evaluated for each of the ?ve methods on each data set.Table 1reports the average performance at each value of ν0.Shown here are two operating characteristics—power (left panel)and FDR (right panel)—along with the 95%pointwise con?dence intervals.all values of ν0.FDR for MB-EB is controlled at the target level of 5%only when ν0is large.This is somewhat expected since as ν0→∞,the simulation more closely approximates the assumptions made in MB-EB.Any increase in FDR due to repeated tests at markers in this case is minimal.Here,a false discovery can be made due to identi?cation of EE genes or DE genes at non?anking markers.Over two thirds of the false calls for each method are made from the former for every value of ν0(results not shown).The number of false

Table 1

Average operating characteristics (OCs )for TB-MR,MB-EB,MB-Q,Q-ALL,and MOM

ν0

OC Method 5?55?35?150515355Power

TB-MR 0.8840.8860.8870.8860.8890.9190.868MB-EB 0.8200.8170.8150.8230.8330.8950.837MB-Q 0.9110.9120.9130.9120.9170.9490.918Q-ALL 0.8740.8740.8780.8750.8800.9070.848MOM 0.8480.8510.8530.8500.8560.8600.811Power-b

TB-MR 0.8520.850.8560.8540.8530.8780.816MB-EB 0.8070.8030.8040.8110.8180.8810.818MB-Q 0.8930.8930.8960.8950.8980.9280.887Q-ALL 0.8440.8410.8480.8460.8460.8680.799MOM 0.8480.850.8520.850.8560.860.811FDR

TB-MR 0.2860.2860.2930.2850.2860.280.301MB-EB 0.2820.2810.2850.2790.2690.1170.034MB-Q 0.240.2460.2460.240.2450.230.226Q-ALL 0.2020.2090.2130.2020.2090.1950.207MOM 0.0380.0410.0460.0370.0360.0050.002N-chr3

TB-MR 86.582.282.786.4582.185.9582.2MB-EB 48.84644.9547.943.2511.450.15MB-Q 0.550.650.250.550.650.550.55Q-ALL 5149.5550.250.9549.649.6542.9MOM

3.75

4.15

4.3

4.1

3.25

Note:Averages are calculated over 20data sets;standard errors were less than 0.005for power,power-b,and FDR and less than 2for N-chr3.Power measures the ability to identify DE transcripts at marker 5or either of the ?anking markers;Power-b considers calls exactly at marker 5.Other OC de?nitions and details of the simulation are given in the text (see Section 4).

calls made on chromosome 3(N-chr3)alone is also considered.As shown in Table 1,TB-MR identi?es the most transcripts on chromosome 3.

These results suggest that it is di?cult in most cases to control FDR using a simple application of a TB or MB ap-proach.This is because the TB-MR approach considers each transcript in isolation,controlling a type I error rate across markers,with no control for multiple tests across transcripts.MB-EB and MB-Q share information across transcripts to

24Biometrics,March2006 control an expected FDR at each marker,but do not account

for tests at multiple markers.When model assumptions do

not hold,MB-EB performs poorly.The MOM approach ad-

dresses these de?ciencies.It allows for information sharing

across transcripts while controlling for multiplicities across

both transcripts and markers;and,as a result,much improved

FDR control is observed.

5.eQTL Data Analysis

The ob mutation in the C57BL/6J mouse background(B6-

ob/ob)causes obesity,but only mild and transient diabetes

(Coleman and Hummel,1973).In contrast,the same mutation

in the BTBR genetic background(BTBR-ob/ob)causes severe

type II diabetes(Stoehr et al.,2000).A(B6×BTBR)F2-

cross was generated yielding110animals.Selective phenotyp-ing(Jin et al.,2004)was employed to identify60F2ob/ob mice.For each of the60mice,pancreatic islets were iso-lated and45,265mRNA abundance traits were collected at 10weeks of age using A?ymetrix Gene Chips(MOE430A,B). The probe level data were processed using Robust Multi-array Average(RMA)to give a single,normalized,background-corrected summary score of expression for each transcript (Irizarry et al.,2003).Low abundance transcripts,de?ned as transcripts with average expression level below the tenth per-centile,were removed leaving40,738traits.Genotypes for145 markers were also obtained(over90%of the animals provided genotype data at any given marker).

The TB-MR,MB-EB,MB-Q,Q-ALL,and MOM meth-ods were each applied to the F2data.As in the simulation study,the location of the maximum LOD(TB-MR),max-imum posterior probability of DE(MB-EB and MOM),or minimum q-value(MB-Q and Q-ALL)for each transcript was recorded.Mapping transcripts are de?ned as those for which the evidence in favor of linkage at the location of the maximum(minimum)exceeds the threshold(or is smaller than the threshold in the case of MB-Q and Q-ALL).For TB-MR,the threshold is3.5as determined by Dupuis and Siegmund(1999).To control the FDR at5%with MB-Q or Q-ALL,q-values smaller than0.05are deemed signi?cant (Storey and Tibshirani,2003).For MB-EB and MOM,the threshold is chosen to control the FDR at5%as described in Section2.2.

The approaches named above identi?ed3689,4083,1913, 652,and3039transcripts,respectively,that map to at least one location.The most similarity was between MB-Q and Q-ALL with92%of the Q-ALL transcripts also identi?ed by MB-Q;MB-EB and MOM followed with84%of the MOM transcripts also identi?ed by MB-EB;the least similarity was between MB-EB and TB-MR with23%of the TB-MR tran-scripts identi?ed by MB-EB.A main reason for these dif-ferences is shown in Figure2.The sample standard devia-tions of transcripts identi?ed as DE by TB-MR and MB-Q are relatively small compared to those identi?ed by the Bayes approaches,MB-EB and MOM.This is perhaps ex-pected,considering the Bayes approaches share information across transcripts to estimate variance;Q-ALL,which uses transcript-speci?c p-values but considers the entire p-value distribution for assigning signi?cance,falls between these extremes.Another reason for di?erences is the imposition of strict thresholds designed to control di?erent error rates

0.00.10.20.30.4

5

1

1

5

Standard Deviation log(Intensity)

D

e

n

s

i

t

y

TB–MR

MB–EB

MB–Q

Q–ALL

MOM

Figure2.Sample standard deviations of transcripts iden-ti?ed as DE by each of the?ve methods.Sample means were very similar across methods(not shown).

across methods.When instead considering average evidence given by each approach,there is increased agreement among the methods in terms of genome regions identi?ed.

Figure3identi?es regions of enhanced linkage(hot spots) for each method,as measured by average evidence in favor of linkage(average is taken across all transcripts).The hot spot D2Mit241is adjacent to D2Mit9,which has recently been

N

o

r

m

a

l

i

z

e

d

A

v

e

r

a

g

e

E

v

i

d

e

n

c

e

D

1

M

i

t

1

2

D

2

M

i

t

2

4

1

D

2

M

i

t

2

7

4

D

2

M

i

t

4

9

D

3

M

i

t

2

2

D

4

M

i

t

2

3

7

D

5

M

i

t

1

D

5

M

i

t

2

2

1

D

7

M

i

t

7

6

D

8

M

i

t

2

4

9

D

8

M

i

t

4

2

D

9

M

i

t

8

D

1

M

i

t

2

D

1

1

M

i

t

4

D

1

2

M

i

t

1

5

3

D

1

3

M

i

t

1

9

8

D

1

3

M

i

t

1

1

D

1

4

M

i

t

1

4

D

1

5

M

i

t

1

3

6

D

1

6

M

i

t

1

3

D

1

6

M

i

t

1

3

8

D

1

7

M

i

t

6

D

1

8

M

i

t

9

D

1

9

M

i

t

3

4 0

1

2

3

4

5

6

7

Figure3.Evidence of linkage for each approach(LOD for TB-MR,posterior probability for MB-EB and MOM,and 1?(q-value)for MB-Q and Q-ALL)averaged over transcripts and normalized by the sum of the evidence over all mark-ers.The?ve markers with the strongest evidence of mapping transcripts are indicated by triangles for each method.Trian-gles represent(from top to bottom)TB-MR,MB-EB,MB-Q, Q-ALL,and MOM.D4Mit237is among the top?ve mark-ers for each method;D2Mit241and D10Mit20are identi?ed by TB-MR,MB-Q,Q-ALL,and MOM.Note that although hot spot regions are identi?ed in common across approaches, the lists of transcripts mapping to these regions are largely di?erent.

Statistical Methods for eQTL Mapping25

identi?ed as an obesity-modi?er locus(Stoehr et al.,2004). Two additional regions identi?ed by at least four of the?ve methods(on chromosomes4and10)are not yet known to be involved in diabetes although we note that the region identi-?ed on chromosome4has been implicated in other analyses done in the Attie lab.The two regions identi?ed by MOM alone on chromosomes5and8have been identi?ed by other groups in earlier studies:D5Mit1is a location known to a?ect triglyceride levels(Colinayo et al.,2003)and D8Mit249is the marker on our map closest to the“fat”gene which is known to a?ect both diabetes and obesity(Naggert et al.,1995). 6.Discussion

With the advent of microarrays,it is now relevant to con-sider the QTL mapping problem with thousands of expression traits simultaneously.We have demonstrated that novel ap-plications of existing methods for traditional QTL mapping or microarray studies do not perform well.In particular,a repeated application of standard QTL methods to each tran-script results in in?ated FDR.A similar in?ation is observed if methods for identifying DE transcripts are repeatedly applied at every marker.Much of the in?ated FDR results from not correcting for multiple tests across transcripts in the former case and across markers in the latter.

The Q-ALL approach,which combines tests across mark-ers and transcripts simultaneously,is perhaps better justi-?ed.This approach is valid provided the p-values are weakly dependent(Storey et al.,2004);and Storey and Tibshirani (2003)hypothesize that weak dependence is the most likely form of dependence in genomewide studies such as the eQTL study of Brem et al.(2002).This hypothesis remains to be veri?ed.We found that Q-ALL did not control the FDR at the target level in our simulations.This could be due to the fact that the simulation induces dependence that does not sat-isfy the assumption of weak dependence or that the p-values calculated from Student’s t-test are not accurate.We?nd lit-tle evidence for the latter in our simulation set up.Further consideration of these issues is warranted.

To address the eQTL mapping problem,we propose a MOM model that shares information across markers and tran-scripts.The general method is?exible in speci?cation of com-ponent densities and di?erent forms will be appropriate for di?erent data sets.Diagnostics such as those prescribed in Newton and Kendziorski(2003)should always be checked.To facilitate comparisons to Gaussian-based methods,we here considered component densities obtained from log-normal–normal hierarchical models.Simulations demonstrate that FDR is well controlled,without a sacri?ce in power.

The conditions under which data are simulated are always questionable,and particularly so here as the methods com-pared vary considerably in underlying assumptions.To eval-uate these approaches without biasing the results in favor of any one method,we have proposed a simulation framework that allows for evaluation of Bayesian-based methods that share information across units of interest(here,transcripts and markers)as well as those that do not.The framework is in no way designed to capture the many complexities of eQTL data,but it does provide some useful information regarding operating characteristics,and will serve as the basis for the development of more realistic simulation settings.

In addition to simulations,the methods were also com-pared based on results from a(B6×BTBR)F2mouse cross in a study of diabetes.A number of di?erences were ob-served.Most notably,TB-MR and MB-Q identify traits with relatively small standard deviations.This type of behavior motivated the Bayes approach considered here,as informa-tion across transcripts can be shared to better estimate a transcript-speci?c variance and help prevent spurious iden-ti?cations;other Bayesian approaches in the context of mi-croarray studies are similarly motivated(Baldi and Long, 2001;Newton et al.,2001;Tusher,Tibshirani,and Chu,2001; Lonnstedt and Speed,2002;Kendziorski et al.,2003).

Figure3shows that in spite of these di?erences,there are regions of enhanced linkage identi?ed in common among the approaches.These hot spot regions provide support for each approach to some extent and are of most interest to a biolo-gist.The?rst region we considered is adjacent to one recently identi?ed as an obesity-modi?er locus(Stoehr et al.,2004). Two other identi?ed regions are not yet known to be involved in diabetes,but are of particular interest considering they are identi?ed by at least four of the?ve methods considered here. Of more interest to those evaluating these approaches are re-gions that are not identi?ed in common across methods.

In particular,there are two regions identi?ed by MOM alone.We have yet to con?rm that these regions of enhanced linkage are real.However,we are encouraged by the results for two reasons.The?rst is that these regions have been identi?ed by other groups in earlier studies:D5Mit1is a marker linked to triglyceride levels(Colinayo et al.,2003)and D8Mit249is the marker on our map closest to the“fat”gene,which is known to a?ect both diabetes and obesity(Naggert et al., 1995).The second reason is that there is good evidence that MOM may be better able to identify the types of transcripts mapping to hot spot regions(so called trans transcripts). Basically,transcripts can be labeled as cis or trans;trans transcripts are transcripts in which the expression is regu-lated by genes perhaps distant from the physical location of the transcript.They generally have higher variability com-pared to cis transcripts,transcripts that are self-regulated. The vast majority of traits mapping to a hot spot region are not physically located at the region;by de?nition,these traits are trans traits.As shown in Figure2,MOM generally iden-ti?es transcripts with larger variability;most likely,these are trans traits.A close evaluation of these hot spot regions is underway.

In summary,eQTL mapping promises to be among the most statistically challenging problems involving microarray data;and the methods developed for the design and analysis of traditional QTL mapping or microarray studies will not directly apply.The question of selecting the most informa-tive subjects to be phenotyped has been addressed(Jin et al., 2004),but most design and analysis questions for eQTL stud-ies remain open.We have here considered a central problem in the analysis of eQTL data—that of identifying the collection of mapping transcripts and the genome locations to which they map.We have shown that novel applications of some existing methodologies do not fare well and have proposed an alternative approach,the MOM model.The MOM model should prove useful in improving the speci?city of eQTL iden-ti?cations.Speci?cally,by considering one full model for the

26

Biometrics,March 2006

P o s t e r i o r P r o b a b i l i t y o f D E

D 2M i t 2

D 2M i t 296

D 2M i t 241

D 2M i t 327

D 2M i t 249

D 2M i t 274

D 2M i t 17

D 2M i t 106

D 2M i t 194

D 2M i t 263

D 2M i t 49

D 2M i t 229

D 2M i t 148

0.00

0.02

0.040.06

0.08

0.10

Figure 4.Simulation results from 5000simulated tran-scripts;1000have expression levels determined by 2QTL (triangles).QTL genotypes were de?ned by the marker geno-types at the QTL locations.These markers were removed from the analysis to simulate QTL in between markers;the QTLs are not interacting.Intermarker distance surrounding the ?rst QTL is 22.3cM with the QTL 16.5cM from D2Mit241.There are 5.5cM surrounding the second QTL,which is 3.0cM from D2Mit263.The estimated proportion of DE transcripts is given on the y -axis.As shown,posterior probabilities of DE are highest at the markers nearest the QTL.

data,multiple tests across markers and transcripts are ac-counted for and FDR can be controlled without a sacri?ce in power.Two regions identi?ed by MOM alone are known to be involved in diabetes,providing further support for this approach.Additional validation studies are required.

The question of the best way to ?nd multiple eQTL remains open.We here use HPD regions to identify the most likely locations to which mapping transcripts are linked.Figure 4suggests that this approach may be useful for identifying mul-tiple loci,even when the loci lie between markers.In some cases,markers closest to the loci will have the highest pos-terior probability of DE and,in this way,interesting regions will be identi?ed using the MOM model.The precise condi-tions under which this is the case remain to be identi?ed.Ex-plicit consideration of a multiple loci model should certainly improve upon the MOM model,particularly when multiple eQTL are interacting.Interval mapping in the context of the MOM model should also prove useful,as identi?ed genome regions are often large.Finally,a substantial bene?t is ex-pected by incorporation of sequence and other available in-formation.In the context of the MOM model,information regarding the physical location of transcription factors could inform priors on the mixing proportions while functional cate-gories could be used to more appropriately identify gene clus-

ters,thereby improving model accuracy,power,and eQTL identi?cation.

Acknowledgements

This work was supported in part by HHMI 133-ES29to C.M.K.as well as NIDDK 58037and NIDDK 66369to A.D.A.The authors wish to thank Geo?MacLachlan,Michael New-ton,Brian Yandell,and Ping Wang for useful discussions.We also thank Ping Wang for simulation studies of the MOM model,not shown here.

References

Baldi,P.and Long,A.D.(2001).A Bayesian framework for

the analysis of microarray expression data:Regularized t-test and statistical inferences of gene changes.Bioin-formatics 17,509–519.

Brem,R.B.,Yvert,G.,Clinton,R.,and Kruglyak,L.(2002).

Genetic dissection of transcriptional regulation in bud-ding yeast.Science 296,752–755.

Broman,K.W.(2001).Review of statistical methods for QTL

mapping in experimental https://www.sodocs.net/doc/207773268.html,boratory Animal 30,44–52.

Broman,K.W.and Speed,T.P.(2002).A model selection

approach for the identi?cation of quantitative trait loci in experimental crosses (with discussion).Journal of the Royal Statistical Society,Series B 64,641–656and 737–775(discussion).

Carlin,B.and Louis,T.(1998).Bayes and Empirical Bayes

Methods for Data Analysis .New York:Chapman &Hall.Churchill,G.A.and Doerge,R.W.(1994).Empirical thresh-old values for quantitative trait mapping.Genetics 138,963–971.

Coleman,D.L.and Hummel,K.P.(1973).The in?uence of

genetic background on the expression of the obese (Ob)gene in the mouse.Diabetologia 9,287–293.

Colinayo,V.V.,Qiao,J.H.,Wang,X.,Krass,K.,Schadt,

E.,Lusis, A.J.,and Drake,T. A.(2003).Ge-netic loci for diet-induced atherosclerotic lesions and plasma lipids in mice.Mammalian Genome 14,464–471.

Cox,N.J.(2004).An expression of interest.Nature 12,733–

734.

Dudoit,S.,Yang,Y.H.,Speed,T.P.,and Callow,M.J.

(2002).Statistical methods for identifying di?erentially expressed genes in replicated cDNA microarray experi-ments.Statistica Sinica 12,111–139.

Dupuis,J.and Siegmund,D.(1999).Statistical methods for

mapping quantitative trait loci from a dense set of mark-ers.Genetics 151,373–386.

Efron,B.(2004).Large-scale simultaneous hypothesis testing:

The choice of a null hypothesis.Journal of the American Statistical Association 99,96–104.

Efron,B.and Tibshirani,R.(2002).Empirical Bayes meth-ods and false discovery rates for microarrays.Genetic Epidemiology 1,70–86.

Efron,B.,Tibshirani,R.,Storey,J.,and Tusher,V.(2001).

Empirical Bayes analysis of a microarray experiment.

Statistical Methods for eQTL Mapping27

Journal of the American Statistical Association96,1151–1160.

Irizarry,R.A.,Hobbs,B.,Collin,F.,Beazer-Barclay,Y.D., Antonellis,K.J.,Scherf,U.,and Speed,T.P.(2003).Ex-ploration,normalization,and summaries of high density oligonucleotide array probe level data.Biostatistics4, 249–264.

Jin, C.,Lan,H.,Attie, A. D.,Bulutuglo, D.,Churchill,

G.A.,and Yandell,B.S.(2004).Selective phenotyp-

ing for increased e?ciency in genetic mapping studies.

Genetics168,2285–2293.

Kendziorski,C.M.,Newton,M.A.,Lan,H.,and Gould,M.

N.(2003).On parametric empirical Bayes methods for comparing multiple groups using replicated gene expres-sion pro?les.Statistics in Medicine22,3899–3914. Lan,H.,Rabaglia,M.E.,Stoehr,J.P.,Nadler,S.T.,Schueler, K.L.,Zou,F.,Yandell,B.S.,and Attie,A.D.(2003a).

Gene expression pro?les of non-diabetic and diabetic obese mice suggest a role of hepatic lipogenic capacity in diabetes susceptibility.Diabetes52,688–700.

Lan,H.,Stoehr,J.P.,Nadler,S.T.,Schueler,K.L.,Yandell,

B.S.,and Attie,A.D.(2003b).Dimension reduction

for mapping mRNA abundance as quantitative traits.

Genetics164,1607–1614.

Lander,E.S.and Botstein,D.(1989).Mapping Mendelian factors underlying quantitative traits using RFLP link-age maps.Genetics121,185–199.

Lonnstedt,I.and Speed,T.P.(2002).Replicated microarray data.Statistica Sinica12,31–46.

Lund,M.S.,Sorenson,P.,Guldbrandtsen,B.,and Sorensen,

D.A.(2003).Multitrait?ne mapping of quantitative

trait loci using combined linkage disequilibria and link-age analysis.Genetics163,405–410.

Naggert,J.K.,Fricker,L.D.,Varlamov,O.,Nishina,P.M., Rouille,Y.,Steiner,D.F.,Carroll,R.J.,Paigen,B.J., and Leiter,E.H.(1995).Hyperproinsulinaemia in obese fat/fat mice associated with a carboxypeptidase E muta-tion which reduces enzyme activity.Nature Genetics10, 135–142.

Newton,M.A.and Kendziorski,C.M.(2003).Parametric empirical Bayes methods for microarrays.In The Anal-ysis of Gene Expression Data:Methods and Software,G.

Parmigiani,E.S.Garrett,R.Irizarry,and S.L.Zeger (eds),254–271.New York:Springer Verlag.

Newton,M. A.,Kendziorski, C.M.,Richmond, C.S., Blattner, F.R.,and Tsui,K.W.(2001).On di?er-ential variability of expression ratios:Improving sta-tistical inference about gene expression changes from microarray data.Journal of Computational Biology8,37–

52.Newton,M.A.,Noueiry,A.,Sarkar,D.,and Ahlquist,P.

(2004).Detecting di?erential gene expression with a semiparametric hierarchical mixture method.Biostatis-tics5,155–176.

Parmigiani,G.,Garrett,E.S.,Irizarry,R.,and Zeger,S.L.

(2003).The Analysis of Gene Expression Data:Methods and Software.New York:Springer Verlag.

R Development Core Team.(2004).R:A language and envi-ronment for statistical computing.R Foundation for Sta-tistical Computing,Vienna,Austria.

Schadt,E.,Monks,S.,Drake,T.A.,et al.(2003).Genetics of gene expression surveyed in maize,mouse and man.

Nature422,297–302.

Stoehr,J.P.,Nadler,S.T.,Schueler,K.L.,Rabaglia,M.E., Yandell,B.S.,Metz,S.A.,and Attie,A.D.(2000).

Genetic obesity unmasks nonlinear interactions between murine type2diabetes susceptibility loci.Diabetes49, 1946–1954.

Stoehr,J.P.,Byers,J.E.,Clee,S.M.,Lan,H.,Boronenkov,

I.V.,Schueler,K.L.,Yandell,B.S.,and Attie,A.D.

(2004).Identi?cation of major quantitative trait loci con-trolling body weight variation in ob/ob mice.Diabetes 53,245–249.

Storey,J. D.(2003).The positive false discovery rate:A Bayesian interpretation and the q-value.Annals of Statis-tics31,2013–2035.

Storey,J.D.and Tibshirani,R.(2003).Statistical signi?-cance for genomewide studies.Proceedings of the National Academy of Sciences USA100,9440–9445.

Storey,J.D.,Taylor,J.E.,and Siegmund,D.(2004).Strong control,conservative point estimation,and simultaneous conservative consistency of false discovery rates:A uni-?ed approach.Journal of the Royal Statistical Society, Series B66,187–205.

Tusher,V.,Tibshirani,R.,and Chu,G.(2001).Signi?cance analysis of microarrays applied to the ionizing radiation response.Proceedings of the National Academy of Sciences USA98,5116–5121.

Yvert,G.,Brem,R.B.,Whittle,J.,Akey,J.M.,Foss,E., Smith,E.N.,Mackelprang,R.,and Kruglyak,L.(2003).

Trans-acting regulatory variation in Saccharomyces cere-visiae and the role of transcription factors.Nature Ge-netics35,57–64.

Zhang,Y.,Proenca,R.,Ma?ei,M.,Barone,M.,Leopold,L., and Friedman,J.M.(1994).Positional cloning of the mouse obese gene and its human homologue.Nature372, 425–431.

Received October2004.Revised June2005.

Accepted June2005.

表观遗传学

表观遗传学 大家晚上好!很高兴有机会和大家交流,我最近看了一些这方面的材料,借这个机会和大家交流一下,讲的不一定对,就是自己的理解,有问题的地方大家可以讨论。我想从以下几个方面进行介绍: 1、表观遗传学概念 2、表观遗传学的研究内容 一、表观遗传学概念 经典遗传学认为遗传的分子基础是核酸, 生命的遗传信息储存在核酸的碱基序列上,碱基序列的改变会引起生物体表现型的改变,而这种改变可以从上一代传递到下一代。然而,随着遗传学的发展,人们发现,,DNA、组蛋白、染色质水平的修饰也会造成基因表达模式的变化,并且这种改变是可以遗传的。这种基因结构没有变化,只是其表达发生改变的遗传变化叫表观遗传改变。表观遗传学是一门研究生命有机体发育与分化过程中,导致基因发生表观遗传改变的新兴学科。 1939年,生物学家Waddington CH 首先在《现代遗传学导论》中提出了epihenetics这一术语,并于1942年定义表观遗传学为他把表观遗传学描述为一个控制从基因型到表现型的机制。 1975年,Hollidy R 对表观遗传学进行了较为准确的描述。他认为表观遗传学不仅在发育过程,而且应在成体阶段研究可遗传的基因表达改变,这些信息能经过有丝分裂和减数分裂在细胞和个体世代间传递,而不借助于DNA序列的改变,也就是说表观遗传是非DNA序列差异的核遗传。 Allis等的一本书中可以找到两种定义,一种定义是表观遗传是与DNA突变无关的可遗传的表型变化;另一种定义是染色质调节的基因转录水平的变化,这种变化不涉及DNA序列的改变。 二、表观遗传学研究内容 从现在的研究情况来看,表观遗传学变化主要集中在三大方面:DNA甲基化修饰、组蛋白修饰、非编码RNA的调控作用。这三个方面各自影响特有的表观遗传学现象,而且它们还相互作用,共同决定复杂的生物学过程。因此,表观遗传学也可理解为环境和遗传相互作用的一门学科。 DNA甲基化 组蛋白共价修饰 染色质重塑 基因组中非编码RNA 微小RNA(miRNA) 反义RNA 内含子、核糖开关等 基因印记 1、DNA甲基化(DNA methylation)是研究得最清楚、也是最重要的表观遗传修饰形式,主要 是基因组DNA上的胞嘧啶第5位碳原子和甲基间的共价结合,胞嘧啶由此被修饰为5甲基胞嘧啶(5-methylcytosine,5mC)。

表观遗传学

表观遗传学:营养之间的新桥梁与健康 摘要:营养成分能逆转或改变表观遗传现象,如DNA甲基化和组蛋白修饰,从而改变表达与生理和病理过程,包括胚胎发育,衰老,和致癌作用有关的关键基因。它出现营养成分和生物活性食物成分能影响表观遗传现象,无论是催化DNA直接抑制酶甲基化或组蛋白修饰,或通过改变所必需的那些酶反应底物的可用性。在这方面,营养表观遗传学一直被看作是一个有吸引力的工具,以预防儿科发育疾病和癌症以及延迟衰老相关的过程。在最近几年,表观遗传学已成为广泛的疾病,例如2型糖尿病的新出现的问题糖尿病,肥胖,炎症,和神经认知障碍等。虽然开发治疗或预防发现的可能性这些疾病的措施是令人兴奋的,在营养表观遗传学当前的知识是有限的,还需要进一步的研究来扩大可利用的资源,更好地了解使用营养素或生物活性食品成分对保持我们的健康和预防疾病经过修改的表观遗传机制。 介绍: 表观遗传学可以被定义为基因的体细胞遗传状态,从不改变染色质结构产生的表达改变的DNA序列中,包括DNA甲基化,组蛋白修饰和染色质重塑。在过去的几十年里,表观遗传学的研究主要都集中在胚胎发育,衰老和癌症。目前,表观遗传学在许多其它领域,如炎症,肥胖,胰岛素突出抵抗,2型糖尿病,心血管疾病,神经变性疾病和免疫疾病。由于后生修饰可以通过外部或内部环境的改变因素和必须改变基因表达的能力,表观遗传学是现在被认为是在不明病因的重要机制的许多疾病。这种诱导表观遗传变化可以继承在细胞分裂,造成永久的保养所获得的表型。因此,表观遗传学可以提供一个新的框架为寻求病因在环境相关疾病,以及胚胎发育和衰老,这也是已知受许多环境因素的影响。 在营养领域,表观遗传学是格外重要的,因为营养物质和生物活性食物成分可以修改后生现象和改变的基因的表达在转录水平。叶酸,维生素B-12,甲硫氨酸,胆碱,和甜菜碱可以影响通过改变DNA甲基化和组蛋白甲基化1 - 碳代谢。两个代谢物的1-碳代谢可以影响DNA 和组蛋白的甲基化:S-腺苷甲硫氨酸(的AdoMet)5,这是一个甲基供体为甲基化反应,并S-腺苷高半胱氨酸(的AdoHcy),这是一种产物抑制剂的甲基化。因此,理论上,任何营养素,生物活性组件或条件可影响的AdoMet或的AdoHcy水平在组织中可以改变DNA和组蛋白的甲基化。其他水溶性维生素B像生物素,烟酸和泛酸也发挥组蛋白修饰重要的作用。生物素是组蛋白生物素化的底物。烟酸参与组蛋白ADPribosylation如聚(ADP-核糖)的基板聚合酶作为以及组蛋白乙酰为底物Sirt1的,其功能作为组蛋白乙酰化酶(HDAC)(1)。泛酸是的一部分辅酶A以形成乙酰CoA,这是乙酰基的中组蛋白乙酰化的源。生物活性食物成分直接影响酶参与表观遗传机制。例如,染料木黄酮和茶儿茶素会影响DNA甲基(转移酶)。白藜芦醇,丁酸盐,萝卜硫素,和二烯丙基硫化物抑制HDAC和姜黄素抑制组蛋白乙酰转移酶(HAT)。改变酶activit这些化合物可能我们的有生之年通过改变基因表达过程中影响到生理和病理过程。 在这次审查中,我们更新了关于最新知识营养表观遗传学,这将是一个有助于理解如何营养素有助于我们的健康。 知识的现状 DNA甲基化 DNA甲基化,它修改在CpG二残基与甲基的胞嘧啶碱基,通过转移酶催化和通过改变染色质结构调节基因表达模式。目前,5个不同的转移酶被称为:DNMT1,DNMT2转移酶3A,DNMT3B和DnmtL。DNMT1是一个维护转移酶和转移酶图3a,3b和L分别从头转移酶。DNMT2的功能尚不明确。通过在我们的一生,营养成分影响这些转移酶和生物活性食物成分可以改变全球DNA甲基化,这是与染色体完整性以及genespecific启动子DNA甲基化,

表观遗传学

表观遗传学 比较通俗的讲表观遗传学是研究在没有细胞核DNA序列改变的情况时,基因功能的可逆的、可遗传的改变。也指生物发育过程中包含的程序的研究。在这两种情况下,研究的对象都包括在DNA序列中未包含的基因调控信息如何传递到(细胞或生物体的)下一代这个问题。表观遗传学是与遗传学(genetic)相对应的概念。遗传学是指基于基因序列改变所致基因表达水平变化,如基因突变、基因杂合丢失和微卫星不稳定等;而表观遗传学则是指基于非基因序列改变所致基因表达水平变化,如DNA甲基化和染色质构象变化等;表观基因组学(epigenomics)则是在基因组水平上对表观遗传学改变的研究。所谓DNA甲基化是指在DNA 甲基化转移酶的作用下,在基因组CpG二核苷酸的胞嘧啶5'碳位共价键结合一个甲基基团。正常情况下,人类基因组“垃圾”序列的CpG二核苷酸相对稀少,并且总是处于甲基化状态,与之相反,人类基因组中大小为100—1000 bp左右且富含CpG二核苷酸的CpG岛则总是处于未甲基化状态,并且与56%的人类基因组编码基因相关。人类基因组序列草图分析结果表明,人类基因组CpG岛约为28890个,大部分染色体每1 Mb就有5—15个CpG岛,平均值为每Mb含10.5个CpG岛,CpG岛的数目与基因密度有良好的对应关系[9]。由于DNA甲基化与人类发育和肿瘤疾病的密切关系,特别是CpG岛甲基化所致抑癌基因转录失活问题,DNA甲基化已经成为表观遗传学和表观基因组学的重要研究内容。 几十年来,DNA一直被认为是决定生命遗传信息的核心物质,但是近些年新的研究表明,生命遗传信息从来就不是基因所能完全决定的,比如科学家们发现,可以在不影响DNA序列的情况下改变基因组的修饰,这种改变不仅可以影响个体的发育,而且还可以遗传下去。这种在基因组的水平上研究表观遗传修饰的领域被称为“表观基因组学(epigenomics)”。表观基因组学使人们对基因组的认识又增加了一个新视点:对基因组而言,不仅仅是序列包含遗传信息,而且其修饰也可以记载遗传信息。 摘要表观遗传学是研究没有DNA 序列变化的可遗传的基因表达的改变。遗传学和表观遗传学系统既相区别、彼此影响,又相辅相成,共同确保细胞的正常功能。表观遗传学信息的改变,可导致基因转录抑制、基因组印记、细胞凋亡、染色体灭活以及肿瘤发生等。 关键词表观遗传学;甲基化;组蛋白修饰;染色质重塑;非编码RNA 调控;副突变 表观遗传学( epigenetics) 是研究没有DNA序列变化的可遗传的基因表达的改变。它最早是在1939 年由Waddington在《现代遗传学导论》一书中提出,当时认为表观遗传学是研究基因型产生表型的过程。1996 年,国内学术界开始介绍epigenetics 研究,其中译名有表遗传学、表观遗传学、表型遗传修饰等10 余种,其中,表观遗传学、表遗传学在科技文献中出现的频率较高。 1 表观遗传学调控的分子机制 基因表达正确与否,既受控于DNA 序列,又受制于表观遗传学信息。表观遗传学主要通过DNA 的甲基化、组蛋白修饰、染色质重塑和非编码RNA 调控等方式控制基因表达。近年发现,副突变也包含有表观遗传性质的变化。 1.1 DNA 甲基化DNA 甲基化是由酶介导的一种化学修饰,即将甲基选择性地添加到蛋白质、DNA 或RNA上,虽未改变核苷酸顺序及组成,但基因表达却受影响。其修饰有多种方式,即被修饰位点的碱基可以是腺嘌呤N!6 位、胞嘧啶的N!4 位、鸟嘌呤的N!7 位和胞嘧啶的C!5 位,分别由不同的DNA 甲基化酶催化。在真核生物DNA 中,5- 甲基胞嘧啶是唯一存在的化学性修饰碱基,CG 二核苷酸是最主要的甲基化位点。DNA 甲基化时,胞嘧啶从DNA 双螺旋突出,进入能与酶结合的裂隙中,在胞嘧啶甲基转移酶催化下,有活性的甲基从S- 腺苷甲硫氨酸转移至胞嘧啶5' 位上,形成5- 甲基胞嘧啶( 5mC)。DNA 甲基化不仅可影响细胞基因的表达,

表观遗传学(总结)资料

1.表观遗传学概念 表观遗传是与DNA 突变无关的可遗传的表型变化,且是染色质调节的基因转录水平的变化,这种变化不涉及DNA 序列的改变。表观遗传学是研究基因的核苷酸序列不发生改变的情况下,基因表达了可遗传的变化的一门遗传学分支学科。表观遗传学内容包括DNA 甲基化、组蛋白修饰、染色质重塑、遗传印记、随机染色体失活及非编码RNA 等调节。研究表明,这些表观遗传学因素是对环境各种刺激因素变化的反映,且均为维持机体内环境稳定所必需。它们通过相互作用以调节基因表达,调控细胞分化和表型,有助于机体正常生理功能的发挥,然而表观遗传学异常也是诸多疾病发生的诱因。因此,进一步了解表观遗传学机 制及其生理病理意义,是目前生物医学研究的关键切入点。 别名:实验胚胎学、拟遗传学、、外遗传学以及后遗传学 表观遗传学是与遗传学(genetic)相对应的概念。遗传学是指基于基因序列改变所致基因表达水平变化,如基因突变、基因杂合丢失和微卫星不稳定等;而表观遗传学则是指基于非基因序列改变所致基因表达水平变化,如和染色质构象变化等;表观基因组学(epigenomics)则是在基因组水平上对表观遗传学改变的研究。 2.表观遗传学现象 (1)DNA甲基化 是指在DNA甲基化转移酶的作用下,在基因组CpG二核苷酸的胞嘧啶5'碳位共价键结合一个甲基基团。正常情况下,人类基因组“垃圾”序列的CpG二核苷酸相对稀少,并且总是处于甲基化状态,与之相反,人类基因组中大小为100—1000 bp左右且富含CpG二核苷酸的CpG岛则总是处于未甲基化状态,并且与56%的人类基因组编码基因相关。人类基因组序列草图分析结果表明,人类基因组CpG岛约为28890个,大部分每1 Mb就有5—15个CpG岛,平均值为每Mb含10.5个CpG岛,CpG岛的数目与基因密度有良好的对应关系[9]。由于DNA甲基化与人类发育和肿瘤疾病的密切关系,特别是CpG岛甲基化所致抑癌基因转录失活问题,DNA甲基化已经成为表观遗传学和表观基因组学的重要研究内容。 (2)基因组印记 基因组印记是指来自父方和母方的等位基因在通过精子和传递给子代时发生了修饰,使带有亲代印记的等位基因具有不同的表达特性,这种修饰常为DNA甲基化修饰,也包括组蛋白乙酰化、甲基化等修饰。在形成早期,来自父方和母方的印记将全部被消除,父方等位基因在精母细胞形成精子时产生新的甲基化模式,但在受精时这种甲基化模式还将发生改变;母方等位基因甲基化模式在卵子发生时形成,因此在受精前来自父方和母方的等位基因具有不同的甲基化模式。目前发现的大约80%成簇,这些成簇的基因被位于同一条链上的所调控,该位点被称做印记中心(imprinting center, IC)。印记基因的存在反映了性别的竞争,从目前发现的印记基因来看,父方对的贡献是加速其发育,而母方则是限制胚胎发育速度,亲代通过印记基因来影响其下一代,使它们具有性别行为特异性以保证本方基因在中的优势。印记基因的异常表达引发伴有复杂突变和表型缺陷的多种人类疾病。研究发现许多印记基因对胚胎和胎

表观遗传学

表观遗传学 摘要: 表观遗传学是研究基因的核苷酸序列不发生改变的情况下,基因表达了可遗传的变化的一门遗传学分支学科。表观遗传的现象很多,已知的有DNA甲基化(DNA methylation),基因组印记(genomic impriting),母体效应(maternal effects),基因沉默(gene silencing),核仁显性,休眠转座子激活和RNA编辑(RNA editing)等。 表观遗传学是研究基因的核苷酸序列不发生改变的情况下,基因表达了可遗传的变化的一门遗传学分支学科。表观遗传的现象很多,已知的有DNA甲基化(DNA methylation),基因组印记(genomic impriting),母体效应(maternal effects),基因沉默(gene silencing),核仁显性,休眠转座子激活和RNA编辑(RNA editing)等。 目录 [隐藏] 1 简介 2 染色质重塑 3 基因组印记 4 染色体失活 5 非编码RNA 表观遗传学简介 表观遗传学 表观遗传学是与遗传学(genetic) 相对应的概念。遗传学是指基于基因序列改变所致基因表达水平变化,如基因突变、基因杂合丢失和微卫星不稳定等;而表观遗传学则是指基于非基因序列改变所致基因表达水平变化,如DNA甲基化和染色质构象变化等;表观基因组学(epigenomics)则是在基因组水平上对表观遗传学改变的研究。 所谓DNA甲基化是指在DNA甲基化转移酶的作用下,在基因组CpG二核苷酸的胞嘧啶5'碳位共价键结合一个甲基基团。正常情况下,人类基因组“垃圾”序列的CpG二核苷酸相对稀少,并且总是处于甲基化状态,与之相反,人类基因组中大小为100—1000 bp左右且富含CpG二核苷酸的CpG岛则总是处于未甲基化状态,并且与56%的人类基因组编码基因相关。人类基因组序列草图分析结果表明,人类基因组CpG岛约为28890个,大部分染色体每1 Mb就有5—15个CpG 岛,平均值为每Mb含10.5个CpG岛,CpG岛的数目与基因密度有良好的对应关系[9]。由于DNA甲基化与人类发育和肿瘤疾病的密切关系,特别是CpG岛甲

表观遗传学涉及的几种机制

表观遗传学涉及的几种机制 摘要表观遗传学是指以研究没有DNA序列变化,但是可以遗传的生命现象为主要内容的学科。它通过DNA的甲基化、组蛋白修饰、染色质重塑和非编码RNA调控4种方式来控制表观遗传的沉默。从表观遗传学所涉及的这四种机制进行描述。 关键词表观遗传学;DNA的甲基化;组蛋白修饰;染色质重塑;非编码RNA调控 随着生命科学的发展,几十年来,人们一直认为基因决定着生命过程中所需要的各种蛋白质,决定着生命体的表型。但随着研究的深入,越来越多无法解释的生命现象一一出现:具有完全相同的基因组的同卵双生,即使在同样的环境中长大,他们的性格、健康等方面也会有较大的差异;有些特征只是由一个亲本的基因来决定,而源自另一亲本的基因却保持“沉默”;马、驴正反交的后代差别较大等。人们无法用经典的遗传学理论解释这些现象。现在,遗传学中的一个前沿领域:表观遗传学(Epigenetics),为人们提供了解答这类问题的新思路。表观遗传学(Epigenetics)是1957年由Waddington CH提出的,是研究表观遗传变异的遗传学分支学科。表观遗传变异(Epigenetic variation)是指在基因的DNA序列没有发生改变的情况下,基因功能发生了可遗传的变化,并最终导致了表型的变化。它是不符合孟德尔遗传规律的核内遗传,由此可以认为,基因组含有2类遗传信息,一类是传统意义上的遗传信息,即DNA序列所提供的遗传信息,另一类是表观遗传学信息,它提供了何时、何地、以何种方式去应用遗传信息的指令。本文就表观遗传改变涉及DNA的甲基化、组蛋白修饰、染色质重塑、非编码RNA调控等机制进行论述。 1DNA甲基化 DNA甲基化是基因组DNA表观遗传修饰的一种主要形式,是调节基因组功能的重要手段。它是由DNA甲基转移酶催化S-腺苷甲硫氨酸作为甲基供体,将胞嘧啶转变为5-甲基胞嘧啶(mC)的反应。在真核生物DNA中,5-甲基胞嘧啶是唯一存在的化学性修饰碱基。在哺乳动物细胞的基因组DNA中,3%~5%的胞嘧啶是以5-甲基胞嘧啶形式存在的。同时70%的5-甲基胞嘧啶参与了CpG序列的形成。而非甲基化的CpG序列则与管家基因以及组织特异性表达基因有关,这提示CpG 的甲基化与否在基因的表达中起重要作用。 体内甲基化状态有3种:持续的低甲基化状态,如持家基因;诱导的去甲基化状态,如发育阶段中的一些基因;高度甲基化状态,如女性的一条缢缩的X染色体。DNA甲基化影响到基因的表达,与肿瘤的发生密切相关。把癌基因组学与表观遗传学的研究结合起来,是癌症研究的发展趋势。人类的一些癌症常出现整个基因组DNA的低甲基化,但人们并不清楚这种表观遗传变化是肿瘤产生的诱因还是结果。研究者构建了携带低表达水平Dnmtl基因的小鼠,对它的研究结果显示,DNA低甲基化可能通过提高染色体的不稳定性来促进肿瘤的形成。

[遗传学的名词解释] 表观遗传学名词解释

竭诚为您提供优质的服务,优质的文档,谢谢阅读/双击去除 [遗传学的名词解释] 表观遗传学名词 解释 遗传学的意思是什么呢?怎么用遗传学来造句?下面是 小编为你整理遗传学的意思,欣赏和精选造句,供大家阅览! 遗传学的意思 遗传学(genetics)是一门学科,研究生物起源、进化与发育的基因和基因组结构、功能与演变及其规律等,是生物学的一个重要分支,经历了孟德尔经典遗传学、分子遗传学和如今系统遗传学的研究时期。在史前人们就已经利用生物体的遗传特性通过选择育种来提高谷物和牲畜的产量,虽然

遗传学在决定生物体外形和行为的过程中扮演着重要的角色,但此过程是遗传学和生物体所经历的环境共同作用的结果。遗传学中的亲子概念不限于父母子女或一个家族,还可以延伸到包括许多家族的群体,这是群体遗传学的研究对象。遗传学中的亲子概念还可以以细胞为单位,离体培养的细胞可以保持个体的一些遗传特性。1992年10月1日,伦敦发 表第一张染色体图被认为是遗传学上的一个里程碑。 遗传学的研究范围包括遗传物质的本质、遗传物质的传递和遗传信息的实现三个方面。遗传物质的传递包括遗传物质的复制、染色体的行为、遗传规律和基因在群体中的数量变迁等。 遗传学造句欣赏 1.父母不能骂自己的孩子是小兔崽子,因为这在遗传学上是对父母不利的。 2.我们以斑马鱼为模式动物,利用发育遗传学、生物信

息学和分子化学的方法研究心脏和血管的分化形成以及环 境对心血管发育的影响。 3.这是真菌进化遗传学的网页。 4.利用模式动物探索哺乳动物发育遗传学研究新方法,并研究发育和疾病机理。 5.提供固体的历史背景,序篇检查过去概念的行为遗传学流。 6.目的研究河南汉族人群的指纹纹型特点,为人类学、遗传学和医学肤纹学等研究领域提供基础皮纹学参数。 7.行为遗传学告诉我们即使在个人的范围上很多生活 中的结果都似乎是路径依赖性的,或者仅仅是无法预测的,即使是同卵双生的人。

表观遗传学

Brian Dias 去年10 月晋升为父亲,和许多新父母一样,孩子出生前他就开始考虑要承担各种责任。但Dias 考虑的问题更多,他已经考虑自己的父母或祖父母是否也会对孩子产生影响。 祖先生活环境,受教育程度,都可能通过遗传对后代产生影响。是否祖先的生活习惯或遭遇,例如吸烟、饥荒或战争经历也会对后代的健康产生影响? Dias 是艾默理大学(Emory University)克里莱斯勒实验室的博士后。在儿子出生前2 年,他的研究就是和上述问题有关的。他观察暴露在恶劣气味环境动物后代大脑产生的影响。乙酰苯是一种有甜杏仁味的化合物,Dias 将雄性小鼠暴露在乙酰苯环境下,然后对他们每天5 次中度电足刺激,连续3 天。这些动物会对这些刺激恐惧,一旦有乙酰苯味道就会僵住。 10 天后,Dias 让这些动物和正常雌性小鼠动物交配。这些动物后代成年后,大部分对乙酰苯敏感,当暴露在这种气味下,有意外声音就会惊慌失措。动物的下一代(孙辈)仍会对乙酰苯敏感。研究发现,三代动物M71 肾小球结构增大,其中乙酰苯敏感神经元增加。最近这一研究发表在《自然-神经科学》杂志上,Dias 等认为,环境信息可通过表观遗传机制传递给后代。 表观遗传学是在DNA碱基序列不变前提下引起基因表达或细胞表型变化的一种遗传。生物学家最早是在植物中发现表观遗传现象。开始发现西红柿存在表观遗传现象,随后证明在动物和人类也普遍存在这种现象。表观遗传学仍存在争议,尤其是会让人回想起来19世纪法国博物学家拉马克的失败理论。他提出,生物能将获得性状遗传给后代。麻省大学医学院分子生物学家Oliver Rando,研究证明了动物的表观遗传现象,对许多现代生物学家来说,

表观遗传学考试复习

、名词解释表观遗传 DNA 序列不发生改变但基因表达却发生了变化的一种有别于传统遗传学的遗传方式,主要原因包括:(1)基因选择性转录表达的调控,包括DNA 甲基化,基因印记,组蛋白共价修饰,染色质重塑;(2)基因转录后的调控,包含基因组中非编码的RNA,如miRNA,siRNA等。 剂量补偿效应 在生物的性别决定机制中,性连锁基因在两种性别中有相等或近乎相等的有效剂量的遗传效应,即在雌性和雄性细胞里,由X 染色体基因编码产生的酶或其他蛋白质产物在数 量上相等或近乎相等。 染色质重塑 基因表达调控过程中所出现的一系列染色质结构变化和位置改变的总称,研究内容包括基因表达的复制和重组等过程中,染色质的包装状态,核小体中的组蛋白以及对应的DNA 分子发生改变的分子机理。 RNA 干扰 生物体内通过双链RNA 分子在mRNA水平上诱导具有特异性序列的转录后基因沉默的过程(如miRNA,siRNA 等),是表观遗传学中的一种重要现象。 CpG 岛 基因组中富含CpG的区域,长度500~ 1000bp ,GC含量超过55%,常分布在持家基因和一些组织表达特异性基因的启动子区域,其中70% 的 C 是甲基化的,但总的来说G+C 丰富的CpG 岛是非甲基化的。CpG岛区域序列可以被HpaII 酶(CCGG)切成小片段,因此也叫HTF 岛。 CpG 岛在基因转录调控过程中有重要作用,例如启动子区CpG 被甲基化时转录是受抑制的。Histone Crosstalk 组蛋白的不同化学修饰之间相互作用,不仅表现为同种组蛋白不同残基的一种修饰能加速或抑制另一修饰的发生,并且在影响其他组蛋白残基的同时,也受到另外组蛋白残基修饰的调节。 泛素化修饰 组蛋白赖氨酸残基与泛素分子羧基末端的甘氨酸相互结合,可能会改变底物的结构,参与内吞作用、组蛋白的活性、DNA 修复等过程等。组蛋白的泛素化修饰则会招募核小体到染色体、参与X 染色体失活、影响组蛋白甲基化和基因的转录。 SUMO 修饰 小泛素相关修饰物(small ubiquitin related modifier, SUMO ),是一种ATP依赖的小蛋白的共价修饰,通常发生在赖氨酸(K)上,其生物学功能包括:转录沉默、抑制组蛋白的乙酰化。

相关主题