Top链与Forward链

育种技术 信息技术
邓飞    2020-12-29    832

最近,有同学问我Illumina芯片数据中Top链,Forward链,以及与refSNP的区别,我查了一些资料,汇总如下。如果有错误,还请留言区批评指正,我也是现学现卖。

「参考:」

https://gengen.openbioinformatics.org/en/latest/tutorial/coding/#introduction

这里介绍一下芯片的几种格式:

  • Forward allele
  • A/B allele
  • TOP/BOT allele

Forward allelel

这是大多数人应该研究的等位基因。它是指参考基因组中前链的等位基因。注意,不同版本的参考基因组往往在具有共同SNP的特定位置上可能有所不同;例如,如果GRCh37在参考链的SNP中有一个次等位基因,GRCh38则倾向于将该等位基因变回主等位基因,因此理想情况下,正向等位基因应该始终与基因组结合构建以唯一标识SNP。

前链(Forward)和后链(Reverse)来源于dbSNP数据库。

Illumina's A/B allele coding

这是Illumina定义的类型,等位基因编码方法解决了上述问题,即等位基因不依赖于特定的基因组组合,而是基于实际的多态性本身。

简而言之,如果两个多态性中的一个是A或T,另一个是C或G,那么A或T被称为A等位基因,C或G被称为B等位基因。

Illumina's A/B allele coding, or TOP/BOT strand definition, is explained in here in detail by Illumina. The allele coding method solves the problem aformentioned, that is, the alleles are not dependent on the specific genome assembly, but are based on the actual polymorphism itself. Briefly, if one of the two polymorphism is A or T, and the other one is C or G, then the A or T is refered to as A allele, and the C or G is refered to as B allele, and the strand with A or T is refered to as TOP and BOT strand, respectively.

有时候,也会用1/2来表示A/B

Sometimes, people often use 1/2 to denote Illumina's A/B allele, since numeric coding is more convenient in many scenarios and since some old association software only recognize numeric coded alleles.

TOP/BOT allele

如果两个多态性中的一个是A或T,另一个是C或G,带有A或T的链分别称为TOP和BOT链.

Briefly, if one of the two polymorphism is A or T, and the other one is C or G, then the A or T is refered to as A allele, and the C or G is refered to as B allele, and the strand with A or T is refered to as TOP and BOT strand, respectively.

注意:

如果多态性是A/T或C/G,那么穿过环绕序列(SNP的上游或下游的两个核苷酸)找到一对明确的核苷酸,然后应用类似的规则:如果A或T在SNP的5'侧,那么它是Top链,否则就是BOT链。对于Top链,A和B等位基因分别表示A和T(或C和G);而对于BOT链,A和B等位基因分别表示T和A(或G和C)

If the polymorphism is A/T or C/G, then walk through the surrouding sequence (the two nucleotides up or downstream of the SNP) to find a pair of unambiguous nucleotides, and then a similar rule is applied: if A or T is on 5' side of the SNP, then it's a TOP strand otherwise it's a BOT strand. For TOP strand, A and B allele denote A and T (or C and G), respectively; whereas for BOT strand, A and B allele denote T and A (or G and C), respectively.

Illumina的编码方案不依赖于前链(Forward)的定义(因此正确的基因组组装),因此它几乎总是确保基因组构建之间的一致性,并确保新测序基因组序列或未组装基因组序列的即时等位基因指定。

另外,在Illumina BeadStudio软件中,可以指定AB类型,或者ACGT类型(TOP链),或者Forward链类型。TOP alleles是TOP链,但不一定是Forward链,具体解释如下:

When exporting genotypes from the Illumina BeadStudio software, the user can choose AB genotypes, or ACGT genotypes (commonly refered to as "TOP alleles"), or forward strand genotype in newer version of the software. The TOP alleles is the allele on the TOP strand, which may or may not be the forward strand: see the example above, the "fwd/B" means that dbSNP's forward strand corresponds to Illumina's BOT strand, so the "TOP allele" is the opposite as the "forward strand allele". Unfortunately many users simply do not know or understand what is "TOP allele": they simply take for granted that "TOP" means "forward" and then complain that there are many discordances when merging two different data sets (one coded as forward strand and one exported from BeadStudio). The convert_bim_allele.pl program that I describe in this article will solve problems like this.

用Top链还是Forward链

主流的还是用Forward链多一点,当然如果你之前的数据是Top链,那还是要用Top才可以合并。

**注意:**Top链和Forward不是对应的!

  • Top链与Bot链对应

  • Forward链与Reverse链对应

Top链的位点分型,有时候和Forward是一致的,有时候是不一样的。因为Top链的规则是“如果多态性是A/T或C/G,那么穿过环绕序列(SNP的上游或下游的两个核苷酸)找到一对明确的核苷酸,然后应用类似的规则:如果A或T在SNP的5'侧,那么它是Top链,否则就是BOT链”,它会根据SNP的上下游确定。

dbSNP中T>C是什么意思?

比如rs1004491这个SNP,在dbSNP数据库中是T/C突变

rs1004491 [Homo sapiens]
Variant type:SNVAlleles:T>C[Hide Flanks]
AAAGCCTTCTGAACTGAGTGAAAATACAGCCAAGATCTTGGCAAAGCTTC
TCCCTCAGTATTTAGACCAGGTAAGAATTTCTTGACTCATCTCCAACATA
[T/C]
GTGTTTACTGTGGAAAACACACATTTTATTTTCTTGCTATTGCATGTTAT
TGCTGGCCGGGGACCCAATTGCAGTCTCTTTAAGCCTTCAACAGTTGGCT

之所以是T>C,是因为平均而言,这个位点T为主等位基因(major),C为次等位基因(minor)

下图可以看到,整体而言(209010个样本),T的频率为0.701,C的频率为0.298,当然对于少数的群体(比如这里的Asian)中,T为0.482,C为0.518,但整体而言T>C。

图片.png

好了,就到这里。后面理解更深了,我再更新。



本文来自微信公众号【育种数据分析之放飞自我】公众号ID:R-breeding;未经许可谢绝二次转载至其他网站。

本文为专栏作者授权科易网发表,版权归原作者所有。文章系作者个人观点,不代表科易网立场,转载请联系原作者。如有任何疑问,请联系ky@1633.com。
热门观点