当前位置：文档之家› 基于高通量测序数据的基因组拼接方法研究

基于高通量测序数据的基因组拼接方法研究

摘要....................................................................................................................... I Abstract ................................................................................................................. III 第1章绪论 (1)

1.1 课题背景及研究的意义 (1)

1.1.1 课题来源 (1)

1.1.2 课题背景 (1)

1.1.3 研究的目的和意义 (3)

1.2 国内外研究现状 (4)

1.2.1 基因组拼接 (4)

1.2.2 基于高通量测序数据的基因组拼接方法 (6)

1.2.3 测序错误导致的分叉的处理方法 (11)

1.2.4 重复序列导致的分叉的处理方法 (12)

1.2.5 基因组拼接错误的识别方法 (14)

1.2.6 基因组短序列拼存在的主要问题 (16)

1.3本文的主要研究内容 (17)

第2章基于支持向量机的分叉结构处理方法 (19)

2.1 引言 (19)

2.2 支持向量机 (21)

2.2.1 最优分类超平面 (21)

2.2.2 支持向量机与核函数 (24)

2.3 基于支持向量机的分叉结构处理方法 (25)

2.3.1 分叉结构及其特征选择 (25)

2.3.2 支持向量机预测模型的生成 (27)

2.4 实验结果与分析 (30)

2.4.1 实验数据 (30)

2.4.2 实验结果与分析 (31)

2.5 本章小结 (33)

第3章基于向前查看方法的分叉结构处理方法 (34)

3.1 引言 (34)

3.2 向前查看方法 (36)

3.2.1 处理气泡结构 (36)

3.2.2 处理短串联重复序列 (38)

3.3 实验结果与分析 (43)

3.3.1 实验数据 (43)

3.3.2 实验结果与分析 (43)

3.4 本章小结 (45)

第4章基于多重启发式的基因组拼接方法 (46)

4.1 引言 (46)

4.2 拼接方法总体思路 (49)

4.3 基于多重启发式的基因组拼接方法 (51)

4.3.1 拼接前数据预处理 (51)

4.3.2 k-mer哈希表的构建 (55)

4.3.3 核心拼接算法 (58)

4.3.4 reads到contig的比对 (61)

4.3.5 利用配对数据进行拼接 (63)

4.3.6 利用可变交叠进行拼接 (65)

4.4 基于配对数据的基因组组装方法 (66)

4.4.1 确定reads在contigs上的位置信息 (67)

4.4.2 将contigs连接成scaffolds (67)

4.4.3 计算相邻contigs之间的距离 (68)

4.4.4 填充contigs之间的gap (71)

4.5 实验结果与分析 (72)

4.5.1 实验数据 (72)

4.5.2 评价指标 (73)

4.5.3 实验结果与分析 (75)

4.6 本章小结 (83)

第5章无偏的拼接错误识别方法 (84)

5.1 引言 (84)

5.2 无偏的基因组拼接错误识别方法 (85)

5.2.1 识别拼接结果中的差异 (86)

5.2.2 计算差异的断点区域 (90)

5.2.3 识别拼接错误 (93)

5.2.4 识别对应于结构变异的正确拼接 (96)

5.3 实验结果与分析 (97)

5.3.1 实验数据 (97)

5.3.2 实验结果与分析 (97)

5.4 本章小结 (104)

结论 (105)

参考文献 (107)

攻读博士学位期间发表的论文 (118)

哈尔滨工业大学学位论文原创性声明和使用权限 (119)

致谢 (121)

个人简历 (122)

Contents

Abstract (In Chinese) .............................................................................................. I Abstract (In English) ............................................................................................ III Chapter 1 Introduction .. (1)

1.1 Background, objective and significance of the project (1)

1.1.1 Project source (1)

1.1.2 Background of the project (1)

1.1.3 Significance of the project (3)

1.2 Review of related work (4)

1.2.1 Genome Assembly (4)

1.2.2 Review of assembly methods for high throughput sequencing data (6)

1.2.3 Review of dealing with branches due to sequencing errors (11)

1.2.4 Review of dealing with branches due to repeats (12)

1.2.5 Review of mis-assembly identification methods (14)

1.2.6 Problems in genome assembly (16)

1.3 Project source and main contents (17)

Chapter 2 Dealing with branches based on support vector machine (SVM) (19)

2.1 Introduction (19)

2.2 Support vector machine (21)

2.2.1 Optimal classification hyperplane (21)

2.2.2 Support vector machine and kernel functions (24)

2.3 Algorithm for dealing with branches based on SVM (25)

2.3.1 Branch and feature selection (25)

2.3.2 SVM prediction model (27)

2.4 Experimental results and analysis (29)

2.4.1 Experimental data (30)

2.4.2 Experimental results and analysis (30)

2.5 Brief summary (33)

Chapter 3 Dealing with Branches based on look ahead approach (34)

3.1 Introduction (34)

3.2 Look ahead approach (36)

3.2.1 Dealing with bubbles (36)

3.2.2 Dealing with short tandem repeats (38)

Contents

3.3 Experimental results and analysis (43)

3.3.1 Experimental data (43)

3.3.2 Experimental results and analysis (43)

3.4 Brief summary (45)

Chapter 4 Genome assembly method based on multiple heuristics (46)

4.1 Introduction (46)

4.2 General idea of assembly method (49)

4.3 Assembly method based on multiple heuristics (51)

4.3.1 Data preprocessing (51)

4.3.2 build k-mer hash table (55)

4.3.3 Core assembly algorithm (58)

4.3.4 Align reads to contig (61)

4.3.5 Assembly using paired-end reads (63)

4.3.6 Assembly using variavle overlap size (65)

4.4 Scaffolding method based on paired-end reads (66)

4.4.1 Determine the reads aligned position on contigs (67)

4.4.2 Link contigs to scaffolds (67)

4.4.3 Compute the distance between adjacent contigs (68)

4.4.4 Fill gaps between contigs (71)

4.5 Experimental results and analysis (72)

4.5.1 Experimental data (72)

4.5.2 Evaluation metrics (73)

4.5.3 Experimental results and analysis (75)

4.6 Brief summary (83)

Chapter 5 Unbiased mis-assembly identification method (84)

5.1 Introduction (84)

5.2 Unbiased mis-assembly identification method (85)

5.2.1 Ientify differences in assembly (86)

5.2.2 Compute breakpoint regions for differences (90)

5.2.3 Identify mis-assemblies (93)

5.2.4 Identify correct assembly due to structural variations (96)

5.3 Experimental results and analysis (97)

5.3.1 Experimental data (97)

5.3.2 Experimental results and analysis (97)

5.4 Brief summary (104)

Conclusions (105)

References (107)

Papers published in the period of Ph.D. education (118)

Statement of copyright and Letter of authorization (119)

Acknowledgements (121)

Resume (122)

第1章绪论

1.1 课题背景及研究的意义

1.1.1 课题来源

国家自然科学基金面上项目：“基于新一代测序数据的全基因组拼接组装算法研究”（编号：61173085）。2012年1月--2015年12月。

1.1.2 课题背景

生物遗传信息的序列化极大地促进了生命科学的发展，是生命科学进入21世纪快速发展的划时代里程碑。生物体基因组全序列的测定具有重要的科学价值，它使人们更全面地从整个基因组规模来认识和研究目标生物，阐明其基因结构和功能的关系，更深入地研究细胞的发育、生长与分化的机制，进而揭示疾病发生的机理。基因组测序的应用十分广泛，它可以用于获得新物种的基因组序列(de novo sequencing, 从头测序)，可以测定种群中单个个体的基因组序列(resequencing，重测序)，以及获得样本中RNA分子的序列(RNA测序)。其中生物体基因组序列的测定是其最重要的应用。随着生物实验技术和数据分析技术的快速发展，已对越来越多的生物体基因组完成了测序。生物体基因组全序列的测定是生物数据分析的基础，对于生命科学的研究、探索与认识生命的本质具有十分重要的理论与现实意义。

20世纪80年代开始，基于Sanger测序方法(双脱氧核苷酸末端终止法，Chain-termination dideoxynucleotides method) [1]的测序技术取得了巨大的成就。以基于Sanger方法的自动、半自动毛细管测序技术为代表的第一代测序技术，能够产生高质量的DNA序列(每碱基准确度99.999%)，称为reads，测序后的reads长度可达到1000 base pairs (bp)[2]。该技术帮助人们完成了从细菌基因组[3-6]到完整的人类基因组图谱[7, 8]等大量测序工作。然而，由于该测序技术存在成本高、速度慢、通量低的缺点，难以满足生命科学快速发展的需求，因而迫切需要新的成本低、速度快、通量高的新的测序技术的诞生。

第二代测序技术与第一代测序技术相比，具有更加高的通量，因此该测序技术又称高通量测序技术。该技术的诞生并快速发展极大地改变了人们从事基础、应用及临床的研究方式，是基因组学研究领域中具有里程碑意义的事件[2, 9]。