RNA-seq第三部分(比对,比对质控,HTseq-count计数)

一.比对(hisat2)

1.构建索引

a.使用hisat2-build 命令自行构建(耗时长,不推荐)

1
hisat2-build -p 4 hg19.fa genome

hg19.fa为ucsc下载的基因组解压合并后的文件

-p 线程数

genome 生成的index前缀名

b.hisat2官网下载hg19索引(推荐)

网址:http://daehwankimlab.github.io/hisat2/download/#h-sapiens

选择相应版本的index,这里选择hg19

1
tar -zxvf hg19_genome.tar.gz #解压下载好的索引文件

生成以下文件

2.将reads比对到基因组

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ hisat2 -h
HISAT2 version 2.1.0 by Daehwan Kim (infphilo@gmail.com, www.ccb.jhu.edu/people/infphilo)
Usage:
hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <sam>]

<ht2-idx> Index filename prefix (minus trailing .X.ht2).
<m1> Files with #1 mates, paired with files in <m2>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<m2> Files with #2 mates, paired with files in <m1>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<r> Files with unpaired reads.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<SRA accession number> Comma-separated list of SRA accession numbers, e.g. --sra-acc SRR353653,SRR353654.
<sam> File for SAM output (default: stdout)

<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.

主要参数:

  • -x
    参考基因组索引文件的前缀。
  • -1
    双端测序结果的第一个文件。若有多组数据,使用逗号将文件分隔。Reads的长度可以不一致。
  • -2
    双端测序结果的第二个文件。若有多组数据,使用逗号将文件分隔,并且文件顺序要和-1参数对应。Reads的长度可以不一致。
  • -U
    单端数据文件。若有多组数据,使用逗号将文件分隔。可以和-1、-2参数同时使用。Reads的长度可以不一致。
  • –sra-acc
    输入SRA登录号,比如SRR353653,SRR353654。多组数据之间使用逗号分隔。HISAT将自动下载并识别数据类型,进行比对。
  • -S
    指定输出的SAM文件。
1
hisat2 -x index/hg19/genome -U qc/clean_data/SRR957680_clean.fastq.gz -S alignment/SRR957680.sam #-x 需跟索引文件的前缀 SRR957680_clean.fastq.gz为fastp去除低质量碱基及接头的fastq文件

使用循环将四个文件比对命令挂后台运行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ for ((i=77;i<=80;i++));do echo $i;done
77
78
79
80
$ for ((i=77;i<=80;i++));do echo nohup hisat2 -x index/hg19/genome -U qc/clean_data/SRR9576${i}_clean.fastq.gz -S alignment/SRR9576${i}.sam \&;done
nohup hisat2 -x index/hg19/genome -U qc/clean_data/SRR957677_clean.fastq.gz -S alignment/SRR957677.sam &
nohup hisat2 -x index/hg19/genome -U qc/clean_data/SRR957678_clean.fastq.gz -S alignment/SRR957678.sam &
nohup hisat2 -x index/hg19/genome -U qc/clean_data/SRR957679_clean.fastq.gz -S alignment/SRR957679.sam &
nohup hisat2 -x index/hg19/genome -U qc/clean_data/SRR957680_clean.fastq.gz -S alignment/SRR957680.sam &
#使用循环前用echo打印出命令查看是否有问题再将其写入脚本方便后续溯源
for ((i=77;i<=80;i++));do echo nohup hisat2 -x index/hg19/genome -p 10 -U qc/clean_data/SRR9576${i}_clean.fastq.gz -S alignment/SRR9576${i}.sam \&;done >hisat2.sh
#运行脚本hisat2.sh
bash hisat2.sh

查看命令是否在后台运行

1
top -u liyue #-u接用户名

生成以下文件

1
2
3
4
5
6
7
$ ls -lh alignment/
total 19G
-rw-r--r--. 1 liyue grp6 5.3G Oct 8 11:42 SRR957677.sam
-rw-r--r--. 1 liyue grp6 2.2G Oct 8 11:40 SRR957678.sam
-rw-r--r--. 1 liyue grp6 5.2G Oct 8 11:41 SRR957679.sam
-rw-r--r--. 1 liyue grp6 6.2G Oct 8 11:41 SRR957680.sam

查看运行日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ cat nohup.out 
8772774 reads; of these:
8772774 (100.00%) were unpaired; of these:
533575 (6.08%) aligned 0 times
7261566 (82.77%) aligned exactly 1 time
977633 (11.14%) aligned >1 times
93.92% overall alignment rate
19821487 reads; of these:
19821487 (100.00%) were unpaired; of these:
1204646 (6.08%) aligned 0 times
16033406 (80.89%) aligned exactly 1 time
2583435 (13.03%) aligned >1 times
93.92% overall alignment rate
20716911 reads; of these:
20716911 (100.00%) were unpaired; of these:
1147541 (5.54%) aligned 0 times
17115269 (82.61%) aligned exactly 1 time
2454101 (11.85%) aligned >1 times
94.46% overall alignment rate
24127364 reads; of these:
24127364 (100.00%) were unpaired; of these:
1287901 (5.34%) aligned 0 times
19991304 (82.86%) aligned exactly 1 time
2848159 (11.80%) aligned >1 times
94.66% overall alignment rate

3.sam文件转化为bam文件并排序(两种方法)

1.先转化为bam文件再排序

1
2
samtools view -S SRR957677.sam -b SRR957677.bam
samtools sort SRR957677.bam -o SRR957677_sorted.bam #-o接排序后输出文件名称默认按染色体顺序排序

2.一步到位排序与转化同时进行(推荐)

1
2
3
4
5
6
7
8
9
10
11
samtools sort SRR957677.sam -o SRR957677_sorted.bam #直接跟sam文件
#同样使用循环进行排序转化
$ for ((i=77;i<=80;i++));do echo nohup samtools sort SRR9576${i}.sam -o SRR9576${i}_sorted.bam \&;done
nohup samtools sort SRR957677.sam -o SRR957677_sorted.bam &
nohup samtools sort SRR957678.sam -o SRR957678_sorted.bam &
nohup samtools sort SRR957679.sam -o SRR957679_sorted.bam &
nohup samtools sort SRR957680.sam -o SRR957680_sorted.bam &
#使用循环前用echo打印出命令查看是否有问题再将其写入脚本方便后续溯源
for ((i=77;i<=80;i++));do echo nohup samtools sort SRR9576${i}.sam -o SRR9576${i}_sorted.bam \&;done >sam2bam_sort.sh
#运行脚本sam2bam_sort.sh
bash sam2bam_sort.sh

查看命令是否在后台运行

1
top -u liyue #-u接用户名

生成以下结果

1
2
$ ls *sorted.bam
SRR957677_sorted.bam SRR957678_sorted.bam SRR957679_sorted.bam SRR957680_sorted.bam

生成的bam文件用samtools flagstat查看reads比对情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ samtools flagstat SRR957677_sorted.bam
25690281 + 0 in total (QC-passed reads + QC-failed reads)
4973370 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
24542740 + 0 mapped (95.53% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

二.htseq-count计数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
htseq-count -h
usage: htseq-count [options] alignment_file gff_file

This script takes one or more alignment files in SAM/BAM format and a feature
file in GFF format and calculates for each feature the number of reads mapping
to it. See http://htseq.readthedocs.io/en/master/count.html for details.

positional arguments:
samfilenames Path to the SAM/BAM files containing the mapped reads.
If '-' is selected, read from standard input
featuresfilename Path to the GTF file containing the features

optional arguments:
-h, --help show this help message and exit
-f {sam,bam,auto}, --format {sam,bam,auto}
Type of <alignment_file> data. DEPRECATED: file format
is detected automatically. This option is ignored.
-r {pos,name}, --order {pos,name}
'pos' or 'name'. Sorting order of <alignment_file>
(default: name). Paired-end sequencing data must be
sorted either by position or by read name, and the
sorting order must be specified. Ignored for single-
end data.
--max-reads-in-buffer MAX_BUFFER_SIZE
When <alignment_file> is paired end sorted by
position, allow only so many reads to stay in memory
until the mates are found (raising this number will
use more memory). Has no effect for single end or
paired end sorted by name
-s {yes,no,reverse}, --stranded {yes,no,reverse}
Whether the data is from a strand-specific assay.
Specify 'yes', 'no', or 'reverse' (default: yes).
'reverse' means 'yes' with reversed strand
interpretation
-a MINAQUAL, --minaqual MINAQUAL
Skip all reads with MAPQ alignment quality lower than
the given minimum value (default: 10). MAPQ is the 5th
column of a SAM/BAM file and its usage depends on the
software used to map the reads.
-t FEATURETYPE, --type FEATURETYPE
Feature type (3rd column in GTF file) to be used, all
features of other type are ignored (default, suitable
for Ensembl GTF files: exon)
-i IDATTR, --idattr IDATTR
GTF attribute to be used as feature ID (default,
suitable for Ensembl GTF files: gene_id). All feature
of the right type (see -t option) within the same GTF
attribute will be added together. The typical way of
using this option is to count all exonic reads from
each gene and add the exons but other uses are
possible as well.
--additional-attr ADDITIONAL_ATTR
Additional feature attributes (default: none, suitable
for Ensembl GTF files: gene_name). Use multiple times
for more than one additional attribute. These
attributes are only used as annotations in the output,
while the determination of how the counts are added
together is done based on option -i.
-m {union,intersection-strict,intersection-nonempty}, --mode {union,intersection-strict,intersection-nonempty}
Mode to handle reads overlapping more than one feature
(choices: union, intersection-strict, intersection-
nonempty; default: union)
--nonunique {none,all,fraction,random}
Whether and how to score reads that are not uniquely
aligned or ambiguously assigned to features (choices:
none, all, fraction, random; default: none)
--secondary-alignments {score,ignore}
Whether to score secondary alignments (0x100 flag)
--supplementary-alignments {score,ignore}
Whether to score supplementary alignments (0x800 flag)
-o SAMOUTS, --samout SAMOUTS
Write out all SAM alignment records into SAM/BAM files
(one per input file needed), annotating each line with
its feature assignment (as an optional field with tag
'XF'). See the -p option to use BAM instead of SAM.
-p {SAM,BAM,sam,bam}, --samout-format {SAM,BAM,sam,bam}
Format to use with the --samout option.
-d OUTPUT_DELIMITER, --delimiter OUTPUT_DELIMITER
Column delimiter in output (default: TAB).
-c OUTPUT_FILENAME, --counts_output OUTPUT_FILENAME
Filename to output the counts to instead of stdout.
--append-output Append counts output. This option is useful if you
have already creates a TSV/CSV/similar file with a
header for your samples (with additional columns for
the feature name and any additionl attributes) and
want to fill in the rest of the file.
-n NPROCESSES, --nprocesses NPROCESSES
Number of parallel CPU processes to use (default: 1).
--feature-query FEATURE_QUERY
Restrict to features descibed in this expression.
Currently supports a single kind of expression:
attribute == "one attr" to restrict the GFF to a
single gene or transcript, e.g. --feature-query
'gene_name == "ACTB"' - notice the single quotes
around the argument of this option and the double
quotes around the gene name. Broader queries might
become available in the future.
-q, --quiet Suppress progress report
--version Show software version and exit

1
2
3
4
5
6
7
8
9
10
11
# 命令参数
-f | --format default: sam 设置输入文件的格式,该参数的值可以是sam或bam。
-r | --order default: name 设置sam或bam文件的排序方式,该参数的值可以是name或pos。前者表示按read名进行排序,后者表示按比对的参考基因组位置进行排序。若测序数据是双末端测序,当输入sam/bam文件是按pos方式排序的时候,两端reads的比对结果在sam/bam文件中一般不是紧邻的两行,程序会将reads对的第一个比对结果放入内存,直到读取到另一端read的比对结果。因此,选择pos可能会导致程序使用较多的内存,它也适合于未排序的sam/bam文件。而pos排序则表示程序认为双末端测序的reads比对结果在紧邻的两行上,也适合于单端测序的比对结果。很多其它表达量分析软件要求输入的sam/bam文件是按pos排序的,但HTSeq推荐使用name排序,且一般比对软件的默认输出结果也是按name进行排序的。
-s | --stranded default: yes 设置是否是链特异性测序。该参数的值可以是yes,no或reverse。no表示非链特异性测序;若是单端测序,yes表示read比对到了基因的正义链上;若是双末端测序,yes表示read1比对到了基因正义链上,read2比对到基因负义链上;reverse表示双末端测序情况下与yes值相反的结果。根据说明文件的理解,一般情况下双末端链特异性测序,该参数的值应该选择reverse(本人暂时没有测试该参数)。
-a | --a default: 10 忽略比对质量低于此值的比对结果。在0.5.4版本以前该参数默认值是0。
-t | --type default: exon 程序会对该指定的feature(gtf/gff文件第三列)进行表达量计算,而gtf/gff文件中其它的feature都会被忽略。
-i | --idattr default: gene_id 设置feature ID是由gtf/gff文件第9列那个标签决定的;若gtf/gff文件多行具有相同的feature ID,则它们来自同一个feature,程序会计算这些features的表达量之和赋给相应的feature ID。
-m | --mode default: union 设置表达量计算模式。该参数的值可以有union, intersection-strict and intersection-nonempty。这三种模式的选择请见上面对这3种模式的示意图。从图中可知,对于原核生物,推荐使用intersection-strict模式;对于真核生物,推荐使用union模式。
-o | --samout 输出一个sam文件,该sam文件的比对结果中多了一个XF标签,表示该read比对到了某个feature上。
-q | --quiet 不输出程序运行的状态信息和警告信息。
-h | --help 输出帮助信息。

1.数据准备

htseq计数需要按照reads名称进行排序,之前的是按染色体排序,所以还得重新排序

1
2
3
4
5
6
7
8
9
10
11
samtools sort -n SRR957677_sorted.bam -o SRR957677_nsorted.bam
#循环
for ((i=77;i<=80;i++));do echo nohup samtools sort -n SRR9576${i}_sorted.bam -o SRR9576${i}_nsorted.bam \&;done
nohup samtools sort -n SRR957677_sorted.bam -o SRR957677_nsorted.bam &
nohup samtools sort -n SRR957678_sorted.bam -o SRR957678_nsorted.bam &
nohup samtools sort -n SRR957679_sorted.bam -o SRR957679_nsorted.bam &
nohup samtools sort -n SRR957680_sorted.bam -o SRR957680_nsorted.bam &
#使用循环前用echo打印出命令查看是否有问题再将其写入脚本方便后续溯源
for ((i=77;i<=80;i++));do echo nohup samtools sort -n SRR9576${i}_sorted.bam -o SRR9576${i}_nsorted.bam \&;done >bam_nsort.sh
#运行脚本bam_nsort.sh
bash bam_nsort.sh

2.htseq-count计数

1
2
3
4
5
6
7
8
9
10
11
htseq-count -f bam -r name SRR957677_nsorted.bam ../genome/gencode.v19.annotation.gtf>SRR957677.conut
#循环
for((i=77;i<=80;i++));do echo nohup htseq-count -f bam -r name SRR9576${i}_nsorted.bam ../genome/gencode.v19.annotation.gtf \>SRR9576${i}.conut \&;done#>和&需要\转义
nohup htseq-count -f bam -r name SRR957677_nsorted.bam ../genome/gencode.v19.annotation.gtf >SRR957677.conut &
nohup htseq-count -f bam -r name SRR957678_nsorted.bam ../genome/gencode.v19.annotation.gtf >SRR957678.conut &
nohup htseq-count -f bam -r name SRR957679_nsorted.bam ../genome/gencode.v19.annotation.gtf >SRR957679.conut &
nohup htseq-count -f bam -r name SRR957680_nsorted.bam ../genome/gencode.v19.annotation.gtf >SRR957680.conut &
#使用循环前用echo打印出命令查看是否有问题再将其写入脚本方便后续溯源
for((i=77;i<=80;i++));do echo nohup htseq-count -f bam -r name SRR9576${i}_nsorted.bam ../genome/gencode.v19.annotation.gtf \>SRR9576${i}.conut \&;done >htseq.sh
#运行脚本htseq.sh
bash htseq.sh

#查看命令是否在后台运行

1
top -u liyue #-u接用户名

生成SRR957677.conut,SRR957678.conut,SRR957679.conut,SRR957680.conut四个文件

1
less SRR957677.conut

不知道问题出在哪,htseq运行的过程也写入这个文件中去了

解决办法:用grep命令去除该文件中的运行过程内容(其他三个文件也一样处理)

1
grep -v "processed" SRR957677.conut >SRR957677_conut
1
less SRR957677_conut
客官打个赏咯.