snakemake搭建流程(一)

Snakemake 是基于 Python 的一款工具,所以它也继承了 Python 语言简单易读、逻辑清晰、便于维护的特点,同时它还支持 Python 语法,非常适合新手用户。例如遵循python中缩进表示层级;以及索引从0开始,{input[0]}表示input里的第1个元素;列表用中括号类似[‘A’,’B’,’C’]等。snakemake 的基本组成单位叫“规则”,即 rule;每个 rule 里面又有多个元素(input、output、shell等)。它的执行逻辑就是将各个 rule 利用 input/output 连接起来,形成一个完整的工作流。

一.数据准备

以RNAseq的数据为例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
snakefile_rnaseq
├── genome
│   ├── gencode.v19.annotation.gtf
│   └── hg19.fa
├── index
│   └── hg19
│   ├── genome.1.ht2
│   ├── genome.2.ht2
│   ├── genome.3.ht2
│   ├── genome.4.ht2
│   ├── genome.5.ht2
│   ├── genome.6.ht2
│   ├── genome.7.ht2
│   ├── genome.8.ht2
│   └── make_hg19.sh
├── SRR957677.1_1.fastq.gz
├── SRR957678.1_1.fastq.gz
├── SRR957679.1_1.fastq.gz
└── SRR957680.1_1.fastq.gz

二.创建工作流文件snakefile

工作流文件 Snakefile 文件名并不是固定的,位置也不是固定的。当你的工作流文件就在当前目录下,且名称正好是 Snakefile 时,就可以像上面的示例一样,不用指定具体的工作流文件,snakemake 会自动调取当前路径的名为 Snakefile 的文件去执行。如果你要执行其他路径的 Snakefile 或者其他的文件名的工作流文件,可以使用 -s 参数。将工作流文件命名为 py 文件,这样你在写 Python 代码时会有语法高亮。

  • --snakefile, -s 指定Snakefile,否则是当前目录下的Snakefile
  • --dryrun, -n 不真正执行,一般用来查看Snakefile是否有错
  • snakemake --dag | dot -Tpdf > dag.pdf 可视化
1
2
3
4
5
# 创建 Snakefile
$ touch rnaseqflow.py
$ snakemake -n -s rnaseqflow.py
Building DAG of jobs...
Nothing to be done.

三.创建第一条rule

1
2
3
4
5
6
7
8
9
10
11
12
13
rule fastqc:  # 定义第一条规则,命名为 fastqc
input: # input: 指定输入
fq='SRR957677.1_1.fastq.gz'
output: # output: 指定输出
'SRR957677.1_1_fastqc.zip'
log:#指定log文件
'SRR957677.1_1.log'
params: #指定参数
outdir='qc'
shell:
'fastqc {input[fq]} -o {params[outdir]} 1>{log[0]} 2>&1'
# 指定执行方式,这里有三种执行方式:shell、run、script。run执行python脚本,shell执行Bash脚本。还可以用script来执行外部脚本。比如 script:"scripts/script.py"或 script:"scripts/script.R"。
#{input[fq]} 也可以写成{input[0]},同理{params[outdir]}可写成{params[0]}

参数-n为dry run,即不实际运行;-p为打印命令;-np即为只打印全部命令而不运行,-s 指定文件,否则是当前目录下的Snakefile,查看命令是否正确:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ snakemake -np -s rnaseqflow.py 
Building DAG of jobs...
Job counts:
count jobs
1 fastqc
1

[Fri Oct 23 07:51:24 2020]
rule fastqc:
input: SRR957677.1_1.fastq.gz
output: SRR957677.1_1_fastqc.zip
log: SRR957677.1_1.log
jobid: 0

fastqc SRR957677.1_1.fastq.gz -o qc 1>SRR957677.1_1.log 2>&1 #这里可以看出命令和我们平时写的一样
Job counts:
count jobs
1 fastqc
1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

四.使用通配符匹配多个输入文件

上面的 fastqc规则仅仅比对了一个样本,可是实际项目中有几十上百的样本时,我们就不能这样直接写样本名来运行了。snakemake 允许使用通配符(wildcard)来批量运行命令。

rule all的input是流程最终的目标文件,类似于GNU Make,从顶部指定目标

expand是Snakemake的特有函数,类似列表推导式。

expand(‘{sample}.txt’,sample=SAMPLES)相当于[‘{sample}.txt’.format(sample=sample) for sample in SAMPLES]。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
(SAMPLES,)= glob_wildcards("{sample}.1_1.fastq.gz")
#这里也可运用列表一一列出来SAMPLES=['SRR957677','SRR957678','SRR957679','SRR957680']
rule all:
input:
expand('{sample_name}.1_1_fastqc.zip',sample_name=SAMPLES)
rule fastqc:
input:
fq='{sample_name}.1_1.fastq.gz'
output:
'{sample_name}.1_1_fastqc.zip'
log:
'{sample_name}.1_1.log'
params:
outdir='qc'
shell:
'fastqc {input[fq]} -o {params[outdir]} 1>{log[0]} 2>&1'

dry run 一下,可以加上 -p 参数让终端打印出 shell 运行的命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
$ snakemake -np -s rnaseqflow.py 
Building DAG of jobs...
Job counts:
count jobs
1 all
4 fastqc
5

[Fri Oct 23 08:04:34 2020]
rule fastqc:
input: SRR957677.1_1.fastq.gz
output: SRR957677.1_1_fastqc.zip
log: SRR957677.1_1.log
jobid: 1
wildcards: sample_name=SRR957677

fastqc SRR957677.1_1.fastq.gz -o qc 1>SRR957677.1_1.log 2>&1

[Fri Oct 23 08:04:34 2020]
rule fastqc:
input: SRR957679.1_1.fastq.gz
output: SRR957679.1_1_fastqc.zip
log: SRR957679.1_1.log
jobid: 3
wildcards: sample_name=SRR957679

fastqc SRR957679.1_1.fastq.gz -o qc 1>SRR957679.1_1.log 2>&1

[Fri Oct 23 08:04:34 2020]
rule fastqc:
input: SRR957680.1_1.fastq.gz
output: SRR957680.1_1_fastqc.zip
log: SRR957680.1_1.log
jobid: 4
wildcards: sample_name=SRR957680

fastqc SRR957680.1_1.fastq.gz -o qc 1>SRR957680.1_1.log 2>&1

[Fri Oct 23 08:04:34 2020]
rule fastqc:
input: SRR957678.1_1.fastq.gz
output: SRR957678.1_1_fastqc.zip
log: SRR957678.1_1.log
jobid: 2
wildcards: sample_name=SRR957678

fastqc SRR957678.1_1.fastq.gz -o qc 1>SRR957678.1_1.log 2>&1

[Fri Oct 23 08:04:34 2020]
localrule all:
input: SRR957677.1_1_fastqc.zip, SRR957678.1_1_fastqc.zip, SRR957679.1_1_fastqc.zip, SRR957680.1_1_fastqc.zip
jobid: 0

Job counts:
count jobs
1 all
4 fastqc
5
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

本文的例子都是各个样品批量、并行运行同样的步骤,所以可以全部使用通配符 {sample} 完成匹配;但有的步骤需要将多个样本的bam文件传递给一个命令而不再并行,这种方法就不在适用了。例如在以变异检测为例。

1. 首先在工作流文件开头定义一个变量:

1
samples = ["A", "B", "C"]

2. 然后使用 snakemake 的内置函数 expand:

1
bam=expand("mapped/{sample}.sorted.bam", sample=samples) 
1
2
3
4
5
6
7
8
9
10
11
rule calling:
input:
fa="data/genome.fa",
bam=expand("mapped/{sample}.sorted.bam", sample=samples),
bai=expand("mapped/{sample}.sorted.bam.bai", sample=samples)
output:
"calling/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "

"bcftools call -mv - > {output}"

五.添加其他rule

在fastqc的基础上添加fastp、hisat2、samtools sort、htseq等rule。前后相连的rule,前一个rule的输出文件须是后一个的输入文件。

rule all的input是流程最终的目标文件,从顶部指定目标。fastqc不能和其他的rule串起来,它的输出文件不能作为其它rule的输入文件,所以要单独在rule all里指定。

rule all 指定输出文件{sample_name}.count后,通过顶部目标从上往下确定运行逻辑顺序,首先寻找哪个rule的输出文件是{sample_name}.count,可以看出是rule htseq;那么rule htseq的输入文件{sample_name}_nsorted.bam是哪个rule来的呢,它是rule samtools_sort的输出文件;同理我们可以看出rule samtools_sort的输入文件是从rule hisat2的输出结果;以此类推从顶部倒推的结果是rule all >rule htseq>rule samtools_sort>rule hisat2>rule fastqc,所以运行顺序是与之相反的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
(SAMPLES,)= glob_wildcards("{sample}.1_1.fastq.gz")
rule all:
input:
expand('{sample_name}.1_1_fastqc.zip',sample_name=SAMPLES),
expand('{sample_name}.count',sample_name=SAMPLES)
rule fastqc:
input:
fq='{sample_name}.1_1.fastq.gz'
output:
'{sample_name}.1_1_fastqc.zip'
log:
'{sample_name}.1_1.log'
params:
outdir='qc'
shell:
'fastqc {input[fq]} -o {params[outdir]} 1>{log[0]} 2>&1'
rule fastp:
input:
fq='{sample_name}.1_1.fastq.gz'
output:
'{sample_name}.1_1_clean.fastq.gz'
log:
'{sample_name}.fastp.log'
params:
outdir='fastp'
shell:
'fastp -i {input[fq]} -o {params[outdir]}/{output[0]}'
rule hisat2:
input:
clean_fq='{sample_name}.1_1_clean.fastq.gz'
output:
'{sample_name}.sam'
log:
'{sample_name}.hisat2.log'
params:
outdir='hisat2',
index='index/hg19/genome'
shell:
'hisat2 -x {params[index]} -p 10 -U fastp/{input[clean_fq]} -S {params[outdir]}/{output[0]}'
rule samtools_sort:
input:
'{sample_name}.sam'
output:
'{sample_name}_nsorted.bam'
shell:
'samtools sort -o hisat2/{output[0]} hisat2/{input[0]}'
rule htseq:
input:
'{sample_name}_nsorted.bam'
output:
'{sample_name}.count'
log:
'{sample_name}.htseq.log'
params:
outdir='htseq',
gtf='genome/gencode.v19.annotation.gtf'
shell:
'htseq-count -f bam -r name hisat2/{input[0]} {params[gtf]} >{output[0]}'

dry run 一下,可以加上 -p 参数让终端打印出 shell 运行的命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
$ snakemake -np -s rnaseqflow.py 
Building DAG of jobs...
Job counts:
count jobs
1 all
4 fastp
4 fastqc
4 hisat2
4 htseq
4 samtools_sort
21

[Fri Oct 23 10:23:24 2020]
rule fastp:
input: SRR957678.1_1.fastq.gz
output: SRR957678.1_1_clean.fastq.gz
log: SRR957678.fastp.log
jobid: 18
wildcards: sample_name=SRR957678

fastp -i SRR957678.1_1.fastq.gz -o fastp/SRR957678.1_1_clean.fastq.gz

[Fri Oct 23 10:23:24 2020]
rule fastqc:
input: SRR957679.1_1.fastq.gz
output: SRR957679.1_1_fastqc.zip
log: SRR957679.1_1.log
jobid: 3
wildcards: sample_name=SRR957679

fastqc SRR957679.1_1.fastq.gz -o qc 1>SRR957679.1_1.log 2>&1

[Fri Oct 23 10:23:24 2020]
rule fastqc:
input: SRR957680.1_1.fastq.gz
output: SRR957680.1_1_fastqc.zip
log: SRR957680.1_1.log
jobid: 4
wildcards: sample_name=SRR957680

fastqc SRR957680.1_1.fastq.gz -o qc 1>SRR957680.1_1.log 2>&1

[Fri Oct 23 10:23:24 2020]
rule fastp:
input: SRR957679.1_1.fastq.gz
output: SRR957679.1_1_clean.fastq.gz
log: SRR957679.fastp.log
jobid: 19
wildcards: sample_name=SRR957679

fastp -i SRR957679.1_1.fastq.gz -o fastp/SRR957679.1_1_clean.fastq.gz

[Fri Oct 23 10:23:24 2020]
rule fastp:
input: SRR957677.1_1.fastq.gz
output: SRR957677.1_1_clean.fastq.gz
log: SRR957677.fastp.log
jobid: 17
wildcards: sample_name=SRR957677

fastp -i SRR957677.1_1.fastq.gz -o fastp/SRR957677.1_1_clean.fastq.gz

[Fri Oct 23 10:23:24 2020]
rule fastqc:
input: SRR957678.1_1.fastq.gz
output: SRR957678.1_1_fastqc.zip
log: SRR957678.1_1.log
jobid: 2
wildcards: sample_name=SRR957678

fastqc SRR957678.1_1.fastq.gz -o qc 1>SRR957678.1_1.log 2>&1

[Fri Oct 23 10:23:24 2020]
rule fastqc:
input: SRR957677.1_1.fastq.gz
output: SRR957677.1_1_fastqc.zip
log: SRR957677.1_1.log
jobid: 1
wildcards: sample_name=SRR957677

fastqc SRR957677.1_1.fastq.gz -o qc 1>SRR957677.1_1.log 2>&1

[Fri Oct 23 10:23:24 2020]
rule fastp:
input: SRR957680.1_1.fastq.gz
output: SRR957680.1_1_clean.fastq.gz
log: SRR957680.fastp.log
jobid: 20
wildcards: sample_name=SRR957680

fastp -i SRR957680.1_1.fastq.gz -o fastp/SRR957680.1_1_clean.fastq.gz

[Fri Oct 23 10:23:24 2020]
rule hisat2:
input: SRR957679.1_1_clean.fastq.gz
output: SRR957679.sam
log: SRR957679.hisat2.log
jobid: 15
wildcards: sample_name=SRR957679

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957679.1_1_clean.fastq.gz -S hisat2/SRR957679.sam

[Fri Oct 23 10:23:24 2020]
rule hisat2:
input: SRR957680.1_1_clean.fastq.gz
output: SRR957680.sam
log: SRR957680.hisat2.log
jobid: 16
wildcards: sample_name=SRR957680

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957680.1_1_clean.fastq.gz -S hisat2/SRR957680.sam

[Fri Oct 23 10:23:24 2020]
rule hisat2:
input: SRR957678.1_1_clean.fastq.gz
output: SRR957678.sam
log: SRR957678.hisat2.log
jobid: 14
wildcards: sample_name=SRR957678

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957678.1_1_clean.fastq.gz -S hisat2/SRR957678.sam

[Fri Oct 23 10:23:24 2020]
rule hisat2:
input: SRR957677.1_1_clean.fastq.gz
output: SRR957677.sam
log: SRR957677.hisat2.log
jobid: 13
wildcards: sample_name=SRR957677

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957677.1_1_clean.fastq.gz -S hisat2/SRR957677.sam

[Fri Oct 23 10:23:24 2020]
rule samtools_sort:
input: SRR957677.sam
output: SRR957677_nsorted.bam
jobid: 9
wildcards: sample_name=SRR957677

samtools sort -o hisat2/SRR957677_nsorted.bam hisat2/SRR957677.sam

[Fri Oct 23 10:23:24 2020]
rule samtools_sort:
input: SRR957679.sam
output: SRR957679_nsorted.bam
jobid: 11
wildcards: sample_name=SRR957679

samtools sort -o hisat2/SRR957679_nsorted.bam hisat2/SRR957679.sam

[Fri Oct 23 10:23:24 2020]
rule samtools_sort:
input: SRR957680.sam
output: SRR957680_nsorted.bam
jobid: 12
wildcards: sample_name=SRR957680

samtools sort -o hisat2/SRR957680_nsorted.bam hisat2/SRR957680.sam

[Fri Oct 23 10:23:24 2020]
rule samtools_sort:
input: SRR957678.sam
output: SRR957678_nsorted.bam
jobid: 10
wildcards: sample_name=SRR957678

samtools sort -o hisat2/SRR957678_nsorted.bam hisat2/SRR957678.sam

[Fri Oct 23 10:23:24 2020]
rule htseq:
input: SRR957678_nsorted.bam
output: SRR957678.count
log: SRR957678.htseq.log
jobid: 6
wildcards: sample_name=SRR957678

htseq-count -f bam -r name hisat2/SRR957678_nsorted.bam genome/gencode.v19.annotation.gtf >SRR957678.count

[Fri Oct 23 10:23:24 2020]
rule htseq:
input: SRR957680_nsorted.bam
output: SRR957680.count
log: SRR957680.htseq.log
jobid: 8
wildcards: sample_name=SRR957680

htseq-count -f bam -r name hisat2/SRR957680_nsorted.bam genome/gencode.v19.annotation.gtf >SRR957680.count

[Fri Oct 23 10:23:24 2020]
rule htseq:
input: SRR957677_nsorted.bam
output: SRR957677.count
log: SRR957677.htseq.log
jobid: 5
wildcards: sample_name=SRR957677

htseq-count -f bam -r name hisat2/SRR957677_nsorted.bam genome/gencode.v19.annotation.gtf >SRR957677.count

[Fri Oct 23 10:23:24 2020]
rule htseq:
input: SRR957679_nsorted.bam
output: SRR957679.count
log: SRR957679.htseq.log
jobid: 7
wildcards: sample_name=SRR957679

htseq-count -f bam -r name hisat2/SRR957679_nsorted.bam genome/gencode.v19.annotation.gtf >SRR957679.count

[Fri Oct 23 10:23:24 2020]
localrule all:
input: SRR957677.1_1_fastqc.zip, SRR957678.1_1_fastqc.zip, SRR957679.1_1_fastqc.zip, SRR957680.1_1_fastqc.zip, SRR957677.count, SRR957678.count, SRR957679.count, SRR957680.count
jobid: 0

Job counts:
count jobs
1 all
4 fastp
4 fastqc
4 hisat2
4 htseq
4 samtools_sort
21
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

假设我们将rule all 里的{sample_name}.count 改为{sample_name}.sam那么命令只会运行到rule hisat2,跳过之后的命令。顶部倒推,rule all指定目标{sample_name}.sam是rule hisat2的输出文件,所以最后运行的rule是rule hisat2,之后的命令不运行。

1
2
3
4
5
(SAMPLES,)= glob_wildcards("{sample}.1_1.fastq.gz")
rule all:
input:
expand('{sample_name}.1_1_fastqc.zip',sample_name=SAMPLES),
expand('{sample_name}.sam',sample_name=SAMPLES)

dry run 一下,可以加上 -p 参数让终端打印出 shell 运行的命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
$ snakemake -np -s rnaseqflow.py 
Building DAG of jobs...
Job counts:
count jobs
1 all
4 fastp
4 fastqc
4 hisat2
13

[Fri Oct 23 10:43:36 2020]
rule fastqc:
input: SRR957679.1_1.fastq.gz
output: SRR957679.1_1_fastqc.zip
log: SRR957679.1_1.log
jobid: 3
wildcards: sample_name=SRR957679

fastqc SRR957679.1_1.fastq.gz -o qc 1>SRR957679.1_1.log 2>&1

[Fri Oct 23 10:43:36 2020]
rule fastqc:
input: SRR957680.1_1.fastq.gz
output: SRR957680.1_1_fastqc.zip
log: SRR957680.1_1.log
jobid: 4
wildcards: sample_name=SRR957680

fastqc SRR957680.1_1.fastq.gz -o qc 1>SRR957680.1_1.log 2>&1

[Fri Oct 23 10:43:36 2020]
rule fastp:
input: SRR957678.1_1.fastq.gz
output: SRR957678.1_1_clean.fastq.gz
log: SRR957678.fastp.log
jobid: 10
wildcards: sample_name=SRR957678

fastp -i SRR957678.1_1.fastq.gz -o fastp/SRR957678.1_1_clean.fastq.gz

[Fri Oct 23 10:43:36 2020]
rule fastp:
input: SRR957680.1_1.fastq.gz
output: SRR957680.1_1_clean.fastq.gz
log: SRR957680.fastp.log
jobid: 12
wildcards: sample_name=SRR957680

fastp -i SRR957680.1_1.fastq.gz -o fastp/SRR957680.1_1_clean.fastq.gz

[Fri Oct 23 10:43:36 2020]
rule fastqc:
input: SRR957678.1_1.fastq.gz
output: SRR957678.1_1_fastqc.zip
log: SRR957678.1_1.log
jobid: 2
wildcards: sample_name=SRR957678

fastqc SRR957678.1_1.fastq.gz -o qc 1>SRR957678.1_1.log 2>&1

[Fri Oct 23 10:43:36 2020]
rule fastqc:
input: SRR957677.1_1.fastq.gz
output: SRR957677.1_1_fastqc.zip
log: SRR957677.1_1.log
jobid: 1
wildcards: sample_name=SRR957677

fastqc SRR957677.1_1.fastq.gz -o qc 1>SRR957677.1_1.log 2>&1

[Fri Oct 23 10:43:36 2020]
rule fastp:
input: SRR957677.1_1.fastq.gz
output: SRR957677.1_1_clean.fastq.gz
log: SRR957677.fastp.log
jobid: 9
wildcards: sample_name=SRR957677

fastp -i SRR957677.1_1.fastq.gz -o fastp/SRR957677.1_1_clean.fastq.gz

[Fri Oct 23 10:43:36 2020]
rule fastp:
input: SRR957679.1_1.fastq.gz
output: SRR957679.1_1_clean.fastq.gz
log: SRR957679.fastp.log
jobid: 11
wildcards: sample_name=SRR957679

fastp -i SRR957679.1_1.fastq.gz -o fastp/SRR957679.1_1_clean.fastq.gz

[Fri Oct 23 10:43:36 2020]
rule hisat2:
input: SRR957678.1_1_clean.fastq.gz
output: SRR957678.sam
log: SRR957678.hisat2.log
jobid: 6
wildcards: sample_name=SRR957678

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957678.1_1_clean.fastq.gz -S hisat2/SRR957678.sam

[Fri Oct 23 10:43:36 2020]
rule hisat2:
input: SRR957680.1_1_clean.fastq.gz
output: SRR957680.sam
log: SRR957680.hisat2.log
jobid: 8
wildcards: sample_name=SRR957680

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957680.1_1_clean.fastq.gz -S hisat2/SRR957680.sam

[Fri Oct 23 10:43:36 2020]
rule hisat2:
input: SRR957677.1_1_clean.fastq.gz
output: SRR957677.sam
log: SRR957677.hisat2.log
jobid: 5
wildcards: sample_name=SRR957677

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957677.1_1_clean.fastq.gz -S hisat2/SRR957677.sam

[Fri Oct 23 10:43:36 2020]
rule hisat2:
input: SRR957679.1_1_clean.fastq.gz
output: SRR957679.sam
log: SRR957679.hisat2.log
jobid: 7
wildcards: sample_name=SRR957679

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957679.1_1_clean.fastq.gz -S hisat2/SRR957679.sam

[Fri Oct 23 10:43:36 2020]
localrule all:
input: SRR957677.1_1_fastqc.zip, SRR957678.1_1_fastqc.zip, SRR957679.1_1_fastqc.zip, SRR957680.1_1_fastqc.zip, SRR957677.sam, SRR957678.sam, SRR957679.sam, SRR957680.sam
jobid: 0

Job counts:
count jobs
1 all
4 fastp
4 fastqc
4 hisat2
13
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

六.生成可视化流程图

Snakemake 可以将整个工作流以流程图的形式导出(结合 dot 命令),文件格式可以使png、pdf等。命令如下:

1
2
$ snakemake --dag -s rnaseqflow.py| dot -Tpdf > test.pdf
Building DAG of jobs...

引用

1.Snakemake:简单好用的生信流程搭建工具

2.小白手册:使用Snakemake来轻松搞定生信分析流程

客官打个赏咯.