Snakemake 是基于 Python 的一款工具，所以它也继承了 Python 语言简单易读、逻辑清晰、便于维护的特点，同时它还支持 Python 语法，非常适合新手用户。例如遵循python中缩进表示层级；以及索引从0开始，{input[0]}表示input里的第1个元素；列表用中括号类似[‘A’,’B’,’C’]等。snakemake 的基本组成单位叫“规则”，即 rule；每个 rule 里面又有多个元素（input、output、shell等）。它的执行逻辑就是将各个 rule 利用 input/output 连接起来，形成一个完整的工作流。

一.数据准备

以RNAseq的数据为例

snakefile_rnaseq
├── genome
│   ├── gencode.v19.annotation.gtf
│   └── hg19.fa
├── index
│   └── hg19
│       ├── genome.1.ht2
│       ├── genome.2.ht2
│       ├── genome.3.ht2
│       ├── genome.4.ht2
│       ├── genome.5.ht2
│       ├── genome.6.ht2
│       ├── genome.7.ht2
│       ├── genome.8.ht2
│       └── make_hg19.sh
├── SRR957677.1_1.fastq.gz
├── SRR957678.1_1.fastq.gz
├── SRR957679.1_1.fastq.gz
└── SRR957680.1_1.fastq.gz

二.创建工作流文件snakefile

工作流文件 Snakefile 文件名并不是固定的，位置也不是固定的。当你的工作流文件就在当前目录下，且名称正好是 Snakefile 时，就可以像上面的示例一样，不用指定具体的工作流文件，snakemake 会自动调取当前路径的名为 Snakefile 的文件去执行。如果你要执行其他路径的 Snakefile 或者其他的文件名的工作流文件，可以使用 -s 参数。将工作流文件命名为 py 文件，这样你在写 Python 代码时会有语法高亮。

--snakefile, -s 指定Snakefile，否则是当前目录下的Snakefile
--dryrun, -n 不真正执行，一般用来查看Snakefile是否有错
snakemake --dag | dot -Tpdf > dag.pdf 可视化

# 创建 Snakefile
$ touch rnaseqflow.py
$ snakemake -n -s rnaseqflow.py 
Building DAG of jobs...
Nothing to be done.

三.创建第一条rule

rule fastqc:  # 定义第一条规则，命名为 fastqc
    input:  # input: 指定输入
        fq='SRR957677.1_1.fastq.gz'
    output: # output: 指定输出
        'SRR957677.1_1_fastqc.zip'
    log:#指定log文件
        'SRR957677.1_1.log'
    params: #指定参数
        outdir='qc'
    shell:
        'fastqc {input[fq]} -o {params[outdir]} 1>{log[0]} 2>&1'
     # 指定执行方式，这里有三种执行方式：shell、run、script。run执行python脚本，shell执行Bash脚本。还可以用script来执行外部脚本。比如 script:"scripts/script.py"或 script:"scripts/script.R"。
     #{input[fq]} 也可以写成{input[0]}，同理{params[outdir]}可写成{params[0]}

参数-n为dry run，即不实际运行；-p为打印命令；-np即为只打印全部命令而不运行，-s 指定文件，否则是当前目录下的Snakefile，查看命令是否正确：

$ snakemake -np -s rnaseqflow.py 
Building DAG of jobs...
Job counts:
	count	jobs
	1	fastqc
	1

[Fri Oct 23 07:51:24 2020]
rule fastqc:
    input: SRR957677.1_1.fastq.gz
    output: SRR957677.1_1_fastqc.zip
    log: SRR957677.1_1.log
    jobid: 0

fastqc SRR957677.1_1.fastq.gz -o qc 1>SRR957677.1_1.log 2>&1 #这里可以看出命令和我们平时写的一样
Job counts:
	count	jobs
	1	fastqc
	1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

四.使用通配符匹配多个输入文件

上面的 fastqc规则仅仅比对了一个样本，可是实际项目中有几十上百的样本时，我们就不能这样直接写样本名来运行了。snakemake 允许使用通配符（wildcard）来批量运行命令。

rule all的input是流程最终的目标文件，类似于GNU Make，从顶部指定目标。

expand是Snakemake的特有函数，类似列表推导式。

expand(‘{sample}.txt’,sample=SAMPLES)相当于[‘{sample}.txt’.format(sample=sample) for sample in SAMPLES]。

(SAMPLES,)= glob_wildcards("{sample}.1_1.fastq.gz")
#这里也可运用列表一一列出来SAMPLES=['SRR957677','SRR957678','SRR957679','SRR957680']
rule all:  
    input:
        expand('{sample_name}.1_1_fastqc.zip',sample_name=SAMPLES)
rule fastqc:
    input:
        fq='{sample_name}.1_1.fastq.gz'
    output:
        '{sample_name}.1_1_fastqc.zip'
    log:
        '{sample_name}.1_1.log'
    params:
        outdir='qc'
    shell:
        'fastqc {input[fq]} -o {params[outdir]} 1>{log[0]} 2>&1'

dry run 一下，可以加上 -p 参数让终端打印出 shell 运行的命令：

$ snakemake -np -s rnaseqflow.py 
Building DAG of jobs...
Job counts:
	count	jobs
	1	all
	4	fastqc
	5

[Fri Oct 23 08:04:34 2020]
rule fastqc:
    input: SRR957677.1_1.fastq.gz
    output: SRR957677.1_1_fastqc.zip
    log: SRR957677.1_1.log
    jobid: 1
    wildcards: sample_name=SRR957677

fastqc SRR957677.1_1.fastq.gz -o qc 1>SRR957677.1_1.log 2>&1

[Fri Oct 23 08:04:34 2020]
rule fastqc:
    input: SRR957679.1_1.fastq.gz
    output: SRR957679.1_1_fastqc.zip
    log: SRR957679.1_1.log
    jobid: 3
    wildcards: sample_name=SRR957679

fastqc SRR957679.1_1.fastq.gz -o qc 1>SRR957679.1_1.log 2>&1

[Fri Oct 23 08:04:34 2020]
rule fastqc:
    input: SRR957680.1_1.fastq.gz
    output: SRR957680.1_1_fastqc.zip
    log: SRR957680.1_1.log
    jobid: 4
    wildcards: sample_name=SRR957680

fastqc SRR957680.1_1.fastq.gz -o qc 1>SRR957680.1_1.log 2>&1

[Fri Oct 23 08:04:34 2020]
rule fastqc:
    input: SRR957678.1_1.fastq.gz
    output: SRR957678.1_1_fastqc.zip
    log: SRR957678.1_1.log
    jobid: 2
    wildcards: sample_name=SRR957678

fastqc SRR957678.1_1.fastq.gz -o qc 1>SRR957678.1_1.log 2>&1

[Fri Oct 23 08:04:34 2020]
localrule all:
    input: SRR957677.1_1_fastqc.zip, SRR957678.1_1_fastqc.zip, SRR957679.1_1_fastqc.zip, SRR957680.1_1_fastqc.zip
    jobid: 0

Job counts:
	count	jobs
	1	all
	4	fastqc
	5
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

本文的例子都是各个样品批量、并行运行同样的步骤，所以可以全部使用通配符 {sample} 完成匹配；但有的步骤需要将多个样本的bam文件传递给一个命令而不再并行，这种方法就不在适用了。例如在以变异检测为例。

1. 首先在工作流文件开头定义一个变量：

1	samples = ["A", "B", "C"]

2. 然后使用 snakemake 的内置函数 expand：

1	bam=expand("mapped/{sample}.sorted.bam", sample=samples)

rule calling:
  input:
      fa="data/genome.fa",
      bam=expand("mapped/{sample}.sorted.bam", sample=samples),
      bai=expand("mapped/{sample}.sorted.bam.bai", sample=samples)
  output:
      "calling/all.vcf"
  shell:
      "samtools mpileup -g -f {input.fa} {input.bam} | "

      "bcftools call -mv - > {output}"

五.添加其他rule

在fastqc的基础上添加fastp、hisat2、samtools sort、htseq等rule。前后相连的rule，前一个rule的输出文件须是后一个的输入文件。

rule all的input是流程最终的目标文件，从顶部指定目标。fastqc不能和其他的rule串起来，它的输出文件不能作为其它rule的输入文件，所以要单独在rule all里指定。

rule all 指定输出文件{sample_name}.count后，通过顶部目标从上往下确定运行逻辑顺序，首先寻找哪个rule的输出文件是{sample_name}.count，可以看出是rule htseq；那么rule htseq的输入文件{sample_name}_nsorted.bam是哪个rule来的呢，它是rule samtools_sort的输出文件；同理我们可以看出rule samtools_sort的输入文件是从rule hisat2的输出结果；以此类推从顶部倒推的结果是rule all >rule htseq>rule samtools_sort>rule hisat2>rule fastqc，所以运行顺序是与之相反的。

(SAMPLES,)= glob_wildcards("{sample}.1_1.fastq.gz")
rule all:
    input:
        expand('{sample_name}.1_1_fastqc.zip',sample_name=SAMPLES),
        expand('{sample_name}.count',sample_name=SAMPLES)
rule fastqc:
    input:
        fq='{sample_name}.1_1.fastq.gz'
    output:
        '{sample_name}.1_1_fastqc.zip'
    log:
        '{sample_name}.1_1.log'
    params:
        outdir='qc'
    shell:
        'fastqc {input[fq]} -o {params[outdir]} 1>{log[0]} 2>&1'
rule fastp:
    input:
        fq='{sample_name}.1_1.fastq.gz'
    output:
        '{sample_name}.1_1_clean.fastq.gz'
    log:
        '{sample_name}.fastp.log'
    params:
        outdir='fastp'
    shell:
        'fastp -i {input[fq]} -o {params[outdir]}/{output[0]}'
rule hisat2:
    input:
        clean_fq='{sample_name}.1_1_clean.fastq.gz'
    output:
        '{sample_name}.sam'
    log:
        '{sample_name}.hisat2.log'
    params:
        outdir='hisat2',
        index='index/hg19/genome'
    shell:
        'hisat2 -x {params[index]} -p 10 -U fastp/{input[clean_fq]} -S {params[outdir]}/{output[0]}'
rule samtools_sort:
    input:
        '{sample_name}.sam'
    output:
        '{sample_name}_nsorted.bam'
    shell:
        'samtools sort -o hisat2/{output[0]} hisat2/{input[0]}'
rule htseq:
    input:
        '{sample_name}_nsorted.bam'
    output:
        '{sample_name}.count'
    log:
        '{sample_name}.htseq.log'
    params:
        outdir='htseq',
        gtf='genome/gencode.v19.annotation.gtf'
    shell:
        'htseq-count -f bam -r name hisat2/{input[0]} {params[gtf]}  >{output[0]}'

dry run 一下，可以加上 -p 参数让终端打印出 shell 运行的命令：

$ snakemake -np -s rnaseqflow.py 
Building DAG of jobs...
Job counts:
	count	jobs
	1	all
	4	fastp
	4	fastqc
	4	hisat2
	4	htseq
	4	samtools_sort
	21

[Fri Oct 23 10:23:24 2020]
rule fastp:
    input: SRR957678.1_1.fastq.gz
    output: SRR957678.1_1_clean.fastq.gz
    log: SRR957678.fastp.log
    jobid: 18
    wildcards: sample_name=SRR957678

fastp -i SRR957678.1_1.fastq.gz -o fastp/SRR957678.1_1_clean.fastq.gz

[Fri Oct 23 10:23:24 2020]
rule fastqc:
    input: SRR957679.1_1.fastq.gz
    output: SRR957679.1_1_fastqc.zip
    log: SRR957679.1_1.log
    jobid: 3
    wildcards: sample_name=SRR957679

fastqc SRR957679.1_1.fastq.gz -o qc 1>SRR957679.1_1.log 2>&1

[Fri Oct 23 10:23:24 2020]
rule fastqc:
    input: SRR957680.1_1.fastq.gz
    output: SRR957680.1_1_fastqc.zip
    log: SRR957680.1_1.log
    jobid: 4
    wildcards: sample_name=SRR957680

fastqc SRR957680.1_1.fastq.gz -o qc 1>SRR957680.1_1.log 2>&1

[Fri Oct 23 10:23:24 2020]
rule fastp:
    input: SRR957679.1_1.fastq.gz
    output: SRR957679.1_1_clean.fastq.gz
    log: SRR957679.fastp.log
    jobid: 19
    wildcards: sample_name=SRR957679

fastp -i SRR957679.1_1.fastq.gz -o fastp/SRR957679.1_1_clean.fastq.gz

[Fri Oct 23 10:23:24 2020]
rule fastp:
    input: SRR957677.1_1.fastq.gz
    output: SRR957677.1_1_clean.fastq.gz
    log: SRR957677.fastp.log
    jobid: 17
    wildcards: sample_name=SRR957677

fastp -i SRR957677.1_1.fastq.gz -o fastp/SRR957677.1_1_clean.fastq.gz

[Fri Oct 23 10:23:24 2020]
rule fastqc:
    input: SRR957678.1_1.fastq.gz
    output: SRR957678.1_1_fastqc.zip
    log: SRR957678.1_1.log
    jobid: 2
    wildcards: sample_name=SRR957678

fastqc SRR957678.1_1.fastq.gz -o qc 1>SRR957678.1_1.log 2>&1

[Fri Oct 23 10:23:24 2020]
rule fastqc:
    input: SRR957677.1_1.fastq.gz
    output: SRR957677.1_1_fastqc.zip
    log: SRR957677.1_1.log
    jobid: 1
    wildcards: sample_name=SRR957677

fastqc SRR957677.1_1.fastq.gz -o qc 1>SRR957677.1_1.log 2>&1

[Fri Oct 23 10:23:24 2020]
rule fastp:
    input: SRR957680.1_1.fastq.gz
    output: SRR957680.1_1_clean.fastq.gz
    log: SRR957680.fastp.log
    jobid: 20
    wildcards: sample_name=SRR957680

fastp -i SRR957680.1_1.fastq.gz -o fastp/SRR957680.1_1_clean.fastq.gz

[Fri Oct 23 10:23:24 2020]
rule hisat2:
    input: SRR957679.1_1_clean.fastq.gz
    output: SRR957679.sam
    log: SRR957679.hisat2.log
    jobid: 15
    wildcards: sample_name=SRR957679

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957679.1_1_clean.fastq.gz -S hisat2/SRR957679.sam

[Fri Oct 23 10:23:24 2020]
rule hisat2:
    input: SRR957680.1_1_clean.fastq.gz
    output: SRR957680.sam
    log: SRR957680.hisat2.log
    jobid: 16
    wildcards: sample_name=SRR957680

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957680.1_1_clean.fastq.gz -S hisat2/SRR957680.sam

[Fri Oct 23 10:23:24 2020]
rule hisat2:
    input: SRR957678.1_1_clean.fastq.gz
    output: SRR957678.sam
    log: SRR957678.hisat2.log
    jobid: 14
    wildcards: sample_name=SRR957678

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957678.1_1_clean.fastq.gz -S hisat2/SRR957678.sam

[Fri Oct 23 10:23:24 2020]
rule hisat2:
    input: SRR957677.1_1_clean.fastq.gz
    output: SRR957677.sam
    log: SRR957677.hisat2.log
    jobid: 13
    wildcards: sample_name=SRR957677

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957677.1_1_clean.fastq.gz -S hisat2/SRR957677.sam

[Fri Oct 23 10:23:24 2020]
rule samtools_sort:
    input: SRR957677.sam
    output: SRR957677_nsorted.bam
    jobid: 9
    wildcards: sample_name=SRR957677

samtools sort -o hisat2/SRR957677_nsorted.bam hisat2/SRR957677.sam

[Fri Oct 23 10:23:24 2020]
rule samtools_sort:
    input: SRR957679.sam
    output: SRR957679_nsorted.bam
    jobid: 11
    wildcards: sample_name=SRR957679

samtools sort -o hisat2/SRR957679_nsorted.bam hisat2/SRR957679.sam

[Fri Oct 23 10:23:24 2020]
rule samtools_sort:
    input: SRR957680.sam
    output: SRR957680_nsorted.bam
    jobid: 12
    wildcards: sample_name=SRR957680

samtools sort -o hisat2/SRR957680_nsorted.bam hisat2/SRR957680.sam

[Fri Oct 23 10:23:24 2020]
rule samtools_sort:
    input: SRR957678.sam
    output: SRR957678_nsorted.bam
    jobid: 10
    wildcards: sample_name=SRR957678

samtools sort -o hisat2/SRR957678_nsorted.bam hisat2/SRR957678.sam

[Fri Oct 23 10:23:24 2020]
rule htseq:
    input: SRR957678_nsorted.bam
    output: SRR957678.count
    log: SRR957678.htseq.log
    jobid: 6
    wildcards: sample_name=SRR957678

htseq-count -f bam -r name hisat2/SRR957678_nsorted.bam genome/gencode.v19.annotation.gtf  >SRR957678.count

[Fri Oct 23 10:23:24 2020]
rule htseq:
    input: SRR957680_nsorted.bam
    output: SRR957680.count
    log: SRR957680.htseq.log
    jobid: 8
    wildcards: sample_name=SRR957680

htseq-count -f bam -r name hisat2/SRR957680_nsorted.bam genome/gencode.v19.annotation.gtf  >SRR957680.count

[Fri Oct 23 10:23:24 2020]
rule htseq:
    input: SRR957677_nsorted.bam
    output: SRR957677.count
    log: SRR957677.htseq.log
    jobid: 5
    wildcards: sample_name=SRR957677

htseq-count -f bam -r name hisat2/SRR957677_nsorted.bam genome/gencode.v19.annotation.gtf  >SRR957677.count

[Fri Oct 23 10:23:24 2020]
rule htseq:
    input: SRR957679_nsorted.bam
    output: SRR957679.count
    log: SRR957679.htseq.log
    jobid: 7
    wildcards: sample_name=SRR957679

htseq-count -f bam -r name hisat2/SRR957679_nsorted.bam genome/gencode.v19.annotation.gtf  >SRR957679.count

[Fri Oct 23 10:23:24 2020]
localrule all:
    input: SRR957677.1_1_fastqc.zip, SRR957678.1_1_fastqc.zip, SRR957679.1_1_fastqc.zip, SRR957680.1_1_fastqc.zip, SRR957677.count, SRR957678.count, SRR957679.count, SRR957680.count
    jobid: 0

Job counts:
	count	jobs
	1	all
	4	fastp
	4	fastqc
	4	hisat2
	4	htseq
	4	samtools_sort
	21
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

假设我们将rule all 里的{sample_name}.count 改为{sample_name}.sam那么命令只会运行到rule hisat2，跳过之后的命令。顶部倒推，rule all指定目标{sample_name}.sam是rule hisat2的输出文件，所以最后运行的rule是rule hisat2，之后的命令不运行。

(SAMPLES,)= glob_wildcards("{sample}.1_1.fastq.gz")
rule all:
    input:
        expand('{sample_name}.1_1_fastqc.zip',sample_name=SAMPLES),
        expand('{sample_name}.sam',sample_name=SAMPLES)

dry run 一下，可以加上 -p 参数让终端打印出 shell 运行的命令：

$ snakemake -np -s rnaseqflow.py 
Building DAG of jobs...
Job counts:
	count	jobs
	1	all
	4	fastp
	4	fastqc
	4	hisat2
	13

[Fri Oct 23 10:43:36 2020]
rule fastqc:
    input: SRR957679.1_1.fastq.gz
    output: SRR957679.1_1_fastqc.zip
    log: SRR957679.1_1.log
    jobid: 3
    wildcards: sample_name=SRR957679

fastqc SRR957679.1_1.fastq.gz -o qc 1>SRR957679.1_1.log 2>&1

[Fri Oct 23 10:43:36 2020]
rule fastqc:
    input: SRR957680.1_1.fastq.gz
    output: SRR957680.1_1_fastqc.zip
    log: SRR957680.1_1.log
    jobid: 4
    wildcards: sample_name=SRR957680

fastqc SRR957680.1_1.fastq.gz -o qc 1>SRR957680.1_1.log 2>&1

[Fri Oct 23 10:43:36 2020]
rule fastp:
    input: SRR957678.1_1.fastq.gz
    output: SRR957678.1_1_clean.fastq.gz
    log: SRR957678.fastp.log
    jobid: 10
    wildcards: sample_name=SRR957678

fastp -i SRR957678.1_1.fastq.gz -o fastp/SRR957678.1_1_clean.fastq.gz

[Fri Oct 23 10:43:36 2020]
rule fastp:
    input: SRR957680.1_1.fastq.gz
    output: SRR957680.1_1_clean.fastq.gz
    log: SRR957680.fastp.log
    jobid: 12
    wildcards: sample_name=SRR957680

fastp -i SRR957680.1_1.fastq.gz -o fastp/SRR957680.1_1_clean.fastq.gz

[Fri Oct 23 10:43:36 2020]
rule fastqc:
    input: SRR957678.1_1.fastq.gz
    output: SRR957678.1_1_fastqc.zip
    log: SRR957678.1_1.log
    jobid: 2
    wildcards: sample_name=SRR957678

fastqc SRR957678.1_1.fastq.gz -o qc 1>SRR957678.1_1.log 2>&1

[Fri Oct 23 10:43:36 2020]
rule fastqc:
    input: SRR957677.1_1.fastq.gz
    output: SRR957677.1_1_fastqc.zip
    log: SRR957677.1_1.log
    jobid: 1
    wildcards: sample_name=SRR957677

fastqc SRR957677.1_1.fastq.gz -o qc 1>SRR957677.1_1.log 2>&1

[Fri Oct 23 10:43:36 2020]
rule fastp:
    input: SRR957677.1_1.fastq.gz
    output: SRR957677.1_1_clean.fastq.gz
    log: SRR957677.fastp.log
    jobid: 9
    wildcards: sample_name=SRR957677

fastp -i SRR957677.1_1.fastq.gz -o fastp/SRR957677.1_1_clean.fastq.gz

[Fri Oct 23 10:43:36 2020]
rule fastp:
    input: SRR957679.1_1.fastq.gz
    output: SRR957679.1_1_clean.fastq.gz
    log: SRR957679.fastp.log
    jobid: 11
    wildcards: sample_name=SRR957679

fastp -i SRR957679.1_1.fastq.gz -o fastp/SRR957679.1_1_clean.fastq.gz

[Fri Oct 23 10:43:36 2020]
rule hisat2:
    input: SRR957678.1_1_clean.fastq.gz
    output: SRR957678.sam
    log: SRR957678.hisat2.log
    jobid: 6
    wildcards: sample_name=SRR957678

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957678.1_1_clean.fastq.gz -S hisat2/SRR957678.sam

[Fri Oct 23 10:43:36 2020]
rule hisat2:
    input: SRR957680.1_1_clean.fastq.gz
    output: SRR957680.sam
    log: SRR957680.hisat2.log
    jobid: 8
    wildcards: sample_name=SRR957680

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957680.1_1_clean.fastq.gz -S hisat2/SRR957680.sam

[Fri Oct 23 10:43:36 2020]
rule hisat2:
    input: SRR957677.1_1_clean.fastq.gz
    output: SRR957677.sam
    log: SRR957677.hisat2.log
    jobid: 5
    wildcards: sample_name=SRR957677

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957677.1_1_clean.fastq.gz -S hisat2/SRR957677.sam

[Fri Oct 23 10:43:36 2020]
rule hisat2:
    input: SRR957679.1_1_clean.fastq.gz
    output: SRR957679.sam
    log: SRR957679.hisat2.log
    jobid: 7
    wildcards: sample_name=SRR957679

hisat2 -x index/hg19/genome -p 10 -U fastp/SRR957679.1_1_clean.fastq.gz -S hisat2/SRR957679.sam

[Fri Oct 23 10:43:36 2020]
localrule all:
    input: SRR957677.1_1_fastqc.zip, SRR957678.1_1_fastqc.zip, SRR957679.1_1_fastqc.zip, SRR957680.1_1_fastqc.zip, SRR957677.sam, SRR957678.sam, SRR957679.sam, SRR957680.sam
    jobid: 0

Job counts:
	count	jobs
	1	all
	4	fastp
	4	fastqc
	4	hisat2
	13
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.