去除宿主菌序列

噬菌体(phage)是侵袭细菌的病毒,也是赋予宿主菌生物学性状的遗传物质。噬菌体必须在活菌内寄生,有严格的宿主特异性,其取决于噬菌体吸附器官和受体菌表面受体的分子结构和互补性。噬菌体测序过程中会有宿主菌污染,因此组装前需要去除宿主菌序列。

一.bowtie2

文件夹内容如下所示:

1
2
3
4
5
6
├── index
├── SG15.fasta
├── SMP1_R1.clean.fq.gz
└── SMP1_R2.clean.fq.gz

0 directories, 3 files

1.建立索引

1
2
3
bowtie2-build SG15.fasta index/SG15
$ ls index/
SG15.1.bt2 SG15.2.bt2 SG15.3.bt2 SG15.4.bt2 SG15.rev.1.bt2 SG15.rev.2.bt2

2.比对

1
2
bowtie2 -x index/SG15 -1 SMP1_R1.clean.fq.gz -2 SMP1_R2.clean.fq.gz -S smp1.sam -p30
samtools view -bS smp1.sam > smp1.bam#sam转bam

3.去掉比对上的

1
2
3
 samtools view -b -f 12 -F 256 smp1.bam > smp1.unmapped.bam
# -f 表示提取; 12 表示未比对上的reads和未比对上的pair
# -F 表示不要提取; 256 主要比对上了的

4.bam转fastq

1
2
samtools sort -n smp1.unmapped.bam -O BAM -o smp1.unmapped.sort.bam# samtools根据名字排序
bedtools bamtofastq -i smp1.unmapped.sort.bam -fq smp1_remove_host_1.fastq -fq2 smp1_remove_host_2.fastq# bedtools 转格式

5.直接使用bowtie2的–un-conc参数

1
bowtie2 -p 30 -x index/SG15 -1 SMP1_R1.clean.fq.gz -2 SMP1_R2.clean.fq.gz -S sample1.sam --un-conc uncon_bowtie/sample1.fq 

二.kneaddata

1
2
3
4
5
6
7
kneaddata -t 20 --input SMP1_R1.clean.fq.gz --input SMP1_R2.clean.fq.gz  -db ./index/SG15 --output kneaddata/ --bypass-trim  --remove-intermediate-output
#—remove-intermediate-output 清理中间文件
#-db 指定bowtie2索引
ls
SMP1_R1.clean_kneaddata.log SMP1_R1.clean_kneaddata_SG15_bowtie2_paired_contam_1.fastq SMP1_R1.clean_kneaddata_SG15_bowtie2_unmatched_2_contam.fastq
SMP1_R1.clean_kneaddata_paired_1.fastq SMP1_R1.clean_kneaddata_SG15_bowtie2_paired_contam_2.fastq SMP1_R1.clean_kneaddata_unmatched_1.fastq
SMP1_R1.clean_kneaddata_paired_2.fastq SMP1_R1.clean_kneaddata_SG15_bowtie2_unmatched_1_contam.fastq SMP1_R1.clean_kneaddata_unmatched_2.fastq

各文件内容如下图官网所示:

1
kneaddata --input seq1.fastq --input seq2.fastq -db bact_rrna_db -db human_rna_db --output seq_out

This will output files in the folder seq_out named:

Files for just the bact_rrna_db database:

  • seq_kneaddata_paired_bact_rrna_db_bowtie2_contam_1.fastq: Reads from the first mate in situation (1) above that were identified as belonging to the bact_rrna_db database.
  • seq_kneaddata_paired_bact_rrna_db_bowtie2_contam_2.fastq: Reads from the second mate in situation (1) above that were identified as belonging to the bact_rrna_db database.
  • seq_kneaddata_paired_bact_rrna_db_bowtie2_clean_1.fastq: Reads from the first mate in situation (1) above that were identified as NOT belonging to the bact_rrna_db database.
  • seq_kneaddata_paired_bact_rrna_db_bowtie2_clean_2.fastq: Reads from the second mate in situation (1) above that were identified as NOT belonging to the bact_rrna_db database.

引用

1.宏基因组,除去宿主序列

2.去污染(宿主)过程记录

客官打个赏咯.