I am new in microbiome analysis and QIIME 2. I found the published paired-end data (https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA601757) and would like to generate the OTU table by QIIME 2 for the downstream analysis.
My first step is to read in the data by using qiime tools import:
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path manifest_file.tsv \
--output-path paired-end-demux.qza \
--input-format PairedEndFastqManifestPhred33V2
By looking at the “Moving Pictures” tutorial, I would like to use DADA2 to handle the data. In DADA2 own website, it mentions non-biological nucleotides need to be removed, e.g. primers, adapters, linkers, etc. And by using R to check the sequence, I see there are primers in the data. Therefore, I decide to use qiime cutadapt trim-paired before qiime dada2 denoise-paired.
qiime cutadapt trim-paired \
--i-demultiplexed-sequences paired-end-demux.qza \
--p-front-f CCTACGGG \
--p-front-r GACTAC \
--o-trimmed-sequences trimmed-seqs.qza \
--verbose
qiime demux summarize \
--i-data trimmed-seqs.qza \
--o-visualization demux.qzv
qiime dada2 denoise-paired \
--i-demultiplexed-seqs trimmed-seqs.qza \
--p-trim-left-f 0 \
--p-trunc-len-f 251 \
--p-trim-left-r 0 \
--p-trunc-len-r 244 \
--o-table table.qza \
--o-representative-sequences rep-seqs.qza \
--o-denoising-stats stats.qza \
--verbose
qiime metadata tabulate \
--m-input-file stats.qza \
--o-visualization stats.qzv
251 and 244 are from demux.qzv Quality Plot to make sure the scores from Middle of Box are all above 20. However, in stats.qzv, the numbers from "percentage of input passed filter" are all vey low, and there are around 80 OTUs in the final OTU table.
If I use paired-end-demux.qza in DADA2 directly without qiime cutadapt trim-paired, then the result is very different.
qiime dada2 denoise-paired \
--i-demultiplexed-seqs paired-end-demux.qza \
--p-trim-left-f 0 \
--p-trunc-len-f 251 \
--p-trim-left-r 0 \
--p-trunc-len-r 244 \
--o-table table.qza \
--o-representative-sequences rep-seqs.qza \
--o-denoising-stats stats.qza \
--verbose
qiime metadata tabulate \
--m-input-file stats.qza \
--o-visualization stats.qzv
I think I should put "--p-trim-left-f 8" and "--p-trim-left-r 6" to remove the primers. But without removing primers, the "percentage of input passed filter" improve a lot in stats.qzv. And there are around 650 OTUs in the final OTU table.
Does anyone know why the results are so different with and without cutadapt? And should I avoid using cutadapt before dada2? But it is mentioned that primers need to be removed before dada2 procedure in DADA2 website.
In addition, the following is my steps to generate and output OTU table by using Greengenes reference data version 13.8 with 97% similarity threshold. The following link is where I got Greengenes reference: ftp://greengenes.microbio.me/greengenes_release/gg_13_8_otus/
Can you also help me to check if my steps are correct?
### match reference library
qiime vsearch cluster-features-closed-reference \
--i-table table.qza \
--i-sequences rep-seqs.qza \
--i-reference-sequences 97_otus-GG.qza \
--p-perc-identity 0.97 \
--o-clustered-table tbl-gg-97_nasal.qza \
--o-clustered-sequences rep-seqs-gg-97_nasal.qza \
--o-unmatched-sequences unmatched-gg-97_nasal.qza
### export to biom file
qiime tools export \
--input-path tbl-gg-97_nasal.qza \
--output-path feature-table
### convert to txt file
biom convert -i ~/nasal_all/chk_primer/feature-table/feature-table.biom -o feature-table.txt --to-tsv
Thank you very much for any insight and apologies for the long question.