Over 120K "unique" sequences called in Microbiome dataset

neempatel · August 26, 2021, 6:01pm

Version of QIIME 2 you are running: qiime2/2020.11, installed using environment module command.
What is the exact command or commands you ran? Copy and paste please.

##Start QIIME2 environment:
module load qiime2/2020.11

##CasavaOneEightSingleLanePerSampleDirFmt - the format of data that comes off the illumina platform.
qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-format CasavaOneEightSingleLanePerSampleDirFmt --input-path /global/scratch/neempatel/BS_16S_Raw/All_16S/Data/ --output-path /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.qza

##qiime demux summarize --help

##to visualize the data
qiime demux summarize --i-data /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.qza --p-n 10000 --o-visualization BS_16S_Raw/All_16S/QIIMEobjects/demux.qzv

Trim primers + spacer, pad, index, linker - should trim all with primer seq.

qiime cutadapt trim-paired --i-demultiplexed-sequences /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.qza --p-cores 4 --p-front-f CCTACGGGNBGCASCAG --p-front-r GACTACNVGGGTATCTAATCC --o-trimmed-sequences /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.trimmed.qza --verbose

##To denoise and generate the OTU table:
qiime dada2 denoise-paired --i-demultiplexed-seqs /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.trimmed.qza --p-trunc-len-f 0 --p-trunc-len-r 0 --p-trim-left-f 225 --p-trim-left-r 150 --o-table /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.table.qza --o-representative-sequences /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.rep-seqs.qza --o-denoising-stats /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.stats.qza

##Turning the stats into qzv objects for visualization:
qiime feature-table summarize --i-table /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.table.qza --o-visualization /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.table.qzv

qiime feature-table tabulate-seqs --i-data /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.rep-seqs.qza --o-visualization /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.rep-seqs.qzv

qiime metadata tabulate --m-input-file /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.stats.qza --o-visualization /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.stats.qzv

#taxonomy classification/assignment:
##qiime feature-classifier —-help
##classify-sklearn
qiime feature-classifier classify-sklearn --i-classifier /global/scratch/neempatel/Databases/silva-138-99-nb-classifier.qza --i-reads /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.rep-seqs.qza --o-classification /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.taxonomy.qza

##To visualize:
qiime metadata tabulate --m-input-file /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.taxonomy.qza --o-visualization /global/scratch/neempatel/BS_16S_Raw/All_16S/QIIMEobjects/demux.taxonomy.qzv

What is the exact error message, if applicable? No error message.
Specific issue/question:
Essentially with this script, I receive an output which suggests that approximately 560 of my 664 sample communities are completely unique from one another. Which is impossible considering they were collected from the same four fields sites. Hypothetically at least 25% of the communities should be similar, but could be distinct from the other 75% of the dataset.

I've consulted with all the resources available to me, including the posts on the forum here, the internet, and folks in my community who have used QIIME2 for their own works. Luckily, I was able to work out some of the issues with my code due to the help from these resources, so the output is much better compared to my previous results which defined all 664 of my sample communities as completely unique with over 150K unique sequences. I am happy to provide any of the outputs or additional information which would be helpful. Thank you so much in advance for your time and help, I truly appreciate it!

SoilRotifer · August 26, 2021, 7:42pm

Hi @neempatel,

There may very well be that many unique sequences! However, here is a quick drive-by comment .

I noticed you ran cutadapt as follows:

I would highly suggest you set the following additional parameters:

 --p-match-adapter-wildcards \
 --discard-untrimmed \

Using --p-match-adapter-wildcards will ensure that the ambiguous IUPAC bases in your primer sequence will correctly match the bases in your reads. Otherwise, they will be counted as differences.

Using --discard-untrimmed, will discard any pair of reads in which either primer sequence can not be detected from the paired reads. I consider this an essential QA/QC step. For example, if the quality of the sequence where the primer resides is poor (for either R1 and/or R2), cutadapt may be unable to detect the primer. Resulting in that portion of the sequence remaining within the read and ultimatley remaining within your dataset, unless you choose to discard the read pair. Which is what I'd suggest.

Otherwise, if these (untrimmed) reads are retained, then you'll likely keep longer reads in the output (relative to the trimmed reads). This in turn artificially inflating the number of sequence variants. That is, we are trying to obtain single-base resolution here, which means any indels, will likely be considered a sequence variant.

-Cheers!
-Mike

neempatel · August 26, 2021, 8:31pm

Thank you so much Mike! This is greatly helpful, I will run a few tests with these additional parameters and report back. I really appreciate your help!

system · September 27, 2021, 2:32am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.