Issues finding sampling depth for Rarefaction

lucky_endophyte · May 10, 2024, 8:34pm

Hi everyone,

I'm starting my alpha and beta diversity analyses, but getting a little perplexed by sampling depth. It seems my rarefaction curve never plateaus no matter the sampling depth. This is a smaller sample set, but I plan on doing a rarefaction curve for 3 more sample sets and then combining all my sample sets together (~100 samples) and analyzing their alpha/beta diversities all together, as well.
I have a sample that is very much an outlier (over 2mil frequencies!) and some that are somewhat lower in frequency count as well. Should I drop samples with too low of frequency AND the one with the highest frequency count since it's so dissimilar?

I've taken a look at the "moving pictures" tutorial and having issues getting my rarefaction curve to look something like theirs.

If anyone can point me in a direction of what I'm doing wrong (or missing), that would be greatly appreciated!

colinbrislawn · May 10, 2024, 10:44pm

Hello Alisa,

Some of these samples are super diverse. If you are able to share, what kind of environment did they come from?

How many features are shared between samples?

(I ask because if there is a barcode sneaking into the features themselves, this will inflate diversity. And this would also cause features to be found in only one sample each.)

lucky_endophyte · May 11, 2024, 1:52am

These are environmental leaf samples from that have been surface sterilized. The environment is quite diverse/dusty chaparral. These are sequences of the fungal ITS1 gene.

I also did quite a bit of filtering on these samples beforehand. Specifically this after dereplication / OTU clustering:

qiime feature-table filter-features
--i-table table-native.qza
--p-min-frequency 8
--p-min-samples 2
--o-filtered-table filtered_table-native.qza

and then Chimera filtering using qiime vsearch uchime-denovo.

These samples were pre-processed in Trim Galore and all adapters/primers were stripped off.

When you say barcoding, do you mean possible NGS index contamination?

Here is what the taxonomy histogram looks like:

Is there any way to find out if it's barcode contamination or something else inflating the feature count? Thanks so much!

colinbrislawn · May 11, 2024, 4:47pm

Well, that means that each feature is in >1 samples, so it may not be barcode contamination like I originally suspected.

Yes. Because the barcodes/indexes are different from every sample, if they end up in the feature sequence they will make the same features appear unique to each sample. So instead of having 2k features across 100 samples, (2k features x 100 indexes) will appear as 200k features.

These are sequences of the fungal ITS1 gene.

Thank you for mentioning this.

What paper introduced these primers?
How many features did they see when they tested the primers?

lucky_endophyte · May 13, 2024, 6:44pm

Hi Colin,
I'm actually using primer pools of ITS1 (forward) and ITS2 (rev) as outlined in Illumina's ITS protocol:
https://www.illumina.com/content/dam/illumina-marketing/documents/products/appnotes/its-metagenomics-app-note-1270-2018-001-web.pdf and https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/metagenomic/fungal-metagenomic-demonstrated-protocol-1000000064940-01.pdf

But to answer your question about the high number of features, having over 20k features isn't uncommon in endophyte community research, at least from what I've found, although this sample set with the high frequencies is definitely making me scratch my head to say the least.
Here is the list of observed features for this dataset:

I've checked my pipeline and it doesn't seem I've made a mistake during dereplication either:

qiime vsearch dereplicate-sequences
--i-sequences demux-merged-native.qza
--o-dereplicated-table table-native.qza
--o-dereplicated-sequences rep-seqs-native.qza \

#Clustering with 97% identity:

qiime vsearch cluster-features-de-novo
--i-table table-native.qza
--i-sequences rep-seqs-native.qza
--p-perc-identity 0.97
--o-clustered-table table-trim-97-native.qza
--o-clustered-sequences rep-seqs-trim-97-native.qza

After this step I've further filtered the OTU table as outlined above and removed Chimeras using the vsearch plugin. I even went as far as filtering some irrelevant taxa after taxonomy using vsearch and the UNITE eukaryote database.

Any input is helpful and your time is greatly appreciated!

colinbrislawn · May 13, 2024, 9:20pm

I am out of ideas!

Maybe your samples are just very diverse

I suppose you can cautiously continue with analysis and see if reviewer three complains about it.

lucky_endophyte · May 16, 2024, 1:47am

Hi everyone,
I apologize for posting so much on this forum, I'm a newbie at bioinformatics and having issues with my MiSeq paired-end data.

I've been having issues with my alpha-rarefaction curve not plateauing despite trying different sampling depths. I've since backtracked and found some adapter contamination and removed those with Trim Galore, but still getting the same rarefaction issues.

Right now, I've been playing around with the Uchime Denovo chimera removal, after using it including the borderline chimeras. If I don't include borderline chimeric sequences, I have very few OTUs.

Can anyone point me in any kind of direction here? The DADA2 filter ends up filtering out all of my sequences so we're trying not to use that (we want both R1 and R2 files).

So far my pipeline looks like this (after importing and merging via vsearch and dereplicating):

qiime vsearch cluster-features-de-novo
--i-table table-native2.qza
--i-sequences rep-seqs-native2.qza
--p-perc-identity 0.97
--o-clustered-table table-trim-97-native2.qza
--o-clustered-sequences rep-seqs-trim-97-native2.qza

#Remove low-abundance OTUs

qiime feature-table filter-features
--i-table table-native2.qza
--p-min-frequency 5
--p-min-samples 3
--o-filtered-table filtered_table-native2.qza

#filter chimeric sequences
qiime vsearch uchime-denovo
--i-table filtered_table-native2.qza
--i-sequences filtered-rep-seqs-native2.qza
--output-dir uchime_output

qiime feature-table filter-features
--i-table filtered_table-native2.qza
--m-metadata-file uchime_output/chimeras.qza
--p-exclude-ids
--o-filtered-table uchime_output/table-nonchimeric-native.qza
qiime feature-table filter-seqs
--i-data filtered-rep-seqs-native2.qza
--m-metadata-file uchime_output/chimeras.qza
--p-exclude-ids
--o-filtered-data uchime_output/rep-seqs-nonchimeric-natives.qza
qiime feature-table summarize
--i-table uchime_output/table-nonchimeric-native.qza
--o-visualization uchime_output/table-nonchimeric-native.qzv

Here is the include borderline chimeras:

Filtering chimeras and borderline chimeras:

colinbrislawn · May 21, 2024, 12:24am

Hello again Alisa,

(Looks like your new thread was merged back into our old thread. Always helpful to keep related questions together.)

Now that I've seen your full pipeline, I have a few more suggestions about the number of features.

Yes, many OTUs are going to be chimeric so filing those out should help.

Zooming out, OTU inflation has always been a problem. See

The modern denoising methods were built to solve this exact problem, and they work great!

Would you like to post those results and try to get DADA2 working?

Let's try to get DADA2 working!

lucky_endophyte · May 23, 2024, 6:14pm

Hi again Colin!

I basically scrapped my whole process and started using Dada2 filter on my R1 reads only. I'm getting much better results. Turns out, I had some Phix contamination from my Illumina run! Thanks for all your help in my previous posts!

system · June 24, 2024, 12:15am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.