Too many features?

Hi,

I received demultiplexed paired-end 16S rRNA V4 sequences and I denoised them using DADA2 by truncating at 250 bp in order to obtain a feature table. However, I am worried I have too many features.

As you can see in the screenshot, I have about 28,000 features. My samples are coming from a rhizobiome and when inspecting the feature detail, I can see that more than 75% of the features are present in less than 10 samples in a frequency lower than 100. Is this reasonable or should I filter for low variance/low frequency?

Cheers,
Pablo

1 Like

Hi @Pablo_V,
The most likely culprit here is that the barcodes/degenerate primers, or some other non-biological sequences are still in your reads. Can you double check that, and try removing them first before running DADA2 again.

1 Like

Hi @Mehrbod_Estaki,

Thank you for your reply.

Is there any way I can check that myself? I received the samples demultiplexed, i.e. paired-end reads for each sample and I assumed they were ready for using with DADA2.

Cheers,
Pablo

1 Like

HI @Pablo_V,
You can either ask your sequencing facility directly, what has or hasn't been done with the reads, or you can look at the raw sequencing files. For example take a look through a few lines in one of the fastq files to see what the sequences look like, then you check with your primers/barcodes etc.
You can also run those sequences through q2-cutadapt paired plugin and provide your primers, the output will also tell you how many reads had that target in them or not.

1 Like

Hi @Mehrbod_Estaki,

I have contacted the sequencing facility and they said I have received the V4 demultiplexed files without any non-biological sequences.

In any case, I have checked the raw reads and this is what I get on the first line of a R1 read:
@M00984:174:000000000-CC7H8:1:1101:18679:1972 1:N:0:1
TACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCTTTTACGTCTGGCGTGTAATCCCGGGGCTCAACCCCGGGTCGGCCTGGGATCTGGGCGACCTGAGTGCCGAAGAGGGGAGTGGAATTTCCCGTGTTGGGGTGGAATGCGGAGGTTTGTGGAGGAAACACAGGGGCGGAGGGGACTCTCTGGGCTGTAAATGACGCTGAGGCGCGGAAGCGTGGGGGGCAGACCG
+
AABBBFFBBADB2FFGGGGGGGHGGGGGH5FGGGGHGGGGHH5EEAEE0EEGEGGG//3B44??FG2F33>E/<BF2F22CDGGGG/FGHHGG/--.>---.....</.0<CH.----;///0000;---.;C.99-.:;/;/F/B//;.;....-;99-;B///.----../;..;/.;.;///;.;..;---;-@--.-.////9BF....;//////...9.9A.9---;;..-9;@FF-9--.;..-

There are about 250 bp in the sequence. So I think indeed my reads are only biological sequences.

What could be then the issue with DADA2 and obtaining so many features?

Cheers,
Pablo

1 Like

Hi @Pablo_V,
Thanks for the update. A couple of additional questions:
What is the expect diversity of these samples based on the existing literature? Is it possible that you actually do have this rich of an environment? Soil samples can be truly rich like this.
How do the taxonomy assignment look like on these reads? Are they being classified at a decent resolution or or are you getting lots of "Unassigned" and domain-only assignments?

1 Like

Hi @Mehrbod_Estaki,

According to the literature the expected diversity of the samples is about 3-4 shannon index.

The samples come from a rhizobiome environment, but it is not soil, it is "soilless" culture (specifically, rockwool).

The taxonomy of the reads looks quite acceptable since I get very good resolution and not many unassigned.

Moreover, after filtering out samples (following a Qiime2 SOP from the Langille Lab) with less than 1000 sequences and rare ASVs (less than 0.1% of the mean sample depth) I end up with about 800 features. However, when I filter (following Dhariwal et al. 2017) for low count (min. count of 4 with prevalence in at least 20% samples) and low variance filter (10% to remove based on inter-quartile range) I end up with about 180 features. Which of these two sets of features should I keep, 800 or 180?

Cheers,
Pablo

2 Likes

Hi @Pablo_V,
Thanks for the update. If you're confident that all non-biological reads + primer has been removed then I see no reason to doubt the results. As for filtering, that will depend entirely on your data, and your biological question and unfortunately I can't offer a choice for you. Some things to consider, removing rare or low prevalent features might be a sensible thing to do for some analysis for example differential abundance tests can benefit from this, however, some metrics such as richness, chao1, or absence-presence based beta diversity metrics such as Jaccard, unweighted UniFrac etc, are strongly influenced by this. So you may in fact be better off utilizing both filtering and no-filtering.

Others may disagree with this, but personally, when I have that many true singletons, I tend to remove them as an initial first step, even before running diversity analysis that rely on rare-features. But this decision makes sense in my line of work dealing with more homogenous sample types (gut, skin, oral etc.). If your samples originate from a wide range of diverse environments and their composition is expected to be vastly different from each other, then perhaps the singletons contain more true information and should be kept.
Ultimately, one would hope that the main results would not change as the distribution of true vs false features would be equal across all samples.

1 Like

Hi @Mehrbod_Estaki,

Actually, I have not trimmed the 16S rRNA V4 primers from the demultiplexed reads. I will do that and see if there is any significant change with the number of features recognized.

Thank you for the insight on the filtering. In my case, my samples come from considerably homogeneous environments and I expect the bacterial communities to be more or less similar.

Cheers,
Pablo

1 Like

Hi @Mehrbod_Estaki,

I have trimmed the 16S rRNA V4 primers from the demultiplexed reads and the results from the denoising were bad, with only 15 features recognized. So I believe that the primers were already trimmed.

Then I think that the amount of features found on my sample set is plausible.

Thank you for your support!!!

Cheers,
Pablo

1 Like

Hi @Pablo_V,
Hmm, that is strange behavior. Can you provide the exact cutadapt command you ran?
Do you mean that you ran q2-cutadapt and it worked out fine and removed primers from reads as intended? Or that there were no primers to be removed?
If they were remved, that is good, and DADA2 should actually work better, so being left with only 15 features is probably as a result of some other issue. Could you also provide the summary stats of your DADA2 please?

Hi @Mehrbod_Estaki,

I ran this command:
qiime cutadapt trim-paired --i-demultiplexed-sequences demux-paired-end.qza --p-cores 2 --p-front-f CYGCGGTAATTCCAGCTC --p-front-r AYGGTATCTRATCRTCTTYG --p-discard-untrimmed --p-no-indels --o-trimmed-sequences reads_trimmed.qza

And I obtained this output:
reads-trimmed-summary.qzv (344.6 KB)

I mean there were very few samples from which the primers were removed, so I think that the primers were removed by the sequencing center. I am awaiting for them to confirm this.

This is the summary of the DADA2 stats:
dada2-table-summary.qzv (422.6 KB)

Cheers,
Pablo

Hi @Pablo_V,

Sorry, I meant the visualized artifact from DADA2 --o-denoising-stats, not the table summary. BUT I think I know the problem without this anyways. Since you are setting the --p-discard-untrimmed flag in your cutadapt step, any sequences that don't have the primer in them are discarded. And since your primers have already been removed by the sequencing facility, this essentially leaves you with a blank table that goes into DADA2. So this is of course not the right table to use, but this does confirm that your primers had in fact been removed from the reads.
This suggests to me that your high feature count is in all likelihood a true biological signal.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.