Hello! I am looking for help with rarefying my data from Fungal ITS1 amplicon sequencing.
I have demultiplexed, paired-end reads in .fastq format. I imported my data into QIIME2 (installed on a VirtualBox), cut the primers using cutadapt, joined the forward and reverse reads using vsearch join-pairs, and quality filtered them using quality-filter q-score. I dereplicated my sequences using vsearch dereplicate-sequences. I used qiime tools view to determine appropriate sampling depth and used this to rarefy during diversity core-metrics. I see I can also generate a stand-alone rarefied feature table using feature-table rarefy, but I would like to know if there is a way I can rarefy my sequences themselves.
I am interested in evaluating the diversity within one specific taxonomic ranking (the fungal order Pleosporales). To look at relative abundance of fungal taxa, I normally export from QIIME2 my processed representative sequences, then BLAST them against a UNITE database, and run the output through FHiTINGS. My understanding is that my sequences need to be rarefied to an equal sequence depth in order to look at the diversity of a specific taxon. Is there a way I can rarefy the sequences I have stored in my quality-filtered artifact?
I think there may be a terminology gap around the word 'rarify'. I use this word to mean 'subsample to an equal depth', which is one way to address differences in sequencing depth (a.k.a. sampling effort) so that samples with 10k reads can be reasonably compared to samples with 200k reads. Without some way to account for depth, many metrics including those related to alpha and beta diversity will conclude that depth itself is a primary driver of community composition and can mask/overwhelm biological trends.
I'm not sure your sequences in your .fastq file have a 'depth'. Like, they have a length and a quality and a G/C composition... the concept of sampling effort only relates to their counts.
Am I making sense here? What sort of bias are you trying to control through rarifying your sequences that would not be controlled by rarifying your sequence counts?
Thank you for your response and apologies for the confusion! You are right, I am trying to subsample all my reads to an equal depth or in effect rarefy my sequence counts. In QIIME2, I see that I can rarefy a feature table using the feature-table rarefy plugin.
However, I think I need to rarefy the sequence counts in my samples themselves (samples-qfilter.qza), not just in the feature table (table.qza). Since the feature table retains no information about the Do you know of a plugin that will allow be to rarefy the artifact containing my samples and their representative sequences? Would the following commands work?
So after you subsample, it's possible that some features will no longer appear in the output table, and that filter-seqs command serves to remove those sequences.
You can perform another filter to select just the features in the order Pleosporales.
At this point you have a file like Pleosporales-only-rarefied-feature-table.qza
which you could use to calculate Shannon diversity on a per-sample basis.
(This table does not include any sequences. Is that OK?)
What do you mean by samples? Have you opened up that file to see what's really inside of it? It may hold less than you think...
Oh okay, thank you for clarifying the different tables!
I am still a little confused about producing these tables. Maybe it would be helpful if my paste my workflow below. Here are 5 steps I follow to obtain relative abundance of all fungal taxa in each of my samples.
After the 4th step, I use the tools extract command to export my sequences, separated into different fastq.gz files corresponding to their original sample file IDs. Then, I BLAST each sample file against a UNITE database, and pass the blasted.out files to FHiTINGS to parse. I do not normally use a deblur denoise or vsearch dereplicate-sequences step.
My understanding of deblur is that it was originally designed for bacterial 16S and now better accommodates 18S through deblur denoise-other. However, I am working with fungal ITS. Is it appropriate to use denoise here? I am concerned about potential data loss.
Instead, I have chosen to use vsearch dereplicate-sequences after step 4 to produce my feature tables and begin by diversity analysis. Am I correct in thinking that the only way to examine diversity metrics is by creating a feature table and the only ways to create feature tables are through 1) DADA2; 2) deblur; 3) vsearch dereplicate-sequences?
Thanks for the update. I'll start with specific questions. (We can discuss your pipeline later, if you would like.)
Deblur makes use of a reference, which is both a pro and a con; your results are biased by / consistent with that reference database. You could denoise with dada2, which does not use a reference. Or you could cluster reads into OTUs using vsearch cluster-features-de-novo after dereplicating them, also without a reference.
Yes, those are you 3 options within Qiime2 at the moment. Using vsearch dereplicate-sequencesis usually followed by some sort of OTU clustering or counting, but just dereplication does give you a feature table.
This is correct, and this is a key point! To get the Shannon diversity of reads, you first have to count those reads, and the same goes for ASVs / OTUs / mammals , etc.
The plugin vsearch dereplicate-sequences counts reads. DADA2 will denoises reads into ASVs and count those. The plugin vsearch dereplicate-sequences + cluster-features-de-novo will cluster reads into de novo OTUs, and report counts of OTUs.