Rarefy representative sequences not feature-table

Hello! I am looking for help with rarefying my data from Fungal ITS1 amplicon sequencing.

I have demultiplexed, paired-end reads in .fastq format. I imported my data into QIIME2 (installed on a VirtualBox), cut the primers using cutadapt, joined the forward and reverse reads using vsearch join-pairs, and quality filtered them using quality-filter q-score. I dereplicated my sequences using vsearch dereplicate-sequences. I used qiime tools view to determine appropriate sampling depth and used this to rarefy during diversity core-metrics. I see I can also generate a stand-alone rarefied feature table using feature-table rarefy, but I would like to know if there is a way I can rarefy my sequences themselves.

I am interested in evaluating the diversity within one specific taxonomic ranking (the fungal order Pleosporales). To look at relative abundance of fungal taxa, I normally export from QIIME2 my processed representative sequences, then BLAST them against a UNITE database, and run the output through FHiTINGS. My understanding is that my sequences need to be rarefied to an equal sequence depth in order to look at the diversity of a specific taxon. Is there a way I can rarefy the sequences I have stored in my quality-filtered artifact?

Thanks!
-Sam

Hello Sam,

I think there may be a terminology gap around the word 'rarify'. I use this word to mean 'subsample to an equal depth', which is one way to address differences in sequencing depth (a.k.a. sampling effort) so that samples with 10k reads can be reasonably compared to samples with 200k reads. Without some way to account for depth, many metrics including those related to alpha and beta diversity will conclude that depth itself is a primary driver of community composition and can mask/overwhelm biological trends.

I'm not sure your sequences in your .fastq file have a 'depth'. Like, they have a length and a quality and a G/C composition... the concept of sampling effort only relates to their counts.

Am I making sense here? What sort of bias are you trying to control through rarifying your sequences that would not be controlled by rarifying your sequence counts?

1 Like

Hi Colin,

Thank you for your response and apologies for the confusion! You are right, I am trying to subsample all my reads to an equal depth or in effect rarefy my sequence counts. In QIIME2, I see that I can rarefy a feature table using the feature-table rarefy plugin.

However, I think I need to rarefy the sequence counts in my samples themselves (samples-qfilter.qza), not just in the feature table (table.qza). Since the feature table retains no information about the Do you know of a plugin that will allow be to rarefy the artifact containing my samples and their representative sequences? Would the following commands work?

qiime feature-table rarefy \
--i-table feature-table.qza \
--p-sampling-depth 9557 \
--o-rarefied-table rarefied-feature-table.qza

qiime feature-table filter-seqs \
--i-data samples-qfiltered.qza \
--i-table rarefied-feature-table.qza \
--o-filtered-data samples-rarefied.qza

Once each sample has an equal sampling depth, I want to extract the sequences for each sample, then analyze the shannon alpha diversity of the fungal order Pleosporales in each sample.

Thanks,
Sam

Ah, OK.

So after you subsample, it's possible that some features will no longer appear in the output table, and that filter-seqs command serves to remove those sequences.

You can perform another filter to select just the features in the order Pleosporales.

At this point you have a file like
Pleosporales-only-rarefied-feature-table.qza
which you could use to calculate Shannon diversity on a per-sample basis.
(This table does not include any sequences. Is that OK?)

What do you mean by samples? Have you opened up that file to see what's really inside of it? It may hold less than you think...


Let's zoom out
:telescope: :whale2:

Take a look at how denoising works within Qiime2, and how it makes two feature tables.

FeatureTable[Frequency]

FeatureID:SampleID S1 S2 S3 S4
ASV1 2 2 4 5
ASV2 2 2 40 5
ASV3 22 24 4 5

This table includes counts, so you can rarefy them, and also calculate diversity measures.

FeatureTable[Sequence]

FeatureID Sequence
ASV1 ATGCTA
ASV2 CGTATT
ASV3 CGGCTA

This table includes no counts, so it's not related to rarefying or diversity.

Also note that the samples (S1, S2...) are not in the same table as your sequences!

These are two different artifacts. You match them up based on shared FeatureID (ASV1, ASV2...).

2 Likes

Oh okay, thank you for clarifying the different tables!

I am still a little confused about producing these tables. Maybe it would be helpful if my paste my workflow below. Here are 5 steps I follow to obtain relative abundance of all fungal taxa in each of my samples.

After the 4th step, I use the tools extract command to export my sequences, separated into different fastq.gz files corresponding to their original sample file IDs. Then, I BLAST each sample file against a UNITE database, and pass the blasted.out files to FHiTINGS to parse. I do not normally use a deblur denoise or vsearch dereplicate-sequences step.

My understanding of deblur is that it was originally designed for bacterial 16S and now better accommodates 18S through deblur denoise-other. However, I am working with fungal ITS. Is it appropriate to use denoise here? I am concerned about potential data loss.

Instead, I have chosen to use vsearch dereplicate-sequences after step 4 to produce my feature tables and begin by diversity analysis. Am I correct in thinking that the only way to examine diversity metrics is by creating a feature table and the only ways to create feature tables are through 1) DADA2; 2) deblur; 3) vsearch dereplicate-sequences?

qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]'\
--input-path /absolutepath/manifest_file.txt \
--output-path demux.qza \
--input-format PairedEndFastqManifestPhred33V2
qiime cutadapt trim-paired \
--i-demultiplexed-sequences demux.qza \
--p-front-f CTTGGTCATTTAGAGGAAGTAA \
--p-front-r GCTGCGTTCTTCATCGATGC \
--p-match-read-wildcards \
--p-match-adapter-wildcards \
--p-discard-untrimmed \
--o-trimmed-sequences trimmed.qza
qiime vsearch join-pairs \
--i-demultiplexed-seqs trimmed.qza \
--p-allowmergestagger \
--o-joined-sequences joined.qza
qiime quality-filter q-score \
--i-demux joined.qza \
--p-min-quality 30 \
--p-quality-window 3 \
--o-filtered-sequences qfilter-seqs.qza \
--o-filter-stats qfilter-stats.qza
mkdir extracted-reads
qiime tools extract \
--input-path qfilter-seqs.qza \
--output-path extracted-reads \

Good morning, Sam,

Thanks for the update. I'll start with specific questions. (We can discuss your pipeline later, if you would like.)

Deblur makes use of a reference, which is both a pro and a con; your results are biased by / consistent with that reference database. You could denoise with dada2, which does not use a reference. Or you could cluster reads into OTUs using vsearch cluster-features-de-novo after dereplicating them, also without a reference.

Yes, those are you 3 options within Qiime2 at the moment. Using vsearch dereplicate-sequencesis usually followed by some sort of OTU clustering or counting, but just dereplication does give you a feature table.

This is correct, and this is a key point! To get the Shannon diversity of reads, you first have to count those reads, and the same goes for ASVs / OTUs / mammals :pig2: :mouse2: :goat: :bat: , etc.

The plugin vsearch dereplicate-sequences counts reads. DADA2 will denoises reads into ASVs and count those. The plugin vsearch dereplicate-sequences + cluster-features-de-novo will cluster reads into de novo OTUs, and report counts of OTUs.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.