Processing 18S metatranscriptomic reads extracted using SortMeRNA

Hi everybody!

I am trying to assign taxonomy to 18S rRNA metatranscriptomic reads that were originally extracted using SortMeRNA. I have created the following pipeline, and am just generally looking for input on whether or not this is the best approach. A concern I have, is that when I run through this it shows I have nearly the same amount of fungal reads in each sample (which it not expected at all), so this is originally what made me believe something is off. Please let me know, any input helps!

Step 1: Import the paired-end sequencing data in CasavaOneEightSingleLanePerSampleDirFmt format (example of format listed below)

sample1_S1_L001_R1_001.fastq.gz # Forward reads

sample1_S1_L001_R2_001.fastq.gz # Reverse reads

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path PE_seqs
--input-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path PE.qza

Step 2: Generate a summary visualization of the demultiplexed sequences to assess data quality.

qiime demux summarize
--i-data PE.qza
--o-visualization PE.qzv

Step 3: Join paired-end reads to obtain complete sequences.

qiime vsearch merge-pairs
--i-demultiplexed-seqs PE.qza
--o-merged-sequences PE-joined.qza

Step 4: Dereplicate the joined sequences to obtain a table of unique sequences and their representative sequences.

qiime vsearch dereplicate-sequences
--i-sequences PE-joined.qza
--o-dereplicated-table PE-table.qza
--o-dereplicated-sequences PE-rep-seqs.qza

Step 5: Export the feature table in BIOM format and convert it to a tab-separated format.

qiime tools export
--input-path PE-table.qza
--output-path ASVTABLE_export

biom convert
-i ASVTABLE_export/feature-table.biom
-o ASVTABLE_export/otu_table.tsv
--to-tsv

Step 6: Remove the header from the exported table and replace a specific string (#OTU ID).

cd ASVTABLE_export
sed -i '1d' otu_table.tsv
sed -i 's/#OTU ID//' otu_table.tsv
cd ..

Step 7: Import the reference taxonomy data and reference sequences that will be used for downstream processing

Here, we are using the PR2 databased (most recent version as of 7/20/2023, https://pr2-database.org/). THIS DATABASE ALSO INCLUDES SOME BACTERIA AND PLASTIDS -> they will be removed during later processing points but it is important to keep this in mind as if a different database is used, step 16 will not apply.

qiime tools import --type 'FeatureData[Taxonomy]'
--input-format HeaderlessTSVTaxonomyFormat
--input-path pr2_version_5.0.0_SSU_mothur.tax
--output-path PR2-ref-tax.qza

qiime tools import --type 'FeatureData[Sequence]'
--input-path pr2_version_5.0.0_SSU_mothur.fasta
--output-path PR2-ref-seqs.qza

Step 8: Perform closed-reference clustering of the features against the reference database with a sequence identity threshold of 100%.

qiime vsearch cluster-features-closed-reference
--i-table PE-table.qza
--i-sequences PE-rep-seqs.qza
--i-reference-sequences PR2-ref-seqs.qza
--p-perc-identity 1
--p-threads 4
--o-clustered-table closed-ref-table.qza
--o-clustered-sequences closed-ref-seqs.qza
--o-unmatched-sequences closed-ref-unmatched-seqs.qza

Step 9: Generate a summary visualization of the closed-reference clustered feature table.

qiime feature-table summarize
--i-table closed-ref-table.qza
--o-visualization table.qzv
--m-sample-metadata-file sample_metadata.tsv

Step 10: Generate a visualization of the closed-reference clustered representative sequences.

qiime feature-table tabulate-seqs
--i-data closed-ref-seqs.qza
--o-visualization rep-seqs.qzv

Step 11: Filter samples from the feature table based on a minimum frequency of 1200.

qiime feature-table filter-samples
--i-table closed-ref-table.qza
--p-min-frequency 15000
--o-filtered-table filtered-table.qza

Step 12: Filter features (sequences) from the feature table based on a minimum frequency of 10.

qiime feature-table filter-features
--i-table filtered-table.qza
--p-min-frequency 10
--o-filtered-table feature-frequency-filtered-table.qza

Step 13: Classify the closed-reference clustered sequences using VSEARCH and the PR2 reference database with a sequence identity threshold of 99%.

qiime feature-classifier classify-consensus-vsearch
--i-query closed-ref-seqs.qza
--i-reference-reads PR2-ref-seqs.qza
--i-reference-taxonomy PR2-ref-tax.qza
--p-perc-identity
--o-classification taxonomy1.qza
--o-search-results search.results.qza
--p-threads 4

Hello @syrenawhitner :wave:

I've also made pipelines over the years, and I know validating them can be hard.

Before we dive into the pipeline, can you share with me what data you have for evaluation?

So, what was expected?


Let's go to the pipeline:

For my own notes, it looks like your pipeline goes
vsearch merge-pairs -> dereplicate-sequences -> cluster-features-closed-reference

This is classic closed-ref clustering counting with the common speed optimizations.

This method has well-understood tradeoffs, which makes it viable in some contexts despite it being endlessly criticized compared to newer methods.

One of the reasons it's easy to hate on closed-ref is that it includes a number of thresholds that are hard to justify.
--p-perc-identity 1, why not 99%?
--p-min-frequency 15000, why not 14K?
--p-min-frequency 10, why not lucky number 7 ?? :four_leaf_clover:

The counterargument is that none of this really matters;
closed-ref provides results that are similar to other methods
(You can make this argument using positive controls with known composition.)