Dada2. Filtering steps discards most of my reads

Hello,
so I am working with a subset of my dataset (10 libraries out of 200). I have demultiplexed them using the process_shortreads from the "Stacks" pipeline. Afterwards, I uploaded them into Qiime, and I cut the primers using Cutadapt. Then I denoised the sequences with DADA2. After checking the dada2 stast I see that most of my reads did not pass the quality filtering steps and I ended up with very few rep-seqs and low total feature count. The commands I used were the following ones:

  • The visualization for the demux-seqs is the following one:
    demux-ITS-2.qzv (311.6 KB)

  • Then I trimmed the primer with cutadapt with the following command:
    qiime cutadapt trim-paired
    --i-demultiplexed-sequences demux-ITS-2.qza
    --p-adapter-f ^CAWCGATGAAGAACGYAGC...GAAYCATCGARTCTTTGAACGC
    --p-adapter-r ^RGTTTCTTTTCCTCCGCTTA...CCTTGTTACGACTTYTMCTTCC
    --o-trimmed-sequences demultiplexed_trimmed_sequences-ITS-2.qza

  • The visualization after trimming is this:
    demultiplexed_trimmed_sequences-ITS-2.qzv (317.3 KB)

  • Then I denoised the sequences with the following command:
    qiime dada2 denoise-paired
    --i-demultiplexed-seqs demultiplexed_trimmed_sequences-ITS-2.qza
    --p-trunc-len-f 0
    --p-trunc-len-r 0
    --p-pooling-method "pseudo"
    --p-chimera-method "consensus"
    --p-n-threads 10
    --o-table feature-table-ITS-2-dada_2.qza
    --o-representative-sequences rep-seqs-ITS-2-dada_2.qza
    --o-denoising-stats dada-2-stats-ITS-2.qza

  • The outputs from dada2 are the following ones:
    dada-2-stats-ITS-2.qzv (1.2 MB)
    feature-table-ITS-2-dada_2.qzv (405.8 KB)
    rep-seqs-ITS-2-dada_2.qzv (206.3 KB)

I have seen other post related to this question, however they were dealing with the trunc length of the sequences. So I am wondering if there are other ways to solve the problem since I am not truncating the sequences. I believe that changing the default values for "--p-max-ee-f", "--p-max-ee-r", and "--p-trunc-q" would help me. However, I do not fully understand their meaning.

Can someone help with this situation?

Thank you very much in advance.

In case you have not found them already, here is the documentation for qiime dada2 denoise-paired, which describes those settings.

"--p-max-ee-f", "--p-max-ee-r", are for your Forward and Reverse reads.
ee is Expected Error, described here.

"--p-trunc-q" may not help very much, see here.


I think this may be the most important part to explore for your sequences as well!

Your quality drops a lot by 140 in the forward read and by 160 in the reverse read.
Because of this, you may want to truncate your reads at this location
--p-trunc-len-f 140
--p-trunc-len-r 160

Try some of these settings, especially the --p-trunc-len-* settings, and let us know what you find!

1 Like

Hi @colinbrislawn,

Thank you for your feedback. I will try to apply the truncation of the reads and see the what the results are!! I will also read the information you provided me with regarding the expected error parameters. I will keep you posted!! :grinning: :grinning:

1 Like

Hello @colinbrislawn after applying the truncation of the seqs the results of DADA2 seem to be better. See the qzv of the dada2 stats
dada-2-stats-ITS-2.qzv (1.2 MB)
feature-table-ITS-2-dada_2.qzv (436.5 KB)
rep-seqs-ITS-2-dada_2.qzv (272.8 KB)

The code for dada2 was the following one:

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-trimmed.qza
--p-trunc-len-f 150
--p-trunc-len-r 170
--p-pooling-method "pseudo"
--p-chimera-method "consensus"
--p-n-threads 10
--p-max-ee-f 3
--p-max-ee-r 3
--p-trunc-q 1
--o-table feature-table-ITS-2-dada_2.qza
--o-representative-sequences rep-seqs-ITS-2-dada_2.qza
--o-denoising-stats dada-2-stats-ITS-2.qza

However, I have incurred into another problem when performing the taxonomic assignment. Firstly I downloaded all the ITS-2 sequences for Magnoliopsida from genebank with Rescript. The command I used was he following:

qiime rescript get-ncbi-data
--p-query 'txid3398[ORGN] AND (ITS2 OR Internal Transcribed Spacer 2) NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]'
--p-n-jobs 5
--o-sequences ncbi-refseqs-unfiltered_magnoliopsida.qza
--o-taxonomy ncbi-refseqs-taxonomy-unfiltered_magnoliopsida.qza

Then, I continued to taxonomic assignment with the following command:

qiime feature-classifier classify-consensus-vsearch
--i-query rep-seqs-ITS-2-dada_2.qza
--i-reference-reads ref_seqs_magnoliopsida.qza
--i-reference-taxonomy taxonomy_magnoliopsida.qza
--p-perc-identity 0.97
--p-no-top-hits-only "True"
--p-threads 10
--o-classification /mnt/c/Users/Guill/Desktop/Illumina_trials/ITS-2/Cutadapt_pipeline/Taxonomy/taxonomic_asingment.qza
--o-search-results /mnt/c/Users/Guill/Desktop/Illumina_trials/ITS-2/Cutadapt_pipeline/Taxonomy/taxo_results.qza
--verbose > taxonomic_assignment.txt

Everything ran perfectly, however the resolution fo the taxonomic assignment was really poor. Most of the representative sequences were just assigned to the level of order.

taxonomy.qzv (1.2 MB)

Do you have any idea why this could be happening ?

Agreed! About 20% of the reads cannot be joined, but this is pretty good given the quality.

ITS classification is harder than 16S classification for a number of reasons. For a first run with a custom database (that you just made!), I think getting to order is pretty good!

Yes. It's all about classify-consensus-vsearch.

Aftering finding hits to the database with vsearch, the taxonomy of these hits are compaired. For each level where >50% of the hits agree a classification is given, until you reach a level where there is not longer a consensus.

For your data, the level in which hits no longer agree is mostly Order. :person_shrugging:


Your progress here is awesome. If I were you, I would keep these results and continue with the paper. If reviewer 3 thinks there is a better way to classify ITS-2, they will certainly tell you.

EDIT: If you do want to improve the database, try using RESCRIPt's 'extract-seq-segments', or adding a few "out-group" taxa to your database in case you amplified non-Magnoliopsida taxa.

2 Likes

@colinbrislawn
Thanks for your feedback! I will have a look to what you suggest and come back to you with the results. I hope I can get further than order since it is essential to our study to reach to the genus level at least!! Luckily for us we used two markers that can be used in combination to reach further taxonomic levels. We are studying plant-bird interactions so stopping at the order level it is a bit meaningless.

Thank a lot again!! I am very grateful to Qiime2 community and specially to you moderators

1 Like