Poor results, please I need support

Good morning,

I'm getting poor results in my analysis: about 500 ref seqs and only 30% of them are matched in taxonomy assignment. How can I obtain more ref seqs/ASV and improve the matching ratio? What if I remove "--p-discard-untrimmed" option from trim command? Thank you in advance!

I start from 20 FASTQ, paired end Illumina, 300 bp.
There are the commands I entered in order:

qiime cutadapt trim-paired --i-demultiplexed-sequences demux-paired-end.qza --p-adapter-f ' CCTACGGGNGGCWGCAG... GGATTAGAWACCCBNGTAGTC ' --p-adapter-r ' GACTACNVGGGTWTCTAATCC... CTGCWGCCNCCCGTAGG ' --p-cores 4 --o-trimmed-sequences trimmed-seqs.qza --p-discard-untrimmed --verbose

=== Summary ===
Total read pairs processed: 58,375
Read 1 with adapter: 50,754 (86.9%)
Read 2 with adapter: 48,590 (83.2%)
Pairs that were too short: 82 (0.1%)
Pairs written (passing filters): 43,664 (74.8%)

qiime demux summarize --i-data trimmed-seqs.qza --o-visualization demux.qzv

qiime dada2 denoise-paired --i-demultiplexed-seqs trimmed-seqs.qza --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 284 --p-trunc-len-r 236 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza

qiime feature-classifier classify-consensus-vsearch --i-query rep-seqs.qza --i-reference-reads silva-138-99-seqs.qza --i-reference-taxonomy silva-138-99-tax.qza --p-perc-identity 0.99 --p-threads 4 --o-classification assigned_taxa.qza –verbose

Matching query sequences: 162 of 526 (30.80%)

demux.qzv (321.1 KB)

Hi @leandro_di_gloria welcome to :qiime2:!

I think you'll greatly improve your results if you simply add the following flag to your command: --p-match-adapter-wildcards. Otherwise the ambiguous IUPAC bases like BWN... will count as mismatches and increase the chances of being unable to find and trim the primers.

Given that you are providing the primers in the normal and reverse complimented form, I assume you are expecting your sequence data in mixed orientation? As in this case:

Here is the updated command with other minor modifications.. note that I removed the ' and the ...:

qiime cutadapt trim-paired \
    --i-demultiplexed-sequences demux-paired-end.qza \
    --p-adapter-f  CCTACGGGNGGCWGCAG  GGATTAGAWACCCBNGTAGTC  \
    --p-adapter-r GACTACNVGGGTWTCTAATCC  CTGCWGCCNCCCGTAGG \
    --p-cores 4 \
    --o-trimmed-sequences trimmed-seqs.qza \
    --p-discard-untrimmed \
    --p-match-adapter-wildcards \
    --verbose

Note if your reads are in mixed orientation you'll want orient your reads to the same direction using another tool, like the RESCRIPt plugin rescript orient-seqs command.

If you are not expecting reads in mixed orientation you can run the command like so:

qiime cutadapt trim-paired \
    --i-demultiplexed-sequences demux-paired-end.qza \
    --p-adapter-f  CCTACGGGNGGCWGCAG  \
    --p-adapter-r GACTACNVGGGTWTCTAATCC  \
    --p-cores 4 \
    --o-trimmed-sequences trimmed-seqs.qza \
    --p-discard-untrimmed \
    --p-match-adapter-wildcards \
    --verbose

Keep us informed. :slight_smile:
-Mike

Hi Mike, thank you for your quick reply!

I added the reverse complementary primer just in case, indeed just same read (about 10) had the reverse too. I've seen that adding it does not influence the trim of reads that have normal primers only (am I wrong?) then I wrote it too.

Anyway, trying with "--p-match-adapter-wildcards" I get the following summary:
Total read pairs processed: 58,375
Read 1 with adapter: 50,917 (87.2%)
Read 2 with adapter: 48,717 (83.5%)
Pairs that were too short: 55,256 (94.7%)
Pairs written (passing filters): 27 (0.0%)

I partially resolved my problem with a more severe cut on denoising: too many reads were filtered and then merged (about 1,1% of input only). Now, cutting on nt 236 and 203, much less reads are filtered and then about 50% of inputs are merged. I'm interested in your advices on cutadapt trim anyway, in order to get a better output how can I improve it further? Why am I now getting "pairs too short"?

Thank you so much

No problem. :slight_smile:

It will not hurt to throw it in, but you'll have to do the work of orienting the reads. That is, if these are indeed in mixed orientation. Your sequencing facility should have provided you with all of the details on how your data was generated.

When I see low values like this, it is often due to using the incorrect primer sequences. If you read through this post you'll see that I inquired as to which V3V4 primer was used. For example, your reverse primer is slightly different from two other commonly used sequences:

Your 805R primer Alternate 805R primer 785R primer
GACTACNVGGGTWTCTAATCC GACTACHVGGGTATCTAATCC GACTACHVGGGTATCTAAKCC

I suspect that the revers primer you are using is incorrect and might be one of the other two? :thinking:
Perhaps give one of those other ones a try, or add all of them to your cutadapt command.

They may be detected in the 3' end due to potential sequence read-through. Not because they exist in the 5' end. Though I would not expect this to be the case for the V3V4 region, only for very short amplicon data. But, if much of your data is less than the typical length of the V3V4 region (~460 bp), then this could be why), :man_shrugging:

Unless you know for sure that your reads are in mixed orientation then, I'd stick with searching for primers in the correct orientation from the 5' end. Otherwise you'll risk spurious sequence trimming. Which may be what is occurring here.

Have you tried without using the reverse compliment sequence? It is weird that the "pairs too short" drastically increased. That is it could be spuriously trimming / finding a primer towards the 3' end of the sequence (the reverse compliment) and when it trims... it is removing everything from that primer sequence back towards the 5' end (i.e. the begining) of that read.

Perhaps manually inspect several of the sequences and see where these primers (and their reverse compliment) reside within your sequence.

See this thread about minimum read overlap required to merge reads.

Good morning Mike,

yes, I have tried without reverse complement sequences but the "pairs too short" where still high. After many tries, I've seen that writing the sequences without ' and ... gives me this problem then I wrote them with those symbols.
Moreover, I discovered that "--p-match-adapter-wildcards" is set TRUE by default.

There are my tries with different R primers:

My R primer
Total read pairs processed: 58,375
Read 1 with adapter: 48,183 (82.5%)
Read 2 with adapter: 45,777 (78.4%)
Pairs that were too short: 81 (0.1%)
Pairs written (passing filters): 38,463 (65.9%)

Alternate 805R primer
Total read pairs processed: 58,375
Read 1 with adapter: 48,183 (82.5%)
Read 2 with adapter: 45,746 (78.4%)
Pairs that were too short: 81 (0.1%)
Pairs written (passing filters): 38,441 (65.9%)

785R primer
Total read pairs processed: 58,375
Read 1 with adapter: 48,183 (82.5%)
Read 2 with adapter: 47,525 (81.4%)
Pairs that were too short: 81 (0.1%)
Pairs written (passing filters): 39,847 (68.3%)

My R primer with reverse complementary primers
Total read pairs processed: 58,375
Read 1 with adapter: 50,754 (86.9%)
Read 2 with adapter: 48,590 (83.2%)
Pairs that were too short: 82 (0.1%)
Pairs written (passing filters): 43,664 (74.8%) ******* the best one for the moment *****

785 R primer with reverse complementary primers
Total read pairs processed: 58,375
Read 1 with adapter: 50,754 (86.9%)
Read 2 with adapter: 47,789 (81.9%)
Pairs that were too short: 82 (0.1%)
Pairs written (passing filters): 42,567 (72.9%)

As regard merging, I think that I should increase just a little my trim value on forward read because 236 + 201 (reads) + 17 +21 (primers) - 460 = 15 nt ... it's a shame, I got a lot of reads merged with this setting! In this link https://www.biorxiv.org/content/10.1101/843144v1.full.pdf I have read that the amplicon size is 444bp... is this correct?

What do you think?

Thank you again

Errg... this is indeed frustrating, no?

Oh, right! I keep confusing this with --p-match-read-wildcards :man_facepalming:

:thinking: You could take a combination of your 805R primer and the 758R primer and use this primer sequence:
GACTACNVGGGTWTCTAAKCC

I'm not sure if there are other versions of the forward primer though...

But in all honesty, sometimes it is not unusual to lose ~15% of your data. I typically see a loss of around 4-7%.

Looks like it.

I think increasing your trim value is worth a shot.

Another option... is to analyze the data using only the forward reads. I've had to do this on several occasions when my reverse reads were really poor. :frowning_face:

Hi Mike, I tried with that primer sequence too but still the initial one gives the best outcome. If it's not unusual to lose the 15% then i'll mantain the setting with 805R, thank you anyway.

As regards the filter step with DADA2, I tried to trim more while adding the option --p-min-overlap 20 in order to test the real lenght of my amplicon and I've seen that I was already around my overlap limit with 241 and 201 (less filtered but much less merged too!). Before using only the forward reads, what do you think of those denoising stats? Are they acceptable?

This is my last message on this post, thank you so much for all your help.

Leandro

denoising-stats.qzv (1.2 MB)

Tough call. In this case I often compare the forward-only analysis with that of the merged-pairs. From there I'll pick the one that best answers my question(s). Very often I end up using only the forward-only reads for V3V4.

One sanity check to consider, perhaps forgo cutadapt (on the off-chance that is causing problems), and just use DADA2's trim options to trim the region where the primer would reside?

You're welcome! Good luck and keep us posted!