Trimming fungal demultiplexed paired-end sequences before dada2

einamart · October 26, 2018, 12:54pm

Dear QIIME developers! First, thank you for the amazing job you are doing here at the forum.

Background: I am a newly educated medical doctor that started on my PhD project mid September. Bioinformatics and QIIME are completely new to me, but hopefully my questions aren´t too basic. My data consists of human samples, and we aim to examine fungal DNA. We have amplified the ITS-region using ITS1-30F/ITS1-217R as primers, which sequence is GTCCCTGCCCTTTGTACACA and TTTCGCTGCGTTCTTCATCG. Illumina MiSeq was used for 250-bp paired-end sequencing, and reads were already demultiplexed when I received them. Current analyses are based on a subset of our samples (performed on 6 samples).

First issue: We want to use dada2, and it seems essential to trim our primers before entering dada2. We know that our primers consist of 20 base pairs, so why can´t we just use --p-trim-left-f 20 \ --p-trim-left-r 20 in qiime dada2 denoise-paired?

Second issue: We are also concerned about the “read-through” issue (the complement of reverse primer showing up in forward read if it reaches the reverse primer, i.e. with short ITS amplicons), so we wanted to use qiime cutadapt to remove both primers and both complements. Our suggestion:

qiime cutadapt trim-paired \ --i-demultiplexed-sequences import.qza \ --p-adapter-f CGATGAAGAACGCAGCGAAA \ --p-front-f GTCCCTGCCCTTTGTACACA \ --p-adapter-r TGTGTACAAAGGGCAGGGAC \ --p-front-r TTTCGCTGCGTTCTTCATCG \ --o-trimmed-sequences trimmed_paired_end_cutadapt.qza

Searching for primer sequences with its complements in the fastq-files using BBEdit give several hits, and most are removed by the cutadapt. Still, 1 or 2 hits appear in the representative sequences, even though I tried every combination of ^ and $. Do you have any explanation why? Or some advice? We are also aware of the Trimmomatic tool. Do you have any experience with that, and is it better to solve the current task?

Third issue: Our representative sequences often start with base pairs pretty similiar to our primers, e.g. 15 of 20 bases exactly the same, and then some minor differences. The same applies for its complement. Is this some kind of sequencing error? Is it possible to fix this using error rate in cutadapt, or is it any other method to remove the bases (if it should be removed)? Do cutadapt accept any form of variation in primers, and if so, how do I implement it?

Thanks!

Nicholas_Bokulich · October 26, 2018, 1:07pm

That's fine — the readthrough is the bigger issue.

q2-cutadapt should work for this, and I am not sure why you are getting a few that miss unless if there are mismatches. You could try the --p-anywhere parameter if your primers are not flush with the 5' end.

You could also try q2-itsxpress, which will trim out just the ITS region.

Sounds like maybe degenerate bases are causing this issue. As far as I know, cutadapt can operate on degenerate bases but it looks like your primer sequences are non-degenerate. Increasing the error-rate, as you suggested, would also be another way to capture these.

Let us know if any of that works!

einamart · October 31, 2018, 2:27pm

Thank you for your suggestions!

I still found the (exact bases of the) forward primer and the complement of the reverse primer 2 times (in the same 2 sequences), and the reverse primer and the complement of the forward primer 1 time (the same sequence) in my rep-seqs. However, using ITSxpress, no primers or complement were found!

I did ITSxpress and further analyses on 5 samples, and compared it to cutadapt with --p-anywhere. The ITSxpress followed by dada2 resulted in 13 features and a total frequency of 81,583 compared to 688 and 143,645 from the cutadapt-method. The taxonomy from ITSxpress were for most cases classified at least to genus-level. Cutadapt yielded a lot of Fungi at Kingdom level (almost none from ITSxpress), and additionally gave some fungal classes not appearing in the ITSxpress method. I am somewhat confused – why is it such difference in feature and frequence number? And should I be able to reveal more from the ITSxpress method?

We actually are using degenerate primers. But that won´t change anything, I guess? Increasing the error rate to 0.5 resulted in fewer of these “almost-primers”, but some are still there. ITSxpress did not give “almost-primers”.

Regarding ITSxpress – using the following on 6 samples (the 5 from above + 1 more):

qiime itsxpress trim-pair-output-unmerged \ --i-per-sample-sequences import.qza \ --p-region ITS1 \ --p-taxa F \ --p-cluster-id 1.0 \ --p-threads 2 \ --o-trimmed trimmed_data_itsxpress.qza

I get the following error:

Filename_R2.fastq.gz is not a(n) FastqGzFormat file
Missing sequence for record beginning on line 17

I have read this topic:

suggesting using the cutadapt to remove 0-length reads, but I don´t understand how to find the read of interest. Any suggestion?

thermokarst · October 31, 2018, 6:22pm

Have you tried the anchoring syntax for cutadapt?

https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types

Nicholas_Bokulich · October 31, 2018, 8:57pm

That sounds strange. The itsxpress analysis sounds like it is yielding too few features. ccing: @Adam_Rivers in hopes he may provide some insight.

Excellent! That is what you should see

That is not so good, and pretty much a confirmation that cutadapt is not trimming adapters and read-through. This is clearly a read-through issue, since non-biological sequence data will not be classifiable.

Why cutadapt is not trimming properly is another question — see @thermokarst's advice:

Something is wrong with the format of that file. Unzip that file and look at line 17. E.g.,

gunzip -c Filename_R2.fastq.gz | head -n 20 | tail -n 4

I hope that helps!

einamart · November 1, 2018, 11:40am

Yes, I have, and then I found the reverse primer and the complement of the forward primer one time in rep-seqs (in same sequence). I rerun the analysis now with an increased error rate of 0.5, and both primers and complements are then trimmed! However, taxonomy looks quite similar though, with a lot of Fungi at kingdom level, so I guess I should read more about taxonomic assignment.

Adam_Rivers · November 1, 2018, 1:51pm

This itsxpress issue is the same is the issue posted here: q2-ITSxpress generates empty sequences? - #6 by ebolyen. If you send me a link to download your your import.qza i will look into it.

bsen2018 · November 6, 2018, 3:04am

We also came across similar results while using ITSxpress plugin followed by dada2, i.e. low no. of features compared with dada2 alone. I wish @Adam_Rivers could throw some light on why there is retrieval of low no. of features with ITSxpress. We are not sure which workflow would be technically correct and scientifically valid. Thanks!

BS

einamart · November 6, 2018, 12:55pm

@Adam_Rivers, sorry to keep you waiting.

import.qza (375.8 KB)

I generated this file using Casava 1.8 paired-end demultiplexed fastq.
I thought it might be a Casava-problem, so I tried the manifest method as well, but it just resulted in the same error. Thank you for looking into this!

Adam_Rivers · November 6, 2018, 2:24pm

Thanks for sending me the file this helps a lot. Does your imported library only have a single sample names 385 with 1302 reads in it?

einamart · November 6, 2018, 3:05pm

I´m sorry, but I´m not quite sure what you mean. The import-file was made using 1 sample (2 fastq-files) named 385, and they had 1302 reads. I choose to use them since 385 made the error. I do not know how to look into the import-file, if that is what you meant.

Hope this helps! If not, let my know

Adam_Rivers · November 6, 2018, 3:07pm

The error is occuring because there are no sequences left after the trimming stage for this sample.

I looked at the sequences in your sample using BBtools Sendsketch and they seem to be mostly contaminants. Blasn on selected reads also turned up human and mammalian hits, that are likely contamination. See the report below.

Using BBtools Sendsketch to get a quick taxonomic profile 
(qiime2-2018.8) ARSMSSGB2A17007:data rivers$ sendsketch.sh in=385_S30_L001_R1_001.fastq.gz in2=385_S30_L001_R2_001.fastq.gz size=100000
Max memory cannot be determined.  Attempting to use 3200 MB.
If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command,
or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
Adding /Users/rivers/anaconda3/envs/qiime2-2018.8/opt/bbmap-38.22-0/resources/blacklist_refseq_species_250.sketch to blacklist.
0.021 seconds.
Loaded 1 sketch in 	0.195 seconds.

`Query: SN7001210:479:HJG7HBCX2:1:1102:7451:6694 1:N:0:TAAGGCGA+CTCTCTAT	DB: RefSeq	SketchLen: 2258	Seqs: 1302 	Bases: 325500	gSize: 226127	Quality: 0.9414	File: 385_S30_L001_R1_001.fastq.gz`

WKID	KID	ANI	Complt	Contam	Matches	Unique	noHit	TaxID	gSize	gSeqs	taxName
94.44%	0.01%	99.81%	0.01%	0.00%	17	0	1	9606	2423912913	24687	Homo sapiens
83.33%	0.01%	99.39%	0.01%	11.11%	15	0	1	9598	2434823895	4771	Pan troglodytes
77.78%	0.01%	99.15%	0.01%	16.67%	14	0	1	9595	2382094051	40754	Gorilla gorilla gorilla
77.78%	0.01%	99.15%	0.01%	16.67%	14	0	1	9597	2368990487	11016	Pan paniscus
72.22%	0.01%	98.90%	0.01%	22.22%	13	0	1	9601	2456659792	5513	Pongo abelii
55.56%	0.01%	98.03%	0.01%	38.89%	10	0	1	61853	2298840064	17597	Nomascus leucogenys
44.44%	0.01%	97.29%	0.01%	50.00%	8	0	1	60711	2400969651	2073	Chlorocebus sabaeus
38.89%	0.00%	96.85%	0.01%	55.56%	7	0	1	591936	2426739424	47644	Piliocolobus tephrosceles
38.89%	0.00%	96.85%	0.01%	55.56%	7	0	1	9565	2414004218	15330	Theropithecus gelada
38.89%	0.00%	96.85%	0.01%	55.56%	7	0	1	336983	2357094801	13126	Colobus angolensis palliatus
37.50%	0.00%	96.73%	0.00%	56.25%	6	0	1	9545	2498927350	9736	Macaca nemestrina
33.33%	0.00%	96.34%	0.00%	61.11%	6	0	1	9544	2457872871	286409	Macaca mulatta
33.33%	0.00%	96.34%	0.00%	61.11%	6	0	1	9555	2426007380	71205	Papio anubis
33.33%	0.00%	96.34%	0.00%	61.11%	6	0	1	61622	2419943618	135514	Rhinopithecus roxellana
33.33%	0.00%	96.34%	0.00%	61.11%	6	0	1	61621	2408714928	105034	Rhinopithecus bieti
33.33%	0.00%	96.34%	0.00%	61.11%	6	0	1	9541	2402786168	7649	Macaca fascicularis
33.33%	0.00%	96.34%	0.00%	61.11%	6	0	1	9568	2362113300	12823	Mandrillus leucophaeus
27.78%	0.00%	95.74%	0.00%	66.67%	5	0	1	9531	2447756662	11434	Cercocebus atys
22.22%	0.00%	95.01%	0.00%	72.22%	4	0	1	37293	2357890694	28970	Aotus nancymaae
20.00%	0.00%	94.67%	0.00%	70.00%	4	0	2	29073	2072199437	23819	Ursus maritimus

I will add a check in the software to let users know that one of their files does not contain any fungal reads rather than crashing with a cryptic message about line 17!

einamart · November 6, 2018, 3:11pm

Ah! That´ll explain a lot. Thank you so much

If you could find time to answer my (and bsen2018) question regarding features as well, I would be very grateful!!

Adam_Rivers · November 6, 2018, 3:21pm

I can look into this more systematically if you want me to look at your larger dataset. Itsxpress uses expect value cutoffs for matching the conserved fungal regions flanking the ITS so it discards both sequencing artifacts from read through and real ITS reads that do not come from fungi but will still yield sequence variants in Dada2 (because they are real sequence variants, just not from fungi).

I did some testing with the dataset in the itsxpress paper after we submitted it and ITSxpress yielded fewer sequence variants but they were also more identifiable. I would like to dig into it more shortly.

system · December 7, 2018, 9:21pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.