I'm using qiime2 installed with conda. I need some help troubleshooting where I could be going wrong. When I try to use the output of the filter features command I get an error indicating that the file is empty. I'm currently just not using the filter features output.
So my questions are:
any ideas where I'm going wrong?
it seems I don't strictly need to filter features - so am I OK to ignore this and carry in or does it mean I might be doing something else wrong that is affecting the results?
QIIME 2 version: 2023.9.2
My data is 16s v4 sequences - split between forward and reverse read FASTQ files.
Hello!
I am not sure that I am right, but looks like you did not remove primers from the sequences with cutadapt after importing and before Dada2. Could you confirm that?
Since ASV ids are hashes of sequences itself your ids may be different from ids in the greengenes database if your sequences still contain primers.
Could you try to remove primers with cutadapt before Dada2?
In the 2022.10 tree, we primarily placed 90, 100, and 150nt fwd reads relative to the EMP 16S V4 rRNA sequencing protocol. Reads not of those lengths, and not relative to 515F are unlikely to be represented. We did not place any stitched reads either, if a phylogenetic taxonomy is desired, I recommend using only the fwd reads. Alternatively, reads can be recruited closed reference using the non-v4-16s action. If just interested in taxonomy, the naive Bayes pre-trained model can be used (available under Data Resources), although we observed lower correlation to paired shotgun samples relative to the phylogenetic taxonomy (https://www.nature.com/articles/s41587-023-01845-1/figures/2).
It is necessary to filter or map the reads if the intention is to use either the phylogeny or taxonomy.
Hi Timur,
thanks so much for the suggestion! I didn't need to remove the primers as the lab told me the FASTQ files they provided are demultiplexed and do not contain primer sequences.
That is exactly what I hear from sequencing centers before I find and remove primers...
If the problem is not yet solved I would try to search for primers with cutadapt and discard any sequences without primers (in cutadapt settings). If output file will be very small that means that primers were removed indeed. If the output file big enough and contains most of the reads, that means that primers were in sequence.
hi Timur!
thanks for the follow up. I ran it and it didn't find much to remove but the filter-features were still empty like before. The below is the verbose output. I ran it on the forward reads only as I also started experimenting with the suggestion from Daniel.
This is cutadapt 4.5 with Python 3.8.15
Command line parameters: --cores 1 --error-rate 0.1 --times 1 --overlap 3 --minimum-length 1 -q 0,0 --quality-base 33 -o /data/tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-uvhs5hb4/BS12-MTV84_0_L001_R1_001.fastq.gz --front GTGCCAGCMGCCGCGGTAA /data/tmp/qiime2/ec2-user/data/b9b59679-edf7-4e1d-85e4-bb6abe24eba7/data/BS12-MTV84_0_L001_R1_001.fastq.gz
Processing single-end reads on 1 core ...
Finished in 2.249 s (21.939 µs/read; 2.73 M reads/minute).
=== Summary ===
Total reads processed: 102,499
Reads with adapters: 55 (0.1%)
== Read fate breakdown ==
Reads that were too short: 0 (0.0%)
Reads written (passing filters): 102,499 (100.0%)
Total basepairs processed: 15,477,349 bp
Quality-trimmed: 0 bp (0.0%)
Total written (filtered): 15,477,180 bp (100.0%)
=== Adapter 1 ===
Sequence: GTGCCAGCMGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 55 times
Overview of removed sequences
length count expect max.err error counts
3 53 1601.5 0 53
5 2 100.1 0 2
Running external command line application. This may print messages to stdout and/or stderr.
The commands to be run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
hi Daniel,
thanks so much for your help.
It looks like it would be better to do the phylogenetic taxonomy although at this point I am assigning just the taxonomy. I will look into doing the phylogenetic taxonomy next.
I tried using just the forward reads as you suggested but it didn't make a difference and the filter features are still resulting in an empty output.
I experimented with a different --p-trunc-len 150 with DADA2 as well, but it didn't help.
The V4 representation in Greengenes2 2022.10 spans over 300,000 samples, representing over 20,000,000 ASVs and inclusive of the EMP / EMP500 sets among many other environmental sample sets. It would be unexpected for the data to not have reasonable representation, and concerning that there is no representation. I suspect there is a processing detail that we haven't isolated yet, and which we (i.e., the Greengenes2 team) need to better communicate
I was using DADA2 to truncate to 145 NT. I've now set it to 151 hoping it would make a difference as per fastqc output quality seemed ok up till 151. I only tried cutadapt after it was suggested but that didn't introduce the issue I have and the below is now without cutadapt and it's still empty. I used 145 as that's what I understood my lab was recommending but I might be misunderstanding.
This is what it looks like now i've set it to 151 - is this looking better? I had tried previously with 150 as you mentioned having ASVs with 150 length.
Something is odd as you were able to search the database, and the content of the website is a reflection of the files on the FTP. Could you share the 150nt FeatureData[Sequence] artifact?