Hello! It is my first time posting in the QIIME2 forum, which means my experience with QIIME2 has been only great so far without much ado. Recently I'm processing some Nanopore 16S V1–V9 reads and although I struggled a bit initially, getting the taxonomy results using the MetONTIIME pipeline was almost a breeze.
But now I am getting some headaches with diversity analysis (qiime phylogeny align-to-tree-mafft-fasttree). Initially mafft spewed out memory errors with my rep-seqs.qza, and I circumvented it by subsampling by 5000 with seqtk. Now it is running FastTreeMP, but even with this low number of reads it is taking more than half a day, which I didn't expect.
While waiting, I went through many previous posts in this forum on this long-read phylogeny issue only to know many have been before me, and the paper by Dr. Maestri (doi:10.3390/genes10060468), which claims more than 200 randomly sampled reads did not improve the accuracy. If that is true, my 5000 reads are definitely an overkill, but reflecting upon the usual alpha-rarefaction results, this feels a little bit counterintuitive.
If anyone has suggestions or experiences with this, I will appreciate the help. Thanks!
Thanks for circling back on your error and doing some digging on your end. Let us know if you are still running into issues after trying to subsample by 500, and we'll be happy to troubleshoot with you!
With my ~50 samples, 500 reads/sample at just 1 vCPU still doesn't work (with 94GB of available memory). I'm now trying with 100, but it is highly likely to be not enough, I assume. The rep-seqs.qza artifacts weigh at 484, 412, 376MB for 3000, 500, 100 reads/sample data.
The phylogeny diversity workflow runs the below two steps prior to the tree generation.
Could it be denoising/clustering issue from chimeras/singletons/rare features as were discussed several times in the forum? They seem to be a bit outdated and did not reach any clear consensus other than food for thought, unfortunately.
Have you tried applying the --p-parttree flag when running qiime phylogeny align-to-tree-mafft-fasttree ...?
What is the error rate of your Nanopore reads? Or how many ambiguous IUPAC bases (e.g. N, M, W, Y, R, ... are there for each of your sequences?
Having a moderate amount of ambiguous bases will often result in highly error-prone alignments. Which makes it quite difficult for phylogenetic inference. I suspect that the authors of the paper you linked were aware of this given:
Starting from pre-processed MinION reads, the ONTrack pipeline was applied to each sample consecutively and consisted of the following steps. First, VSEARCH v2.4.4_linux_x86_64 was used to cluster reads at 70% identity and only reads in the most abundant cluster were retained for subsequent analysis in order to remove contaminating sequences. Of those, 200 reads were randomly sampled using Seqtk sample...
@SoilRotifer Thank you for your comment, it gave me more insights to carry on with the troubleshooting. For mafft parameters, it mandates to use the parttree flag when the sequence count is more than 1 million, so I had to use it to try it in the first place. After days of looking into the problem, I feel more and more like that the error rates as you have mentioned and the pile of other quality control issues altogether might be the culprit. (Because HPC trial with 100 vCPUs and 300GB memory also failed to run)
As I lack the knowledge to manually polish the reads at the moment, I'm looking into existing software options such as nanopolish, nanofilt, fastp and etc. The Perl script, remove_long_short.pl from the ONTrack looks fancy ("...removing reads shorter than mean - 2SD and longer than mean + 2SD"), but I don't have much of an understanding with the Perl language either; so I may as well try with simple filtration first, possibly cutting down on reads with length 1250 to 1750 bp, considering the 16S V1–V9 amplicon length (~1.5k bp).
So after some QC attempts, the computational burden seemingly decreased, but it is still much more burdensome compared to Illumina & Ion Torrent reads. Now I presume it is just natural, coming from the innate increase in read length and/or error rates of Nanopore reads. I will soon get my hands on some PacBio HiFi reads on the same gDNA samples, so I'll be able to compare the two long-read data in such perspective.
In the end, I had to subsample to 50k reads and 15k ONT reads for taxonomic and phylogenetic analysis, respectively, using the current workstation resources at my disposal (16C CPU, 96GB memory). The whole journey led my team to purchase a HPC server(128C CPU, 1TB memory); I'm looking forward to further long-read sequencing journey. I guess I will also have to test and incorporate other tools outside of QIIME2 ecosystem (e.g., kraken2, minimap2, MMseqs2), given the current experience.
@SoilRotifer Thank you again for your valuable insight!