Phylogenetic analysis of Nanopore 16S reads. How many subsampled reads will do?

Hello! :person_raising_hand: It is my first time posting in the QIIME2 forum, which means my experience with QIIME2 has been only great so far without much ado. Recently I'm processing some Nanopore 16S V1–V9 reads and although I struggled a bit initially, getting the taxonomy results using the MetONTIIME pipeline was almost a breeze.

But now I am getting some headaches with diversity analysis (qiime phylogeny align-to-tree-mafft-fasttree). Initially mafft spewed out memory errors with my rep-seqs.qza, and I circumvented it by subsampling by 5000 with seqtk. Now it is running FastTreeMP, but even with this low number of reads it is taking more than half a day, which I didn't expect.

While waiting, I went through many previous posts in this forum on this long-read phylogeny issue only to know many have been before me, and the paper by Dr. Maestri (doi:10.3390/genes10060468), which claims more than 200 randomly sampled reads did not improve the accuracy. If that is true, my 5000 reads are definitely an overkill, but reflecting upon the usual alpha-rarefaction results, this feels a little bit counterintuitive.

If anyone has suggestions or experiences with this, I will appreciate the help. Thanks! :innocent:

1 Like

And it just now kernel-panicked :person_facepalming: I'll go try with subsampling by 500 this time.

1 Like

Hi @hoohugokim,

Welcome to the :qiime2: forum!

Thanks for circling back on your error and doing some digging on your end. Let us know if you are still running into issues after trying to subsample by 500, and we'll be happy to troubleshoot with you!

Cheers :lizard:

Hi,

With my ~50 samples, 500 reads/sample at just 1 vCPU still doesn't work (with 94GB of available memory). I'm now trying with 100, but it is highly likely to be not enough, I assume. The rep-seqs.qza artifacts weigh at 484, 412, 376MB for 3000, 500, 100 reads/sample data.

The phylogeny diversity workflow runs the below two steps prior to the tree generation.

qiime vsearch dereplicate-sequences
qiime vsearch cluster-features-de-novo

Could it be denoising/clustering issue from chimeras/singletons/rare features as were discussed several times in the forum? They seem to be a bit outdated and did not reach any clear consensus other than food for thought, unfortunately.

I am checking the chimeras (Identifying and filtering chimeric feature sequences with q2-vsearch — QIIME 2 2022.2.0 documentation), as it looks like to me the step is omitted in the MetONTIIME workflow.

The troubleshooting continues...

Hi @hoohugokim,

Have you tried applying the --p-parttree flag when running qiime phylogeny align-to-tree-mafft-fasttree ...?

What is the error rate of your Nanopore reads? Or how many ambiguous IUPAC bases (e.g. N, M, W, Y, R, ... are there for each of your sequences?

Having a moderate amount of ambiguous bases will often result in highly error-prone alignments. Which makes it quite difficult for phylogenetic inference. I suspect that the authors of the paper you linked were aware of this given:

Starting from pre-processed MinION reads, the ONTrack pipeline was applied to each sample consecutively and consisted of the following steps. First, VSEARCH v2.4.4_linux_x86_64 was used to cluster reads at 70% identity and only reads in the most abundant cluster were retained for subsequent analysis in order to remove contaminating sequences. Of those, 200 reads were randomly sampled using Seqtk sample...