Resampling from rep-seqs

suzukik · November 21, 2018, 6:36am

Hi, I have 180 samples with approximately 40,000 - 60,000 sequences each.
Unfortunately, since my PC is not so excellent, align-to-tree-mafft-fasttree command can’t work well (error in progressive alignment. Making a distance matrix and Constructing a UPGMA tree were passed).
Now I am thinking if can I solve this problem by reducing total number of sequence.
Is there any option for random resampling from rep-seqs and table (for example, 20,000 sequences will be randomly selected, then make new rep-seqs file).
Or if somebody have other idea to solve this error, please help me.

Thank you

jwdebelius · November 21, 2018, 9:23am

Hi @suzukik,

A couple of solutions I can come up with, not sure if any help.

One would be to make sure you have singletons and chimeras filtered out of your tree and table before building. I think deblur does this automatically, I’m not sure what denosing/clustering method you used.

The solution you’re proposing is essentially rarefaction of your table. Which is somewhat controversial, although appropriate in diversity calculations. So, it’s an option but perhaps not the best one.

You might also want to look for additional computational resources, depending on your data privacy/ethics rules. (Although, if you’re only building the tree, dealing with a set of de-identified sequences with no sample data attached, but always better to check if you’re not sure.) These might include:

A local or national computational resource. Many universities have super computers or super computer consortium agreements that you can apply to use. A bioinformatics core or IT department might be able ot help you locate yours.
Buying time on a cloud-based platform like Amazon Web Services or Microsoft Azure. It’s essentially a supercomputer for hire. Depending on what you’re doing, its historically been in the neighborhood of a few dollars a day to get a fair bit of processing power, and you don’t need to keep running stuff long term which can help limit your costs. I’m not sure if there’s a QIIME installation already there, and can’t give more advice on which installation makes sense.
Finally, if If you’re willing/able to have your data hosted on someone else’s server, Qiita is a database developed by Rob Knight’s group at UC San Diego. It leverages a standardised platform for denoising (and clustering) against what their team of expert developers - including some of the people behind QIIME and QIIME 2 - consider the best pipeline. I think they output QIIME 2 artefacts at the end of processing.

Hope that’s helpful.

Best,
Justine

suzukik · November 21, 2018, 10:16am

Hi Justine,

Thank you very much for your kind suggestion.
I used DADA2 for denoising and clustering the sequence. I think that chimeras were automatically removed from rep-seqs.
When I remove less abundance features (<10 in all sequences), trees were successfully built!! Thank you very much.
I am not sure if is this threshold value (10) reasonable. Probably 1 or 2 is better. I will try later.
And I will check if can I use super computer service in my university.

I have another question.
After the taxonomic analysis, I will use the data to LEfSe or PiCRUSt analysis.
In this time, although the table will be converted into relative abundance data, the table were prepared from different sampling depths.
Is this OK? Or should I prepare taxonomic table after standardizing the sampling depths?
In my case, minimum sequence number is approximately 20,000 and maximum is 200,000…
In qiime taxa barplot command, I could not find some option for controlling sampling depth.

Best regards,
Suzuki

jwdebelius · November 21, 2018, 10:32am

Hi @suzukik,

I’m glad it worked. Singletons and rare features often make up most of the sequences in your table, so sometimes removing the singletons or super low abundance features can solve this problem.

I think the second part of this question might need to be split into another topic. Weiss et al (linked above) shows that you need to rarefy for diversity calculations. However, it’s not recommended for the relative abundance calculations for feature-based analyses.

Best,
Justine

system · December 22, 2018, 4:46pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.