Representative sequences file too large after combining studies

gilberto_solano · May 2, 2018, 12:09pm

Hello,

I have been working to train a classifier using the qiime alignment mafft plugin, and have continuously received errors because the process takes too much memory. Most recently it has been running for over 12 hours. I asked about this in another topic: Mafft error for alignment of rep-seqs

I believe the problem is that my rep-seqs file is too large. I created it by combining rep-seq files generated over several studies in QIITA. This was one study with 6 plates that were each given an individual prep template in qiita, so to analyze these data I downloaded each deblurred table and rep-seqs file, then combined them using qiime feature-table merge-seqs. I am concerned that it simply added the 6 tables together instead of truly combining the information, because my rep-seqs file now contains over 120,000 representative sequences. In other words, I think there may be redundant sequences listed, and that is why my file is so large. Is there a better way to merge the files, or is there a way to reduce the number of sequences so I can use this for further processing?

I tried to attach the rep-seqs.qzv file, but it was too large

Thank you!

ebolyen · May 4, 2018, 11:32pm

Hi @gilberto_solano,

That merge command won't record the same sequence twice, however because you have that many representative sequences (from I assume similar plates of the same environment and amplicon), it does suggest that something is wrong. How many sequences were found in a single plate? I would expect the number of unique sequences to still be in the same ballpark after merging (it sounds like this is not the case?)

Are each of these plates for the same amplicon, and where they all filtered/trimmed/denoised to the same length? If the lengths don't match then you won't be able to compare across plates.

Make sense

jessicalmetcalf · May 8, 2018, 3:54am

Hi Evan, Gilberto has been trying to analyze a published QIITA data set as part of a class. The one he chose (as a dairy systems student) is Qiita. The challenge is that the study has 6 prep templates and it is not clear how to combine them for a final rep_seqs. It seems something has gone wrong. Perhaps you could walk through the workflow of pulling data off QIITA when prep template within a single study need to be combined?

antgonza · May 8, 2018, 11:34am

Hi @jessicalmetcalf and @gilberto_solano,

To merge multiple preparations from a single or multiple studies in Qiita, you need to create an analysis. There are a couple of guides for this: one in the tutorial and another in the help, suggest taking a look. Once you have an analysis created you can download the merged biom and QIIME mapping file. Note that an Analysis can be also public like the Qiita's main paper analysis, which combines 7 preparations from 7 different studies.

Hope this helps.

jessicalmetcalf · May 9, 2018, 6:11am

Hi @antgonza Here is the public qiita meta-analysis Qiita. One issue we ran into here is that the only phylogenetic tree option for alpha and beta tree-based metrics looked like a closed ref 97% similarity gg tree. Is that correct, or is the tree created using the deblur rep_set sequences? I am trying to download the support_files.zip to see if a tree is in there, but I am at a hotel with very slowwwwww internet. @gilberto_solano.

antgonza · May 9, 2018, 4:51pm

Currently in Qiita we only support phylogenetic analyses with close reference because we need to create a new phylogeny with the combined datasets when using other techniques, in this case deblur. Note that this is something that we are actively working on and should be available soon in the system.

Now, you can combine your deblur bioms from all studies and preparations and then download the resulting files to continue analysis in your local machine, suggest checking out these posts:

Hope this helps.