OTU-Picking/ASV and Silva Import

reige012 · February 21, 2019, 4:19pm

Hi,

I have several separate, but kind of related questions. First I’ve imported a seqs.fna file demultiplexed using the qiime1 pipeline and I’ve use vsearch to dereplicate my sequences. Those steps were both successful, but now I’m wondering about the downstream analysis. I’ve browsed the forum and gotten more and more confused about which way is the proper way. I would ultimately like to run both an ASV clustering and a 97% OTU picking step for comparison purposes, but not sure when I can use each so here are my questions.

If I have dereplicated the sequences using vsearch can I now use dada2 denoise? Or did I go one step too far and need to step back and use the seqs.qza file created just after import? I’m guessing yes, because this dereplicating step created a table and dada2 does that as well so I probably should use the seqs.qza file? Here is the two data types I have after dereplicating:
qiime tools peek rep-seqs.qza
UUID: ba8776d0-8285-48f2-8e66-2d08f482c904
Type: FeatureData[Sequence]
Data format: DNASequencesDirectoryFormat

qiime tools peek table.qza
UUID: 05bafc22-ee20-48b1-96a0-4e253d7ceb9e
Type: FeatureTable[Frequency]
Data format: BIOMV210DirFmt

If I do use the seqs.qza file that comes directly from my importing step of the fna file, do I need to do any other steps to “clean” things up before using dada2 denoise?
When I want to use OTU picking I know I’m at the correct spot to keep going on after dereplicating, but I’m having trouble importing my silva database. I had previously imported the pre-trained full-length silva from here to do dada2 on a smaller dataset that was already demultiplexed. However, when I try to use it on vsearch open-ref clustering, Qiime2 throws and error that says “Argument to parameter ‘reference_sequences’ is not a subtype of FeatureData[Sequence].” So what I’m wondering is where/how can I get the properly formatted/aligned Silva set and/or how can I import it properly? Here is my current silva qza type:

qiime tools peek silva-132-99-nb-classifier.qza
UUID: ba91648e-8216-45a0-b37e-304ef7531f9c
Type: TaxonomicClassifier
Data format: TaxonomicClassiferTemporaryPickleDirFmt

Thank you!
Alicia

vheidrich · February 21, 2019, 8:36pm

Hi Alicia,

I will try to answer sticking to your questions structure:

No, you are correct, as DADA2 has its own dereplicating step you will have to step back to your after import .qza file. In fact, as a general rule, you should only dereplicate your sequences after filtering low-quality reads (otherwise good and bad reads will be understood as identical after dereplication). DADA2 will pick your naive qza sequences file, filter low-quality reads and dereplicate afterwards.
No, they are good to go. I just would advise you to summarize your import file and look to the quality of the bases of your reads. Then you can set parameters while calling DADA2 and it will trim low-quality 5’/3’ bases and/or primers (if needed) for you.
As far as I understood you downloaded the Naive Bayes classifier trained on Silva only. You won’t need this file until the taxonomic assignment step. Open-reference OTU picking demands the reference database itself, which you can download from Silva website. This will probably be a txt/tsv file, which can be imported as FeatureData[Sequence], just like QIIME2 is asking you.

Sorry if I lost anything in translation.
Hope it was helpful!

reige012 · February 22, 2019, 3:20pm

Thank you for your help @vheidrich. This is really useful.

I do have one question though. Since I’ve previously only used the pre-trained classifier and I know I will need to dowload the txt file for Silva. If I go here, which file is the one I need? There are a million to choose from. I’m guessing I need this one “SILVA_132_SSURef_tax_silva.fasta.gz,” but I see that it needs to be fully aligned for OTU picking so maybe I need this one “SILVA_132_SSURef_tax_silva_full_align_trunc.fasta.gz”?

Any advice?

Thanks,
Alicia

reige012 · February 22, 2019, 3:32pm

Whoops I realize that those above are fasta files? Is that what I need or do I need the actual taxonomy files like those found here, and if so is this correct " tax_slv_ssu_132.txt"?

Nicholas_Bokulich · February 22, 2019, 3:45pm

Use the 132 release here.

Use the unaligned fasta (rep_set/rep_set_16S_only/99/silva_132_99_16S.fna) and corresponding taxonomy files (e.g., taxonomy/16S_only/99/majority_taxonomy_7_levels.txt) for training your own classifier.

reige012 · February 22, 2019, 4:38pm

Sorry, for the 20 questions. Do I need to extract the reads down to 250 bp? I like using the full sequences when possible, but your tutorial suggests trimming them down. Is that something you think is necessary or can I just skip the truncating part and just train the classifier?

Nicholas_Bokulich · February 22, 2019, 5:02pm

Not necessary — see the notes in the tutorial. Trimming will improve accuracy slightly, but it is not a game changer.

reige012 · February 22, 2019, 5:05pm

Great! Thank you so much for the help.

reige012 · February 22, 2019, 6:55pm

One last question just to be sure I’m on the correct path.

I’ve got my .fna file imported and I’ve dereplicated, I’m training my 97% Silva classifier. Do I need to join my reads and complete chimera removal before taxonomically classifying anything?

So basically would I do:
Import
Dereplicate
Join (using search join-reads) or some other method?
Chimera removal (using qiime vsearch uchime-denovo)
Filter Chimeras from table and seas
Cluster seqs using vsearch 97% and trained classifier

Or, have I input the dereplicating step way too early. Seems like perhaps it goes after the filtering chimeras and before the clustering seqs? Sorry for so many questions but the tutorials for the import of EMP paired-ends followed by ASV clustering steps are much more clear. Its harder to figure out the pipeline for any other type of imports and clustering.

Thanks,
Alicia

Nicholas_Bokulich · February 22, 2019, 7:07pm

Hi Alicia,

Import
Join (vsearch join-reads)
Dereplicate
Cluster using vsearch
Chimera removal

That should all be covered in this tutorial

Good luck!

reige012 · February 22, 2019, 7:11pm

Thank you.

Thats the tutorial I was using, but it doesn’t mention joining before dereplicating so I just wanted to be sure. Y’all are so helpful.

Nicholas_Bokulich · February 22, 2019, 7:14pm

I may have mixed up that tutorial with this one (which shows deblur... just replace deblur with derep/clustering).

reige012 · February 22, 2019, 7:24pm

So when I tried to used this command:
qiime vsearch join-pairs --i-demultiplexed-seqs smallseqs.qza --o-joined-sequences demux-joined.qza

I got this error:
Argument to parameter 'demultiplexed_seqs' is not a subtype of SampleData[PairedEndSequencesWithQuality]

Checked my sequences data type that for the file that comes from importing my seqs.fna file from the qiime1 pipeline:
qiime tools peek smallseqs.qza
UUID: 6dc74cfe-a11f-4da7-aa0c-54bf2fcdbcaa
Type: SampleData[Sequences]
Data format: QIIME1DemuxDirFmt

Am I missing something important or does the join-reads command not use this type of data? Is there another command I can utilize?

Nicholas_Bokulich · February 22, 2019, 8:03pm

You imported your sequences as SampleData[Sequences], you should import as SampleData[PairedEndSequencesWithQuality] if they are paired-end sequences.

reige012 · February 22, 2019, 8:16pm

Hey Nicholas,

I tried to import as you suggested above and got this error.
Fri Feb 22 13:58:13 CST 2019

/ddnB/work/areige1/FNAtest/temp /ddnB/work/areige1/FNAtest/temp

Traceback (most recent call last):

File “/usr/local/packages/qiime2/2018.4/lib/python3.5/site-packages/q2cli/tools.py”, line 116, in import_data

view_type=source_format)

File “/usr/local/packages/qiime2/2018.4/lib/python3.5/site-packages/qiime2/sdk/result.py”, line 218, in import_data

return cls.from_view(type, view, view_type, provenance_capture)

File “/usr/local/packages/qiime2/2018.4/lib/python3.5/site-packages/qiime2/sdk/result.py”, line 242, in _from_view

recorder=recorder)

File “/usr/local/packages/qiime2/2018.4/lib/python3.5/site-packages/qiime2/core/transform.py”, line 59, in make_transformation

(self._view_type, other._view_type))

Exception: No transformation from <class ‘q2_types.per_sample_sequences._format.QIIME1DemuxFormat’> to <class ‘q2_types.per_sample_sequences._format.SingleLanePerSamplePairedEndFastqDirFmt’>

An unexpected error has occurred:

No transformation from <class ‘q2_types.per_sample_sequences._format.QIIME1DemuxFormat’> to <class ‘q2_types.per_sample_sequences._format.SingleLanePerSamplePairedEndFastqDirFmt’>

My .fna file was created using the QIIME1 pipeline because I was having a lot of problems with the demultiplexing process on QIIME2 because my reads weren’t in the correct format so the sequencing facility I work with demultiplexed them using their own QIIME1 pipeline, which is why I don’t have the actual fastq files.

Nicholas_Bokulich · February 22, 2019, 8:30pm

I see… I was not aware that you are importing qiime1 sequence files. I assume those sequences should already be joined.

If they are not already joined, I recommend using qiime1 or vsearch directly to merge these paired-end reads, then import to QIIME 2 (vsearch is downloaded as part of your QIIME 2 installation and should have a way to join fasta sequences, but you will need to check out the vsearch docs to learn how). That seems like the only way forward if you cannot import as SampleData[PairedEndSequencesWithQuality]

reige012 · February 26, 2019, 2:16pm

Hi Nicholas,

I’m working on your above suggestion, but also simultaneously trying to use QIIME2 to demultiplex these files (hoping one method will work).

I’m wondering if you have run into anyone having trouble with memory space while running the demux command? I have a fw and a rv file, each about 132gb of data (they are hiseq reads) and I’m running them on our supercomputer (256gb node: 16 threads though I don’t think you can use multiple threads for demux) with unlimited file storage for the final files. and I keep getting errors like this:

“could not write to ‘/tmp/qiime2-archive-qrnrhgmu/f3ecc142-5e4d-40c6-8899-c09f062b396d/data/S22_119_L001_R2_001.fastq.gz’: No space left on device”

This is after 81 hrs of run time, which is super frustrating.

I’m wondering if there is something I can do to alleviate the temp storage requirement and/or the runtime. I’m worried about how long and how much memory steps like dada2 and classifying will take if this is just a demux step.

Thanks!

Nicholas_Bokulich · February 26, 2019, 2:23pm

Hi @reige012,

Your temp directory is running out of space (it probably has less than 132 gb available), this is not a memory error.

You should be able to change the temp directory location, but you should discuss with your system admins how/where to do that on your supercomputer.

Good luck!

reige012 · February 26, 2019, 2:37pm

I just emailed them. I had run into space issues before and worked with them to set up my temp directory in a work space that has unlimited storage space so this error is surprising to me. I'll wait to see what they say.

Thanks again.

system · March 29, 2019, 8:44pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.