Best approach for analyzing Sanger sequences from ITS2 region

jhines1 · January 18, 2019, 9:30pm

I am new to QIIME 2 & bioinformatics in general. I'm in need of assistance to figure out how I need to alter my pipeline.

I have 430 fungal ITS2 Sanger sequences produced through a clone library & I am wanting to analyze the community structure through ESVs as opposed to OTUs. I have found that I need to use q2-itsxpress as my primers amplified from the beginning of the 5.8S gene all the way to the LSU. I figure I compare those ITS2 isolates to the UNITE database, then run that through q2-ghost-tree, as recommended here.

So, I have imported my trimmed & QC'd (via Geneious 10.2.6) data as a .FASTA file, as described in this Q2 forum post. I have been looking at the Fungal ITS analysis tutorial, but after building their UNITE classifier database & importing the mock community data, they denoise those sequences with Dada2. From what I understand, Sanger data isn't supported by Dada2 or Deblur. Additionally, those require .FASTQ files. Can I still get through the pipeline with .FASTA files? Also, is denoising an important step in the pipeline, or would obtaining the artifact from the q2-ghost-tree plugin give me the necessary file type to move forward with diversity analyses?

My apologies if this is a really basic question, but I'm a bit lost in the pipeline &, again, bioinformatics is still quite new to me.

Thanks!!

Nicholas_Bokulich · January 21, 2019, 3:14pm

Hi @jhines1!

That's correct. If you do not want to use OTU clustering, you can just use qiime vsearch dereplicate-sequences to dereplicate — that will give you a feature table and representative sequences artifact.

Yes... see this tutorial. Just follow the import and dereplicate-seqs steps in that section of that tutorial, then proceed through the ITS tutorial.

The feature table you get from dereplicate-sequences is all you need to proceed to diversity analyses. Denoising is important for correcting errors from next-generation sequencing methods, but since you have Sanger data and are doing a Sanger-specific QC protocol with Geneious, then denoising should not be necessary anyway.

I hope that helps!

jhines1 · January 25, 2019, 7:53pm

Thank you @Nicholas_Bokulich for your reply! So sorry for the delay in response.

I have imported my data & have tried to dereplicate the sequences with vsearch. However, when I try to run that command I get an error:

Plugin error from vsearch:

list index out of range

I'm not sure why I would get that error message. I checked on the forum for previous issues with this & found that another user had encountered this issue with another issue. However, the user fixed their issue with a simple character replacement. I don't have issues with that, however. Here are the first couple of lines of my .FASTA file:

JH_LCl_51
ANCCTCCGCTTATTGATATGCTTAAGTTCAGCGGGTATCCCTACCTGATCCNAGGTCAATCTGGGGTGGTTTGCTTACTGGTAAGCCCCTTTCTTCGGTGCGTCCCGCAAATTTGCTGCGTTCANTGCCGATAAGGGAGCTGCCAACTACTTTTGAGGCGAGTCCGCNCGCGGAGGCGGGACNNACGCCNATCACCAAGCTNAGCTTGAATTTTGAAATGACGCTCNAACAGGCATGCCCTAAGGAATACCAAAGGGCGCAATGTGCGTTCANNNATTCNATGATTCACTGAATTCTGCAATTCACACTACTTATCGCATTTCGCTGCGTTCTTCATCGATGCCANAGCCAAGAGATCCATTGCCGAAAGTT
JH_HCl_73
AGCCTCCGCTTATTGATATGCTTAAGTTCAGCGGGTATCCCTACCTGATCCNAGGTCAACCTGATAAAATGGGGGGTTGTTGGCAAGCAACCACCGAGACCCTATAGCGAGAAAATTTACTACGCTCANAGCTCGATGGCACCGCCACTGAGTTTAGGGGCTGCGAGACCGCANGCTCCAATACCAAGCGTGAGCTTGAGGGGTTGTAATGACGCTCGAACAGGCATGCCCCGCGNAATACCACGGGGCGCAATGTGCGTTCAAAGATTCNATGATTCACTGAATTCTGCAATTCACATTACTTATCGCATTTCGCTGCGTTCTTCATCGATGCCANAACCAAGTGATCCGTTGTCAAAAGTT

As you can see, I do not have those character issues with my headers. Any ideas as to what else it could be?

Nicholas_Bokulich · January 28, 2019, 1:58pm

Hi @jhines1,
could you please report the full error traceback? Use the --verbose flag in your command to print the full traceback in your terminal, and please post that here, along with the full command.

Do all of your sequence IDs fit that same pattern? You should review all IDs to make sure there are not IDs with special characters or an unusual pattern.

jhines1 · January 28, 2019, 5:32pm

@Nicholas_Bokulich

> (qiime2-2018.11) bash-3.2$ qiime vsearch dereplicate-sequences \

> --i-sequences Hines_SeqData.qza \

> --o-dereplicated-table Hines_table.qza \

> --o-dereplicated-sequences Hines_rep_seqs.qza \

> --verbose

Running external command line application. This may print messages to stdout and/or stderr.

The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --derep_fulllength /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/data/seqs.fna --output /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/q2-DNAFASTAFormat-zg8dy1a9 --relabel_sha1 --relabel_keep --uc /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/tmp7e33ao9j --qmask none --xsize

vsearch v2.7.0_macos_x86_64, 16.0GB RAM, 4 cores

Reading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/daReading file /var/folders/x1/8vlx8m5n2zn5hqc34w_3b8m40000gq/T/qiime2-archive-oj1cuzld/9334ad92-b7c1-4f4a-80c6-0135d76906a7/data/seqs.fna 100%

126219 nt in 342 seqs, min 90, max 945, avg 369

Dereplicating 100%

Sorting 100%

311 unique sequences, avg cluster 1.1, median 1, max 3

Writing output file 100%

Writing uc file, first part 100%

Writing uc file, second part 100%

Traceback (most recent call last):

File "/Users/haselkornlab/anaconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/q2cli/commands.py", line 274, in call

results = action(**arguments)

File "<decorator-gen-128>", line 2, in dereplicate_sequences

File "/Users/haselkornlab/anaconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/sdk/action.py", line 231, in bound_callable

output_types, provenance)

File "/Users/haselkornlab/anaconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/sdk/action.py", line 362, in callable_executor

output_views = self._callable(**view_args)

File "/Users/haselkornlab/anaconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/q2_vsearch/_cluster_sequences.py", line 134, in dereplicate_sequences

table = _parse_uc(out_uc)

File "/Users/haselkornlab/anaconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/q2_vsearch/_cluster_sequences.py", line 70, in _parse_uc

observation_id = fields[9].split()[0]

IndexError: list index out of range

Plugin error from vsearch:

list index out of range

See above for debug info.

jhines1 · January 28, 2019, 5:32pm

@Nicholas_Bokulich Something just occurred to me. Some of my sequences have 'N' in place of normal bases. Would this give vsearch, or Q2 in general, reasons to throw errors?

thermokarst · January 28, 2019, 10:10pm

Hey there @jhines1!

I don't think that is the problem here (but I could be wrong...)

One of these list lookups here [9], [0] is failing because the field doesn't exist.

The FASTA file snippet you provided above might not satisfy the format spec --- what happens when you run the following?

qiime tools validate Hines_SeqData.qza

Thanks!

thermokarst · January 28, 2019, 10:23pm

Ah bummer, I just double-checked, that validation code hasn't quite been implemented yet. Would you be able to share a link to this file in a DM?

jhines1 · January 28, 2019, 10:39pm

@thermokarst Yes, I will have to get back to the lab before I can get to the files. I was working on them a bit earlier but haven't saved them to an exterior source yet (e.g. could/thumb drive). I am out of the lab for the day, but will be back first thing tomorrow morning. I will send them on over then.

Thanks so much!

thermokarst · January 29, 2019, 7:13pm

Hey there @jhines1! It was a problem with file line-endings --- your fasta file had a mix of LF and CRLF, which vsearch wasn't cleaning out, so when q2-vsearch was parsing the intermediate files created by vsearch there were all kinds of crazy going on in the file. I converted the file to use only unix-style line-endings and have returned it to you in a DM. Happy QIIMEing! :qiime2:

jhines1 · January 29, 2019, 7:18pm

Awesome! Thank you so much!!

Interesting about the line-break types. I'll have to look into that more & pass on this info to others in the lab that are beginning their Q2 journey.

Thanks again!!