Pipline for already demultiplexed, non-barcoded samples

Hi All,

I have already demultiplexed, paired-end .fastq.gz files from a Miseq PE300 run. There doesnt seem to be a specific tutorial covering this setup, but I did find some help on the forums. I am running into an error when trying to map my reads and am unsure if its due to an single problem with the step, or previous steps.

Ill do my best to outline what I have done:
This run on a Ubuntu server running Linux with q2cli version 2020.6.0.
Files from the seq facility, there are 106 samples:
ED1127vert_R1_001.fastq.gz ED1372vert_R1_001.fastq.gz
ED1127vert_R2_001.fastq.gz ED1372vert_R2_001.fastq.gz
ED1128vert_R1_001.fastq.gz ED1373vert_R1_001.fastq.gz
ED1128vert_R2_001.fastq.gz ED1373vert_R2_001.fastq.gz

Here is the pipeline that I've gleaned from several forum posts

#paired R1 and R2
qiime tools import
--type SampleData[PairedEndSequencesWithQuality]
--input-path mts_fastq_manifest_paired
--output-path ./import_paired/paired-end-demux.qza
--input-format PairedEndFastqManifestPhred33

qiime demux summarize
--i-data ./import_paired/paired-end-demux.qza
--o-visualization ./import_paired/paired-demuxed.qzv
#I get an output that shows the following for F reads (can only upload one image)

#import custom reference
qiime tools import
--input-path 16s_ref_nodupes.fasta
--output-path 16s_ref_nodupes.qza
--type 'FeatureData[Sequence]'

#questionable on the trimming and truncating values. Primers are 20 and 21 bases for F and R.
qiime dada2 denoise-paired
--i-demultiplexed-seqs ./import_paired/paired-end-demux.qza
--p-trim-left-f 20
--p-trunc-len-f 300
--p-trim-left-r 21
--p-trunc-len-r 300
--o-representative-sequences ./import_paired/clean_300/rep-seqs-dada2.qza
--o-denoising-stats ./import_paired/clean_300/stats-dada2.qza
--o-table ./import_paired/clean_300/table-dada2.qza

#map to reference
qiime vsearch cluster-features-closed-reference
--i-sequences ./clean_300/rep-seqs-dada2.qza
--i-table ./clean_300/table-dada2.qza
--i-reference-sequences ./../16s_ref_nodupes.qza
--p-perc-identity 0.95
--output-dir ./map_95/map_95.qza
--o-unmatched-sequences ./map_95/unmatched_95.qza
--p-strand both

#here is the error after running the vsearch script:
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /tmp/tmphjr4ecyt --id 0.95 --db /tmp/qiime2-archive-i5k0jgis/3b8b748e-ce9d-4e73-b11c-385f454aa85c/data/dna-sequences.fasta --uc /tmp/tmpeqoqp1kj --strand both --qmask none --notmatched /tmp/tmpbvuaibot --threads 1 --minseqlength 1 --fasta_width 0

vsearch v2.7.0_linux_x86_64, 125.9GB RAM, 80 cores

Reading file /tmp/qiime2-archive-i5k0jgis/3b8b748e-ce9d-4e73-b11c-385f454aa85c/data/dna-sequences.fasta 0%

Fatal error: Invalid FASTA - header must be terminated with newline
Traceback (most recent call last):
File "/home/tangled/miniconda2/envs/qiime2-2020.6/lib/python3.6/site-packages/q2cli/commands.py", line 329, in call
results = action(**arguments)
File "", line 2, in cluster_features_closed_reference
File "/home/tangled/miniconda2/envs/qiime2-2020.6/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
output_types, provenance)
File "/home/tangled/miniconda2/envs/qiime2-2020.6/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in callable_executor
output_views = self._callable(**view_args)
File "/home/tangled/miniconda2/envs/qiime2-2020.6/lib/python3.6/site-packages/q2_vsearch/_cluster_features.py", line 263, in cluster_features_closed_reference
File "/home/tangled/miniconda2/envs/qiime2-2020.6/lib/python3.6/site-packages/q2_vsearch/_cluster_features.py", line 33, in run_command
subprocess.run(cmd, check=True)
File "/home/tangled/miniconda2/envs/qiime2-2020.6/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['vsearch', '--usearch_global', '/tmp/tmphjr4ecyt', '--id', '0.95', '--db', '/tmp/qiime2-archive-i5k0jgis/3b8b748e-ce9d-4e73-b11c-385f454aa85c/data/dna-sequences.fasta', '--uc', '/tmp/tmpeqoqp1kj', '--strand', 'both', '--qmask', 'none', '--notmatched', '/tmp/tmpbvuaibot', '--threads', '1', '--minseqlength', '1', '--fasta_width', '0']' returned non-zero exit status 1.

Plugin error from vsearch:

Command '['vsearch', '--usearch_global', '/tmp/tmphjr4ecyt', '--id', '0.95', '--db', '/tmp/qiime2-archive-i5k0jgis/3b8b748e-ce9d-4e73-b11c-385f454aa85c/data/dna-sequences.fasta', '--uc', '/tmp/tmpeqoqp1kj', '--strand', 'both', '--qmask', 'none', '--notmatched', '/tmp/tmpbvuaibot', '--threads', '1', '--minseqlength', '1', '--fasta_width', '0']' returned non-zero exit status 1.

See above for debug info.

So, it says "Fatal error: Invalid FASTA - header must be terminated with newline" which I am unsure is due to how my original samples were imported or the reference.qza. I had no errors importing either, so I would think that I would have had the header error prior to the vsearch step.

Please let me know if you think other steps are necessary, should be swapped, or have an error. I have already read through many of the forum help posts about this and the links to the tutorials, so I need a bit more specific help.

1 Like

Hi @Max7,
Welcome to the forum! And thanks for providing all the details to your issue and looked up the forum first.
Before we troubleshoot the error message specific to vsearch, I just want to mention something first. Can you confirm that your feature-table following DADA2 is actually completed at least somewhat reasonably? I ask because your truncating parameters of 300, meaning essentially no truncating of poor quality tails, is bound to give some rather bad feature-tables. You may want to look up the forum on how to select DADA2 parameters and optimize this step first. But my guess is that even if were to resolve the main error, you would still have to redo this step to get a workable table.

As for your custom reference database error, can you provide us with a sample of a few lines from head and tail of this fasta file, will be helpful for us to see how this looks exactly. Sounds like there is something off with the FASTA file formatting. You can also try using another reference database, say with greenegenes (if this is 16S data) to see it works, that way we’ll at least confirm the error is originating from your custom reference.
Any additional info about this custom reference database would also be helpful, as in how it was made, if it has been validated elsewhere etc.

Hi @Mehrbod_Estaki,

Ill look into the DADA2 truncating more. We can certainly choose a more restrictive truncating step, but went with 300 to test the pipeline out. I was having trouble finding a way to visualize/check those results though after looking for through the tutorial. Any suggestions on which script/tool to use?

As to the to the custom database, below is the last lines of the .fasta file before being imported as a .qza. I had to use ‘vi’ to be able to get the text as ‘tail’ wasn’t working. If I view the The custom database was made from selected 16s data for many different species from Genbank and grouped into the one file. I woudnt doubt there to be a formatting issue, but no error came up after doing the ‘qiime tools import’ step. I can test out a greenegenes data base to see if the script works in the meantime.

I had to add quotes to the front b/c ‘>’ was creating blockquotes instead.

From the ‘vi’ command:

Here is also a copy-paste from the text file. Oddly, there are some accessions that are missing their sequence info… so maybe thats the issue. Not sure how it still imported with those issues.


Thank you for your help!

1 Like

Hi @Max7,
Thanks for the updates.

Did you see the feature-table summarize action in this section of the tutorial? This will give a summary results of your feature-table following DADA2. In the same tutorial, under the DADA2 section:

qiime metadata tabulate \
  --m-input-file stats-dada2.qza \
  --o-visualization stats-dada2.qzv

This will also give you a summary of the DADA2 stats results.

Yes please, this will be quite useful for a quick check, but keep in mind that GG is for 16S only, and it looks like you are dealing with non-bacteria, fish? So GG won’t work here, you’ll want to use a different database that has your marker gene. For example if its 18S, Silva will have one that should work.

Your reference database looks rather funky and messy to me, but I can’t tell if that’s just the formatting when you pasted the data over or that is in fact what it looks like. If you want you can share your database with us and we can have better look that way, or better yet, have you seen this tutorial on making your own custom reference database with RESCRIPt? This might be the cleanest way for you to build one. You can also search through the forum regarding custom reference database and see some of the common issues with those, maybe this, as a starting point…
Hope this helps.

1 Like

I plan on getting back on this task soon.
Thanks for the suggestions. I will let you know how it turns out.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.