Importing paired end demultiplexed MiSeq data

David_Bradshaw · February 13, 2018, 4:50pm

Dear Whom It May Concern,

I need to import paired end demultiplexed sequences into QIIME2 and I have been having trouble. My sequences do not have barcodes or primers the names look like the following for one sequence with the run number, sample name, forward primer name, reverse primer name, and direction:
4185-1-515wF-806bR_R1.fastq
4185-1-515wF-806bR_R2.fastq.

What import script should I use to import a folder containing these pairs of sequences? I will also have to combine runs at some point, will that be complicated without having any barcodes attached to them? Thank you for your time and help.

Sincerely,

David Bradshaw

Mehrbod_Estaki · February 13, 2018, 8:04pm

Hi @David_Bradshaw,

You can import you demultiplexed unjoined paired end reads using a the manifest approach described here.

As for combining runs, could you provide a bit more information about the nature of these runs? Are these the same samples that have been sequenced multiple times, perhaps using different primers of the same target? Or do they share experimental characteristic like originating from the same environment(group), or will they simply be compared to each other? The good news is that it's completely doable in qiime2 without too much complications, but depending on what you are trying to achieve it may affect how you approach this.
You can explore these options under the feature table plugins here, where your options are merging tables, sequences, or taxonomies.
Hope that gets you started!

David_Bradshaw · February 14, 2018, 5:21pm

Dear Mehrbod Estaki,

Thank you very much. They are samples of the same areas a couple times over a period of time, I just made the manifest file with appropriate sample labels delineating sampling periods and it seems to be working fine.

My next step is that I was going to follow Analyzing paired end reads in QIIME 2 through the quality filtering step and then move on to de novo chimera checking and then OTU picking using the latest SILVA database (128). I also have to train the classifier correct? Would just use the following plugins? uchime-denovo: De novo chimera filtering with vsearch. — QIIME 2 2017.12.0 documentation
Training feature classifiers with q2-feature-classifier — QIIME 2 2017.12.0 documentation
cluster-features-de-novo: De novo clustering of features. — QIIME 2 2017.12.0 documentation

Thank you for your time and help,

Sincerely,

David Bradshaw

Mehrbod_Estaki · February 14, 2018, 10:42pm

Hi @David_Bradshaw,

Glad you've got your data imported into qiime properly!
While the the steps you've outlined in your post are correct (i.e. OTU picking using vsearch, training your classifier etc.) I am wondering if you've had a chance to check out the other denoising methods in qiime2 DADA2 or Deblur, both of which are used in the 'Moving Pictures' tutorial. Without going in too much detail about these newer methods, they outperform the more traditional OTU picking methods such as vsearch. Unless you have specifically chosen vsearch for a reason, for example comparing to previous data with the same methodology, I think you will greatly benefit from one of the other 2 methods.
If you choose to do DADA2, you want to make sure you perform the denoising step separately on each of your runs and merge the feature tables after as the error model it creates is run-specific. In addition, DADA2 performs its own chimera removal and merging of your paired end. As so you want to make sure you avoid doing any of that prior to dada2.

One of the qiime experts will have to confirm this but if you do still choose to use vsearch, I believe you want to merge your runs prior to OTU picking as failing to do so can create some artificial clustering based on the arbitrary OTUs.

Happy qiiming

Nicholas_Bokulich · February 16, 2018, 3:44pm

Hi @David_Bradshaw,

@Mehrbod_Estaki has offered a series of expert advice — he is correct, denoise and merge paired-end data using dada2 or deblur is much more accurate than OTU clustering and we recommend this approach. Moving pictures covers single-end data; see the atacama soils tutorial to get an example for paired-end reads.

Since you are using 515f/806r primers, you can use the pre-trained classifiers here. No need to train your own — but for future reference a tutorial on training your own classifiers is here.

Correct again — it would be better to merge your datasets (e.g., importing all per-sample fastq files together in a single manifest file) prior to OTU clustering if you go that route.

I hope that helps! Good luck. And if you have further questions about downstream steps, please search/open a new thread to make those questions easier for other users to find.

David_Bradshaw · February 16, 2018, 5:52pm

Dear Mehrbod Estaki and Nicholas Bokulich,

Thank you both very much. I will test out both methods Deblur and OTU Picking. My PIs are still proponents of the 97% OTUs so we will see how that goes. So I can use the pre-trained classifier even though I am planning on using the 97% 128 SILVA release? I only need to train one if I am using a different primer? Thank you again very much.

Sincerely,

David

Nicholas_Bokulich · February 16, 2018, 6:20pm

Looks like the current pre-trained classifiers are SILVA 119 release... so if you want to classify against the 128 release you will need to train your own using the tutorial I linked to above. It is a straightforward but memory-intensive process (so you may get something like a MemoryError — scan the forum for previous posts on how to deal with this).

That too — though you could use the full-length 16S pre-trained classifiers if you want to be domain-agnostic (trimming to primer sites improves accuracy a bit but not so much that you should be afraid to use full-length if you need to, e.g., if memory errors constrain you from training your own classifiers).

Good luck!

David_Bradshaw · February 20, 2018, 4:59pm

Dear Nicholas Bokulich,

Thank you for the information, honestly it actually went very quickly, only an hour, and nor MemoryError. Hope that is not a sign that things did no go well. How can I check if the classifier worked? I did the visualization and got 485 assigned Feature IDs. metadata.tsv (84.5 KB)
Thanks again,

David Bradshaw

Nicholas_Bokulich · February 20, 2018, 5:09pm

No, sounds like everything went very well (lots of users report memory errors when attempting to train a SILVA classifier, usually in virtualbox or other memory-constrained situations so I was just forewarning in case that error came up).

It looks like it worked well... you have classifications, many at species level. Not all of these classifications will be 100% accurate at species level (in our tests, V4 classifications with SILVA have something like 50%-60% accurate at species level but genus should be very accurate). Unless if your taxonomic compositions have unexpected results, I don't think there is any need to worry about whether it worked — it did!

David_Bradshaw · February 20, 2018, 5:33pm

Dear Nicholas Bokulich,

Thank you very much that is helpful. I will be working through each of the three (97% OTU Picking, Deblur, and Dada2). I needed to double check if I interpreted my quality plots correctly, they honestly seem a little wonky. I am using the modified 515Fp (GTGYCAGCMGCCGCGGTAA) and 806R primers (GGACTACNVGGGTWTCTAAT) from EMP for the V4 region. My quality plots are attached. paired-end-demux.qzv (289.0 KB)
demux-joined.qzv (290.6 KB). I seem to have some really short reads. I tried Deblur at 291 and lost a lot of sequences, retrying at 250. I just want to get help interpreting the plots this time so that I get it right now and can do it myself in the future. Thank you for your help. Sorry that this thread has veered away from original topic.

Sincerely,

David Bradshaw

Nicholas_Bokulich · February 21, 2018, 6:12pm

This tutorial discusses interpretation of joined-read quality plots.

It does look like you have lots of really short reads — and it seems like this might be in the raw (unjoined) reads, judging from your quality plots. You should double-check the raw reads and review any procedures you used prior to importing (e.g., did you trim these reads somehow?) Something is not looking right about these data. Why are your reads so short?

David_Bradshaw · February 21, 2018, 7:15pm

Thanks, that tutorial is how I determined that I should use 291 since the sequences the number of sequences used to estimate it dropped off. I have since tried 250 and 150 and got to keep more sequences, but then I run the risk of losing data.

Agreed, I figured something did not look right. I have not done anything to the sequences besides following that tutorial honestly. I imported my raw sequences (from the sequencing company we use) using a manifest file and then just joined the sequences using the vsearch script.

qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]'
--input-path '/home/microbiology/Downloads/manifest2.csv'
--output-path paired-end-demux.qza
--source-format PairedEndFastqManifestPhred33

qiime vsearch join-pairs
--i-demultiplexed-seqs paired-end-demux.qza
--o-joined-sequences demux-joined.qza
Saved SampleData[JoinedSequencesWithQuality] to: demux-joined.qza

qiime demux summarize
--i-data demux-joined.qza
--o-visualization demux-joined.qzv
Saved Visualization to: demux-joined.qzv

David_Bradshaw · February 22, 2018, 8:16pm

Dear Nicholas,

It seems likely that those short sequences are just low quality trim backs or primer dimers based upon my info from my sequencer. I would imagine that they will be removed in quality filtering steps moving forward. Though they do mess up how I am suppose to interpret my trimming protocols for deblur and dada2. Which I am honestly unsure how to interpret even with the tutorial, sorry new to this. I do not have any primers or anything else attached to the sequences, so I do not need to use th p-trim-left-f or -r options correct?

Sincerely,

David

Nicholas_Bokulich · February 23, 2018, 12:45am

Makes sense — thanks for confirming!

Yes, they would be removed by dada2 or deblur (part of the low yield you mentioned). If you are concerned about interpreting quality plots, you could take a first pass to remove these sequences using q2-quality-filter (which should precede both deblur and OTU picking anyway).

Correct.

I hope that helps!

David_Bradshaw · February 26, 2018, 5:09pm

Dear Nicholas Bokulich,

Thanks that does help a lot. I was definitely planning on using quality filter joined prior to deblur and OTU picking. I know there is an option to remove sequences below a particular size in deblur, but is there a way in OTU picking, or does that happen automatically?

Also apparently I was mistaken about my primers, they are still in there. I understand how to trim them using dada 2. I read the Deblur vs DADA2 Questions - #8 by benjjneb thread which indicated this would be a problem in OTU picking and Deblur. Is there currently a way to trim sequences for those two analyses in QIIME2?

I am truly appreciative of all the help. Still working and learning this process as a part of my PhD, no PI here really has experience with 16S sequencing so the Forum is my best way to learn.

qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-path '/home/microbiology/Downloads/manifest2.csv' --output-path paired-end-demux.qza --source-format PairedEndFastqManifestPhred33

qiime vsearch join-pairs --i-demultiplexed-seqs paired-end-demux.qza --o-joined-sequences demux-joined.qza
qiime demux summarize --i-data demux-joined.qza --o-visualization demux-joined.qzv

Current Deblur Path
qiime quality-filter q-score-joined --i-demux demux-joined.qza --o-filtered-sequences demux-joined-filtered.qza --o-filter-stats demux-joined-filter-stats.qza

qiime deblur denoise-16S --i-demultiplexed-seqs demux-joined-filtered.qza --p-trim-length 240 --o-representative-sequences 240-rep-seqs.qza --o-table 240-table.qza --p-sample-stats --o-stats deblur-stats.qza

Current 97% OTU Path
qiime quality-filter q-score-joined --i-demux demux-joined.qza --p-min-quality 30 --o-filtered-sequences demux-joined-filtered2.qza --o-filter-stats demux-joined-filter-stats2.qza

qiime vsearch dereplicate-sequences --i-sequences demux-joined-filtered2.qza --o-dereplicated-table otutable.qza --o-dereplicated-sequences oturep-seqs.qza

qiime vsearch cluster-features-de-novo --i-table '/home/microbiology/97OTU/otutable.qza' --i-sequences '/home/microbiology/97OTU/oturep-seqs.qza' --o-clustered-table 97OTU/table-dn-97.qza --o-clustered-sequences rep-seqs-dn-97.qza --p-perc-identity 0.97 --p-threads 8 --verbose

qiime vsearch uchime-denovo --i-table otutable.qza --i-sequences oturep-seqs.qza --output-dir uchime-dn-out

etc... based upon Identifying and filtering chimeric feature sequences with q2-vsearch — QIIME 2 2018.2.0 documentation

Nicholas_Bokulich · February 27, 2018, 1:07am

This does not happen automatically. You could use the quality-filter plugin to filter these out (and you really should use that plugin prior to OTU picking, anyway, to remove short and high-error seqs, as described here. This is not necessary with deblur or dada2, since those perform trimming/qc steps on their own). I see that you are using q-score-joined... that is correct, and you can adjust the --p-min-length-fraction (default 0.75) to trim out shorter sequences if desired.

dada2 can trim primers from seqs with the trim-left parameter. I don't think that deblur has a way to do this currently. For OTU picking and deblur, you can use the trim commands in q2-cutadapt.

We are all happy to help, glad to be of service . Go forth and spread the word.

Your workflows look good. I hope this helps solve the issues that you were having!

David_Bradshaw · February 27, 2018, 3:28pm

Dear Nicholas Bokulich,

Thank you very much. So for the q2-cutadapt script if I have mostly sequences with the forward primer in the front of R1 and the reverse primer in the front of R2, these were run on 2x250. One run was done on 2x300 which means that some of my reads will actually have both primers in the sequence. So I would run the following scripts?

qiime cutadapt trim-paired --i-demultiplexed-sequences '/home/microbiology/Deblur/paired-end-demux.qza' --p-cores 8 --p-front-f GTGYCAGCMGCCGCGGTAA --p-front-r GGACTACNVGGGTWTCTAAT --output-dir trimmed-primers --verbose

qiime cutadapt trim-paired --i-demultiplexed-sequences 'trimmed-primers.qza' --p-cores 8 -
–p-adapter-f GTGYCAGCMGCCGCGGTAA...ATTAGAWACCCBNGTAGTCC –p-adapter-r GGACTACNVGGGTWTCTAAT...TTACCGCGGCKGCTGRCAC --output-dir all-trimmed-primers --verbose

I am using the EMP modified 515-806, so I would have to run both these scripts to remove both sets of possible primers?

Thank you for your time and help.

Sincerely,

David

Nicholas_Bokulich · February 27, 2018, 5:43pm

Wow, that complicates matters. Yes, I believe you may need to run both commands on the 2x300 if a "framed" sequence is not anchored (see here for details) if you intend to keep the full sequence (e.g., for dada2 and deblur you may trim the sequences based on quality profiles, so the reverse primers would not in fact impact this depending on where you trim).

However, with that in mind (that you will be doing this for deblur and OTU clustering workflows), you could just do this on the joined reads, in which case you would use trim-single instead of trim-joined (culpa mea), and you needn't specify primer sequences in both directions. But yes, what you are doing looks correct otherwise.

Let us know if that works! (if you run into an unreported error with cutadapt you may want to start a new post to make it easier to search) Thanks!

David_Bradshaw · February 27, 2018, 8:52pm

Dear Nicholas Bokulich,

Thank you very much, the suggestion to do the trimming on the joined sequences before deblur and OTU clustering seems like a great idea and would overcome the random 2x300 run but I get the following error It does not seem to like using SampleData[JoinedSeqeuncesWithQuality] artifacts. (Should I put this into a new thread?) I have my 515f primer in the front script and my 806r primer in the adapter slot reversed and complementary to my actual primer.

(qiime2-2018.2) microbiology@willow:~$ qiime cutadapt trim-single --i-demultiplexed-sequences '/home/microbiology/Deblur/demux-joined.qza' --p-cores 8 --p-adapter ATTAGAWACCCBNGTAGTCC --p-front ^GTGYCAGCMGCCGCGGTAA --output-dir trimmed-primers-joined --verbose
Plugin error from cutadapt:

Argument to parameter 'demultiplexed_sequences' is not a subtype of SampleData[SequencesWithQuality].

Nicholas_Bokulich · February 27, 2018, 9:20pm

arg... you are right... sorry, I overlooked that. Indeed, for now you will need to use trim-paired but I have opened this issue to track adding support for joined reads in trim-single.

So you need to trim, then join.

In that case, the commands that you provided previously look like the correct workflow to accomplish this.