Hi, I have three sequencing fastq files generated with Illumina from a sequencing center. R1, R2 and R3 files. How can I import the files and transform it into .qza file?
Should I use qiime demux emp-paired or qiime cutadapt demux-paired to demultiplex the files prior to dada2?
I checked the EMP protocol which using the demux emp-paired plugin and noticed that the current primers format has changed, where the 12-base barcode is now in the forward primer construct. However, my data is generated using the primers of old format. How should I import and demultiplex the files? Can I use qiime demux emp-paired or qiime cutadapt demux-paired to demultiplex? Do I need to let say pre-process the files first due to the barcode and primers format difference?
Refer: http://press.igsb.anl.gov/earthmicrobiome/protocols-and-standards/16s/
Copied from the EMP protocol:
The primer sequences without linker, pad, barcode, or adapter are as follows:
Current, 2015-present (fwd-barcoded: 515FB-806RB):
FWD:GTGYCAGCMGCCGCGGTAA; REV:GGACTACNVGGGTWTCTAAT
Original, pre-2015 (rev-barcoded: 515F-806R):
FWD:GTGCCAGCMGCCGCGGTAA; REV:GGACTACHVGGGTWTCTAAT
Which primers were used doesn't matter to this step, what's more important is that your data has the non-biological data already stripped. In other words, the linker, primer, barcode, etc, are not in R1 or R2 (barcodes should be in R3). It will use the order of R3 to work out which sequences from R1 and R2 belongs to which samples.
qiime cutadapt demux-paired is for when your barcodes are still in the sequences and need to both be identified and trimmed out. This shouldn't be the case with an EMP-like protocol.
Thank you very much @ebolyen for your help!! I tried to run it. I also checked the barcode in R3 file, which are in reverse complement (I guess because the barcodes were at the reverse primer construct). Hence, I used --p-rev-comp-mapping-barcodes \
#View the representative sequence file for all samples
qiime feature-table tabulate-seqs
--i-data rep_seqs_dada2.qza
--o-visualization rep_seqs_dada2.qzv
#summary of identified features
qiime feature-table summarize
--i-table feature_table_dada2.qza
--o-visualization feature_table_dada2.qzv
--m-sample-metadata-file 16s_metadata.txt
During the dada2, I got the message "Not all sequences were the same length". Is that an issue? I'm not sure how different is this from OTU picking in QIIME1. I used to get about 15,000 OTU previously. With dada2, I got the output number of features = 2651. I'm not sure if this is due to the trim/truncation I didn't do correctly ( --p-trunc-len-f 140) during the dada2?
With the feature table, should I still perform OTU picking?
Not necessarily, that can happen if you have variable length adapters. From your quality plots it looks like all of the positions are blue (that I can see), so it's probably just the last few positions that are missing on a few reads (which can be hard to see in the plot). This shouldn't present any issue in practice as far as I'm aware.
Also, the feature counts are consistent with our experiences. The older OTU-based methods seem to dramatically inflate the number of features due to sequencing error, whereas DADA2 and Deblur are able to correct it, resulting in drastically fewer (and higher quality) ASVs.
That's up to you, but it's technically not necessary. Closed reference could still be advantageous if you are dealing with multiple amplicon targets (as an example for why you might want to still cluster).
Regarding your comment: "From your quality plots it looks like all of the positions are blue (that I can see), so it’s probably just the last few positions that are missing on a few reads (which can be hard to see in the plot)." Yes, my plots only have blue bars. I'm not sure what does the blue bar and red bar mean. Does it help to decide the truncation position?
Is there any specific cut-off quality score to use for truncation/trimming the sequence? Like bottom of box of the QC (or middle of box etc) bar below certain score let say 20? I went through the Moving Picture tutorial and discussion in other related topic in this forum. If I'm right, it seems that there isn't any specific cut off value base on QC score.
Since it is not necessary to do OTU picking, I suppose the next I can do is assigning the features to a taxa using qiime feature-classifier classify-sklearn and view the taxonomic composition with interactive bar plots qiime taxa barplot . I suppose I can then download the csv file (taxa counts) and import into R for other analysis, such sPLS-DA and Lefse. In QIIME1, I used 97_otus.fasta but I read in QIIME2 forum where 99_otus.fasta seems to be preferable for feature-classifier (if I'm right).
Besides that, I suppose I can use the feature table generated with DADA2 for closed reference OTU picking against the 97_otus.fasta (or 99_otus.fasta) using qiime vsearch cluster-features-closed-reference, which I intend to use the closed reference feature table for Picrust analysis next. Just that, I'm not sure based on what, I can choose the percent identity at which clustering should be performed --p-perc-identity?
No they just indicate that some of the sampled reads for the visualization are shorter than the rest (when they are red). Since you don't really have any visible, I wouldn't worry about it.
There is not. It's very much a judgement call. Some people do median q-30, others wait for a visually obvious drop, others try to keep as much as they can.
If you are talking about training the classifier, yes the more information you can provide to the classifier during training, the better it should perform (ideally). We also already have some pre-trained ones for 16S if that's the amplicon you are using: Data resources — QIIME 2 2018.2.0 documentation
You should have --p-perc-identity match the OTUs you are using (so 0.97 for 97_otus.fasta). Otherwise that should work great, I've done a similar thing with PICRUSt before as well (although I really only needed the functional profile table, so matching that to closed reference was very easy).