I am combining datasets from the NCBI SRA database with my own data to make comparisons and after looking through the forum I am struggling to determine how combining datasets will affect the specific analysis steps. I would appreciate any thoughts, resources, and advice you have on how to go about this.
This post explained how to trim the data to a certain region, which would be optimal for my analysis because I am combining V4, V4-V5, and V3-V4 paired-end sequences. My plan is to use the qiime cutadapt trim-paired command as you described it in this post using the V4 primer sequences. I have imported all of the runs separately and I am planning on merging after the DADA2 step.
If I use cutadapt to remove the V4 primer sequences before running DADA2, how will that affect the DADA2 steps? Should I avoid trimming the sequences regardless of quality in order to maintain the same length across the different runs?
Would cutadapt affect how the forward and reverse reads are joined or is that irrelevant after you import them as an artifact?
Additionally, how would the use of cutadapt affect the qiime feature-classifier extract-reads step? What would you use for the --p-f-primer and --p-r-primer options if you have already removed the primers? Could I just skip the cutadapt step and instead use the V4 primers during the extract-reads step?
Hello @JadeS,
Sorry for the late response!
Welcome to the QIIME community and what a great post!
It looks like you have done a lot of research on this and seem to have a good plan!
Using cutadapt trim-paired should not affect your dada2 steps. I do not believe that your reads have to be the same length across the different runs.
Dada 2 should join your reads for you and cutadapt should not be an issue.
From what I understand you can not use qiime feature-classifier extract-reads without primers, so you couldn't use qiime feature-classifier extract-reads after using cutadapt to remove the primers. I am unsure what would happen if you didn't do the cutadapt step and instead use the V4 primers during the extract-reads. Maybe some other people can chime in on that portion?
Thanks so much for your reply! I tried using cutadapt on all of my sequences, but when I summarized the output into visualizations, it was nearly identical to the initial demultiplexed file after importing. All of the lengths and quality scores were the same even though it should have trimmed off a whole hypervariable region. Does that mean that it didn’t work or is there another way to check the results that I’m missing?
When I used DADA2 on the cutadapt output, I realized that you need to include the --p-trunc-len- arguments. What should I use for those if cutadapt has (hopefully) made every run the same length?
So just to clarify, I could skip the extract-reads step altogether?
I would recommend running cutadapt with the --verbose flag and taking a look at the output. Check out this section of cutadapt's docs for guidance on interpreting the output (and feel free to share that here). Then you will have an idea of whether or not you are getting the expected results with cutadapt.
Hello @JadeS,
I am not sure what is going on with your cutadapt. I agree with @andrewsanchez that using the --verbose flag might help you figure out the error. If that doesn't work, would you mind uploading the before, after visualizations, and what the verbose prints so I can get a better idea of what the problem might be?
If your data is good enough quality then you will just trunc at the end of your sequence length(so you cut nothing off). If your data is noisy near the end of your sequences length then you will need to truncate the noise off.
And yes you could completely skip the extract-reads step, once your data is in the feature table that dada2 outputs. You should be golden to follow any QIIME2 Tutorial for analysis
If you really wanted to use QIIME2 feature-classifier extract-reads. You could skip the cutadapt step and try to use QIIME2 feature-classifier extract-reads and extract reads with the V4 primer. I have to be honest though, I don't know if that will work. I have never seen QIIME2 feature-classifier extract-reads used like that but it is a really cool idea. You would just have to be sure to check the sensitivity of this step, because missing reads or trimming at different length could effect downstream analysis.
I agree with @cherman2’s theory that the feature classifier should work. But, I think you might have to use it as metadata in a group function so that you can get your feature IDs to map the trimmed sequence. I’m not sure if anyone has tried doing this, I’ve done something similar using a different library and it worked. I think the group function would be an approach here (I’d use group here because multiple V34 sequences could map to the same V4 region.)