problem with cutadapt trim-single

eysung · July 1, 2024, 9:20am

Hi,
I'm having problem while running the code:
qiime cutadapt trim-single --p-cores 8 --p-front ACTCCTACGGGAGGCWGCAG --p-adapter CGTATTACCGCGGCTGCTGG --i-demultiplexed-sequences s_pre_PRJEB45207/re/import/IT/import_IT_B_V3.qza --o-trimmed-sequences s_pre_PRJEB45207/re/recut/IT_B_V3.qza

This is the sequence extracted using IonTorrent and the 16S rRNA data downloaded from NCBI.
I ran the code above to remove the primer pair of forward ACTCCTACGGGAGGCWGCAG and reverse CGTATTACCGCGGCTGCTGG.

After running the code, I received the fastq file using qiime tools export and checked the primer sequence.
However, when I visualized it with qiime demux summarize (or truncated with qiime dada2 denoise-single), I saw that the number of sequence base was bigger than the raw data.

raw data sequence base: 301
after cutadapt(export): 321

My question is,

is it correct that --p-front has forward primers and --p-adapter has reverse primers?
if it is correct that --p-adapter has reverse primer, should I enter it as reverse complement sequence?
when I put the forward primer in --p-adapter and the reverse complement primer sequence in --p-adapter, I noticed that only the --p-adapter works and the --p-adapter is not removed, can you briefly explain how it works?

If I'm doing something wrong, please advise!
EY

gregcaporaso · July 1, 2024, 10:02pm

Hi @eysung,
Could you provide the output of qiime demux summarize for both the input (s_pre_PRJEB45207/re/import/IT/import_IT_B_V3.qza) and output (s_pre_PRJEB45207/re/recut/IT_B_V3.qza) for that command? I'd like to look at the issue you're describing in the context of your provenance.

They're not necessarily primers, but it is correct that --p-front is looking at the 5' end of the sequences (though there can be bases before this, unless you integrate the ^ character at the beginning of your sequence - see here) and --p-adapter is looking at the 3' end of the sequences.
I just looked this up in the cutadapt docs, and they say: "By default, Cutadapt expects adapters to be given in the same orientation (5’ to 3’) as the reads. That is, Cutadapt considers neither the reverse complement of the reads nor of the adapters." (source: User guide — Cutadapt 5.0 documentation) If you're not sure if the sequence your providing is the reverse complement with respect to your reads, I recommend just trying it both ways.
Do you have a typo in there? Is one of the --p-adapter references supposed to be --p-front? I suspect that if you put the reverse primer in as the --p-front parameter, the sequences will be truncated before the --p-adapter is searched for, and there therefore won't be a forward primer in the sequence anymore (since it's removing that sequence and anything preceding it).

eysung · July 2, 2024, 12:33am

Hi @gregcaporaso, thank you for your reply!

Here the files you requested:
import_IT_B_V3.qzv (296.4 KB)
re_IT_B_V3.qzv (301.6 KB)

And thank you for your explanation of my questions!
If you need any other information, please let me know.

EY

gregcaporaso · July 2, 2024, 8:22pm

Hi @eysung,
Thanks for sharing these files.

When I compare the Demultiplexed sequence length summary tables from the second tab side-by-side (see screenshot below), the results look like what I would expect - basically, post-trimming, your median sequence length is shorter (172 nucleotides) relative to pre-trimming (192 nucleotides).

I suspect that you're seeing some outliers (in terms of sequence length) in the post-trimming quality score plot (at the top of the second tab) that don't show up in the pre-trimming quality score plot since they're infrequent. You're losing about 40% of your sequences during trimming, and those plots are based on a random sampling of 10000 sequences, so it may just be more likely for those outliers to be shown in the plot when you have fewer sequences.

So I think everything is good here. If you want to be really certain, you could run qiime demux summarize again, and have it generate the quality plot from all of the sequences by providing --p-n 596480 (where 596,480 is the total number of sequences in your pre-trimming data) and then confirm that you see the longer sequences in there. That might take a little while to run, but I suspect not too long. I'll be interested to hear how it works out if you try that.

Good luck with the rest of your analysis!

system · August 22, 2024, 9:58pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.