Hello,
I am a student studying microbiology, and I am teaching myself bioinformatics. However, I lack a way to verify if my analysis is appropriate, so I am seeking help from this forum.
My question is about trimming adapters and primers from demultiplexed sample-specific fastq files.
Here is some information about my analysis: I performed amplicon metagenomic sequencing using a MiSeq instrument, targeting the V3-V4 region. The primer sequences I used are as follows:
- 341F: CCTACGGGNGGCWGCAG
- 805R: GACTACHVGGGTATCTAATCC
I used the Nextera XT kit for library preparation, and the sequencing primers, indexes, and adapter sequences are as follows:
- 5'- AATGATACGGCGACCACCGAGATCTACAC[i5]TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-[locusspecific sequence]-CTGTCTCTTATACACATCTCCGAGCCCACGAGAC[i7]ATCTCGTATGCCGTCTTCTGCTTG -3'
- 3'- TTACTATGCCGCTGGTGGCTCTAGATGTG[i5]AGCAGCCGTCGCAGTCTACACATATTCTCTGTC-[locusspecific sequence]-GACAGAGAATATGTGTAGAGGCTCGGGTGCTCTG[i7]TAGAGCATACGGCAGAAGACGAAC -5'
When I checked the fastq files for one sample, most of the sequences in the _1 (forward) file started with the 341F primer, and most of the sequences in the _2 (reverse) file started with the 805R primer (though not all reads did). Additionally, a few reads had primer and adapter sequences at the ends. Therefore, I planned to use cutadapt with the following command:
cutadapt
-j 14
-a CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
-g CCTACGGGNGGCWGCAG
-A CTGTCTTATACACATCTGACGCTGCCGACGA
-G GACTACHVGGGTATCTAATCC
-o ${OUTPUT_DIR}/primer_${base_name}_1.fastq.gz
-p ${OUTPUT_DIR}/primer_${base_name}_2.fastq.gz
-m 50
-q 20
In summary, I would like to know:
- Is my script suitable for trimming my data?
- How should a researcher decide on the parameters for minimum length (-m) and quality score (-q)?
- Are there any other considerations I should be aware of?
If more information is needed to address these questions, I am happy to provide it.
I will also provide images of some of my fastq files for reference.
Thank you in advance!