Analysing V4/V5 16S rRNA illumina Miseq paired-end reads.

Analysing V4/V5 16S rRNA Miseq paired-end reads (2x300). I amplified my eDNA with 51F-y and 926R primer sets and sent the samples for Illumina Miseq sequencing. I received the sequences as FASTQ.gz file. Do i need to trim the primers or adapters? I received the sequesnces as separate samples meaning they did the demultiplexing.I do not know the adapters they used (probably the common illumina adapters). They shared FASTQC report which insicate that all reads were 301 bp (illumina added 1 more base pair) both the reverse and the forward. However I tried to run this script

primer sequences used during pcr

FWD_PRIMER="GTGYCAGCMGCCGCGGTAA"
REV_PRIMER="CCGYCAATTYMTTTRAGTT"

Looping through all forward reads (_R1.fastq.gz)

for R1 in "INPUT_DIR"/*_R1.fastq.gz; do R2="{R1/_R1.fastq.gz/_R2.fastq.gz}" # Find corresponding R2 read
SAMPLE=$(basename "$R1" _R1.fastq.gz) # Extract sample name

echo "Processing sample: $SAMPLE"

# Running Cutadapt
cutadapt \
    -j 4 \
    -a "$FWD_PRIMER" -A "$REV_PRIMER" \
    --minimum-length 100 \
    -o "$OUTPUT_DIR/${SAMPLE}_trimmed_R1.fastq.gz" \
    -p "$OUTPUT_DIR/${SAMPLE}_trimmed_R2.fastq.gz" \
    "$R1" "$R2"

done

echo "Trimming complete!"

I get the following output

=== Summary ===

Total read pairs processed: 1,156,333
Read 1 with adapter: 536,849 (46.4%)
Read 2 with adapter: 552,446 (47.8%)

== Read fate breakdown ==
Pairs that were too short: 551,224 (47.7%)
Pairs written (passing filters): 605,109 (52.3%)

Total basepairs processed: 696,112,466 bp
Read 1: 348,056,233 bp
Read 2: 348,056,233 bp
Total written (filtered): 364,232,687 bp (52.3%)
Read 1: 182,134,893 bp
Read 2: 182,097,794 bp

=== First read: Adapter 1 ===

Sequence: GTGYCAGCMGCCGCGGTAA; Type: regular 3'; Length: 19; Trimmed: 536849 times

However, when I run this script using the illumina adapeters instead of specifying the primers used the primers used during initial pcr amplifying region of interest

Load necessary modules

module load Miniconda3/23.5.2-0

Activate the conda environment with Cutadapt installed

source activate cutadapt_env

Loop over all *_R1.fastq.gz files and trim them using find

find /cluster/projects/nn8999k/hosea/ACTINOMETABA/ -name "*_R1_001.fastq.gz" | while read r1; do

r2="${r1/_R1_001.fastq.gz/_R2_001.fastq.gz}" # Assuming R2 files have the same base name

out_r1="/cluster/projects/nn8999k/hosea/ACTINOMETABA/trimmed_reads/$(basename $r1 _R1_001.fastq.gz)_trimmed_R1.fastq.gz"

out_r2="/cluster/projects/nn8999k/hosea/ACTINOMETABA/trimmed_reads/$(basename $r1 _R1_001.fastq.gz)_trimmed_R2.fastq.gz"

cutadapt -j 4 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --minimum-length 100 -o $out_r1 -p $out_r2 $r1 $r2

done

I get the following outcome

Total read pairs processed: 1,156,333
Read 1 with adapter: 18,279 (1.6%)
Read 2 with adapter: 18,126 (1.6%)

== Read fate breakdown ==
Pairs that were too short: 5,731 (0.5%)
Pairs written (passing filters): 1,150,602 (99.5%)

Total basepairs processed: 696,112,466 bp
Read 1: 348,056,233 bp
Read 2: 348,056,233 bp
Total written (filtered): 690,074,068 bp (99.1%)
Read 1: 345,032,569 bp
Read 2: 345,041,499 bp

I am confused how do I go about this?

Hi @hosismas,

This depends on how they have been sequenced. Did the sequencing protocol sequence through the PCR primers? Or are they constructed in such a way that the PCR primers will not be in your resulting sequence, e.g. the EMP sequencing protocol. If the former... then yes, you need to use cutadapt to remove the PCR primers. Otherwise, if the latter.. then you generally do not need to trim, unless they are low quality bases.

Is there a reason why you are not using qiime cutadapt trim-paired?

Anyway, according to the cutadapt paired-end docs, I think the issue has more to do with the flags being used. Specifically, you should be using -g / -G to find and remove from the 5' end and not -a / -A, which removes from the 3' end. For example, you should use:

-g GTGYCAGCMGCCGCGGTAA` -G CCGYCAATTYMTTTRAGTT

-Hope this helps. :slight_smile:

1 Like

The primers with no illumina adapters attached to it were used to amplify the eDNA for the V4/V5 using the primers mentioned above. Thereafter, samples we sent to a service provider. As to how they did the sequencing apart from knowing it was 2x300 paired.-end reads, I have no such information.

All you need to do is ask the provider.

Otherwise, just run cutadapt using the flags I mentioned above. If the primers are contained within your sequences, then you should see more than 80-90% of the reads with primers trimmed. If you see that then your reads did have the primers... Otherwise, I'd suspect less than a few percent will stochastically have anything resembling a primer sequence.

1 Like

I tried runnning this code

Define primer and adapter sequences

FWD_PRIMER="GTGYCAGCMGCCGCGGTAA" # Forward primer
REV_PRIMER="CCGYCAATTYMTTTRAGTT" # Reverse primer
ILLUMINA_FWD_ADAPTER="AGATCGGAAGAG" # Illumina forward adapter
ILLUMINA_REV_ADAPTER="AGATCGGAAGAG" # Illumina reverse adapter

Import FASTQ files into QIIME2

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path "$INPUT_DIR"
--input-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path "$QIIME2_DIR/demux.qza"

Run cutadapt trim-paired

qiime cutadapt trim-paired
--i-demultiplexed-sequences "$QIIME2_DIR/demux.qza"
--p-front-f "$FWD_PRIMER"
--p-front-r "$REV_PRIMER"
--p-adapter-f "$ILLUMINA_FWD_ADAPTER"
--p-adapter-r "$ILLUMINA_REV_ADAPTER"
--p-cores 8
--verbose
--o-trimmed-sequences "$TRIMMED_DIR/trimmed_demux.qza"

I got this summary
=== Summary ===

Total read pairs processed: 885,056
Read 1 with adapter: 446,149 (50.4%)
Read 2 with adapter: 448,683 (50.7%)

== Read fate breakdown ==
Pairs that were too short: 0 (0.0%)
Pairs written (passing filters): 885,056 (100.0%)

Total basepairs processed: 532,803,712 bp
Read 1: 266,401,856 bp
Read 2: 266,401,856 bp
Quality-trimmed: 0 bp (0.0%)
Read 1: 0 bp
Read 2: 0 bp
Total written (filtered): 512,396,267 bp (96.2%)
Read 1: 256,175,429 bp
Read 2: 256,220,838 bp

=== First read: Adapter 1 ===

Sequence: AGATCGGAAGAG; Type: regular 3'; Length: 12; Trimmed: 13231 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-12 bp: 1

Does this now work well?

1 Like

Hi @hosismas,

Please rerun without using the --p-adapter-f & --p-adapter-fr flags. Depending on long the reads are... the adapters may not even be within the sequences. Thus, you run the risk of erroneously discarding reads.

1 Like

All reads are 301 bp long

1 Like

Yes, I was just making a general statement. Your amplicon V4V5 should be much longer than that... so there should not be any adapters at the 3' end, at least I would suspect there would be only a small amount.

Again, rerun with the options I mentioned above but use the --p-discard-untrimmed flag. This way you discard any pairs that do not contain the --p-front-* sequences. This will also directly tell you what is being written to file based on the presence of the --p-front-* sequences in your reads

1 Like

I tried to the following conditions

  1. Primers alone

Define primer sequences

FWD_PRIMER="GTGYCAGCMGCCGCGGTAA" # Forward primer
REV_PRIMER="CCGYCAATTYMTTTRAGTT" # Reverse primer

Import FASTQ files into QIIME2 (paired-end data format)

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path "$INPUT_DIR"
--input-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path "$QIIME4_DIR/demux.qza"

Run cutadapt trim-paired WITHOUT --p-discard-untrimmed

qiime cutadapt trim-paired
--i-demultiplexed-sequences "$QIIME4_DIR/demux.qza"
--p-front-f "$FWD_PRIMER"
--p-front-r "$REV_PRIMER"
--p-cores 8
--verbose
--o-trimmed-sequences "$TRIMMED4_DIR/trimmed_demux.qza"
=== Summary ===

Total read pairs processed: 885,056
Read 1 with adapter: 432,984 (48.9%)
Read 2 with adapter: 436,750 (49.3%)

== Read fate breakdown ==
Pairs that were too short: 0 (0.0%)
Pairs written (passing filters): 885,056 (100.0%)

Total basepairs processed: 532,803,712 bp
Read 1: 266,401,856 bp
Read 2: 266,401,856 bp
Quality-trimmed: 0 bp (0.0%)
Read 1: 0 bp
Read 2: 0 bp
Total written (filtered): 516,300,854 bp (96.9%)
Read 1: 258,191,864 bp
Read 2: 258,108,990 bp

=== First read: Adapter 1 ===

Sequence: GTGYCAGCMGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 432984 times

  1. Adapters alone

Define Illumina adapter sequences

ILLUMINA_FWD_ADAPTER="AGATCGGAAGAG" # Illumina forward adapter
ILLUMINA_REV_ADAPTER="AGATCGGAAGAG" # Illumina reverse adapter

Import FASTQ files into QIIME2 (assuming paired-end Casava format)

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path "$INPUT_DIR"
--input-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path "$QIIME5_DIR/demux.qza"

Run cutadapt trim-paired to remove Illumina adapters

qiime cutadapt trim-paired
--i-demultiplexed-sequences "$QIIME5_DIR/demux.qza"
--p-front-f "$ILLUMINA_FWD_ADAPTER"
--p-front-r "$ILLUMINA_REV_ADAPTER"
--p-cores 8
--verbose
--o-trimmed-sequences "$TRIMMED5_DIR/trimmed_demux.qza"
=== Summary ===

Total read pairs processed: 885,056
Read 1 with adapter: 16,166 (1.8%)
Read 2 with adapter: 16,224 (1.8%)

== Read fate breakdown ==
Pairs that were too short: 70 (0.0%)
Pairs written (passing filters): 884,986 (100.0%)

Total basepairs processed: 532,803,712 bp
Read 1: 266,401,856 bp
Read 2: 266,401,856 bp
Quality-trimmed: 0 bp (0.0%)
Read 1: 0 bp
Read 2: 0 bp
Total written (filtered): 527,816,322 bp (99.1%)
Read 1: 263,914,740 bp
Read 2: 263,901,582 bp

=== First read: Adapter 1 ===

Sequence: AGATCGGAAGAG; Type: regular 5'; Length: 12; Trimmed: 16166 times

Thanks @hosismas,

Do you know if other amplicon types were sequenced along with the V4V5 primers? I find it awfully suspicious that consistently ~50% of the R1 and R2 reads have the primers trimmed.

FYI, if you only want to trim adapters you'd need to do so like this:

qiime cutadapt trim-paired
    --i-demultiplexed-sequences "$QIIME5_DIR/demux.qza"
    --p-adapter-f "$ILLUMINA_FWD_ADAPTER"
    --p-adapter-r "$ILLUMINA_REV_ADAPTER"
    --p-cores 8
    --o-trimmed-sequences "$TRIMMED5_DIR/trimmed_demux.qza \
    --verbose"

Here again, even regardless of using both primers & adapters, that only 45-50% of the reads are trimmed. This makes me think there is another amplicon with different primers contained within this run. Do you know if this is true?

If so, then I highly recommend that you make use of the --p-discard-untrimmed. This way you are only writing out reads that are from the targeted amplicon. See here.

I had 10 samples. Five were sequenced for V4/V5 region and the other 5 samples were sequenced for ITS2. I guess after they were barcoded, they might have been pooled together before sequencing. I received the sequenced which were already demultiplexed. These primers I am trying to trim, were initially not adapeter tagged.

Okay.. that likely explains it.

Basically, you can use cutadapt to separate out your ITS and 16S rRNA gene sequences via --p-discard-untrimmed. Assuming your ITS constructs are similar, you just need to run cutadapt with the ITS PCR primers as you did with the 16S primers. Then you'll have two amplicon data sets to use for analysis.

Or.. if each amplicon is setup as separate samples... then all you need to do is import the 16S rRNA gene samples separately from the ITS samples, then process each with cutadapt... etc...

:slight_smile:

When using the `--p-discard-untrimmed flag, 52% of the reads are discarded these reads did not contain the expected primer sequences.

=== Summary ===

Total read pairs processed: 885,056
Read 1 with adapter: 432,984 (48.9%)
Read 2 with adapter: 436,750 (49.3%)

== Read fate breakdown ==
Pairs that were too short: 0 (0.0%)
Pairs discarded as untrimmed: 456,173 (51.5%)
Pairs written (passing filters): 428,883 (48.5%)

Total basepairs processed: 532,803,712 bp
Read 1: 266,401,856 bp
Read 2: 266,401,856 bp
Quality-trimmed: 0 bp (0.0%)
Read 1: 0 bp
Read 2: 0 bp
Total written (filtered): 241,905,666 bp (45.4%)
Read 1: 120,961,075 bp
Read 2: 120,944,591 bp

=== First read: Adapter 1 ===

Sequence: GTGYCAGCMGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 432984 times

That makes sense given that ~half your samples are ITS and will not contain the 16S rRNA gene primers. Correct?

I will try to run the same sample and see if I can identify ITS primers for any possible mixture which I highly doubt.

1 Like

The above results are from v4/v5. I was testing the pipeline for only two samples for v4/v5 and not a combination of both amplicons (ITS).This is only for 16S

1 Like