Frequency per feature is 1 - each read is being considered as unique feature

Dear,
I am writing to get input from experienced ones. I got nanopore data and using q2ONT command line to process my 16srRNA gene seq data. After demuliplexing, adapters removal, and trimming the reads to 1400 length, i imported my sequencing data into qiime2.
I used these commands for deprelication of sequences, and for obtaining feature table seqs and feature table summary.

Dereplication of sequences
qiime vsearch dereplicate-sequences
--i-sequences 4_single-end-demux.qza
--o-dereplicated-table 5_derep-table.qza
--o-dereplicated-sequences 5_derep-seqs.qza

visualization files
qiime feature-table tabulate-seqs
--i-data 5_derep-seqs.qza
--o-visualization 5_derep-seqs.qzv

qiime feature-table summarize
--i-table 5_derep-table.qza
--o-visualization 5_derep-table.qzv

After these steps, i got two files, derep-seqs.qzv and derep-table.qzv.

Upon checking derep-table.qzv using qiime2 view, i realized that something might have gone wrong, as Frequency per feature is showing 1. Photo is attached.


I later run the command vsearch command using --verbose and below is the outcome of the command.
vsearch v2.7.0_linux_x86_64, 15.3GB RAM, 16 cores

Reading file /tmp/q2-QIIME1DemuxDirFmt-hvrikd0d/seqs.fna 100%
90956600 nt in 64969 seqs, min 1400, max 1400, avg 1400
Dereplicating 100%
Sorting 100%
64969 unique sequences, avg cluster 1.0, median 1, max 1
Writing output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%

Although i have many unique features but they are counted as 1. The question is why unique features are equal to number of reads in my samples? Does it mean dereplication is considering each read as unique feature? If so, what should be the solution to resolve this problem?

Could you please provide insights what could have gone wrong that i obtained such outcomes, or it is normal to get such outcomes while processing nanopore seq data?
Thank you

1 Like

Yeah. And this matches the other number you reported here.

64969 is also the total number of unique features in the second picture.

Two thoughts:

Nanopore reads are much longer than Illumina reads, so finding more unique reads is more common than on Illumina, especially with Nanopore's error rate.

Finding everything to be unique is probably a problem. I've had this happen when my barcodes were still in my reads, so the unique barcodes + the unique reads made many extra reads.

Fixing barcode removal solved this for me. See if you can roll back to the demultiplexing, barcode removal, and adapters removal steps and make sure those are working.
(You can grep for barcodes in your sequences)

1 Like

Dear,
Thank you for providing your experience. I demultiplexed my files through guppy and used the --trim_barcodes, --trim_adapter and --trim_primers commands to make the files free from these things.

I am not sure what are the sequences of barcodes/adapters in my file, could you provide any input on this? I used SQK-16S024 nanopore kit for the sequencing.
Thanks

I know more about Illumina than nanopore, but hopefully I can offer a little help.

From the SQK-16S024 store page, it looks like 24 sample kits are for the Flongle. Is that the machine and kit you used?

When running Guppy, did you enable barcode removal as discussed here?

I might ask the sequencing core or Nanopre directly if they have experience with barcodes ending up in the read after running Guppy.

Dear,
yes, I used SQK-16S024 kit with 24 barcodes.
I followed the command given below for the guppy
"C:\Program Files\OxfordNanopore\ont-guppy-cpu\bin\guppy_barcoder.exe" --input_path C:\Users\u0150736\Pictures\Nanopore\run1 --save_path C:\Users\u0150736\Pictures\Nanopore\guppy --trim_adapters --trim_primers --detect_mid_strand_adapter --detect_mid_strand_barcodes --enable_trim_barcodes --barcode_kits SQK-16S024.

I tried to check on Nanopore community and their similar command have been mentioned from users.

Regards
Muhammad

1 Like

Yeah, that should do it...

Do you know what the barcodes are?

Would you be willing to post the 5_derep-seqs.qza so I can take a look inside for barcodes in the reads?

Dear,
To oprimize the workflow i am now working with few samples and attached is the qza and qzv files of my few samples having 6500+ reads. I hope it works.
5_derep-seqs.qza (2.3 MB)
5_derep-seqs.qzv (559.5 KB)

I do not know the barcodes. I tried very hard but could not find.

Here's the first three reads from derep-seqs.qza

Looks like these are from sample BC01

>4a4f58480dffc1a95a58d33026cc59b7e0374ffd BC01_0
CGGCGATGCTTAACACATACAAGTCGAACGGAGCACCCTTGACACATAATTCGGCCAAATGATAGGAATACTTAGTGGCGGACTGGTGGTAACGCGTGAGGAACCTGCACTTCAGAGGACGCAGTTGGAAACGACTGCTAATACCGCATGATATATTTGAGGGCATCCTTGGATATCAAAGATTATATCGCTGGAAGATGGCCTCGCGTCTGATTAGTAGATGGTGGGTAACGGCCCACCATGGCGACGATCAGTAGCCGGACTGAAGGTTGACCGGCCACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCATCAGTGGGGAATATTGGGCAATGGACGCAAGTCTGACCCAGCAACGCCGCGTGAAGGAAGAAACAGGTTAACTTCTTTTGTCAGGGAACAGTAGAAGGGTACACAGCGAATAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAGGCATAATTGTCCGGATTTACTGGGTGTAAAAGGGCGTGCAGCCGGGCCGGCAAGTCAGATGTGAAATCTGGGCTTAACCTCCAAACTGCATTTAGAAACTATTTGGGTCTTGGTACCGGAGAGGTTATCGGAATTCCTTGTGTGGCGGTGATGCGTAGATATAAGGGAAGAACACCAGTGGCGAGGCGGATAACTGGACGGCAACTGACGGTGAGGCACCCCAGCGTGGGGAATAAACGGGGATTAGATACCCTGGTGATTACGCTGTAAACGATGGATACTAGGTGTGCGGGGACTGACCCCCTGCGTGCCGCAGTTAACACAATAATTATCACCTGGTGATCGCAAGGTTGAAACTCAAAAGGAATTGACGAGGACGCACAAGCGGTGGATTATGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGGCTTGACATCCTACTAACAGTAGAATACATTGGTGCTGGAAGGTAGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCCTATTGTTAGTTGCTACGCAGGCACTCTAGCGAGACTGCCGTTGACAAAAACGGAGGAAGGTGGGGACGACGTCAAATCATCATGCCCCTTATGTCCCTGGGCTACACGTAATACAATGGCGGTCAACAGAGGAGGCAAAGCCGCGAGGCAGAGAGCAAACCCCCAAAAGCCGTCCAGTTCGGATCGCGGGGCTGCAACCCGCCTGCGTGAAAGTCAAGTCGCTAGTAATCGCGAGATCCAGCATGCCGCGGTGAATACGTTCGGGCCTTGTACACCATACGTCACACCATGAGAGTCGGGAACACCCGAAGTCCGTAGCCCCAACCGCAAGAGCGCGGCCGAAGGTGGGTTCCGATAATTGGGGTAGCGAAGTCGTAACAAGGTAACCGCAC
>958c009e73210442d0791a3dabbd26a5fdba3d56 BC01_1
GGCGGCGTGCCTAACACATGCAAGTCGAACGGAGGATATTTGGAGGAGCTTGCTTTGGATATCTTAGTGGCGGACGGGTGAGTAACGCGTGGAAGTAACCTGCCTCTCAGAGGGGATAACGTTCTGAAAGAACGCTATTACCGCATGACATTGCGAAACCGCATGGTTTTGCAATCAAAGAACAATCCGCTGAGATGGACTCGCGTCCGATTAGCCAGTTGGCGGGGTAACGGCCCGCAAAGCGACGATCGGTAGCCGGACTGAGAGGTTACGATGACCACATCGGGACTGAGACGGCCCAGACTCCTACGGGGAGGCGGCAGTGGGGGATATTGCACAATGGGGAAACCCTGATGCAGCAACGCCGCGTGTGGGAAGAAGGTTTTCAGTTGGCAAACCACTGTTCTCAGGGACGATAATGACGGTATATGAGAGAAAGCCGGCTAACTACGTGCCAGCGGCGCGGTAATACGTAGGAGCGAGCGTTGTCCGGATTTACTGGGTGTAAAGGGGTGCGTGAGCGGCTCGCGCAAGTCAGTCGTGAAAACCATGGGCTCAACCCGTGGACTGCGATTGAAACTGTGGAACTTGGTGAAGTAGAGGCCAGGCGGAATTCCCATTGTGGCGGTGCGGAAATGCATAGAGATCGGGAGGAACACCGAGCCGAGAGCGGCCTGCAGGGCTGGCGACGCTGAGGCACGAAAGTATGGGTAACCAAACGGGATTAGATACCCTGGTAGTCCCATACCGTAAACGATGATTACTAAGGTGTGGGTCTGACCCCTCCGTGCAGGTTAACACAATAAGTAATCCACCTGGGAGTACAGCCCGCAGGTGCAAACTCAAAGGAATTGACGGGGGGCCCCGCACAAGCAGTGGAATTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGATCCAACTAACGAAGTAGAGATACATTAGGTGCCCTTCAGGGGAAAAGTTGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCATTGAGATGTTGGGTTAAGTCCGCAACGACGCAACCCTATGATTAGTTGCTACGCAGAGCACTCTAATCGAACTGCCGTTGACAAAACGGAGGAAGGTGGGGACGACGTCAAATCATCATGCCTGTGACCTGGAGCTACACCGCTACAATGGCCGTCAACAGAGGAGAGCAAAACCGCGAGGCGAAGCCAAAACCCCAAAACGGTCCAGTTCGGATTACGGGCTGCAACCCGCCTGCATGGATTGGAATTGCTGGTAATCGCGGATCCAGAATGCCGCGGTAGGGCCGTTCCCGGGCCTTGCACACCGCCGTCCACCGTGGGAGCCGAGTGTACCCAGTCAGTAGCCTAACCGCAAGGGAGGCGCTGCCGAAGGT
>4fad7a3b17fcbfc077d8bcb3c87d1676fda7c872 BC01_10
ATCAGACCTGTGACGCTCCTCCTTTCGGTTGGGGTCACTGGCTTCGGGCATTTCGACTCCCATGGTGTGACGGGCGGTGATGACCCTGCAAGACCGAGACGTATTCACCGCGACATTCTGATTCGCGATTACTAGCGATTCCAGCTTCCTTGTAGTCAGTTACAGACTACAATCCGAACGAGACGTTATTTTGAGATTCGCAGGTCTCCCTCTCGCTTCCCTTTGTTTACGCCATTGTAGCACGTGTGTAGCCCAAATCGCAGGGGCATGATGATTTGACGTCATCCCCACCTTCCTCCAGGTTATCCTGGCAGTCTCCTTAGAGGTACCCGGCTTTATCGCTGGCTACTAAAATACGGGTTGCGCTCGTTGCGGGACCTTAACATCTCACGACACGAGCTGACGACAACCATGCACCACCTGTCTATGACGCCCCGAGAGGGAACGGTTAGTTCCGGTCGTCACGATGTCAAGACTTGGTAGGTTCTTCGCGTTGCTTCGGGTAAACCACATGCTCCACCGCTTGTTGCGGTCCCGTCAATTCCTTTGAGTTTCATTCTTGCGAACATGCTCCCAGGTGGATACTTACTGCGTTTGCGGCAGCATCGATACGCTTTGCGCACAACACCTAGTATTCATCGTTTCGGCGTGGACTACGGAATTGTCTAATCATATTCCTCCCCCTTTCGAACCTCAACGTCAGTTACTTGTCCAACGAAACCGCCTTCGCCACTGGTGTTCCTCTAATATCTACGCATTTCACTGCCCCATGGGAATTCCGCTTGCCTCTCCAGCACTCCAGCAACAGTTTCCAAAGCAGTTCCCAGGTTGAGCCGGGTATTTCACCGGACATATGCCGTCTACGCTCCCTTTACACCGGTAAATCGGATAACAGCGCCCCCACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGGGGCTGTAGTCGGGGCTGCGTCATTCTCTTCCCTGCTGATGAAGCTTTACGCGCGAAATACTTCTTCACTCACGCGGCGTCGCTGCATCCAGGGTTCCCCCCCCATTGTGCAATATTCCCACTGCTGCCTCCCGTAGAGTTTGGGCCGGTGTCTCCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCGTCATGGTGGGCCGTTACTCACCAACTAGCTAATCAGACGCAGGTCCATCTCAGCCGCCACCGGAGTTTTTCACGCAAGCATGCGCTTCCGTGCGCTTATGCGGTATTAGCAGTCATTTAACTGTTATCCCCCTGTATGAGGGTAGGTTACCCACGCGTTGCCTCACCGTCCGCCACTCAGTCAATTTGACTTCCATCCGAAAACTTCCGTCAATCGCTTCGTTCGACTTGCATGTGTTAAACGCCGCCAGCATTCATCCCTGA

I also ran these through MAFFT but did not see any 100% regions, which it what I would expect because these are from the same region.

I also notice they vary in length. This is surprising , because

I wonder if trimming before then removing barcodes could explain variable length.

The variable length is another thing that would make reads unique!

Thank you for your patience Muhammad. I'm not an expert on Oxford Nanopore data so appreciate your time while I work to understand this data set.

Does anyone have more advice about working with ONP data?

Dear,
Thank you for time and valuable insights. I also opened the issues on GIThub for q2ONT pipeline. This what I received from the developer.

Okay!

Thank you for posting that.

Based on the developer discussion, it sounds like everything may be okay.

Have you done step 6?

  1. Chimeric sequences are screened for and are filtered out from the workflow. Subsequently OTUs are clustered via open reference option using vsearch at 85% identity. This can be also changed manually digging into script. However, due to high error rate of ONT platform, it is advised to cluster OTUs at 85% similarity or even less*.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.