How to know the sample sequence IDs that each ASV represent?

11131 · July 30, 2022, 1:25pm

Hi there,
I just finished the DADA2 process and obtained an ASV table and sequences with MD5 hash headers. Is there any method to know the original sequence IDs that each ASV corresponding to? Thanks!

colinbrislawn · July 30, 2022, 2:50pm

Each ASV represents many sequences from many samples. All these sequences have unique IDs.

Are you looking for a table like this?

Example:

ASV ID:	Seq IDs:
HASH1234	Sample_1;read_4, Sample_7,read_245, Sample_9,read_4781, ...
HASH5678	Sample_1;read_38, Sample_2,read_113, Sample_9,read_5, ...

11131 · July 30, 2022, 3:07pm

Yes, that's exactly what I want. Do you have any ideas?

colinbrislawn · August 1, 2022, 4:56pm

Here's no easy way to do this within DADA2 or Qiime2.

Here's the code that makes the hashes from input sequences:

github.com

qiime2/q2-dada2/blob/dev/q2_dada2/_denoise.py#L155-L160


      
          # only calculate percentage of input primer-removed if ccs
          if 'primer-removed' in df:
              PASSED_PRIMERREMOVE = 'percentage of input primer-removed'
              round_cols[PASSED_PRIMERREMOVE] = 2
              df[PASSED_PRIMERREMOVE] = df['primer-removed'] / df['input'] * 100
              col_order.insert(1, 'primer-removed')

To make that table, you would want to preserve both the original SequenceID and the new MD5 hash it receives after it is renamed.

It may be easier to make this table another way: List all reads in your study along with their unique MD5 hashes, then map your original reads and their SeqIDs against this list. I think this can be done using vsearch, but that would take some custom scripting.

gregcaporaso · August 1, 2022, 5:07pm

@11131, can you clarify what you'd like to do with that information? As @colinbrislawn mentioned, we don't generate that mapping of identifiers directly, but there may be some other way to help you get the information you need. For example, if you're interested in the sequence associated with each hashed identifier, that is generated by the DADA2 denoise methods. For example, it is the rep-seqs-dada2.qza file generated in this step of the Moving Pictures tutorial, and you can turn that into a visualization that you can explore using qiime feature-table tabulate-seqs (see here).

11131 · August 5, 2022, 8:06am

Great! Thanks for your help.

11131 · August 5, 2022, 8:13am

Yes. I'm trying to reanalyze a published V3-5 sequencing dataset. But I found the ASV sequence was not started from the V3 primer position, though the parameter was --trim-left 0. So I just want to backtracking the original sequences the shorter ASV represent.
Now I've found the method to find out the original sequences. Many thanks!