Split_libraries file sample ids format with multiple sample files

Hi,

I have externally demultiplexed and cleaned my data so now have multiple fasta files, each containing the reads of one sample. I am trying to import them into QIIME2 as explained in the clustering sequences into OTUs tutorial but am running into the error ‘file’ is not a(n) QIIME1DemuxFormat file. I think this is something to do with how my sample ids are formatted.

As all the sequences in the file come from the same sample, they all contain the same sample id:

>2_18S TT5967UF7ANPX02 orig_bc=CTAGGTGA new_bc=CTAGGTGA bc_diffs=0
ATGCAGGTCTTAGTATAAACTTGAAAAAAGTGAAACCGCGAATGGCTCATTACATCAG
>2_18S UIBDKT7CHWJ6X9Q orig_bc=CTAGGTGA new_bc=CTAGGTGA bc_diffs=0
ATGCATGTCTAAGTACAGGCTTTAATAAAGTGAAACCGCGAATGGCTCATTAAATCAG
>2_18S GAO6GKWSP2F935Y orig_bc=CTAGGTGA new_bc=CTAGGTGA bc_diffs=0
ATGCATGTCTAAGTACAGGCTTTAATAAAGTGAAACCGCGAATGGCTCATTAAATCAG

(I’ve shortened the sequences here to make it a bit prettier)

I tried combining multiple samples, so there are a variety of ids in the file in case that was the issue:

>2_18S 15LOC6N60FD1Q7F orig_bc=CTAGGTGA new_bc=CTAGGTGA bc_diffs=0
ATGCATGTCTAAGTATAAATCTTTTACTTTGAAACTGCGAACGGCTCATTATATCAGTTATAG
>2_18S 8HYOD7O4D1V43CO orig_bc=CTAGGTGA new_bc=CTAGGTGA bc_diffs=0
ATGCATGTCTAAGTATAAGTAGTATACAGCGAAACTGCGAATGGCTCATTAAAACAGTTATA
>4_18S 9VYWL2QSR22MB7Y ACGACTTG bc_diffs=0 ACGACTTG
ATGCATGTCTAAGTACACACTGTGGCACAGTGAAACCGCGAATGGCTCATTAAATCAGTT
>4_18S O222PMF38H92TE7 ACGACTTG bc_diffs=0 ACGACTTG
ATGCATGTCTAAGTATAAACTGCTTTATACTGTGAAACTGCGAATGGCTCATTAAATCAGTT

This also didn’t work (:sob:)

On the QIIME 1 file format page I notice that sample ids are in the format PC.634_1, PC.634_2, PC.354_3, PC.354_3
If I edit my sample ids into something similar (ie 2_18S_1, 2_18S_2, 2_18S_3 etc ), the import won’t work (could this be the multiple underscores?).

However, if I add .1, .2, .3 (2_18S.1, 2_18S.2, 2_18S.3 etc) the import will work (:grin:).
But I can’t find any information on how this affects downstream analysis. As in, does this cause qiime to view 2_18S.1 and 2_18S.2 from different samples?

Furthermore, I’m aware that importing the files this way will give a separate artifact for each sample. I can’t work out if this will make it difficult to compare samples downstream - my next step (per the OTU clustering tutorial) is dereplication, and presumably this is per sample, so keeping the files separate should work? Or is it better to combine everything in one file and process all the samples together?

I hope I’ve given enough information and I’d be grateful for any advice at all, thank you!

Hi @niamh55,
I think I got to the bottom of your issue.

That is the issue. The header ID must be unique for each. The ID contains information on the samples, plus the read number. So in the qiime1 format example you mentioned, the header IDs are:

PC.634_1
PC.634_2
PC.354_3
PC.481_4

But the same IDs are:

PC.634
PC.354
PC.481

So in your case the issue is that the sequence number information is missing. The header IDs are all identical (2_18S). Mixing with other reads does not work because there are still duplicates (e.g., all header IDs are replicates of 2_18S or 4_18S)

Because now the sequence ID portion is unique.

The 18S.1, 18S.2 portions are in the sequence ID portion — the sample ID portion (2) is still correct and so I think this should work downstream.

Note, however, that several different methods may have issues with sample IDs that just consist of numbers. So you may want to name your samples S1, S2, etc, instead of just 2.

I believe that the qiime1 format expects everything to be together in a single file. So you should just be able to concatenate all of these files together and import before proceeding. The alternative is a lot more hassle — you need to process each sequence file separately, then merge later on. Just concatenate now!

I hope that helps!

1 Like

Thank you so much for your help and such a clear answer!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.