Surely I should give you more information :
I downloaded sequences from Japan for the Insecta class on the BOLD platform.
I removed the sequences which didn't correspond to the COI-5P on both the taxa and the sequences files and renamed the duplicates.
Then, I proceeded as stipulated in the tutorial "Building a COI database from BOLD references".
Setting the --p-rank-handles as 'disable' leads to the same error message as before.
Sending you my files right now ! Thank you so much !
I've looked through the files you sent me. The IDs are not identical between your FASTA file and the taxonomy file. For example, There is a sequence with the label ISSIK304-14 in your FASTA file but not in your taxonomy file. There are similar IDs in your taxonomy file such as: ISSIK304-14-1 and ISSIK304-14_2.
I am not sure how you originally obtained these data, but there is clearly a disconnect between the contents of the two files. For example there are a little over 10,000 sequences in the FASTA file, but over 26,000 entries in the taxonomy file.
The error is basically telling you that it found a sequence with the ID ISSIK304-14, but could not find a corresponding taxonomy in the taxonomy file. Have you looked through the following threads:
I had some issues with my data indeed. I managed to get through the dereplicate process by downloading new sequences and starting over with these.
It was however normal to see a difference between the number of ID in the FASTA and the taxonomy file since the FASTA file underwent some filtering processes.
I looked through these threads yes. I am currently trying to follow the "Building a COI database from BOLD references" pipeline.
However, I am stuck at the primers trimming step. I used ZBJ-Art primers and am supposed to get an amplicon of 157 bp after trimming the sequences at the primers coordinates but the closest I can get to this length is 147 bp. I tried about 50 parameters combinations but mostly got amplicons of 147 bp. It is said on the tutorial that this step is pretty random so I will keep on trying with different parameters. But I find it suspicious.
My guess is @Ander is referring to this part of the BOLD tutorial, when he mentions that a portion of the workflow "is pretty random":
In the quote above, and the step @Ander is working on in this question, we're trying to find a region that aligns between a primer set and some set of sequences. It's not clear to me from these posts thus far how the user went about subsettting their data to try the alignment of the new primer set (ZBJ-Art) to their own BOLD sequence collection. In the tutorial, I point out that you should start with a subset of your samples, filtering both for sequence length and taxonomic completeness (see here). In that example, I mention filtering these gapped sequences by a minimum length of 660bp; even though the expected sequences is usually a bit less (around 658bp), remember that the expected length is that of the degapped sequence. In the case of the ZBJ-Art primers, you might expect a degapped amplicon of 157bp, but you should probably set a threshold at least that long (or maybe a bit longer). If you pre-filter your gapped alignment file in such a manner, I would be really surprised to see you getting these degapped lengths less than 150bp as routinely as you're suggesting.
For what it's worth, most BOLD sequences should be that ~650bp length, and most were generated using a primer set that occur outside of the flanking regions that ZBJ-Art primers work - see this supplementary figure that indicates where these primers fall on COI (from Jusino et. al's 2018 paper describing their development of ANML primers). Filtering for a length of just 150bp might be a bad idea, and filtering for these more complete BOLD sequences is probably an indication that they aren't junk... no guarantees though
It would be helpful to get some clarity on this statement from @Ander too:
50 combinations of what parameters? The length of sequences? The taxonomic completeness?
One other thought would be to try a very specific subset of sequences. Say, take only a particular family of moths or beetles or any other frequently represented insect order, filter for just those records with some minimum length, filter for those records with complete names, then see if you can't get your primers to align where you know they should be aligning. If you happen to have a copy of the original Zeale paper, Table 3 highlights that they were able to recover a lot of Lepidopteran samples from bat fecal samples that matched to a BOLD reference - point being, if you're going to test our your samples on some subset of your specific BOLD references, moths is probably a good place to start for this particular primer set.