Trouble generating SeppReferenceDatabase qza

dethlefs · September 28, 2020, 5:31am

Hi!

I'm new to qiime2, so I'm not sure if I'm having problems because I couldn't find instructions for importing/creating a SeppReferenceDatabase.

Goal:
Use a customized subset of Silva 138 NR99 for sepp fragment insertion.

What I've done:
Using arb, exported the aligned fasta for the high quality seqs I want to use, after trimming to just the relevant 16S region, and exported the corresponding tree.

Followed Siavash's steps to clean up the tree, resolve polytomies, and re-estimate branch lengths with RAxML.

I know that a 'SeppReferenceDatabase' must include the alignment, tree, and info about the tree obtained from RAxML, and by unzipping sepp-refs-gg-13-8.qza (and after a couple errors) I deduced that those 3 elements should be named 'aligned-dna-sequences.fasta', 'tree.nwk', and 'raxml-info.txt'. I also deduced that the info should be the binary info file produced by RAxML, not the human-readable version. So I made a directory with those files:
$ ls -lh Sepp_Ref_Data_Files/
total 532M
-rw-r--r-- 1 dethlefs microbio 519M Sep 27 16:15 aligned-dna-sequences.fasta
-rw-r--r-- 1 dethlefs microbio 49K Sep 27 16:22 raxml-info.txt
-rw-r--r-- 1 dethlefs microbio 13M Sep 27 16:14 tree.nwk

Command I tried:
$qiime tools import --type SeppReferenceDatabase --input-path Sepp_Ref_Data_Files/ --output-path Sil138_BAhq_SeppRefDB.qza

Error I got:
$ qiime tools import --type SeppReferenceDatabase --input-path Sepp_Ref_Data_Files/ --output-path Sil138_BAhq_SeppRefDB.qza
Traceback (most recent call last):
File "/home/dethlefs/miniconda3/envs/qiime2-2020.8/lib/python3.6/site-packages/q2cli/builtin/tools.py", line 158, in import_data
view_type=input_format)
File "/home/dethlefs/miniconda3/envs/qiime2-2020.8/lib/python3.6/site-packages/qiime2/sdk/result.py", line 241, in import_data
validate_level='max')
File "/home/dethlefs/miniconda3/envs/qiime2-2020.8/lib/python3.6/site-packages/qiime2/sdk/result.py", line 267, in _from_view
result = transformation(view, validate_level)
File "/home/dethlefs/miniconda3/envs/qiime2-2020.8/lib/python3.6/site-packages/qiime2/core/transform.py", line 68, in transformation
self.validate(view, validate_level)
File "/home/dethlefs/miniconda3/envs/qiime2-2020.8/lib/python3.6/site-packages/qiime2/core/transform.py", line 143, in validate
view.validate(level)
File "/home/dethlefs/miniconda3/envs/qiime2-2020.8/lib/python3.6/site-packages/qiime2/plugin/model/directory_format.py", line 171, in validate
getattr(self, field)._validate_members(collected_paths, level)
File "/home/dethlefs/miniconda3/envs/qiime2-2020.8/lib/python3.6/site-packages/qiime2/plugin/model/directory_format.py", line 101, in _validate_members
self.format(path, mode='r').validate(level)
File "/home/dethlefs/miniconda3/envs/qiime2-2020.8/lib/python3.6/site-packages/qiime2/plugin/model/file_format.py", line 25, in validate
self.validate(level)
File "/home/dethlefs/miniconda3/envs/qiime2-2020.8/lib/python3.6/site-packages/q2_fragment_insertion/_format.py", line 63, in validate
info = self.path.read_text()
File "/home/dethlefs/miniconda3/envs/qiime2-2020.8/lib/python3.6/pathlib.py", line 1197, in read_text
return f.read()
File "/home/dethlefs/miniconda3/envs/qiime2-2020.8/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8a in position 9608: invalid start byte

An unexpected error has occurred:

'utf-8' codec can't decode byte 0x8a in position 9608: invalid start byte

See above for debug info.

###---### END ERROR MSG ###---###

I'm running a new conda install of version 2020.8 on a CentOS workstation with 80 cores and 1.5TB memory. (Helpful for RAxML.)

I wonder if the error just implies a corrupted file. But before I do the rather lengthy re-running of the RAxML steps in Sriavash's instructions, I thought I'd make sure that these steps would be expected to work.

Thanks!!
Les

dethlefs · September 28, 2020, 11:31pm

Update to my own post: I think I am making newbie mistakes, and for the help of anyone who finds this thread in the future, it is the human-readable raxml info, not binary, that should be submitted.

Furthermore, this info file must have been run through Siavash's reformat-info.py script (bundled with sepp) which (perhaps among other changes) adds a line starting 'Base frequencies:' that lists (wait for it) base frequencies. I'd gotten an error earlier that told me 'base frequencies not found in raxml-info' or something like that, even though that line was present, which is why I thought qiime tools import wanted the binary version of the info. In reality, the error was very likely my mistake of copying the pre-reformat human-readable info file to 'raxml-info.txt'.

So now I'm on to a different error (that also might wind up being another stupid mistake):
An unexpected error has occurred:

Invalid character in sequence: b'U'.
Valid characters: ['A', 'Y', 'V', 'D', 'R', 'N', 'M', '.', 'B', 'K', 'W', 'H', 'T', 'G', 'S', 'C', '-']
Note: Use lowercase if your sequence contains lowercase characters not in the sequence's alphabet.

Checking my aligned fasta file now...

dethlefs · September 29, 2020, 12:20am

Duh...error message says it right there: From the Silva Arb db, I've got 'U's intead of 'T's.

Okay, for future reference, this worked to generate a Sepp Reference Database (version 2020.8):

qiime tools import --type SeppReferenceDatabase --input-path Sepp_Ref_Data_Files/ --output-path SeppRefDB.qza

where Sepp_Ref_Data_Files is a directory containing 3 files:

aligned-dna-sequences.fasta
raxml-info.txt
tree.nwk

Identifiers in fasta need to exactly match names of leaves in tree (obviously...and have T's instead of U's, duh).
The raxml info file is the human readable version generated during the branch length re-estimation step of Sriavash's instructions (here and here), that has subsequently been reformated (see above).
The tree is the raxml result file from the branch length re-estimation.

Dear kind Qiime moderators: please feel free to close this topic.

Les

Mehrbod_Estaki · September 29, 2020, 3:39am

Hi @dethlefs,
Thanks for the thorough details about the problem, updates, and solutions
The post will automatically close after 30 days so we don't really close topics.

btw, just curious, how much computational resource did you end up using on this?

For other users who are not able to recruit powerful machines, there is currently a sepp reference database based on SILVA 128 on the data resource page for use.
Perhaps @dethlefs would be willing to share the new constructed database as well?

dethlefs · September 29, 2020, 5:20pm

I didn't track memory use, but I had used 70 cores on a high-end server (not cluster) and the re-estimation of branch lengths via RAxML took ~12 hours (using the AVX2-compiled version 8.2.12). This is with a high-quality subset of bacteria & archaea from the full Silva 138 NR, about 225k sequences, and with aligned reference seqs trimmed to only the V4 region. (Theoretically, I think it's preferable to trim the reference alignment to just the fragment region being placed, but I doubt it will matter much in terms of placement accuracy. It might help a lot in terms of runtime, but I haven't tested that.)

I'll need to write up a README that explains how I derived the Silva subset, but yeah...I could share the Silva sepp reference. Come to think of it, in addition to a README file, it makes sense to have that info in the provenance section of the qza. I know usually qiime does all the work of recording the history all artifacts, but in this case it's the pre-artifact history that should get incorporated. Can I edit the provenance files manually to put a copy of the README in there?

Mehrbod_Estaki · September 29, 2020, 8:13pm

Hi @dethlefs,
Wonderful, thanks for the update! I actually expected this to take even longer based on some other attempts people had made.

Good question, I'm not sure to be honest. Perhaps the plugin's developer @Stefan might have some insight on this.

The question I believe isn't so much as can you do it, anything is possible, but rather should you do it! 100% agree about the information being important to retain somehow, but I'm guessing the Q2 developers will advise against any instance of manually editing the provenance or anything else in the artifacts.
They may have some alternative solution, though at the moment they are all preoccupied with the upcoming Q2 workshop so might have to wait a bit.

Stefan · September 29, 2020, 8:33pm

Hi @dethlefs, thanks for your great work and effort to document your journey. This will be for sure helpful for others.

A few weeks ago I chatted with Siavash exactly about this idea. We think accuracy should remain untouched, but runtime is expected to decrease significantly. Not sure when I will have the time to dig into this, but if you could provide both of your custom made reference sets it would be a wonderful second test set, besides GG13.8 99%.

Best,
Stefan

hugh · October 9, 2020, 12:08am

I have been trying a similar approach to build a sepp database with another gene region (catalase-peroxidase, katG). Thanks to @dethlefs for posting their efforts; they really helped me sort out some issues. I was able to import the files as a SeppReferenceDatabase type, and then ran the sepp plugin successfully. I did not find how to look at the results, especially the placements json. (I exported the tree as a newick.) Here are a couple of things I learned, if it helps anyone:

The files in the directory that is imported have to have those exact names, as in @dethlefs post above. The alignment has to be called aligned-dna-sequences.fasta etc.
I got an error when I tried to import the alignment with lowercase bases. I had to convert these to uppercase. I don't know how you would deal with soft-masked input.
You cannot have underscores in your reference sequence names. The newick format reads these as spaces so then you get an error that it doesn't match the alignment, where the underscores are maintained. I have to go back and change the sequence id names, but was able to run it with some quick find/replace in the input files.

Thanks again for the discussion. I look forward to trying the classify-otus plugin, to compare against NB taxonomy classification.

cheers,
hugh

hugh · October 9, 2020, 12:10am

One other thing that is confusing me. Regarding 'trimming the reference alignment to just the fragment region being placed'. I don't understand. Isn't the whole point of fragment insertion that phylogenies made with short fragments not accurate? Doesn't that defeat the whole purpose of sepp and fragment insertion? Sorry if I misread.

cheers,
Hugh

dethlefs · October 12, 2020, 9:59pm

Hi @hugh,

Sorry for the slow reply. You're right that phylogenies made with short fragments are likely to be less accurate, and maybe downright misleading. With SEPP neither the output tree nor the reference tree is calculated de novo from short fragments...the topology of the tree is taken from the reference. In this case, it's a trimmed version the Silva 138 SSU 16S guide tree, which has a complicated history that I wouldn't be able to explain as well as the Silva folks, although it's derived from a full-length 16S maximum likelihood tree and has received expert curation from specialists in certain taxa.

Given that you have a reference tree that you want to use, and an alignment of the sequences associated with the leaves of that tree, SEPP's job is to make the best placement of new fragments into that tree. It has to align fragments to the existing alignment, and for that step having the reference alignment trimmed is helpful. While @Stefan and Siavash think alignment accuracy will be good for fragments against the full length alignment (and I generally agree), my experience with automated aligners in this context is that bases near the end of the fragment can get spread out to columns outside the intended region.

Imagine that a fragment has GCA for the first 3 bases, and the first 3 columns that we know to be appropriate (based on our PCR primers) are consistently GCC in the most similar reference seqs to that fragment. Imagine further that the triplet just prior to those columns in the similar reference seqs is GCA. The aligner doesn't know what PCR primers we used to obtain the fragments, and its mismatch and gap penalties might result in an optimal score when the fragment GCA is aligned just outside the appropriate area, with a gap spanning the next three columns. In effect, trimming the reference alignment is telling the aligner what we know about our PCR primers, in this example forcing it to chose GCA aligned to GCC with a mismatch inside the appropriate area of the reference alignment.

But that's not likely to have a huge impact since it will make a small difference in the alignment scores of probably a small number of fragments, which might or might not even affect their placement. @Stefan and Siavash point out the more important benefit for those lacking free access to supercomputers: if your aligner doesn't have to waste time for each fragment testing alignments that include the majority of columns outside the appropriate region, it can more quickly find the best alignment for each fragment within the appropriate region.

I've glossed over some steps...prior to fragment placement, the Silva guide tree (and perhaps most big reference trees) will have to have polytomies cleaned up, and the branch lengths of the guide tree will have to be re-estimated, along with the evolutionary parameters that go into maximum likelihood modeling of the phylogeny. I suppose if someone were to use the SEPP output tree to make branch length dependent arguments about something or other, or treat the associated ML evolutionary parameters as precise estimates, that could be bad. (And would be foolish.)

But usually, all we want is the most likely placement of our fragments into a tree topology that we trust (or at least are familiar with). I'd argue that the re-estimated branch lengths and parameters (following trimming) provide the best chance of getting the fragments placed accurately on the fixed-topology tree (because they're the best estimates for fragment alignment columns), even if estimates of average mutation rates and transition probabilities etc. are expected to be more precise when using longer sequences.

Regards,
Les

P.S. I haven't forgotten about sharing my work...still haven't gotten around to generating a README file, but I hope to soon.

hugh · October 12, 2020, 10:44pm

Hi @dethlefs,

Thank you very much for the explanation. I see the difference now. So, the tree topology is from the full alignment, but then you advise to trim the alignment down to the fragment size for creating the SEPP database and inserting the fragments? Your reasoning makes good sense. In my particular case the fragments are actually the result of hybrid capture of probes designed from across a 2200 bp gene, so any fragment can belong to any part of the gene. I have trained a NB classifier in Qiime to try to sort out the correct species ID of these captured reads. This is getting off topic so I would be happy to create a new post to discuss this.

However, I do have other amplicon data with which I will try using SEPP. For creating the database, I did follow the steps for building a Sepp ref SEPP Github buildref page, as you mention.

Cheers,
Hugh

system · November 15, 2020, 1:22am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.