Open-reference clustering; difference in output files

nricks · January 31, 2018, 7:07pm

In the open-reference clustering command it produces two outputs, the --o-clustered-sequences and the --o-new-reference-sequences.
What are the differences between these files and which one should I use for mafft analysis?

jakereps · January 31, 2018, 7:43pm

Answering without direct knowledge of how the pipeline works, so someone correct me if I am wrong, but interpreting the documentation:

  --o-clustered-sequences ARTIFACT PATH FeatureData[Sequence]
                                  Sequences representing clustered features.
                                  [required if not passing --output-dir]
  --o-new-reference-sequences ARTIFACT PATH FeatureData[Sequence]
                                  The new reference sequences. This can be
                                  used for subsequent runs of open-reference
                                  clustering for consistent definitions of
                                  features across open-reference feature
                                  tables.  [required if not passing --output-
                                  dir]

It appears that --o-clustered-sequences is the result of your input data and should be used for further analysis (MAFFT, etc…). The --o-new-reference-sequences looks to be an updated reference set, which should be your old reference database with your de novo features selected in this run added to it, allowing additional read clustering to depend on the output/representative sequences of this specific result.

Nicholas_Bokulich · February 1, 2018, 2:46pm

@nricks,

@jakereps’s is absolutely correct, you want to use the clustered seqs for downstream analyses, e.g., mafft alignment.

The purpose of new-reference-sequences is to use as reference sequences in future analyses that you wish to compare to this one. For example, imagine that you are conducting a longitudinal study over the course of a year, split across multiple sequencing runs, and you are analyzing your data batch by batch as its ready. As you process each batch, you would use the new-reference-sequences from the previous batch as reference sequences so that the same OTU IDs are assigned and all batches can be compared against one another.

I hope that helps! (and thank you @jakereps for the excellent answer!)

system · March 4, 2018, 8:46pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.