Mixed orientation data

I am combining samples from 4 different studies and have handled all data in the same manner up until classification. I then merged data the feature tables and rep seqs together and tried to classify them all together. I found that all samples from one of the studies just could not get classified past bacteria, and figured out that this is probably because they are not in the right orientation.
I’m trying to figure out how I can analyze this data with a classifier (as opposed to just closed reference OTU picking) but can’t figure out if this is a doable… For taxonomic summary I could classify them separately and merge afterwards and that might be fine, but I have a feeling this would be an issue for diversity analyses, especially those involving making a tree… not sure if I can make a tree with data in different orientation?
I would prefer to start from my already demultiplexed, quality filtered, deblurred reads, but can start from the raw reads if that’ll open up more possibilities.
I can live with picking closed OTUs if that’s the only option, but thought I’d see if anyone else has thoughts - I’d appreciate any suggestions!

Hi @c.older,

A couple questions come to mind, which may or may not be appropriate.

First, are you sure that all your reads are from the same hypervariable region? Im not sure if cross classifying would lead to the same issue, but I would double check, especially if the data is coming from someone/somewhere else. If they are mixed regions, then my best current advice is to use closed reference picking if you plan to do direct feature-to-feature comparisons.

If they’re mixed orientation, and you could get them to align from the same starting/ending positions, you might be able to maybe reverse compliment the reads, and then handle them that way. But, that’s a lot of ifs!

You could do taxonomic comparisons, and you’ll probably get more bang for your buck comparing at the genus level. Sometimes individual ASVs replicate, and sometimes they don’t. (You always hope they do, but…) But, you’re still relying on that classification and you will struggle with phylogenetic-based analyses which I think might give you more information int he meta analysis context.

Im not sure there is one best answer, but that’s my two cents.

Best,
Justine

4 Likes

Hi Justine,

Thanks for responding!

I think maybe I figured out what’s wrong based on your two cents :slight_smile:

Since you asked if I was sure they were the same region, I went ahead and manually checked sequences (probably should’ve done this to begin with) and they are from the same hypervariable region. However, it looks like it’s actually the sequences in the files that I would’ve thought were the reverse reads (file names like “SAMPLE1_2”; what I thought were forward reads were “SAMPLE1_1”) are actually the forward reads… so that’s interesting. But when I look at my demux summary, it looks like qiime2 was able to figure out which ones were forward and reverse, since I would expect the quality on the reverse to fall off earlier.

I had originally imported this data and intended on using joined sequences but had issues with joining given how much I needed to trim based on quality, so I had decided to continue analysis of only the single reads (first with the non-paired version of q-score, which does except paired end data), using the paired end data I had imported (shown in the demux). I’m thinking my issue was at this step, because when I pull a sequence from the q-score output and align it to the raw “forward” and “reverse” reads, it aligns to the “forward” (incorrect) read. Instead of using the sequences that are actually forward reads but not labelled accordingly in the file names (e.g. “SAMPLE_2”), it used the sequences labelled as forward (e.g. “SAMPLE_1”) but that are actually reverse sequences. This doesn’t make sense to me since it looked like demux summary knew which files were forward and reverse…

So I have a few questions:

  1. when I import paired end data and do a demux summary, how does qiime2 determine which are forward and which are reverse reads? Just based off file names, where R1/1 would be assumed to be forward?
  2. Is there a way in using the q-score plugin to specify which reads are forward and reverse?
  3. Does this issue make sense or is this unlikely to be the issue?

I’d love to hear your (or anyone else’s!) thoughts on this. I think if this is the issue, the best way to deal with this data is re-import just the actual forward reads and process these. I’m not able to work with this data until next week, but figured I’d go ahead and post now in case anyone has some ideas before I start working with this again. :grimacing:

Just to explain why classification did work on the reverse complement of my sequences, I had made/trained a classifier for the whole region that would’ve been covered by the joined paired ends (~500 bp), so would still cover either the forward or reverse reads. Made this classifier prior to deciding to use just the forward reads and haven’t gotten around to trimming it to just the forward read length/region :slight_smile:

1 Like

@c.older,

The fact that they’re the same hypervariable region helps!

The naming thing is weird!

It depends on how you’re importing. I’m assuming you used the casava import and then, it does infer based on R1/R2. If you’re using manifest, you specify and then it renames/infers position again based on an R1/R2 format.

It seems odd to me that the q-score plug-in would behave that way, but maybe @thermokarst or @Nicholas_Bokulich could shed more light on htis behavior? It seems suprising to me, though!

This would probably be my approach!

Best,
Justine

2 Likes

Hi @c.older, I’m not really following what you posted above (sorry!), any chance you could provide the following info:

  1. The version of QIIME 2 you are using
  2. The exact command or commands you ran (copy and paste)
  3. The exact error you saw (copy and paste) when run with --verbose flag

If you can send any data my way (a link in a DM to me works if you don’t want to post publicly) would really help, too.

Thanks!

:qiime2:

Hi,

Sorry it’s so confusing - there’s a lot going on with this issue!

2019.7

Importing my paired end sequences
qiime tools import --type ‘SampleData[PairedEndSequencesWithQuality]’ --input-path Brad/ --input-format CasavaOneEightSingleLanePerSampleDirFmt --output-path Brad_demux-paired-end.qza
Quality filtering just with forward sequences
qiime quality-filter q-score --i-demux Brad_demux-paired-end.qza --o-filtered-sequences FBrad_filtered_seqs.qza --o-filter-stats FBrad_filtered_stats.qza
Deblur-ing; EDIT: forgot hat I did do a trim
qiime deblur denoise-16S --i-demultiplexed-seqs FBrad_q_filtered_seqs.qza --o-table FBrad_deblur_table.qza --o-representative-sequences FBrad_deblur_rep_seqs.qza --o-stats
FBrad_deblur_stats.qza --p-trim-length 280

I did this for sequencing data for 4 different studies and then merged the table (qiime feature-table merge) nd seqs (qiime feature-table merge-seqs) to give files “All_rep-seqs.qza” and "All_table.qza

Then classified with classify-sklearn on a greengenes classifier. I’m actually working with one specific for my region, but I get the same issue with the full length greengenes classifier and do not change any of the command other than the classifer filename
qiime feature-classifier classify-sklearn --i-classifier …/gg-13-8-99-nb-classifier.qza --i-reads All_rep-seqs.qza --o-classification All_taxonomy.qza

From this, I get classification for data from 3 of my 4 studies, with the last study getting classification of “Bacteria”, “Unclassified” and sometimes the occasional phylum. When I classify that data set by itself though, I get reasonable classifications, so think it has to do with the orientation of this particular data set

So I’m not getting any error message, it’s just an obvious error with the data.

Sending you the data via DM. Thanks for taking the time to look into this! :slightly_smiling_face:

1 Like

Hi @c.older, thanks for this info!

What I am having a hard time following is the business about sample labels / orientations being swapped.

Since you used a manifest format QIIME 2 doesn’t need to “figure out” the direction — it uses the direction you specified in your manifest! At first glance, I think if there are issues with fwd/rev files being swapped, I would triple-check the manifest to make sure there aren’t any mistakes there.

The reason it aligns to the forward read here is because this method only operates on single-end reads. That means that if you pass in paired-end reads, the reverse reads will be discarded.

In your example are SAMPLE_1 and SAMPLE_2 one sample (fwd and rev) or two samples (one direction)?

Can you provide an example of this that illustrates the issue?


I am wondering if you are running into another variation of the bug reported here:

but, I need a bit more information to be sure (requested above). In the meantime, can you share some manifest or demux seqs in the DM you opened up with me? Thanks! :t_rex:

I checked to make sure I did not make a mistake here and it looks correct (unless the files are named incorrectly).

This would be an example of one sample with fwd as “_1” and rev as “_2”

I don’t have a super great way to show this, but here’s an alignment showing some of the sequences that I think are in the wrong orientation. “SRR…R” is one of the sequences from the file I would’ve thought would be the rev reads (original file name “SRR…_2.fastq”) and the “SRR…FRC” is a reverse complement of one of the sequences from the file I would’ve thought would be the fwd reads (original file name like “SRR…_1.fastq”). Bano_staph is a representative seq from one of the studies (the other studies I pulled sequences from have this same orientation). You can see the “rev seq” looks more like what would be expected from a forward seq and vice versa for the “fwd seq”.



I’m convinced I just did something wrong, but perhaps it could be that they were originally uploaded or named incorrectly? Not sure… sharing files via DM and hopefully you can see if I’m missing something :+1:t4:

1 Like

This is still on my radar — playing catch-up from last week’s Thanksgiving holiday — I will take a look some time next week! :spiral_calendar:

2 Likes

Hey @c.older,

Based on some tinkering with BLAST, I think the original data may be just in the wrong orientation like you suspect:

I don’t see anything in your analysis that suggests any error. I grabbed the a few reads out of SRR2923680_1.fastq and it seems to align well to the reverse complement of a variety of Staphylococcus 16S sequences (Plus/Minus strand orientation).

I don’t have a great explanation for why your quality plots look like the correct orientation, other than the possibility that the original data had the V3 RC primer as it’s “forward” and V1 as the “reverse”. In which case the quality scores would look like we expect (and what we see). This is only a guess however, I don’t think I’ve ever heard of anyone doing this.

Do others think this seem like a consistent explanation?

2 Likes

Hi Catlin,
I have ask the mixed orientation data problem before and Nicholas helped me to solved my problem. Please check the bottom part of this thread

In my case, based on the taxonomy data using the mixed orientation data, I know which samples are corrected and which samples are in the reverse orientation. So I used vsearch --fastx_revcomp command in the Vsearch to correct those samples and move them to Qiime2 again.

Hope these steps also work for you.

4 Likes

@ebolyen Thanks for checking out the data. Such a weird situation but glad you agree that it just looks like it’s in the wrong orientation and not some other underlying issue…

@Lei Thanks for sharing your thread and suggestion! I’ll give this a try and see if it works.

I’ll make sure to update you all on if this solves my problem! Sorry in advance if there’s not an update prior to the holidays :grimacing:

2 Likes