Large portion of Cyanobacteria

I need help please. I assigned taxonomy using my own trained classifier for V3/V4 region.
The problem is when I removed the primers and got much of my reads kept as non-chimeric which is good, I have dominance of Cyanobacteria in my taxonomy file. That dominance belonged to Proteobacteria in my previous taxonomy file generated which makes more sense to me.

Here the the final taxonomy file i generated
atra-taxa-bar-plots.qzv (833.5 KB)

Here is the prefinal taxonomy file i generated, although here much less reads were included as chimera was almost half of my sequences
taxa-bar-plots.qzv (570.2 KB)

What could I have done wrong to get that much Cyanobacteria which makes no sense to me?

Please, always create a new topic if it is no more relevant to the original one. It helps moderators to solve the issue in a faster way and it will be more visible to other users with similar questions.

In your case, Cyanobacteria is dominant because of large amount of Chloroplast. I guess you are working with plant material. Here is a tutorial how to get rid of organelles DNA.

Probably after filtering dominance will return to the Proteobacteria. But not necessarily - your first run was biased by large portion of false-chimeric reads.


Actually, It is mainly sediment and gut samples not plant material.

Still, you can remove organelles DNA from the dataset.

I have different Cyanobacteria subclassification in my taxonomy as follows and I am confused if I should exclude all
d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__Chloroplast;f__Chloroplast;g__Chloroplast | 19.457%





So I should exclude all the orders of Cyanobacteria, or just chloroplast?

What confuses me is that their existence in sediment or seawater samples could be normal but in midgut samples with high abundace is not normal to me, How they are there?

No, you do not need to exclude the whole order. Just chloroplast and mitochondria, as it is shown in the tutorial.

I know that a lot of chloroplast can be found in human mouth samples just because subject ate some salad that day. Can be that they are coming from the food?


Do I have then to just use the new filtered-seq.qza file to create new taxa barplots? But I will be using then the same feature-table unfiltered, Is this okay, or should i filter also the feature table.qza too??

Filter first your feature table and then filter rep-seq.qza file (optional) based on filtered feature table.

1 Like

qiime taxa barplot
--i-table table.qza
--i-taxonomy taxonomy.qza
--m-metadata-file sample-metadata.tsv
--o-visualization taxa-bar-plots.qzv

So, finally when we generate taxa barplots after filtration of chloroplast, we will use the filtered table.qza but still we will use the unfiltered taxonomy file, Is this correct? as we only filtering table.qza and seq.qza but not taxonomy.qza ? correct?

Then, I believe, I should run Phylogeny, alpha and beta diversity again using the new filtered table.qza and seq.qza files, correct?

Thank you so much for your patience with me.

There is no need to filter taxonomy.qza file. It is not a problem when ASVs, present in taxonomy file, are missing in the feature table and representative sequences. The problem is when ASVs from feature table are missing in rep-seqs or taxonomy files.

That's correct!

No worries, I am learning here as well by helping others.

1 Like

Hi again,

after filteration of my taxonomy file and removing cholroplast and mitochondria, I still have cyanobacteria dominating my midgut samples whhich makes no sense to have it dominating here

atra-filtered-no-mitochondria-no-chloroplast-taxa-bar-plots.qzv (828.2 KB)

How I could please proceed with it without ruining my data? I have also cyanobacteria in seawater and sediment samples but in midgut samples is not logic

Where could that wired classification comes from?

Hi @Sabrin,

That is fine, as not all cyanobacteria are photosynthetic. Many are known to live in the gut too. See:

Also, if these are gut samples, it is entirely possible these could be dietary 'bycatch', that is, cyanobacteria living on surfaces that the host is eating. Possibly contamination too, but hard to say without knowing more about your study system.


1 Like

thank you so much for your quick response.

I am sampling from nature gut of sea cucumbers and dividing the gut into three compartments, foregut, midgut, hingut. Additionally sampling seawater and sediment samples.

It would make sense to me that these cyanobacteria comes from feeding on sediment if the foregut also was dominated by these cyanobacteria but it is not the case. So this means these cyanobacteria is enriched in the mid gut but if they are photosynthetic cyanobacteria then it makes no sense.

How i could reach their metabolic activity? any suggestions!! I can not find them on literature in any gut study ,

Not necessarily true. The foregut may simply not be the right environment for them to "stay around", so they just keep moving through the gut until they find a region that they can do well and become enriched there (midgut). Or perhaps it was indeed diet, and you happened to sample at a point and time when the DNA from the Cyanobacteria had moved to the midgut. Of course, these are just a few thoughts :man_shrugging:. Sounds like you may have an interesting research question blooming. :slight_smile:

Remember trying to identify very specific taxa at the genus and species level can be difficult with amplicon sequencing data. It could be there are no good representatives of non-photosynthetic cyanobacteria within the reference database. I've not looked very thoroughly myself. Though you can look here. Note, I did not find any Melainabacteria within SILVA. So they may be present under another name or may simply not have been included yet. Thus, it is possible that the assigned taxonomy is simply a spurious result of the query sequence consistently mapping to the closest, but unrelated, taxon within the reference database. Also, keep in mind that taxonomy is always changing, and new sequence data is being generated.

You can always use the various tools within QIIME 2 and RESCRIPt to fetch and append any 16S rRNA gene data of Melainabacteria from GenBank into your existing SILVA or other reference database (Assuming the taxonomy labelling schema is constant between the files, i.e. GenBank and SILVA differ in some respects. But you can use qiime rescript edit-taxonomy ... to help). Then you can merge the new data into the existing reference database by using the qiime feature-table merge-seqs ... and qiime feature-table merge-taxa ... commands. Then you can train your new classifier.


@SoilRotifer thank you so much for your thoughts.

I have a question please regarding training my classifier, as i have fears it doesnot fit my sequences.

Here in this step in rescript,
qiime rescript filter-seqs-length-by-taxon
--i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza
--i-taxonomy silva-138.1-ssu-nr99-tax.qza
--p-labels Archaea Bacteria Eukaryota
--p-min-lens 900 1200 1400
--o-filtered-seqs silva-138.1-ssu-nr99-seqs-filt.qza
--o-discarded-seqs silva-138.1-ssu-nr99-seqs-discard.qza
Does --p-min-lens 900 1200 1400 means i am loosing any ref. sequences less than 1200 bp for 16S, could that be limiting my classifier to full length ref. 16S? Would lowering that increase my chances and my ref. seq. by including shorter amplicon reads?

I am trying to understand why all my sequences are classified as bacteria and I have no unclassified results at all.

This is explained in the RESCRIPt documentation here, as well as the help text, which you can access by:

qiime rescript filter-seqs-length-by-taxon --help

Potentially. You can play around with the settings. If you do not need to differentially length-filter the sequences by taxonomy, you can simply use qiime rescript filter-seqs-length ... instead. This way you can trim everything down to 900 bases, or whatever length you choose.

@SoilRotifer Hi again, As i have doubts still about my taxonomy, I am assigning taxonomy against the full classifier and not the specific V3V4 classifier, just to be sure..

Is these steps correct please , especially step 2? to assign directly against silva-138-ssu-nr99-classifier.qza

  1. qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads /user/asga9989/Taxonomy_output/training-feature-classifiers/ silva-138-ssu-nr99-seqs-derep-uniq.qza --i-reference-taxonomy /user/asga9989/Taxonomy_output/training-feature-classifiers/ silva-138-ssu-nr99-tax-derep-uniq.qza --o-classifier /user/asga9989/Taxonomy_output/training-feature-classifiers/ silva-138-ssu-nr99-classifier.qza

  2. qiime feature-classifier classify-sklearn --i-classifier /user/**asga9989/Taxonomy_output/training-feature-classifiers/**silva-138-ssu-nr99-classifier.qza --i-reads /user/asga9989/atra-rep-seqs.qza --o-classification /user/asga9989/atra-vs-full-classifier-taxonomy.qza

  3. qiime metadata tabulate --m-input-file /user/asga9989/atra-rep-seqs.qza --o-classification /user/asga9989/atra-vs-full-classifier-taxonomy.qza --o-visualization /user/asga9989/atra-vs-full-classifier-taxonomy.qzv

Is there is available full length classifier silva-138-ssu-nr99-classifier.qza that I might directly use? Just to check if my full classifier file is not corrupted!!

Best Regards,

@SoilRotifer I found Melainabacteria within SILVA , here it is

But it is only four entries, which maybe still not enough?

You can find a few pre-made files on the Data resources page.

Likely not. Mainly because these labels only exist as organisms names, that have been used as the species labels. It is often quite difficult to obtain species-level classification with amplicon reads.

I want now to add the Melainabacteria from NCBI to my classifier but so stuck how to start that.
How to only add Melainabacteria and not the whole 16S refs, or I should add the whole 16S?

here it adds the whole 16S
qiime rescript get-ncbi-data
*--p-query '33175[BioProject] OR 33317[BioProject]' *
--o-sequences ncbi-refseqs-unfiltered.qza
--o-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza
Thanks in Advance