Re-obtaining rep-seqs and feature tables after filtering for barcoding

pedrolafarguem · July 5, 2019, 12:22am

Hi, I have obtained my taxonomic assignment by the classifier and now I need to discard some rep-seqs. I have followed previous posts on how to filter the sequences to eliminate cyanobacteria, mitochondria, chloroplast, but I don't know if it was right and how to check.
I am confused and have the next questions.

The inputs use "table. qza (My-feature-table?) and rep-seqs.qza. These two files are obtained as Dada2 output. How can I obtain again the reps.seqs that does not include the contaminations?

Should the new output files go again to this stage to start the process?
qiime feature-table summarize
--i-table full-table.qza <---- table-no-mitochondria-chloroplast.qza
--m-sample-metadata-file
--o-visualization full-table.qzv

and here,

#Vizualization of the feature table and representative seqs
qiime feature-table tabulate-seqs
--i-data rep-seqs.qza <---- rep-seq-no-cyanobacteria.qza
--o-visualization rep-seqs.qzv

This is what I used;
qiime feature-table filter-features
--i-table table.qza #feature-table-filtered.qza
--m-metadata-file taxonomy.qza
--p-where "Taxon NOT LIKE '%Cyanobacteria%'"
--p-exclude-ids
--o-filtered-table feature-table-sans-cyanobacteria.qza

qiime feature-table filter-seqs
--i-data rep-seqs.qza
--m-metadata-file taxonomy.qza
--p-where "Taxon NOT LIKE '%Cyanobacteria%'"
--p-exclude-ids
--o-filtered-data rep-seq-no-cyanobacteria.qza

qiime taxa filter-table
--i-table table.qza
--i-taxonomy taxonomy.qza
--p-exclude mitochondria,chloroplast
--o-filtered-table table-no-mitochondria-chloroplast.qza

Might be a concept confusion but I will appreciate the help.
Many thanks in advance

Mehrbod_Estaki · July 5, 2019, 1:07am

Hi @pedrolafarguem,
Welcome to the forum!
You are correct. When you filter your feature-table and rep-seqs files, a new version of these files is produced and therefore you'll have to run the visualizations on the new formed artifacts. The original table and rep-seqs remain untouched.

pedrolafarguem · July 5, 2019, 2:11pm

Hi @Mehrbod_Estaki, thanks for your reply. Could you please give me some feedback here?

I filtered and input the table as mention before.

qiime feature-table filter-features
--i-table table.qza #feature-table-filtered.qza
--m-metadata-file taxonomy.qza
--p-where "Taxon NOT LIKE '%Cyanobacteria%'"
--p-exclude-ids
--o-filtered-table feature-table-sans-cyanobacteria.qza
##(Concept confusion; Feature tables to change for further analysis; table.gza or full.table?

qiime feature-table filter-seqs
--i-data rep-seqs.qza
--m-metadata-file taxonomy.qza
--p-where "Taxon NOT LIKE '%Cyanobacteria%'"
--p-exclude-ids
--o-filtered-data rep-seq-no-cyanobacteria.qza

qiime feature-table tabulate-seqs
--i-data rep-seq-no-cyanobacteria.qza
--o-visualization rep-seq-no-cyanobacteria.qzv
#Saved Visualization to: rep-seq-no-cyanobacteria.qzv

After this, in the downstream analysis, can I again use the sample_metada file?
As I want to obtain the unique seqs that does not contain Cyanobact and run again Core-diversity and taxonomy assignment.
I have written the next commands, but the unique seqs are decreasing too much 1000 to 20, which I don't think it makes sense, How can I control this?

qiime feature-table summarize
--i-table feature-table-sans-cyanobacteria.qza
--m-sample-metadata-file '/media/pedro/Shared/2792_data_training/sample_metadata.csv'
--o-visualization feature-table-sans-cyanobacteria.qzv
#Saved Visualization to: feature-table-sans-cyanobacteria.qzv

Generate a tree for phylogenetic diversity analyses, use sepp
qiime fragment-insertion sepp
--i-representative-sequences rep-seq-no-cyanobacteria.qza
--o-tree insertion-tree-noCyano.qza
--o-placements insertion-placements-noCyano.qza
#Filter our feature table with our fragments that are featured in the tree (table.qza) and those without (removed-table.qza)
qiime fragment-insertion filter-features
--i-table feature-table-sans-cyanobacteria.qza
--i-tree insertion-tree-noCyano.qza
--o-filtered-table table-sans-cyanobacteria.qza
--o-removed-table removed-table-sans-cyanobacteria.qza
Filter our feature table with our fragments that are featured in the tree (table.qza) and those without (removed-table.qza)
qiime fragment-insertion filter-features
--i-table feature-table-sans-cyanobacteria.qza
--i-tree insertion-tree-noCyano.qza
--o-filtered-table table-sans-cyanobacteria.qza
--o-removed-table removed-table-sans-cyanobacteria.qza
qiime feature-table summarize
--i-table feature-table-sans-cyanobacteria.qza
--m-sample-metadata-file '/media/pedro/Shared/2792_data_training sample_metadata.csv'
--o-visualization table-sans-cyanobacteria.qzv
qiime tools view table-sans-cyanobacteria.qzv

qiime feature-table summarize ######did not pass
-i-table removed-table-sans-cyanobacteria.qza
--m-sample-metadata-file '/media/pedro/Shared/2792_data_training/sample_metadata.csv'
--o-visualization removed-table-sans-cyanobacteria.qzv

Saved Visualization to:

qiime tools view removed-table-sans-cyanobacteria.qzv

There was an issue with merging QIIME 2 Metadata:

Cannot merge metadata with overlapping columns. The following columns overlap: 'anonymized_name', 'sample_country_origin', 'sample_fermentation_location', 'sample_continent', 'dna_extracted', 'sample_type', 'description', 'scientific_name', 'collection_year', 'predominant_genetics', '16S _V3V4_r1', '16S _V3V4_r2', '2RPOBAP_289-308F_r1', '2RPOBAP_622-644R_r2', '3DNAKAP_275-294F_r1', '3DNAKAP_617-636R_r2', '1GROELAP _71-92F_r1', '1GROELAP_377-398R_r2', '3GROELLF_231-252F_r1', '3GROELLF_536-557R_r2', 'Index N7_ID', 'Index_N7', 'Index_N5_ID', 'Index_N5', 'well_id', 'barcodes'

CORE DIVERSITY WITHOUT CYANO
qiime diversity core-metrics-phylogenetic
--i-phylogeny insertion-tree-noCyano.qza
--i-table table-sans-cyanobacteria.qza
--m-metadata-file '/media/pedro/Shared/2792_data_training/sample_metadata.csv'
--p-sampling-depth 1000 \
--output-dir core-metrics-results-nocyano
LPHA diversity
qiime diversity alpha-group-significance
--i-alpha-diversity core-metrics-results-nocyano/faith_pd_vector.qza
--m-metadata-file '/media/pedro/Shared/2792_data_training/sample_metadata.csv'
--o-visualization core-metrics-results-nocyano/faith_pd_group-significance.qzv

qiime diversity alpha-group-significance
--i-alpha-diversity core-metrics-results-nocyano/evenness_vector.qza
--m-metadata-file '/media/pedro/Shared/2792_data_training/sample_metadata.csv'
--o-visualization core-metrics-results-nocyano/evenness-group-significance.qzv

Plugin error from diversity:
float division by zero

In the end by reviewing core-metrics-results-nocyano/faith_pd_group-significance.

The following pairwise group comparisons have been omitted because the Kruskal-Wallis test could not be completed. This can occur if the two groups being compared each have a sample size (n) of 1 and contain the same single value.

Note: I was following Moving picture, FMT, Filtering data tutorial.

Could you please guide throught is.

Many thanks again.
Pedro

Mehrbod_Estaki · July 5, 2019, 8:26pm

Hi @pedrolafarguem,
I believe your initial filter step is wrong if you are trying to remove Cyanobacteria. The way you have it set up I think will actually only keep those that ARE Cyanobacteria. Can you confirm this?
I think the command you want to use would be

qiime feature-table filter-features \
  --i-table feature-table-filtered.qza \
  --m-metadata-file taxonomy.qza \
  --p-where "Taxon NOT LIKE '%Cyanobacteria%'" \
  --o-filtered-table feature-table-filtered-without-Cyanobacteria.qza

without the --exclude-ids OR removing the 'NOT' from your --p-where line. This probably explains why you lose all but 20 of your rep-seqs.

You start out with a table called table.qza and do some filtering to it using the filter-features plugin. The output here you call feature-table-sans-cyanobacteria.qza which is a new table; note that your original table.qza remains unchanged. Same thing with your rep-seqs.qza, you build a new file called rep-seq-no-cyanobacteria.qza while the original remains unchanged.
Moving downstream, you'll want to use these new filtered files.

If you are referring to the .csv file then

yes, because the sample-metadata file has nothing to do with your filtering steps, rather it just contains info about your samples for example group designation, collection dates etc.

See my comment above regarding the proper using of the filtering command.

Try that first and let us know if you run into any more issues, we can troubleshoot from there.

pedrolafarguem · July 12, 2019, 10:03am

Hi @Mehrbod_Estaki,
Many thanks for the feedback. Is working now and I have obtained the table-without-chloroplast and other filters.

To optimize I am wondering If I can clean up multiple (contaminants) in one line? I tried: –p-where “Taxon NOT LIKE ‘%Chloroplast,Mitocondria%’” but it just recognize Chloroplast, not mito. Any suggestion? Or do I have to filter again the table-without-chloroplast and do several filters in each new table?

What I am trying to do is find if the ASV are exclusive of a specific group so then I can sum the samples, and then merge the runs (3), should I use my new tables without contaminants for this?
Can the ASV be a mutation or variant of another ASV? Is this define in Dada2 step or the ASV will be obtained even if there is just 1 bp difference?

As I am working in barcoding, I am interested in finding the unique or exclusive ASV per sample and then group. In my understanding, if the ASV is not present in the duplicate sample, therefore, is a PCR error or mutation. How could I compare the samples as an initial filter instead of having to do beta diversity per sample as the last step of the pipeline? I was following your other comment (Filtering for private ASVs - #3 by Nicholas_Bokulich) in the forum about (core-features combining different tables), is there any example or notebook that you could share?

-p-min-fraction FLOAT (the ASV in one sample? )
--p-max-fraction FLOAT ?

Also, I would like to extract the ASV that are in each sample so I can have a table with (features/samples and add an abundance of the feature in the sample), any thoughts?

Many thanks and bests,

Nicholas_Bokulich · July 15, 2019, 12:33pm

Have you tried the qiime taxa filter-seqs and filter-table methods? These accept multiple include/exclude options.

Theoretically only if it is a true biological variant (e.g., due to multi-copy heterogeneity). The purpose of dada2 is to identify and correct PCR/sequence errors leading to spurious variants. But no method is perfect.

I agree duplicate samples should share all ASVs, but strange things do happen. I would trust dada2 to identify errors/mutations more than I would an arbitrary rule like this.

Yes, use qiime feature-table filter-features --p-min-samples 2 to filter out features that are unique to a single sample (this does not look at duplicates specifically, but would accomplish roughly what you are after).

See the exporting tutorial

Good luck!

system · August 15, 2019, 6:33pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.