filtering data table using CSV file

Hello qiime2 team!
I'm trying to filter a taxonomy table using a CSV or TSV file rather than listing out the bacteria using the "--p-exclude" command. Is this possible? If so, what are the steps? I'm running qiime2-amplicon-2023.9. The filtering data page (Filtering data — QIIME 2 2024.5.0 documentation) suggests the below code where I would have to individually list out the bacteria. I might have missed it, but I didn't see an option for filtering taxa based on another file. I have a curated CSV file that contains a list of 163 contaminants, specific to the dataset with which I'm working, that I would like to filter out.
Many thanks in advance, Caroline

qiime taxa filter-table
--i-table table.qza
--i-taxonomy taxonomy.qza
--p-mode exact
--p-exclude "k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__mitochondria"
--o-filtered-table table-no-mitochondria-exact.qza

Hello!

Using a file for this kind of things is something that can be done in some QIIME 2 plugins (e.g. RESCRIPt allows you to use a tabular file with your replacements when editing taxonomy). In this case, after reading the documentation, I'm afraid you cannot do what you want by directly feeding QIIME 2 with your CSV path. But that does not mean you can use some :dizzy: Bash magic :dizzy: to make it work automatically.

First of all, your CSV file. I will assume you have a CSV file called contaminants.csv that contains only one column: the taxonomy of the contaminants.¹ I'll also assume this file has no headers.²

You can add the content of your file to a Bash variable:

# If CSV file has Linux newlines (\n):
contaminants=$(tr '\n' ',' < contaminants.csv | sed 's/,$//')

# If CSV file has Windows newlines (\r\n):
contaminants=$(tr -s '\r\n' ',' < contaminants.csv | sed 's/,$//')

The tr command removes newlines and put commas (the default qiime taxa filter-table value for --p-query-delimiter) instead. You can put the delimiter you want. The sed command removes the trailing comma.

Once you have this, you can run your command as follows:

qiime taxa filter-table
--i-table table.qza
--i-taxonomy taxonomy.qza
--p-mode exact
--p-exclude $contaminants
--o-filtered-table table_no_contaminants.qza

Best,

Sergio

--

¹ What if my file has more than one column?

There are a lot of things you can do (like manually creating another CSV with only your column of interest). For the sake of completeness, I'll provide one possible command line solution:

desired_column=$(awk -F',' '{print $3}' contaminants.csv)
contaminants=$(echo "$desired_column" | tr '[:space:]' ',' | sed 's/,$//')

First command keeps only one column and store it in a variable desired_column. Here I assume the column of interest is the third, $3, but you can adapt it to your needs. I also assume the CSV field separator is a comma, -F','.

The second command creates the contaminants variable in a similar way as before, but instead of using the CSV as input we use the desired_column variable.

Now you are ready to run QIIME2 with $contaminants with the command I wrote above.

² What if my file has headers?

Again, I will give one of the multiple possible command line solutions, although you could simply open the CSV file and manually remove the first row.

Assuming you already created the contaminants variable (either with the post method or with the footnote 1 method), all you have to do prior to run QIIME 2 is:

contaminants=$(echo "$contaminants" | cut -d',' -f2-)

This removes the first comma-separated value of the contaminants variable (that is the column header).

1 Like

Hi @salias !
This is extremely helpful! Thank you so much!! I do have a follow up question: Does the qiime taxa filter-table command allow for pattern matching? For example, in my contaminants file, I want to remove anything that matches "g__Ralstonia" however in the dataset Ralstonia appears as "d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__Burkholderiaceae;g__Ralstonia". "g__Ralstonia" is just one example of over 100 contaminants. Many thanks in advance, Caroline

Hello again @crw ,

I'm glad you got it working.

Not exactly pattern matching but substring matching. The --p-mode contains option is what you are looking for. From the documentation:

--p-mode TEXT Choices('exact', 'contains')
                         Mode for determining if a search term matches a
                         taxonomic annotation. "contains" requires that the
                         annotation has the term as a substring; "exact"
                         requires that the annotation is a perfect match to a
                         search term.                    [default: 'contains']

Default value is contains so you can simply remove the --p-mode exact from the command, or change it with --p-mode contains if you want to be explicit.

Cheers!

Sergio

2 Likes

Hi @salias !
Thanks so much for your quick reply! Unfortunately this does not seem to be working. My results still contain the contaminants I'm trying to filter out. Below is the code I'm running. I've tried making the contaminants variable with commas or semicolons because semicolons are what is used in the qiime2 filtering data tutorial. The code successfully runs; however, the contaminants contained in the "contam" csv are not removed. When I try to look at the "contam" variable using echo $contam or printf $contam only ",s__Clostridium polyendosporumpes group" is returned. Many thanks for your continued help!

contam=$(tr '\n' ',' < qiime/fulldata/contaminants_exclude_20240723.csv | sed 's/,$//')

qiime taxa filter-table \
  --i-table qiime/fulldata/denoising-feature-table.qza \
  --i-taxonomy qiime/fulldata/taxonomy-classification.qza \
  --p-mode contains \
  --p-exclude $contam \
  --o-filtered-table qiime/fulldata/table-no-contam.qza
Saved FeatureTable[Frequency] to: qiime/fulldata/table-no-contam.qza

echo $contam returns
,s__Clostridium polyendosporumpes group

Hi!

It looks like the CSV is built different than I assumed. Would you mind sharing your contaminants file, or at least first rows, so I can see the structure and play around with it to debug the problem?

Another quick note:

Semicolons are not separating different taxa in the example. They are the separator for different taxa levels of the same taxonomic assignation. So in:

--p-exclude "k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rickettsiales;f__mitochondria"

We are excluding only one taxa: the one that is kingdom Bacteria, phylum Proteobacteria, class Alphaproteobacteria, order Rickettsiales and family mitochondria. Spaces between taxa levels are probably a typo in the tutorial so I removed them.

1 Like

Hi @salias ! Thanks so much for your reply. and continued help! I've attached a CSV that contains a subset of the contaminants.
contaminants_exclude_example.csv (340 Bytes)

1 Like

Hello!

The problem is that your file has the Windows newlines (\r\n) instead of the Linux newlines (\n), so the tr commands were not working as expected. I edited my first answer to include both options.

One last thing: I see you have taxa in your file like s__Clostridium polyendosporum, with a space. You will need to put one underscore there: s__Clostridium_polyendosporum.

Cheers!

Sergio

2 Likes

Hi @salias !
Thank you so very much for all of your help! I was successfully able to filter out the contaminants.
All the best, Caroline

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.