Best way to filter and merge the data in my case--following discussion

This is a following discussion with this topic (Best way to filter and merge the data in my case).

Now, I am decide to use QIIME 2 workflow. Little bit background, I have two batches sequencing. 1st one and 2nd one. The 2nd batch is basically resequence the samples failed at the 1st time.

I am following Nick’s method:

  1. demultiplex each run separately
  2. use qiime demux filter-samples to remove samples with low read counts after demultiplexing
  3. denoise each run separately with dada2
  4. merge the feature tables and sequences

My way is a little different:

Step 1> I do demultiplex each run separately.
Step 2> I run dada2 workflow separately and build to feature tables and sequences file (2nd.bacteria.sequences.qza and 1st.sequences.qza).
Step 3> I want to filter two feature tables and sequences file – I am stuck here.

Questions:

A> My order is a little different from Nick’s method? Can I do this order? IF YOU THINK THIS ORDER WOULD BE A LOT OF TROUBLE LATER, I WOULD REDO EVERYTHING.
B> 1st batch and 2nd batch. I want to filter by samples ID, because I know which sample I should remove and which not, right!

I only need to remove 1 or 2 samples for each batch. I follow instructions here (https://docs.qiime2.org/2019.10/tutorials/filtering/). It’s so weird that the mapping file can only be the sample wants to keep? Can I do filter by sample that I don’t want to keep?

C> Do I need to filter the sequences file before I merge the feature table? If I need to do I need to filter the total sequence files? or representative files?

If I need to do this which scripts should I use? Can I filter them by sample ID that I don’t want?

Thanks a lot

1 Like

Hi @moonlight,

Sure. Sounds like the only difference between Nick's way was that you remove low count samples before DADA2. This has the added advantage of reducing some computational time, but probably very minimally. If you didn't do this and are planning on doing it after, that is totally fine, it would not do any harm.

Sounds like you have a few specific samples in mind that you want to remove, so yes.

Sure, just set the --p-exclude-ids parameter to True in your filter-samples action. Then you only get rid of the samples you provide.

  1. First merge your tables with merge when you have removed those unwanted samples
  2. Merge your representative sequences with merge-seqs,
  3. Use the merged-table as input (as metadata) in filter-seqs to remove any sequences from the merged-sequences files that are not represented in the merged-table. It's very likely that nothing will get filtered at this step but its good to do anyways.
1 Like

Hello Mehrbod,

Thank you very much!

I am confused about using this " --p-exclude-ids" and I am not sure how to use it. Also confused about --m metafile format?

You know, I also tried Nick’s methods to filter the demux file first, however, it doesn’t do anything by using --p-exclude-ids.

See my post here: ----p-exclude-ids doesn't work

What is the metafile header looks like in filter scripts?? Does it looks like the master metafile that I use at the very beginning? like this

sample-id barcode-sequence LinkerPrimerSequence ReversePrimer BarcodeName Description Plant
#q2:types categorical categorical categorical categorical categorical categorical

Also, if I finish filtering, do I need to make a new master metafile?

For example, I have 100 samples and I filter 20 out. Should I update my master metafile? delete those 20 samples (names, descriptions etc.?). If I use the old one with 100 samples, would it cause an error for later analysis?

1 Like

Hi @moonlight,
Your first issue regarding the use of --p-exclude-ids is answered by @colinbrislawn in the other thread, so following his instructions will take care of that.

If you are referring to the filtering your samples before DADA2 then this makes sense because the --p-exclude-ids parameter only works when you are providing a list of sample-ids to remove. Whereas at that step you are simply trying to remove samples that don't have a minimum # of sequences.

The metadata file you use in this particular script with filter-samples is the same as your regular metadata file, and after you are done filtering your samples out of your feature-table you don't need to remake your metadata file. For most downstream analysis the extra rows will just be ignored, I say most here because I'm not sure if this behavior is consistent across all non-core plugins, it might be a good idea to remove them, but I don't think you will run into any problems.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.