q2-data-augment: QIIME2 plugin for data augmentation using rarefaction (Rarefy for Augment)

yxia · January 15, 2021, 5:44pm

q2-data-augment: QIIME2 plugin for data augmentation using rarefaction (Rarefy for Augment)

Data augmentation is a very useful and widely used method in data science (see: Data augmentation - Wikipedia). Especially, it can increase the sample size of the training set for machine learning models.

Rarefaction can be used as an effective and trustable method for data augmentation, given the following reasons:

Essentially, biological sample collection and sequencing are random sampling processes, which capture microbes from an unknown population. Rarefaction is just another random sampling process,
which can also be seen as sampling certain reads from the same population, just the same as biological sample collection and sequencing.
Under the hypothesis that "rarefaction" = "biological sample collection and sequencing", each iteration of rarefaction on a sequencing sample in fact generates a new sequencing sub-sample.
This new sub-sample contains a subset of reads of the original sample, and come from the same population that the original sample belongs to.
Some preliminary results show using rarefaction for data augmentation can significantly the results of machine learning classification (unpublished results).

This method, named Rarefy for Augment, is very simple. Run random rarefaction N times. Each time, rename the samples and corresponding metadata and concatenate them with the previous two files. Finally, the sample size can be enlarged N times.

Installing

conda activate qiime2-2020.11
pip install git+https://github.com/yxia0125/q2-data-augment.git

Type "qiime data-augment" to test if the installation is successful.

Uninstalling

pip uninstall q2-data-augment

Using

qiime data-augment augment --i-table raw_table.qza 
                           --m-raw-metadata-file raw_metadata.tsv 
                           --p-sampling-depth 2000 
                           --p-augment-times 10
                           --p-output-path-metadata augmented_meta.tsv  
                           --o-augmented-table augmented_table.qza

"raw_table.qza" and "raw_metadata.tsv" are the input raw feature table and metadata; --p-sampling-depth --> the rarefaction depth; --p-augment-times set to 10 means repeating
rarefaction 10 times (i.e., enlarge sample 10 times); "augmented_table.qza" is the augmented feature table, its sample size is 10 times larger than "raw_table.qza", and new rarefed samples end with "_X" (X represents the i_th rarefaction); "augmented_meta.tsv" is the augmented metadata that has matching sample names in "augmented_table.qza".

Note: Only need to augment the training set.

Citing

If you are interested to use this method, please include the following citation:

Yao Xia, q2-data-augment: QIIME2 plugin for data augmentation using rarefaction (Rarefy for Augment), (2021), GitHub repository, https://github.com/yxia0125/q2-repeat-rarefy.

Mechah · April 9, 2021, 7:44am

First of all thanks a lot for developing this useful q2-plugin!
I augment my feature-table at a depth of 1000 for 10 times and used it for metadata predictions with the q2-sample-classifier plugin. I was pretty surprised by the result: while I only got prediction accuracies around 50-60% for the original data set, data augmentation with q2-data-augment improved this up to prediction accuracies of 100%. However, I'm still a bit sceptical about this improvement... Since I do not have any prior knowledge of data augmentation I would be interested in any comments or links to web resources or discussions how to judge such an improvement for supervised learning methods. Looking forward to the discussion!

yxia · April 29, 2021, 12:24pm

Hi Mechah,

Thanks for your feedback. Data augmentation should only be done in the training data rather than the whole feature table. As far as I know, the q2-sample-classifier plugin does 5-fold cross-validation automatically on the whole feature table. You should split the feature table manually and augment only the training table.

I will check this problem soon and try to make it compatible with q2-sample-classifier.