hello
I'm new to metagenomic pipelines. Currently, I have 500 sample of crohns patients(feces) and 467 samples of healthy patients (feces). All these sample files are in fasta format as follows:
GVJ07HB01AWJS1_cs_nbp_rc cs_nbp=28-337 sample=C_4022_01_S1 rbarcode=GACTCTGA primer=V1-V2 subject=4022 body_site=stool center=UPENNBL barcode_mismatch=0 primer_mismatch=0
ACTAGGCGTTAACACATAGCAAGCGAGGGGACGAGCATCATCAAAGCTTGCTTTGATGGATGGCGACCGGCGGCACGGTGAGTAACACGTATCCAACCTGCCGACAACACTGGGATAGCCTTTCGAAAGAAAGATTAATACCGGATGGCATAATTTTCCCGCATGGGATAATTATTAAAGAATTTCGGTTGTCGATGGGGGATGCGTTCCATTAGGCAGTTGGCGGGGTAACGGCCCACCAAGACAACGATGGATAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACTGAGACACGGTCCAAACTCC
GVJ07HB01C9SIH_cs_nbp_rc cs_nbp=28-381 sample=C_4022_01_S1 rbarcode=GACTCTGA primer=V1-V2 subject=4022 body_site=stool center=UPENNBL barcode_mismatch=0 primer_mismatch=0
TGGTAAGAAGTTTGTAGTCCTGGCGTCAGGATGAACGCTGCGGCGTGCCTAACACATGCAAGTCGAGCGTAAGCGGTTTTAGGAAGTTTTCGGATGGATTAAACTGACTGAGCGGCGGACGGGTGAGTAACGCGTGGGTACCTGCCTCATACAGGGGGATAACAGTTAGAAATGGCTGCTAATACCGCATAAGCACACAGCTTCGCATGGAGCAGTGTGAAAAACTCCGGTGGTATGAGATGGACCCGCGTCTGATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCGACGATCAGTAGCCGACCTGAGAGGGTGACCGGCCACATTGGGACTGAGACACGGCCCAAACTCC
After reading up on the forum and the qiime2 website, I came to conclusion that I can combine the 967 fasta files into a single fasta file using qiime1 via virtual machine - add_qiime_labels.py, which looks as follows:
Reading up further on the forum, I'm to analyze the combined file following through steps from Clustering sequences into OTUs using q2-vsearch — QIIME 2 2021.2.0 documentation
And this is where I have questions and please let me know if the above logic behind the pipeline is correct.
In the OTU clustering tutorial, it essentially clusters based on similarity in denovo step but I believe this tutorial from UCLA (https://qcb.ucla.edu/wp-content/uploads/sites/14/2017/12/QCB_W11-Metagenomics-Analysis_BS_day2.pdf ) asks me to know a few things like which region it is based on before I choose between different OTU methods. I am unaware as to how to find the region of my data (data is from HMP)
Is there a filtering step that I'm missing?
The end product from OTU step via either of three methods would be a rep-seq file and table artifact, which would be piped into the moving tutorial picture at featuretable and featuresummaries step ? Would this pipeline be accurate for a foolproof analysis?
Thanks!