I have a multiplexed .fastq file containing full-length 16S rRNA sequencing data generated by the MinION Mk1B platform. The structure of the file is shown below:
I’m unsure whether QIIME2 can handle this kind of multiplexed data where the barcodes are not in standard base format (A/C/T/G). Given this setup, is it possible to process and analyze the data using QIIME2?
Indeed, your data is already demultiplexed, as indicated by the presence of "barcode=barcodeXX" in each read header. So in this case, things are pretty straightforward. You just need to run the following command to split the data into one fastq file per sample:
You won’t need this since your data is already demultiplexed, but for reference, here is the command you would use to demultiplex ONT data using Dorado:
dorado demux \
--kit-name SQK-NBD114-24 \ # adjust according to the barcoding kit used
--output-dir demux_output \
your_file.fastq \
--emit-fastq
Once you have one fastq file per sample, you can proceed with the sequence analysis using PRONAME, which enables steps like data curation, error correction, taxonomic assignment, and more.
I tried running PRONAME steps 0–3 on my Apple M1 laptop (Docker 4.42.0, I’ve unchecked Rosetta for x86_64/amd64 emulation), but ran into a couple of issues:
Here is my code:
docker run -it --name 16S_MCI_27e -v /Users/liuchenyu/Desktop/16S_MCI:/16S_MCI benn888/proname:v2.0.1-arm64
cd 16S_MCI
proname_import --inputpath MCI_27e_RAW --duplex no --trimadapters no --sequencingkit SQK-RBK114.96 --trimprimers no
proname_filter --datatype simplex --filtminlen 200 --filtmaxlen 5000 --filtminqual 9 --inputpath MCI_27e_RAW
proname_refine --clusterid 0.97 --inputpath MCI_27e_RAW --medakamodel r1041_e82_400bps_sup_v5.0.0 --chimeradb /opt/db/Silva138_full16S/silva-138-99-seqs.fasta --qiime2import yes
I successfully ran steps 0–2 on the HPC and obtained the corresponding figure outputs. However, for step 3 (refine), the resulting OTU table contained only zeros, and it could not be imported into QIIME 2 automatically. Please see attached screenshorts for reference:
To investigate further, I attempted to classify taxonomy manually using the .qza file generated from step 3. The classification appeared successful in Qiime2 and I was able to retrieve bacterial names.
I'm not entirely sure which step might have gone wrong, so I’ve attached all the relevant code here for reference.
export SIF=/.../proname_v2.0.1-amd64.sif
export HOST_DIR=/.../16SPerSample/MIC_27e_target
cd $HOST_DIR
singularity exec
--bind {HOST_DIR}:{HOST_DIR}
${SIF}
proname_import
--inputpath Persample
--duplex no
--trimadapters no
--sequencingkit SQK-RBK114.96
--trimprimers yes
--fwdprimer AGAGTTTGATCMTGGCTCAG
--revprimer TACGGYTACCTTGTTAYGACTT
Thank you for sharing all the details and screenshots.
From your outputs, the main issue seems to be that the feature table generated by proname_refine contains only zeros, and the QIIME2 import fails due to duplicate Feature IDs.
To help you troubleshoot, could you please run the following diagnostic commands inside your output directory, and send me the results ?
# Check for duplicate Feature IDs in the table
awk 'NR>1 {print $1}' rep_table.tsv | sort | uniq -d
# Check if any sample has non-zero counts
awk '{for(i=2;i<=NF;i++) if($i>0) print $i}' rep_table.tsv | wc -l
It could also be usefull to run additional diagnostic commands on files that are normally deletedat the end of proname_refine execution. It could therefore be interesting to run the command sed -i 's/^\([[:space:]]*\)rm\([[:space:]-]\)/\1# rm\2/' /opt/scripts/proname_refine before re-running your proname_refine command. This will prevent intermediate files from being deleted, making troubleshooting easier.
Also note that PRONAME was developed to be run using Docker. Using Singularity instead may could cause some issues, especially if some files are not written due to permission issues or because some directories are not correctly mounted or writable inside the Singularity container.
One last point regarding your read filtering, in case it helps: your current filters are quite permissive at the moment. Since the gene is about 1.5 kb, using a minimum length of 200 bp seems very low. In terms of quality, a Q score of 9 is also quite low and could impact the efficiency of downstream error correction. We generally set this threshold in the range of Q15 to Q20.
I ran the first command, but it returned nothing. The second command returned 14, which matches the number of barcodes I have.
I tried using Docker again, but as I mentioned two days ago, I encountered numerous errors. Even after installing cutadapt, matplotlib, pyabpoa, and search, I still encountered the following error at Step 3: qemu-x86_64: Could not open '/lib64/ld-linux-x86-64.so.2': No such file or directory.
I set the minimum length to 200 bp since most of the length is not reached at 1500 bp. Here is the plot received from step 1 for your reference: