Issues with FPKM Calculation in RNA-seq Analysis (Apodemus agrarius)

SingeunOh · February 27, 2025, 7:05am

Dear all,

I am currently working on RNA-seq analysis for Apodemus agrarius using genome Fasta data and a reference GTF database (created in 2024, though the related article has not yet been published).

To obtain FPKM values, I used Cufflinks with a BAM file, running the following command:

cufflinks -o cufflinks_output \ -g ref.gtf \ -b genome.fa \ -u \ ref.bam

However, I am encountering an issue with the results. Out of the 75,000 features, almost 60,000 features are assigned gene IDs from CUFF (e.g., CUFF.1, CUFF.2, etc.), while only 15,000 features match with Entrez IDs. Additionally, from the raw read counts, I observe 27,000 Entrez IDs, but only 15,000 of these features are retained after assignment.

I am wondering what might be causing this discrepancy and how I can address it. Has anyone encountered a similar issue, or can you provide any suggestions for resolving this?

Any help or insights would be greatly appreciated!

Best regards,

P.S I found genes.fpkm_tracking and isoforms.fpkm_tracking files found in the output of Cufflink directory.

Is it right to use isoforms.fpkm_tracking ? (It contains more matched entrezID than genes.fpkm_tracking)

llenzi · February 27, 2025, 9:03am

Hi @SingeunOh
I would say this is a question for the forum in the database from where you did download your reference. The may be able to explain why the discrepancy.

On my side only thing I could suggest is to try a different method to count the total value for each gene.
We use:
https://htseq.readthedocs.io/en/latest/

In which you can specify if you want to count reads aligned onto exon (considering the gene orientation on the genome) or any reads mapped into the known gene region.
On your cufflink question, honestly is so long that I have not use it that I don't remember but there may be a cufflink forum you can ask too.

Cheers
Luca