The lengths of gene, transcript, and CDS can be quite different in their lengths.
First we need to understand all of these three terms:
gene Of these three, gene is a big concept, which contains both transcript and CDS.
transcript A gene can have multiple forms of transcripts (a.k.a. isoforms), of course with varied lengths
CDS Coding region composed of exon, but it does not include the 5' or 3' UTRs in the exon. A CDS always starts with a AUG codon and ends with a stop codon (UAG, UAA, and UGA).
The job I need to do is to add gene function annotation (brief description) and CDS length with a list of gene in
Arabidopsis thaliana. I found the
biomaRt tool in BioConductor is quite helpful.
Below are my script to retrieve those information from biomart within R environment.
library(biomaRt) mart <- biomaRt::useMart(biomart = "plants_mart", dataset …