Manual:Gene nomenclature
From VectorBase Development
Gene nomenclature for VectorBase organisms
Species identification - 4 characters are assigned with a provisional plan of 3 for the species and one available for strain information where appropriate. Having set the field length to four we must use four and so any projects which do not have an appropriate strain/isolate we will assign the fourth letter arbitrarily.
Project examples:
- AAEL - Aedes aegypti Liverpool
- AGAP - Anopheles gambiae PEST
- CPIJ - Culex pipiens quinquefasciatus JHB
- ISCW - Ixodes scapularis Wikel
Ordinal assignment - 6 digits are assigned. Although we probably will not need the sixth digit it is a valid safeguard. Using 5-digits should give us enough namespace for 4-5 copies of the full dataset but the higher churn rate inherent in semi-automated gene builds (Anopheles is currently in the mid-30,000s and used several thousand new identifiers for the latest AgamP3 assembly. Further, it would be good to have a stock of identifiers for use by the greater VectorBase for assignment of community annotation/manual confirmation of genes prior to a regular genebuild.
Example
The following is a worked example for Aedes aegypti Liverpool strain
Gene/Locus name: AAEL100007 Isoform #1 Transcript: AAEL100007-RA Translation: AAEL100007-PA Isoform #2 Transcript: AAEL100007-RB Translation: AAEL100007-PB
These identifiers will be used as the systematic name for a gene/locus, submitted to GenBank/EMBL as the /locus_tag qualifier and the canonical name in the EnsEMBL/CHADO database (i.e. the identifier used to navigate into the browser by user queries/blast/link from INSD pages and the BRC.
