Manual:Gene nomenclature

From VectorBase Development

Jump to: navigation, search


Gene nomenclature for VectorBase organisms

Species identification - 4 characters are assigned with a provisional plan of 3 for the species and one available for strain information where appropriate. Having set the field length to four we must use four and so any projects which do not have an appropriate strain/isolate we will assign the fourth letter arbitrarily.

Project examples:

  • AAEL - Aedes aegypti Liverpool
  • AGAP - Anopheles gambiae PEST
  • CPIJ - Culex pipiens quinquefasciatus JHB
  • ISCW - Ixodes scapularis Wikel

Ordinal assignment - 6 digits are assigned. Although we probably will not need the sixth digit it is a valid safeguard. Using 5-digits should give us enough namespace for 4-5 copies of the full dataset but the higher churn rate inherent in semi-automated gene builds (Anopheles is currently in the mid-30,000s and used several thousand new identifiers for the latest AgamP3 assembly. Further, it would be good to have a stock of identifiers for use by the greater VectorBase for assignment of community annotation/manual confirmation of genes prior to a regular genebuild.

Example

The following is a worked example for Aedes aegypti Liverpool strain

 Gene/Locus name:       AAEL100007
	
 Isoform #1        Transcript:  AAEL100007-RA       Translation: AAEL100007-PA
 Isoform #2        Transcript:  AAEL100007-RB       Translation: AAEL100007-PB

These identifiers will be used as the systematic name for a gene/locus, submitted to GenBank/EMBL as the /locus_tag qualifier and the canonical name in the EnsEMBL/CHADO database (i.e. the identifier used to navigate into the browser by user queries/blast/link from INSD pages and the BRC.

Personal tools