LRG Frequently Asked Questions (FAQs)
The most frequently asked FAQs
Variant Reporting Standards
Obtaining existing LRGs and requesting new ones
Software support for LRGs
Specifications and standards
Locus Reference Genomic (LRG) represents both a collaboration and the set of sequences resulting from that collaboration. The collaboration is among sequence providers (European Bioinformatics Institute - EBI and the National Center for Biotechnology Information - NCBI), GEN2PHEN, locus-specific databases (LSDB), and research/diagnostic laboratories. The resultant sequences are stable human genomic sequences, annotated with a set of exons, transcripts and proteins, to be used as a reference standard for reporting disease-causing variants in human genes. LRGs are based on principles already established for NCBI's RefSeq and RefSeqGene, but are independent of any one stakeholder.
Why do LRGs have two layers?
LRGs have two layers to separate data that are fixed from data that will necessarily change with time. The fixed layer contains the human genomic sequences, the exon annotations, the transcripts and the proteins. The updatable layer contains information that maps the sequences in the fixed layer onto the current genome assembly. In addition, there are other types of information such as commonly used alternative exon and amino acid numbering schemes, where these exist.
Which transcripts will be represented in LRGs?
It is well recognised that the majority of human genes are alternatively spliced for demonstrably functional reasons. There is also considerable evidence for the existence of additional transcripts whose provenance and function are much less certain. Our policy is to only include transcripts for which there is currently good biological understanding AND that are required for the unambiguous reporting of disease-causing variants.
Is there a published account of LRGs that I can read?
The LRG standard and why it is needed is described in the publication Locus Reference Genomic sequences: an improved basis for describing human DNA variants
, Dalgleish R et al., Genome Med.
:24, available via open access from the Genome Medicine website [link
]. See also the editorial "Conventional wisdom" in the May 2010 issue of Nature Genetics
, where the use and benefits of LRGs are discussed [link
Where can I get more information about LRGs?
Why do we need LRGs when RefSeqGene records already exist?
LRG and RefSeqGene are collaborative resources. Until the sequence and annotation of an LRG is established by all stakeholders, the RefSeqGene can be used. When an LRG is established for any gene, the RefSeqGene and its annotation will be 'frozen' to match that of the LRG. The LRG sequence identifiers [LRG, transcript (t) and protein (p) numbers], are included in the RefSeqGene, so the correspondence is unambiguous.
LRG accessions don't have versions and RefSeqGene records do. Why?
The RefSeq project, following the convention of the International Nucleotide Sequence Database collaboration (http://www.insdc.org/
), assigns sequence identifiers as a combination of a stable component (the accession), and a version. Any revision of the sequence results in the incrementing of the version number. The version number is indicated after the decimal point at end of the accession number (e.g. NM_000088.3). Unfortunately, the version of a sequence is often not reported when a variant description is presented in a publication. Thus uncertainty can result when trying to interpret the consequence of any variant if the current version of the reference sequence is greater than 1. This problem is avoided in the LRG accessioning system by not having versions. Once an LRG is created, the sequence data are never changed.
Does this mean that additional transcripts cannot be added?
It's inevitable that new transcripts of biological importance will be discovered for genes for which an LRG already exists. Such transcripts can be added to the updateable section of an LRG. New transcripts can only be added to the fixed section, if they are essential for reporting clinically relevant variants. A compelling case can be made for their inclusion if the new transcripts encode different proteins of clear clinical importance and variants cannot be meaningfully described in terms of the current transcripts in the LRG. Consideration of requests for the addition of new transcripts will be on a case-by-case basis.
Does this not re-create the versioning problem?
Versioning is an issue with traditional reference sequence records simply because the actual sequences differ from version to version for records with the same accession number. The LRG sequence data for the genomic DNA, the transcripts and their translation products will never be changed. Consequently, a variant description such as LRG_13:g.8290C>A will always remain valid and will never be subject to misinterpretation. The user simply needs to ensure that the LRG contains all of the necessary transcripts for the intended task.
How will sequence corrections be made to LRGs?
No changes to the sequences in an LRG will be permitted. If it's no longer possible to describe a sequence variant in terms of an existing LRG, it might be necessary to create a totally new LRG with a uniquely different number (e.g. LRG_1275 instead of the existing LRG_89). The original LRG will not be "retired" and it will remain valid to describe variants with respect to that sequence record. Creation of additional LRGs for an existing gene or genomic region will only be considered in the most exceptional circumstances.
What about copy number variation?
Copy number variation (CNV) will certainly be an issue, but LRGs are certainly no less well suited to the task of variant description than existing reference sequence records. Requests will be considered for the creation of an LRG representing a particular allele with respect to CNV and we will work with the requesting party to achieve the best practicable solution to represent the allele.
Who is the final arbiter of LRG content?
LRGs are created for the benefit of the biomedical community and so must meet its needs. We welcome discussion about whether or not individual LRGs fulfil specific needs and we will work with the community to ensure that these needs are met. In the end, we will take the authoritative advice of the community.
Will LRGs replace RefSeq and RefSeqGene records?
How can I use a LRG record?
Once a LRG record has been created, you can e.g.
- View your LRG and all known variants in browsers supporting LRGs:
Once an LRG has been released, it is integrated into the next Ensembl release, normally within two months. This means, the wealth of data available in ensembl can be viewed in the context of that LRG sequence. Ensembl projects all variants from the database i.e. all the information in dbSNP and much more, onto the LRG sequence which you can view in the browser. For example: LRG_1 (COL1A1)
Reports of your submissions are displayed at NCBI in two formats. One is the tabular display via Variation Viewer, which is accessed by adding the HGNC official symbol at the end of the URL, e.g. http://www.ncbi.nlm.nih.gov/sites/varvu?gene=COL1A1. This display currently provides the HGVS expression for each variation in terms of the RefSeqGene; representation based on the LRG will be added soon. The other display is based on the annotated RefSeqGene/LRG sequence. This may be most readily accessed from the new RefSeqGene interface where you can focus on your record of interest by entering the LRG number or the gene symbol in the filter box, click on submit, and then click on 'graphic' in the appropriate Views column.
- See the effects of variants on the LRG:
You can make use of the Variant Effect Predictor to upload known and novel variant positions in a simple tab delimited format to find out the effect of each variant and whether it is known or novel.
- Submit your variant data into a public archive:
Ensembl and the NCBI are happy to accept variant submissions into the public variation archives (dbSNP for SNPs and indels or dbVAR/DGVa for structural variants ) using LRG coordinates. Please see this page for more information.
- Use external software:
Mutalyzer 2.0 supports LRGs for HGVS nomenclature validation (http://www.mutalyzer.nl/2.0/)
Variant Reporting Standards
Will I have to learn a new variant nomenclature?
No, the standard HGVS Nomenclature
will still be used. Since the LRG is a sequence accession, the numbering of the altered DNA bases remains the same when writing the variant description in terms of an LRG reference sequence or its annotated transcripts and proteins. These transcripts are numbered sequentially t1, t2, t3 etc. and their corresponding translation proteins are numbered in the same fashion: p1, p2, p3 etc.
Can you give me an example?
The COL1A1 gene is represented by LRG record LRG_1 which has a single transcript (t1) and a single corresponding protein (p1). The frequently reported disease-causing variant NG_007400.1:g.9595G>A can also be reported as NM_000088.3:c.769G>A, and as NP_000079.2:p.Gly257Arg using the current RefSeqGene and RefSeq mRNA and protein reference sequences. Since LRGs contain the genomic DNA, mRNA and protein sequences within a single record, the three corresponding descriptions are LRG_1:g.9595G>A, LRG_1t1:c.769G>A and LRG_1p1:p.Gly257Arg.
||RefSeqGene or RefSeq
What about an example for a gene with more than one transcript?
The calcitonin gene (CALCA) encodes two peptide hormones, calcitonin and calcitonin gene-related peptide (CGRP), that have no amino acid sequence in common. These hormones are derived by enzymatic cleavage of the translation products of two alternatively spliced mRNAs that exclusively contain exon 4 (calcitonin) or exons 5 and 6 (CGRP). Consequently, a SNP in the first base of exon 4 (rs5241) affects only the mRNA that encodes calcitonin.
Using HGVS nomenclature, the variant can be described as NM_001033952.2:c.228C>A using the calcitonin RefSeq mRNA as the reference sequence. The corresponding protein-level description is NP_001029124.1:p.Ser76Arg. Alternatively, it can be described with respect to the RefSeqGene genomic DNA sequence as NG_015960.1:g.8290C>A.
The LRG for the CALCA gene (LRG_13) contains information for both the major alternatively spliced forms of the gene's transcripts. Calcitonin and CGRP are represented by transcripts t2 and t1 respectively. Consequently, the SNP can be described at the DNA level as LRG_13: g.8290C>A or LRG_13t2:c.228C>A. The corresponding protein-level description is LRG_13p2: p.Ser76Arg.
||RefSeqGene or RefSeq
Obtaining existing LRGs and requesting new ones
How do I find out if an LRG already exists for my gene of interest?
If none yet exists, how do I request the creation of an LRG?
How do I use the search function on the LRG website?
You can search by e.g. LRG identifier, HGNC gene name, LSDB name, NCBI and Ensembl accession numbers, gene synonym or LRG status. Wildcards and logical expressions are accepted. Example searches: LRG_1, COL*, Osteogenesis, (NM_000088.3 OR NM_000089.3), collagen, pending. A batch search can be carried out by entering a list of LRG identifiers separated by a pipe symbol e.g. LRG_1|LRG_3|LRG_45
What do I need to view an LRG?
Can I download and view LRGs locally?
Yes, you do not need to view them from the ftp site. To ensure that the LRGs display correctly in your browser, you must also download the files lrg2html.xsl
and place these in the same directory as the downloaded LRG file. Without these extra files, your browser will display the XML code of the LRG file rather than the nicely formatted version that you see when viewing LRGs from the ftp site.
Can I view the sequences in an LRG in any other format?
Within the browser view of an LRG it's possible to display the individual sequences (genomic DNA, transcripts and their translated protein sequences) in FASTA format. This allows copying and pasting of sequences into other applications that support that format. From NCBI, try the 'graphics' display (http://www.ncbi.nlm.nih.gov/nuccore/LRG_1?report=graph
Are there any other ways to view LRGs?
The NGRL Universal Browser has preliminary support for browsing of LRGs
with co-integration of data from ENSEMBL, dbSNP and the LSDB (if available) for the gene in question. The LRG sequences will be available in Ensembl later in 2010.
Software support for LRGs
Do you offer programmatic access to LRG data?
Yes, some of the information from LRG records are available through different web services, implemented using the XML-RPC protocol. See this page
for more information.
What free software currently supports LRGs?
There is no such support at present. However, future versions of the LOVD
genetic variant database systems will support LRGs. In addition, the Mutalyzer
sequence variant nomenclature checker will also be updated. In principle, the current version of LOVD can already be used with LRGs. However, it will not be possible to check variant descriptions using the connection module to Mutalyzer until that package is updated.
Are there any third-party applications that support the LRG format?
LRGs are a recent innovation and we are not aware of any third-party applications that yet support the format. We believe that Interactive Biosoftware
intend to provide support for LRGs in their flagship Alamut mutation-interpretation application.
Can I write my own application?
Anybody can write an application to handle and manipulate sequence data in the LRGs. The LRG format is open and the record schema is freely available (see below). Technical support for the schema is available at email@example.com
. We would encourage you to make your software free and open and to let us know about it so that we can provides links to your application.
Are there pre-written support modules for programming languages?
We are not aware of any at present. However, we hope that there will eventually be modules to support LRGs in the commonly used scripting languages such as Perl, Python, PHP, Ruby etc.
Specifications and standards
Where can I get a copy of the LRG specification?
Are LRGs formatted as GenBank records or as EMBL records?
LRGs are created in extensible markup language (XML: http://www.w3.org/XML/
) as this provides many technical advantages.
If LRGs are written in XML, are they not difficult to read?
Style sheets have been created to transform the LRG XML files into a human readable format.
Is the LRG XML schema available?
Can I create my own LRG sequence records?
For an LRG to be an international standard, it must be accessioned by the collaborating groups. If you would like to request an LRG, please contact firstname.lastname@example.org
Can I request that new features be added to the LRG schema?
We are keen to carefully control the LRG XML schema and to only allow the addition of new features if a good practical case can be made. Requests for changes to the schema should be made to email@example.com
Who has responsibility for creating LRG sequences?
What is the role of GEN2PHEN in LRGs?
What will happen when the GEN2PHEN project funding ends?
Although funding for GEN2PHEN ends in December 2012, it's only the initial development of the LRG concept that is supported by GEN2PHEN. EBI and NCBI are fully committed to maintaining the LRG sequence record format beyond 2012.