Frequently Asked Questions

What are LRGs?

What is an LRG?

A Locus Reference Genomic (LRG) is a manually curated record that contains stable and thus, un-versioned reference sequences designed specifically for reporting sequence variants with clinical implications.

Why do we need LRGs?

Accurate and unambiguous reporting of variants requires internationally recognized reference sequences that do not change over time. The use of multiple sequences for a given locus as well as confusion over versions has resulted in inconsistent variant reporting in the past. The LRG project was created to avoid these problems.

What is contained in an LRG record?

Each LRG contains a stable “fixed” section and a regularly updated “updatable” section. The fixed section contains stable genomic DNA sequence for the region of interest, transcripts and proteins deemed essential for reporting variants, and an LRG-specific exon numbering system. The updatable section contains mapping information, annotation of all transcripts and overlapping genes in the regions, and other relevant information submitted by the community (e.g legacy exon numbering).

How are the transcripts selected?

During the LRG creation process, LRG curators will review the transcript submitted by the requester as well as all other transcripts in the region of interest for potential inclusion in the record. Curators will perform alignments, review supporting evidence for each transcript and analyze expression data as part of this review process. Only transcripts for which there is currently good biological understanding AND are required for the unambiguous reporting of disease-causing variants will be included.

What is the difference between RefSeqGene and LRG records?

LRG and RefSeqGene are collaborative resources. The advantages of using LRGs for variant reporting are:

  • LRGs are specifically created for the reporting of clinically relevant variants and hence, are for loci with clinical implications.
  • LRGs are stable and therefore are not versioned, thus reducing ambiguity when reporting variants.
    When an LRG is established for any gene, the RefSeqGene and its annotation will be “frozen” to match that of the LRG.
Why are there mismatches between LRG records and the reference genome assembly?

The sequences in the LRG record do not necessarily perfectly match the reference genome assemblies (GRCh37/38). LRG sequences are based on RefSeqGene sequences, which, if possible, represent prevalent “standard” alleles at each locus. Therefore, there are a number of cases where the RefSeqGene sequences may differ from the reference genome assembly.

  • If the current reference assembly is not well supported an alternate sequence is selected, in consultation with gene-specific experts as available. When feasible, RefSeqGene sequences will be derived from a single clone, based on the assumption that no sequence errors were introduced in cloning, and that a single insert represents an example of a naturally occurring haplotype.

  • The default implementation of ‘standard allele’ is the sequence from the public reference genome assembly. If, however, there is published evidence, evidence from locus-specific databases, or evidence from clinical testers, that the sequence in the reference genome assembly is not standard, the RefSeqGene sequence can be constructed from an alternate source sequence, or locally modified.

During the curation of an LRG record our curators review all mismatches between the RefSeqGene and the reference assemblies (GRCh37 and GRCh38).

Read more about the RefSeqGene project, and its relationship with the LRG.

Who is the final arbiter of LRG content?

LRGs are created for the benefit of the biomedical community and so must meet its needs. We welcome discussion about whether or not individual LRGs fulfil specific needs and we will work with the community to ensure that these needs are met. In the end, we will take the authoritative advice of the community.

How can I use a LRG record?

Once a LRG record has been created, you can e.g.:

  • View your LRG and all known variants in genome browsers supporting LRGs:

    • Ensembl
      Once an LRG has been released, it is integrated into the next Ensembl release, normally within two months. Public LRGs can be viewed in a set of dedicated pages (for example, LRG_1) whereas pending LRGs can be viewed by following the Ensembl link in the updatable section of each LRG (for example, LRG_750).

    • NCBI
      The records are available from NCBI’s Nucleotide and Gene databases and can be found in the RefSeqGene browser page and FTP site (ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/). More information on ways to find LRGs at NCBI can be found here https://www.ncbi.nlm.nih.gov/refseq/rsg/lrg/.

    • UCSC
      The records are available in the UCSC Genome Browser by entering the LRG identifier or gene symbol in the search box or by turning on the LRG track under Genes and Gene Predictions.

  • Check your variant data using your LRG:

    • Mutalyzer checks that variants described using your LRG follow HGVS conventions - Mutalyzer’s “Name Checker”, “Syntax Checker”, and “Name Generator” ensure that variants described using LRG sequences follow HGVS guidelines.
    • VariantValidator
  • Submit your variant data into a public archive:

  • Interpret your variant data using your LRG:

How is the LRG-specific exon numbering determined?

The LRG-specific exon numbering system included in each LRG is based on the transcript(s) included in the fixed section. Each exon is numbered consecutively 5′ to 3′; the numbering is then applied to individual transcripts.

How many LRG records have been created?

Over LRGs have been created, of which are public. The aim is to create an LRG for every locus with clinical implications. To request an LRG please look at the LRG request page.

Where can I get more information about LRGs?

The LRG web site http://www.lrg-sequence.org/ and the partner RefSeqGene site http://www.ncbi.nlm.nih.gov/RefSeq/RSG/ maintain current information about the LRG project and available sequences.

Is there a published account of LRGs that I can read?

The LRG standard and why it is needed is described in the following publication:

“Locus Reference Genomic: reference sequences for the reporting of clinically relevant sequence variants”
MacArthur JA et al., Nucleic Acids Res. 2014 Jan (doi: 10.1093/nar/gkt1198).

See also “Locus Reference Genomic sequences: an improved basis for describing human DNA variants” (Dalgleish R et al., Genome Med. 2010, 2:24), and the editorial “Conventional wisdom” in Nature Genetics 2010, 42, p.363.

Stability issues

How will sequence corrections be made to LRGs?

No changes to the sequences in an LRG will be permitted.
If it’s no longer possible to describe a sequence variant in terms of an existing LRG, it might be necessary to create a totally new LRG with a uniquely different number (e.g. LRG_1275 instead of the existing LRG_89). The original LRG will not be “retired” and it will remain valid to describe variants with respect to that sequence record. Creation of additional LRGs for an existing gene or genomic region will only be considered in the most exceptional circumstances.

LRG records don't have versions and RefSeqGene records do. Why?

The RefSeq project, following the convention of the International Nucleotide Sequence Database collaboration (http://www.insdc.org/), assigns sequence identifiers as a combination of a stable component (the accession), and a version. Any revision of the sequence results in the incrementing of the version number. The version number is indicated after the decimal point at end of the accession number (e.g. NM_000088.3).
Unfortunately, the version of a sequence is often not reported when a variant description is presented in a publication. Thus, uncertainty can result when trying to interpret the consequence of any variant if the current version of the reference sequence is greater than 1. This problem is avoided in the LRG accessioning system by not having versions. Once an LRG is created, the sequence data are never changed.

Can additional transcripts be added to an LRG?

It’s inevitable that new transcripts of biological importance will be discovered for genes for which an LRG already exists. Such transcripts will be included in the updatable section of the record. However, new transcripts can only be added to the fixed section if they are essential for reporting clinically relevant variants. A compelling case can be made for their inclusion if the new transcripts encode different proteins of clear clinical importance and variants cannot be meaningfully described in terms of the current transcripts in the LRG. Consideration of requests for the addition of new transcripts will be on a case-by-case basis.

Does the addition of new transcripts not re-create the versioning problem?

Versioning is an issue with traditional reference sequence records simply because the actual sequences differ from version to version for records with the same accession number.
The LRG sequence data for the genomic DNA, the transcripts and their translation products will never be changed.
Consequently, a variant description such as LRG_13:g.8290C>A will always remain valid and will never be subject to misinterpretation. The user simply needs to ensure that the LRG contains all of the necessary transcripts for the intended task.

What will happen when a new genome build is released?

Once a new assembly is released, the mapping information and annotation of all LRGs will be updated to the new assembly. Mapping of the LRG genomic sequence to both the current and penultimate assembly will be included in each LRG.

Variant Reporting Standards

Will I have to learn a new variant nomenclature?

No, the standard HGVS Nomenclature will still be used. HGVS and EMQN best practice guidelines have endorsed LRGs.
The stable identifiers of the genomic, transcript, and protein sequences in the fixed section of an LRG (transcripts: “t1”, “t2”, etc.; proteins: “p1”, “p2”, etc.) can be used for stable reporting of variants. See “Can you give me an example?” for more details.

Can I see an example of variant nomenclature?

The COL1A1 gene is represented by LRG record LRG_1 which has a single transcript (t1) and a single corresponding protein (p1).
The frequently reported disease-causing variant NG_007400.1:g.9595G>A can also be reported as NM_000088.3:c.769G>A, and as NP_000079.2:p.Gly257Arg using the current RefSeqGene and RefSeq mRNA and protein reference sequences. Since LRGs contain the genomic DNA, mRNA and protein sequences within a single record, the three corresponding descriptions are LRG_1:g.9595G>A, LRG_1t1:c.769G>A and LRG_1p1:p.Gly257Arg.

Description Level RefSeqGene or RefSeq LRG
Gene NG_007400.1:g.9595G>A LRG_1:g.9595G>A
mRNA NM_000088.3:c.769G>A LRG_1t1:c.769G>A
Protein NP_000079.2:p.Gly257Arg LRG_1p1:p.Gly257Arg
Can I see an example of variant for a gene with more than one transcript?

The calcitonin gene (CALCA) encodes two peptide hormones, calcitonin and calcitonin gene-related peptide (CGRP), that have no amino acid sequence in common. These hormones are derived by enzymatic cleavage of the translation products of two alternatively spliced mRNAs that exclusively contain exon 4 (calcitonin) or exons 5 and 6 (CGRP). Consequently, a SNP in the first base of exon 4 (rs5241) affects only the mRNA that encodes calcitonin.

Using HGVS nomenclature, the variant can be described as NM_001033952.2:c.228C>A using the calcitonin RefSeq mRNA as the reference sequence. The corresponding protein-level description is NP_001029124.1:p.Ser76Arg.
Alternatively, it can be described with respect to the RefSeqGene genomic DNA sequence as NG_015960.1:g.8290C>A.

The LRG for the CALCA gene (LRG_13) contains information for both the major alternatively spliced forms of the gene’s transcripts. Calcitonin and CGRP are represented by transcripts t2 and t1 respectively. Consequently, the SNP can be described at the DNA level as LRG_13: g.8290C>A or LRG_13t2:c.228C>A.
The corresponding protein-level description is LRG_13p2:p.Ser76Arg.

Description Level RefSeqGene or RefSeq LRG
Gene NG_015960.1:g.8290C>A LRG_13:g.8290C>A
mRNA NM_001033952.2:c.228C>A LRG_13t2:c.228C>A
Protein NP_001029124.1:p.Ser76Arg LRG_13p2:p.Ser76Arg
I have been using reference sequences not included in the LRG record. Is there a tool that can help map all my variants in LRG coordinates?

Yes, several tools can be used to convert variant coordinates from other reference sequences into LRG coordinates.

How do I report intronic variants in LRGs?

As the LRG consists of genomic, transcript and protein sequences that are linked, the sequence covering the introns are present too.
Below are examples of using HGVS nomenclature to report a variant in an intron e.g. rs750106647 is LRG_1t1:c.4005+11T>C and rs778417218 is LRG_1t1:c.4005+5G>A.

How to deal with copy number variation?

Copy number variation (CNV) will certainly be an issue, but LRGs are certainly no less well suited to the task of variant description than existing reference sequence records. Requests will be considered for the creation of an LRG representing a particular allele with respect to CNV and we will work with the requesting party to achieve the best practicable solution to represent the allele.

Obtaining existing LRGs and requesting new ones

How do I find out if an LRG already exists for my gene of interest?

To find if an LRG already exists, you may use the search function on this page or search the list of all LRGs found here. You may also search for LRGs in Ensembl and NCBI browsers.

How do I request the creation of an LRG if none exists yet for my gene of interest?

You can request the creation of an LRG for your gene of interest by contacting us at contact@lrg-sequence.org. We suggest that beforehand you read the Request an LRG page.

Viewing LRGs

How do I use the search function on the LRG website?

You can search by e.g. LRG identifier, HGNC gene name, NCBI and Ensembl accession numbers (gene or transcript), gene synonym or LRG status.
Autocompletion will appear when you will start typing. You can search for NCBI and Ensembl accession numbers without using the version number, e.g. NM_000088 instead of NM_000088.3.
Batch search can be carried out by entering a list of identifiers separated by a semi-colon symbol e.g. LRG_1;LRG_3;NM_007294;LRG_45.

How can I view a LRG record?

All that you need is a web browser such as Firefox, Chrome, Safari, etc. to view the LRGs that are available at the LRG FTP site.
LRGs can also be viewed in the Ensembl, NCBI and UCSC genome browsers.

Can I view the LRG sequences in any other format?

Within the browser view of a LRG it is possible to display the individual sequences (genomic DNA, transcripts and their translated protein sequences) in FASTA format. This allows copying and pasting of sequences into other applications that support that format.
From NCBI, try the “graphics” display http://www.ncbi.nlm.nih.gov/nuccore/LRG_1?report=graph.

Can I download and view LRGs locally?

Yes, each LRG can be downloaded from its page on the LRG website or by following the instructions described on the Download page. If you want to display the downloaded LRG(s) locally on your web browser with the same layout as the LRG website, you need to download the files:

and place these in the same directory as the downloaded LRG file(s).
Without these extra files, your web browser will display the LRG data in XML rather than the nicely formatted version that you see when viewing LRGs from the ftp site.

Software support for LRGs

Do you offer programmatic access to LRG data?

Yes, some of the information from LRG records are available through different web services from the EMBL-EBI EB-eye RESTful service.
See the LRG page Web services for more detailled information.

Are LRGs supported by external software?

Here is the list of external softwares supporting LRGs:

In addition,the LOVD DNA variation database system supports LRGs.

Can I write my own application using LRG?

Anybody can write an application to handle and manipulate sequence data in the LRGs: the LRG format is open and the record schema is freely available.
We would encourage you to make your software free and open and to let us know about it so that we can provide links to your application.
Technical support for the schema is available at contact@lrg-sequence.org.

Specifications and standards

How are LRGs formatted?

LRGs are created in extensible markup language (XML) format. Each XML file is highly structured and contains all the information pertaining to a single LRG.

Is the LRG XML schema available?

The LRG XML schema was created in RELAX NG schema description language and can be downloaded from http://ftp.ebi.ac.uk/pub/databases/lrgex/LRG.rnc. The schema has a date stamp and any changes would be accompanied by a change in the date.
The current version of the schema is Schema 1.9 and its documentation can be found on the LRG FTP site.

Where can I get a copy of the LRG specification?

The current version of the technical specification document is available at http://ftp.ebi.ac.uk/pub/databases/lrgex/docs/LRG.pdf.
There is a version number and date stamp in the bottom margin to help in tracking changes to the specification.

Can I create my own LRG sequence records?

For an LRG to be an international standard, it must be accessioned by the collaborating groups. If you would like to request an LRG, please contact contact@lrg-sequence.org.
We would encourage you not to create your own LRG records.

Administrative Issues

Who has responsibility for creating LRG sequences?

The creation of LRGs is the joint responsibility of the European Bioinformatics Institute (EMBL-EBI) and the National Center for Biotechnology Information (NCBI).

What is the role of GEN2PHEN in LRG?

The LRG concept was developed as a project within the remit of the GEN2PHEN project and was funded for 5 years under the European Community’s Seventh Framework Programme (FP7).

What will happen now that the GEN2PHEN project funding has ended?

Although funding for GEN2PHEN ended in June 2013, EMBL-EBI and NCBI are fully committed to maintaining the LRG project.