Stakeholder review

Document review and submission decision

This page shows what the system understood from the submitted DOCX file, what it could normalize, and whether the document currently passes review, needs warning-level follow-up, or should be rejected and corrected in Word first.

Document selection

Open another reviewed document

You are currently reviewing document #138. Use this form only if you want to jump to a different review.

Document metadata

EukaryoticAnnotationPipeline

ID: 138
Parsed title: Eukaryotic Genome Annotation Pipeline
Template profile: NCBI Bookshelf Template
Strict template mode: Yes
Source file: EukaryoticAnnotationPipeline.docx
Stored path: docx/138/master.docx
Status: parsed
Notes: Parsed 43 headings, 177 blocks, 17 references, 25 citations, 0 citation issues, 1 preflight issues.
Uploaded: 2026-06-23 13:31:37

Extracted front matter Metadata candidates

document_information Document Information

Document Information

Eukaryotic Genome Annotation Pipeline | 2013-11-14 | | | | chapter | pdf

author_information Author Information

Author Information

affiliations Affiliations

Affiliations

Label | Organization

1 | NCBI

Parsed summary

Block counts

Headings: 43
Paragraphs: 149
Tables: 2
References: 17
Citations: 25
Matched citations: 25
Open issues: 0
Blocking pre-flight issues: 0
Warning pre-flight issues: 1

Validation status

Pre-flight decision

Warning

Blocking issues: 0 | warnings: 1 | total: 1

Pre-flight issues Acceptance and rejection guidance

warning missing_keywords

No keywords metadata group was detected.

If keywords are required for this content type, add them using the expected Word style so they can be promoted into canonical metadata.

Structured blocks Heading, paragraph, and table order

Use this section to confirm heading fidelity, spot spacing artifacts, inspect first-class table content, and understand how nearby paragraphs and tables were sequenced during parsing. Matched in-text citations now jump to their reference entries, while unresolved citations route to open citation issues.

Matched and externally validated or skipped Needs review Failed match

paragraph Order 1 Style Comment word/document.xml:/w:document[1]/w:body[1]/w:p[1]

Use this form to tell us important information about this document, then start the text on the following page. All information you give in this form will appear in the document.

paragraph Order 7 Style Comment word/document.xml:/w:document[1]/w:body[1]/w:p[5]

Use one row for each author. List authors in order of appearance in the document. Add rows to add more authors.

paragraph Order 8 Style Comment word/document.xml:/w:document[1]/w:body[1]/w:p[6]

NCBI Authors: For Affiliation, use NCBI.

paragraph Order 10 Style Comment word/document.xml:/w:document[1]/w:body[1]/w:p[7]

Use one row for each affiliation. Link affiliations by label to the respective author. Add rows to add more affiliations.

heading Order 14 Level 1 Style Heading1 word/document.xml:/w:document[1]/w:body[1]/w:p[11]

H1. Eukaryotic Genome Annotation Pipeline

heading Order 15 Level 2 Style Heading2 word/document.xml:/w:document[1]/w:body[1]/w:p[12]

H2. Scope

paragraph Order 16 word/document.xml:/w:document[1]/w:body[1]/w:p[13]

The NCBI Eukaryotic Genome Annotation Pipeline is an automated pipeline producing annotation of coding and non-coding genes, transcripts, and proteins on finished and unfinished public genome assemblies. It provides content for various NCBI resources including Nucleotide, Protein, BLAST, Gene, and the Map Viewer genome browser. The pipeline uses a modular framework for the execution of all annotation tasks from the fetching of raw and curated data from public repositories (sequence and Assembly databases) through the alignment of sequences and the prediction of genes, to the submission of the accessioned and named annotation products to public databases.

paragraph Order 17 word/document.xml:/w:document[1]/w:body[1]/w:p[14]

Core components of the pipeline are the alignment programs Splign (1) and ProSplign, and Gnomon, a gene prediction program combining information from alignments of experimental evidence and from models produced ab initio with an HMM-based algorithm.

paragraph Order 18 word/document.xml:/w:document[1]/w:body[1]/w:p[15]

The annotation pipeline produces comprehensive sets of genes, transcripts, and proteins derived from multiple sources, depending on the data available. In order of preference, the following sources are used:

list Order 19 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[16]

RefSeq curated annotated genomic sequences (2), such as the human beta globin gene cluster located on chromosome 11 (NG_000007.3)

list Order 20 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[17]

Known RefSeq transcripts (2)

list Order 21 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[18]

Gnomon-predicted models

paragraph Order 23 word/document.xml:/w:document[1]/w:body[1]/w:p[20]

Both the set of genes and the placements of the genes in the annotation on the genomic sequences comprise the output of the annotation pipeline.

heading Order 24 Level 3 Style Heading3 word/document.xml:/w:document[1]/w:body[1]/w:p[21]

H3. Organisms in scope

paragraph Order 25 word/document.xml:/w:document[1]/w:body[1]/w:p[22]

Those eukaryotic organisms annotated by NCBI span a wide range of taxa among invertebrates, vertebrates, and plants. Annotation priorities are based on several considerations, including:

list Order 26 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[23]

National Institutes of Health (NIH) priorities: Mammals are important to the NIH, so high-quality genome assemblies for new mammalian species are given a higher priority for annotation

list Order 27 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[24]

Biological or economic importance: highly-studied organisms or organisms with agricultural (e.g., crops) or industrial use

list Order 28 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[25]

Community interest/requests: requests from research communities, communicated in person or in writing through the NCBI Support Center. To write to the NCBI Support Center, click on the “Support Center” link in the bottom right corner of any NCBI web page.

paragraph Order 30 word/document.xml:/w:document[1]/w:body[1]/w:p[27]

The annotation process depends heavily on the availability of transcript or protein evidence for the species. Some annotation plans for high-priority organisms may be put on hold pending submission and public availability of transcriptome data.

heading Order 32 Level 3 Style Heading3 word/document.xml:/w:document[1]/w:body[1]/w:p[29]

H3. Assemblies in scope

paragraph Order 33 word/document.xml:/w:document[1]/w:body[1]/w:p[30]

Only genomes with assemblies that are public in the International Nucleotide Sequence Database Collaboration (INSDC) (DNA Data Bank of Japan, European Nucleotide Archive or GenBank) are considered for annotation. These assemblies are available in the Assembly resource. Assemblies with assembled chromosomes are preferred, but assemblies made of unplaced scaffolds only may also be annotated. Assemblies for which only contigs are available are not annotated.

paragraph Order 34 word/document.xml:/w:document[1]/w:body[1]/w:p[31]

Assemblies with high contig and scaffold N50 are prioritized. No single quality metric is used as a strict threshold, but assemblies that have a contig N50 above 50,000 bases and/or a scaffold N50 above 2,000,000 bases are preferred, as more complete gene sets are generally produced for assemblies with higher N50 statistics. NCBI may decide not to annotate assemblies that are extremely fragmented, even if they meet other criteria.

paragraph Order 35 word/document.xml:/w:document[1]/w:body[1]/w:p[32]

If multiple assemblies are available for the same organism, NCBI will annotate the higher quality assembly as the reference. Alternate assemblies of lower quality may also be included. This decision depends on the quality of the alternate assemblies, their importance to the community, as well as the estimated gain from annotating extra assemblies (number of extra genes identified, compensation of low-quality regions in the reference by higher-quality regions in the alternate assembly, value to variation studies).

paragraph Order 36 word/document.xml:/w:document[1]/w:body[1]/w:p[33]

Some assemblies are submitted to INSDC with annotation. NCBI may elect to propagate this annotation onto RefSeq sequences. This is typically the case for model organism assemblies with well-curated annotation, such as Drosophila melanogaster (maintained by FlyBase), Saccharomyces cerevisiae (maintained by the Saccharomyces Genome Database) or Caenorhabditis elegans (maintained by WormBase) but annotation propagation from GenBank to RefSeq records may also be done for other organisms (e.g., Sorghum bicolor). For some organisms with annotation submitted to INSDC (e.g., Ailuropoda melanoleuca), NCBI may opt to annotate RefSeq copies of the assemblies, primarily to provide a more consistent RefSeq dataset across organisms of interest to the NIH.

heading Order 37 Level 2 Style Heading2 word/document.xml:/w:document[1]/w:body[1]/w:p[34]

H2. History

paragraph Order 38 word/document.xml:/w:document[1]/w:body[1]/w:p[35]

NCBI’s original Eukaryotic Genome Annotation Pipeline began development in the year 2000 to annotate draft versions of the human genome assembly produced by the Human Genome Project. NCBI's annotation process has grown over the last 13 years to accommodate non-human organisms. It has also become an automated pipeline that annotates more feature types using a wider range of input data and new or improved algorithms.

paragraph Order 39 word/document.xml:/w:document[1]/w:body[1]/w:p[36]

In its infancy, NCBI’s Eukaryotic Genome Annotation Pipeline was a semi-manual process to annotate known genes by aligning mRNAs from GenBank and RefSeq to the genome using BLAST (3), and to generate ab initio gene model predictions in the spaces between the known genes with GenomeScan (4) guided by protein alignments. One early advance was to use EST alignments to produce model transcripts that represented EST and mRNA chains that shared introns. Another major improvement came in 2003 when Gnomon, a gene prediction program developed at NCBI based on GenScan (5), replaced GenomeScan. Gnomon allowed us to generate gene models using a combination of mRNA, EST and protein alignments as evidence, supplemented by ab initio prediction where evidence was lacking. The next major advance was the development and incorporation of splicing-aware alignment algorithms capable of placing transcripts and proteins independently while following established rules of eukaryotic splicing. NCBI's first splicing-aware transcript alignment program, Spidey (6), was developed as a research project but this program did not scale to very large data sets and it was not sufficiently robust for routine use in our annotation pipeline. Splign (1) was developed as a replacement for Spidey and was incorporated into the annotation pipeline in 2004. Splign allowed accurate placement of transcripts and aided efforts to identify problematic areas of both the genome and the transcript set. ProSplign, NCBI's splicing-aware protein alignment program, was incorporated into the annotation pipeline in 2006 to improve the accuracy of the protein-to-genomic sequence alignments used as evidence in the Gnomon gene model prediction process. In 2013, NCBI made another major enhancement to the annotation process that allowed effective use of RNA-Seq data as evidence for making transcript models. This greatly improved the quality of the annotation for many organisms that have little or no mRNA or EST data in GenBank.

paragraph Order 40 word/document.xml:/w:document[1]/w:body[1]/w:p[37]

As the rate that new genome assemblies deposited in GenBank increased, deficiencies in the annotation pipeline that limited our ability to scale the process beyond a small number of organisms became more apparent. In parallel to the improvements to the annotation algorithms described above, we twice re-engineered the existing process to create a new framework for parallel execution that also provides extensibility, robustness, tracking, and reproducibility. By 2009, development of the re-engineered pipeline was sufficiently advanced to switch production annotation runs from the old pipeline to the new framework. Further refinements to the process and more automation continue to improve throughput. In 2011, we annotated twice as many eukaryotic genomes as in any previous year and as of the second half of 2013 are releasing an average of 8 eukaryotic genome annotations per month.

heading Order 41 Level 2 Style Heading2 word/document.xml:/w:document[1]/w:body[1]/w:p[38]

H2. Dataflow

heading Order 42 Level 3 Style Heading3 word/document.xml:/w:document[1]/w:body[1]/w:p[39]

H3. Methods

heading Order 43 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[40]

H4. Alignments

paragraph Order 44 word/document.xml:/w:document[1]/w:body[1]/w:p[41]

Both Splign (1) and ProSplign are global alignment tools that enable alignment of transcripts and proteins with high resolution of splice sites. The computational cost of these algorithms requires that approximate placements of the query sequences (transcripts or proteins) on the target (genome) be first identified with a local alignment tool, such as BLAST. Since a query often aligns at multiple locations, the BLAST hits are analyzed by the Compart algorithm to identify compartments prior to running Splign or ProSplign.

heading Order 46 Level 5 Style Heading5 word/document.xml:/w:document[1]/w:body[1]/w:p[43]

H5. BLAST

paragraph Order 47 word/document.xml:/w:document[1]/w:body[1]/w:p[44]

See the BLAST chapter.

heading Order 49 Level 5 Style Heading5 word/document.xml:/w:document[1]/w:body[1]/w:p[46]

H5. Compart algorithm

paragraph Order 50 word/document.xml:/w:document[1]/w:body[1]/w:p[47]

A compartment is defined as a sequence of compatible hits. Two BLAST hits are said to be compatible if they follow the natural flow of the target sequence. On a given strand, the relative position of the hits should be the same on both the query sequence and the genome. Compatible hits may overlap but may not be contained within one another. This definition of compatibility is transitive.

paragraph Order 51 word/document.xml:/w:document[1]/w:body[1]/w:p[48]

The Compart algorithm finds all non-overlapping compact compartments on the genome for a given query using a maximal coverage algorithm. Each compartment is assigned coverage, Φc, which is a measure of how well it represents the target sequence:

paragraph Order 52 word/document.xml:/w:document[1]/w:body[1]/w:p[49]

Φc=hwhLeffh

paragraph Order 53 word/document.xml:/w:document[1]/w:body[1]/w:p[50]

In this equation Leffh is the effective length of the hit h. It is usually the hit length, but if the hit overlaps with a neighbor hit, its effective length is decreased by a half of the overlap.

paragraph Order 54 word/document.xml:/w:document[1]/w:body[1]/w:p[51]

For cDNA alignments, where most useful hits are of very high identity, the weight wh equals the identity of the hit and the coverage Φc is the number of matches. For protein alignments, the weight is a constant equal 1. In this case the coverage Φc is simply the target sequence length covered by the hits.

paragraph Order 55 word/document.xml:/w:document[1]/w:body[1]/w:p[52]

When there is more than one compartment, the query sequence is covered multiple times, and to a certain extent finding all compartments is equivalent to maximization of the total coverage. In the case of exon duplication events, the additional hits should be ignored rather than turned into additional compartments. Since typically only a relatively small portion of the gene is duplicated we introduce a penalty Pnew for an additional compartment. This penalty ensures that a new compartment is created only if there is enough gene material for it. The value of this parameter is usually 25%–40% of the target sequence length. So our maximal coverage algorithm finds the compartments configuration that maximizes the following total coverage:

paragraph Order 56 word/document.xml:/w:document[1]/w:body[1]/w:p[53]

Φ=c(Φc-Pnew)

paragraph Order 57 word/document.xml:/w:document[1]/w:body[1]/w:p[54]

The process of optimization is performed very effectively using the dynamic programming algorithm.

heading Order 59 Level 5 Style Heading5 word/document.xml:/w:document[1]/w:body[1]/w:p[56]

H5. Splign – Transcript alignment

paragraph Order 60 word/document.xml:/w:document[1]/w:body[1]/w:p[57]

Splign (1) is a tool for aligning spliced cDNA sequences against their genomic counterparts using pre-computed compartments. The program produces accurate spliced alignments via solving a score S optimization problem formulated specifically to account for splice signals and introns.

paragraph Order 62 word/document.xml:/w:document[1]/w:body[1]/w:p[59]

S=BmNm-PmisNmis-gapsPgopen+Pgextendl-intronsPiopen+Piextendl

paragraph Order 63 word/document.xml:/w:document[1]/w:body[1]/w:p[60]

In this formula Bm and Nm are the bonus for a match and the number of matches, Pmis and Nmis , are the penalty for a mismatch and the number of mismatches, Pgopen and Pgextend , are the penalties for opening and extending a gap. These parameters are similar to the ones used in Blastn. The introns are accounted for by introduction of a special type of gap with Piopen and Piextend as the penalties for an opening and extending an intron. The formulation discriminates between the most frequent consensus (GT/AG), less frequent consensus (GC/AG, AT/AC), and not consensus donor/acceptor sites by giving different values to Piopen.

paragraph Order 64 word/document.xml:/w:document[1]/w:body[1]/w:p[61]

Since the complexity of solving the global sequence alignment problem is proportional to the product of lengths of the sequences, the hits are arranged into compartments as described above and the dynamic programming matrix split into smaller blocks by seeding the global alignment with the high identity portion of the hits (Figure 1).

paragraph Order 65 word/document.xml:/w:document[1]/w:body[1]/w:p[62]

For each compartment, its genomic search space is expanded by the length of query cDNA ends not covered by local alignments. This allows detecting the end exons if they are missed by the local alignment tool for reasons such as the alignment length being shorter than the word size or the exon residing in a masked region. Each hit may correspond to an exon, a part of an exon, or even a number of exons. Therefore, it is important to be conservative when using local alignments for alignment seeding. Within each compartment, parts of alignments that overlap on the query are dropped. From the remaining alignments, the longest perfectly matching diagonals are extracted, and the cores are used to seed the global alignment.

paragraph Order 66 word/document.xml:/w:document[1]/w:body[1]/w:p[63]

Hits comprising compartments determine whether the query and the subject sequence align on the same strand. Most mRNA sequences have natural biological order and positive strand can be assumed when aligning them. On the contrary, EST and frequently RNA-Seq sequences are not oriented, so both the original sequence and its reverse complimentary have to be aligned and the strand is determined by comparing the resulting alignments.

heading Order 67 Level 5 Style Heading5 word/document.xml:/w:document[1]/w:body[1]/w:p[64]

H5. ProSplign – Protein alignment

paragraph Order 68 word/document.xml:/w:document[1]/w:body[1]/w:p[65]

Protein alignments are produced by ProSplign. Similarly to Splign, ProSplign is a global protein-to-genome alignment tool that produces accurate spliced alignments from pre-computed compartments. ProSplign uses a modified Needleman Wunsch type (7) global alignment algorithm for aligning. ProSplign scores the target protein sequence against translation of the genomic sequence using the following score:

paragraph Order 70 word/document.xml:/w:document[1]/w:body[1]/w:p[67]

S=diagSdiag-gapsPgopen+Pgextendl-intronsPiopen+Piextendl

paragraph Order 72 word/document.xml:/w:document[1]/w:body[1]/w:p[69]

where Sdiag is the score for an ungapped part of the alignment calculated using a BLOSUM62 matrix (8). Insertions and deletions for which the length is a multiple of three are scored with the default Blastp gap penalties Pgopen and Pgextend. Gaps for which the length is not a multiple of three are frameshifts and have a much higher opening penalty Pgopen. The introns are scored as a special type of gap with a very small extension cost and an opening cost which is different between the most frequent consensus splices (GT/AG), less frequent consensus splices (GC/AG, AT/AC), and non-consensus splice sites.

paragraph Order 73 word/document.xml:/w:document[1]/w:body[1]/w:p[70]

Unlike Splign, ProSplign doesn’t use seeds because Blast hits for cross-species proteins do not give reliable information about seeds. Instead, ProSplign aligns the protein against a slightly extended genomic region identified by Compart as the compartment.

paragraph Order 74 word/document.xml:/w:document[1]/w:body[1]/w:p[71]

Not all parts of a protein are conserved well enough to provide a reliable alignment. In fact, some parts may not correspond to anything on the genome. The global alignment algorithm will align the whole protein, rendering a very low-identity alignment for the non-conserved portions of the protein. These unreliable and often misleading pieces of the alignment are filtered out by ProSplign during a post-processing step.

heading Order 76 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[73]

H4. Gene prediction

paragraph Order 77 word/document.xml:/w:document[1]/w:body[1]/w:p[74]

Gnomon is a two-step gene prediction program maintained by NCBI. The Chainer algorithm assembles overlapping alignments into “chains” and is followed by the ab initio prediction step which extends these chains into complete models and creates full ab initio models, using a Hidden Markov Model (HMM).

heading Order 78 Level 5 Style Heading5 word/document.xml:/w:document[1]/w:body[1]/w:p[75]

H5. Chainer

paragraph Order 79 word/document.xml:/w:document[1]/w:body[1]/w:p[76]

Spliced alignments obtained using Splign and ProSplign are likely partial, either because the aligned sequences are partial or in the case of protein alignments because only conserved portions of the protein could be aligned. Chainer analyzes and assembles these partial alignments to provide longer gene models and additional information about alternative variants.

paragraph Order 80 word/document.xml:/w:document[1]/w:body[1]/w:p[77]

Because of their short length and high redundancy, RNA-Seq alignments with identical introns are first combined into single alignments with larger weights (Figure 2). Boundaries of these “micro-chains” don’t cross splices known from other alignments and their extension is limited to 20 bp.

paragraph Order 81 word/document.xml:/w:document[1]/w:body[1]/w:p[78]

These “micro-chains” are then combined by Chainer with cDNA and protein alignments based on their exon structure compatibility using a modified version of the Maximal Transcript Alignment algorithm (9) based on frame compatibility of the coding regions. For protein and annotated full-length cDNA alignments, the coding regions can be inferred. For other cDNA alignments, possible coding regions are predicted and scored using a 3-periodic fifth-order Markov model for coding propensity and Weight Matrix Method (WMM) models for splice signals and translation initiation and termination signals (10). All cDNAs with coding sequence (CDS) scores above a given threshold are marked as coding, and the CDS information is used when assembling chains. In many cases, this process determines the orientation of an EST if it was unknown before. RNA-Seq and some EST alignments are too short to score above the threshold and, if they are not spliced, their orientation is often also unknown. For these alignments, Chainer will consider that these sequences can be part of the 5’ end and harbor a start codon, or be part of the 3’ end and harbor a stop codon, or be internal to the CDS or to an untranslated region (UTR), and select the scenario that contributes to the longest CDS.

paragraph Order 82 word/document.xml:/w:document[1]/w:body[1]/w:p[79]

Afterward, UTRs are added if the necessary translation initiation or termination signals are present. There are no restrictions on the extension of a 5’-UTR other than the exon-intron structure compatibility.

paragraph Order 83 word/document.xml:/w:document[1]/w:body[1]/w:p[80]

The assembled full-length chains that share splices or CDS are combined into genes with alternative isoforms. Among the partial chains for a gene, the variant with the longest CDS is selected for extension by ab initio prediction.

heading Order 85 Level 5 Style Heading5 word/document.xml:/w:document[1]/w:body[1]/w:p[82]

H5. HMM-based prediction

paragraph Order 86 word/document.xml:/w:document[1]/w:body[1]/w:p[83]

The core algorithm of the ab initio prediction capability of Gnomon is based on Genscan (5), which uses a 3-periodic fifth-order HMM for the coding propensity score and incorporates descriptions of the basic transcriptional, translational and splicing signals, as well as length distributions and compositional features of exons, introns and intergenic regions. The most important distinction of Gnomon from Genscan and other ab initio prediction programs is its ability to conform to the supplied alignments and extend and complement them when necessary.

paragraph Order 87 word/document.xml:/w:document[1]/w:body[1]/w:p[84]

Mathematically, an HMM-based ab initio prediction is a search in the gene configuration space for the gene that provides the maximal score. If all configurations that are not compatible with the available alignments are excluded from the search space, then the optimization process in the resulting collapsed space will yield a gene configuration that is possibly suboptimal from the ab initio point of view but exactly follows the experimental information available. This approach allows extension or connections of partial alignments (Figure 3). Untranslated regions, if present in the alignments, are also included in the gene model.

paragraph Order 89 word/document.xml:/w:document[1]/w:body[1]/w:p[86]

Gnomon recognizes as HMM states coding exons and introns on both strands and intergenic sequences. Translational and splice signals are described using WMM (10) and WAM (11) models. A 12-bp WMM model, beginning 6 bp prior to the initiation codon, is used for the translation initiation signal (12). A 6-bp first order WAM model starting at the stop codon is used for the translation termination signal. The donor splice signal is described by a 9-bp second order WAM model, and the acceptor splice signal is described by a 43-bp second order WAM model. Both donor and acceptor models include 3-bp of the coding exon. Coding portions of exons are modeled using an inhomogeneous 3-periodic fifth-order Markov model (13). The noncoding states are modeled using a homogeneous fifth-order Markov model.

heading Order 90 Level 3 Style Heading3 word/document.xml:/w:document[1]/w:body[1]/w:p[87]

H3. Input data

heading Order 91 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[88]

H4. Assemblies

paragraph Order 92 word/document.xml:/w:document[1]/w:body[1]/w:p[89]

The Eukaryotic Annotation Pipeline can annotate one or multiple assemblies at once (see below). All assemblies must be publicly available in the Assembly database. Since the INSDC sequence records constituting the submitted assemblies are owned by submitters and may not be modified by NCBI, all annotation is done on RefSeq copies of the INSDC assemblies. Prior to the annotation process, RefSeq accessions are assigned to the assembly’s scaffolds and chromosomes. These RefSeq sequences are based on the sequences in the INSDC records, but their records will bear the NCBI annotation. Note also that a new assembly accession, with the prefix GCF_ , is given to the assembly which contains the RefSeq sequences.

heading Order 94 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[91]

H4. Source of evidence

paragraph Order 95 word/document.xml:/w:document[1]/w:body[1]/w:p[92]

The evidence used to predict gene models is selected from available public data. Same-species transcripts, proteins, and short reads, and if not sufficient, transcripts and proteins from closely related species are included.

paragraph Order 96 word/document.xml:/w:document[1]/w:body[1]/w:p[93]

More specifically the following sets of transcripts are included:

list Order 97 word/document.xml:/w:document[1]/w:body[1]/w:p[94]

Known RefSeq transcripts: coding and non-coding RefSeq transcripts, with NM_ or NR_ prefixes respectively. These are generated by NCBI staff based on automatic processes, manual curation, or data from collaborating groups (see more details in the RefSeq chapter and 2)

list Order 98 word/document.xml:/w:document[1]/w:body[1]/w:p[95]

Other long transcripts

list Order 99 word/document.xml:/w:document[1]/w:body[1]/w:p[96]

GenBank transcripts from the taxonomically relevant GenBank divisions, and the Third-Party Annotation (TPA), High-throughput cDNA (HTC) and Transcriptome Shotgun Assembly (TSA) divisions

list Order 100 word/document.xml:/w:document[1]/w:body[1]/w:p[97]

ESTs from dbEST

list Order 101 word/document.xml:/w:document[1]/w:body[1]/w:p[98]

Long RNA-Seq sequences (e.g., from the GS FLX TITANIUM 454 platform) from the Sequence Read Archive SRA

list Order 102 word/document.xml:/w:document[1]/w:body[1]/w:p[99]

Short read RNA-Seq data available in SRA

paragraph Order 103 word/document.xml:/w:document[1]/w:body[1]/w:p[100]

And the following proteins:

list Order 104 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[101]

Known RefSeq proteins, with NP_ prefixes

list Order 105 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[102]

INSDC proteins derived from transcripts (as much as possible, conceptual translations are excluded)

paragraph Order 106 word/document.xml:/w:document[1]/w:body[1]/w:p[103]

In addition, if available for the annotated organism, curated RefSeq genomic sequences are used. These sequences have accessions with NG_ prefixes and represent non-transcribed pseudogenes, manually annotated gene clusters that are difficult to annotate via automated methods, or human RefSeqGene records (2).

heading Order 107 Level 3 Style Heading3 word/document.xml:/w:document[1]/w:body[1]/w:p[104]

H3. Process flow

paragraph Order 108 word/document.xml:/w:document[1]/w:body[1]/w:p[105]

Figure 4 provides an overview of the annotation pipeline. Transcripts from RefSeq, GenBank, and the Sequence Read Archive, proteins, and, if available, RefSeq curated genomic sequences are aligned to the masked genome. Gene models are predicted by Gnomon based on these alignments, and searched against the curated database UniProtKB/SwissProt. The final set of models is then chosen among the Gnomon predictions (model RefSeq) and the known and curated RefSeq. Names and type of loci and GeneIDs are assigned to model RefSeq and retrieved from the Gene database for known RefSeq. In the final steps, the annotation is formatted, submitted to the sequences databases, and published.

heading Order 110 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[107]

H4. Fetching of inputs

paragraph Order 111 word/document.xml:/w:document[1]/w:body[1]/w:p[108]

All evidence identifiers are retrieved from Entrez at the very beginning of the annotation run and the date of sequence retrieval is tracked and reported as the annotation run “freeze” date. Any sequence added to archival databases after that day will not be used.

heading Order 113 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[110]

H4. Genome sequence masking

paragraph Order 114 word/document.xml:/w:document[1]/w:body[1]/w:p[111]

The assemblies are retrieved from the Assembly resource and masked using either WindowMasker (14) or RepeatMasker (15). RepeatMasker is generally used for organisms for which a comprehensive repeat library is available.

heading Order 116 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[113]

H4. Alignment of curated RefSeq genomic sequences

paragraph Order 117 word/document.xml:/w:document[1]/w:body[1]/w:p[114]

If available for the organism of interest, curated RefSeq genomic sequences are aligned to the masked genome using BLAST. The alignments are ranked and filtered based on identity, coverage, and placement information kept in a RefSeq tracking in-house database. The features annotated on the alignments passing the filter are then projected onto the genomic sequences and evaluated with the other aligning evidence when choosing the best model.

heading Order 119 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[116]

H4. Alignment of protein and transcript evidence

paragraph Order 120 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[117]

After retrieval, sequences are aligned to the masked genome following this general strategy: sequences are aligned locally to the genome using BLAST. Based on the BLAST hits, Compart identifies genomic compartments to which query sequences are re-aligned globally. This second round of alignments is necessary for accurate determination of splice sites and for the identification of small terminal exons that may be missed by BLAST. The global alignments are performed by Splign for transcripts and ProSplign for proteins. Resulting alignments are then ranked based on coverage and identity and filtered before hand-off to downstream tasks. Adjustments to the alignments and filtering parameters, and variation to this general dataflow are made based on the source and characteristics of the evidence and are described below.

heading Order 121 Level 5 Style Heading5 word/document.xml:/w:document[1]/w:body[1]/w:p[118]

H5. Alignment of known RefSeq transcripts

paragraph Order 122 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[119]

Since many of the known RefSeq sequences are curated (most notably for Vertebrates) and, as such, are high-value targets when annotating a genome, special attention is given to their proper placement. Masking may interfere with the alignment process, so RefSeq transcripts for which all alignments on the masked genome are under a coverage threshold may be re-aligned to the unmasked genome.

paragraph Order 123 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[120]

The alignments are ranked and filtered based on adjustable criteria (such as coverage, identity, rank) as well as location information contained in the RefSeq tracking database. Typically, only the best-placed alignment for a given query is selected for use in downstream steps.

heading Order 124 Level 5 Style Heading5 word/document.xml:/w:document[1]/w:body[1]/w:p[121]

H5. Alignment of non-Refseq transcripts

paragraph Order 125 word/document.xml:/w:document[1]/w:body[1]/w:p[122]

INSDC mRNAs, ESTs and 454 sequences are first screened against a database of mitochondrial sequences, cloning vectors, adaptors, bacterial IS-elements and repetitive sequences, and excluded from further processing if a large portion of their sequence hits a contaminant. In addition, transcripts identified as low-quality by curation staff are screened out.

paragraph Order 126 word/document.xml:/w:document[1]/w:body[1]/w:p[123]

Following this initial screen, the sequences are aligned with BLAST and Splign, as explained above, and ranked and filtered. For a given transcript, typically only the best-placed alignment (rank 1) is selected. For sequences that cannot be oriented (e.g., unspliced ESTs), alignments to both strands are passed dowstream. If used, cross-species transcripts are aligned with more stringent criteria than same-species transcripts to insure that only the most-likely ortholog transcript is passed downstream.

heading Order 127 Level 5 Style Heading5 word/document.xml:/w:document[1]/w:body[1]/w:p[124]

H5. Alignment of proteins

paragraph Order 128 word/document.xml:/w:document[1]/w:body[1]/w:p[125]

Similarly to transcripts, proteins are first screened against a database of repeats and the curated list of low-quality transcripts. Proteins are then aligned to the masked genome with BLAST and ProSplign. The alignments are further ranked and filtered and passed to the gene prediction step.

heading Order 129 Level 5 Style Heading5 word/document.xml:/w:document[1]/w:body[1]/w:p[126]

H5. Alignment of short reads

paragraph Order 130 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[127]

Short reads (RNA-Seq) available in the SRA can be used for gene prediction. A specific dataflow was engineered to handle the large volume and short length of sequences produced by new generation sequencing technologies.

paragraph Order 132 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[129]

RNA-Seq data from so-called next-generation sequencing platforms present several challenges for use in gene prediction. First, the reads are substantially shorter than conventional transcript data such as ESTs and mRNAs, so an individual read contains relatively little information. For example, typically only 5-25% of reads from the Illumina platform span an intron, which is the most useful data for building gene models. Second, the reads are extremely numerous and redundant, with highly-expressed genes being represented by tens of millions of reads. This presents a challenge for throughput. And third, the depth of coverage results in apparent background expression in most of the genome that isn't desirable to represent in the final gene models.

paragraph Order 134 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[131]

The annotation pipeline addresses these issues in several ways to reduce the complexity of the RNA-Seq data and convert it to a form useful for gene predictions:

list Order 135 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[132]

Datasets and associated metadata are obtained from the SRA and BioSample databases, enabling robust tracking of evidence.

list Order 136 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[133]

The reads are "uniquified" so that 100% identical sequences are aligned only once.

list Order 137 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[134]

Unique reads are aligned, ranked, and filtered for high identity and coverage alignments.

list Order 138 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[135]

Alignments with the same splice structure and the same or similar start and end points are collapsed into a single representative alignment. The number of reads from each SRA run is tracked for each collapsed alignment.

list Order 139 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[136]

Alignments containing rare introns or that represent apparent noise or background are filtered from the dataset.

paragraph Order 141 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[138]

Taken together, these steps reduce the size and complexity of a typical RNA-Seq dataset by 100-1000x. The resulting collapsed alignments can be used by themselves or combined with transcript and/or protein alignments for the gene prediction step.

heading Order 143 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[140]

H4. Gene prediction by Gnomon

paragraph Order 144 word/document.xml:/w:document[1]/w:body[1]/w:p[141]

Protein transcript and short read alignments are passed to Gnomon for gene prediction. Chainer assembles alignments with the same exon structure and with coding regions in compatible frames into putative models. Gnomon then extends the models missing a start or a stop codon or internal exon(s) using an HMM-based algorithm. Gnomon additionally creates pure ab initio predictions where open reading frames of sufficient length but with no supporting alignment are detected (see Methods).

paragraph Order 145 word/document.xml:/w:document[1]/w:body[1]/w:p[142]

This first set of predictions is further refined by alignment against a subset of the nr (non-redundant) database of protein sequences. The additional alignments are added to the initial alignments and the chaining and ab initio extension steps are repeated. The results constitute the set of Gnomon predictions.

paragraph Order 146 word/document.xml:/w:document[1]/w:body[1]/w:p[143]

Alternate variants, complete or partial, may be produced for each gene.

paragraph Order 147 word/document.xml:/w:document[1]/w:body[1]/w:p[144]

Frameshifts, indels, and stop codons may occur in the resulting Gnomon predictions. They reflect sequence differences between the input transcript and protein alignments and the genome assembly.

heading Order 149 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[146]

H4. Annotation of small RNA

paragraph Order 150 word/document.xml:/w:document[1]/w:body[1]/w:p[147]

tRNAs are annotated using tRNAScan-SE (16). Other small RNAs are annotated by placement of same-species curated RefSeq transcripts. Hence, these are only part of the annotation if they were incorporated in the RefSeq set for the organism being annotated. Currently the RefSeq set may include small RNAs identified by curation, collaboration, or external sources, which is currently limited to microRNAs obtained from miRBase (17).

heading Order 152 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[149]

H4. Choosing the best model(s)

paragraph Order 153 word/document.xml:/w:document[1]/w:body[1]/w:p[150]

The final set of annotated features comprises, in order of preference, pre-existing known RefSeq sequences and a subset of well-supported Gnomon-predicted models. It is built by evaluating together at each locus the known RefSeq transcripts, the features projected from the curated RefSeq genomic alignments, and the models predicted by Gnomon.

heading Order 154 Level 5 Style Heading5 word/document.xml:/w:document[1]/w:body[1]/w:p[151]

H5. Models based on known and curated RefSeq

paragraph Order 155 word/document.xml:/w:document[1]/w:body[1]/w:p[152]

RefSeq transcripts are given precedence over overlapping Gnomon models with the same splice pattern. Alignments of known same-species RefSeq transcripts or curated genomic sequences are used directly to annotate the gene, RNA, and CDS features on the genome. Since the RefSeq sequence may not align perfectly or completely to the genomic sequence, a consequence of this rule is that the annotated product may differ from the conceptual translation of the genome.

heading Order 157 Level 5 Style Heading5 word/document.xml:/w:document[1]/w:body[1]/w:p[154]

H5. Models based on Gnomon predictions

paragraph Order 158 word/document.xml:/w:document[1]/w:body[1]/w:p[155]

Gnomon predictions are included in the final set of annotations if they do not share all splice sites with a RefSeq transcript and if they meet certain quality thresholds including:

list Order 159 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[156]

Only fully- or partially-supported Gnomon predictions, or pure ab initio Gnomon predictions with high coverage hits to UniProtKB/SwissProt proteins are selected.

list Order 160 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[157]

When multiple fully-supported transcript variants are predicted for a gene, only the Gnomon predictions supported in their entirety by a single long alignment (e.g., a full-length mRNA) or by RNA-Seq reads from a single BioSample are selected.

list Order 161 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[158]

Poorly-supported Gnomon predictions conflicting with better-supported models annotated on the opposite strand are excluded from the final set of models.

list Order 162 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[159]

Gnomon predictions with high homology to transposable or retro-transposable elements are excluded from the final set of models.

heading Order 164 Level 5 Style Heading5 word/document.xml:/w:document[1]/w:body[1]/w:p[161]

H5. Integrating RefSeq and Gnomon annotations

paragraph Order 165 word/document.xml:/w:document[1]/w:body[1]/w:p[162]

As a result of the model selection process, a gene may be represented by multiple splice variants, with some of them known RefSeq and others model RefSeq (originating from Gnomon predictions).

paragraph Order 166 word/document.xml:/w:document[1]/w:body[1]/w:p[163]

Gnomon predictions selected for the final annotation set are assigned model RefSeq accessions with XM_ or XR_ prefixes for protein-coding and non-coding transcripts, respectively, and XP_ prefixes for proteins to distinguish them from known RefSeq with NM_/NR_ and NP_ prefixes. Model RefSeq can be searched in Entrez with the query “srcdb_refseq_model[properties]” while known RefSeq sequences can be obtained with the query “srcdb_refseq_known[properties]”.

heading Order 168 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[165]

H4. Locus typing and protein naming

paragraph Order 169 word/document.xml:/w:document[1]/w:body[1]/w:p[166]

Genes are categorized into different locus types according to the type and quality of the model and based on orthology information.

list Order 170 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[167]

Known RefSeq features are annotated according to their locus type (e.g., protein-coding vs. pseudogene) established before the annotation run.

list Order 171 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[168]

Most Gnomon models with insertions, deletions, or frameshifts are labeled as pseudogenes and annotated without a CDS feature or protein product.

list Order 172 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[169]

Gnomon models that appear to be single-exon retrocopies of protein-coding genes may also be annotated as pseudogenes.

list Order 173 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[170]

Gnomon models with insertions, deletions, or frameshifts may be considered coding if they have a strong unique hit to the SwissProt database or appear to be orthologs of known protein-coding genes. Titles for these models are prefixed with “PREDICTED: LOW QUALITY PROTEIN.” There may be defects in the assembly and/or the model in these cases.

list Order 174 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[171]

Gnomon models that have no predicted CDS or a short CDS with no supporting alignments may be annotated as non-coding models or removed from the annotation.

list Order 175 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[172]

When multiple assemblies are annotated, a partial or imperfect model may be called coding because a complete model exists at the corresponding locus on one of the other annotated assemblies.

paragraph Order 177 word/document.xml:/w:document[1]/w:body[1]/w:p[174]

Gene and protein names are assigned based on the locus type, protein homology, and orthology information, and data from the Gene database, which may in turn be based on nomenclature from an external group such as the HUGO Gene Nomenclature Committee (HGNC). Predicted genes are evaluated for orthology to genes in a reference species using a pairwise comparison process based on protein alignments and local synteny information.

list Order 178 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[175]

If a likely ortholog can be determined, the gene symbol and name is transferred from the reference species, if applicable.

list Order 179 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[176]

If an ortholog cannot be determined, predicted genes are named based on the name of the most similar SwissProt protein, adding the suffix ‘-like’ to indicate the putative nature of the assignment.

list Order 180 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[177]

Predicted genes for which no name can be determined are assigned a generic gene and protein name of the form “uncharacterized LOC” plus the GeneID.

heading Order 182 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[179]

H4. Assignment of GeneIDs

paragraph Order 183 word/document.xml:/w:document[1]/w:body[1]/w:p[180]

Genes in the final set of models are assigned GeneIDs in the Gene database.

list Order 184 word/document.xml:/w:document[1]/w:body[1]/w:p[181]

A gene represented by at least one known RefSeq transcript receives the GeneID of the RefSeq transcript(s).

list Order 185 word/document.xml:/w:document[1]/w:body[1]/w:p[182]

Genes mapped from a previous annotation (see Re-annotation below) are assigned the same GeneIDs as in the previous annotation.

list Order 186 word/document.xml:/w:document[1]/w:body[1]/w:p[183]

Genes that are not mapped from a previous annotation and genes that are represented by Gnomon models only are assigned new GeneIDs.

list Order 187 word/document.xml:/w:document[1]/w:body[1]/w:p[184]

Genes mapped to equivalent locations on co-annotated assemblies are assigned the same GeneIDs (see Annotation of multiple assemblies).

heading Order 188 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[185]

H4. Packaging of the annotation

paragraph Order 189 word/document.xml:/w:document[1]/w:body[1]/w:p[186]

The output of the annotation pipeline is labelled with an Annotation Release number. For a given annotation, the combination of organism and Annotation Release number (e.g., NCBI Homo sapiens Annotation Release 105) is used throughout NCBI as a way to uniquely identify annotation products originating from the same annotation run.

paragraph Order 191 word/document.xml:/w:document[1]/w:body[1]/w:p[188]

The annotation pipeline output is composed of the scaffolds and the chromosomes of the assembled genome(s) annotated with the genes, RNAs and proteins as features, and also the RNAs and proteins themselves. The RefSeq scaffolds and chromosomes are assigned accessions with NW_ or NT_ and NC_ prefixes and submitted to the Nucleotide database with the features annotated. Sequences submitted to the sequence databases are labelled with the Annotation Release (Figure 5).

paragraph Order 194 word/document.xml:/w:document[1]/w:body[1]/w:p[191]

The annotated products may include known RefSeq transcripts and proteins, Gnomon-predicted models and tRNA genes that were predicted by tRNAscan-SE. The Gnomon models that were retained by the best model selection process are submitted to the Nucleotide, Protein, and Gene database and the tRNAs genes are submitted to Gene. The known RefSeq features are updated independently from the annotation process and are not re-submitted to the sequence or Gene databases (see Access section below). The origin of the annotation can be deduced from the \note on the feature annotated on the genomic sequences (Table 1).

paragraph Order 196 word/document.xml:/w:document[1]/w:body[1]/w:p[193]

For transcripts and proteins produced by Gnomon, the sequence records provide the level of support for predicted models. For low-quality proteins, the records also detail the difference between the model and the genomic sequences that was introduced to compensate for a possible error in the assembly (Figure 6).

paragraph Order 197 word/document.xml:/w:document[1]/w:body[1]/w:p[194]

As explained above, a known RefSeq transcript may not align perfectly to the genome but may be selected as a gene representative in the set of annotation products. These discrepancies are noted on the genomic sequence records (Figure 7).

heading Order 200 Level 3 Style Heading3 word/document.xml:/w:document[1]/w:body[1]/w:p[197]

H3. Special considerations

heading Order 201 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[198]

H4. Annotation of multiple assemblies

paragraph Order 202 word/document.xml:/w:document[1]/w:body[1]/w:p[199]

When multiple assemblies of good quality are available for a given organism, the annotation of all is done in coordination. To ensure that matching regions in multiple assemblies are annotated consistently, assemblies are mapped to each other using a BLAST-based process prior to the annotation. The reciprocal best hits are used to pair corresponding regions on two assemblies.

paragraph Order 203 word/document.xml:/w:document[1]/w:body[1]/w:p[200]

As explained on Figure 8, these paired regions allow the coordinate ranking of the alignment of a given transcript on both assemblies.

paragraph Order 204 word/document.xml:/w:document[1]/w:body[1]/w:p[201]

This strategy ensures that mapped regions are annotated the same way and that the same genes are assigned the same GeneID and locus type on both assemblies. It reduces the redundancy in the Gene set for a given organism and helps navigation between multiple assemblies. Note that for Gnomon models, although a single GeneID represents the locus in multiple assemblies, a different transcript and protein accession is instantiated for each individual assembly.

paragraph Order 205 word/document.xml:/w:document[1]/w:body[1]/w:p[202]

For more on the assembly-assembly alignment process, see the Remapping Service chapter.

heading Order 207 Level 4 Style Heading4 word/document.xml:/w:document[1]/w:body[1]/w:p[204]

H4. Re-annotation

paragraph Order 208 word/document.xml:/w:document[1]/w:body[1]/w:p[205]

Special attention is given to tracking of models and genes from one release of the annotation to the next. Previous and current models annotated at overlapping genomic locations are identified and locus type and GeneID of the previous models are taken into consideration when assigning GeneIDs to the new models. If the assembly was updated between the two rounds of annotation, the assemblies are aligned to each other and the alignments used to match previous and current models in mapped regions.

heading Order 209 Level 2 Style Heading2 word/document.xml:/w:document[1]/w:body[1]/w:p[206]

H2. Access

paragraph Order 210 word/document.xml:/w:document[1]/w:body[1]/w:p[207]

The status of annotation runs in progress or completed recently is updated nightly on the Eukaryotic Genome Annotation Pipeline public page:

paragraph Order 211 word/document.xml:/w:document[1]/w:body[1]/w:p[208]

http://www.ncbi.nlm.nih.gov/genome/annotation_euk/status/

paragraph Order 212 word/document.xml:/w:document[1]/w:body[1]/w:p[209]

This page provides links to the resources where data for a specific Annotation Release is available (Figure 9).

paragraph Order 215 word/document.xml:/w:document[1]/w:body[1]/w:p[212]

Products of NCBI’s eukaryotic annotation pipeline are available in several resources (Table 2) including:

list Order 216 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[213]

In the Nucleotide and Protein databases

list Order 217 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[214]

In the Gene database

list Order 218 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[215]

On the FTP site in GFF, FASTA, GenBank flat file and ASN formats

list Order 219 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[216]

As Map Viewer tracks

list Order 220 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[217]

In BLAST databases available from organism-specific BLAST pages

list Order 221 Style ListParagraph word/document.xml:/w:document[1]/w:body[1]/w:p[218]

In the Consensus CDS project (CCDS)

heading Order 222 Level 3 Style Heading3 word/document.xml:/w:document[1]/w:body[1]/w:p[219]

H3. Future development: annotation reports

paragraph Order 223 word/document.xml:/w:document[1]/w:body[1]/w:p[220]

The quality of the end-products produced by the Eukaryotic Genome Annotation Pipeline is highly dependent on the quality of the assembly and on the amount and quality of same-species or close cross-species evidence.

paragraph Order 224 word/document.xml:/w:document[1]/w:body[1]/w:p[221]

To facilitate the users' understanding of the annotation process and provide context for the annotation results, NCBI will start publishing reports for each annotation run by the end of 2013. These reports will include a description of the assemblies that were annotated and summary counts of the products of the annotation. Additionally, intermediate statistics summarizing which transcripts and protein sets were used and how well the evidence aligned to the genomes will be provided.

heading Order 226 Level 2 Style Heading2 word/document.xml:/w:document[1]/w:body[1]/w:p[223]

H2. References

reference Order 227 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[224]

1. Kapustin Y, Souvorov A, Tatusova T and Lipman D. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol Direct. 2008 May 21;3:20.

reference Order 228 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[225]

2. Pruitt KD, Tatusova T, Brown GR, Maglott DR. Nucleic Acids Res. 2012 Jan;40(Database issue):D130-5

reference Order 229 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[226]

3. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10.

reference Order 230 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[227]

4. Yeh RF, Lim LP, Burge CB. Computational inference of homologous gene structures in the human genome. Genome Res. 2001 May; 11(5):803–816.

reference Order 231 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[228]

5. Burge C and Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997 Apr 25;268(1):78-94.

reference Order 232 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[229]

6. Wheelan SJ, Church DM and Ostell JM. Spidey: A Tool for mRNA-to-Genomic Alignments. Genome Res. 2001 November;11(11): 1952–1957

reference Order 233 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[230]

7. Needleman SB and Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Bioll. 1970 Mar;48(3):443-53

reference Order 234 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[231]

8. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915-9

reference Order 235 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[232]

9. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003 Oct 1;31(19):5654-66.

reference Order 236 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[233]

10. Staden R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):505-19.

reference Order 237 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[234]

11. Zhang MQ and Marr TG. A weight array method for splicing signal analysis. Computer applications in the biosciences. Comput Appl Biosci. 1993 Oct;9(5):499-509.

reference Order 238 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[235]

12. Kozak M. Compilation and analysis of sequences upstream from the translational start site in eukaryotic mRNAs. Nucleic Acids Res. 1984 Jan 25;12(2):857-72.

reference Order 239 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[236]

13. Borodovsky M and McIninch J. GenMark: Parallel gene recognition for both DNA strands. Computers & Chemistry. 1993;17(2):123-33.

reference Order 240 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[237]

14. Morgulis A, Gertz EM, Schäffer AA, Agarwala R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics. 2006 Jan 15;22(2):134-41

reference Order 241 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[238]

15. Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org

reference Order 242 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[239]

16. Lowe TM and Eddy SR. Nucleic Acids Res. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. 1997 Mar 1;25(5):955-64.

reference Order 243 Style Reference word/document.xml:/w:document[1]/w:body[1]/w:p[240]

17. Griffiths-Jones S. The microRNA Registry. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D109-11.

heading Order 245 Level 2 Style FiguresTablesBoxesSectionHeading word/document.xml:/w:document[1]/w:body[1]/w:p[242]

H2. [figs-and-tables] Figures, Tables and Boxes Appendix (do not delete)

paragraph Order 246 Style Comment word/document.xml:/w:document[1]/w:body[1]/w:p[243]

Place numbered figures, tables and boxes (referred to from the main text) below.

paragraph Order 247 Style Comment word/document.xml:/w:document[1]/w:body[1]/w:p[244]

“In-line” figures (e.g. equations) and tables should be placed within the main text in their desired final location.

paragraph Order 248 Style Comment word/document.xml:/w:document[1]/w:body[1]/w:p[245]

Boxes can have a single level of sections; the titles for these sections should be marked up in “Box subhead” style.

paragraph Order 250 Style Figuregraphic word/document.xml:/w:document[1]/w:body[1]/w:p[247]

figure Order 251 Style Figurenumberandcaption word/document.xml:/w:document[1]/w:body[1]/w:p[248]

Figure. Figure 1. Splign reduces the computational complexity by using the high identity portions of the hits (dark blue) for the bulk of the alignment and realigning only small portions of the transcript (light blue).

paragraph Order 253 Style Figuregraphic word/document.xml:/w:document[1]/w:body[1]/w:p[250]

figure Order 254 Style Figurenumberandcaption word/document.xml:/w:document[1]/w:body[1]/w:p[251]

Figure. Figure 2. Combining the alignments with the same introns into one alignment (micro chaining) reduces the computational complexity.

paragraph Order 256 Style Figuregraphic word/document.xml:/w:document[1]/w:body[1]/w:p[253]

figure Order 257 Style Figurenumberandcaption word/document.xml:/w:document[1]/w:body[1]/w:p[254]

Figure. Figure 3. Partial chains a and b produced by Chainer may be combined into one chain, c, by addition of the HMM prediction of missing coding sequence. In blue: coding sequence. In green: untranslated region

paragraph Order 259 Style Figuregraphic word/document.xml:/w:document[1]/w:body[1]/w:p[256]

figure Order 260 Style Figurenumberandcaption word/document.xml:/w:document[1]/w:body[1]/w:p[257]

Figure. Figure 4. Overview of the process flow in the Eukaryotic Genome Annotation Pipeline. In grey: genomic sequence preparation; in blue: alignments of transcripts; in green: alignment of proteins; in orange: alignment of short reads; in pink: alignment of curated genomic sequences (if available); in brown: gene prediction based on all available alignments; in red: internal tracking database of RefSeq sequences; in purple: selection of the best models and protein naming; in yellow: formatting of annotation sets for deployment to public resources.

paragraph Order 262 Style Figuregraphic word/document.xml:/w:document[1]/w:body[1]/w:p[259]

figure Order 263 Style Figurenumberandcaption word/document.xml:/w:document[1]/w:body[1]/w:p[260]

Figure. Figure 5. Typical RefSeq record for a scaffold annotated by the Eukaryotic Genome Annotation Pipeline. (A) Links to the RefSeq BioProject and RefSeq assembly. (B) The comment field is prefixed with REFSEQ INFORMATION, and provides a link to the GenBank sequence on which the record is based. (C) The Genome Annotation structured comment provides the Annotation Release number and other information relating to the annotation process. ‘Annotation Status:: Full annotation’ and ‘Annotation Method:: Best-placed RefSeq; Gnomon’ indicate that the annotation used the placement of RefSeq sequences and Gnomon prediction as the source for the annotation.

paragraph Order 265 Style Figuregraphic word/document.xml:/w:document[1]/w:body[1]/w:p[262]

figure Order 266 Style Figurenumberandcaption word/document.xml:/w:document[1]/w:body[1]/w:p[263]

Figure. Figure 6. Example of a RefSeq record for a transcript model predicted by Gnomon. (A) The title in the DEFINITION line is prefixed with PREDICTED (B) The comment field is prefixed with MODEL REFSEQ and indicates the gene prediction method and refers to the genomic sequence on which the model is annotated. (C) The note on the gene indicates the type and number of supporting evidence for the model. (D) The note on the CDS describes the modification that was done relative to the genomic sequence to produce the model. (E) The product name is prefixed with LOW QUALITY PROTEIN.

paragraph Order 268 Style Figuregraphic word/document.xml:/w:document[1]/w:body[1]/w:p[265]

figure Order 269 Style Figurenumberandcaption word/document.xml:/w:document[1]/w:body[1]/w:p[266]

Figure. Figure 7. Example of a known RefSeq transcript annotated on a genomic scaffold. (A) The note on the gene indicates that the gene was annotated by projection of a best-placed RefSeq transcript on the genome. (B) The inference identifies the RefSeq transcript from which the annotation is inferred. (C) The note describes the alignment of the known transcript to the genomic sequence.

paragraph Order 271 Style Figuregraphic word/document.xml:/w:document[1]/w:body[1]/w:p[268]

figure Order 272 Style Figurenumberandcaption word/document.xml:/w:document[1]/w:body[1]/w:p[269]

Figure. Figure 8. Ranking of alignments across multiple assemblies. Alignments of a given transcript are represented in red to Assemby 1 and in green to Assembly 2. If a genomic alignment exists between two regions harboring a transcript alignment (light blue parallelograms), the alignments in the paired regions are placed in the same group (Group 1 and Group 3). All alignments in a given group are given the same rank, different from the rank of other groups, based on the quality of the alignments.

paragraph Order 274 Style Figuregraphic word/document.xml:/w:document[1]/w:body[1]/w:p[271]

figure Order 275 Style Figurenumberandcaption word/document.xml:/w:document[1]/w:body[1]/w:p[272]

Figure. Figure 9. Public report of annotion runs (A) in progress and recently completed annotation runs (B). Information in the tables are linked to the Taxonomy database (Species), the Assembly database (RefSeq Assemblies), and resources where the data is available (Links). For each annotation run, the name of the Annotation Release, the Freeze date when the input data used for the annotation was fetched, and the Release date when the annotation was first made public are also provided.

table Order 283 word/document.xml:/w:document[1]/w:body[1]/w:tbl[4]

Table caption. Table 1. Guide to the features annotated on scaffolds and chromosomes. The note provides information on the origin of the feature. *For predicted models, the note is also on the records of individual annotation products.

Annotated Product	Accession prefix	Origin of the product	Note provided for the feature annotated on scaffolds and chromosomes records*
Known transcripts/proteins	NM_, NR_, NP_	Curated RefSeq genomic alignment	Derived by automated computational analysis using gene prediction method: Curated Genomic
Known transcripts/proteins	NM_, NR_, NP_	Known RefSeq transcript alignment	Derived by automated computational analysis using gene prediction method: BestRefseq
Model transcripts/proteins	XM_, XR_, XP_	Gnomon	Derived by automated computational analysis using gene prediction method: Gnomon
tRNAs	no accession	tRNAscan-SE	tRNA features were annotated by tRNAscan-SE
RefSeq non-transcribed pseudogenes	no accession	Curated RefSeq genomic alignment	Derived by automated computational analysis using gene prediction method: Curated Genomic
Gnomon non-transcribed pseudogenes	no accession	Gnomon	Derived by automated computational analysis using gene prediction method: Gnomon
Full set of Gnomon predictions	no accession	Gnomon	Not in the sequence database. Available on the FTP site and as BLAST databases.

Table footprint: 8 rows, 32 cells

Attached table caption from preceding Word paragraph

table Order 286 word/document.xml:/w:document[1]/w:body[1]/w:tbl[5]

Table caption. Table 2. Availability of annotation products in NCBI resources.

Annotation products	In sequence databases	In Gene	In BLAST database	On the FTP site	In a Map Viewer track
Chromosomes	Yes	Yes	Yes	Yes	Yes
Scaffolds	Yes	No	Yes	Yes	Yes
Curated RefSeq transcripts and proteins	Yes	Yes	Yes	Yes	Yes
Predicted transcripts and proteins	Yes	Yes	Yes	Yes	Yes
tRNA	No	Yes	No	Yes	Yes
Ab initio Gnomon models	No	No	Yes	Yes	Yes

Table footprint: 7 rows, 42 cells

Attached table caption from preceding Word paragraph

References Reference list and validation

Review the reference list in document order, confirm whether each entry is cited in text, and inspect local citation matching plus DOI or PubMed-backed validation details from one place.

Normalized reference fields are shown for authors, year, title, DOI, PMID, and external validation status.

Local citation coverage

Counts below summarize citations resolved to these reference entries.

Validated 8 Needs review 17 Not found 0 Not applicable 0

External outcomes

PubMed and DOI validation outcomes are listed on each reference row below.

Validated 1 Probable 0 Ambiguous 0 Not found 17 Not applicable 7 Pending 0

reference Order 1 Matched in text not found

Kapustin Y, Souvorov A, Tatusova T and Lipman D. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol Direct. 2008 May 21;3:20.

At least one in-text citation resolves to this reference entry.

Matched in-text citations (4)

Citation I Needs review

1 | Style: numeric | Normalized: 1

Detected in block order 17: Core components of the pipeline are the alignment programs Splign (1) and ProSplign, an...

Citation VIII Needs review

1 | Style: numeric | Normalized: 1

Detected in block order 39: In its infancy, NCBI’s Eukaryotic Genome Annotation Pipeline was a semi-manual proces...

Citation IX Needs review

1 | Style: numeric | Normalized: 1

Detected in block order 44: Both Splign (1) and ProSplign are global alignment tools that enable alignment of trans...

Citation X Needs review

1 | Style: numeric | Normalized: 1

Detected in block order 60: Splign (1) is a tool for aligning spliced cDNA sequences against their genomic counterp...

Authors: Kapustin Y, Souvorov A, Tatusova T and Lipman D | Year: 2008 | Title: Splign: algorithms for computing spliced alignments with identification of paralogs

DOI: n/a | PMID: n/a

External validation: not found via title_author_year_search

No PubMed-backed source match was found for this matched reference.

External validation was recorded without a normalized source summary

No PubMed search candidates were returned for the normalized reference metadata.

reference Order 2 Matched in text not found

Pruitt KD, Tatusova T, Brown GR, Maglott DR. Nucleic Acids Res. 2012 Jan;40(Database issue):D130-5

At least one in-text citation resolves to this reference entry.

Matched in-text citations (3)

Citation II Needs review

2 | Style: numeric | Normalized: 2

Detected in block order 19: RefSeq curated annotated genomic sequences (2), such as the human beta globin gene clus...

Citation III Needs review

2 | Style: numeric | Normalized: 2

Detected in block order 20: Known RefSeq transcripts (2)

Citation XX Needs review

2 | Style: numeric | Normalized: 2

Detected in block order 106: In addition, if available for the annotated organism, curated RefSeq genomic sequences...

Authors: Pruitt KD, Tatusova T, Brown GR, Maglott DR | Year: 2012 | Title: Nucleic Acids Res

DOI: n/a | PMID: n/a

External validation: not found via title_author_year_search

No PubMed-backed source match was found for this matched reference.

External validation was recorded without a normalized source summary

No PubMed search candidates were returned for the normalized reference metadata.

reference Order 3 Matched in text validated

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10.

At least one in-text citation resolves to this reference entry.

Matched in-text citations (1)

Citation IV Validated

3 | Style: numeric | Normalized: 3

Detected in block order 39: In its infancy, NCBI’s Eukaryotic Genome Annotation Pipeline was a semi-manual proces...

Authors: Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ | Year: 1990 | Title: Basic local alignment search tool

DOI: n/a | PMID: 2231712

External validation: validated via title_author_year_search

A PubMed-backed source record was validated for this matched reference.

Basic local alignment search tool. | Altschul SF, Gish W, Miller W, et al | 1990 | PMID 2231712

Parsed author summary differs from the returned PubMed authors.

reference Order 4 Matched in text not found

Yeh RF, Lim LP, Burge CB. Computational inference of homologous gene structures in the human genome. Genome Res. 2001 May; 11(5):803–816.

At least one in-text citation resolves to this reference entry.

Matched in-text citations (1)

Citation V Needs review

4 | Style: numeric | Normalized: 4

Detected in block order 39: In its infancy, NCBI’s Eukaryotic Genome Annotation Pipeline was a semi-manual proces...

Authors: Yeh RF, Lim LP, Burge CB | Year: 2001 | Title: Computational inference of homologous gene structures in the human genome

DOI: n/a | PMID: n/a

External validation: not found via title_author_year_search

No PubMed-backed source match was found for this matched reference.

External validation was recorded without a normalized source summary

No PubMed search candidates were returned for the normalized reference metadata.

reference Order 5 Matched in text not applicable

Burge C and Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997 Apr 25;268(1):78-94.

At least one in-text citation resolves to this reference entry.

Matched in-text citations (2)

Citation VI Validated

5 | Style: numeric | Normalized: 5

Detected in block order 39: In its infancy, NCBI’s Eukaryotic Genome Annotation Pipeline was a semi-manual proces...

Citation XV Validated

5 | Style: numeric | Normalized: 5

Detected in block order 86: The core algorithm of the ab initio prediction capability of Gnomon is based on Genscan...

Authors: n/a | Year: 1997 | Title: Burge C and Karlin S

DOI: n/a | PMID: n/a

External validation: not applicable via skipped

This matched reference is not currently expected to validate against PubMed.

External validation was recorded without a normalized source summary

Reference looks non-PubMed-oriented or lacks enough normalized metadata for lookup.

reference Order 6 Matched in text not found

Wheelan SJ, Church DM and Ostell JM. Spidey: A Tool for mRNA-to-Genomic Alignments. Genome Res. 2001 November;11(11): 1952–1957

At least one in-text citation resolves to this reference entry.

Matched in-text citations (1)

Citation VII Needs review

6 | Style: numeric | Normalized: 6

Detected in block order 39: In its infancy, NCBI’s Eukaryotic Genome Annotation Pipeline was a semi-manual proces...

Authors: Wheelan SJ, Church DM and Ostell JM | Year: 1957 | Title: Spidey: A Tool for mRNA-to-Genomic Alignments

DOI: n/a | PMID: n/a

External validation: not found via title_author_year_search

No PubMed-backed source match was found for this matched reference.

External validation was recorded without a normalized source summary

No PubMed search candidates were returned for the normalized reference metadata.

reference Order 7 Matched in text not applicable

Needleman SB and Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Bioll. 1970 Mar;48(3):443-53

At least one in-text citation resolves to this reference entry.

Matched in-text citations (2)

Citation XI Validated

7 | Style: numeric | Normalized: 7

Detected in block order 68: Protein alignments are produced by ProSplign. Similarly to Splign, ProSplign is a globa...

Citation XXV Validated

7 | Style: numeric | Normalized: 7

Detected in block order 269: Figure 7. Example of a known RefSeq transcript annotated on a genomic scaffold. (A) The...

Authors: n/a | Year: 1970 | Title: Needleman SB and Wunsch CD

DOI: n/a | PMID: n/a

External validation: not applicable via skipped

This matched reference is not currently expected to validate against PubMed.

External validation was recorded without a normalized source summary

Reference looks non-PubMed-oriented or lacks enough normalized metadata for lookup.

reference Order 8 Matched in text not found

Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915-9

At least one in-text citation resolves to this reference entry.

Matched in-text citations (1)

Citation XII Needs review

8 | Style: numeric | Normalized: 8

Detected in block order 72: where is the score for an ungapped part of the alignment calculated using a BLOSUM62 ma...

Authors: Henikoff S, Henikoff JG | Year: 1992 | Title: Amino acid substitution matrices from protein blocks

DOI: n/a | PMID: n/a

External validation: not found via title_author_year_search

No PubMed-backed source match was found for this matched reference.

External validation was recorded without a normalized source summary

No PubMed search candidates were returned for the normalized reference metadata.

reference Order 9 Matched in text not found

Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003 Oct 1;31(19):5654-66.

At least one in-text citation resolves to this reference entry.

Matched in-text citations (1)

Citation XIII Needs review

9 | Style: numeric | Normalized: 9

Detected in block order 81: These “micro-chains” are then combined by Chainer with cDNA and protein alignments...

Authors: Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O | Year: 2003 | Title: Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies

DOI: n/a | PMID: n/a

External validation: not found via title_author_year_search

No PubMed-backed source match was found for this matched reference.

External validation was recorded without a normalized source summary

No PubMed search candidates were returned for the normalized reference metadata.

reference Order 10 Matched in text not found

Staden R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):505-19.

At least one in-text citation resolves to this reference entry.

Matched in-text citations (2)

Citation XIV Needs review

10 | Style: numeric | Normalized: 10

Detected in block order 81: These “micro-chains” are then combined by Chainer with cDNA and protein alignments...

Citation XVI Needs review

10 | Style: numeric | Normalized: 10

Detected in block order 89: Gnomon recognizes as HMM states coding exons and introns on both strands and intergenic...

Authors: Staden R | Year: 1984 | Title: Computer methods to locate signals in nucleic acid sequences

DOI: n/a | PMID: n/a

External validation: not found via title_author_year_search

No PubMed-backed source match was found for this matched reference.

External validation was recorded without a normalized source summary

No PubMed search candidates were returned for the normalized reference metadata.

reference Order 11 Matched in text not applicable

Zhang MQ and Marr TG. A weight array method for splicing signal analysis. Computer applications in the biosciences. Comput Appl Biosci. 1993 Oct;9(5):499-509.

At least one in-text citation resolves to this reference entry.

Matched in-text citations (1)

Citation XVII Validated

11 | Style: numeric | Normalized: 11

Detected in block order 89: Gnomon recognizes as HMM states coding exons and introns on both strands and intergenic...

Authors: n/a | Year: 1993 | Title: Zhang MQ and Marr TG

DOI: n/a | PMID: n/a

External validation: not applicable via skipped

This matched reference is not currently expected to validate against PubMed.

External validation was recorded without a normalized source summary

Reference looks non-PubMed-oriented or lacks enough normalized metadata for lookup.

reference Order 12 Matched in text not found

Kozak M. Compilation and analysis of sequences upstream from the translational start site in eukaryotic mRNAs. Nucleic Acids Res. 1984 Jan 25;12(2):857-72.

At least one in-text citation resolves to this reference entry.

Matched in-text citations (1)

Citation XVIII Needs review

12 | Style: numeric | Normalized: 12

Detected in block order 89: Gnomon recognizes as HMM states coding exons and introns on both strands and intergenic...

Authors: Kozak M | Year: 1984 | Title: Compilation and analysis of sequences upstream from the translational start site in eukaryotic mRNAs

DOI: n/a | PMID: n/a

External validation: not found via title_author_year_search

No PubMed-backed source match was found for this matched reference.

External validation was recorded without a normalized source summary

No PubMed search candidates were returned for the normalized reference metadata.

reference Order 13 Matched in text not applicable

Borodovsky M and McIninch J. GenMark: Parallel gene recognition for both DNA strands. Computers & Chemistry. 1993;17(2):123-33.

At least one in-text citation resolves to this reference entry.

Matched in-text citations (1)

Citation XIX Validated

13 | Style: numeric | Normalized: 13

Detected in block order 89: Gnomon recognizes as HMM states coding exons and introns on both strands and intergenic...

Authors: n/a | Year: 1993 | Title: Borodovsky M and McIninch J

DOI: n/a | PMID: n/a

External validation: not applicable via skipped

This matched reference is not currently expected to validate against PubMed.

External validation was recorded without a normalized source summary

Reference looks non-PubMed-oriented or lacks enough normalized metadata for lookup.

reference Order 14 Matched in text not found

Morgulis A, Gertz EM, Schäffer AA, Agarwala R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics. 2006 Jan 15;22(2):134-41

At least one in-text citation resolves to this reference entry.

Matched in-text citations (1)

Citation XXI Needs review

14 | Style: numeric | Normalized: 14

Detected in block order 114: The assemblies are retrieved from the Assembly resource and masked using either WindowM...

Authors: Morgulis A, Gertz EM, Schäffer AA, Agarwala R | Year: 2006 | Title: WindowMasker: window-based masker for sequenced genomes

DOI: n/a | PMID: n/a

External validation: not found via title_author_year_search

No PubMed-backed source match was found for this matched reference.

External validation was recorded without a normalized source summary

No PubMed search candidates were returned for the normalized reference metadata.

reference Order 15 Matched in text not found

Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org

At least one in-text citation resolves to this reference entry.

Matched in-text citations (1)

Citation XXII Needs review

15 | Style: numeric | Normalized: 15

Detected in block order 114: The assemblies are retrieved from the Assembly resource and masked using either WindowM...

Authors: Smit AFA, Hubley R, Green P | Year: 2004 | Title: RepeatMasker Open-3.0

DOI: n/a | PMID: n/a

External validation: not found via title_author_year_search

No PubMed-backed source match was found for this matched reference.

External validation was recorded without a normalized source summary

No PubMed search candidates were returned for the normalized reference metadata.

reference Order 16 Matched in text not applicable

Lowe TM and Eddy SR. Nucleic Acids Res. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. 1997 Mar 1;25(5):955-64.

At least one in-text citation resolves to this reference entry.

Matched in-text citations (1)

Citation XXIII Validated

16 | Style: numeric | Normalized: 16

Detected in block order 150: tRNAs are annotated using tRNAScan-SE (16). Other small RNAs are annotated by placement...

Authors: n/a | Year: 1997 | Title: Lowe TM and Eddy SR

DOI: n/a | PMID: n/a

External validation: not applicable via skipped

This matched reference is not currently expected to validate against PubMed.

External validation was recorded without a normalized source summary

Reference looks non-PubMed-oriented or lacks enough normalized metadata for lookup.

reference Order 17 Matched in text not found

Griffiths-Jones S. The microRNA Registry. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D109-11.

At least one in-text citation resolves to this reference entry.

Matched in-text citations (1)

Citation XXIV Needs review

17 | Style: numeric | Normalized: 17

Detected in block order 150: tRNAs are annotated using tRNAScan-SE (16). Other small RNAs are annotated by placement...

Authors: Griffiths-Jones S | Year: 2004 | Title: The microRNA Registry

DOI: n/a | PMID: n/a

External validation: not found via title_author_year_search

No PubMed-backed source match was found for this matched reference.

External validation was recorded without a normalized source summary

No PubMed search candidates were returned for the normalized reference metadata.

Citation issues Open citation issues

No open citation issues were recorded for this document.