GTF与GFF file format

GFF file format:

Fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

<seqname>

The name of the sequence. Having an explicit sequence name allows a feature file to be prepared for a data set of multiple sequences. Normally the seqname will be the identifier of the sequence in an accompanying fasta format file. An alternative is that <seqname> is the identifier for a sequence in a public database, such as an EMBL/Genbank/DDBJ accession number. Which is the case, and which file or database to use, should be explained in accompanying information.

<source>

The source of this feature. This field will normally be used to indicate the program making the prediction, or if it comes from public database annotation, or is experimentally verified, etc.

<feature>

The feature type name. We hope to suggest a standard set of features, to facilitate import/export, comparison etc.. Of course, people are free to define new ones as needed. For example, Genie splice detectors account for a region of DNA, and multiple detectors may be available for the same site, as shown above.

We would like to enforce a standard nomenclature for common GFF features. This does not forbid the use of other features, rather, just that if the feature is obviously described in the standard list, that the standard label should be used. For this standard table we propose to fall back on the international public standards for genomic database feature annotation, specifically, the DDBJ/EMBL/GenBank feature table documentation).

<start>, <end>

Integers. <start> must be less than or equal to <end>. Sequence numbering starts at 1, so these numbers should be between 1 and the length of the relevant sequence, inclusive. (Version 2 change: version 2 condones values of <start> and <end> that extend outside the reference sequence. This is often more natural when dumping from acedb, rather than clipping. It means that some software using the files may need to clip for itself.)

<score>

A floating point value. When there is no score (i.e. for a sensor that just records the possible presence of a signal, as for the EMBL features above) you should use '.'. (Version 2 change: in version 1 of GFF you had to write 0 in such circumstances.)

<strand>

One of '+', '-' or '.'. '.' should be used when strand is not relevant, e.g. for dinucleotide repeats. Version 2 change: This field is left empty '.' for RNA and protein features.

<frame>

One of '0', '1', '2' or '.'. '0' indicates that the specified region is in frame, i.e. that its first base corresponds to the first base of a codon. '1' indicates that there is one extra base, i.e. that the second base of the region corresponds to the first base of a codon, and '2' means that the third base of the region is the first base of a codon. If the strand is '-', then the first base of the region is value of <end>, because the corresponding coding region will run from <end> to <start> on the reverse strand. As with <strand>, if the frame is not relevant then set <frame> to '.'. It has been pointed out that "phase" might be a better descriptor than "frame" for this field.

Version 2 change: This field is left empty '.' for RNA and protein features.

[attribute]

From version 2 onwards, the attribute field must have an tag value structure following the syntax used within objects in a .ace file, flattened onto one line by semicolon separators. Tags must be standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must be quoted with double quotes. Note: all non-printing characters in such free text value strings (e.g. newlines, tabs, control characters, etc) must be explicitly represented by their C (UNIX) style backslash-escaped representation (e.g. newlines as '\n', tabs as '\t'). As in ACEDB, multiple values can follow a specific tag. The aim is to establish consistent use of particular tags, corresponding to an underlying implied ACEDB model if you want to think that way (but acedb is not required). Examples of these would be:

seq1     BLASTX  similarity   101  235 87.1 + 0  Target "HBA_HUMAN" 11 55 ; E_value 0.0003
dJ102G20 GD_mRNA coding_exon 7105 7201   .  - 2 Sequence "dJ102G20.C1.1"

The semantics of tags in attribute field tag-values pairs has intentionally not been formalized. Two useful guidelines are to use DDBJ/EMBL/GenBank feature 'qualifiers' (see DDBJ/EMBL/GenBank feature table documentation), or the features that ACEDB generates when it dumps GFF.

Version 1 note In version 1 the attribute field was called the group field, with the following specification:

An optional string-valued field that can be used as a name to group together a set of records. Typical uses might be to group the introns and exons in one gene prediction (or experimentally verified gene structure), or to group multiple regions of match to another sequence, such as an EST or a protein.

All of the above described fields should be separated by TAB characters ('\t'). All values of the mandatory fields should not include whitespace (i.e. the strings for <seqname>, <source> and <feature> fields).

Version 1 note In version 1 each string had to be under 256 characters long, and the whole line should under 32k long. This was to make things easier for guaranteed conforming parsers, but seemed unnecessary given modern languages.

Comments

Comments are allowed, starting with "#" as in Perl, awk etc. Everything following # until the end of the line is ignored. Effectively this can be used in two ways. Either it must be at the beginning of the line (after any whitespace), to make the whole line a comment, or the comment could come after all the required fields on the line.

## comment lines for meta information

There is a set of standardised (i.e. parsable) ## line types that can be used optionally at the top of a gff file. The philosophy is a little like the special set of %% lines at the top of postscript files, used for example to give the BoundingBox for EPS files.

Current proposed ## lines are:

gff-version

##gff-version 2

GFF version - in case it is a real success and we want to change it. The current default version is 2, so if this line is not present version 2 is assumed.

source-version

##source-version <source> <version text>

So that people can record what version of a program or package was used to make the data in this file. I suggest the version is text without whitespace. That allows things like 1.3, 4a etc. There should be at most one source-version line per source.

date

##date <date>

The date the file was made, or perhaps that the prediction programs were run. We suggest to use astronomical format: 1997-11-08 for 8th November 1997, first because these sort properly, and second to avoid any US/European bias.

Type

##Type <type> [<seqname>]

The type of host sequence described by the features. Standard types are 'DNA', 'Protein' and 'RNA'. The optional <seqname> allows multiple ##Type definitions describing multiple GFF sets in one file, each of which have a distinct type. If the name is not provided, then all the features in the file are of the given type. Thus, with this meta-comment, a single file could contain DNA, RNA and Protein features, for example, representing a single genomic locus or 'gene', alongside type-specific features of its transcribed mRNA and translated protein sequences. If no ##Type meta-comment is provided for a given GFF file, then the type is assumed to be DNA.

DNA


 ##DNA <seqname>
 ##acggctcggattggcgctggatgatagatcagacgac
 ##...
 ##end-DNA

To give a DNA sequence. Several people have pointed out that it may be convenient to include the sequence in the file. It should not become mandatory to do so, and in our experience this has been very little used. Often the seqname will be a well-known identifier, and the sequence can easily be retrieved from a database, or an accompanying file.

RNA


 ##RNA <seqname>
 ##acggcucggauuggcgcuggaugauagaucagacgac
 ##...
 ##end-RNA

Similar to DNA. Creates an implicit ##Type RNA <seqname> directive.

Protein


 ##Protein <seqname>

 ##MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF
 ##...
 ##end-Protein

Similar to DNA. Creates an implicit ##Type Protein <seqname> directive.

sequence-region

##sequence-region <seqname> <start> <end>

To indicate that this file only contains entries for the specified subregion of a sequence.

Please feel free to propose new ## lines.

The ## line proposal came out of some discussions including Anders Krogh, David Haussler, people at the Newton Institute on 1997-10-29 and some email from Suzanna Lewis. Of course, naive programs can ignore all of these...

File Naming

We propose that the format is called "GFF", with conventional file name ending ".gff".

Semantics

We have intentionally avoided overspecifying the semantics of the format. For example, we have not restricted the items expressible in GFF to a specified set of feature types (splice sites, exons etc.) with defined semantics. Therefore, in order for the information in a gff file to be useful to somebody else, the person producing the features must describe the meaning of the features.

In the example given above the feature "splice5" indicates that there is a candidate 5' splice site between positions 172 and 173. The "sp5-20" feature is a prediction based on a window of 20 bp for the same splice site. To use either of these, you must know the position within the feature of the predicted splice site. This only needs to be given once, possibly in comments at the head of the file, or in a separate document.

Another example is the scoring scheme; we ourselves would like the score to be a log-odds likelihood score in bits to a defined null model, but that is not required, because different methods take different approaches.

Avoiding a prespecified feature set also leaves open the possibility for GFF to be used for new feature types, such as CpG islands, hypersensitive sites, promoter/enhancer elements, etc.

GTF2(gene transfer format) file format:
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

Here is a simple example with 3 translated exons. Order of rows is not important.

AB000381 Twinscan  CDS          380   401   .   +   0  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  CDS          501   650   .   +   2  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  CDS          700   707   .   +   2  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  start_codon  380   382   .   +   0  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  stop_codon   708   710   .   +   0  gene_id "001"; transcript_id "001.1";

The whitespace in this example is provided only for readability. In GTF, fields must be separated by a single TAB and no white space.

<seqname>
The FPC contig ID from the Golden Path.

<source>
The source column should be a unique label indicating where the annotations came from --- typically the name of either a prediction program or a public database.

<feature>
The following feature types are required: "CDS", "start_codon", "stop_codon". The feature "exon" is optional, since this project will not evaluate predicted splice sites outside of protein coding regions. All other features will be ignored.

CDS represents the coding sequence starting with the first translated codon and proceeding to the last translated codon. Unlike Genbank annotation, the stop codon is not included in the CDS for the terminal exon.

<start> <end>
Integer start and end coordinates of the feature relative to the beginning of the sequence named in <seqname>. <start> must be less than or equal to <end>. Sequence numbering starts at 1. Values of <start> and <end> that extend outside the reference sequence are technically acceptable, but they are discouraged for purposes of this project.

<score>
The score field will not be used for this project, so you can either provide a meaningful float or replace it by a dot.

<frame>
0 indicates that the first whole codon of the reading frame is located at 5'-most base. 1 means that there is one extra base before the first codon and 2 means that there are two extra bases before the first codon. Note that the frame is not the length of the CDS mod 3.

Here are the details excised from the GFF spec. Important: Note comment on reverse strand.

'0' indicates that the specified region is in frame, i.e. that its first base corresponds to the first base of a codon. '1' indicates that there is one extra base, i.e. that the second base of the region corresponds to the first base of a codon, and '2' means that the third base of the region is the first base of a codon. If the strand is '-', then the first base of the region is value of <end>, because the corresponding coding region will run from <end> to <start> on the reverse strand.

[attributes]
All four features have the same two mandatory attributes at the end of the record:

gene_id value; A globally unique identifier for the genomic source of the transcript
transcript_id value; A globally unique identifier for the predicted transcript.

These attributes are designed for handling multiple transcripts from the same genomic region. Any other attributes or comments must appear after these two and will be ignored.

Attributes must end in a semicolon which must then be separated from the start of any subsequent attribute by exactly one space character (NOT a tab character).

Textual attributes should be surrounded by doublequotes.

Here is an example of a gene on the negative strand. Larger coordinates are 5' of smaller coordinates. Thus, the start codon is 3 bp with largest coordinates among all those bp that fall within the CDS regions. Similarly, the stop codon is the 3 bp with coordinates just less than the smallest coordinates within the CDS regions.

AB000123    Twinscan     CDS    193817    194022    .    -    2    gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     CDS    199645    199752    .    -    2    gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     CDS    200369    200508    .    -    1    gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     CDS    215991    216028    .    -    0    gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     start_codon   216026    216028    .    -    .    gene_id    "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     stop_codon    193814    193816    .    -    .    gene_id    "AB000123.1"; transcript_id "AB00123.1.2";

Note the frames of the coding exons. For example:

The first CDS (from 216028 to 215991) always has frame zero.
Frame of the 1st CDS =0, length =38. (frame - length) % 3 = 1, the frame of the 2nd CDS.
Frame of the 2nd CDS=1, length=140. (frame - length) % 3 = 2, the frame of the 3rd CDS.
Frame of the 3rd CDS=2, length=108. (frame - length) % 3 = 2, the frame of the terminal CDS.
Alternatively, the frame of terminal CDS can be calculated without the rest of the gene. Length of the terminal CDS=206. length % 3 =2, the frame of the terminal CDS.

Here is an example in which the "exon" feature is used. It is a 5 exon gene with 3 translated exons. AB000381 Twinscan exon         150   200   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon         300   401   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS          380   401   .   +   0 gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon         501   650   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS          501   650   .   +   2 gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon         700   800   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS          700   707   .   +   2 gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon         900 1000   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan start_codon 380   382   .   +   0 gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan stop_codon   708   710   .   +   0 gene_id "AB000381.000"; transcript_id "AB000381.000.1"; attention:related content are referred from related websites mainteined by related orgnization, refer them when neccessary. 注意：相关内容引自维护该格式的组织网站，如有必要请注明出处。

posted on 2011-11-29 15:26 ewre 阅读(1007) 评论(0) 编辑收藏引用所属分类: Bioinformatics

只有注册用户登录后才能发表评论。
【推荐】100%开源！大型工业跨平台软件C++源码提供，建模，组态！

相关文章: 关于Corona Lite 关于大规模数据操作 linux disk usage command-du KEGG数据库收费了你做的公共分析工具，请你维护 GTF与GFF file format

网站导航: 博客园 IT新闻 BlogJava 博问 Chat2DB 管理

以致宏大，以致高远