Gbrowse Quick-Guide
From Biocourse
Contents |
Generic Genome Browser: A Tutorial
Author: Lincoln Stein, June 16, 2003 (revised 5 May 2006)
여기서는 Gbrowser의 주요 특징에 대해서 간략하게 알아보도록 하겠습니다.
아래의 과정을 따라하기 위해서는 Perl, GD, BioPerl등 기본셋팅이 완료된 상태야만 합니다.
또한 여기서 다룰 내용은 파일기반의 database를 설명하도록 하겠습니다. 이후 MySQL을 이용한 방법에 대해 논의 하겠습니다.
Volvox genome annotation data가 있다는 가정하에 작업수행.
실행 URL : http://localhost/cgi-bin/gbrowse/volvox
gbrowse를 설치한 폴더를 살펴보면 data_files(database로 사용할 파일들)폴더와 conf_files(gbrowse환경설정파일) 폴더를 확인할 수 있으며 여기에 있는 데이터를 사용합니다.
GBrowse database directory : /var/www/html/gbrowse/databases
GBrowse configuraton directory : /etc/httpd/conf/gbrowse.conf
GBrowse를 실행하기 위해서는 Database파일폴더와 환경설정폴더의 권한설정을 해주어야 됩니다.
>su
Password: ******
>chown 로그인사용자이름 /var/www/html/gbrowse/databases
>chown 로그인사용자이름 /etc/httpd/conf/gbrowse.conf
>exit
앞서 말했듯이 volvox를 가정하고 작업을 하기 때문에
첫번째로 volvox폴더를 만들어 줍니다.
>cd /var/www/html/gbrowse/databases
>mkdir volvox
>chmod go+rwx volvox
두번째로 예제파일을 위에 생성한 폴더로 카피합니다. volvox1.gff
>cd /var/www/html/gbrowse/
>cp tutorial/data_files/volvox1.gff databases/volvox
세번째로 위의 데이터베이스 파일을 어떠한 방법으로 보여줄 것인가를 결정하는 환경설정파일을 카피합니다.
>cp tutorial/conf_files/volvox.conf /etc/httpd/conf/gbrowse.conf
자, 이제 확인을 하러 가면됩니다. URL : http://아이피/cgi-bin/gbrowse/volvox
아래의 그림을 확인할 수 있습니다.

Figure 1: volvox1.gff data with volvox.conf config file.
만약 문제가 있다면 환경설정파일의 경로를 확인해 보시기 바랍니다.
데이터베이스파일 설명
ctgA example contig 1 50000 . . . Contig ctgA
ctgA example my_feature 1659 1984 . + . My_feature f07
ctgA example my_feature 3014 6130 . + . My_feature f06
ctgA example my_feature 4715 5968 . - . My_feature f05
ctgA example my_feature 13280 16394 . + . My_feature f08
...
위의 것은 volvox1.gff의 일부입니다.
9개의 칼럼으로 구성되어 있으며 아래에 자세한 설명을 참고합니다.
The 9 columns are as follows:
- reference sequence
This is the name of the feature that will be used to establish the coordinate system for the annotation. This is usually the name of a chromosome, a clone, or a contig. In our example, the reference sequence is "ctgA". A single GFF file can refer to multiple reference sequences.
- source
The source of the annotation. This field describes how the feature was derived. In the example, the source is "example" for want of a better description. Many people find the source as a way of distinguishing between similar features that were derived by different methods, for example, gene calls derived from different prediction software. You can leave this column blank by replacing the source with a single dot (".").
- type
This column describes the feature type. You can choose anything you like to describe the feature type, but common names are "gene", "repeat", "exon", and "CDS." A good source of commonly recognized names is the Sequence Ontology Lite, located at http://song.sourceforge.net. For lack of a better name, the features in the volvox example are of type "my_feature."
- start position
The position that the feature starts at, relative to the reference sequence. The first base of the reference sequence is position 1.
- end position
The end of the feature, again relative to the reference sequence. End is always greater than or equal to start.
- score
For features that have a numeric score, such as sequence similarities, this field holds the score. Score units are arbitrary, but most people use the expectation value for similarity features. You can leave it blank by replacing the column with a dot.
- strand
For features that are strand-specific, this field is the strand on which the annotation resides. It is "+" for the forward strand, "-" for the reverse strand, or "." for annotations that are not stranded. If you are unsure of whether a feature is stranded, it won't hurt to use a "+" here.
- phase
For CDS features that encode proteins, this field describes where the next codon starts. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature in order to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the next codon begins at the third base of this region. This information is used by the "cds" glyph to show how the reading frame changes across splice sites. For all other feature types, use a dot here.
- group
The ninth and last column has multiple purposes. Its main use is to give features a name for searching and display. It can also be used to group related features together by giving them a common name. We'll see later how the exons of a gene can be grouped together with this field.
The format of the group field is "class name", where class describes the class of the feature and name describes its name. The class and the name are separated by a space not a tab. The feature class is a funny concept, because it is almost, but not quite, the same as the feature type. The idea is to distinguish among features that might share the same name by giving them a distinctive prefix. For example, the class distinguishes "Transcript M1.2" from "Gene M1.2". Most people find this confusing, and a proposed update to the GFF format promises to do away with the class entirely. For now, I suggest that you reuse the type field here. In the examples, I've used an initial capital letter in the class field in order to distinguish when I'm talking about the class field from when I'm talking about the type field.
Other good stuff can go into the group field, as we shall see later.
환경설정파일설명
volvox.conf 파일을 예제로 사용합니다.
### TRACK CONFIGURATION ###
[ExampleFeatures]
feature = my_feature
glyph = generic
stranded = 1
bgcolor = blue
height = 10
key = Example Features
위의 한 단락은 하나의 트랙을 표현하는데 필요한 설정입니다.
첫번째로 ExampleFeatures는 트랙의 내부이름입니다. 이것은 웹상에서 URL로 사용되며 트랙에 대한 설명을 추가 할 수 있습니다.
화면상에서는 체크박스를 통해 표현여부를 결정할 수 있습니다.
feature : 데이터베이스파일의 세번째 칼럼부분에 해당하는 것으로 비주얼시 구분할수 있는 Term 입니다.( my_feature, motif ... )
glyph : 화면상에 뿌려지는 바의 모양을 결정합니다.
stranded : 바의 양끝의 화살표로 표현 할것인가에 대한 옵션
bgcolor : 바의 색
height : 바의 크기
key : 트랙의 이름
예제) bgcolor = orange

Figure 2: A Feature of a Different Color
예제) height = 5, key = Skinny features, glyph = primers
Figure 3: Using the primers Glyph
glyphs의 리스트를 확인할려면 doc 폴더에 있는 CONFIGURE_HOWTO.txt 파일을 확인합니다.
다음은 검색부분에 대한 설명입니다. (중요)
Now we'll look at the interaction between feature names and GBrowse's search box. If you look through the volvox1.gff data file, you'll see that all the example features are named, and that their class is "My_feature."
GBrowse has a very flexible search feature. You can type in the name of a reference sequence, such as "ctgA", and it will display the entire thing, or you can type in a range in the format "ctgA:start..stop". Try "ctgA:5000..8000" to see this at work.
In addition, GBrowse can search for features by name. By default, the name of the object must be preceded by its class in the format Class:name. For example, if we are searching for My_feature "f07", we could type "My_feature:f07" into the search box. Try this now.
You probably don't want to remember to enter the class of the object to search for a feature. Fortunately, it is easy to declare one or more classes "automatic" and specify the order in which GBrowse will search for them. To do this with our example database, open up the volvox.conf config file, find the option named "automatic classes", and change it to read:
automatic classes = My_feature
This tells GBrowse that when the user types in an unqualified feature name, it should search the My_feature class for a match. You can now type "f07" directly into the search field and GBrowse will find and display it. If you wish, you may list several (or many!) automatic classes on this line. Just separate them by spaces:
automatic classes = My_feature Gene Transcript Contig Chromosome
For fun, try searching for the following:
- f1*
- f07:-5000..5000
- *3
feature(바)를 클릭했을 때 기본적인 9개 컬럼의 정보 이외의 정보를 추가하고자 할 경우에는 ";" 를 사용합니다.
아래의 예제에서 볼 수 있듯이 Note란 항목이 추가되고 Note에 해당하는 값이 들어가게 됩니다. ( 검색도 가능 )
ctgA example motif 11911 15561 . + . Motif m11 ; Note "kinase" ctgA example motif 13801 14007 . - . Motif m05 ; Note "helix loop helix" ctgA example motif 14731 17239 . - . Motif m14 ; Note "kinase" ctgA example motif 15396 16159 . + . Motif m03 ; Note "zinc finger"
예제) data_files 폴더에 있는 volvox2 파일을 database폴더에 카피합니다. 그리고 volvox.conf 파일에 다음을 추가합니다.
[Motifs]
feature = motif
glyph = span
height = 5
description = 1
key = Example motifs
즉 위와 같이 함으로써 새로운 트랙을 생성할 수 있습니다.
description 옵션은 Note 를 표현해주기 위한 옵션입니다.
결과)

Figure 4: Showing the Notes attribute
Many features are discontinuous. Examples include spliced transcripts, and gapped sequence similarity alignments, such as the alignment of cDNAs to the genome. GBrowse can deal with such features easily provided that you take a little care in setting them up.
The data file volvox3.gff contains a simulated data set of a series of gapped nucleotide alignments. An excerpt from the file is here:
ctgA example match 6885 8999 . - . Match seg03 ctgA example HSP 6885 7241 . - . Match seg03 ctgA example HSP 7410 7737 . - . Match seg03 ctgA example HSP 8055 8080 . - . Match seg03 ctgA example HSP 8306 8999 . - . Match seg03 ctgA example match 5233 9825 . - . Match seg04 ctgA example HSP 5233 5302 . - . Match seg04 ctgA example HSP 5800 6101 . - . Match seg04 ctgA example HSP 6442 6854 . - . Match seg04 ctgA example HSP 7106 7211 . - . Match seg04 ctgA example HSP 7695 8177 . - . Match seg04 ctgA example HSP 8545 8783 . - . Match seg04 ctgA example HSP 8869 8935 . - . Match seg04 ctgA example HSP 9404 9825 . - . Match seg04
Each segmented feature is represented by several lines in the GFF file that share the same feature name. Each feature has a single GFF line of type "match" whose start and end coordinates correspond to the full length of the alignment. Following this are one or more lines of type "HSP" with start and end coordinates indicating a section of the match. You will recognize this terminology from the standard BLAST report.
For example "Match seg03" starts at position 6885 and ends at 8999. It has four subsegments, one from 6885..7241, another from 7410..7737, and so forth.
The types "match" and "HSP" are not arbitrary, but are needed to tell the GBrowse database what the relationship between the full-length feature and its subparts are. The specific type names expected are mediated by a series of "aggregators" -- code modules that are loaded when GBrowse starts up. We will see later in this section how to manipulate these aggregators and to define custom ones.
Add volvox3.gff into the volvox database by copying it into the volvox database directory. Then edit volvox.conf to add the following track definition:
[Alignments] feature = match glyph = segments key = Example alignments
This is declaring a new track named "Alignments" which displays features of type "match" using a glyph named "segments". The segments glyph is specialized for displaying objects that have multiple similar subparts.
Save the modified config file and reload the page in the browser. Disappointingly, you'll see something like Figure 5. Instead of showing multi-segmented features, the track shows a single solid box that spans the entire length of the feature.
Figure 5: Without activating an aggregator, multisegmented features do not displayed properly
To make multipart features display correctly, you must activate or define an appropriate aggregator. This is very easy for the similarity/match relationship, because there's already a predefined aggregator named "match." Reopen the volvox.conf configuration file, and find the option line near the top of the file that reads "aggregators = ". Change this to read as follows:
<strong>aggregators = match</strong>
This is telling GBrowse to turn on the "match" aggregator. Now reload. You should see a much-improved image similar to Figure 6.

Figure 6: Turning on the "match" aggregator allows GBrowse to recognize that the match feature has subparts
There are several predefined aggregators, each of which expects a particular combinations of feature type names. The table below summarizes the most useful ones:
| Aggregator name | Main type | Subtype(s) | Purpose |
|---|---|---|---|
| alignment | (none) | similarity | nucleotide and protein alignments where the full extent of the match is unknown |
| coding | mRNA | CDS | Used in concert with the "cds" glyph to display the reading frame used by the coding portion of each exon. |
| clone | (none) | Clone_left_end Clone_right_end | Used for cases in which clone ends have been mapped to the genome, but one of the ends may be missing. |
| match | match | similarity, HSP | nucleotide and protein alignments |
| processed_transcript | mRNA | CDS, UTR, 5'-UTR, 3'-UTR transcription_start_site, polyA_site | the canonical spliced gene |
| transcript | transcript | exon TSS PolyA | a spliced transcript that expects exon features |
To use any of these aggregators, follow this recipe:
- Give your features and their subparts the specific type names expected by the aggregators.
- Add the aggregator to the list of aggregators in the config file, e.g.
aggregators = match processed_transcript clone. - In the appropriate track definition, use the aggregator's name as the argument for
feature. For examplefeature=processed_transcript.
GBrowse can display protein-coding genes in various shapes and styles. The easiest way to set this up is to use the "processed_transcript" aggregator and its companion glyph also called "processed_transcript." Take a look at the file volvox4.gff, which defines a gene named EDEN, and its three spliced forms named EDEN.1, EDEN.2 and EDEN.3. Here is the contents of the file:
ctgA example gene 1050 9000 . + . Gene EDEN ; Note "protein kinase"
ctgA example mRNA 1050 9000 . + . mRNA EDEN.1 ; Gene EDEN
ctgA example 5'-UTR 1050 1200 . + . mRNA EDEN.1
ctgA example CDS 1201 1500 . + 0 mRNA EDEN.1
ctgA example CDS 3000 3902 . + 0 mRNA EDEN.1
ctgA example CDS 5000 5500 . + 0 mRNA EDEN.1
ctgA example CDS 7000 7608 . + 0 mRNA EDEN.1
ctgA example 3'-UTR 7609 9000 . + . mRNA EDEN.1
ctgA example mRNA 1050 9000 . + . mRNA EDEN.2 ; Gene EDEN
ctgA example 5'-UTR 1050 1200 . + . mRNA EDEN.2
ctgA example CDS 1201 1500 . + 0 mRNA EDEN.2
ctgA example CDS 5000 5500 . + 0 mRNA EDEN.2
ctgA example CDS 7000 7608 . + 0 mRNA EDEN.2
ctgA example 3'-UTR 7609 9000 . + . mRNA EDEN.2
ctgA example mRNA 1300 9000 . + . mRNA EDEN.3 ; Gene EDEN
ctgA example 5'-UTR 1300 1500 . + . mRNA EDEN.3
ctgA example 5'-UTR 3000 3300 . + . mRNA EDEN.3
ctgA example CDS 3301 3902 . + 0 mRNA EDEN.3
ctgA example CDS 5000 5500 . + 1 mRNA EDEN.3
ctgA example CDS 7000 7600 . + 1 mRNA EDEN.3
ctgA example 3'-UTR 7601 9000 . + . mRNA EDEN.3
The first line of the file defines the gene as a whole, starting at position 1050 of ctgA and extending to position 9000. Following this, there are three sets of lines that define the structure of the spliced forms EDEN.1, EDEN.2, and EDEN.3. By convention, the whole transcript is represented as type "mRNA". It has subparts named "5'-UTR", CDS, and "3'-UTR", where the UTR features are the 5' and 3' untranslated regions, respectively, and CDS is the coding region. Note how the CDS is split by splicing among multiple discontinuous locations on the reference sequence. The UTRs can be split in this way too.
Each mRNA and its subparts are grouped together under a common name in the ninth column ("mRNA EDEN.1", "mRNA EDEN.2", and so forth). In addition, each mRNA has a Gene attribute that ties it to the EDEN gene itself ("Gene EDEN"). Although this isn't required for the display, doing this will identify the various alternative transcripts as belonging to the same gene should you wish to use the GBrowse database for data mining. It will also show the user what gene the transcript belongs to when he or she clicks on it for details.
HINT: If you prefer not to distinguish between 5' and 3' UTRs, you can simply use "UTR" as the type. If you don't know where the UTRs are, just leave them blank. If you'd rather think in terms of exons and introns, then check out the "transcript" aggregator and its corresponding "transcript" glyph.
Go ahead and add volvox4.gff to the database. Then make the following changes to volvox.conf:
- Change the aggregators line to read as follows:
aggregators = match processed_transcript - Add the following new stanza to the bottom of the file:
[Transcripts] feature = processed_transcript glyph = processed_transcript bgcolor = peachpuff description = 1 key = Protein-coding genes
The updated aggregators option loads the processed_transcript aggregator, which knows how to put CDS and UTR features together to form a spliced transcript. The new Transcripts track associates aggregated processed_transcript features with the like-named glyph, sets its background color to peachpuff (yes, there really is a color by this name!), turns on the description lines, and sets the human readable track name to "Protein-coding genes."
<em>The aggregators option demonstrates that GBrowse config file options can continue across multiple lines provided that each additional line is indented.</em>
Upon reloading the page, turning on the new "Protein-coding genes" track, and viewing the region around 1..10K, you'll see this:
Figure 7: The canonical processed_transcript glyph
This image is nice, but we can make it even better. One problem is that the gene description (the Note in the EDEN GFF line) isn't being displayed, because the description is attached to the gene and not to the individual mRNAs. To fix this, we simply tell GBrowse to display features of type "gene" as well as those of type "processed_transcript". Modify volvox.conf so the last stanza looks like this:
[Transcripts] feature = processed_transcript gene glyph = processed_transcript bgcolor = peachpuff description = 1 key = Protein-coding genes
The only change is that there are now two types listed for the feature option, "processed_transcript" and "gene." This is telling GBrowse to place both feature types in the same track. If we reload the page, it now looks like this:
Figure 8: Showing the gene as well as its transcripts
The processed_transcript glyph has a number of options that you can use to customize its appearance:
| Option Name | Possible values | Description |
|---|---|---|
| thin_utr | 0 (false), 1 (true) | If true, makes UTRs half-height. |
| utr_color | a color name ("gray" by default) | Changes the UTR color. |
| decorate_introns | 0 (false), 1 (true) | If true, puts little arrowheads on the introns to indicate direction of transcription. |
Using these options, we can make the track look like the UCSC Genome Browser (Figure 9).
[Transcripts] feature = processed_transcript gene glyph = processed_transcript height = 8 bgcolor = black utr_color = black thin_utr = 1 decorate_introns = 1 description = 1 key = Protein-coding genes
Figure 9: A UCSC Genome Browser lookalike
Continuing with the example from the last section, the third exon of EDEN.1 is shared with EDEN.3. But is the reading frame preserved? The "coding" aggregator used in concert with the "cds" glyph creates a display that will visualize each CDS's reading frame.
To see this work, add the predefined "coding" aggregator to the list of aggregators:
aggregators = match processed_transcript coding
The "coding" aggregator is similar to processed_transcript, except that it only pays attention to the CDS parts of the transcript. It was designed to work hand-in-hand with the "cds" glyph. (For historical reasons, the glyph is called "cds" rather than "coding.")
Now add the following short stanza to the bottom of the configuration file:
[CDS] feature = coding glyph = cds key = Frame usage
When you reload the page and turn this track on, you'll see a "musical staff" representation of the frame usage (Figure 10). From this we can see that the alternative splicing in fact changes the reading frame of the second exon.
Figure 10: The "cds" glyph shows the reading frame using a musical staff notation
If none of the predefined aggregators meets your needs, it is simple to define a custom one of your own. For example, say you wanted to display a feature of type "BAC", whose subparts are of type "left_end_read" and "right_end_read" (possibly corresponding to a BAC clone mapping experiment). Here is a GFF representation of this:
ctgA example BAC 1000 20000 . . . BAC b101.2 ; Note "Fingerprinted BAC with end reads" ctgA example left_end_read 1000 1500 . + . BAC b101.2 ctgA example right_end_read 19500 20000 . - . BAC b101.2
This is the contents of volvox5.gff. Go ahead and add this into the database now. To visualize this you will:
- Create a custom aggregator that explains the relationship between the three feature types.
- Create a new stanza that uses the custom aggregator.
To define the custom aggregator, open volvox.conf and add the following to the aggregators line:
aggregators = match
processed_transcript
coding
BAC{left_end_read,right_end_read/BAC}
The thing named BAC{left_end_read,right_end_read/BAC} is the custom aggregator definition. Its format is aggregator_name{subtype1,subtype2,subtype3.../main_type}. Here we're defining an aggregator of type "BAC" which has subparts of type "left_end_read" and "right_end_read" (separated by commas) and top-level type of "BAC" (separated from the subparts by a slash). Although it's not necessary to use the same name for both the main feature type and the aggregator, it's often convenient to do so.
Now add the appropriate stanza to the bottom of volvox.conf:
[Clones] feature = BAC glyph = segments bgcolor = yellow strand_arrow = 1 description = 1 key = Fingerprinted BACs
With this new track turned on, look at ctgA:1..24200. It will show that GBrowse has correctly picked up and rendered the relationship between the whole BAC and its two end reads (Figure 11).
Figure 11: The glyph produced by a custom BAC aggregator
For your convenience, the configuration file with all the modifications made up through this point of the tutorial can be found in volvox3.conf.
GBrowse can plot quantitative data such as alignment scores, confidence scores from gene prediction programs, and microarray intensity data. The data can be displayed either with glyphs that change color to indicate score levels (see the "heterogeneous_segments", "graded_segments" and "redgreen_box" glyphs), or using a general-purpose XY-plot glyph.
Congratulations, Affymetrix has built a transcriptional profiling chip for Volvox! There's now a transcriptional profile for volvox, with an intensity reading every 100 bp across all of ctgA. The simulated data for this is in the file volvox6.gff, an excerpt of which is shown here:
ctgA affy tlevel 1 100 281 . . Affy Expt1 ctgA affy tlevel 101 200 183 . . Affy Expt1 ctgA affy tlevel 201 300 213 . . Affy Expt1 ctgA affy tlevel 301 400 191 . . Affy Expt1 ctgA affy tlevel 401 500 288 . . Affy Expt1 ...
The file contains 500 features, each of which is exactly 100 bp long. The features are of type "tlevel" ("transcriptional level") and of source "affy." Each one has a score (column 6) between 0 and 1000, where higher scores means more transcriptional activity. This is the first time we've used the score column.
All of the 500 features share the same group (column 9) of "Affy Expt1." They are grouped in this way because the entire set of 500 features represents a single transcriptional profiling experiment. If we had multiple experiments to show, they would be named Expt1, Expt2 and so on.
We would like to generate a line graph that shows the transcriptional profile level across the current region. To do this, we will first create an aggregator that will bring all the individual tlevel features together into a single feature named "tprofile." This is done as described in the previous section. Modify the configuration file's aggregators option to read as follows:
aggregators = match
BAC{left_end_read,right_end_read/BAC}
processed_transcript
coding
tprofile{tlevel}
The last line is declaring an aggregated feature named "tprofile" whose parts consist of individual "tlevel" features. This is similar to the BAC aggregator, except that in this case there is no top-level feature that goes from end to end, so we just leave out the /main_type part of the aggregator definition.
We now need to use this aggregated feature in a track stanza. Create the following section:
[TransChip] feature = tprofile glyph = xyplot graph_type = line height = 50 min_score = 0 max_score = 1000 scale = right key = Transcriptional Profile
The options shown here create a track named TransChip to display the tprofile feature with the xyplot glyph. The "graph_type", "height", "scale", "min_score", and "max_score" options all configure various aspects of the xyplot glyph's appearance.
You can read all about xyplot's options using perldoc Bio::Graphics::Glyph::xyplot
When you reload the page and turn on the Transcriptional Profile track, you should see something like that shown in Figure 12.
Figure 12: A transcriptional profile rendered with the xyplot glyph
Using the info that perldoc provides, play around with the xyplot options a bit. For example, see what happens when you change graph_type to "boxes."
GBrowse can take advantage of DNA sequence data in several ways:
- It can display a GC content graph of the reference sequence at low magnifications and the DNA sequence itself at higher magnifications.
- It can display three and six-frame translations of the reference sequence DNA.
- It can display the protein translation of coding regions.
- It can display aligned nucleotide sequences, creating a poor man's multiple alignment.
So we've been working with feature coordinates, but no actual DNA sequence has been loaded into the volvox database. We will again rebuild the database, this time loading in a simulated DNA file in fasta format. Download the file volvox.fa, and copy it into the volvox database directory. At this point in the tutorial, when you do a directory listing of the volvox database directory (with "ls" on unix systems, or "dir/w" on Windows systems) it should look like this:
% <strong>ls /var/www/html/gbrowse/databases/volvox/</strong> volvox.fa volvox2.gff volvox4.gff volvox6.gff volvox1.gff volvox3.gff volvox5.gff
After copying the .fa file into the volvox database directory, you will need to change the configuration file very slightly to tell GBrowse to look for and load the FASTA file. At the top of the config file, change the db_args section to look like this:
db_args = -adaptor memory -dir '/var/www/html/gbrowse/databases/volvox'
Previously the -gff argument told GBrowse to load all GFF files in the directory. The new -dir argument says to load both GFF files and FASTA sequence files. (There's also a -fasta argument to load sequence without features, but this is not much use with GBrowse.)
This is all you need to do to load the DNA. To see that the DNA is indeed being loaded, add two new stanzas to the volvox.conf configuration file:
[DNA] glyph = dna global feature = 1 height = 40 do_gc = 1 fgcolor = red axis_color = blue strand = both key = DNA/GC Content [Translation] glyph = translation global feature = 1 height = 40 fgcolor = purple start_codons = 0 stop_codons = 1 translation = 6frame key = 6-frame translation
The "DNA" track uses a specialized glyph called "dna". At low magnifications (zoomed way out), this glyph draws a GC content plot. At high magnifications (zoomed way in), this glyph draws the dna. Of the various options given in the example stanza, the most important one is "global feature", which is set to a true value (1). This tells GBrowse that the stanza doesn't correspond to a specific feature type, but should be displayed globally. Other options control whether to draw one or both strands, whether to draw the GC content histogram, and what colors to use.
Similarly, the "Translation" track uses a glyph called "translation", which draws three or six-frame conceptual translations. At low magnifications (zoomed way out), this glyph draws little symbols indicating where start and stop codons are. At high magnifications, the actual amino acid sequence comes into view. Again, the most important option is "global feature", which is set to a true value to tell GBrowse that the track isn't attached to a particular feature type, but is to be generated automatically. Other options control the height of the glyph, whether to draw start and/or stop codon symbols, and whether to generate a 3frame or 6frame translation.
Figures 13a and 13b show the browser at low and high magnification, with both tracks activated. Notice that the coding track ("cds" glyph) notices that the DNA is available and generates the transcripts' protein translations automatically!
(13A)
![]()
(13B)
Figure 13: Viewing DNA/GC content and 6-frame translation. (a) low magnification; (b) high magnification
If you happen to do a listing of the volvox database directory after adding the DNA file, you might notice that a new file named "directory.index" has appeared. This index directory is created automatically by GBrowse in order to speed up access to the .fa file and to reduce memory requirements. If the database directory is not writable by all users, GBrowse will not be able to create this directory, and the display will be somewhat slower whenever a DNA track is turned on.
This section will lead you through creating a plausible EST track, and show you how grouping of 5' and 3' EST reads works.
We'll start with a simple data set containing information on three pairs of EST reads. You'll find this data set in volvox7.gff. Here is the first pair described in the data file:
ctgA est match 1050 3202 . + . EST agt830.5 ctgA est HSP 1050 1500 . + . EST agt830.5 ctgA est HSP 3000 3202 . + . EST agt830.5 ctgA est match 5410 7503 . - . EST agt830.3 ctgA est HSP 5410 5500 . - . EST agt830.3 ctgA est HSP 7000 7503 . - . EST agt830.3 ...
What's going on here is the same as the alignments shown in volvox3.gff. There are two EST reads named agt830.5 (the 5' read) and agt830.3 (the 3' read). Each of them matches the ctgA genome in two discontinuous regions because, presumably, they cross a splice site. As in the earlier example, we represent each EST as a single "match" feature that spans the entire region, plus a series of "HSP" features that correspond to the aligned regions. The last column is used to group the match and HSP features together using the class and name of the feature, where the class is arbitrarily chosen to be "EST."
There are two other things to notice. One is that the source field (column 2) is "est". All previous examples used "example" here. This is because we need to distinguish this set of alignments from the generic alignments in volvox3.gff. The second item of interest is that the strand field (column 7) is + for the 5' EST and - for the 3' EST, indicating that the 3' EST aligned to the reverse complement of ctgA.
Add this file to the volvox database directory, and add the following to the configuration file:
[EST] feature = match:est glyph = segments height = 6 bgcolor = orange key = ESTs
This will give a display similar to that shown in Figure 14.
Figure 14: A simple representation of EST matches.
Notice that the feature option reads "match:est" rather than simply "match." This is to distinguish the EST matches from the example matches that we loaded previously. When needed, you can use the source field (column 2) to distinguish different features of the same type, using the format "type:source". You can use this equally well with ordinary types (e.g. "my_feature:example") or with aggregated types ("processed_transcript:genscan").
This display is OK, but it could be better. One problem is that the relationship between the 5' and 3' EST read pairs is not shown. We'd like to place the two members of the pair together on the same line, and connect them with a dotted line to show that they are the two ends of the same cDNA clone. Recall that we did something similar to this with the custom BAC aggregator. Unfortunately, there's a problem with the ESTs because we are already using the "match" aggregator to perform one level of grouping, and the GFF load format only allows one level of grouping at a time (this is changing in a proposed new version of the format).
For the time being, we can work around this problem using a "hack." Change the [EST] track configuration to look like this:
[EST] feature = match:est glyph = segments bgcolor = orange group_pattern = /\.[53]$/ key = ESTs
The new group_pattern option tells GBrowse to use a Perl regular expression pattern matching operation to find and group related EST matches based on their names. It helps to understand how Perl regular expressions work, but basically the pattern match breaks down this way:
/ begin the pattern match \. match a dot [53] match either the numbers 5 or 3 $ match the end of the string / end the pattern match
What this is saying is to look for pairs of EST names that are similar except for the terminal .5 or .3, and pair them. When we reload the page, we get Figure 15.
Figure 15: The group_pattern option allows EST pairs to be grouped
Here are regular expressions that will work for other common EST pairing schemes:
| 5' EST | 3' EST | group_pattern |
|---|---|---|
| agt123f | agt123r | /[fr]$/ |
| agt123p | agt123q | /[pq]$/ |
| f.agt123 | r.agt123 | /^[fr]\./ |
| 5.agt123 | 3.agt123 | /^[53]\./ |
| agt123.for | agt123.rev | /\.(for|rev)$/ |
Another nice enhancement would be to give the 5' and 3' ESTs different colors so as to distinguish one from another. This can be accomplished using a Perl callback. Open up volvox.conf once more, and find the bgcolor option in the [EST] track. Replace it with this (you may want to cut and paste from here in order to avoid introducing any typos):
bgcolor = sub {
my $feature = shift;
my $name = $feature->display_name;
if ($name =~ /\.5$/) {
return 'red';
} else {
return 'orange';
}
}
You'll need to know the basics of the Perl programming language in order to do this type of thing yourself. Suffice to say that instead of hard-coding the color "orange" into the bgcolor option, we are asking GBrowse to run a Perl subroutine each time it needs to render an EST. The subroutine is passed the feature that is about to be drawn. It asks the feature for its human-readable name (display_name) and assigns that name to a variable named $name. It then performs a pattern match on the name to see if it ends in a "5". If the name matches, the subroutine returns the color "red" to GBrowse. Otherwise it returns the color "orange."
The effect is shown in Figure 16.
Figure 16: Using a callback to distinguish 5' and 3' ESTs
The last thing we'll do with the EST data set is to add DNA to the ESTs so that at high magnification GBrowse will show the multiple alignment. This information is also used by the "dump alignments" plugin to generate a text-based multiple alignment.
NOTE: Currently only nucleotide to nucleotide alignments can be displayed at the level of individual nucleotides (e.g. BLASTN, BLAT, Exonerate). Protein to nucleotide alignments, such as those produced by Genewise or BLASTX, are not supported at the residue level
To make this work, we need to add two additional pieces of information to the EST alignment data:
- The DNA sequences of the volvox ESTs.
- The alignment positions in EST coordinates.
ctgA 1050 gattgccattgaccttggccattggccaagctgaa 1086
|||||||||| ||||||| ||||||||||||||||
agt830.5 1 gattgccattcaccttgggcattggccaagctgaa 135
What we currently have in the GFF file are the source genomic positions of the alignments (in ctgA-relative coordinates). We need to add the target positions in agt830.5-relative coordinates in order for GBrowse to fetch and display the appropriate segments of the EST DNA.
The fasta file ests.fa provides the DNA sequences for the six EST reads. The GFF load file volvox8.gff contains the revised coordinates. If you look at this file you'll see that it is dissimilar to previous load files:
ctgA est match 1050 3202 . + . Target EST:agt830.5 1 554 ctgA est HSP 1050 1500 . + . Target EST:agt830.5 1 451 ctgA est HSP 3000 3202 . + . Target EST:agt830.5 452 654 ctgA est match 5410 7503 . - . Target EST:agt830.3 1 595 ctgA est HSP 5410 5500 . - . Target EST:agt830.3 505 596 ctgA est HSP 7000 7503 . - . Target EST:agt830.3 1 504
The first eight columns are identical to what we've been using before, but the ninth column follows a new convention used for nucleotide to nucleotide and protein to nucleotide alignments. There is now a special class name, "Target", that tells GBrowse that the group field represents the combination of a target sequence and its coordinates. Following Target is the name EST:agt830.5, which is a composite of the "real" class name ("EST") and the name of the EST. The two are separated by a colon in the format "class:name". Following this are two numbers indicating the start and end of the alignment in EST coordinates.
There are a couple of subtleties to notice here. First of all, notice that the "match" features extend all the way across the matched area of the genome (1050 to 3202 in the case of agt830.5), and all the way across the matched area of the target (1 to 554). Because one or both of the matched regions may contain gaps, the source and target regions do not have to be the same length. On the other hand, each HSP covers an ungapped contiguous region: the first HSP covers 1050..1500 in genome coordinates and 1..451 in target (EST) coordinates; the second covers 3000..3202 in genome coordinates and 452..654 in target coordinates. The HSPs should have the same alignment length in both genome and target coordinates, or at least very close lengths. If the lengths are close but not identical, GBrowse will realign the segments, introducing small gaps where necessary.
NOTE: GBrowse uses a simple but slow segment realigner. If there are numerous gaps, it is better to break them into a set of smaller colinear HSPs than to rely on the realigner to do it for you.
The second subtlety to notice is that for the minus strand ESTs, the target coordinates are not reversed, that is, the start position is less than the end position. For example, for the first agt830.3 HSP, we are told that genomic region 5410..5500 aligns to EST region 505..596. The strand field is used to determine the direction of the alignment.
Note that this contradicts the historical implementation of GFF but the current use is more internally consistent and is the method for target annoation in the proposed revision of GFF.
Since this data file contains a revised version of volvox7.gff, remove volvox7.gff from the database directory and replace it with volvox8.gff . Also copy ests.fa into the database directory. If you perform a directory listing, it should look like this:
directory.index volvox.fa volvox2.gff volvox4.gff volvox6.gff ests.fa volvox1.gff volvox3.gff volvox5.gff volvox8.gff
NOTE: If you see doubled EST features after this point, make sure that you have removed volvox7.gff. Another thing to watch out for is that some sort of bug in the BioPerl layer (up through at least version 1.4) causes the EST DNA display to get messed up at this point on Windows systems. To fix the latter problem, go to the volvox database directory and remove the files directory.dir and directory.pag. These are automatically-generated DNA file indexes that GBrowse develops, and will be regenerated for you the next time you access a page.
We're not done with making configuration file changes, but volvox4.conf contains all configuration file enhancements up to this point. If you like, you can copy it over the live volvox.conf. It contains the following version of the [EST] track:
[EST]
feature = match:est
glyph = segments
height = 6
draw_target = 1
show_mismatch = 1
canonical_strand = 1
bgcolor = sub {
my $feature = shift;
my $name = $feature->display_name;
if ($name =~ /\.5$/) {
return 'red';
} else {
return 'orange';
}
}
group_pattern = /\.[53]$/
key = ESTs
The key addition to this track configuration is the "draw_target", "show_mismatch" and "canonical_strand" options. All options are true/false flags, where 0 means false and 1 means true. draw_target tells the segments glyph to draw the DNA sequence of the target ESTs when the magnification allows. show_mismatch instructs the glyph to highlight mismatches between the genome and the EST in pink. canonical_strand instructs the glyph to display the plus strand sequence even when the EST matches the minus strand.
To see this work, reload the page, turn on the EST track and search for region "ctgA:1065..1165". This will show the aligned 5' ends of agt221.5, agt830.5 and agt767.5 (Figure 17). Notice that one of the T's towards the beginning of agt830.5 is highlighted in red, to show that it doesn't match the corresponding genomic base.
Figure 17: Multiple alignments at the DNA level
If you don't see the EST sequence appearing, make sure that ests.fa is in the volvox database directory and is world readable. If it still isn't working, you may need to "touch" the file in order to update its modification date. This tells GBrowse that it is new and needs to be reindexed. In Unix:
% <strong>touch /var/www/html/gbrowse/databases/volvox/ests.fa</strong>
If you have sequence trace information (in SCF format) associated with the reference sequence, this can be displayed in gbrowse using the trace glyph. To use this glyph, you must have installed:
- The Staden io-lib package </dt>
- staden.sourceforge.net </dd>
- zlib </dt>
- www.zlib.net </dd>
- The Bio::SCF perl module </dt>
- Available from CPAN </dd>
The data file volvox9.gff contains an example trace entry.
ctgA example trace 44401 45925 . + . name trace; trace volvox_trace.scf
This aligns the full trace sequence to the reference sequence. The trace file in this case is named "volvox_trace.scf". Due to sequence quality, the first few bases of a trace file usually don't align. Even so, these need bases need to be included in the gff file. For instance, if the bases 10-700 of the trace file aligns to the bases 100-800 of the reference sequence, the feature would be 90-800 to account for the first 10 bases (starting at base 0).
NOTE: The trace glyph currently doesn't deal with insertions or deletions. If an indel occurs, the alignment after the indel will be off.
To display this first copy the following into the volvox.conf (or copy volvox5.conf over the volvox.conf file).
[Traces] feature = trace glyph = trace fgcolor = black bgcolor = orange strand_arrow = 1 height = 6 description = 1 a_color = green c_color = blue g_color = black t_color = red trace_height = 80 trace_prefix = http://localhost/gbrowse/tutorial/data_files/ show_border = 1 key = Traces
The fgcolor, bgcolor, strand_arrow and height control the bar that shows the location and directionality of the trace.
The trace_prefix option is important because it gives the path to the trace files. This is prepended to the trace file name defined in the gff file. It can be a direct path to the directory (eg "/usr/local/trace_files/") or a web address (as above).
The a/c/g/t_color options allow configuration of the base colors. The trace_height refers to the height of the trace itself. Play around with it to find a height that you like.
If show_border is set to 1, a black box will be drawn around the trace.
After configuring the trace glyph, reload the browser page and enable traces. Zoomed out you will see:
Figure 18: The trace glyph zoomed out.
Zooming in will show you the trace diagram:
Figure 19: The trace glyph zoomed in.
Is all the effort to load the genomic and EST DNAs worth it? Yes, if you want to take advantage of two popular plugins, RestrictionAnnotator and Aligner. The first generates a track of restriction sites. The second dumps a text-based multiple alignment of the current region on view.
To see these plugins at work, first make sure that the database files are up to date with this position in the tutorial. If you are in any doubt, remove the current contents of the volvox database directory and replace them with the files volvox_all.gff and volvox_all.fa.
Now find the option "plugins=" at the top of volvox.conf, and modify it to activate the Aligner and RestrictionAnnotator plugins:
plugins = Aligner RestrictionAnnotator
When you reload the page, you will see a new popup menu appear under the image labeled "Dumps, searches and other operations." You will also see an automatic track labeled "plugin:Restriction Sites" appear in the track list. When you turn on this track, you will be presented with a restriction map (Figure 18). You can then adjust which restriction sites are shown by selecting "Annotate Restriction Sites" from the popup menu and pressing the "Configure" button.
Figure 20: The RestrictionAnnotator Plugin
To see the Aligner at work, center your view on a region that contains the EST alignments (for example, ctgA:1000..5000), select "Dump Alignments" from the plugin popup menu, and press "Go". This will return a text-based multiple alignment of the genome and the EST tracks.
The Aligner plugin has some additional configuration that you can perform. We'll look at this now as an example of how to configure plugins. Open up volvox.conf and add the following configuration section:
######################## # Plugin configuration ######################## [Aligner:plugin] alignable_tracks = EST upcase_tracks = CDS Motifs upcase_default = CDS
It doesn't matter where the section goes, but it is probably a good idea to place this towards the middle of the file after the [GENERAL] section (at the top) and before the [TRACK DEFAULTS] section. Otherwise it is easy for you or someone else maintaining the configuration file to mistake this for some sort of track configuration.
Plugin configuration sections are distinguished from track configuration by having names of the format PluginName:plugin. In this case, the three configuration options are applied to the Aligner plugin. For the Aligner plugin, the configuration options are:
| Option | Description |
|---|---|
| alignable_tracks | Space-delimited list of tracks to include in the multiple alignment. The genome is always included. If this option is not present, then GBrowse will automatically include any track that has the "draw_target" option set. |
| upcase_tracks | Space-delimited list of tracks that will be used to UPCASE the genomic DNA. This is very useful if you want to embed the positions of coding regions or other features inside the multiple alignment. Uppercasing will not be turned on by default. The user must press the "Configure" button, and select which of the uppercase tracks are to be activated from a list of checkboxes. |
| upcase_default | A space-delimited list of tracks that will be uppercased by default unless the user turns them off during configuration. |
| ragged_default | A small integer indicating that the aligner should include some unaligned bases from the end of each sequence. This is useful for seeing the sequencing primer or cloning site in ESTs. |
With the changes in place, select the aligner from the popup menu and press Configure. Turn on uppercasing of the coding region track and see how it affects the display (Figure 19).
Figure 21: The Aligner plugin produces multiple alignments.
Only a few of the plugins are currently well-documented, but this situation is being rectified. To view their documentation, if any, find the plugin file, which usually lives under gbrowse.conf/plugins, and run the perldoc command with the -F ("file") option:
% <strong>perldoc -F Aligner.pm</strong>
Here's the list of plugins that come with the standard distribution:
| Plugin | Description |
|---|---|
| Aligner | Dump multiple alignments |
| BatchDumper | Allows the user to cut and paste a series of landmarks on the genome and dumps out all overlapping features using a variety of formats (e.g. GenBank format) |
| FastaDumper | Produce pretty-printed FASTA dumps of the current region, with selected features highlighted with colors or font styles. |
| GFFDumper | Dump out the current region in GFF format (redundant with BatchDumper). |
| OligoFinder | Lets the user search for landmarks on the basis of unique 11-mers or greater. |
| RestrictionAnnotator | Creates restriction maps. |
| SequenceDumper | The same functionality as BatchDumper, but just shows features that overlap the current region on view. |
Although the example that we've been working with only has a single reference sequence (the infamous "ctgA"), many projects will have multiple references. Reference sequences can be anything that acts as a convenient landmark: sequenced clones, contigs, scaffolds, golden path segments, or whole chromosomes.
There are just a few rules to be aware of when setting up the load files:
- All reference sequences must share the same class name. In our example, the class for the reference sequence was Contig.
- Each reference sequence must have its own feature entry in the GFF load file. This feature entry must use itself as the reference sequence, start at position 1, and extend the length of the sequence. The source and type for the feature are arbitrary, and do not have to be the same across all reference sequences.
- The class chosen for reference sequences must be noted in the configuration file under the general option "reference class."
Let us review these three criteria for the volvox example. If you look at the top of the initial load file, volvox1.gff, you'll see the very first line is:
ctgA example contig 1 50000 . . . Contig ctgA
1. The class for the name "ctgA" is Contig. If we were using other reference sequences as landmarks, then they too would have to be identified as Contigs.
2. This line describes "ctgA" as a feature relative to itself. The feature starts at position 1 (it has to!) and ends at position 50,000. The source is "example" and the type is "contig." If there are other reference sequences in the database, they do not have to share the "contig" type. This allows you refer to other types of landmarks, such as clones.
3. If you examine the top few lines of the volvox.conf configuration file, you'll see this line:
reference class = Contig
This line is required for GBrowse to effectively find and render with features located on the reference sequence.
If you find this confusing, it might help to choose "Reference" as the class. Then you can write GFF load files like this:
ctgA example contig 1 50000 . . . Reference ctgA chr22 example chromosome 1 1150000 . . . Reference chr22 5p3.2 example band 1 830000 . . . Reference 5p3.2
Don't forget to update "reference class" in the config file!
Alternatively, you can use the word "Sequence" as the class name for reference sequences. For historical reasons, "Sequence" is recognized as the default classname. This means you don't have to have a "reference class" option in the config file at all.
One of the cooler features of GBrowse is its ability to support semantic zooming. Semantic zooming is a feature in which objects show different levels of detail depending on the level of magnification. We've already seen this behavior in the "dna" and "segments" glyphs, which show the DNA sequence only when there's sufficient room to display it.
GBrowse has several types of semantic zooming:
- glyph-based, automatic </dt>
- The dna and segments glyphs, and others that support semantic zooming out of the box. This happens automatically and can't be modified. </dd>
- semantic labeling </dt>
- When there's sufficient room, GBrowse will print the label and descriptions next to the glyphs. The threshold at which this happens is under your control. </dd>
- semantic bumping </dt>
- When there's sufficient room, GBrowse will "bump" features to prevent them from colliding on the screen. When this would cause the display to become to high, bumping is suppressed. This threshold is also under your control. </dd>
- semantic options </dt>
- You can set track configuration sections up so that when a preset size threshold is exceeded, one configuration replaces another. </dd>
The thresholds for labeling and bumping are set by configuration options named "label density" and "bump density" respectively. The standard values can be found in the defaults track named [TRACK DEFAULTS]. They are originally set so that labels are suppressed when there are more than 25 features per track, and bumping is suppressed when there are more than 100 features per track. You can these values globally by editing their values in [TRACK DEFAULTS], or you can add "label density" and/or "bump density" options to individual track configuration sections in order to override the settings for specific tracks.
The process of setting up semantic options is a bit more interesting. To illustrate, we will create semantic zooming for the [Alignments] track ("Example Alignments"). We would like the track to shift from showing the individual segments to showing solid rectangles when the user is zoomed out to 30K and beyond, and turn bumping off when the user is zoomed out to 45K and beyond. The process is simple. Beneath the [Alignments] stanza, we add a stanza qualified for zoomlevels of >= 30,000 and another stanza qualified for zoomlevels of >= 45,000:
[Alignments] feature = match glyph = segments key = Example alignments [Alignments:30000] glyph = box label = 0 [Alignments:45000] glyph = box bump = 0 label = 0
The format for semantic options is [Trackname:distance], where Trackname must be the same as the non-qualified track, and distance is the length of the region at which the semantic options will kick in. Only options that are different from the non-qualified track need to be listed. According to the configuration given above, when the user is looking at a region 30,000 bp or longer, the glyph option will change to "box," which is a solid rectangle that doesn't show any internal details. All other options, such as feature and key, will be inherited from the [Alignments] track.
At 45,000 bp, the glyph is again set to box, and in addition the "bump" option is set to zero, turning off collision control. Notice that options are inherited from the unqualified track stanza, and not from the previous semantic zoom level. If we had neglected to specify the glyph option in [Alignments:45000], the glyph would have reverted to "segments."
Make these changes to volvox.conf, turn on the "Example Alignments" track, and view the contig at 20K, 40K and 50K. At 40K, you'll see the alignments lose their internal structure and be replaced by solid boxes (Figure 20). At 50K they'll begin to overlap and the feature labels will be suppressed.
Figure 22: Semantically zoomed alignments at 40K
The overview is the scale that appears at the top of the detailed image. In addition to acting as a reference point and navigation tool, you can place tracks in it. These tracks will always be displayed, and can serve as reference points for well-known genes, cytogenetic bands, or genetic markers.
We will illustrate how to do this by placing a copy of the Motifs track into the overview. Add the following to the bottom of the volvox.conf configuration file:
[Motifs:overview] feature = motif glyph = span height = 5 description = 0 label = 1 key = Motifs
This stanza is identical to the [Motifs] track that we created earlier, except that its name is qualified with ":overview". This tells GBrowse that this is not an ordinary track to be placed in the detail image, but one that should be placed in the overview.
We also want the overview motifs track to be displayed by default, so go to the top of the configuration file, and modify the "default features" option to look like this:
# list of tracks to turn on by default default features = ExampleFeatures Motifs:overview
Reload the page. Violá! See Figure 21.
Figure 23: Any number of tracks can be placed in the overview
You can add as many tracks to the overview as you like. The main warning is that if you add lots of features to the overview it can get pretty crowded in there. Performance can also suffer, since each feature must be fetched and rendered each time the overview is displayed.
The next topic we'll cover in this tutorial is configuring GBrowse's outgoing links. When the user clicks on a glyph in the details image, he will be taken to another page by following a URL. The URL to follow is generated from the link option. The default link option is located in the [TRACK DEFAULTS] section of the config file; you can specify track-specific links by placing a link option in one or more of the individual track stanzas.
The volvox.conf track defaults looks like this:
[TRACK DEFAULTS] glyph = generic height = 10 bgcolor = lightgrey fgcolor = black font2color = blue label density = 25 bump density = 100 # where to link to when user clicks in detailed view link = AUTO
In this case, we've been using a special link URL of "AUTO." This generates an automatic link to a helper script named "gbrowse_details." If you click on some of the features in the current volvox page you'll get an idea of what this script displays. Try clicking on a motif, a spliced transcript, the EDEN gene, and an EST. When you click on the spliced transcript, notice that the content of the "Gene" attribute is displayed. By adding attributes like this one, you can build up a very modest web-browsable database of facts about your features.
We're going to override the default link rule for the motif track. There's nothing sensible to link to, so we'll link to Google using first the motif's name, and then the motif's description.
Go to the [Motifs] stanza in the volvox.conf config file and modify it so that it looks like this:
[Motifs] feature = motif glyph = span height = 5 description = 1 link = http://www.google.com/search?q=$name key = Example motifs
The only change we've made is to add a "link" option to the stanza, where the value is a Google search URL. "$name" is a Perl variable. GBrowse will fill in this variable with the name of the motif. Reload the page and click on a motif to see that this works as advertised ("m01," "m02" and the other example motifs are similar to the names for galactic clusters, so be prepared for some astronomy hits).
It would be more sensible to link to the description of the motif, for example "helix loop helix." Fortunately we can do that too. Just change the link option to:
link = http://www.google.com/search?q=$description
There are a large number of possible variables that you can use inside link rules. See the CONFIGURE_HOWTO document in the GBrowse distribution for the full list. You can also construct links using Perl callbacks as described in the section on displaying ESTs. This gives you the ability to generate any arbitrary URL.
If you want nothing to happen when the user clicks on a feature, just set link to empty ("link = ").
The last thing we'll do is to change the behavior of the [Motif] track so that:
- a new window pops up with the google search rather than replacing the contents of the current window
- when the user mouses over a motif, a hints box will appear telling him that clicking there will initiate a google search
These changes are easy:
[Motifs] feature = motif glyph = span height = 5 description = 1 link = http://www.google.com/search?q=$description link_target = _blank title = Search Google for $description. key = Example motifs
There's now a link_target option. This contains the name of a browser window in which to load the content when the user clicks on the feature. If there's no window of that name, the browser will create a new window and give it the desired name. Choose an ordinary name like "Google" if you want the Google content to be loaded into the same window each time, or choose "_blank" as we've done here in order to pop up a new fresh window each time the user clicks.
The title option contains a bit of text that will be displayed whenever the user hovers the mouse over the feature for a second or two. The same variable substitution rules apply, so when the user mouses over feature "m06", a hints window will pop up that says "Search Google for SUSHI repeat." Give it a try!
This section will show you how to add two nice user interface enhancements to the volvox database.
Adding a "region" panel
With larger genomes, you may want to add a "region panel" that is intermediate in size between the overview panel and the detail panel. The region panel can contain tracks of its own and is useful for displaying features that are too numerous for the overview panel and too large for the detail panel.
Open the volvox.conf configuration file and add the following line to the [GENERAL] section. A good place is near the "max segment" and "default segment" sections:
# max and default segment sizes for detailed view max segment = 50000 default segment = 5000 # size of the "region panel" region segment = 20000
Now when you reload the volvox page, you will see an intermediate panel labeled "region", as shown in Figure 22:
Figure 24: The "region" panel shows a region intermediate in size between the overview and the detail panel.
You can declare region panel tracks in exactly the same way that you declare overview tracks by declaring stanzas qualified by ":region"
[TransChip:region] feature = tprofile glyph = xyplot graph_type = boxes height = 50 min_score = 0 max_score = 1000 bgcolor = blue scale = right key = Profile
Figure 25 shows what the region looks like with its "Profile" track turned on.
Figure 25: You can add any number of tracks to the region panel, just as you would for the overview panel.
Grouping Tracks
The bottom of the GBrowse window contains an expandable set of checkboxes that allows the users to turn tracks on and off. By default, the tracks are grouped into sections corresponding to tracks belonging to the overview panel, those belonging to the region panel, tracks created by external (third-party) annotations, and tracks created by plugins. All other tracks are grouped together in a catch-all section named "General."
You can easily define new track groups to make navigation easier. To do so, just add a "category" option to each of the track stanzas. This option defines the name of the category. Tracks that belong to the same category will be grouped together, regardless of the order in which the track definitions appear in the configuration file. For example, we can place the [Motifs] and the [Translation] tracks into a section named "Proteins" by modifying their stanzas to look like this:
[Motifs] feature = motif glyph = span height = 5 description = 1 category = Proteins key = Example motifs [Translation] glyph = translation global feature = 1 height = 40 fgcolor = purple start_codons = 0 stop_codons = 1 category = Proteins translation = 6frame key = 6-frame translation
In this way we can create sections named "Alignments," "Examples," "Genes" and "Proteins" and assign the appropriate tracks to them. The Tracks control section will look something like figure 24:
Figure 26: You can add any number of tracks to the region panel, just as you would for the overview panel.
The file volvox_final.conf contains the final configuration file with all the modifications we've made during the course of this tutorial. The data files volvox_all.gff and volvox_all.fa likewise contain the entirety of the feature and DNA data.
The in-memory database is great for smaller data sets, and can handle GFF files of up to about 20,000 features (more if you have lots of memory). For larger data sets, however, you'll want to use a database management system. GBrowse handles a number of DBMS through its "database adaptor" system. This section shows how to use the Bio::DB::GFF berkeleydb adaptor that comes for free when you install BioPerl; this will enable you to create databases of 10 million or more features. The next section shows you how to install a MySQL relational database that will support even larger data sets. You may skip these sections and move on to working with third-party annotations if you do not wish to install a berkeleydb-based server at this time.
The Berkeleydb database adaptor comes with BioPerl 1.51 or higher (still under development at the time this tutorial was written). If you have an older version of BioPerl, GBrowse will install the adaptor for you. As its name implies, this adaptor uses the Berkeleydb database system (http://www.sleepycat.com) to create indexed database files from GFF feature files. The adaptor also requires the Perl DB_File interface to Berkeleydb. If you are using a Linux or Mac OSX system, you almost certainly have both Berkeleydb and DB_File already installed. For Windows users of ActiveState Perl, you should confirm that DB_File is installed by running the following command:
C:\> perl -MDB_File -e 'print $DB_FILE::VERSION'
If this prints out a number, then you are golden. If you get an error, you should reinstall DB_File by running the PPM tool:
C:\> ppm PPM interactive shell (2.1) - type 'help' for available commands. PPM> install DB_File
It is an extremely simple task to convert an existing in-memory database to use the Berkeleydb database. We will now convert the Volvox example database to Berkeleydb.
Take the most recent version of the volvox.conf configuration file, and edit the top few lines of the new file so that it looks like this:
[GENERAL] description = Volvox Berkeleydb Database db_adaptor = Bio::DB::GFF db_args = -adaptor berkeleydb -dir '/var/www/html/gbrowse/databases/volvox'
We made just two changes. First, we changed the description of the database to "Volvox Berkeleydb Database" to distinguish it from the in-memory database. Second, we changed the value of the -adaptor option from "memory" to "berkeleydb".
Now reload the volvox page in your browser. There will be a slight delay as the Berkeleydb adaptor constructs its indexes, and then the page will reappear. You should now be able to browse and search the database exactly as before. Depending on how fast the memory adaptor was to begin with, you may not notice a speed improvement; however, with large GFF files, the performance improvement will be very marked.
If you look in the volvox database directory, you will see a series of newly-created index files named "bdb_features.btree", "bdb_features.data", etc. These are automatically created when needed and updated whenever the underlying GFF or FASTA files are changed.
If you get an "Internal Server Error" or similar message, check the server error log file for messages that explain what went wrong. The most common problem is that the volvox database directory is not writeable by the web server user. As described earlier, this directory must be "world writeable" in order to allow the web server to create and maintain the databases
Creating a Berkeleydb database using bp_load_gff.pl
Although it is convenient to maintain the Berkeleydb indexes automatically, this mechanism has a number of disadvantages. One disadvantage is that this mechanism requires the database directory to be world writeable (or at least writeable by the web user), which may not be acceptable in some installations. Another disadvantage is that the indexing may take a long time, up to 10 minutes for a GFF databases containing a million lines. Some web servers will time out during this process. For large databases, it is better to explicitly create the database index files using the bp_load_gff.pl program.
bp_load_gff.pl is a BioPerl utility that is described in more detail in Setting up a MySQL database. It takes as its input a series of GFF and FASTA files and creates the appropriate database files. To see how to use it, we will create a fresh database directory. Go to the GBrowse database located at /var/www/html/gbrowse/databases and create a new subdirectory called "volvox_bdb:"
% cd /var/www/html/gbrowse/databases % mkdir volvox_bdb
On Windows systems you can use the file manager to create this new folder.
You do not have to make this directory world writeable, but it should be readable and executable by the user that the web server runs as. Now enter the tutorial data files directory (/var/www/html/gbrowse/tutorial/data_files) and load the GFF and sequence files using the following command:
The arguments to bp_load_gff.pl are:<strong>% bp_load_gff.pl -c -a berkeleydb -d /var/www/html/gbrowse/databases/volvox_bdb volvox_all.fa volvox_all.gff</strong> volvox_all.gff: loading... 738 records loaded volvox_all.gff: 738 records loaded Loading fasta file volvox_all.fa volvox_all.fa: 7 records loaded
| -a | Use the berkeleydb database adaptor. |
| -c | clear (initialize) the database |
| -d /var/www/html/gbrowse/databases/volvox_bdb | Load the data into the indicated database directory. |
| volvox_all.fa volvox_all.gff | The data files to load. |
If all goes well, this will create the index files in /var/www/html/gbrowse/databases/volvox_bdb. If you look in that directory now, you'll see a series of bdb_* index files.
The last step is to modify the volvox.conf to point to this directory. Open it in a text editor and modify the top part so that it looks like this:
[GENERAL] description = Volvox Berkeleydb Database db_adaptor = Bio::DB::GFF db_args = -adaptor berkeleydb -dsn '/var/www/html/gbrowse/databases/volvox_bdb'
The change here is to replace the -dir argument with -dsn ("data source name"). This tells the Berkeleydb adaptor that pre-made index files can be found in the indicated directory. It will not attempt to update the index files automatically.
If you wish to update the indexes with new GFF or sequence data, you should run the bp_load_gff.pl script again to update the indexes. Using the -c flag will reinitialize the indexes from scratch, erasing whatever was there before. Without this flag, the provided GFF and/or sequence data will be incrementally added to the indexes.
The Bio::DB::GFF MySQL adaptor is an interface to the open source MySQL database management system. Its performance is similar to that of the Berkeleydb adaptor, but it has better provisions for error recovery and is safe to use in environments where multiple users write to the database simultaneously. In addition, the MySQL adaptor has been tested much more extensively than the Berkeleydb adaptor and is highly recommended for production environments. This section describes how to set up GBrowse to use the MySQL adaptor. If you are not interested in this, you may skip to the next section that describes loading third-party annotations.
First you'll have to install MySQL. Although it is installed by default in most Linux systems, it will not be present on Windows or Macintosh OSX systems. Go to www.mysql.com and follow the instructions to download and install the database. Come back here when this is done.
Next, you'll need to install the Perl interface to MySQL. On a Windows system using ActiveState Perl, use the ppm tool:
C:\Windows> <strong>ppm</strong> ppm> <strong>install DBD::mysql</strong> ppm> <strong>quit</strong>
On a Unix, Linux or Mac OSX system, use the perl CPAN installer (this may need to be done with root/superuser privileges):
% <strong>perl -MCPAN -e shell</strong> cpan> <strong>install DBD::mysql</strong> cpan> <strong>quit</strong>
Now you're ready to create the MySQL version of the volvox database. First you'll set up a new empty database named "volvox." Using the mysql command-line tool, create the database, grant yourself read/write privileges, and grant the "nobody" user read privileges:
% <strong>mysql -uroot -p</strong> Enter password: ********* mysql> <strong>create database volvox;</strong> Query OK, 1 row affected (0.04 sec) mysql> <strong>grant all privileges on volvox.* to lstein@localhost;</strong> Query OK, 0 rows affected (0.00 sec) mysql> <strong>grant select on volvox.* to nobody@localhost;</strong> Query OK, 0 rows affected (0.00 sec) mysql> <strong>quit</strong> Bye
Depending on how mysql was installed, you may not need to provide a password, in which case just type "mysql -uroot" without the "-p" argument. When granting privileges to yourself, replace "lstein" with your own login name. If you are on a Windows system, you may be able to skip this step entirely.
You'll now load the .gff and .fa files into this newly created database. There are actually two steps needed. The first is to "initialize" the database with all the data definitions needed to hold genomic feature data, and the second is to actually load the data. Fortunately, both these steps are handled by the same command-line tool, bp_load_gff.pl, which is part of the BioPerl suite.
Copy the files volvox_all.gff and volvox_all.fa to some convenient place. Then run the following command from the command line:
% <strong>bp_load_gff.pl -c -d volvox volvox_all.fa volvox_all.gff</strong> volvox_all.gff: loading... volvox_all.gff: 738 records loaded Loading fasta file volvox_all.fa volvox_all.fa: 7 records loaded
The arguments to bp_load_gff.pl are:
| -c | clear (initialize) the database |
| -d volvox | Load into the database named volvox |
| volvox_all.fa volvox_all.gff | The data files to load. |
The MySQL database is all ready to go. Now, in order to tell GBrowse to start using the MySQL database rather than the in-memory database, you need to make a small change to the volvox.conf configuration file. Find the few lines of the file and change them to look like this:
[GENERAL]
description = Volvox Example Database
db_adaptor = Bio::DB::GFF
db_args = -adaptor dbi::mysql
-dsn volvox
-user nobody
-pass
The -adaptor argument is telling GBrowse to use the "dbi::mysql" database adaptor, which is the BioPerl interface to MySQL databases. The -dsn argument tells GBrowse to use the data source name "volvox".
When you reload the web page, GBrowse will now be using MySQL. Depending on the speed of your CPU and disk, you might notice that it seems a bit snappier than the in-memory version. See CONFIGURE_HOWTO.txt for more information on configuring GBrowse to use relational databases. Also see the following perldoc manual pages:
- perldoc Bio::DB::GFF::Adaptor::dbi::mysql </dt>
- The MySQL adaptor.
</dd> - perldoc Bio::DB::GFF::Adaptor::dbi::oracle </dt>
- The Oracle adaptor.
</dd> - perldoc Bio::DB::GFF::Adaptor::dbi::pg </dt>
- The PostgreSQL adaptor.
</dd> - perldoc Bio::DB::GFF::Adaptor::dbi::biofetch </dt>
- An adaptor that will fetch data automatically from GenBank/EMBL and load it into a local MySQL database.
</dd> - perldoc Bio::DB::GFF::Adaptor::memory </dt>
- An adaptor for in-memory databases running off files.
</dd> - perldoc Bio::DB::Das::Chado </dt>
- An adaptor for PostgreSQL databases using the Chado schema (see the Chado home page.)
</dd> - perldoc Bio::DB::Das::BioSQL </dt>
- An adaptor for PostgreSQL and MySQL databases using the BioSQL schema (see www.biosql.org). </dd>
It is often useful to have independent annotation data sets that can be visualized together but updated separately. For example, you may be working on a genome that has a core set of stable annotations that everyone shares, such as the set of protein-coding genes, and independent sets of annotations that change frequently, such as promoter predictions and experimental data.
GBrowse provides several mechanisms for making this type of modular annotation possible. You can:
- Upload one or more files of annotations temporarily, and view them in the context of the core annotations. These annotations will be private to the user who uploads the annotations; others cannot see the data.
- Put one or more GFF files in a web-accessible location, such as an FTP or Web site, and point GBrowse at it. These annotations will be accessible to anyone who knows the correct URLs.
- Point one GBrowse at another GBrowse. All the tracks in the second instance of GBrowse will be available to the first GBrowse. This method uses the Distributed Annotation System (DAS) and can handle very large data sets.
This section will lead you through the various ways to view third party annotations on top of GBrowse. The examples are somewhat contrived since we only have one computer to work with, and by necessity both the main data and the third-party feature data will have to reside on the same computer. Don't be confused by this, and keep in mind that in the real world, GBrowse will be running on one computer, and the third-party annotation data will be loaded from another network-accessible computer.
Instead of using the artificial volvox data, we will now use some real genome annotations from the C. elegans genome project. This is a region around C. elegans cosmid C01F4. The core data that we'll be using is contained in the files elegans_core.gff, and elegans.fa.
Refer back to the beginning of the tutorial now and create a GBrowse database directory named "elegans_core". Then copy elegans_core.gff, and elegans.fa into it. The configuration file to use is elegans_core.conf. Place it in /etc/httpd/conf/gbrowse.conf/.
Confirm that you can browse the database. Figure 25 is a picture of the entire data set with all core tracks turned on.
Figure 27: The core C. elegans dataset.
Uploading an Annotation File
We will now add some third-party annotations to the display. These are contained in the files "elegans_acceptor.gff", "elegans_expression.gff", "elefans_sts.gff", "elegans_deletion.gff", and "elegans_repeats.gff":
| elegans_acceptor.gff | Annotations of C. elegans spliced leader acceptor sites. |
| elegans_expression.gff | Positions assayed for gene expression level in C. elegans microarrays. |
| elegans_sts.gff | Primer pairs available for the region produced by the C. elegans ORFeome project. |
| elegans_deletion.gff | Deletion endpoints from a targeted gene knockout project. |
| elegans_repeats.gff | Complex repetitive elements found using the RepeatMasker program. |
We can load each of these files to private storage located on the server using the file upload feature. Copy these five files to your home directory where you can find them easily. Go to the section marked Upload your own annotations and choose the "Browse..." button. Select one of the annotation files, and then press the "Upload" button to upload the file to the server. The annotations contained in the file should now appear on the display. If you now do this for all five of the annotation files, you will eventually get a display like that shown in Figure 26.
Figure 28: After uploading four annotation files.
NOTE: This upload function works even if the gbrowse you are uploading to is located on a remote server. The uploaded files are stored in a private directory on the server away from the main data set. Other users cannot see your data.
Although this display is functional, there is no difference between the appearance of each of the tracks. Fortunately, we can customize the uploaded files quite easily. Let us change the "elegans_sts.gff" file so that the primer pairs use the "primers" glyph. We can either do this by deleting the uploaded file, making the appropriate modification to our local version and then reuploading it, or by editing the file in place. We'll take the latter course.
Scroll to the bottom of the browser window, find the uploaded file named "elegans_sts.gff", and choose "Edit File...".
Figure 29: The uploaded files can be edited in place by clicking the "Edit File..." button.
This will take you to a simple text editor window. At the top of the window, add the following configuration stanza:
# edited elegans_sts.gff file [reagent] glyph = primers height = 6 key = ORFeome project primer pairs
When you are done, press "Submit Changes..." and the display will be updated to show the track with a more readable track name and the primers glyph.
If you like, you can customize each of the files. Here is a suggested set of customizations:
# for the file elegans_repeats.gff [repeat] bgcolor = white key = Complex repeats # for the file elegans_acceptor.gff [trans-splice_acceptor] glyph = diamond bgcolor = red key = Trans-splice Acceptors # for the file elegans_deletion.gff [Deletion_allele] glyph = span key = Gene knockouts # for the file elegans_expression.gff [Expression] bgcolor = orange height = 4 key = Microarray expression probe
With this combination of configurations, the display will now look as shown in Figure 28:
Figure 30: After customizing the annotation files.
NOTE: Be aware of an important difference between the track configuration of the uploaded files and of the main GBrowse configuration files. In GBrowse, the [STANZA] heading is the name of the symbolic name of the track, and particular feature types are added to the track using the feature= option. In uploaded files, the [STANZA] heading is the feature type itself. This means that each track can only contain one feature type. However, any uploaded GFF file can contain multiple feature types, and each feature type can have its own configuration stanza.
The other important difference between the uploaded file configuration and the GBrowse main configuration is that for security reasons Perl subroutines are not allowed in the configuration sections of uploaded files. However links and link patterns are allowed.
There is no particular reason that each of the annotation sets were broken into separate files. We could easily combine them into a single GFF file just as you do for the core annotations.
Sharing an Annotation File
Once you have an uploaded annotation file set up the way you like it, you might want to share it with others. You can do this easily if you have access to an anonymous FTP or web server (if you are reading this tutorial, it is fair to assume that you do!)
To watch this in action, we will place one of the annotation files onto the local web server and then load it from within the local GBrowse. This contrived example doesn't make much sense until you realize that the same trick will work when the GBrowse server and the web-accessible annotation file can be on separate machines halfway across the world.
We will demonstrate using the elegans_sts.gff file. Please use a version that has been edited to place the [reagent] configuration stanza at the top. Then copy this file to the directory "/var/www/html". This will place it at the top of the Web server document tree, but outside the location of GBrowse databases. Check that the file is correctly installed on your web server by fetching this URL: http://localhost/elegans_sts.gff. If the file is correctly installed on the Web server, you will see this:
[reagent] glyph = primers height = 6 key = ORFeome project primer pairs ##gff-version 2 ##date Tue Feb 24 06:39:41 2004 ##sequence-region C01F4 1 40000 ##source gbrowse GFFDumper plugin ##NOTE: Selected features dumped. C01F4 Orfeome_project reagent 3319 17668 . + . PCR_product mv_ZK783.1 ; Amplified 0 C01F4 Orfeome_project reagent 18584 20445 . - . PCR_product mv_G_YK5686 ; Amplified 1 C01F4 Orfeome_project reagent 24509 25425 . - . PCR_product mv_ZK783.3 ; Amplified 1 C01F4 Orfeome_project reagent 26525 33359 . - . PCR_product mv_ZK783.4 ; Amplified 0 C01F4 Orfeome_project reagent 38660 49506 . + . PCR_product mv_C18H2.1 ; Amplified 1
Now go back to your browser, and delete all the uploaded files. (This is to prevent the list of tracks from getting too long!) You can do this by scrolling to the bottom of the browser window and pressing "Delete File" for each of the annotation files that you previously uploaded. This should return you to the display of the core gene models and EST alignments that we began with.
Now we'll reload the STS annotations by using their URL. Scroll to the bottom of the window, find the text field labeled "Enter Remote Annotation URL", type in http://localhost/elegans_sts.gff, and press "Update URLs." The "ORFeome project primer pairs" track will reappear.
In order to make this process even simpler, you can create a popup menu containing the URLs of frequently-accessed remote annotation files. To make this more interesting, first copy the elegans_expression.gff file to the "/var/www/html" directory in the way described earlier. Now elegans_sts.gff and elegans_expression.gff will be available as the URLs http://localhost/elegans_sts.gff and http://localhost/elegans_expression.gff, respectively.
Open up the GBrowse configuration file, "/etc/httpd/conf/gbrowse.conf/elegans_core.conf", and insert the following lines right after the "plugins =" line:
# remote GFF files to make available for optional loading remote sources = "ORFeome STSs" http://localhost/elegans_sts.gff "Expression probes" http://localhost/elegans_expression.gff
When you reload the web page, you will see a popup menu appear next to the remote annotation URL textfield (Figure 29). The menu will contain options to load "ORFeome STSs" and "Expression probes", and selecting a menu item will have exactly the same effect as typing in the URL manually.
The neat thing about all this is that it works across the Internet. Send the URL of the annotation files to your colleagues (being sure to replace "localhost" with the hostname of your web server!) and they'll be able to load this URL into any GBrowse that uses the same core annotations. You can also use this mechanism within your laboratory or department to share annotation sets without having to give everyone write access to the web server's /var/www/html directory.
Figure 31: The preset remote annotation URL popup menu.
To remove a URL from the list of loaded URLs, just delete it from its text field and reload.
The Distributed Annotation System protocol (DAS; http://www.biodas.org) is a system for exchanging genomic annotations across the Internet. It works similarly to the idea of sharing the URLs of web-accessible GFF files, except that it is designed to support large data sets. When a client application needs to fetch just a subset of the data, such as a small piece of a chromosomal arm, the DAS protocol allows only the relevant annotations to be retrieved, rather than the whole data set.
To take advantage of DAS functionality, you will have to install the Perl Bio::Das module. This is available from CPAN (the Comprehensive Perl Archive Network (http://www.cpan.org) or from the GMOD PPM repository. Unix users can install Bio::Das with this command:
% <strong>perl -MCPAN -e 'install Bio::Das'</strong>
Windows users can use the PPM tool:
You may need to issue the command "rep add gmod http://www.gmod.org/ggb/ppm" if PPM complains that it cannot find Bio::Das.C:\Windows> <strong>ppm</strong> ppm> <strong>install Bio::Das</strong> ppm> <strong>quit</strong>
When you installed GBrowse, you also installed a CGI script that enables your web server to act as a DAS server. The CGI script is named "/var/www/cgi-bin/das", and it runs off the same configuration files as GBrowse itself. Only a very small bit of extra configuration is required to enable full DAS server functionality. In this part of the tutorial we will first turn on the DAS server, and then use it to serve out annotations on the C. elegans database.
To start, open the elegans_core.conf configuration file and add the following line to the configuration file. It can go anywhere before the start of the track definition stanzas, but it is probably a good idea to place it towards the top between "plugins" and "default features."
# DAS reference server das mapmaster = SELF
What this line is doing is to declare to the DAS system that our server is authoritative for the coordinates on the current C. elegans genome example. This is appropriate if you are starting out a genome for the first time. If, however, you want to annotate against an existing set of genome coordinates, you should replace SELF with the URL of the DAS reference server that serves that genome. For example release hg16 of the human genome at UCSC corresponds to DAS URL http://genome.cse.ucsc.edu/cgi-bin/das. A list of reference servers for various model organisms can be found at http://www.biodas.org.
The next step is to go through the configured tracks and add a "das category" to each of them. DAS uses the idea of the "category" of a feature in order to filter sets of features by their purpose. Categories include:
| transcription | features that have to do with RNA transcription |
|---|---|
| translation | features that have to do with protein translation and function |
| variation | mutations, deletions, polymorphisms |
| structural | contigs, clones, reads, PCR primers |
| repeat | repetitive elements |
| experimental | a catch-all for experimental data |
| miscellaneous | anything that doesn't fit in one fo the other categories |
Find the [Transcripts] stanza and modify it to to have a das category of "transcription" as shown here:
Similarly, modify the [Alignments] track to have a das category of "similarity." You do not need to add a category to the DNA track, as it is treated specially by das. You're all done! Be sure to save the configuration file before you try the next step.[Transcripts] feature = processed_transcript glyph = processed_transcript height = 8 bgcolor = blue description = 1 das category = transcription key = Protein-coding genes
Using a web browser fetch the URL http://localhost/cgi-bin/das/dsn. This will return an XML document giving information about each of the data sources that you have configured.
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE DASDSN SYSTEM "http://www.biodas.org/dtd/dasdsn.dtd">
<DASDSN>
<DSN>
<SOURCE id="elegans_core">elegans_core</SOURCE>
<MAPMASTER>http://localhost/cgi-bin/das/elegans_core</MAPMASTER>
<DESCRIPTION>C. elegans Core Annotations</DESCRIPTION>
</DSN>
</DASDSN>
This is showing that there is one configured DAS source, the "elegans_core" data set.
Next test that the DAS "types" request is working. This request returns all the feature types that the database knows about. Using a web browser fetch the URL http://localhost/cgi-bin/das/elegans_core/types. This should return another short document confirming that the "processed_transcript" and "match:BLAT_EST_BEST" feature types are available.
The final test that the DAS server is performing correctly is to browse to the elegans_core database and to turn off all the tracks except for DNA/GC content. This should give you an empty details panel. Now scroll down to the first empty URL entry field and type in http://localhost/cgi-bin/das/elegans_core and press "Update URLs." The page should now reload and display the gene models and the EST alignments. However, the data is now not coming directly from the local database, but from the database via the DAS protocol.
Combining Databases with DAS
We can now use DAS to integrate the core gene model and EST alignment annotations with the STSs, expression data, trans-splice acceptors and other third party annotations. To do this, we will create a GBrowse database that contains the third party annotations, but not the core data. This new database will be used as a DAS source.Create a new database directory called elegans_extra in the "/var/www/html/gbrowse/databases" directory, and add to it a copy of the file elegans_extra.gff. This GFF file is simply the result of concatenating together the individual annotation files we looked at earlier (elegans_sts.gff, etc), and removing the redundant comment lines from the top of the file. Now copy the configuration file elegans_extra.conf into the /etc/httpd/conf/gbrowse.conf/ directory. Have a look at this config file, and note that it contains the appropriate "das mapmaster" and "das category" configuration objects.
Once the config file is installed, confirm that you can browse the extra annotations by fetching http://localhost/cgi-bin/gbrowse/elegans_extra.
Now we're ready to layer the extra annotations onto the core annotations using DAS. Open up a browser window on the



Figure 8: Showing the gene as well as its transcripts 





















