A THEORETICAL APPROACH TO IDENTIFY PUTATIVE SIGNAL SEQUENCES:
PROVIDING EVIDENCE FOR SECRETED GENE PRODUCTS BY PURE COMPUTATIONAL ANALYSIS.

George Ruban¹ and Joachim Reidl²

  1. 67 Thurston Str, Somerville, MA 02145, USA.
    WWW: http://www.geocities.com/SiliconValley/Vista/2013/
    e-mail: gmn@csa.bu.edu
  2. Research Center for Infectious Disease, University of Würzburg, Röntgenring 11, 97070 Würzburg, Germany. Tel.: 0049 931 59509
    e-mail: joachim.reidl@rzroe.uni-wuerzburg.de

Based on the weight-matrices of von Heijne¹, we have developed a computer program to screen whole chromosomal DNA fragments for putative signal sequences of exported or secreted proteins. The algorithm is a C language computer program. It performs an automated translation of large DNA fragments (small chromosomes) in all six reading frames, using the universal codon usage. Subsequent automated analysis of the N-terminal signal peptide will result in a compiled sub-library for further "fasta" program analysis. Using the complete Haemophilus influenzaeRd chromosome², we demonstrated that numerous experimentally characterized secreted or exported proteins can be detected by this pure computational approach. Thus indicates that this valuable tool could be used for the detection of exported gene products (e.g. virulence factors), based on DNA sequence information only.

  1. von Heijne, G. A new method for predicting signal sequence cleavage sites. 1986 Nucl. Acid. Res. 14: 4683-4690.
  2. Fleischmann et al., Whole-genome random sequencing and assembly of Haemophilus influenzaeRd. 1995 Science 269:496-512.

Introduction

To initiate protein export across the inner membrane (in prokaryotes) or the endoplasmic reticulum (in eukaryotes) a N-terminal signal sequence is found on most secretory proteins. The specific features of a typical signal peptide, well investigated for prokaryotes, are reflected by its secondary structure:

  1. a positive charged amino-terminal region is followed by
  2. a stretch of 11 to 15 hydrophobic residues and finally is recognized by
  3. a specific leader peptidase (Lpp, or LppII for lipoproteins).
After translation, a precursor protein is recognized by the secretory (sec) machinery by its N-terminal signal sequence, processed and subsequently released as mature protein into the periplasm. Some are then further processed to finally reach their specific destination, e.g. the outer membrane or extracellular environment. Besides numerous exported proteins responsible for cell wall synthesis, structural integrity, and substrate binding, a variety of virulence factors and toxins (e.g. surface exposed virulence factors, entero- and exo-toxins) are initially secreted across the inner membrane by the signal sequence dependent secretory pathway.

Noting the increasing amount of information based on DNA sequencing of complete chromosomes, especially for pathogenic prokaryotes, we intended to develop a computational approach, based on the well established signal sequence evaluation matrix of von Heijne¹. The developed computer program provides a method to screen for signal sequence dependent and putative exported proteins. Based on the completed DNA sequence of the H. influenzae chromosome², we have verified the program and will show that it facilitates the detection of naturally secreted proteins.

Results and Discussion

Program function

Basically this program converts a DNA file to one of likely proteins. As illustrated in figure 1, the program first converts a given DNA file into all six possible reading frames. Second, a sliding window of N residues is fitted to each possible protein in a loop (optional) to calculate the best value (significance value) from the given von Heijne weight-matrix¹. Third, the recognized N-terminal signal sequence is converted along with the corresponding reading frame, location information, protein length, and weight factor into a formatted data base (Fasta format).

Program Algorithm:

  • Input DNA fragment (e.g. 1.830.137 pb H. influenzae)

  • Translation into six reading frames, based on start (Met) and stop codon (using the universal codon usage)
  • Comparison of N-terminus with parameters of input options and von Heijne weight matrix¹
  • Sorting the analyzed orf's according to their position and reading frame of the corresponding DNA fragment and storage as fasta readable format on disk

  • Result: editable preformatted fasta library containing putative signal sequence encoding orf's, as deduced from DNA segment
Figure 1

In figure 2, the basic function is shown with an example of the blaM encoding Beta-lactamase protein of E. coli. In step 1. translation into the open reading frame is sensed by the initiator codon ATG. Step 2. a sliding window of N=15 amino acids is compared against the weight-matrix¹, and in a loop procedure the best fit is screened until the highest score above the minimum significance value parameter is reached. In step 3., the best candidates are saved as described above.

Figure 2

Figure 3 shows the parameter input information. The program accepts ASCII code DNA file format. The user can then choose several options:

  1. the weight-matrix evaluation table;
  2. the deduced reading frames can be limited (e.g. to search for smaller secreted proteins);
  3. the sliding window length can be aligned for appropriate distance in respect to the methionine start codon at the N-terminus;
  4. as an optional feature, selected amino acids can be excluded from the signal sequence;
  5. finally, a minimum weight factor (depending on the weight-matrix) can be chosen as a threshold for selecting proteins.

Parameter Input:

  • Input DNA fragment file
  • Choose weight-matrix (eukaryotes/prokaryotes)
  • Select a minimum orf length
  • Set maximum number of steps to slide sliding window at N-terminus
  • Deselect signal sequences containing certain amino acid residues (e.g. basic, acid, polar, etc.)
  • Choose minimum weight factor (1 - 195/774, prokaryotes/eukaryotes) according to the von Heijne weight-matrix¹.
Figure 3

The source code for the program was written in ANSI C, only using the "//" comment style from C++. Therefore, it should compile under most C/C++ compilers and operating systems without modification. The specific executable was compiled with Borland Turbo C++ for Windows 3.1, and has been run under Windows 3.1 and on a Power Macintosh running SoftWindows.

Approaching the H. influenzae genome

As it was reported recently², the complete chromosome of H. influenzae has been sequenced and is provided by the TIGR organization on the World Wide Web (http://www.tigr.org). In order to evaluate the properties of the program, we compiled the complete H. influenzae chromosome with different input parameter. The 1.830.137 bp chromosome encodes 1743 predicted open reading frames². With the minimum weight parameter set to zero, and the minimum deduced protein length set to twenty amino acids, the chromosome has the theoretical capacity to encode 12421 open reading frames (orfs), regardless of numerous possible start sites located within existing genes or in transcriptional control regions. As it can be seen in Figure 4, a protein sorting in sub-libraries occurs depending on weight-factor (0-160), and sliding window (4-15). As it can be demonstrated, using weight factor zero, 12421 hypothetical orfs are generated and saved into a sub-library. As the weight factor increases (80, 90, 100, 120, 160), the number of possible signal sequences containing orfs decrease significantly in the respective sub-libraries. Remarkably, if the minimum length of the deduced orfs is set to at least 100 amino acids, the output library with weight factor 0 contains about 1708 possible reading frames, reflecting very closely the actual deduced number of 1743 orfs. It can also be observed that the content of the respective sub-libraries of minimal lengths of 20 or 100 amino acids do not differ significantly in the numbers of saved orfs.

Figure 4

In order to verify the obtained sub-libraries, we investigated whether a defined subset of experimentally and predicted secreted proteins can actually be identified. For this reason we included the characterized secreted proteins of H. influenzae (HI0693 e(P4), HI0401 P1, HI0139 P2, HI1164 P5, HI0381 P6, HI0689 Hpd, HI0990 IgA-protease, HI0994 transferrin bdg. Tbp1, HI0995 transferrin bdg. Tbp2, HI0251 TonB, HI0113 HxuC, HI0263 HxuB), as well as predicted precursor (HI1111 XylBP, HI0504 RibBP, HI1579 Lpp, HI0620 HlpA, HI0302 Cute, HI1567 IroA, HI0703 LppB, HI0256 Lipo-34) into a test-set, and asked for the frequency in which we can obtain such proteins in a variety of different compiled sub-libraries, generated by the program. As shown in figure 5, a 60-70 * fold increase of the relative accumulation of the test-set can be produced by the program by using a sliding window (11 to 15) and weight factors (100 to 120). The results are calculated as:

For example, the relative accumulation of test-set included in total orfs on sub-library with weight factor 0, (window 4-15) equals 1, and with weight factor 120 (window 11 to 15) equals 65 and 74, respectively.

Figure 5

To obtain specific information about the generated sub-libraries, we investigated the content of a generated sub-library, containing 48 orfs, with weight factor 120, sliding window 15 and a minimum protein length of 20 amino acids. This sub-library contained 25% of the test-set proteins. To further specify the content of this sub-library, we have sent each peptide sequence via Blast-Search to the NCBI network server. As a result we obtained that 35 proteins actually do encode for putative or experimentally characterized precursor proteins, 4 proteins are homologues to membrane associated transport proteins, and 9 proteins were found to correspond to proteins with no defined export characteristic or no data base hit. The results are summarized in table 1.

Table 1: Identification of Sub-Library Content
Number Identification Function
1 HI0052 hypothetical protein,precursor
2 HI0066 AmiB,N-acteylmuranoyl-L-ala amidase
3 HI0131 AfuA,iron uptake outer membrane
4* HI0139 P2,outer membrane protein
5 HI0146 hypothetical protein,precursor
6 HI0147 hypothetical protein,precursor
7 HI0206 UshA,5' nucleotidase,precursor
8 HI0066 FimA,adhesin B,precursor
9* HI0131 P6,Pal precursor
10 HI0139 TolB,precursor
11* HI0401 P1,outer membrane protein
12* HI0504 RbsB, ribose bdg.-protein,periplasm
13 HI0507 hypothetical signal sequence
14 potE,putative putrescin antiporter
15 macB,sigma E homologue
16 Rpl7 50S
17 HI0661 HhuA,hemoglobine bdg.,outer membrane
18 HI0698 hypothetical protein,precursor
19 L-asparaginase,precursor
20 MglB, methyl-galactoside bdg.precursor
21 HI0825 hypothetical protein,precursor
22 HI0852 possible drug translocase
23 DsbD,C-type cytochrome biog.,precursor
24 KefC,potassium efflux system,precursor
25 export factor
26 HI1019 thiamin-bdg.-prot.,precursor
27 MerP,mecury scavanger prot,precursor
28 HI1090 hemin export protein
29 C-cytochrome biog.-prot.,precursor
30 no hit
31 CysK,cysteine synthase
32 OppA,oligo peptide bdg.-prot.,precursor
33 FtsI,penicillin bdg.-prot.,precursor
34 HI1149 hypothetical protein,precursor
35* HI1161 P5,outer membrane protein,precursor
36 argenine bdg.-prot.,precursor
37 HI1182 hypothetical protein,precursor
38 DsbC,precursor
39 dihydrolipoamine acetyltransferase
40 no hit
41 transhydrogenase
42 HI1466 FhuA homologue,precursor
43 HI1586 hypothetical protein,no precursor
44 HI1591 outer membrane lipoprotein carrier
45 HI1601 hypothetical protein,precursor
46 HI1624 hypothetical protein,precursor
47 HI1693 molybdate bdg.-prot.,precursor
48 HI1709 hypothetical protein,precursor

Identification was determined by homology analysis, using the Blast Search engine of the NCBI network server. HI identifiers are included as available. Underlined results indicate potentially not secreted proteins (see text). Asterisks (*) mark test-sets containing orfs.

Conclusion

In summary, by compiling the H. influenzae genome, we have demonstrated that this program represents a simple tool for a first step analysis to verify signal sequence dependent secreted proteins based on the DNA information only. This tool allows omitting the time consuming step to precisely dissect the deduced coding regions of large bacterial chromosomes before they become accessible for further characterization, for example to seek for secreted proteins. Furthermore, the program can be used to generate user friendly data-bases of individual composed sub-libraries, which subsequently can be used as fasta formatted libraries to allow a fast homology search of suspected or homologue forms of already characterized proteins or secreted proteins (e.g. virulence factors).


This page hosted by Get your own Free Home Page