SciELO - Scientific Electronic Library Online

 
vol.108 número1-2On the discontinuous nature of the Mozambique CurrentFive Ochna species have high antibacterial activity and more than ten antibacterial compounds índice de autoresíndice de assuntospesquisa de artigos
Home Pagelista alfabética de periódicos  

Serviços Personalizados

Artigo

Indicadores

Links relacionados

  • Em processo de indexaçãoCitado por Google
  • Em processo de indexaçãoSimilares em Google

Compartilhar


South African Journal of Science

versão On-line ISSN 1996-7489
versão impressa ISSN 0038-2353

S. Afr. j. sci. vol.108 no.1-2 Pretoria Jan. 2012

 

RESEARCH ARTICLE

 

First fungal genome sequence from Africa: a preliminary analysis

 

 

Brenda D. WingfieldI; Emma T. SteenkampII; Quentin C. SantanaI; Martin P.A. CoetzeeI; Stefan BamI; Irene BarnesI; Chrizelle W. BeukesII; Wai Yin ChanII; Lieschen de VosI; Gerda FourieII; Melanie FriendI; Thomas R. GordonIII; Darryl A. HerronII; Carson HoltIV; Ian KorfV; Marija KvasII; Simon H. MartinI; X. Osmond MlonyeniI; Kershney NaidooI; Mmatshepho M. PhashaII; Alisa PostmaI; Oleg RevaVI; Heidi RoosI; Melissa SimpsonI; Stephanie SlinskiIII; Bernard SlippersI; Rene SutherlandII; Nicolaas A. van der MerweI; Magriet A. van der NestI; Stephanus N. VenterII; Pieter M. WilkenI; Mark YandellIV; Renate ZipfelI; Mike J. WingfieldI

IDepartment of Genetics, Forestry and Agricultural Biotechnology Institute, University of Pretoria, Pretoria, South Africa
IIDepartment of Microbiology and Plant Pathology, Forestry and Agricultural Biotechnology Institute, University of Pretoria, Pretoria, South Africa
IIIDepartment of Plant Pathology, University of California, Davis, California, USA
IVEcceles Institute of Human Genetics, University of Utah, Salt Lake City, Utah, USA
VGenome Centre, University of California, Davis, California, USA
VIDepartment of Biochemistry, Bioinformatics Unit, University of Pretoria, Pretoria, South Africa

Correspondence to

 

 


ABSTRACT

Some of the most significant breakthroughs in the biological sciences this century will emerge from the development of next generation sequencing technologies. The ease of availability of DNA sequence made possible through these new technologies has given researchers opportunities to study organisms in a manner that was not possible with Sanger sequencing. Scientists will, therefore, need to embrace genomics, as well as develop and nurture the human capacity to sequence genomes and utilise the 'tsunami' of data that emerge from genome sequencing. In response to these challenges, we sequenced the genome of Fusarium circinatum, a fungal pathogen of pine that causes pitch canker, a disease of great concern to the South African forestry industry. The sequencing work was conducted in South Africa, making F. circinatum the first eukaryotic organism for which the complete genome has been sequenced locally. Here we report on the process that was followed to sequence, assemble and perform a preliminary characterisation of the genome. Furthermore, details of the computer annotation and manual curation of this genome are presented. The F. circinatum genome was found to be nearly 44 million bases in size, which is similar to that of four other Fusarium genomes that have been sequenced elsewhere. The genome contains just over 15 000 open reading frames, which is less than that of the related species, Fusarium oxysporum, but more than that for Fusarium verticillioides. Amongst the various putative gene clusters identified in F. circinatum, those encoding the secondary metabolites fumosin and fusarin appeared to harbour evidence of gene translocation. It is anticipated that similar comparisons of other loci will provide insights into the genetic basis for pathogenicity of the pitch canker pathogen. Perhaps more importantly, this project has engaged a relatively large group of scientists including students in a significant genome project that is certain to provide a platform for growth in this important area of research in the future.


 

 

Introduction

The target genome

The Ascomycete fungus Fusarium circinatum is the causal agent of pitch canker, which is a serious disease that affects numerous Pinus species worldwide.1 The term 'pitch canker' refers to the large resinous cankers that develop on roots, trunks, branches and reproductive organs of established or mature Pinus hosts (Figure 1). On seedlings, the pathogen mainly causes root and collar rot, which are also the symptoms that were observed in South Africa when this pathogen was first detected in 1990.2,3 In contrast to the situation in other parts of the world, F. circinatum remained a nursery pathogen since this first outbreak, and it was only in 2007 that it emerged as a major pathogen in plantations planted to susceptible Pinus species.4 Apart from the losses associated with the plantation outbreaks of pitch canker, F. circinatum-related mortality during plantation establishment has been estimated to exceed R10 million annually.5 The pitch canker fungus thus represents a serious threat to the future of the pine forestry industry in this country.

Relatively little is known regarding the genetics of F. circinatum, with the bulk of knowledge at this level relating to its phylogeny and diagnostics,6,7 as well as to its population biology.8,9 Previous studies have, for example, shown that F. circinatum is a heterothallic fungus capable of both sexual and asexual reproduction. Unlike many other Ascomycete pathogens, sexual and asexual reproduction of F. circinatum have been shown in regions of the world where F. circinatum has been introduced relatively recently.10,11,12 Furthermore, studies have also shown that the fungus probably originated in Mexico or Central America and that it has been accidently introduced into pine-growing regions around the world.13 In all cases, however, these previous DNA-based studies have utilised information from either housekeeping loci or microsatelliterich regions, which in most cases represent small or limited portions of the pathogen's genome.

Whole-genome analysis procedures such as genetic linkage mapping and genome sequence comparisons have increased our understanding of the genetic basis of various biological phenomena in fungi. Well-known examples include the development of spores in Pleurotus pulmonarius14 and the development of ectomycorrhizal symbiosis in Laccaria bicolor.15 Such whole-genome approaches have also shed light on the evolution of fungal pathogenicity,16,17 which has also been particularly true for Fusarium species such as Fusarium oxysporum, Fusarium verticillioides and Fusarium graminearum18,19.The fact that the genomic data for these Fusarium species are in the public domain, and that a framework map is available for F. circinatum,20 therefore presents ideal opportunities to understand the genetic basis for pathogenicity in the pitch canker fungus.

The aim of this study was to sequence, assemble and annotate the genome of F. circinatum. In addition, we present a preliminary analysis of putative gene clusters that are unique to F. circinatum and we compare three loci of this genome with the genomes of three close relatives: F. oxysporum, F. verticillioides and F. graminearum (Figure 2). From a South African perspective, this study will have significant impact - not only because the pitch canker pathogen is the first eukaryotic organism for which the entire genome has been sequenced in Africa, but also because the project strongly promotes human capacity development in the field of genome sequencing on the African continent. Furthermore, data emerging from this sequence will promote many studies concerning the pathogen and potentially lead to innovating approaches to reduce the losses that the pathogen is causing in South Africa and elsewhere in the world.

 

 

The sequence: Genome sequencing, assembly and integrity

In this study we specifically targeted a F. circinatum isolate (FSP34) for which a genetic linkage map based on amplified fragment length polymorphisms is available from a previous study.20 The availability of this framework map would thus provide some higher level structure for the final genome assembly. High quality DNA was isolated23 and then sequenced on a Roche 454 GS FLX system (Life Sciences, Connecticut, USA) using the titanium chemistry by Inqaba Biotechnologies (Pretoria, South Africa). This sequencing generated a total of 500 mega bases (Mb) of DNA sequence, which comprised 1 655 231 reads (Table 1). De novo assembly of these sequences with the 454 GS assembler software package, Newbler24, resulted in 4509 contigs, the largest of which was 129 667 base pairs (bp) in length. Accordingly, the total size of the genome for F. circinatum isolate FSP34 is estimated at 43.97 Mb, which falls within the range of what has been reported for other Fusarium species. The genomes of F. graminearum, F. verticillioides and F. oxysporum are 36 Mb, 40 Mb and 60 Mb, respectively.25

 

 

In order to confirm the integrity of the assembled F. circinatum genome, we interrogated the assembly for the presence and order of the open reading frames (ORFs) known to be encoded at the mating type (MAT) locus of this fungus. From previous research it is known that the mating type of F. circinatum isolate FSP34 is MAT-1.20,26 Within the F. circinatum assembly we thus expected to find three MAT-1 ORFs (MAT 1.1.1, MAT 1.1.2 and MAT 1.1.3) and the entire region to be flanked by genes encoding a cytoskeleton assembly control protein (SLA1) and a DNA lyase (APN1).25,27 Local Basic Local Alignment Search Tool (BLAST) analysis of the assembly indicated that a single contig (Contig00012) contained MAT-1 sequences. Examination of this 25 000 bp contig confirmed the presence of the genes, in both the same orientation and order as those found in other Fusarium species (Figure 3). This process of verification was repeated on two additional contigs (data not shown) containing genes that were of interest and also confirmed the accuracy of the assembly.

The completeness of the F. circinatum genome sequence was determined by subjecting the sequence to the CEGMA (Core Eukaryotic Genes Mapping Approach) pipeline.28 A defined set of conserved protein families known to occur in all eukaryotes was used for the analysis.28 This procedure also allows for the production of an initial set of reliable gene annotations in a eukaryotic genome, even in a draft form. The analysis revealed that the F. circinatum genome sequence assembly included the large majority (95%) of the genes common to other eukaryotes. The assembled F. circinatum genome was thus at least 95% complete. Future studies will seek to verify whether the missing genes are indeed not encoded by the pitch canker fungus.

Fusarium circinatum is a haploid fungus and the isolate sequenced was established from a single spore. Therefore, as opposed to diploid or polyploid organisms, only a single allele would be found at any particular locus in the genome. This simplifies the genome assembly process for haploid species, which generally requires less sequence coverage to produce an accurate assembly. Based on the estimated size of the F. circinatum genome and the amount of sequence information generated, an 11X sequence coverage was obtained. We were, therefore, confident that the genome of F. circinatum had been sequenced close to completeness and that the accuracy and integrity of the assembly was as good as could reasonably be expected.

Gene annotation and curation

Although computer annotation of genomes has progressed substantially in the last decade, the robustness of genome annotations is still dependent on 'gene calling' programs, each of which has inherent strengths and weaknesses. Most are also designed for animals or plants with genome and gene architectures that are significantly different from those of fungi. In this study, the MAKER annotation pipeline29 was used because it is designed to particularly deal with eukaryotic genomes smaller than 100 Mb. For ab initio ORF predictions, MAKER utilised the programs Genemark ES,30 Augustus31 and SNAP.32 To streamline the ORF prediction process, the MAKER pipeline also used genome data available for F. verticillioides, F. oxysporum and F. graminearum. In addition, some expressed sequence tag (EST) sequence data were included (data not shown) to refine the accuracy of identifying the intron-exon boundaries. After several rounds of annotation to train MAKER, thereby improving its gene calling, approximately 15 000 ORFs were identified in the F. circinatum assembly (Table 2).

Whilst computer annotation programs have become substantially more sophisticated, final annotations typically need to be done manually, which currently presents the most substantial obstacle for all genome projects.33,34 In this study, we used the program Apollo35 to manually annotate and curate the F. circinatum genome. Apollo can directly utilise the sequence output from MAKER and this program also has the advantage of being relatively user friendly for biologists not familiar with computer programming.

In addition to utilising manual curation for the F. circinatum annotation, we followed the novel strategy of engaging students as annotators in the process. This approach was adopted because the skills required for curating a simple eukaryotic genome require little more than a basic degree in the biological sciences with some molecular biology focus. By following this approach we were able to achieve our second aim of promoting human capacity in the field of genome annotation in South Africa. A team of 20 graduate student volunteers was identified for this study. The students were then exposed to a 2-day training course in which the theoretical background involved in gene and genome structure was reinforced and the basic concepts and requirements of the annotation process were learned.

All the annotators were supplied with a number of contigs to curate and a support programme was implemented to assist those annotators that encountered problems. In most cases the learning curve for members of the annotation team was considerable, but tackling the annotation process in this way clearly highlighted the value of genome sequences to a biological sciences programme. The project made it possible to not only foster an appreciation of the methodologies and approaches associated with genome sequencing projects, but also provided a large number of graduate students with the opportunity to become experienced in the process of genome sequence annotation.

During the curation, each predicted ORF was compared with the predicted genes from the genomes of F. verticillioides, F. oxysporum and F. graminearum. What was immediately obvious was that about 70% of the F. circinatum ORFs were most similar to those of F. verticillioides, which is consistent with the fact that these two fungi are more closely related to one another than to the other two species (Figure 2). In many cases, when a F. circinatum ORF was not most similar to one in F. verticillioides, the dissimilarity was found to be as a result of differences in intron prediction between the two genomes. Although the ORFs in F. verticillioides have been annotated using FGENESH36 that also utilises a hidden Markov model-based algorithm to find genes, the genome of this fungus has not been subject to much manual annotation. Also, the numbers of predicted ORFs in F. circinatum and F. verticillioides differed considerably. Compared to F. circinatum, which has about 15 000 ORFs, F. verticillioides contains only about 13 500 ORFs. Of the ORFs apparently missing in F. verticillioides, a significant proportion had, in fact, not been annotated, despite the availability of EST evidence in many cases. This absence suggests that the annotation of F. verticillioides as presented on the Broad Institute website25 requires additional analyses which would probably increase the number of predicted ORFs in this genome by as much as 5%.

Inspection of the annotated output for the F. circinatum assembly revealed further discrepancies amongst the results of the different predictions programs employed by MAKER. For example, Genemark predicted 15 713 ORFs, whilst Augustus predicted 14 210 ORFs. By manually curating the annotation, it was thus possible to evaluate the various ORF prediction outputs of the pipeline in terms of intron-exon boundaries and EST evidence for F. circinatum and the other Fusarium species. After the manual curation, the F. circinatum assembly contained 15 049 predicted ORFs, with an accuracy of at least 90% for the combined gene prediction of these two programs.

From the curation it was also observed that most often the contigs terminated in intergenic regions. Although this could be ascribed to the reduced ability of the gene prediction programs to find ORFs in the absence of 3' or 5' gene signatures, the CGEMA output indicated that more than 95% of the core eukaryotic genes were present in the F. circinatum. A more likely explanation is that the assembly program Newbler was not able to assemble across DNA repeat regions, which are most often found in the intergenic regions.

Analysis of unique gene clusters

Reciprocal BLAST analyses were used to compare the predicted ORFs in the F. circinatum genome to those of the other Fusarium species. Within the resulting set of 2599 ORFs unique to F. circinatum (i.e. present in F. circinatum and absent from one or more of the other three Fusarium genomes) we identified 1031 ORFs that occurred next to each other in clusters of 4 or more. The BLAST function of the cDNA Annotation System (dCAS) v1.4.3 was then used to compare our 'unique' set of 1031 ORFs to the Pfam database (