In: Kempf, A. & Saarenmaa, H. (Editors). Internet Applications and Electronic Information Resources in Forestry and Environmental Sciences. Workshop at the European Forest Institute, Joensuu, Finland, August 1-5, 1995. EFI Proceedings 3 (in print).
Forest Research Institute, Unioninkatu 40A, FIN-00170 Helsinki, Finland
*) European Forest Institute, Torikatu 34, FIN-80100 Joensuu, Finland
ADDRESSING the ongoing loss of biodiversity and accelerating biodiversity surveys requires improvements in taxonomic information management, especially for the most numerous group - insects. Currently there are only scattered entomological biodiversity databases available. Internet, and its World Wide Web (WWW) protocol for distributed multimedia hyperdocuments, provide an efficient meta-model of how information on biological webs and taxonomies can be organised. Moreover, object-oriented modelling provides the tools for modelling complex biological domains such as taxonomy and biological field data realistically. We demonstrate a combination of object-oriented databases and WWW front-ends for managing taxonomic biodiversity information. One WWW page for each taxon, which is dynamically connected to the taxonomic hierarchy, allows users to manage taxonomic information efficiently. The pages and their summaries are created on demand out of the contents of the database. We argue that a new generation of WWW servers with special object-oriented modelling capabilities should be created for managing taxonomic information globally and on a distributed basis.
The total number of species on Earth has been estimated to be in the magnitude of 10^7 [1], of which 1.7 million is known. Loss of this biological diversity currently runs at 10^3 times of its natural pace [2], eradicating 10^4 species per year. Hence, over the next 30 years, 10^6 species are bound to disappear even before they become known. Currently, 10^4 new species are discovered annually; a rate which has not changed during the past 15 years [1] despite an overall increase in scientific publishing. Insects comprise more than a half of all the species, and also the majority of the loss. There is obviously an urgent need to accelerate biodiversity surveys, especially in the area of tropical entomology.
One particular culprit for the slow accumulation of entomological data is outdated information management. Outside the Australian National Insect Collection (URL =http://www.ento.csiro.au/research/natres/anic.htm) and the Swedish Museum of Natural History (URL = ftp://ftp.nrm.se/), on-line comprehensive information sources for entomological taxonomy are not currently publicly available. Taxonomic information is maintained separately in each country in off-line files and publications which may not be compatible with each other. Studies require travel and access to single, valuable type specimens. A solution is needed for managing and disseminating taxonomic biodiversity information globally, on-line, synchronously, and on a distributed basis.
Molecular biology has enjoyed such services through GenBank [3] and EMBL Data Library [4] already since the early 1980's. Bringing GenBank databases to a distributed structure and public access reduced the backlog of data entries from 2 years to a few days [5], [6]. Use of on-line databases is now the norm in molecular biology. Taxonomic databases are also succeeding for plants. International Organization for Plant Information (IOPI) is constructing a world plant check-list and database [7]. Its Taxonomic Databases Working Group has prepared a set of standards by which the different IOPI sites operate. Another case in point is the Biodiversity Information Network /Agenda 21 (BIN21) which is a special interest network [8] in the Internet that aims at organising biodiversity information into a network of distributed, public domain databases [9]. If BIN21 succeeds, it will bring the rest of bioinformatics to the level of molecular genetics in information management. However, there are no universally accepted data models nor publicly available database applications, which has caused delays in this critical area and incurred further costs to many institutions. In the following, we present the results of our experiments aiming at an architecture and an application suite for taxonomic biodiversity management in the Internet.
World Wide Web as a meta-model for biodiversity
Internet, which has more than 25 million users in 2.5
million computers connected to it [10], [11], is now a viable
communication medium for global taxonomy research.
The revolution that turned Internet from a
network of computers into a
network of information
was conceived with the World Wide Web
(WWW) protocol [12], [13].
WWW works through application programs such as NCSA Mosaic and
Netscape which launch queries guided by universal resource
locators (URL), and are able to find and retrieve documents
anywhere in Internet's WWW servers.
A WWW hypermedia document can contain text, images, sound,
and video.
Equally importantly, a WWW page can also contain forms, menus, and
buttons which, through properly configured gateways, turn it into an
universal client to interactive server programs such as databases.
The cross-linked architecture of WWW is an efficient meta-model for biodiversity information as it can be used to mimic ecological webs and biological taxonomies. There are a few existing systems that use WWW for taxonomic information management. Information about Australian flora can be accessed with WWW and its SQL-gateway to relational databases [9]. ACEDB is a classification-based database management system for managing biological information [14] which also has an WWW interface. Tree of Life (URL = http://phylogeny.arizona.edu/tree/phylogeny.html) is a recently announced phylogenetic navigator representing each taxon by a page of information (Maddison, W. & D., personal communication). When completed, Tree of Life should span over multiple sites and cover all taxa. However, Tree of Life is entirely hard-wired and there is no query language nor data model that would address the questions of synonym resolution, managing observational data, or the geographical redundancy when multiple sites provide information about the same taxa.
Managing complexity with object-oriented techniques
The complex nature of taxonomic information necessitates the use
of software that can grasp that complexity.
Object-oriented (OO) modelling,
which includes OO systems analysis, software design, and programming,
has lately gained popularity in other domains where complex data prevails
[15], [16], [17].
OO modelling is based on certain concepts that make it possible to build
real-world models into software that can mimic even the most complex
biological structures.
First, objects are encapsulated entities that combine data and
behaviour (program code) under the same structure.
Objects can be either classes (templates for instances) or
actual instances of these classes
or, ideally, even both at the same time.
Second, objects can be classified into hierarchies where
specialised objects inherit and add to the features of their more
general prototypes higher in the hierarchy.
This is exactly as in the Linnéan classification, except that objects can
inherit from many ancestors where necessary.
Third, objects communicate by sending messages to each other.
They can behave polymorphically because different objects can react
differently even to the same messages.
Fourth, objects can be associated with each other -
they can even contain and consist of other objects.
An object model diagrammatically shows the relationships listed above. Such a model, constructed in a formal methodology [15], [18], [19], should be understandable by domain specialists, by information system architects and by the computer. OO databases are systems that store and manage objects the same way as, for instance, relational databases manage tables [20]. OO databases are often an order of magnitude faster than relational databases because they operate on real-world models and do not require normalised and, hence, arbitrarly split tables.
The taxonomic database
Our system is designed around the generic class Taxon
which has subclasses for each of the 12 taxonomic categories
(Fig. 1).
Each subclass is associated with the one above it, defining
a complete chain of taxa.
This chain may be of varying length as all the subclasses (e.g.,
Tribus) have not been defined for all the taxonomies.
It is practical to understand actual taxa (e.g., Scopula ternata)
as the
instances of the subclasses for the taxon categories (e.g., species;
dotted arrows in Fig. 1).
However, by definition a taxon is an arbitrary human construct,
a class [21], [22],
and its instances are the physical individual organisms that can be
found out in the wild (dashed arrows in Fig. 1).
The implication of this paradox is that an ideal taxonomic database,
which is used as a reference also to field observations, should support
meta-data, that is, runtime access to the class definitions, and even
simultaneous roles of objects as classes and instances.
We have built one prototype for both of these approaches.
One particularly important point in the object model is the definition of names. We declare them as link attributes between the taxa, the type specimens, and the publications which deal with these types. It follows that scientific names or their derivatives cannot be used as object identifiers (keys). Ordered numbers cannot be used either, since they would be vulnerable to changes in systematic order. Therefore, we have to resort to hidden arbitrary surrogates as identifiers. The Taxon class must define the methods and message handlers which enable the taxa to identify themselves to queries properly.
The model was implemented as an OO database application (URL = http://www.metla.fi/biodiversity/taxon-object/) with the Objectivity/DB(tm) database product [23] and C++ language on a DEC Alpha/OSF platform. WWW documents that are entirely generated on demand out of the contents of the database constitute the user interface of the system. There is one page per taxon (Fig. 2). The page has subheadings for the fields of the database that the user wishes to retrieve. Taxonomic hierarchies can be followed through menus and buttons. Other dependencies of the taxa, such as host / parasite webs can be followed accordingly. For our testing, we loaded the database with the check-list of Finnish Geometroidea and the complete contents of a scanned field guide [24]. These both can be re-created on paper out of the contents of the database. New taxonomic objects can be created and the existing ones manipulated with a WWW-based editor.
We also experimented with another, and semantically more correct, system where individual taxa were classes instead of instances. The classes were created in run-time where needed. This system was implemented as an Kappa(tm) application because it supports run-time metadata [25]. However, we found storing the new class definitions into a database inconvenient and we no longer pursue this avenue.
Conclusion
Our experience shows that OO databases are a viable alternative for
managing complex taxonomic information.
However, none of the existing commercial products support all
the features required from an ideal system.
Therefore, and because of the need to reduce costs at the user end,
a public-domain OO database system similar to ACEDB should be built.
That system should also possess built-in WWW server capabilities,
and employ a community system [13] of automated agents
[26], [27] which can retrieve information from neighbour
servers in order to generate, for example, a distribution map or cross-check
nomenclature.
Availability of such capabilities should change the way taxonomic
check-lists are envisaged, the nomenclature maintained, and specimens
are identified.
Biodiversity information on-demand, automated network-aware collections,
and virtual museums should be a part of the global information infrastructure,
as recently proposed [28].
Fig. 1.
Object model of the taxonomic database.
Notation according to [15].
Details of classes are hidden, except for Taxon.
This diagram is one-third of a larger object model which, in addition,
describes field observations.
Fig. 2. The browser of the taxonomic hierarchy on the World Wide Web.