One of the projects I am working on right now is the construction of OE, the Ontology of Evolution. OE is a project of the Darwin Digital Library of Evolution, which is itself a special project of the Biodiversity Heritage Library and the American Museum of Natural History Library. The immediate aim of producing OE is to provide an intelligent search capability for the library catalog of the Darwin Digital Library, which is still in its early design stages. The broader aim of OE is to provide a computation and classification tool for researchers in evolutionary biology and those who work in other fields but would like such a tool for application to their own disciplines.
Although the word “ontology” is familiar to philosophers, I do not intend it here in the usual philosophical sense. Rather, I am using it in the sense that librarians and information scientists use the term: to indicate the formal description of the concepts used in some discipline of study in a manner suitable for use as a classification system, or for computation.
OE will be modeled in OWL-DL, a decidable fragment of first-order logic sometimes called “description logic.” OWL, the Web Ontology Language, is designed for use in semantic web applications. OWL ontologies such as OE can be viewed, manipulated and created using ontology editors and browsers such as Protégé or GrOWL.
As a classification scheme, an ontology is richer than a thesaurus or a traditional subject index. Thesauri generally only indicate relationships of synonymy; traditional subject indexes typically only indicate whether one term is narrower than, broader than, or related to another term. An ontology describes the kinds of things in the domain of study of interest, and also, the relationships between them that are posited by the concepts being modeled. This is richer than the kinds of relationships I described above in connection with thesauri and traditional key word lists. For instance, consider the following relationship between geographical isolation and allopatric speciation.
Geographical isolation
-> Is a component sub-process of: allopatric speciation
It would not be possible to formulate this relationship in a thesaurus or traditional key word index. “Geographical isolation” is not synonymous with “allopatric speciation.” Geographical isolation is not a kind of allopatric speciation, and so it would be incorrect to represent the former as narrower than the latter. At best, a traditional key word index would indicate that the two terms are related; it would not provide any indication of how they are related.
To further illustrate the power of an ontology, consider the following terms and relationships:
natural selection
-> Is a cause of: adaptation
-> Is a component subprocess of: speciation
-> Has a kind: balancing selection
-> Has a kind: directional selection
To use OE as a classification and literature search system, terms describing processes, objects, particular individuals and other aspects of the natural world studied by evolutionary biologists would be applied to information resources. Searching a database of such terms and the corresponding resources, researchers would be led from resource to resource by moving among those tagged with related terms, and from term to term by looking for known papers of interest, and searching with key word tags on those papers. The process of resource discovery would be much richer than in the case of a thesaurus or traditional key word index, because the ontology-driven keywording system directs the researcher along paths generated by a rich description of the relationships of objects in the domain of knowledge. Researchers could also browse the ontology directly; looking for papers categorized under a given term, a researcher would find others on related topics, and also would gain some understanding of the nature of their relationship.
Building OE requires starting from scratch, for the most part. MeSH, the only controlled vocabulary that would be expected to serve as a source of key word terms about evolutionary biology, is greatly impoverished. There are few terms describing evolution, and many are incorrectly defined, or occupy places in the MeSH hierarchy that does not accurately represent their relative positions.
OE can also function as an addition to the semantic web. The semantic web provides an advantage over free-text searches of web pages provided by Google or other search engines because the semantic web provides a mechanism for distinguishing among terms and phrases that differ in meaning, even though they might have the same morphology. For instance, users searching the open Internet at Google for “Darwin” will result in many hits concerning the operating system created by Apple computer, as well as those concerning evolution. A semantic web search will group these search results apart from those having to do with the evolutionist, Darwin.
For the same reasons that OE can bring some organization to the Internet, it can bring organization to databases of literature in a way that citation indexing and free-text searching in article titles, abstracts, and author-assigned key words cannot. These searches are unable to detect differences between words that have a common morphology but differ in meaning. They are also unable to detect similarities between phrases and terms that share no common syntactic elements, but have the same meaning. The controlled vocabulary that will make up OE, together with the conceptual structure it represents, will facilitate both targeted searches and exploratory browsing among the linguistically heterogenous literature of evolutionary biology.
(In an article in the High Energy Physics Libraries Webzine, Arturo Montejo Ráez and Ralf Steinberger discuss the value of keywording in a large, heterogenous literature; my discussion in the previous two paragraphs has been strongly influenced by them. They do not argue for ontologies, but because ontologies are a kind of key word indexing strategy, they bring the same benefits to users as do thesauri and traditional subject indexes.)
Intelligent searching in a database of information resources, be they web pages on the semantic web or articles in a digital library, represents a computational use of the ontology. For instance, suppose that a researcher used “speciation” as a search key. An artificial reasoner searching the ontology-driven database of information resources would find papers about geographical isolation, because the reasoner would “infer” that such papers are relevant: in the ontology, geographical isolation is described as a component sub-process of allopatric speciation.
Representing the results of such intelligent searching by showing the degree and kind of relationship between terms would help researchers locate results of greatest interest. For instance, results from the “speciation” search should show the kinds of speciation—allopatric, sympatric, peripatric, etc.—as being more greatly related to speciation than geographical isolation. The latter is a sub-process of one kind of speciation process; a researcher may not want to see all papers on that topic, but steer toward those on sympatry. A “tree” view that can be expanded or closed, and that represents the number of papers on a given “branch,” is probably a good way to represent the results of this kind of intelligent search.
There are probably other computational uses of OE. For instance, OE might be used, like other ontologies such as the gene ontology or those provided by the Science Environment for Ecological Knowledge (SEEK) project, for hypothesis discovery. It also probably has important uses in bibliometrics.

2 comments
Comments feed for this article
November 17, 2006 at 1:37 am
Chris Mungall
I think your use cases point more towards a thesaurus than an
ontology. Can you think of other uses other than searching the
literature?
An ontology should concern relations between types that obtain by
virtue of the relations that hold between the underlying instances.
For example, you have:
Geographical isolation
-> Is a component sub-process of: allopatric speciation
It seems you are representing more of a conceptual connection
here. This is fine for doing various kinds of neighbourhood graph
analysis for literature search and index but I think in order to
actually associate instance data and perform reasoning you will need
to make some definitions clearer.
Is “Geographic isolation” here a kind of process? or is it a kind of
_quality or property_ that inheres in a population of individual
organisms of a species?
If it is a quality or property then subprocess-of would not appear to
be the correct relation. If it is a process, then what exactly do we
mean by sub-process-of? Do you mean that all instances of geographic
isolation processes are part of some larger allopatric speciation
process (surely not)? Or that all instances of allopatric speciation
have as a part a process of geographic isolation?
If you’re serious about an ontology which would form the substrate for
advanced reasoning then these kind of distinctions should be made
clear from the outset, or you may have a lot of reengineering to do
later on.
I’d encourage you to look at the OBO Foundry principles
(http://www.obofoundry.org), especially w.r.t Providing Aristotelian
or genus-differentia definitions. This is good definitional practice,
leads to clear ontologies, and when encoded in a computer can enable
powerful reasoning.
Here is a strawman ontology of allopatric speciation; it may be
biologically naive:
population_biological_process =def
biological_process WHICH has_participant population_of_individual_organisms
speciation =def
population_biological_process WHICH
all participants belong to species S at the beginning of the process &
all participants belong to species S1 and S2 at the end of the process &
S1 != S2
allopatric_speciation =def
speciation WHICH occurs_during: process_of_geographic_isolation
process_of_geographic_isolation =def
population_process WHICH
the participant population has the quality of geographic_isolation at the end of the process &
the participant population lacks the quality of geographic_isolation at the beginning of the process
geographic_isolation =def
a property WHICH
inheres in a collection of individuals in a single species S
by virtue of all individuals in that collection forming two or more
maximally geographic spatially connected wholes
These definitions may not be robust for all uses (eg species that
reproduce via sporulation that can occur over geographically
disconnected wholes).
The relations used could come from the OBO Relation ontology
http://www.obofoundry.org/ro
Some of the definitions above can easily be expressed in OWL; eg
Class(allopatric_speciation complete
speciation
restriction(RO:during
someValuesFrom(process_of_geographic_isolation)))
Other definitions involving time will be harder to do in OWL. In this
case formal natural language or other languages like CommonLogic may be
required.
I deliberately chose a simple example – providing a formal ontological
definition of species or character descent will prove difficult I think.
Why go to all this bother? I’m not familiar enough with efforts like
DarwinCore, species databases, geographical databases etc to give
convincing use cases. If this field does face the same data mining
challenges we face in other areas of biology (which it presumably will
as ecosystems are sequenced..) then I think it will be possible to
generate some use cases.
Anyway, I hope something gets kickstarted soon that’s more expressive
than MeSH!
March 23, 2007 at 12:36 am
Neocles Leontis
My ontology doesn’t have this currently, and I don’t know of others that do. Sorry!
Right now I am mainly focusing on theories of speciation.
-Adam