MULTEXT/EAGLES -
Document LSD 1. Main text. Version 1.0. Last modified 28 April 1996.
In recent years, numerous data collection initiatives and corpus projects have
collected large amounts of electronic texts. At the same time, the
TEI,
MULTEXT and
EAGLES
are making substantial progress towards interchangeability and reusability of
these resources. In addition, there is increasing demand for multi-lingual
processing, especially in Europe, rendering mono-lingual tools obsolete. As a result, the need for generally
available, flexible linguistic software is substantially greater than ever in
the past.
Unfortunately, the linguistic software that exists at present (see for example
the lists at the
Natural
Language Software Registry) only begins to cover growing needs. Industrial
software is often expensive or unavailable, and usually hard to adapt or
extend. On the other hand, the substantial body of natural language processing
academic software is often experimental and hard to get, hard to install,
under-documented, and sometimes unreliable. In both cases tools are typically
embedded in large, non-adaptable systems which are fundamentally incompatible.
Although efforts to develop standards for data representation are underway,
little effort has been made to develop standards for linguistic software, and
software reusability is virtually non-existent.
As a result, there is a serious lack of generally usable tools to manipulate
and analyze the text and speech corpora and collections that are now becoming
widely available. Worse, there is enormous duplication of effort: it is not at
all uncommon for researchers to develop tailor-made systems that replicate much
of the functionality of other systems and in turn create programs that cannot
be re-used by others, and so on in an endless software waste cycle. The
reusability of data is a much-discussed topic these days; similarly, software
reusability is needed, to avoid the re-inventing of the wheel characteristic of
much language-analytic research in the past three decades.
MULTEXT joined efforts with the EAGLES sub-group on Tools, established in spring 1995 (in the last phase
of the EAGLES project), to address this need by working toward the establishment
of Guidelines for Linguistic Software Development. This document
reports the work of the sub-group to date. It is intended to serve as a basis
for the EAGLES Corpus Workgroup workshop in Madrid, Spain, in January 1996, where it
will be possible to confront views, summarize problems and, where necessary,
revise this proposal for future action. This document will also be circulated in
and outside of EAGLES, in order to gather feedback from the larger research
community which can in turn contribute to the continued effort to develop
linguistic software standards.
The long-term goal of the MULTEXT/EAGLES effort is to establish a set of universally
accepted Guidelines for Linguistic Software Development (LSD). These guidelines
will specify a general lingware development environment, including recommended
standards for all aspects of software development, data representation,
linguistic annotation, etc. The establishment of such a set of guidelines will
enable the interchange of tools and data among researchers and sites,
compatibility among tools with potentially diverse functionality, and will in
general contribute to the creation of reliable, high quality tools.
The scope of the work required to develop the guidelines is extensive and in
some ways, undefinable at this time. There are few if any established practices
for the design and development of linguistic software; and, given the recent
changes in the availability of data, the amounts of data typically stored and
processed, and the development of new formats for data representation, these
practices remain in a state of flux. In addition, platforms, systems,
programming methodologies, and data strucuturing techniques are evolving
rapidly. Another factor is the proliferation of standards for all aspects of
software development and data representation, and the fact that information
about these standards is scattered in a variety of unconnected and dissimilar
locations. To aggravate the problem, many of these standards are in draft or
partial form and are themselves in a state of flux.
MULTEXT and the EAGLES Tool sub-group have laid the
groundwork for the future design of the guidelines. This task has involved:
- assessment of existing software,
- analysis of the areas important for reusability,
- survey of existing and emerging standard relevant to each of these areas,
- determination of general principles for reusable software development,
- development of a model for and draft content of the guidelines.
In general it is the academic community which suffers most from the lack of
standardization of linguistic software, especially smaller research teams with
few resources. Substantial time and effort is often wasted within academic
research teams in writing software whose functionality has been implemented
multiple times, simply because this effort is less intensive than to attempt to
adapt existing software--if such software exists. At the same time, the
computer science expertise required to develop robust, reusable, and
extensible code is often lacking in small teams specializing in linguistic
analysis. Even when computer scientists are available for code development, it
is not always sure that their programming techniques are aimed at ensuring
maximum reusability. This is especially true since some of the factors enabling
reusability are based on linguistic grounds, rather than computer science
concerns. It is therefore without argument that academic research teams have
much to gain from the development of linguistic software standards and the
resulting libraries of reusable linguistic software that can be developed on their
basis.
Even large academic teams who have developed tools for various tasks can
benefit from the establishment of standards for software development, since it
is almost invariably the case that such tools function well for the language
for which they were designed but are not readily applicable to other languages.
Given the increasing shift toward multi-lingual applications and environments,
the need for flexible tools that can generalize across multiple languages is
evident. It is very clear, for instance, that linguistic centers involved in
large European consortia comprising several teams from a variety of countries
have the need for common multi-lingual tools. As new methods and technologies
are developed, it is evermore essential to avoid their re-implementation in
each language; common tools which can be extended once and for all for all
languages are an increasing necessity.
Although industrial sites often have sufficient resources to develop
custom software, there is increasing pressure within the industrial community
to cut costs and avoid duplication of effort. Therefore, many of the benefits
of linguistic software standards cited above apply here as well. This is
especially true for SMEs.
There is no competitive advantage to be gained from re-implementing basic tools
and state-of-the-art methods. Companies should be able to acquire good standard
libraries of linguistic functions and basic tools, and spend their efforts developing advanced methodologies. The existence of standards for the development of
linguistic software will benefit industry by providing guidelines for code
development geared specifically toward ensuring reusability of linguistic
software, and by leading to the availability of portable libraries of
linguistic tools.
Linguistic software reusability comprises several aspects. We discuss the major
ones below. They are listed in the order in which they should be implemented,
since as we go down the list, each depends and builds upon the previous one.
Reusability implies usability as a starting point. The current most obvious
obstacles to usability are factors such as poor documentation, unreliability,
lack of robustness, etc., which serve as the prime reasons why freely-available
software is not more widely used. The rectification of these problems is fairly
straightforward, and serves as a first step in working toward reusability.
Portability concerns the capability for tools developed at one site to be
immediately usable at other sites. At present, it is nearly a given that
software developed at one site demands considerable tweaking to run at another
site, especially if the environments are not perfectly identical. This leads to
substantial investments of time and resources, just to get the the point of
being able to run software acquired from other sites.
Ideally, we should aim for portability across platforms (Windows, MacOs, Unix),
but this is a long-term goal which will require substantial work. In the short
term, we can achieve protability between similar environments, e.g., between
different versions of UNIX.
Compatibility concerns the capability for tools developed independently to
inter-operate in the same environment, in order to perform complex tasks. This
demands, first, that tools can communicate--that is, for results produced by
one tool to be usable by another; and, second, that their functionalities are
complementary and coherent. It is also essential that tools are designed for
compatibility with data and other resources (e.g., lexicons) in common
formats.
At present, the proliferation of different implementations of basic linguistic
tools such as part-of-speech taggers is not only confusing to the user who may
want to apply such tools, but also renders comparison of their results
virtually meaningless. Standard methods for software design and development
will make comparison of results possible.
Extensibility involves the capability to adapt tools to fit particular needs,
to add pieces to existing tools, to replace pieces, etc. One important
feature for linguistic purposes is the capacity to use the same tools on
different languages. At the moment, most linguistic software exists in the form
of integrated systems performing multiple functions, often with little or no
access to individually functioning modules. Thus adaptation or extension,
either for functionality or to accomodate other languages, etc., is virtually
impossible.
Standards exist or are being developed in many areas relevant to linguistic
software, including
- character sets
- document encoding
- language and country codes
- application program interfaces
- programming languages
- internationalization and localization of programs
- etc.
Each of these standards covers a small piece of what would serve
as a general lingware development environment, but none has been developed with
an eye toward the overall coherence of such an environment. Annex 1 gives a
list of relevant standards.
The goal of the MULTEXT/EAGLES LSD Guidelines is to bring together existing or emerging de
jure or de facto standards sufficient to address the scope of an
entire Linguistic Software Development system. This approach is similar to that
of the POSIX Open System Environment (see
Overview of the IEEE P1003.0 Guide to the POSIX Open Software Environment
by Charles Severance ).
We propose to call the resulting environment GLOSIX Open System
Environment (GL stands for "General Lingware"; in Greek, glôsa
means language).
The existence of a coherent open system environment will enable the
distributed development of linguistic software, to which anyone,
including individuals and small teams as well as large research teams, can
contribute. Distribution of effort means that there is no large investment of
development time as there is for integrated systems, which in turn means that
software can be distributed cheaply, or for free. In addition, this approach
enables what we call software evolution. There is no need to envision the entire functionality of the system at the outset; instead, because
extension is trivial (and can re-use existing functions), the system can easily
grow how and when there is a demand. This is especially important in a field
where special and one-time use tools are common, and new functions are likely
to be desired.
We see the following steps to develop the MULTEXT/EAGLES LSD Guidelines:
- identify the areas that need to be covered by the LSD to fully
specify a lingware development environment;
- for each area, select and recommend appropriate standards;
- identify the gaps not covered by existing standards, or covered
inconsistently;
- propose solutions for bridging these gaps.
To implement these steps, we propose adopt the following overall strategy:
- Emergency recommendations first. The full specification of the MULTEXT/EAGLES LSD Guidelines will require time and study to accomplish. However, there are a number of areas that can and should be addressed almost immediately, since they comprise an important core of the considerations that will utlimately be involved in the full specification. Recommendations can be provided relatvely quickly for each of these areas, and can serve as the first pass at specifying the entire environment.
- Committees for sub-problems. There is a number of identifiable
sub-areas that will demand detailed consideration and an amassing of
information, which is in most cases scattered about in ftp sites, the WWW,
documents from various sources, etc. This kind of task is best handled by a
small committee which can focus its attention on a particular area. Therefore,
a next step is to identify the relevant areas and name the people to work on them.
- Testing through associate projects. Our overall approach involves
an incremental process. It seems impracticable to attempt to specify a set of
guidelines for linguistic software development a priori, without
implementation and testing. We therefore intend to work in a cyclic manner,
involving by turns the establishment of specifications, testing and
further refinement, more testing, etc. We propose that
several projects are associated with the MULTEXT/EAGLES effort. Their role will be
to attempt to implement the specifications as they are developed, and to provide feedback based on their experiences which will inform the development work. We envision a close working relationship between the
developers and the project representatives.
- Collaboration with relevant international organisations. It will be
essential to have links to relevant projects, initiatives, standards-making
bodies, etc., such as other EAGLES work groups, the TEI, ISO and IEEE
committees, the Free Software Foundation, the X/Open Foundation, etc. This kind
of liaison requires considerable effort, and its demands on time and energy
should not be underestimated.
- Promote free software that implements the Guidelines. We believe that the MULTEXT/EAGLES LSD Guidelines can be successful only if they are implemented in
software systems and libraries that are feely available for research use. We
therefore see as an essential step the promotion of the development of such
free software.
For each of the aspects of reusability listed above in the section What is linguistic software reusability?, the subgroup has listed a number of topics to be addressed in the Guidelines. It should be noted that these lists are not necessarily complete but are intended to serve as a basis for
continued work.
The Guidelines should include recommendations for the following:
- software reliability
- guidelines for development (programming style and conventions, avoidance
of unreliable constructs, etc.)
- development of test suites and testing tools
- development of standard libraries for common algorithms
- standardization of command line interfaces and options
- quality and standardization of documentation
- guidelines for "man" pages, user manuals, tutorials, on-line help, etc.
- development of tools and utilities for documentation development and
maintenance
- internationalization and localization of programs
- translation of program messages and reports
- linguistic and cultural conventions in user interface
In the long term we ail towards full compatibility across diverse platfroms
such as Windows - Unix - MacOs. In the short term, we focus on portability
among different brands of Unix, which is a common and widely accepted paltfrom
for linguistic software development.
The Guidelines should include recommendations for:
- recommendations on the preferred environments
- shells & scripting languages
- programming languages and compilers
- graphic environments
- miscellanous utilities
- standardization of installation procedures
- structure of packages, etc.
- development of standard scripts
- recommendations for source code (C / C++)
- format of source (typographical conventions, comments, etc.)
- constructs to avoid
- development of portable libraries (e.g., for character sets and string
handling, basic language engineering functions, etc.)
The Guidelines should include recommendations for:
- standardization of SGML import/export formats between tools
- libraries of DTDs for import/export formats
- standardization of internal formats for direct inter-connection of tools,
and tools and resources
- input/output formats
- hashed tables
- directory architectures
- standard user interface look and feel
The Guidelines should include recommendations for:
- modularity in order to reuse pieces, chain them to accomplish more
complex operations, etc.
- specifications for maximum atomicity
- libraries and APIs
- language-independence
- separation of code and linguistic data
- parameterizability to accomodate different languages
MULTEXT and the EAGLES sub-group on Tools joined efforts to tackle the growing problem of the lack of reusability and the resulting waste of time and effort by setting out to draft a standard for the development of linguistic software. This task involves not only considering ways to ensure the reuasability of individual tools, but also a thorough knowledge and consideration of different software platforms, operating systems, programming paradigms, as well as current practice and standards for programming, programming environments, data representation, etc. Due to the proliferation of standards, including not only completed standards but also drafts, discussions, etc. which are widely scattered and, in some cases, difficult to access, one of the most difficult challenges for the Tools sub-group was simply to compile the relevant information, which until now has never been gathered in a single place (see Annex 1, "Relevant standards", Annex 2, "International organizations", and Annex 3, "Bibliography").
This compilation, together with the outlining of a set of basic principles to guide the further development of linguistic software development standards, comprise the first phase of the joint work of MULTEXT and the EAGLES Tools sub-group. This work is intended to serve as a basis for discussion and further development, and in particular to serve as input to discussions at the EAGLES workshop in Madrid, Spain, January, 1996. All comments and feedback on the contents of this document or any aspect of linguistics software standardization are invited and encouraged.
| Top
| Next
| Table of Contents
| MULTEXT
| EAGLES Tool subgroup
| LPL
|
Copyright (c) Centre National de la Recherche Scientifique, 1995.