MULTEXT/EAGLES -
Document LSD 2. Part 0. Version 0.5. Last modified 28 April 1996.
|
GLOSIX Part 0. Overview
|
Contents
| Back to LSD2 Table of Contents
|
The MULTEXT project and the
EAGLES
sub-group on Tools have joined efforts to address the need for reusable linguistic software by working
toward the establishment of Guidelines for Linguistic Software
Development (LSD Guidelines). The document
"Considerations
for Linguistic Software Reusability" (document MUL/EAG-LSD 1) outlines the
general principles upon which the Guidelines are to be based. This document is
a follow-on which makes first sketch of the Guidelines, and is intended to
provide the basis upon which the full specifications will be developed.
The development of the Guidelines will require considerable input from and
discussion within the language engineering community. We therefore intend to
achieve our goal by a process of stepwise refinement, consisting of a cycle of
specification, testing and feedback, and refinement. It is for this reason that
we provide here a preliminary sketch which is intended to
- address the immediate issues relevant to achieve better usability and
portability,
- begin mapping the territory, assessing the range and magnitude of
problems, etc.
Linguistic software reusability comprises several aspects. We discuss the major
ones below. They are listed in the order in which they should be implemented,
since as we go down the list, each depends and builds upon the previous one.
Reusability implies usability as a starting point. The current most obvious
obstacles to usability are factors such as poor documentation, unreliability,
lack of robustness, etc., which serve as the prime reasons why freely-available
software is not more widely used. The rectification of these problems is fairly
straightforward, and serves as a first step in working toward reusability.
Portability concerns the capability for tools developed at one site to be
immediately usable at other sites. At present, it is nearly a given that
software developed at one site demands considerable tweaking to run at another
site, especially if the environments are not perfectly identical. This leads to
substantial investments of time and resources, just to get the the point of
being able to run software acquired from other sites.
Ideally, we should aim for portability across platforms (Windows, MacOs, Unix),
but this is a long-term goal which will require substantial work. In the short
term, we can achieve protability between similar environments, e.g., between
different versions of UNIX.
Compatibility concerns the capability for tools developed independently to
inter-operate in the same environment, in order to perform complex tasks. This
demands, first, that tools can communicate--that is, for results produced by
one tool to be usable by another; and, second, that their functionalities are
complementary and coherent. It is also essential that tools are designed for
compatibility with data and other resources (e.g., lexicons) in common
formats.
At present, the proliferation of different implementations of basic linguistic
tools such as part-of-speech taggers is not only confusing to the user who may
want to apply such tools, but also renders comparison of their results
virtually meaningless. Standard methods for software design and development
will make comparison of results possible.
Extensibility involves the capability to adapt tools to fit particular needs,
to add pieces to existing tools, to replace pieces, etc. One important
feature for linguistic purposes is the capacity to use the same tools on
different languages. At the moment, most linguistic software exists in the form
of integrated systems performing multiple functions, often with little or no
access to individually functioning modules. Thus adaptation or extension,
either for functionality or to accomodate other languages, etc., is virtually
impossible.
The MULTEXT/EAGLES LSD Guidelines will be based on existing or emerging standards. However,
there is an enormous proliferation of standards relevant to the full
specification of the Guidelines, including areas such as character sets,
document encoding, programming languages, operating systems, etc. There exist
in some cases multiple standards for the same phenomenon, as well as drafts,
discussions among technical groups, etc., since many standards are currently in
various stages of the definition process.
The MULTEXT/EAGLES LSD Guidelines are intended to provide a selection among relevant standards
that best suit the needs of linguistic software development, in order to define
a coherent "open environment" for developers, the GLOSIX Open System
Environment. In addition, it will be necessary to fill or at least
determine the gaps among existing standards. To develop the MULTEXT/EAGLES LSD Guidelines, it
will be necessary first of all to compile a list of the relevant standards,
drafts, etc., and then to examine each closely, in order to determine their
relations, overlaps, compatibilities, etc. This is a formidable task, which can
only be accomplished by taking careful steps toward fuller and fuller
specification.
There exist similar integration efforts, such as the IEEE PACS Committee
(POSIX) or the X/OPEN Company. While it will be possible to build upon and
adapt from these efforts, the development of the MULTEXT/EAGLES LSD guidelines differs from
them in the following ways:
- Our effort is expressly concerned with handling natural languages. The
above-mentioned efforts do not deal with natural languages apart from the
limited sub-areas of character sets and the internationalisation of software.
They are not necessarily concerned with the processing of linguistic objects.
- Documentation of the guidelines intended for reference by end users and
developers must be cheap, since a proportion of the intended audience consists
of small research teams with limited resources. The supporting documentation
for efforts such as POSIX is intended primarily for use by industrials and is,
as a result, priced well outside the range of most academic teams.
- For the same reasons, recommended support tools and utilities necessary to
implement the MULTEXT/EAGLES LSD Guidelines must be cheap or free. A factor in the choice of
standards, etc. will therefore be the availability of software such as that of
the Free Software Foundation.
- Documentation must also be accessible to users and developers who may not
be highly trained computer scientists. The audience for the MULTEXT/EAGLES LSD Guidelines
includes linguists, humanists, etc. Therefore, it will be necessary to provide
tutorials and similar support materials for use by a potentially non-technical
audience.
The current document addresses primarily the issues of usability and
portability, which are seen to be the first and most basic issues to be
considered in working toward full development of the MULTEXT/EAGLES LSD Guidelines. Other
important concerns, such as compatibility and extensibility,will be taken up in
the continuing work of the Tools sub-group.
The MULTEXT/EAGLES LSD Guidelines will cover the following topics:
- GLOSIX Part 1. Data representation
- GLOSIX Part 2. Programming languages
- GLOSIX Part 3. Tool environment
- GLOSIX Part 4. Tool interfaces
- GLOSIX Part 5. Tool documentation
- GLOSIX Part 6. Tool compatibility
- GLOSIX Part 7. Tool testing
- GLOSIX Part 8. Tool distribution
Because the work of the Tools
sub-group is preliminary, this document is subject to the following limitations:
- The work of the sub-group so far applies only to the Unix environment,
which is a common platform for the development of linguistic software in
language engineering research. The work will be extended to apply to other
operating systems such as DOS, Windows, etc. in later phases.
- This document is a first sketch in which there are necessarily gaps and
omissions in the filling out of specifications within each of the topics listed
above, to be filled in as the work of the sub-group progresses. In some
instances no specific recommendations are made, but rather, an outline of
problems and a list of options is provided.
- There exists the potential for inconsistencies, inaccuracies, etc. which
may not come to light until the guidelines are more fully developed.
- The recommendations made here are preliminary and subject to change as the
MULTEXT/EAGLES LSD Guidelines evolve. Our intention here is to make a first attempt which can
serve as a catalyst for discussion and feedback.
- FAQ
- Frequently Asked Questions -- compilation of most frequently asked
questions, regularly posted on USENET newsgroups.
- LSD
- Linguistic Sofware Development
| Top
| LSD2 Table of Contents
| MULTEXT
| EAGLES Tool subgroup
| LPL
|
Copyright (c) Centre National de la Recherche Scientifique, 1995-1996.