MULTEXT/EAGLES - Document LSD 1. Main text. Version 1.0. Last modified 28 April 1996.

Introduction

In recent years, numerous data collection initiatives and corpus projects have collected large amounts of electronic texts. At the same time, the TEI, MULTEXT and EAGLES are making substantial progress towards interchangeability and reusability of these resources. In addition, there is increasing demand for multi-lingual processing, especially in Europe, rendering mono-lingual tools obsolete. As a result, the need for generally available, flexible linguistic software is substantially greater than ever in the past.

Unfortunately, the linguistic software that exists at present (see for example the lists at the Natural Language Software Registry) only begins to cover growing needs. Industrial software is often expensive or unavailable, and usually hard to adapt or extend. On the other hand, the substantial body of natural language processing academic software is often experimental and hard to get, hard to install, under-documented, and sometimes unreliable. In both cases tools are typically embedded in large, non-adaptable systems which are fundamentally incompatible. Although efforts to develop standards for data representation are underway, little effort has been made to develop standards for linguistic software, and software reusability is virtually non-existent.

As a result, there is a serious lack of generally usable tools to manipulate and analyze the text and speech corpora and collections that are now becoming widely available. Worse, there is enormous duplication of effort: it is not at all uncommon for researchers to develop tailor-made systems that replicate much of the functionality of other systems and in turn create programs that cannot be re-used by others, and so on in an endless software waste cycle. The reusability of data is a much-discussed topic these days; similarly, software reusability is needed, to avoid the re-inventing of the wheel characteristic of much language-analytic research in the past three decades.

MULTEXT joined efforts with the EAGLES sub-group on Tools, established in spring 1995 (in the last phase of the EAGLES project), to address this need by working toward the establishment of Guidelines for Linguistic Software Development. This document reports the work of the sub-group to date. It is intended to serve as a basis for the EAGLES Corpus Workgroup workshop in Madrid, Spain, in January 1996, where it will be possible to confront views, summarize problems and, where necessary, revise this proposal for future action. This document will also be circulated in and outside of EAGLES, in order to gather feedback from the larger research community which can in turn contribute to the continued effort to develop linguistic software standards.


Goals

The long-term goal of the MULTEXT/EAGLES effort is to establish a set of universally accepted Guidelines for Linguistic Software Development (LSD). These guidelines will specify a general lingware development environment, including recommended standards for all aspects of software development, data representation, linguistic annotation, etc. The establishment of such a set of guidelines will enable the interchange of tools and data among researchers and sites, compatibility among tools with potentially diverse functionality, and will in general contribute to the creation of reliable, high quality tools.

The scope of the work required to develop the guidelines is extensive and in some ways, undefinable at this time. There are few if any established practices for the design and development of linguistic software; and, given the recent changes in the availability of data, the amounts of data typically stored and processed, and the development of new formats for data representation, these practices remain in a state of flux. In addition, platforms, systems, programming methodologies, and data strucuturing techniques are evolving rapidly. Another factor is the proliferation of standards for all aspects of software development and data representation, and the fact that information about these standards is scattered in a variety of unconnected and dissimilar locations. To aggravate the problem, many of these standards are in draft or partial form and are themselves in a state of flux.

MULTEXT and the EAGLES Tool sub-group have laid the groundwork for the future design of the guidelines. This task has involved:


Whose needs will be served?

Academic community

In general it is the academic community which suffers most from the lack of standardization of linguistic software, especially smaller research teams with few resources. Substantial time and effort is often wasted within academic research teams in writing software whose functionality has been implemented multiple times, simply because this effort is less intensive than to attempt to adapt existing software--if such software exists. At the same time, the computer science expertise required to develop robust, reusable, and extensible code is often lacking in small teams specializing in linguistic analysis. Even when computer scientists are available for code development, it is not always sure that their programming techniques are aimed at ensuring maximum reusability. This is especially true since some of the factors enabling reusability are based on linguistic grounds, rather than computer science concerns. It is therefore without argument that academic research teams have much to gain from the development of linguistic software standards and the resulting libraries of reusable linguistic software that can be developed on their basis.

Even large academic teams who have developed tools for various tasks can benefit from the establishment of standards for software development, since it is almost invariably the case that such tools function well for the language for which they were designed but are not readily applicable to other languages. Given the increasing shift toward multi-lingual applications and environments, the need for flexible tools that can generalize across multiple languages is evident. It is very clear, for instance, that linguistic centers involved in large European consortia comprising several teams from a variety of countries have the need for common multi-lingual tools. As new methods and technologies are developed, it is evermore essential to avoid their re-implementation in each language; common tools which can be extended once and for all for all languages are an increasing necessity.

Industry

Although industrial sites often have sufficient resources to develop custom software, there is increasing pressure within the industrial community to cut costs and avoid duplication of effort. Therefore, many of the benefits of linguistic software standards cited above apply here as well. This is especially true for SMEs.

There is no competitive advantage to be gained from re-implementing basic tools and state-of-the-art methods. Companies should be able to acquire good standard libraries of linguistic functions and basic tools, and spend their efforts developing advanced methodologies. The existence of standards for the development of linguistic software will benefit industry by providing guidelines for code development geared specifically toward ensuring reusability of linguistic software, and by leading to the availability of portable libraries of linguistic tools.


What is linguistic software reusability?

Linguistic software reusability comprises several aspects. We discuss the major ones below. They are listed in the order in which they should be implemented, since as we go down the list, each depends and builds upon the previous one.

Usability

Reusability implies usability as a starting point. The current most obvious obstacles to usability are factors such as poor documentation, unreliability, lack of robustness, etc., which serve as the prime reasons why freely-available software is not more widely used. The rectification of these problems is fairly straightforward, and serves as a first step in working toward reusability.

Portability

Portability concerns the capability for tools developed at one site to be immediately usable at other sites. At present, it is nearly a given that software developed at one site demands considerable tweaking to run at another site, especially if the environments are not perfectly identical. This leads to substantial investments of time and resources, just to get the the point of being able to run software acquired from other sites.

Ideally, we should aim for portability across platforms (Windows, MacOs, Unix), but this is a long-term goal which will require substantial work. In the short term, we can achieve protability between similar environments, e.g., between different versions of UNIX.

Compatibility

Compatibility concerns the capability for tools developed independently to inter-operate in the same environment, in order to perform complex tasks. This demands, first, that tools can communicate--that is, for results produced by one tool to be usable by another; and, second, that their functionalities are complementary and coherent. It is also essential that tools are designed for compatibility with data and other resources (e.g., lexicons) in common formats.

At present, the proliferation of different implementations of basic linguistic tools such as part-of-speech taggers is not only confusing to the user who may want to apply such tools, but also renders comparison of their results virtually meaningless. Standard methods for software design and development will make comparison of results possible.

Extensibility

Extensibility involves the capability to adapt tools to fit particular needs, to add pieces to existing tools, to replace pieces, etc. One important feature for linguistic purposes is the capacity to use the same tools on different languages. At the moment, most linguistic software exists in the form of integrated systems performing multiple functions, often with little or no access to individually functioning modules. Thus adaptation or extension, either for functionality or to accomodate other languages, etc., is virtually impossible.


General principles

Standards exist or are being developed in many areas relevant to linguistic software, including Each of these standards covers a small piece of what would serve as a general lingware development environment, but none has been developed with an eye toward the overall coherence of such an environment. Annex 1 gives a list of relevant standards.

The goal of the MULTEXT/EAGLES LSD Guidelines is to bring together existing or emerging de jure or de facto standards sufficient to address the scope of an entire Linguistic Software Development system. This approach is similar to that of the POSIX Open System Environment (see Overview of the IEEE P1003.0 Guide to the POSIX Open Software Environment by Charles Severance ).

We propose to call the resulting environment GLOSIX Open System Environment (GL stands for "General Lingware"; in Greek, glôsa means language).

The existence of a coherent open system environment will enable the distributed development of linguistic software, to which anyone, including individuals and small teams as well as large research teams, can contribute. Distribution of effort means that there is no large investment of development time as there is for integrated systems, which in turn means that software can be distributed cheaply, or for free. In addition, this approach enables what we call software evolution. There is no need to envision the entire functionality of the system at the outset; instead, because extension is trivial (and can re-use existing functions), the system can easily grow how and when there is a demand. This is especially important in a field where special and one-time use tools are common, and new functions are likely to be desired.

We see the following steps to develop the MULTEXT/EAGLES LSD Guidelines:

To implement these steps, we propose adopt the following overall strategy:

Topics to be addressed

For each of the aspects of reusability listed above in the section What is linguistic software reusability?, the subgroup has listed a number of topics to be addressed in the Guidelines. It should be noted that these lists are not necessarily complete but are intended to serve as a basis for continued work.

Usability

The Guidelines should include recommendations for the following:

Portability

In the long term we ail towards full compatibility across diverse platfroms such as Windows - Unix - MacOs. In the short term, we focus on portability among different brands of Unix, which is a common and widely accepted paltfrom for linguistic software development.

The Guidelines should include recommendations for:

Compatibility

The Guidelines should include recommendations for:

Extensibility

The Guidelines should include recommendations for:

Conclusion

MULTEXT and the EAGLES sub-group on Tools joined efforts to tackle the growing problem of the lack of reusability and the resulting waste of time and effort by setting out to draft a standard for the development of linguistic software. This task involves not only considering ways to ensure the reuasability of individual tools, but also a thorough knowledge and consideration of different software platforms, operating systems, programming paradigms, as well as current practice and standards for programming, programming environments, data representation, etc. Due to the proliferation of standards, including not only completed standards but also drafts, discussions, etc. which are widely scattered and, in some cases, difficult to access, one of the most difficult challenges for the Tools sub-group was simply to compile the relevant information, which until now has never been gathered in a single place (see Annex 1, "Relevant standards", Annex 2, "International organizations", and Annex 3, "Bibliography").

This compilation, together with the outlining of a set of basic principles to guide the further development of linguistic software development standards, comprise the first phase of the joint work of MULTEXT and the EAGLES Tools sub-group. This work is intended to serve as a basis for discussion and further development, and in particular to serve as input to discussions at the EAGLES workshop in Madrid, Spain, January, 1996. All comments and feedback on the contents of this document or any aspect of linguistics software standardization are invited and encouraged.




| Top | Next | Table of Contents | MULTEXT | EAGLES Tool subgroup | LPL |
Copyright (c) Centre National de la Recherche Scientifique, 1995. HTML 3.2 Checked!