|
EAGLES Text Corpora Working Group Workshop"Issues in Corpus Work"
Hotel Novotel, Madrid, Spain |
Chairperson: J. Veronis
In recent years, numerous data collection initiatives and corpus projects have collected large amounts of electronic texts. At the same time, the TEI, MULTEXT and EAGLES are making substantial progress towards interchangeability and reusability of these resources. As a result, the need for generally available, flexible text analytic software tools is substantially greater than ever in the past.
Unfortunately, the software for corpus analysis that exists at present only begins to cover research growing needs. Industrial software is often expensive or unavailable, and usually hard to adapt or extend. On the other hand, the substantial body of natural language processing academic software is often experimental and hard to get, hard to install, under-documented, and sometimes unreliable. In both cases tools are typically embedded in large, non-adaptable systems which are fundamentally incompatible. Although efforts to develop standards for data representation are underway, little effort has been made to develop standards for software, and software reusability is virtually non-existent.
As a result, there is a serious lack of generally usable tools to manipulate and analyze the text corpora and collections that are now becoming widely available. Worse, there is enormous duplication of effort: it is not at all uncommon for researchers to develop tailor-made systems that replicate much of the functionality of other systems and in turn create programs that cannot be re-used by others, and so on in an endless SOFTWARE WASTE CYCLE. The reusability of data is a much-discussed topic these days; similarly, SOFTWARE REUSABILITY is needed, to avoid the re-inventing of the wheel characteristic of much language-analytic research in the past three decades.
The EAGLES subgroup on Tools is addressing these issues, with the first goal to improve the reusability of software among corpus projects. However, given the lack of experience and background in linguistic software standardization and the limited amount of time and effort allocated to the subgroup, we could not accomplish the development of complete guidelines and standards for the development of corpus software. Therefore, the work of this group has been to perform the first necessary steps toward software standardization, by assessing the needs for reliable and reusable corpus software, examining what exists, and on this basis proposing a set of possible approaches to software standardization. This work has been performed in collaboration with MULTEXT and PAROLE.
In the course of this work, the subgroup has identified a number of obvious critical needs for the development of reusable corpus software, which are stressed in a document that provides a basis for the development of linguistic software standards. This document is intended to serve as a basis for the EAGLES Corpus Workgroup workshop in Madrid in January 1996, where it will be possible to confront views, summarize problems and draft a proposal for future action.
The workshop will be organized as follows: the EAGLES Tools subgroup document will be circulated to all participants at the end of December. It will be assumed at the workshop that the participants are familiar with the document. During the first few minutes of the two-hour time period allotted to Tools in the workshop program, the sub-group chair will summarize the work accomplished in the subgroup, provide an overview of problems and potential solutions. Participants should be prepared to respond by proposing concrete recommendations, with the goal of producing at the end of the workshop a document outlining a precise policy for future action.
The session is intended to involve only the named participants, in order to promote a technical discussion which can lead to specific and concrete conclusions and recommendations.