Alice in the Wonderland of SGML: streamlining text entry in the CELEX databases
CEC, JMO C2/25 Bâtiment Jean Monnet, Plateau de Kirchberg, L-2920 Luxembourg
Abstract: This article
describes the system used for the introduction of textual data into the
full-text document databases. The solution implemented is based on the
establishment of a text production database for the management and
of texts before introducing them into CELEX dissemination databases,
management of structured documents described with the help of SGML
The explosive evolution of office automation systems provided users with very powerful tools for the production and manipulation of documents. The concept of the ‘document’ itself evolved into a multimedia entity (by incorporating text, image and sound), while new ways of describing the structure of documents were developed (i.e. Office Document Architecture - ODA, Standard Generalized Markup Language- SGML). Today we refer to structured documents all too often.
On the other hand the evolution, if any, in the world of information retrieval has not been so spectacular. Commercially available software for the management of document databases, especially large ones, is based on the same concepts as ten years ago. However, with the integration of information retrieval functions into office automation systems, one can expect document packages capable of storing and retrieving structured documents to become available in the very near future.
Today, many (if not all) documents published are produced in an electronic format and some of them, for instance legal texts, are loaded onto databases. Although these texts are available on magnetic media, the format used is a more or less publishing-oriented one; furthermore, the organization of the dissemination databases into which these texts are to be introduced is not oriented towards the easy updating of textual information. This is the reason why texts originating in publishing-oriented environments should be passed through production-preparation systems, independent of the dissemination systems into which they are to be loaded.
In this paper we will present the architecture of an already decentralised operational production system, used for loading textual data into dissemination databases operating in a multilingual environment. The system constitutes an application of the concept of structured documents based on the SGML standard (ISO 8879).
CELEX (Communitatis Europææ LEX) is the computerized documentation system for European community law. CELEX is produced as an interinstitutional system (Commission of the European Communities, Council of Ministers, European Parliament, Court of Justice of the European Communities, Economic and Social Committee, Court of Auditors) and is made available to officials of Community institutions as well as to the public. (For information on CELEX, please contact the European Commission, http://ec.europa.eu). CELEX is a multilingual system and is available in French, English, German, Dutch, Italian, Danish, Greek and Spanish; the Portuguese version is under preparation.
Technically speaking, the most important features of CELEX in its current form are those given below:
Since the system went into operation in the mid-seventies, its structure has not evolved in a radical way. Consequently, management and maintenance have become progressively more expensive and complex. In order to address this situation the Commission decided to launch a project aimed at the modernization of the CELEX databases. The two major objectives of this project are:
To attain the objectives set, the management decided to split the project into several sub-projects. The first of them was to provide the procedure for introducing textual data and was designated TEXTERFACE (TEXT intERFACE).
3. Architecture of the ALICE-TEXTERFACE system
The main function of ALICE-TEXTERFACE is to process texts from different sources in order to produce the textual part of CELEX documents. The system accepts different formats as input and generates files, grouped by language version, ready to be introduced into the dissemination databases. ALICE-TEXTERFACE is implemented on a local Unix mini-computer, while the dissemination databases run on a powerful central mainframe.
The most important of the input sources (corresponding to 80% of the database coverage) is the Official Journal of the European Communities (OJ) produced by the Office for Official Publications of the European Communities (OPOCE). The files used for the publication of the OJ are transformed from their original FORMEX format to an SGML-based format (FORMEX-SGML). The FORMEX format (Guittet 1984) was defined by the OPOCE in order:
“to provide a detailed and structured method for recording information about the OPOCE ’s publications in computer-readable bibliographic record, for exchange purposes between two or more computer-based systems”.
FORMEX attempts to unify two different approaches to the interchange of textual data, namely the CCF (Common Communication Format - based on ISO 2709) (UNESCO 1984) and SGML (ISO 8879).
Other accepted SGML-based formats include CJ-SGML (for data originating from the Court of Justice), EP-SGML (for data from the European Parliament), as well as the internal format CLX-SGML which is used for storing documents in the ALICE system.
These formats use different character sets. FORMEX - and consequently FORMEX-SGML - format uses a character set based on ISO 6937 with an extension mechanism in line with ISO 2022. CJ-SGML uses a proprietary non-ambiguous character set (EBCDIL) which is a multilingual variant of EBCDIC.
Figure 1 shows the architecture of the system.
Figure 1: Architecture of the ALICE-TEXTERFACE system
The most important modules are explained below
3.1. Text production database (TPDB)
The TPDB is the heart of the system. Its main purpose is the management and temporary storage of texts in the different language versions; these texts are to be validated, verified and eventually modified before they are definitively introduced into the CELEX dissemination databases.
The TPDB is structured as follows:
The link between the structured (TPDB-R) and the textual files (TPDB-T) is achieved by storing the name of the file containing the text of the document in a field of the TPDB-R.
3.2 Text preparation
This module performs several functions
3.3 User interface
Users (i.e. members of the CELEX management team) are equipped with a menu-oriented interface containing the usual functions associated with the management of a database: creation, display, modification, validation, and deletion. Through this interface it is possible to access the reference files (TPDB-R) as well as the textual files (TPDB-T) of a document.
Access to the textual files is possible either via a simple screen editor or through a sophisticated word-processing package. Conversions between the character sets used in each case are provided for automatically.
Users can also introduce documents directly into the system. In this case, they have to introduce the markers themselves, in line with the CLX-SGML format. In the future it will be possible to replace the screen editor and the word-processing package with an SGML syntax-oriented editor that will make the online validation of the structure of documents possible.
All documents stored in the database remain there until they have been validated. Validation consists of the verification of the structure and the assignment of a CELEX number, the key to the whole system and the link with the analytical files.
3.4 DBA interface
This module provides the specific functions linked with the management of the system (backup, restore, elimination of documents already loaded into the dissemination databases, etc.).
3.5 Interface between text production and dissemination databases
Documents stored in the TPDB and already validated must be transferred to the dissemination databases. The module that performs this function is made up of two parts.
3.6 Management of consistency between local and central databases
The system automatically checks for consistency between the text production database and the dissemination databases. This check must ensure that all documents are accounted for, while providing data for the clean-up at local level of documents already loaded onto the dissemination databases. The mechanism set-up uses dates linked with each stage of the production process of a document, as well as a process that consolidates the results of updating the dissemination databases in the local system.
4. Discussion of the solution adopted
The solution that has been implemented uses several concepts and techniques that should be discussed further.
4.1. Use of a production database for the management of texts
This choice is seen as practical and justified as a management tool for the following reasons.
4.2. Separation of references (stored in a relational production database) and texts
This choice was not simply a means of overcoming the lack of powerful packages that can treat long texts in a Unix environment; it was a sound technical choice for several reasons:
4.3 Use of SGML
Without making a check-list of the advantages and disadvantages of SGML, it must be emphasized that its most interesting feature is its simplicity and the ease of definition of document types. Thus, SGML is very well adapted to the exchange of information in a documentary context.
The use of SGML makes the definition of unrestricted exchange formats very easy. In the context of these formats, different logical elements of information are identified by markers. This is extremely interesting as far as the evolution of the application is concerned (new markers can be created without modifying the existing ones) while the difficulties associated with the treatment of other fixed formats are avoided altogether.
In the ALICE system, the SGML standard is extensively used for the definition not only of input formats (FORMEX-SGML, CJ-SGML, EP-SGML) but also for the basic format under which the documents are stored (CLX-SGML), as well as for exchange between the production system and the dissemination databases.
The use of an SGML parser is certainly an important advantage and facilitates the development process because it enables the format to be extended without problems. However, the absence of such a parser by no means constitutes a limitation with regard to the application of the SGML concept.
4.4. Open architecture independent of the DBMS used for the dissemination databases
The design and development of ALICE were guided by the need to implement a system capable of running on every Unix platform (POSIX in a general) and eliminate, or at least minimise, restriction to any specific technology. The critical point of the CELEX system has always been the DBMS used for the dissemination databases. With the system in place and the definition of a basic format that is application-dependent and not DBMS-dependent, the generation of input files for other, different, packages is easily achievable and will not compromise the structure of the data input procedures.
5. Implementation and problems encountered
ALICE-TEXTERFACE was developed and runs on an NCR Tower 850 Unix mini-computer. The text production database was implemented under the ORACLE relational DBMS, while the textual files are accessible through the VI editor or the Q-one word-processing package. The dissemination databases run under MISTRAL, a document full-text DBMS, on a DPS 90 Honeywell–Bull mainframe under the GCOS 8 operating system. The SGML parser used was the MARK-IT package from SEMA GROUP S.A.
The first version of ALICE-TEXTERFACE was installed in September 1990, and the second is now available. The system permitted the introduction of all1988 and 1989 texts that were missing from CELEX. It proved to be extremely user-friendly and flexible to use. Using the system it was possible to treat more than one OJ daily (all nine language versions). However, in order to accelerate the introduction of missing texts, it was decided to introduce only limited corrections in texts; because of this, the full capabilities of the system for correcting texts were not utilised.
It should be noted that the intrinsic complexity of the system, especially that of FORMEX, as well as the fact that some of the error messages of the MARK-IT parser were not quite explicit, made the detection of problems a cumbersome process.
More specifically, the major problems encountered can be summarised as follows.
Applying the concepts of open systems in the development of a decentralised application is not always easy, especially in the world of full-text databases. An independent production system for text entry can be made ‘open’ easily, through the extensive use of SGML-based formats for data interchange between the different sources of data as well as between the production and dissemination systems. The ALICE system will be further developed in order to cover the production of bibliographic files, as well as the introduction of special treatments for full-text (e.g. morphological analysis).
We would like to express our thanks to the following persons and organisations who helped us attain the objectives set: Victoria Bensch, DBM CELEX and Project Manager, Commission of the European Communities, for her support throughout the whole project; Gilbert Joulain and Martine Renneson, programmers, Commission of the European Communities; Jean-Claude Xheunemont, SEMA Group Brussels, for developing and adapting the programs; and Emmanuel Albanese, Database Administrator of the ALICE-TEXTERFACE system, CELEX team, Commission of the European Communities, for his valuable remarks on the use of the system and his intelligent implementation of the tools developed. Finally, we thank the Directorate of Informatics, Commission of the European Communities, for making the resources necessary for this project available.
The Electronic Library, Vol. 9, No. 3 June 1991, pages 155-159