Creating the Greek CELEX database,
technical or managerial challenge?
CELEX is the interinstitutional computerized documentation system for European Community law. It is a database that contains acts of Community legislation and case law with their full text, as well as bibliographical data on preparatory acts and parliamentary questions. (see ref. 1.).
CELEX has been established as a multilingual system because the Treaties provide for nine official working languages. This multilingual aspect of CELEX is not a mere luxury but rather a matter of primary importance for the European citizen: Community law supersedes in many cases national law and that is the reason why CELEX, the main distribution channel for Community law, is considered a pillar of the united Europe of 1993.
CELEX exists already in French, English, German, Dutch and Italian. The Danish and Greek versions are under preparation while the Spanish and the Portuguese versions are to follow.
2. THE TECHNICAL CHALLENGE OF MULTILINGUISM
The existing versions of CELEX do not contain special Latin characters (i.e. accented French or German letters); the code used for these versions is the basic alphabet (ISO 646). Texts with special characters are transposed into standard ASCII. In addition texts are further transposed into upper case (“appauvrissement”). Some characters are transliterated as one character and some as two (see Table 1).
TABLE 1: TRANSLITERATION OF SPECIAL LATIN CHARACTERS IN CELEX DOCUMENTS (apauvrissement).
(upper case letters are similarly transliterated as one or two capital letters, or as one capital and one lowercase, if they are in the beginning of a word)
The creation of a Greek CELEX database in such a “poor” environment is possible. One way would be to replace lower case Latin letters with upper case Greek ones or replace the Latin alphabet by the Greek. Serious problems would arise if these solutions were adopted; in fact, legal texts written only in upper case letters are not recognized as binding in Greece. On the other hand, the Community’s legal texts in Greek do contain words in Latin (“sui generis” decisions, “ad hoc” committees, ËSPRIT” programme) that cannot be translated or transliterated into Greek and can in fact constitute useful search terms in the context of a Greek database. Last but not least the accents can in no way be omitted from Greek texts. In Greek, accents mark the syllable that is stressed and can play an important role in making the meaning of a work clear. Words may be written the same but have different meanings, depending on the syllable that is stressed (see Table 2, ref. 2).
TABLE 2: POLYSEMY IN GREEK WORDS (table adapted from ref. 4)
All these considerations led to the conclusion that an extended character set was needed for the Greek base. This set would provide for the basic Latin characters plus the Greek accented and non-accented letters. In fact, what was needed was an 8-bit standard for the Greek-Latin alphabet similar to the one used for western European languages (ISO 8859/1). The standard would have to be international because that is what the Commission is expected to prefer (see ref. 3).
On the other hand, the standard would have to be implemented in real world products, particularly in:
The Commission’s Informatics environment made the challenge still greater by stipulating conformity with the then newly established Informatics Architecture (see ref. 4). That was quite logical. It would be unthinkable to introduce a special multilingual terminal (including Greek) into the Commission - especially the Translation divisions – in order to have them interrogate CELEX, without it being possible to connect such a terminal to the word processing system on the departmental computer, to EURODICAUTOM or to other bases outside the Commission. The existing situation where three different terminals are used for the abovementioned tasks (VT 100s for accessing internal and external databases, ETS 2010s for word processing and SIEMENS 9751s for the Greek EURODICAUTOM) is not very rational and could not serve as an example.
3. MEETING THE CHALLENGE
In January 1986, when an official was given the task of setting up a Greek version of CELEX,
It was thought originally that the Greek CELEX base could be created with the same procedures used for the existing language versions. By mid - 86, however, it was realized that a special project and additional resources would be needed.
Before the project could advance, standards had to be drawn up. Special contacts were established between ELOT (Hellenic Standards Organization) and the convenor of the corresponding Working Group of ISO (who was working at the time as an expert at the Commission’s DG XIII). As a result, an 8 - bit Greek standard was established jointly by ELOT and ISO in June 1986 (ELOT 928) and it soon became an ECMA and ISO standard (ISO 8859/7).
Meanwhile, contacts were established with industry (see ref. 5), to provide for terminals incorporating Greek characters according to the standard while conforming with the Informatics Architecture (VT 220 compatibility). An interim solution via DRCS (dynamically redefinable character sets) was also envisaged but was not put into practice at the time because of lack of resources.
Following established procedures (see ref. 5), a new feasibility study was adopted by Commission departments in January 1987. Progress of the project was conditioned by the availability of terminals on the list of approved hardware and software products for use by the Commission and by the availability of a DBMS (MISTRAL V5) supporting a multilingual environment (including Greek).
By autumn 1987, a multilingual terminal was submitted for testing in the Commission’s Informatics Workshop (SCRIBEL terminal by CREL; the company was later renamed TIL Technologies and resubmitted the product under the name of ALTAIR). Although the layout of the Greek keyboard wasn’t ergonomically correct, a series of tests with the DBMS proved conclusive; MISTRAL V5 was accepted in February 1988 as capable of supporting a multilingual environment. The acceptance was based on the creation of a test base in Greek containing 5 documents; it was possible to interrogate that base successfully from Greece (using an 8 – bit transparent access).
In parallel, following an ever increasing political interest for the proper introduction of Greek into the Commission’s computer systems and looking ahead to the Greek presidency (July 1988), special attention was given to the implementation of a correct version of Q–one, the Commission’s standard word processing package under UNIX. Until then, Q–one had been using an ambiguous coding for Greek and that had caused serious problems e.g. in file transfer, alphabetical ordering etc. A new table for Greek was implemented, and the appropriate keycaps, printcaps, termcaps and collate files created. It thus became possible by June 1988 to produce and print Greek texts on Q–one, using standard VT 220 terminals (WYSE 65) and laser printers. Of course, a DRCS solution was implemented. Some problems that still remain will be solved through the use of true Greek-Latin VT 220 terminals and the implementation of special Greek laser printer fonts. A special program (still to be developed) will allow data transfer from the database towards Q–one (downloading) as well as from Q–one to the database (uploading – data input). It is hoped that the upgraded Q-one will go into general use by October 1988. It will be accompanied by an updated version of ILS, the Commission’s pivot transcodification system. ILS (INSEM Local Server) permits the transfer of texts between a series of word-processing systems (Olivetti ETS 2010, Philips, etc.) and Q-one.
Other products properly supporting multilingualism have also been submitted for testing in the Informatics Workshop (EURO-PC by SIEMENS) and are expected to figure on the Commission’s lists of approved products soon.
4. THE GREEK CELEX ITSELF
CELEX documents consist of two main parts. The analytical part (keywords classifying documents by type or subject matter, relevant dates, names of authors, relations to other documents etc.) and the textual part (title, full text). Data of the analytical part are fed in a coded form into a special file (called ARCHIVE). By the use of special multilingual tables, the ARCHIVE codes are machine-translated into the respective languages before being introduced into the corresponding CELEX bases. Titles in all languages are also introduced into the ARCHIVE in order to offset the fact that texts arrive later.
Therefore, for a Greek database, three types of data are to be considered:
The CELEX tables can be completed for Greek very easily through the use of the THELEM data entry package.
The titles of CELEX documents are to be introduced via Q-one or by the use of data, already existing on magnetic media (e.g. the tapes used to produce the Greek version of the “Directory of Community Legislation in force”).
A large proportion of the texts (legislation, case law) is available, mainly from the Office of Official Publications on magnetic media of different formats. Simple transcodification and formatting programs are to be developed in order to feed these data into the base.
Now that the basic building blocks (terminals, DBMS, data entry software) are almost in place and (almost) linked together, the writing of the above mentioned programs that feed the data into the base is to be considered a rather routine task. If the necessary resources (6 programmer months) are allocated as expected in July 1988, the infrastructure should be completed by early 1989; the base will then be loaded with analytical data and texts and will be opened to the public in the course of 1989.
5. SOME INTERESTING PROSPECTS
When a general problem is solved, minor specific sub-problems become easier to solve too. The introduction of the Greek script into the Commission’s Informatics Architecture – through the Greek CELEX project – means that all the Latin languages can be treated correctly too. The “Latin” versions of CELEX can now contain special characters (“rich” versions) and these versions can be incorporated into the Commission’s office automation environment. The special transfer programs developed for Greek merely need adapting. The new modernization plan for CELEX, the adoption of which is pending, provides for the introduction of the full texts of Commission proposals into CELEX. When the final text of these proposals is adopted by the Council, translators will need only to download the proposal into their word-processor, introduce amendments as necessary and send off the final text to the relevant authorities (electronically or on paper).
As far as the Greek CELEX itself is concerned, there are further prospects. It will be the first full-text database to be available on-line in Greece. This means that it will serve as a pilot for the opening up of an information market in Greece, in line with the Community programme for the development of a specialized information market in Europe (see ref. 6).
On the other hand, the Greek CELEX can serve as a benchmark for a more advanced morphological study of the Greek language. Last but not least, all language versions of CELEX can serve as test beds for new machine translation packages. In fact CELEX documents are identified by a single document number – the same in all its language versions. So, if one finds a decision or a regulation in the English base, one can find the German version in seconds using its document number.
Special thanks are due to the system supplier, Mr. Jose MARIN-NAVARRO of the Service for the Development of Applications; special assistance was provided through valuable advice by Mr. Marios RAISSIS of the Computing Center, Ms Georgia EFTHYMIOPOULOU of DG XIII and MM. Michael BALTSAVIAS and Nikolas PAPADIMITRIOU of the Translation Directorate. Q-one was adapted by Ms Monique VINCENT of TER/Brussels. Thanks are also due to all persons in and outside the Commission that supported this project in any way (critical, moral or political).
Η δημιουργία της ελληνικής βάσης CELEX είναι μια πρόκληση που συνδέεται με την γενικότερη εισαγωγή της πολυγλωσσίας στα συστήματα πληροφορικής των Κοινοτήτων. Η θέσπιση πρότυπων για το ελληνικό αλφάβητο, η ενσωμάτωση τους σε χειροπιαστά προϊόντα (τερματικά, πακέτα προγραμμάτων) και η σύνδεση όλων των δομικών λίθων μεταξύ τους προκείμενου να αποτελέσουν ένα ενοποιημένο σύστημα, να τι θα έχει επιτευχθεί μέχρι το 1989, όταν η ελληνική βάση θα ανοίξει στο κοινό.
Οι θετικές επιπτώσεις από το έργο αυτό περιλαμβάνουν την δυνατότητα εισαγωγής των ειδικών λατινικών χαρακτήρων στο σύστημα, την προώθηση της δημιουργίας μιας αγοράς πληροφοριών στη Ελλάδα και την εξασφάλιση ενός χώρου δοκιμών για μια πρώτη ουσιαστικότερη μελέτη της μορφολογίας της ελληνικής γλώσσας. Υπάρχουν επίσης δυνατότητες χρησιμοποίησης του συστήματος για δόκιμες πρωτότυπων συστημάτων αυτόματης μετάφρασης. Πάνω απ’ όλα όμως το ελληνικό CELEX θα συμβάλει στην καλύτερη ενημέρωση των νομικών, των επιχειρηματιών και των στελεχών του δημόσιου και του ιδιωτικού τομέα στο Κοινοτικό Δίκαιο.
Terminologie et Traduction (No 1, pp. 11-21)