Technical information

Data Structure
Encoding and entry
Display
Processing
Storage

The DMLBS is prepared in XML according to customized XSD schemas using the Oxygen XML editor, which enables visual editing using custom-built CSS. All data is held in unicode encoding. Processing and checking of the data is accomplished by the use of XSLT and XQuery. The architecture for the whole XML-based system — including the full suite of schemas, stylesheets and transformations — and the associated workflow were designed and built from scratch by Richard Ashdowne. This page gives some more technical information about the project's use of XML encoding and other digital technology.

Data structure

At the heart of the DMLBS XML workflow sit the data schemas which describe and are used to constrain the structure of the data. The DMLBS uses XSD schemas.

The DMLBS has two parallel basic schemas for dictionary text, which each also import a number of shared external schemas (such as a schema defining common metadata elements relating to progress through the editorial workflow). These two basic schemas represent the dictionary text in:

the form in which it is printed (with the definitions within a sense grouped into a paragraph separated from their quotations in another paragraph)
the form in which it is initially drafted (in which quotations are grouped with their associated definitions)

Although XML data is readily transformable and so can be output and displayed as required, which might suggest a single underlying structure would be expected, it is naturally easier and more efficient (and accurate) to work on the data in the most appropriate shape at any given stage, especially since it can easily be restructured as required afterwards. In our case, because a printed draft has to be generated in the format of the printed dictionary to be circulated to readers who then provide manuscript comments on it, the data is first prepared in a form that is optimized for drafting (it allows an editor easily to reorder subsenses and quotations accurately) and it is then restructured into the form that mirrors the print dictionary (to enable any consequent changes to be made quickly and above all accurately). During the transition to electronic drafting and for the capture of the existing printed text it was in any case crucial for efficient and accurate data entry that the schema used should allow the data to be captured directly in the order in which it appeared, rather than requiring any reordering by those carrying out the process of keying or scanning (with consequent risks of inconsistency or inaccuracy). Once production of the print dictionary is complete, we plan that the full dataset will be structured into a single canonical form that will resemble the present drafting schema.

In addition to the basic Dictionary schemas, there is a specific schema for the Dictionary's complex bibliography, which is also held in XML form. This bibliography is used for validating references in the Dictionary text and to provide editors with a regularly updated HTML version that allows them quickly and easily to identify authors and works, to follow links to online resources, and to copy XML-tagged references to the system clipboard ready to be pasted directly into the entry being drafted.

The schemas in use were custom-built for the DMLBS in order to match our very specific needs. They ensure that the drafted or captured text always complies with the long-standing structures and conventions of the printed dictionary by requiring, allowing or prohibiting as necessary. Although the use of TEI encoding was seriously considered, it was clear from initial exploration that the level of customization and optimization required to bring the TEI in line with the practical production needs of the dictionary was too great to be feasible. (In being intentionally flexible in its ability to cope with a wide range of dictionary types, the TEI dictionary module, which was not in fact yet well developed, would have required too much reengineering to provide the formal constraints that we needed; it was also essentially incompatible with crucial aspects of the DMLBS structure and conventions, which would therefore have meant including significant non-TEI encoding in any case; finally, using the TEI encoding for data entry was itself assessed as impracticable, being far more labour-intensive and possibly open to error.) It is a long-term aspiration of the project, however, to investigate building a suitable XSL transformation and TEI customization to be able to generate TEI-compliant XML for archival purposes.

Data encoding and entry

The encoding chosen for all DMLBS data is Unicode. This reflects the need to handle the complex mixture of character sets used throughout the dictionary. In addition to the Roman alphabet, with the full range of diacritics (including the macron and breve to mark vowel length), the Dictionary regularly uses Anglo-Saxon letters (such as thorn, wynn, and yogh) and polytonic Greek, along with assorted other letters and symbols. Unicode allows these to be represented unambiguously. The dictionary also has one particular historical quirk that unicode enables us to deal with easily, namely the use of the two-dot ellipsis (Unicode 2025) to indicate the omission of words in a quotation (instead of the three-dot version in more common use today); this has been used since the first printed fascicule of the dictionary as a means of saving space in print.

The project has custom-built keyboard layouts for Windows to enable editorial staff to access the most frequently used 'unusual' characters directly from the keyboard, rather than by time-consuming alternative routes. (The DMLBS XML system makes limited use of internal DTDs alongside the external schema to provide named entities to support some of the important less frequently used unusual characters such as the symbols representing uncia and drachma.)

Data display

For on-screen editing purposes, the project uses a single font, Gentium, that can display almost all the characters the dictionary uses. This has the advantage of making display in Oxygen considerably easier to handle. (Oxygen uses CSS to style elements, and so it is able to assign a different font for display only at the element level rather than at the character level; from the encoding perspective it was clearly desirable to avoid 'typographic' individual tagging of 'unusual' characters simply because they happen to be 'unusual'.)

Data processing

The DMLBS makes considerable use of XSLT (1.0 and 2.0) transformations and XQuery to process the data at various stages in the workflow. During drafting, assistant editors may enter quotations provisionally together with notes relating to the follow-up action required, and so they use transformations to extract reports on these (e.g. ordered lists of quotations to be checked in the National Archive or British Library). They also use transformations to carry out searches across the full data for the dictionary, looking within headwords, definitions or quotation text as appropriate. In the transition between the two schemas several transformations are carried out, also ensuring the data is structurally valid and the content conforms to the correct conventions. During the revision stages transformations are used to check that the entries are in correct alphabetical order (it's surprisingly difficult for human beings to get this right beyond the first two letters of a word!), that the senses are correctly numbered and/or lettered, that the references are valid (with respect to the published bibliography) and so on. Transformations are also used for project management, to keep track of the progress of material through the workflow, measuring the weekly addition of quotations and the completion of milestones for each portion of material.

Data storage

Data is held as plain text XML files within a Subversion version control repository on a VisualSVN Server, with a single read-only working copy checked out to a shared network drive. After validation updated data is integrated into this and a new version is committed to the repository.

Since October 2011 the DMLBS project has also made use of an Editorial Management System (a customized implementation of the Redmine project management and issue/bug-tracking system) for recording any addenda and corrigenda relating to the printed text that come to light during the course of editing. The EMS is also used for tracking the editorial workflow.

Further information

Further information on any aspect of the DMLBS's XML schemas, stylesheets, transformations and processes can be obtained by contacting the project. We especially welcome enquiries from and contact with other research projects that are engaged in or planning XML-based work, particularly in the humanities.

DMLBS logo.png

Technical information

Technical information

Data structure

Data encoding and entry

Data display

Data processing

Data storage

Further information

Further reading

Preparing the Dictionary