MASTER: A gentle technical introduction

This document was written for the Centre for Technology and the Arts (CTA), the predecessor to the Centre for Textual Studies, in 2001. Its aim was to provide for experts on manuscript description, who may know little of computers or of computer encoding, an outline introduction to the technical aspects of the MASTER proposals that were, by then, being developed in consort with the Text Encoding Initiative (TEI) workgroup on manuscript descriptions.

By the end of this document, a reader should:

History of these proposals

Many scholars, working now over several decades, have made suggestions concerning the making of computer-readable manuscript descriptions, and many different computer systems (almost all using database technology) have been implemented to permit creation and retrieval of these descriptions. One may mention here the initiatives of the IRHT in the 70s and of the Italian Censimento in the early 90s. A convenient summary of several such systems can be found in the volume Bibliographic Access to Medieval and Renaissance Manuscripts: A Survey of Computerised Data Bases and Information Service, ed. Wesley Stevens, Primary Sources and Original Works, Volume1, Numbers 3/4, 1991. It is a characteristic of such systems that their focus was to use computing technology to achieve precision of cataloguing for a particular domain of manuscripts, and not create a generic system of cataloguing suitable for many different kinds of manuscript and manuscript catalogue. In 1991 a meeting of LIBER passed a resolution that a standard should be sought for first-level descriptions, but were unable to find resources to achieve this.

The rise of the web and advances in digital imaging of manuscripts gave a new impetus to this work in the mid-90s. It became clear that an online union catalogue of manuscripts was indeed technically feasible, and that digital imaging could offer access to the manuscripts before in a manner never before imagined. Machine readable manuscript descriptions capable of access in a single online system are vital to realisation of these aims. This could only be achieved by agreement about the descriptions themselves: how they are to be structured and accessed. No such agreement existed, and several projects sought to address this. A key event was the funding, by the Mellon foundation, of two interlocking American projects concerning manuscript descriptions: the Electronic Access to Medieval Manuscripts project (EAMMS) and Digital Scriptorium. Both these projects began in 1996, and brought together American and European experts on computer systems and manuscript cataloguing. The inception of these American projects provoked a group of European scholars to create a European counterpart to these American projects.

A meeting at Studley Priory, near Oxford, in November 1996 planned what became the MASTER project (for: Manuscript Access through Standards for Electronic Records). After a long process of negotiation with the European Union, MASTER began (with funding from the Fourth Framework program) on 1st January 1999. The MASTER funded partners include the Royal Library, The Hague, the National Library of the Czech Republic, and L'Institut de recherche et d'histoire des textes, Paris. Unfunded partners include the Vatican Library, British Library, Bodleian Library, and the Biblioteca Ambrosiana. MASTER is very closely linked to EAMMS, and EAMMS representatives have been present at all MASTER meetings.

The present proposals are based on a different approach to that of the database systems implementations characteristic of the methods developed in the 1980s and described in the Stevens volume referred to above. Firstly, the various database systems developed for describing and distributing manuscript descriptions were just that: various indeed, with multiple problems relating to compatibility of data across systems and the binding of particular databases to specific (and perishable) combinations of software and hardware. Secondly, there is real doubt as to the efficacy of any database system for description of materials as elaborately and multifariously structured as manuscript descriptions. By the time of the inception of the EAMMS and other projects (from 1996 on) an alternative to a database implementation of manuscript descriptions had emerged: the application of Standard Generalised Markup Language (SGML) encoding to the task. Since 1996, the emergence and increased acceptance of Extensible Markup Language (XML) has added further force to this alternative. The work of the Text Encoding Initiative (TEI) since 1988 had created a set of guidelines for the encoding of a very wide range of humanities materials. These guidelines have become widely used in the scholarly community, and were instrumental in the explosion of electronic texts, which is reshaping many areas of scholarly work.

Something like a consensus developed, among the scholars, archivists and computer experts involved in the EAMMS, Digital Scriptorium and MASTER projects. That SGML/XML encoding, rather than database systems, offered the most promising route towards a scheme for computer-readable manuscript descriptions which would add the new digital virtues of searchability and access to the strengths of the traditional, deeply-considered scholarly description. This would not mean discarding of database systems entirely: there are too many records invested in such systems, and some of the systems themselves have such virtue, for this to be desirable or possible. Accordingly, all these projects have a database component, and EAMMS (which aimed from the beginning at both database and SGML/XML approaches) has been particularly successful in devising a MARC-based format for manuscript descriptions. However, at each end of the scale the possibilities of SGML/XML encoding recommend themselves. At one end of the scale: SGML/XML encoding of very simply structured records accompanied by manuscript images would lend itself to immediate web mounting. At the other end: the most complex manuscript descriptions could be housed in SGML/XML encoding, without distortion of the description and with no compromise of scholarly detail. Similarly, the development of a universal interchange format in SGML/XML might permit decanting of the many manuscript records from the many databases into a single, and therefore more malleable, form.

Several further factors contributed to the choice of the TEI implementation of SGML/XML, as the base for the proposals here outlined. Firstly, there was the perceived success of the TEI guidelines in addressing many of the issues, which would have to be resolved in any scheme of computer readable manuscript descriptions. Such a scheme would need to establish encoding for transcriptions of text from the manuscripts: the TEI had already set forward guidelines for transcription of primary textual material, and these guidelines had been well tested. A vocabulary would be needed for encoding of various kinds of textual division, for encoding of metadata about the description, and more: the TEI had already done much of this work. Secondly, the TEI workgroup system offered a well-proven means of bringing together domain experts and encoding experts, and harnessing their various skills towards a consensus which could form the basis of a widely-accepted standard. Thirdly, several of the people active in framing the TEI guidelines had strong interests in medieval manuscripts, and so were involved in the projects named above from their beginning.

The Studley Priory meeting which effectively inaugurated MASTER was followed by meetings of the EAMMS group at Hill Monastic Manuscript Library in December, 1996, and a year later, in November 1997, by a meeting at Columbia University. This brought together many of the participants in the EAMMS, Digital Scriptorium and MASTER projects with other manuscript experts. Following this meeting, a TEI workgroup was established, with Consuelo Dutschke and Ambrogio Piazzoni as co-chairs. This workgroup first met in July 1998. The guidelines here described are the result of this meeting and of a series of meetings involving members of EAMMS, MASTER and the TEI workgroup: for example, in New York in January 1999 (members of the TEI workgroup); in Paris in February 1999 (the MASTER group); in Rome in March 1999 (the TEI workgroup). Around these meetings, vigorous discussions by email among the EAMMS and MASTER communities have helped shape these proposals.

A standard becomes a standard not simply as a result of a pure intellectual effort on the part of those who frame it: it must be accepted and used by the community. For this reason, the proposers of this standard wish to involve as many people as possible in the development of these proposals. Already, the members and associates of the EAMMS, Digital Scriptorium and MASTER groups include some of the most significant manuscript libraries, and some of the most respected manuscript scholars, in North America and Europe. An independent expert group (Dr Ian Doyle, Durham University Library [chair]; Professor Peter Gumbert, Leiden; Dr Gilbert Ouy, Paris) has met twice to consider these proposals, and many of their comments are incorporated in the standard. Through this introductory document, and through the separate full technical documentation, the proposers would like to make possible the broadest possible discussion of the standard.

What is SGML/XML?

This section gives a very brief account of SGML (Standard Generalised Markup Language), focussing on the areas most relevant to these proposals. If you are familiar with SGML already, skip to the next section. This basic description is also valid for Extensible Markup Language, XML.

The basic unit of SGML/XML is the element. An element can be a stream of text surrounded by a tag, as follows:

<p>This is a paragraph</p>

Here, the encoding <p> indicates that this is the beginning of a p element, and the encoding </p> indicates that this is the end of a p element. The content of the p element is the text ‘This is a paragraph’.

An SGML/XML document could just consist of a series of elements, one after another, as follows:

<head>This is a head element</head>

<p>This is a paragraph</p>

<note>This is a note</note>

One could compare this to a set of database fields: where a database could put this information into fields named ‘head’ ‘p’ ‘note’, SGML/XML encoding puts the information into elements named <head>,<p>, <note>.

However, SGML/XML has six features which distinguish it from database systems, and (in the proposes’ opinion) make it especially suitable for encoding information as rich and as various as manuscript descriptions. Firstly, one can add attributes to elements, and give those attributes values. For example, one could specify that a paragraph is centred as follows:

<p justification=centre>This is a centred paragraph</p>

Or, for a left-justified paragraph:

<p justification=left>This is a left-justified paragraph</p>

One can combine different attributes and values. Here, we use an n attribute to indicate that this is the first paragraph, and the justification attribute to indicate that it is left-justified:

<p justification=left n=1>The first left-justified paragraph</p>

The ability to add information to encoding by use of attributes and their values, singly and in combination with one another, can permit encoding of very complex information.

The second feature, which suits SGML/XML to our purposes, is that it permits elements to contain other elements (or, to ‘nest’ within one another). Here, we create an outer <chapter> element, which contains a <head> and a sequence of <p> or paragraph elements:

<chapter>

<head>This is the heading for chapter 1</head>

<p>The first paragraph</p>

<p>The second paragraph</p>

</chapter>

Or, instead of <chapter> one could use a generic <div> element (marking any kind of textual division) and use a type attribute to indicate that it is a chapter, thus:

<div type=chapter>

<head>This is the heading for chapter 1</head>

<p>The first paragraph</p>

<p>The second paragraph</p>

</div>

The third feature distinguishing SGML/XML is that one may use a Document Type Definition (DTD) to insist that a document must contain certain elements, and to specify that these elements must come in a certain order, and to define what these elements must contain.

In the example before, we could use a DTD to force every <div> element to have a type attribute, and to force every <div> element to contain at least one <head> element followed by at least one <p> element. In the proposals that follow, we use a DTD to enforce a rule that every manuscript description must contain an identifier for the manuscript. Through this mechanism, one can create a DTD to define a standard, and then ensure that documents do indeed conform to the standard by ‘parsing’ them against the DTD.

The fourth feature distinguishing SGML/XML is its use of what are called ‘entity references’. In brief, an ‘entity reference’ is a mechanism to allow substitution of one string by another. For example: one could define an entity called ‘eacute’. Instead of ‘é’, one types the entity reference ‘&eacute;’ into an SGML/XML document. The advantage of this is that one can replace the entity reference ‘&eacute;’ by whatever computer character represents é on the computer system being used to read the document. Thus, this system can be used to make sure that text is correctly displayed whatever computer system is in use. Indeed, this scheme can be used to cope with different character sets, different writing systems, and any combination of languages, characters and writing systems in a single document.

The fifth feature of SGML/XML stems from its use of a basic character set to describe elements, in combination with the use of entity references to represent characters outside this basic character set. The effect of these is to make SGML/XML documents platform independent: that is, to ensure that SGML/XML documents will appear exactly the same regardless of the computer hardware and software in use. A concrete example of this is the world wide web: the ‘HTML’ (Hypertext Markup Language) used by the web is an instance of SGML/XML, and employs these SGML/XML techniques to ensure that HTML documents look and behave identically whether viewed on a Macintosh or Windows machine, and whether viewed by Internet Explorer or Netscape.

The final feature of SGML which fits it for our use is that it is an international standard. Thus, it is defined and maintained by an International Standards Organisation committee, as ISO 8879: 1986. This definition of SGML as an international standard ensures the stability and consistency of SGML. Because SGML is an international standard and because it is system-independent (as explained above) documents encoded in SGML may be considered as an archival medium. This is obviously attractive in the context of manuscript descriptions: the objects described are often centuries, even millennia, old, and one could expect descriptions of these to be useful for decades, at least. (XML, as distinct from SGML, is not yet an ISO standard).

It is often objected about SGML/XML, that it is very verbose: the ‘text’ of the document disappears in a forest of tags, and it is consequently difficult to make and to read the SGML/XML. There are several answers to this. Firstly, there is efficient editing software available which assists the creation of the SGML/XML by various techniques (macros, automated conversion from optical character read input or from typesetting files, and the like). Secondly, there are well-developed software readers and display systems which permit the reader to see the text without all the SGML/XML encoding, and indeed use the SGML/XML (as it should be used) to create attractive and highly-searchable realisations of the text. The widespread acceptance of SGML/XML encoding in the commercial, government, and academic worlds means that there are a very great many such systems, both as freeware and as commercial applications.

However, it is true that the open nature of SGML/XML means that considerable adaptation of the software might be needed to tune it for a particular task, such as making and reading of machine-readable manuscript descriptions. Accordingly, several of the MASTER workpackages were devoted to preparing software for implementation of the standard. In particular, the MASTER partners prepared an implementation of the easy-to-use Windows text editor NoteTab permitting efficient making of manuscript records.

Design principles for these proposals

The massive advantage gained by unification of many different sets of data within a single system, permitting the development of efficient common tools for the record maker and efficient retrieval for the record user, makes standard development worthwhile. For this, a standard encoding of the many sets of data is required. Accordingly, a standard system of encoding which permits encoding of one set of data, in one way alone, and which cannot be used for different sets of data from the same intellectual domain, is no standard at all. However, a standard which tries to do all things for all sets of data may end by serving none of them well, and so again is no standard at all.

It is the tension between these extremes which makes standard design difficult. The solution of the TEI, from the beginning, has been to create standards which are ‘broad churches’: which would accommodate within themselves many different kinds of records, from the most simple to the most complex. The case for this is particularly cogent for manuscript descriptions, where each country, each intellectual domain, even each cataloguer for each different cataloguing project, has developed a distinct cataloguing style. We have therefore sought, above all, for flexibility and for perfectibility in these proposals. For flexibility: these proposals permit, on the one hand, the making of simple manuscript inventories which contain no more than a list of manuscript identifiers, to which may be added just a few words of description of each manuscript, or an image of the manuscript. On the other hand, the proposals also aim to enable the most highly-formalised descriptions, with elaborate structural mark-up distinguishing the various elements within the description and containing complete manuscript transcriptions and complete digital facsimiles of the manuscript. The distinction is not just of length. In this model, a lengthy description might be just a manuscript identifier accompanied by a series of unstructured paragraphs (that is: plain paragraph elements with no encoding within them discriminating their content); or a short description might be highly structured. For perfectibility: it is inherent in this design that one might begin by making a simple inventory, and elaborate this by progressive addition of information. Or, one might begin by importing information from an existing manuscript database into unstructured paragraphs, which later workers might reformulate to distinguish statements of date, place, provenance, or description.

Accordingly, these proposals do not offer rigid definitions of what might constitute a 'short' or 'first-level' record, as against a 'long' or 'full' record. Indeed, these proposals mandate one thing, and one thing only: that a manuscript description (<msDescription>) must contain a manuscript identifier (<msIdentifier>). Beyond this, the proposals are not of themselves prescriptive. We suggest that when assessing these proposals you ask, first: do they permit you to make the manuscript descriptions which you would like to make? If the answer is 'yes', then they will serve; if the answer is 'no', the proposers would like to know of the deficiency. We suggest you ask, secondly: do the names of the elements proposed and the relationship of the elements to each other seem transparent, logical and elegant? We have sought overall for a lucid naming and structuring of the descriptive element, so that manuscript experts and users would say, instantly: 'this feels right'. If something seems illogical or misnamed, and you have a concrete suggestion as to how it might be improved, we should like to hear about it.

A design principle which emerged in the often-intense discussions surrounding these proposals concerned their architecture. Should we aim for an architecture which would permit the cataloguer to say almost anything, in almost any way imaginable? Or should we aim for a more formal prescription, which declares: if you want to say something about the binding of the manuscript (for example) you can only say it in a <binding> element within a <physDesc> element. The advantage of the first approach is that it permits the cataloguer near complete freedom; its danger is that the heterogeneity of descriptions might be at the cost of efficient retrieval based on predictable use of agreed descriptors: the prime justification for making the computer-readable records in the first place. The disadvantage of the second approach is that over-rigid formalism would lead to frustration among cataloguers, and (rather quickly) refusal to adopt the standard.

The solution which emerged was this: we would do both. We would allow at various points within the description paragraph (<p>) elements which would allow the cataloguer to say almost anything they wish, but in a relatively informal and unstructured manner. But we would also provide formal and precisely-defined elements for distinct and identifiable manuscript phenomena. Thus: a cataloguer is free to speak of any aspect of the manuscript binding, as it bears upon the history or intellectual content of the manuscript, within the <p> elements provided within the <history> and <msContents> elements. But if you want to make a formally-structured statement about the binding (its material, its date) you must do this within the <binding> element provided within the <physDesc> element. The alternative, to permit the encoder to use the <binding> element at any point, would likely lead to cataloguers feeling obliged to surround every reference to binding with the <binding> tag, regardless of context and content. This would encourage a superfluity of effort which would lead to so many different kinds of information being contained in <binding> as to render the element semantically ill-defined and therefore useless.

The danger of an open standard, such as this is designed to be, is that it may be misused. We do not encourage, for example, manuscript descriptions which consist of no more than an identifier and a lengthy prose description with no formal distinctions through markup of statements of date, origin, provenance, and the like. However, the standard itself cannot be used to prescribe that descriptions must conform to this or that model: it can only be used to enable the various models.

These proposals seek to create a framework which can accommodate what we know of existing standards, and to enable (in time) greater precision and more efficient retrieval of detail in cataloguing. It is the task of various domain experts to recommend how the standard should be applied in their area. The proposes will begin this process, in the documentation which here follows and which will accompany the completed standard, but it will be the responsibility of you, the users of the standard, to further best practice.

Finally, because the background and interests of all the participants in the primary groups involved in preparation of these proposals (EAMMS, Digital Scriptorium, MASTER, and the TEI workgroup) are in western European medieval manuscripts, these proposals have been prepared only with those manuscripts in mind. However, it is intended that these should in time be generalisable, at some level or other, to other manuscripts from different cultures, and we welcome reaction from experts in other manuscript areas.

The TEI/MASTER DTD

We propose that a new TEI element be created, <msDescription>, and that each <msDescription> element may contain one, and only one, description of a manuscript. A composite manuscript is regarded as a single manuscript, with the distinct manuscripts which it contains each being held in distinct <msPart> elements, contained within the <msDescription> element.

A single <msDescription> may contain the following elements:

<msIdentifier> : this contains the identifier of the manuscript: where it is, and its exact shelfmark or other identifier. This element is mandatory.

<msHeading>: this can be used to give basic information about the manuscript, such as might appear in summary catalogue (for example: author, title of work, date, place, language)

<msContents> : this contains a description of the intellectual content of the manuscript, either as a series of paragraphs or structured into defined sub-elements

<physDesc> : this contains a description of the physical aspect of the manuscript, structured into defined sub-elements

<history>: this contains a history of the manuscript, structured into defined sub-elements

<additional>: contains additional information related to the manuscript but not part of its formal description (e.g.: bibliography; availability of facsimiles or images; curatorial information)

<msPart> : used for composite manuscripts. It may contain all the elements listed above except for <summary> (<summary> is intended for description of the entire manuscript alone). This element is required when dealing with composite manuscripts

 Here is an example of a rather simple manuscript description:

       

Oxford, Corpus Christi College, MS 198
Geoffrey Chaucer The Canterbury Tales. c. 1400
Folios 1r-266v. The Canterbury Tales. A274-I290. Defective at beginning and end.
Parchment, trimmed. 33.5 x 22.5 cm. Quires [14, 15, and 28] were disordered in the previous binding. They have been reordered and refoliated, with the old foliation being the uppermost. Two consecutive folios are numbered '64a' and '64'
Written by the scribe identified by Doyle and Parkes as 'Hand d'
Dated c. 1400 (personal communication, Malcolm Parkes). On fol. 146r is the name 'Burle' in drypoint, in the margin next to E1396. Cp came to the College as a bequest of William Fulman, according to a note on fol. 1r : 'Liber C.C.C.Oxon Ex dono Gulielmi Fulman A.M. hujus Collegii quondam socius.'

In a summary catalogue, one could encode it using the MASTER DTD as follows, stating only the place, shelfmark, and basic information as follows:

       <msDescription>
       <msIdentifier n="1">
<country reg="GB">Great Britain</settlement>
<settlement>Oxford</settlement>
<repository>Corpus Christi College</repository>
<idNo>MS 198</idNo>
</msIdentifier>
       <msHeading>
<title>The Canterbury Tales</title>
<author>Geoffrey Chaucer</author>
<origPlace>?</origPlace>
<origDate notBefore="1395" notAfter="1420">c. 1400</origDate>
<textLang langKey="ENM">Middle English</textLang>
</msHeading>
       </msDescription>

Observe the possibilities for precise searches which this encoding allows. We could find this manuscript as dated between 1395 and 1420; as being written in Middle English (with language key "ENM"), and more.

Or, one could encode all the information in the description thus:

       <msDescription>
       <msIdentifier n="1">
<country reg="GB">Great Britain</settlement>
<settlement>Oxford</settlement>
<repository>Corpus Christi College</repository>
<idNo>MS 198</idNo>
</msIdentifier>
       <msHeading>
<title>The Canterbury Tales</title>
<author>Geoffrey Chaucer</author>
<origPlace>?</origPlace>
<origDate notBefore="1395" notAfter="1420">c. 1400</origDate>
<textLang langKey="ENM">Middle English</textLang>
</msHeading>
       <msContents>
<msItem n="1" defective="yes">
<locus from="1" to="266">Folios 1r-266v</locus>
<title type="uniform">The Canterbury Tales</title>
<bibl><biblScope>A274-I290</biblScope></bibl>
<note>Defective at beginning and end</note>
</msItem>
</msContents>
       <physDesc>
<form><p>Codex.</p></form>
<support><p>Parchment, trimmed.</p></support>
<extent>266.<dimensions type="leaf" scope="all"><height>33.5</height><width>22.5</width></dimensions></extent>
<collation><p>Quires [14, 15, and 28] were disordered in the previous binding. They have been reordered and refoliated with the old foliation being the uppermost. Two consecutive folios are numbered '64a' and '64'</p></collation>
<msWriting hands="1">
<handDesc scribe="Hand D (Doyle/Parkes)" script="Anglicana" medium="ink" scope="sole"><p>Written by the scribe identified by Doyle and Parkes as '<name type="person" role="scribe" key="DPhandD">Hand d</name>'</p></handDesc>
</msWriting>
</physDesc>
       <history>
<origin notBefore="1395" notAfter="1420" certainty="high" evidence="conjecture"><p>Dated c. <origDate>1400</origDate> (personal communication, Malcolm Parkes).</p></origin>
<provenance><p>On fol. 146r is the name 'Burle' in drypoint, in the margin next to E1396. </p></provenance>
<acquisition><p>Cp came to the College as a bequest of William Fulman, according to a note on fol. 1r : <q>'Liber C.C.C. Oxon Ex dono Gulielmi Fulman A.M. hujus Collegii quondam socius.'</q></p></acquisition>
</history>
       </msDescription>

A full account of how these descriptions can be made, with further discussion of the elements within these descriptions, is given in the account of the customized text-editor prepared by the MASTER partners.

Next steps . . .

The MASTER project and its partners have put together various resources to help make it easier for cataloguers to learn how to make manuscript descriptions according to the encoding we have developed. The resources page of the MASTER site gives a description of these resources. These include: