words that maybe should be in <gi> or <ident>

These tables list the 15,996 words (in 9,287 text nodes) that match the @ident of some *Spec.

The first column is just position(), for use as a reference and so any given table can be re-sorted back to its original order. The second column is either the @ident of the *Spec, or the closest ancestor @xml:id. The links often don't work, because I don't know how to consistently generate a proper link, and one can't easily test that the document is available, as doc-available() always fails because of the about:legacy-compat.

specifications (i.e., https://svn.code.sf.net/p/tei/code/trunk/P5/Source/Specs/)

factuality.xml#13000

# id text
4 factuality describes the extent to which the text may be regarded as imaginative or non-imaginative, that is, as describing a fictional or a non-fictional world.
27 factuality categorizes the factuality of the text.
46 factuality the text is to be regarded as entirely imaginative
62 factuality the text is to be regarded as entirely informative or factual
78 factuality the text contains a mixture of fact and fiction
94 factuality the fiction/fact distinction is not regarded as helpful or appropriate to this text
147 factuality Usually empty, unless some further clarification of the type attribute is needed, in which case it may contain running prose
149 factuality For many literary texts, a simple binary opposition between
155 factuality are in any sense

collection.xml#13000

# id text
4 collection contains the name of a collection of manuscripts, not necessarily located within a single repository.

damage.xml#13000

# id text
4 damage contains an area of damage to the text witness.
40 damage Since damage to text witnesses frequently makes them harder to read, the
46 damage attribute may be used to group together several related

pubPlace.xml#13000

# id text
2 pubPlace publication place
13 pubPlace contains the name of the place where a bibliographic item was published.

cond.xml#13000

# id text
2 cond conditional feature-structure constraint
14 cond defines a conditional feature-structure constraint; the consequent and the antecedent are specified as feature structures or feature-structure collections; the constraint is satisfied if both the antecedent and the consequent subsume a given feature structure, or if the antecedent does not.

classRef.xml#13000

# id text
16 classRef the identifier used for the required class within the source indicated.
23 classRef indicates how references to this class within a content model should be interpreted.
31 classRef a single occurrence of all members of the class may appear in sequence
35 classRef a single occurrence of one or more members of the class may appear in sequence
43 classRef one or more occurrences of all members of the class may appear in sequence
52 classRef c
53 classRef , then a reference to the class within a content model is understood as being a reference to
55 classRef when
57 classRef has the value
61 classRef when it has the value
62 classRef sequence
65 classRef when it has the value
67 classRef ; to (a*,b*, c*) when it has the value
69 classRef ; or to (a+,b+,c+) when it has the value
77 classRef supplies a list of class members which are to be included in the schema being defined.
84 classRef supplies a list of class members which are to be excluded from the schema being defined.
105 classRef Attribute and model classes are identified by the name supplied as value for the
109 classRef element in which they are declared. All TEI names are unique; attribute class names conventionally begin with the latters

event.xml#13012

# id text
4 event contains data relating to any kind of significant event associated with a person, place, or organization.
60 event indicates the location of an event by pointing to a

model.pLike.front.xml#13000

# id text
2 model.pLike.front groups paragraph-like elements which can occur as direct constituents of front matter.

principal.xml#13000

# id text
2 principal principal researcher
16 principal supplies the name of the principal researcher responsible for the creation of an electronic text.

biblScope.xml#13095

# id text
14 biblScope defines the scope of a bibliographic reference, for example as a list of page numbers, or a named subdivision of a larger work.
79 biblScope . For example, if the citation has

gap.xml#13012

# id text
4 gap indicates a point where material has been omitted in a transcription, whether for editorial reasons described in the TEI header, as part of sampling practice, or because the material is illegible, invisible, or inaudible.
125 gap in the case of text omitted from the transcription because of deliberate deletion by an identifiable hand, indicates the hand which made the deletion.
144 gap in the case of text omitted because of damage, categorizes the cause of the damage, if it can be identified.
163 gap damage results from rubbing of the leaf edges
179 gap damage results from mildew on the leaf surface
195 gap damage results from smoke
262 gap core tag elements may be closely allied in use with the
266 gap elements, available when using the additional tagset for transcription of primary sources. See section
271 gap tag simply signals the editors decision to omit or inability to transcribe a span of text. Other information, such as the interpretation that text was deliberately erased or covered, should be indicated using the relevant tags, such as
273 gap in the case of deliberate deletion.

surrogates.xml#13000

# id text
4 surrogates contains information about any representations of the manuscript being described which may exist in the holding institution or elsewhere.

head.xml#13000

# id text
14 head contains any type of heading, for example the title of a section, or the heading of a list, glossary, manuscript description, etc.
52 head may be rather longer than usual in modern works. If a section has an explicit ending as well as a heading, it should be marked as a
165 head element is used for headings at all levels; software which treats (e.g.) chapter headings, section headings, and list titles differently must determine the proper processing of a
169 head occurring as the first element of a list is the title of that list; one occurring as the first element of a
171 head is the title of that chapter or section.

stress.xml#13000

# id text
4 stress contains the stress pattern for a dictionary headword, if given separately.
36 stress Usually stress information is included within pronunciation information.

listForest.xml#13000

# id text
15 listForest identifies the type of the forest group.

eLeaf.xml#13000

# id text
2 eLeaf leaf or terminal node of an embedding tree
14 eLeaf provides explicitly for a leaf of an embedding tree, which may also be encoded with the eTree element.
48 eLeaf indicates the value of an embedding leaf, which is a feature structure or other analytic element.
86 eLeaf tag may be used if the encoder does not wish to distinguish by name between nonleaf and leaf nodes in embedding trees; they are distinguished by their arrangement.

att.declaring.xml#13000

# id text
2 att.declaring provides attributes for elements which may be independently associated with a particular declarable element within the header, thus overriding the inherited default for that element.
50 att.declaring The rules governing the association of declarable elements with individual parts of a TEI text are fully defined in chapter

authority.xml#13000

# id text
2 authority release authority
16 authority supplies the name of a person or other agency responsible for making a work available, other than a publisher or distributor.

undo.xml#13000

# id text
35 undo This encoding represents the following sequence of events:
37 undo At stage s2, "just some sample text, we need" is deleted by overstriking, and "not" is added
38 undo At stage s3, parts of the deletion are cancelled by underdotting, thus reinstating the words "just some" and "text".

entryFree.xml#13000

# id text
2 entryFree unstructured entry
13 entryFree contains a single unstructured entry in any kind of lexical resource, such as a dictionary or lexicon.

alternate.xml#13056

# id text
14 alternate The alternate element must have at least two child elements
26 alternate This example content model permits either a

climate.xml#13242

# id text
4 climate contains information about the physical climate of a place.

when.xml#13000

# id text
2 when indicates a point in time either relative to other elements in the same timeline tag, or absolutely.
28 when supplies an absolute value for the time.
75 when specifies the unit of time in which the
77 when value is expressed, if this is not inherited from the parent
172 when specifies a time interval either as a number or as one of the keywords defined by the datatype data.interval
191 when identifies the reference point for determining the time of the current
193 when element, which is obtained by adding the interval to the time of the reference point.
227 when . If no value is supplied, and the
229 when attribute is also unspecified, then the reference point is understood to be the origin of the enclosing
272 when attribute must be supplied to specify an identifier for this point in time. The value used may be chosen freely provided that it is unique within the document and is a syntactically valid name. There is no requirement for values containing numbers to be in sequence.

titlePage.xml#13000

# id text
2 titlePage title page
16 titlePage contains the title page of a text, appearing within the front or back matter.
54 titlePage classifies the title page according to any convenient typology.
74 titlePage This attribute allows the same element to be used for volume title pages, series title pages, etc., as well as for the
76 titlePage title page of a work.

substJoin.xml#13000

# id text
2 substJoin substitution join
6 substJoin identifies a series of possibly fragmented additions, deletions or other revisions on a manuscript that combine to make up a single intervention in the text

stage.xml#13000

# id text
2 stage stage direction
14 stage contains any kind of stage direction within a dramatic text or fragment.
39 stage indicates the kind of stage direction.
106 stage describes stage business.
122 stage is a narrative, motivating stage direction.
303 stage attribute may be used to indicate more precisely the person or persons participating in the action described by the stage direction.

data.truthValue.xml#13000

# id text
20 data.truthValue The possible values of this datatype are
30 data.truthValue This datatype applies only for cases where uncertainty is inappropriate; if the attribute concerned may have a value other than true or false, e.g.

model.headLike.xml#13000

# id text
2 model.headLike groups elements used to provide a title or heading at the start of a text division.

mood.xml#13000

# id text
4 mood contains information about the grammatical mood of verbs (e.g. indicative, subjunctive, imperative).
88 mood gram type="mood"

dimensions.xml#13000

# id text
68 dimensions dimensions relate to one or more leaves (e.g. a single leaf, a gathering, or a separately bound part)
84 dimensions dimensions relate to the area of a leaf which has been ruled in preparation for writing.
100 dimensions dimensions relate to the area of a leaf which has been pricked out in preparation for ruling (used where this differs significantly from the ruled area, or where the ruling is not measurable).
116 dimensions dimensions relate to the area of a leaf which has been written, with the height measured from the top of the minims on the top line of writing, to the bottom of the minims on the bottom line of writing.
132 dimensions dimensions relate to the miniatures within the manuscript
148 dimensions dimensions relate to the binding in which the codex or manuscript is contained
164 dimensions dimensions relate to the box or other container in which the manuscript is stored.
241 dimensions This element may be used to record the dimensions of any text-bearing object, not necessarily a codex. For example:
257 dimensions When simple numeric quantities are involved, they may be expressed on the
278 dimensions Contains no more than one of each of the specialized elements used to express a three-dimensional object's height, width, and depth, combined with any number of other kinds of dimensional specification.

damageSpan.xml#13000

# id text
2 damageSpan damaged span of text
12 damageSpan marks the beginning of a longer sequence of text which is damaged in some way but still legible.
85 damageSpan Both the beginning and ending of the damaged sequence must be marked: the beginning by the
89 damageSpan attribute: if no other element available, the
93 damageSpan The damaged text must be at least partially legible, in order for the encoder to be able to transcribe it. If it is not legible at all, the
99 damageSpan element should be employed, with the value of the

lbl.xml#13000

# id text
2 lbl label
14 lbl contains a label for a form, example, translation, or other piece of information, e.g. abbreviation for, contraction of, literally, approximately, synonyms:, etc.
39 lbl classifies the label using any convenient typology.

model.lLike.xml#13000

# id text
2 model.lLike groups elements representing metrical components such as verse lines.

reg.xml#13000

# id text
41 reg If all that is desired is to call attention to the fact that the copy text has been regularized,

xr.xml#13000

# id text
14 xr contains a phrase, sentence, or icon referring the reader to some other location in this or another text.
130 xr related or similar term
316 xr This element encloses both the actual indication of the location referred to, which may be tagged using the
320 xr elements, and any accompanying material which gives more information about why the reader is being referred there.

att.datable.iso.xml#13000

# id text
2 att.datable.iso provides attributes for normalization of elements that contain datable events using the ISO 8601 standard.
19 att.datable.iso supplies the value of a date or time in a standard form.
35 att.datable.iso The following are examples of ISO date, time, and date & time formats that are
125 att.datable.iso is a valid time with respect to the W3C
133 att.datable.iso specifies the earliest possible date for the event in standard form, e.g. yyyy-mm-dd.
152 att.datable.iso specifies the latest possible date for the event in standard form, e.g. yyyy-mm-dd.
211 att.datable.iso The value of these attributes should be a normalized representation of the date, time, or combined date & time intended, in any of the standard formats specified by ISO 8601, using the Gregorian calendar.
239 att.datable.iso are specified, the values should be interpreted as indicating a span of time by its starting time (or date) and duration. That is,
240 att.datable.iso indicates the same time period as
245 att.datable.iso form, no claim is made that the form in the source text is incorrect; the regularized form is simply that chosen as the main form for purposes of unifying variant forms under a single heading.

triangle.xml#13000

# id text
2 triangle underspecified embedding tree, so called because of its characteristic shape when drawn
14 triangle provides for an underspecified eTree, that is, an eTree with information left out.
51 triangle supplies a value for the triangle, in the form of the identifier of a feature structure or other analytic element.
95 triangle An optional label followed by zero or more embedding trees, triangles, or embedding leafs.

foreign.xml#13012

# id text
12 foreign identifies a word or phrase as belonging to some language other than that of the surrounding text.
61 foreign attribute should be supplied for this element to identify the language of the word or phrase marked. As elsewhere, its value should be a language tag as defined in
66 foreign attribute should be used in preference to this element where it is intended to mark the language of the whole of some text element.

code.xml#13000

# id text
2 code contains literal code from some formal language such as a programming language.
25 code formal language
35 code a name identifying the formal language in which the code is expressed

anchor.xml#13000

# id text
2 anchor anchor point
69 anchor attribute must be supplied to specify an identifier for the point at which this element occurs within a document. The value used may be chosen freely provided that it is unique within the document and is a syntactically valid name. There is no requirement for values containing numbers to be in sequence.

rendition.xml#13000

# id text
4 rendition supplies information about the rendition or appearance of one or more elements in the source text.
38 rendition styling applies to the first line of the target element
46 rendition styling should be applied immediately before the content of the target element
50 rendition styling should be applied immediately after the content of the target element
71 rendition The present release of these Guidelines does not specify the content of this element in any further detail. It may be used to hold a description of the default rendition to be associated with the specified element, expressed in running prose, or in some more formal language such as CSS.

list.xml#13046

# id text
4 list contains any sequence of items organized as a list.
88 list The content of a "gloss" list should include a sequence of one or more pairs of a label element followed by an item element
103 list each list item glosses some term or concept, which is given by a label element preceding the list item.
121 list each list item is an entry in an index such as the alphabetical topical index at the back of a print volume.
125 list each list item is a step in a sequence of instructions, as in a recipe.
129 list each list item is one of a sequence of petitions, supplications or invocations, typically in a religious ritual.
133 list each list item is part of an argument consisting of two or more propositions and a final conclusion derived from them.
142 list to encode the rendering or appearance of a list (whether it was bulleted, numbered, etc.). The current recommendation is to use the
148 list for the more appropriate task of characterizing the nature of the content of a list.
155 list list type="gloss"
336 list The following example treats the short numbered clauses of Anglo-Saxon legal codes as lists of items. The text is from an ordinance of King Athelstan (924–939):
366 list Note that nested lists have been used so the tagging mirrors the structure indicated by the two-level numbering of the clauses. The clauses could have been treated as a one-level list with irregular numbering, if desired.
385 list May contain an optional heading followed by a series of items, or a series of label and item pairs, the latter being optionally preceded by one or two specialized headings.

interleave.xml#

# id text
22 interleave This example content model permits either a

docTitle.xml#13000

# id text
2 docTitle document title
16 docTitle contains the title of a document, including all its constituents, as given on a title page.

num.xml#13000

# id text
2 num number
38 num indicates the type of numeric value.
135 num supplies the value of the number in standard form.
152 num a numeric value.
157 num The standard form used is defined by the TEI datatype data.numeric.
211 num Detailed analyses of quantities and units of measure in historical documents may also use the feature structure mechanism described in chapter

orig.xml#13092

# id text
2 orig original form
119 orig will be combined with a regularized form within a

transpose.xml#13000

# id text
2 transpose describes a single textual transposition as an ordered list of at least two pointers specifying the order in which the elements indicated should be re-combined.
30 transpose Transposition is usually indicated in a document by a metamark such as a wavy line or numbering.

catDesc.xml#13000

# id text
2 catDesc category description
16 catDesc describes some category within a taxonomy or text typology, either in the form of a brief prose description or in terms of the situational parameters used by the TEI formal textDesc.

item.xml#13000

# id text
85 item May contain simple prose or a sequence of chunks.
87 item Whatever string of characters is used to label a list item in the copy text may be used as the value of the global
95 item element to record the enumerator of the list item. In glossary lists, however, the term being defined should be given with the

district.xml#13000

# id text
4 district contains the name of any kind of subdivision of a settlement, such as a parish, ward, or other administrative or geographic unit.

postCode.xml#13000

# id text
2 postCode postal code
14 postCode contains a numerical or alphanumeric code used as part of a postal address to simplify sorting or delivery of mail.
72 postCode The position and nature of postal codes is highly country-specific; the conventions appropriate to the country concerned should be used.

fs.xml#13000

# id text
16 fs , that is, a collection of feature-value pairs organized as a structural unit.

model.teiHeaderPart.xml#13000

# id text
2 model.teiHeaderPart groups high level elements which may appear more than once in a TEI header.

att.pointing.xml#13229

# id text
2 att.pointing defines a set of attributes used by all elements which point to other elements by means of one or more URI references.
18 att.pointing specifies the language of the content to be found at the destination referenced by
21 att.pointing language tag
33 att.pointing if @target is specified.
52 att.pointing The value must conform to BCP 47. If the value is a private use code (i.e., starts with
58 att.pointing element with a matching value for its
60 att.pointing attribute should be supplied in the TEI header to document this value. Such documentation may also optionally be supplied for non-private-use codes, though these must remain consistent with their
96 att.pointing specifies the intended meaning when the target of a pointer is itself a pointer.
115 att.pointing if the element pointed to is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found which is not a pointer.
131 att.pointing if the element pointed to is itself a pointer, then its target (whether a pointer or not) is taken as the target of this pointer.
164 att.pointing If no value is given, the application program is responsible for deciding (possibly on the basis of user input) how far to trace a chain of pointers.

att.naming.xml#13000

# id text
21 att.naming may be used to specify further information about the entity referenced by this name in the form of a set of whitespace-separated values, for example the occupation of a person, or the status of a place.
28 att.naming reference to the canonical name
38 att.naming provides a means of locating the canonical form (
39 att.naming nym
64 att.naming The value must point directly to one or more XML elements by means of one or more URIs, separated by whitespace. If more than one is supplied, the implication is that the name is associated with several distinct canonical names.

form.xml#13000

# id text
2 form form information group
14 form groups all the information on the written and spoken forms of one headword.
49 form classifies form as simple, compound, etc.
68 form single free lexical item
100 form a variant form
148 form word in other than usual dictionary form
164 form multiple-word lexical item

usg.xml#13012

# id text
101 usg domain
111 usg domain or subject matter (e.g. scientific, literary etc.)
181 usg language
191 usg name of a language mentioned in etymological or other linguistic discussion.
405 usg unclassifiable piece of information to guide sense choice

notesStmt.xml#13000

# id text
16 notesStmt collects together any notes providing information about a text additional to that recorded in other parts of the bibliographic description.

langKnown.xml#13000

# id text
2 langKnown language known
14 langKnown summarizes the state of a person's linguistic competence, i.e., knowledge of a single language.
38 langKnown supplies a valid language tag for the language concerned.
56 langKnown The value for this attribute should be a language
57 langKnown tag
79 langKnown a code indicating the person's level of knowledge for this language

att.placement.xml#13000

# id text
2 att.placement provides attributes for describing where on the source page or object a textual element appears.
18 att.placement specifies where this item is placed
27 att.placement below the line
97 att.placement on the other side of the leaf
113 att.placement above the line
145 att.placement within the body of the text.

adminInfo.xml#13000

# id text
14 adminInfo contains information about the present custody and availability of the manuscript, and also about the record description itself.

att.handFeatures.xml#13000

# id text
16 att.handFeatures gives a name or other identifier for the scribe believed to be responsible for this hand.
34 att.handFeatures points to a full description of the scribe concerned, typically supplied by a
43 att.handFeatures characterizes the particular script or writing style used by this hand, for example
105 att.handFeatures points to a full description of the script or writing style used by this hand, typically supplied by a
116 att.handFeatures , or other writing medium, e.g.

location.xml#13242

# id text
4 location defines the location of a place as a set of geographical coordinates, in terms of other named geo-political entities, or as an address.

listOrg.xml#13000

# id text
2 listOrg list of organizations
12 listOrg contains a list of elements, each of which provides information about an identifiable organization.
80 listOrg The type attribute may be used to distinguish lists of organizations of a particular type if convenient.

text.xml#13000

# id text
4 text contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample.
156 text The body of a text may be replaced by a group of nested texts, as in the following schematic:
175 text This element should not be used to represent a text which is inserted at an arbitrary point within the structure of another, for example as in an embedded or quoted narrative; the

metDecl.xml#13000

# id text
13 metDecl documents the notation employed to represent a metrical pattern when this is specified as the value of a
19 metDecl attribute on any structural element of a metrical text (e.g.
145 metDecl indicates whether the notation conveys the abstract metrical form, its actual prosodic realization, or the rhyme scheme, or some combination thereof.
188 metDecl declaration applies to the abstract metrical form recorded on the
266 metDecl declaration applies to the rhyme scheme recorded on the
294 metDecl element documents the notation used for metrical pattern and realization. It may also be used to document the notation used for rhyme scheme information; if not otherwise documented, the rhyme scheme notation defaults to the traditional
327 metDecl specifies a regular expression defining any value that is legal for this notation.
346 metDecl The value must be a valid regular expression per the World Wide Web Consortium's
370 metDecl This example is intended for the far more restricted case typified by the Shakespearean iambic pentameter. Only metrical patterns containing exactly ten syllables, alternately stressed and unstressed, (except for the first two which may be in either order) to each metrical line can be expressed using this notation.
405 metDecl may contain either a sequence of
407 metDecl elements or, alternately, a series of paragraphs or other components. If the
411 metDecl elements are used, then all the codes appearing within the
415 metDecl Only usable within the header if the verse module is used.

metSym.xml#13000

# id text
2 metSym metrical notation symbol
13 metSym documents the intended significance of a particular character or character sequence within a metrical notation, either explicitly or in terms of other symbol elements in the same metDecl.
39 metSym specifies the character or character sequence being documented.
60 metSym specifies whether the symbol is defined in terms of other symbols (
62 metSym is set to
66 metSym is set to
146 metSym The value
148 metSym indicates that the element contains a prose definition of its meaning; the value

colloc.xml#13000

# id text
14 colloc contains any sequence of words that co-occur with the headword with significant frequency.

data.name.xml#13000

# id text
20 data.name Attributes using this datatype must contain a single word which follows the rules defining a legal XML name (see

relatedItem.xml#13000

# id text
2 relatedItem contains or references some other bibliographic item which is related to the present one in some specified manner, for example as a constituent or alternative version of it.
32 relatedItem is used, the relatedItem element must be empty
34 relatedItem A relatedItem element should have either a 'target' attribute or a child element to indicate the related bibliographic item

cell.xml#13000

# id text
4 cell contains one cell of a table.

node.xml#13000

# id text
31 node provides the value of a node, which is a feature structure or other analytic element.
69 node initial node in a transition network
85 node final node in a transition network
265 node attributes when the graph is undirected and vice versa if the graph is directed.
286 node gives the in degree of the node, the number of nodes which are adjacent from the given node.
323 node gives the out degree of the node, the number of nodes which are adjacent to the given node.
360 node gives the degree of the node, the number of arcs with which the node is incident.
400 node attributes when the graph is undirected and vice versa if the graph is directed.
442 node provides a label for the arc; the second provides a second label for the arc, and should be used if a transducer is being encoded whose actions are associated with nodes rather than with arcs.

span.xml#13000

# id text
2 span associates an interpretative annotation directly with a span of text.
27 span Only one of the attributes @target and @from may be supplied on
34 span Only one of the attributes @target and @to may be supplied on
41 span If @to is supplied on
42 span , @from must be supplied as well
49 span may each contain only a single value
55 span gives the identifier of the node which is the starting point of the span of text being annotated; if not accompanied by a
57 span attribute, gives the identifier of the node of the entire span of text being annotated.
88 span gives the identifier of the node which is the end-point of the span of text being annotated.

listPrefixDef.xml#13000

# id text
2 listPrefixDef list of prefix definitions
4 listPrefixDef contains a list of definitions of prefixing schemes used in
23 listPrefixDef In this example, two private URI scheme prefixes are defined and patterns are provided for dereferencing them. Each prefix is also supplied with a human-readable explanation in a

att.datable.custom.xml#13227

# id text
2 att.datable.custom provides attributes for normalization of elements that contain datable events to a custom dating system (i.e. other than the Gregorian used by W3 and ISO).
6 att.datable.custom supplies the value of a date or time in some custom standard form.
12 att.datable.custom The following are examples of custom date or time formats that are
34 att.datable.custom Not all custom date formulations will have Gregorian equivalents.
38 att.datable.custom attribute and other custom dating are not contrained to a datatype by the TEI, but individual projects are recommended to regularize and document their dating formats.
43 att.datable.custom specifies the earliest possible date for the event in some custom standard form.
50 att.datable.custom specifies the latest possible date for the event in some custom standard form.
81 att.datable.custom supplies a pointer to some location defining a named point in time with reference to which the datable item is understood to have occurred
104 att.datable.custom element for the Julian calendar, specifying that the text content of the
108 att.datable.custom attribute also points to the Julian calendar to indicate that the content of the
110 att.datable.custom attribute value is Julian too.
122 att.datable.custom In this example, a date is given in a Mediaeval text measured "from the creation of the world", which is normalised (in
126 att.datable.custom ) to a machine-actionable, numeric version of the date from the Creation.
135 att.datable.custom ) defines the calendar or dating system to which the date described by the parent element is normalized (i.e. in the
141 att.datable.custom the calendar of the original date in the element.

faith.xml#13000

# id text
4 faith specifies the faith, religion, or belief set of a person.

listRelation.xml#13000

# id text
4 listRelation provides information about relationships identified amongst people, places, and organizations, either informally as prose or as formally expressed relation links.
102 listRelation May contain a prose description organized as paragraphs, or a sequence of

moduleSpec.xml#13000

# id text
12 moduleSpec documents the structure, content, and purpose of a single module, i.e. a named and externally visible group of declarations.

div.xml#13000

# id text
2 div text division
16 div contains a subdivision of the front, body, or back of a text.

attRef.xml#13000

# id text
14 attRef points to the definition of an attribute or group of attributes.
36 attRef the name of the attribute class
43 attRef the name of the attribute

index.xml#13000

# id text
2 index index entry
14 index marks a location to be indexed for whatever purpose.
45 index a single word which follows the rules defining a legal XML name (see
46 index ), supplying a name to specify which index (of several) the index entry belongs to.

valItem.xml#13000

# id text
2 valItem documents a single value in a predefined list of values.
30 valItem specifies the value concerned.

space.xml#13092

# id text
4 space indicates the location of a significant space in the text.
45 space indicates whether the space is horizontal or vertical.
64 space the space is horizontal.
80 space the space is vertical.
97 space For irregular shapes in two dimensions, the value for this attribute should reflect the more important of the two dimensions. In conventional left-right scripts, a space with both vertical and horizontal components should be classed as
116 space (responsible party) indicates the individual responsible for identifying and measuring the space
141 space This element should be used wherever it is desired to record an unusual space in the source text, e.g. space left for a word to be filled in later, for later rubrication, etc. It is not intended to be used to mark normal inter-word space or the like.

att.combinable.xml#13000

# id text
26 att.combinable add
46 att.combinable if present already, the whole of the declaration for this object is removed from the current setup
62 att.combinable this declaration changes the declaration of the same name in the current definition
78 att.combinable this declaration replaces the declaration of the same name in the current definition
100 att.combinable add
102 att.combinable add
103 att.combinable mode); raise an error if an object with the same identifier already exists
109 att.combinable do not process this object or any existing object with the same identifier; raise an error if any new children supplied
110 att.combinable change
112 att.combinable change

pron.xml#13120

# id text
60 pron full form
111 pron indicates what notation is used for the pronunciation, if more than one occurs in the machine-readable dictionary.
195 pron The values used to specify the notation may be taken from any appropriate project-defined list of values. Typical values might be

soCalled.xml#13000

# id text
2 soCalled contains a word or phrase for which the author or narrator indicates a disclaiming of responsibility, for example by the use of scare quotes or italics.

msContents.xml#13000

# id text
12 msContents describes the intellectual content of a manuscript or manuscript part, either as a series of paragraphs or as a series of structured manuscript items.
60 msContents identifies the text types or classifications applicable to this object by pointing to other elements or resources defining the classification concerned.
328 msContents . This constraint is not currently enforced by the schema.

name.xml#13000

# id text
58 name , when the TEI module for names and dates is included.

byline.xml#13000

# id text
4 byline contains the primary statement of responsibility given for a work on its title page or at the head or end of the work.
132 byline The byline on a title page may include either the name or a description for the document's author. Where the name is included, it may optionally be tagged using the

model.msItemPart.xml#13000

# id text
2 model.msItemPart groups elements which can appear within a manuscript item description.

calendar.xml#13000

# id text
8 calendar describes a calendar or dating system used in a dating formula in the text.

data.certainty.xml#13000

# id text
35 data.certainty . The value

cRefPattern.xml#13000

# id text
47 cRefPattern The result of the substitution may be either an absolute or a relative URI reference. In the latter case it is combined with the value of
49 cRefPattern in force at the place where the
51 cRefPattern attribute occurs to form an absolute URI in the usual manner as prescribed by

seal.xml#13000

# id text
4 seal contains a description of one seal or similar attachment applied to a manuscript.
35 seal specifies whether or not the seal is contemporary with the item to which it is affixed

graph.xml#13000

# id text
4 graph encodes a graph, which is a collection of nodes, and arcs which connect the nodes.
83 graph undirected graph
99 graph directed graph
115 graph a directed graph with distinguished initial and final nodes
131 graph a transition network with up to two labels on each arc
152 graph , then the distinction between the
158 graph tag is neutralized. Also, the
168 graph (or any other value which implies directionality), then the
239 graph states the order of the graph, i.e., the number of its nodes.
258 graph states the size of the graph, i.e., the number of its arcs.

recordHist.xml#13000

# id text
2 recordHist recorded history
13 recordHist provides information about the source and revision status of the parent manuscript description itself.

tree.xml#13189

# id text
4 tree encodes a tree, which is made up of a root, internal nodes, leaves, and arcs from root to leaves.
46 tree gives the maximum number of children of the root and internal nodes of the tree.
75 tree indicates whether or not the tree is ordered, or if it is partially ordered.
95 tree indicates that all of the branching nodes of the tree are ordered.
111 tree indicates that some of the branching nodes of the tree are ordered and some are unordered.
127 tree indicates that all of the branching nodes of the tree are unordered.
145 tree gives the order of the tree, i.e., the number of its nodes.
163 tree The size of a tree is always one less than its order, hence there is no need for both a
305 tree A root, and zero or more internal nodes and leaves, but if there is an internal node, there must also be at least one leaf.

iff.xml#13000

# id text
2 iff if and only if
13 iff separates the condition from the consequence in a bicond element.

att.media.xml#13000

# id text
9 att.media Where the media are displayed, indicates the display width
16 att.media Where the media are displayed, indicates the display height
23 att.media Where the media are displayed, indicates a scale factor to be applied when generating the desired display size

geogName.xml#13000

# id text
2 geogName geographical name
14 geogName identifies a name associated with some geographical feature such as Windrush Valley or Mount Sinai.

additional.xml#13000

# id text
4 additional groups additional information, combining bibliographic information about a manuscript, or surrogate copies of it with curatorial or administrative information.

figure.xml#13000

# id text
4 figure groups elements representing or containing graphic information such as an illustration, formula, or figure.

listRef.xml#13000

# id text
2 listRef list of references
14 listRef supplies a list of significant references to places where this element is discussed, in the current document or elsewhere.

binaryObject.xml#13000

# id text
2 binaryObject provides encoded binary data representing an inline graphic, audio, video or other object.
30 binaryObject The encoding used to encode the binary data. If not specified, this is assumed to be

scriptStmt.xml#13000

# id text
16 scriptStmt contains a citation giving details of the script used for a spoken text.

f.xml#13000

# id text
15 f feature value specification
16 f , that is, the association of a name with a value of any of several different types.
55 f A feature value cannot contain both text and element content
59 f A feature value can contain only one child element
66 f a single word which follows the rules defining a legal XML name (see
67 f ), providing a name for the feature.
86 f feature value
96 f references any element which can be used to represent the value of a feature.
114 f If this attribute is supplied as well as content, the value referenced is to be unified with that contained.
152 f If the element is empty then a value must be supplied for the
154 f attribute. The content of
156 f may also be textual, with the assumption that the data type of the feature value is determined by the schema—this is the approach used in many language-technology-oriented projects and recommendations.

data.pattern.xml#13000

# id text
28 data.pattern , is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements. For example, the set containing the three strings
36 data.pattern (or alternatively, it is said that the pattern

material.xml#13000

# id text
4 material contains a word or phrase describing the material of which the object being described is composed.
61 material attribute may be used to point to one or more items within a taxonomy of types of material, defined either internally or externally.

shift.xml#13000

# id text
4 shift marks the point at which some paralinguistic feature of a series of utterances by any one speaker changes.
28 shift The @new attribute should always be supplied; use the special value "normal" to indicate that the feature concerned ceases to be remarkable at this point.
101 shift tension or stress pattern.
151 shift specifies the new state of the paralinguistic feature specified.
172 shift . The special value
174 shift should be used to indicate that the feature concerned ceases to be remarkable at this point. In earlier versions of these Guidelines, a null value for this attribute was understood to have the same effect: this practice is now deprecated and will be removed at a future release.
208 shift is spoken loudly, the words

measureGrp.xml#13000

# id text
2 measureGrp measure group
12 measureGrp contains a group of dimensional specifications which relate to the same object, for example the height and width of a manuscript page.

provenance.xml#13092

# id text
4 provenance contains any descriptive or other information concerning a single identifiable episode during the history of a manuscript or manuscript part, after its creation but before its acquisition.

application.xml#13000

# id text
2 application provides information about an application which has acted upon the document.
37 application supplies an identifier for the application, independent of its version number or display name.
54 application supplies a version number for the application, independent of its identifier or display name.
82 application This example shows an appInfo element documenting the fact that version 1.5 of the Image Markup Tool1 application has an interest in two parts of a document which was last saved on June 6 2006. The parts concerned are accessible at the URLs given as target for the two

att.fragmentable.xml#13251

# id text
6 att.fragmentable specifies whether or not its parent element is fragmented in some way, typically by some other overlapping structure: for example a speech which is divided between two or more verse stanzas, a paragraph which is split across a page division, a verse line which is divided between two speakers.

mapping.xml#13000

# id text
2 mapping character mapping
14 mapping contains one or more characters which are related to the parent character or glyph in some respect, as specified by the

att.divLike.xml#13000

# id text
32 att.divLike specifies how the content of the division is organized.
53 att.divLike no claim is made about the sequence in which the immediate contents of this division are to be processed, or their inter-relationships.
87 att.divLike indicates whether this division is a sample of the original source and if so, from which part.
108 att.divLike division lacks material present at end in source.
124 att.divLike division lacks material at start and end.
140 att.divLike division lacks material at start.
156 att.divLike position of sampled material within original unknown.

model.divBottom.xml#13000

# id text
2 model.divBottom groups elements appearing at the end of a text division

tagsDecl.xml#13122

# id text
56 tagsDecl TEI recommended practice is to specify this attribute. When the
60 tagsDecl are used to list each of the element types in the associated
62 tagsDecl , the value should be given as
68 tagsDecl are used to provide usage information or default renditions for only a subset of the elements types within the associated
70 tagsDecl , the value should be

att.timed.xml#13000

# id text
21 att.timed indicates the location within a temporal alignment at which this element begins.
39 att.timed If no value is supplied, the element is assumed to follow the immediately preceding element at the same hierarchic level.
56 att.timed indicates the location within a temporal alignment at which this element ends.
74 att.timed If no value is supplied, the element is assumed to precede the immediately following element at the same hierarchic level.

charProp.xml#13000

# id text
14 charProp provides a name and value for some property of the parent character or glyph.
76 charProp If the property is a Unicode Normative Property, then its
78 charProp must be supplied. Otherwise, its name must be specied by means of a
82 charProp At a later release, additional constraints will be defined on possible value/name combinations using Schematron rules

att.typed.xml#13017

# id text
67 att.typed attribute is present on a number of elements, not all of which are members of
76 att.typed provides a sub-categorization of the element, if needed
96 att.typed attribute may be used to provide any sub-classification for the element additional to that provided by its
128 att.typed When appropriate, values from an established typology should be used. Alternatively a typology may be defined in the associated TEI header. If values are to be taken from a project-specific list, this should be defined using the

model.correspContextPart.xml#13042

# id text
2 model.correspContextPart groups elements which may appear as part of the correspContext element

model.certLike.xml#13000

# id text
2 model.certLike groups elements which are used to indicate uncertainty or precision of other elements.

distinct.xml#13000

# id text
44 distinct specifies how the phrase is distinct diachronically
63 distinct specifies how the phrase is distinct diatopically
82 distinct specifies how the phrase is distinct diastatically

textNode.xml#13058

# id text
2 textNode indicates the presence of a text node in a content model

model.featureVal.single.xml#13000

# id text
2 model.featureVal.single group elements used to represent atomic feature values in feature structures.

availability.xml#13000

# id text
4 availability supplies information about the availability of a text, for example any restrictions on its use or distribution, its copyright status, any licence applying to it, etc.
36 availability supplies a code identifying the current availability of the text.
59 availability the text is freely available.
75 availability the status of the text is unknown.
91 availability the text is not freely available.

att.editLike.xml#13092

# id text
2 att.editLike provides attributes describing the nature of an encoded scholarly intervention or interpretation of any kind.
41 att.editLike there is internal evidence to support the intervention.
57 att.editLike there is external evidence to support the intervention.
73 att.editLike the intervention or interpretation has been made by the editor, cataloguer, or scholar on the basis of their expertise.
101 att.editLike The members of this attribute class are typically used to represent any kind of editorial intervention in a text, for example a correction or interpretation, or to date or localize manuscripts etc.
106 att.editLike (if present) corresponding to a witness or witness group should reference a bibliographic citation such as a
112 att.editLike element, or another external bibliographic citation, documenting the source concerned.

region.xml#13000

# id text
4 region contains the name of an administrative unit such as a state, province, or county, larger than a settlement, but smaller than a country.

socecStatus.xml#13000

# id text
40 socecStatus identifies the classification system or taxonomy in use, for example by pointing to a locally-defined
61 socecStatus identifies a status code defined within the classification system or taxonomy defined by the
122 socecStatus The content of this element may be used as an alternative to the more formal specification made possible by its attributes; it may also be used to supplement the formal specification with commentary or clarification.

vAlt.xml#13000

# id text
2 vAlt value alternation
14 vAlt represents the value part of a feature-value specification which contains a set of values, only one of which can be valid.

front.xml#13123

# id text
2 front front matter
16 front contains any prefatory matter (headers, title page, prefaces, dedications, etc.) found at the start of a document, before the main body.
212 front Because cultural conventions differ as to which elements are grouped as front matter and which as back matter, the content models for the

variantEncoding.xml#13000

# id text
50 variantEncoding apparatus uses line numbers or other canonical reference scheme referenced in a base text.
82 variantEncoding alternate readings of a passage are given in parallel in the text; no notion of a base text is necessary.
99 variantEncoding The value
118 variantEncoding indicates whether the apparatus appears within the running text or external to it.
140 variantEncoding The @location value "external" is inconsistent with the parallel-segmentation method of apparatus markup.
180 variantEncoding The value

ex.xml#13000

# id text
14 ex contains a sequence of letters added by an editor or transcriber when expanding an abbreviation.

resp.xml#13074

# id text
13 resp contains a phrase describing the nature of a person's intellectual responsibility, or an organization's role in the production or distribution of a work.
76 resp ) to a standardized list of responsibility types, such as that maintained by a naming authority, for example the list maintained at

argument.xml#13000

# id text
4 argument contains a formal list or prose description of the topics addressed by a subdivision of a text.
69 argument Often contains either a list or a paragraph

interpGrp.xml#13000

# id text
2 interpGrp interpretation group
15 interpGrp collects together a set of related interpretations which share responsibility or type.
109 interpGrp Any number of

model.personPart.xml#13000

# id text
2 model.personPart groups elements which form part of the description of a person.

condition.xml#13000

# id text
4 condition contains a description of the physical condition of the manuscript.

samplingDecl.xml#13000

# id text
16 samplingDecl contains a prose description of the rationale and methods used in sampling texts in the creation of a corpus or collection.
59 samplingDecl This element records all information about systematic inclusion or omission of portions of the text, whether a reflection of sampling procedures in the pure sense or of systematic omission of material deemed either too difficult to transcribe or not of sufficient interest.

licence.xml#13221

# id text
2 licence contains information about a licence or other legal agreement applicable to the text.
48 licence element should be supplied for each licence agreement applicable to the text in question. The
60 licence attributes may be used in combination to indicate the date or dates of applicability of the licence.

att.global.analytic.xml#13000

# id text
2 att.global.analytic provides additional global attributes for associating specific analyses or interpretations with appropriate portions of a text.
18 att.global.analytic analysis

att.styleDef.xml#13000

# id text
2 att.styleDef groups elements which specify the name of a formal definition language used to provide formatting or rendition information.
6 att.styleDef identifies the language used to describe the rendition.
51 att.styleDef Informal free text description
65 att.styleDef A user-defined rendition description language
80 att.styleDef If no value for the @scheme attribute is provided, then the default assumption should be that CSS is in use.
85 att.styleDef supplies a version number for the style language provided in
95 att.styleDef @schemeVersion can only be used if @scheme is specified.
103 att.styleDef is used, then
105 att.styleDef should also appear, with a value other than

docDate.xml#13000

# id text
2 docDate document date
16 docDate contains the date of a document, as given on a title page or in a dateline.
43 docDate gives the value of the date in standard form, i.e. YYYY-MM-DD.
61 docDate attribute should give the Gregorian or proleptic Gregorian date in one of the formats specified in
113 docDate element in the core tag set. This specialized element is provided for convenience in marking and processing the date of the documents, since it is likely to require specialized handling for many applications. It should be used only for the date of the entire document, not for any subset or part of it.

offset.xml#13000

# id text
4 offset marks that part of a relative temporal or spatial expression which indicates the direction of the offset between the two place names, dates, or times involved in the expression.

dataSpec.xml#

# id text
2 dataSpec datatype specification

trait.xml#13242

# id text
4 trait contains a description of some status or quality attributed to a person, place, or organization typically, but not necessarily, independent of the volition or action of the holder and usually not at some specific time or for a specific date range.
96 trait the more general purpose element
98 trait should be used even for unchanging characteristics. If you wish to distinguish between characteristics that are generally perceived to be time-bound states and those assumed to be fixed traits, then
102 trait element encodes characteristics which are sometimes assumed to change, often at specific times or over a date range, whereas the

att.ranging.xml#13000

# id text
6 att.ranging gives a minimum estimated value for the approximate measurement.
15 att.ranging gives a maximum estimated value for the approximate measurement.
24 att.ranging where the measurement summarizes more than one observation or a range, supplies the minimum value observed.
33 att.ranging where the measurement summarizes more than one observation or a range, supplies the maximum value observed.
42 att.ranging specifies the degree of statistical confidence (between zero and one) that a value falls within the range specified by

epigraph.xml#13000

# id text
2 epigraph contains a quotation, anonymous or attributed, appearing at the start or end of a section or on a title page.

tagUsage.xml#13000

# id text
38 tagUsage specifies the name (generic identifier) of the element indicated by the tag, within the namespace indicated by the parent
61 tagUsage specifies the number of occurrences of this element within the text.
92 tagUsage specifies the number of occurrences of this element within the text which bear a distinct value for the global
131 tagUsage element which defines how this element was rendered in the source text.

handNote.xml#13000

# id text
2 handNote note on hand

gramGrp.xml#13000

# id text
2 gramGrp grammatical information group

if.xml#13000

# id text
2 if defines a conditional default value for a feature; the condition is specified as a feature structure, and is met if it subsumes the feature structure in the text for which a default value is sought.

div5.xml#13000

# id text
2 div5 level-5 text division
16 div5 contains a fifth-level subdivision of the front, body, or back of a text.
187 div5 any sequence of low-level structural elements, possibly grouped into lower subdivisions.

place.xml#13000

# id text
4 place contains data about a geographic location

roleName.xml#13000

# id text
4 roleName contains a name component which indicates that the referent has a particular role or position in society, such as an official title or rank.

link.xml#13000

# id text
4 link defines an association or hypertextual link among elements or passages, of some type not more precisely specifiable by other elements.
57 link The location of this element within a document has no significance, unless it is included within a
59 link , in which case it may inherit the value of the
61 link attribute from the value given on the

biblFull.xml#13000

# id text
13 biblFull contains a fully-structured bibliographic citation, in which all components of the TEI file description are present.

bloc.xml#13012

# id text
4 bloc contains the name of a geo-political unit consisting of two or more nation states or countries.

elementRef.xml#13000

# id text
16 elementRef the identifier used for the required element within the source indicated.
29 elementRef available from the current default source.
38 elementRef available from the TEI P5 1.2.1 release.
42 elementRef Elements are identified by the name supplied as value for the
46 elementRef element in which they are declared. TEI element names are unique.

distributor.xml#13000

# id text
4 distributor supplies the name of a person or other agency responsible for the distribution of a text.

bicond.xml#13000

# id text
2 bicond bi-conditional feature-structure constraint
14 bicond defines a biconditional feature-structure constraint; both consequent and antecedent are specified as feature structures or groups of feature structures; the constraint is satisfied if both subsume a given feature structure, or if both do not.

layout.xml#13000

# id text
4 layout describes how text is laid out on the page, including information about any ruling, pricking, or other evidence of page-preparation techniques.
28 layout specifies the number of columns per page
45 layout If a single number is given, all pages have this number of columns. If two numbers are given, the number of columns per page varies between the values supplied.
51 layout specifies the number of ruled lines per column
68 layout If a single number is given, all columns have this number of ruled lines. If two numbers are given, the number of ruled lines per column varies between the values supplied.
74 layout specifies the number of written lines per column
91 layout If a single number is given, all columns have this number of written lines. If two numbers are given, the number of written lines per column varies between the values supplied.

model.divBottomPart.xml#13000

# id text
2 model.divBottomPart groups elements which can occur only at the end of a text division.

model.publicationStmtPart.agency.xml#13000

# id text
4 model.publicationStmtPart.agency element of the TEI header that indicate an authorising agent.
32 model.publicationStmtPart.agency child elements, while not required, are required if one of the

del.xml#13000

# id text
14 del contains a letter, word, or passage deleted, marked as deleted, or otherwise indicated as superfluous or spurious in the copy text by an author, scribe, or a previous annotator or corrector.
77 del element should be used for longer sequences of text, for those containing structural subdivisions, and for those containing overlapping additions and deletions.
79 del The text deleted must be at least partially legible in order for the encoder to be able to transcribe it (unless it is restored in a
81 del tag). Illegible or lost text within a deletion may be marked using the
83 del tag to signal that text is present but has not been transcribed, or is no longer visible. Attributes on the
85 del element may be used to indicate how much text is omitted, the reason for omitting it, etc. If text is not fully legible, the
87 del element (available when using the additional tagset for transcription of primary sources) should be used to signal the areas of text which cannot be read with confidence in a similar way.
94 del There is a clear distinction in the TEI between
104 del indicates a deletion present in the source being transcribed, which states the author's or a later scribe's intent to cancel or remove text.
106 del indicates material present in the source being transcribed which should have been so deleted, but which is not in fact.
110 del , by contrast, signal an editor's or encoder's decision to omit something or their inability to read the source text. See sections

data.language.xml#13000

# id text
2 data.language defines the range of attribute values used to identify a particular combination of human language and writing system.
23 data.language The values for this attribute are language
30 data.language language tag
31 data.language , per BCP 47, is assembled from a sequence of components or
35 data.language , U+002D). The tag is made of the following subtags, in the following order. Every subtag except the first is optional. If present, each occurs only once, except the fourth and fifth components (variant and extension), which are repeatable.
36 data.language language
37 data.language The IANA-registered code for the language. This is almost always the same as the ISO 639 2-letter language code if there is one. The list of available registered language subtags can be found at
38 data.language . It is recommended that this code be written in lower case.
40 data.language The ISO 15924 code for the script. These codes consist of 4 letters, and it is recommended they be written with an initial capital, the other three letters in lower case. The canonical list of codes is maintained by the Unicode Consortium, and is available at
41 data.language . The IETF recommends this code be omitted unless it is necessary to make a distinction you need.
42 data.language region
43 data.language Either an ISO 3166 country code or a UN M.49 region code that is registered with IANA (not all such codes are registered, e.g. UN codes for economic groupings or codes for countries for which there is already an ISO 3166 2-letter code are not registered). The former consist of 2 letters, and it is recommended they be written in upper case. The list of codes can be found at
44 data.language . The latter consist of 3 digits; the list of codes can be found at
48 data.language are used to indicate additional, well-recognized variations that define a language or its dialects that are not covered by other available subtags
51 data.language An extension has the format of a single letter followed by a hyphen followed by additional subtags. These exist to allow for future extension to BCP 47, but as of this writing no such extensions are in use.
59 data.language element must be present in the TEI header.
62 data.language There are two exceptions to the above format. First, there are language tags in the
68 data.language Second, an entire language tag can consist of only a private use subtag. These tags start with
70 data.language , and do not need to follow any further rules established by the IETF and endorsed by these Guidelines. Like all language tags that make use of private use subtags, the language in question must be documented in a corresponding
72 data.language element in the TEI header.
82 data.language English as spoken in Sierra Leone
86 data.language Spanish as spoken in Mexico
88 data.language Spanish as spoken in Latin America

u.xml#13000

# id text
14 u contains a stretch of speech usually preceded and followed by silence or by a change of speaker.
76 u this utterance begins without unusual pause or rapidity.
92 u this utterance begins with a markedly shorter pause than normal.
185 u will be delimited by pause or change of speaker,
187 u is not required to represent a turn or any communicative event, nor to be bounded by pauses or change of speaker. At a minimum, a

att.global.linking.xml#13000

# id text
2 att.global.linking defines a set of attributes for hypertextual linking.
76 att.global.linking . The language is indicated using
78 att.global.linking , whose value is inherited; both the tag with the
80 att.global.linking and the tag pointed to by the
82 att.global.linking inherit the value from their immediate parent.
118 att.global.linking elements in a literary personography. This correspondence represents a slightly looser relationship than the one in the preceding example; there is no sense in which an allegorical character could be substituted for the physical city, or vice versa, but there is obviously a correspondence between them.
189 att.global.linking Any content of the current element should be ignored. Its true content is that of the element being pointed at.
269 att.global.linking selects one or more alternants; if one alternant is selected, the ambiguity or uncertainty is marked as resolved. If more than one alternant is selected, the degree of ambiguity or uncertainty is marked as reduced by the number of alternants not selected.

castItem.xml#13000

# id text
2 castItem cast list item
14 castItem contains a single entry within a cast list, describing either a single role or a list of non-speaking roles.
61 castItem role
65 castItem the item describes a single role.
81 castItem the item describes a list of non-speaking roles.

fileDesc.xml#13000

# id text
97 fileDesc The major source of information for those seeking to create a catalogue entry or bibliographic citation for an electronic file. As such, it provides a title and statements of responsibility together with details of the publication or distribution of the file, of any series to which it belongs, and detailed bibliographic notes for matters not addressed elsewhere in the header. It also contains a full bibliographic description for the source or sources from which the electronic text was derived.

caption.xml#13000

# id text
4 caption contains the text of a caption or other text displayed as part of a film script or screenplay.
85 caption A specialized form of stage direction.

editionStmt.xml#13000

# id text
2 editionStmt edition statement
16 editionStmt groups information relating to one edition of a text.

att.lexicographic.xml#13000

# id text
2 att.lexicographic defines a set of global attributes available on elements in the base tag set for dictionaries.
23 att.lexicographic gives an expanded form of information presented more concisely in the dictionary
68 att.lexicographic gives a normalized form of information given by the source text in a non-normalized form
105 att.lexicographic gives the list of split values for a merged form
126 att.lexicographic gives a value which lacks any realization in the printed source text.
151 att.lexicographic gives the original string or is the empty string when the element does not appear in the source text.
174 att.lexicographic element typically elsewhere in the document, but possibly in another document, which is the original location of this component.

signatures.xml#13077

# id text
4 signatures contains discussion of the leaf or quire signatures found within a codex.

model.frontPart.drama.xml#13000

# id text
2 model.frontPart.drama groups elements which appear at the level of divisions within front or back matter of performance texts only.

caesura.xml#13000

# id text
2 caesura marks the point at which a metrical line may be divided.

lb.xml#13000

# id text
2 lb line break
14 lb marks the start of a new (typographic) line in some edition or version of a text.
40 lb This example shows typographical line breaks within metrical lines, where they occur at different places in different editions:
74 lb This example encodes typographical line breaks as a means of preserving the visual appearance of a title page. The
76 lb attribute is used to show that the line break does not (as elsewhere) mark the start of a new word.
100 lb elements should appear at the point in the text where a new line starts. The
102 lb attribute, if used, indicates the number or other value associated with the text between this point and the next
104 lb element, typically the sequence number of the line within the page, or other appropriate unit. This element is intended to be used for marking actual line breaks on a manuscript or printed page, at the point where they occur; it should not be used to tag structural units such as lines of verse (for which the
110 lb attribute may be used to characterize the line break in any respect. The more specialized attributes
116 lb should be preferred when the intent is to indicate whether or not the line break is word-breaking, or to note the source from which it derives.

att.transcriptional.xml#13017

# id text
2 att.transcriptional provides attributes specific to elements encoding authorial or scribal intervention in a text when transcribing manuscript or similar sources.
38 att.transcriptional indicates the effect of the intervention, for example in the case of a deletion, strikeouts which include too much or too little text, or in the case of an addition, an insertion which duplicates some of the text already present.
58 att.transcriptional all of the text indicated as an addition duplicates some text that is in the original, whether the duplication is word-for-word or less exact.
72 att.transcriptional part of the text indicated as an addition duplicates some text that is in the original
86 att.transcriptional some text at the beginning of the deletion is marked as deleted even though it clearly should not be deleted.
100 att.transcriptional some text at the end of the deletion is marked as deleted even though it clearly should not be deleted.
114 att.transcriptional some text at the beginning of the deletion is not marked as deleted even though it clearly should be.
128 att.transcriptional some text at the end of the deletion is not marked as deleted even though it clearly should be.
142 att.transcriptional some text in the deletion is not marked as deleted even though it clearly should be.
171 att.transcriptional Status information on each deletion is needed rather rarely except in critical editions from authorial manuscripts; status information on additions is even less common.
173 att.transcriptional Marking a deletion or addition as faulty is inescapably an interpretive act; the usual test applied in practice is the linguistic acceptability of the text with and without the letters or words in question.
203 att.transcriptional repeated for the purpose of fixation
207 att.transcriptional repeated to clarify a previously illegible or badly written text or mark
214 att.transcriptional sequence
224 att.transcriptional assigns a sequence number related to the order in which the encoded features carrying this attribute are believed to have occurred.

orgName.xml#13000

# id text
2 orgName organization name

figDesc.xml#13000

# id text
2 figDesc description of figure
14 figDesc contains a brief prose description of the appearance or content of a graphic figure, for use when documenting an image without displaying it.
58 figDesc This element is intended for use as an alternative to the content of its parent
60 figDesc element ; for example, to display when the image is required but the equipment in use cannot display graphic images. It may also be used for indexing or documentary purposes.

note.xml#13129

# id text
2 note contains a note or annotation.
30 note indicates whether the copy text shows the exact place of reference for the note.
48 note In modern texts, notes are usually anchored by means of explicit footnote or endnote symbols. An explicit indication of the phrase or line annotated may however be used instead (e.g.
52 note attribute indicates whether any explicit location is given, whether by symbol or by prose cross-reference. The value
54 note indicates that such an explicit location is indicated in the copy text; the value
56 note indicates that the copy text does not indicate a specific place of attachment for the note. If the specific symbols used in the copy text at the location the note is anchored are to be recorded, use the
77 note points to the end of the span to which the note is attached, if the note is not embedded in the text at that point.
93 note This attribute is retained for backwards compatibility; it may be removed at a subsequent release of the Guidelines. The recommended way of pointing to a span of elements is by means of the
109 note In the following example, the translator has supplied a footnote containing an explanation of the term translated as "painterly":
120 note For this example to be valid, the code
122 note must be defined elsewhere, for example by means of a responsibility statement in the associated TEI header:
160 note attribute may be used to supply the symbol or number used to mark the note's point of attachment in the source text, as in the following example:
166 note However, if notes are numbered in sequence and their numbering can be reconstructed automatically by processing software, it may well be considered unnecessary to record the note numbers.

docAuthor.xml#13000

# id text
2 docAuthor document author
16 docAuthor contains the name of the author of the document, as given on the title page (often but not always contained in a byline).
71 docAuthor The document author's name often occurs within a byline, but the

lem.xml#13000

# id text
56 lem The term
58 lem is used in text criticism to describe the reading in the text itself (as opposed to those in the apparatus); this usage is distinct from that of mathematics (where a lemma is a major step in a proof) and natural-language processing (where a lemma is the dictionary form associated with an inflected form in the running text).

macroSpec.xml#13000

# id text
62 macroSpec indicates which type of entity should be generated, when an ODD processor is generating a module using XML DTD syntax.
91 macroSpec datatype entity

s.xml#13000

# id text
161 s You may not nest one s element within another: use seg instead
196 s element may be used to mark orthographic sentences, or any other segmentation of a text, provided that the segmentation is end-to-end, complete, and non-nesting. For segmentation which is partial or recursive, the
202 s attribute may be used to indicate the type of segmentation intended, according to any convenient typology.

model.settingPart.xml#13000

# id text
2 model.settingPart groups elements used to describe the setting of a linguistic interaction.

model.hiLike.xml#13000

# id text
2 model.hiLike groups phrase-level elements which are typographically distinct but to which no specific function can be attributed.

postBox.xml#13000

# id text
14 postBox contains a number or other identifier for some postal delivery point other than a street address.
66 postBox The position and nature of postal codes is highly country-specific; the conventions appropriate to the country concerned should be used.

att.textCritical.xml#13092

# id text
2 att.textCritical defines a set of attributes common to all elements representing variant readings in text critical work.
105 att.textCritical variant sequence
115 att.textCritical provides a number indicating the position of this reading in a sequence, when there is reason to presume a sequence to the variants.
133 att.textCritical Different variant sequences could be coded with distinct number trails: 1-2-3 for one sequence, 5-6-7 for another. More complex variant sequences, with (for example) multiple branchings from single readings, may be expressed through the

history.xml#13000

# id text
4 history groups elements describing the full history of a manuscript or manuscript part.

model.titlepagePart.xml#13000

# id text
2 model.titlepagePart groups elements which can occur as direct constituents of a title page, such as

TEI.xml#13163

# id text
2 TEI TEI document
16 TEI contains a single TEI-conformant document, containing a single TEI header, a single text, one or more members of the model.resourceLike class, or a combination of these. A series of
18 TEI elements may be combined together to form a
80 TEI specifies the major version number of the TEI Guidelines against which this document is valid.
100 TEI The major version number is historically prefixed by a P (for Proposal), and is distinct from the version number used for individual releases of the Guidelines, as used by (for example) the
222 TEI This element is required. It is customary to specify the TEI namespace

filiation.xml#13006

# id text
5 filiation filiation
107 filiation includes a link to some other manuscript description which has the identifier

data.probability.xml#13000

# id text
25 data.probability Probability is expressed as a real number between 0 and 1; 0 representing

death.xml#13012

# id text
4 death contains information about a person's death, such as its date and place.

l.xml#13000

# id text
2 l verse line
14 l contains a single, possibly incomplete, line of verse.

retrace.xml#13000

# id text
2 retrace contains a sequence of writing which has been retraced, for example by over-inking, to clarify or fix it.
24 retrace within another. In principle, a retrace differs from a substitution in that second and subsequent rewrites do not materially alter the content of an element. Where minor changes have been made during the retracing action however these may be marked up using
28 retrace , etc. with an appropriate value for the

sponsor.xml#13000

# id text
4 sponsor specifies the name of a sponsoring organization or institution.
55 sponsor Sponsors give their intellectual authority to a project; they are to be distinguished from

vRange.xml#13000

# id text
2 vRange value range
14 vRange defines the range of allowed values for a feature, in the form of an
18 vRange , or primitive value; for the value of an
20 vRange to be valid, it must be subsumed by the specified range; if the
24 vRange attribute), then each value must be subsumed by the

epilogue.xml#13000

# id text
4 epilogue contains the epilogue to a drama, typically spoken by an actor out of character, possibly in association with a particular performance or venue.
142 epilogue Contains optional headings, a sequence of one or more component-level elements, and an optional sequence of closing material.

att.repeatable.xml#13168

# id text
2 att.repeatable supplies attributes for the elements which define component parts of a content model.
7 att.repeatable supplies an XPath identifying a context within which this component of a content model must be found
14 att.repeatable minimum number of occurences
26 att.repeatable indicates the smallest number of times this component may occur.
35 att.repeatable maximum number of occurences
47 att.repeatable indicates the largest number of times this component may occur.

att.global.facs.xml#13000

# id text
21 att.global.facs facsimile
29 att.global.facs points to all or part of an image which corresponds with the content of the element.

altIdentifier.xml#13000

# id text
92 altIdentifier An identifying number of some kind must be supplied if known; if it is not known, this should be stated.

listNym.xml#13000

# id text
2 listNym list of canonical names
12 listNym contains a list of nyms, that is, standardized names for any thing.
117 listNym The type attribute may be used to distinguish lists of names of a particular type if convenient.

notatedMusic.xml#13000

# id text
2 notatedMusic encodes the presence of music notation in a text
31 notatedMusic It is possible to describe the content of the notation using elements from the
35 notatedMusic . It is possible to specify the location of digital objects representing the notated music in other media such as images or audio-visual files. The encoder's interpretation of the correspondence between the notated music and these digital objects is not encoded explicitly. We recommend the use of graphic and binaryObject mainly as a fallback mechanism when the notated music format is not displayable by the application using the encoding. The alignment of encoded notated music, images carrying the notation, and audio files is a complex matter for which we refer the encoder to other formats and specifications such as MPEG-SMR.

view.xml#13000

# id text
133 view A view is a particular form of stage direction.

funder.xml#13000

# id text
2 funder funding body
16 funder specifies the name of an individual, institution, or organization responsible for the funding of a project or text.
73 funder Funders provide financial support for a project; they are distinct from
75 funder , who provide intellectual support and authority.

dataNode.xml#13058

# id text
2 dataNode defines possible values for a data node, usually as part of an attribute's datatype
11 dataNode supplies the name of a predefined datatype in the datatype library specified by the
18 dataNode points to the datatype library in which the name specified by the
26 dataNode The default source is the list of datatypes provided by
32 dataNode supplies a string representing a regular expression providing additional constraints on the strings used to represent values conforming to this datatype

handNotes.xml#13000

# id text
4 handNotes elements documenting the different hands identified within the source texts.

vDefault.xml#13000

# id text
2 vDefault value default
14 vDefault declares the default value to be supplied when a feature structure does not contain an instance of
16 vDefault for this name; if unconditional, it is specified as one (or, depending on the value of the
22 vDefault elements or primitive values; if conditional, it is specified as one or more
24 vDefault elements; if no default is specified, or no condition matches, the value
99 vDefault May contain a legal feature value, or a series of

persName.xml#13000

# id text
2 persName personal name

model.phrase.xml#13067

# id text
17 model.phrase This class of elements can occur within paragraphs, list items, lines of verse, etc.

setting.xml#13000

# id text
2 setting describes one particular setting in which a language interaction takes place.
79 setting attribute is not supplied, the setting is assumed to be that of all participants in the language interaction.

roleDesc.xml#13000

# id text
2 roleDesc role description
14 roleDesc describes a character's role in a drama.

depth.xml#13000

# id text
5 depth width
41 depth If used to specify the width of a non text-bearing portion of some object, for example a monument, this element conventionally refers to the axis facing the observer, and perpendicular to that indicated by the
42 depth width

floatingText.xml#13000

# id text
4 floatingText contains a single text of any kind, whether unitary or composite, which interrupts the text containing it at any point and after which the surrounding text resumes.
132 floatingText A floating text has the same content as any other and may thus be interrupted by another floating text, or contain a group of tesselated texts.

model.divPart.spoken.xml#13000

# id text
2 model.divPart.spoken groups elements structurally analogous to paragraphs within spoken texts.

orth.xml#13000

# id text
2 orth orthographic form
14 orth gives the orthographic form of a dictionary headword.
58 orth gives the extent of the orthographic information provided.
79 orth full form

purpose.xml#13000

# id text
2 purpose characterizes a single purpose or communicative function of the text.
109 purpose specifies the extent to which this purpose predominates.
129 purpose this purpose is predominant
131 purpose this purpose is intermediate
133 purpose this purpose is weak
135 purpose extent unknown
180 purpose Usually empty, unless some further clarification of the type attribute is needed, in which case it may contain running prose

idno.xml#13128

# id text
16 idno supplies any form of identifier used to identify some object, such as a bibliographic item, a person, a title, an organization, etc. in a standardized way.

macro.phraseSeq.xml#13000

# id text
2 macro.phraseSeq phrase sequence
14 macro.phraseSeq defines a sequence of character data and phrase-level elements.

div1.xml#13000

# id text
2 div1 level-1 text division
16 div1 contains a first-level subdivision of the front, body, or back of a text.
150 div1 any sequence of low-level structural elements, possibly grouped into lower subdivisions.

join.xml#13000

# id text
41 join specifies the name of an element which this aggregation may be understood to represent.
77 join root
83 join attribute are joined, each subtree become a child of the virtual element created by the join
169 join attribute. The value
170 join root
322 join is specified with the value of
324 join to indicate that the virtual list being constructed is to be made by taking the lists indicated by the

nationality.xml#13000

# id text
4 nationality contains an informal description of a person's present or past nationality or citizenship.

att.milestoneUnit.xml#13000

# id text
6 att.milestoneUnit provides a conventional name for the kind of section changing at this milestone.
69 att.milestoneUnit line breaks (synonymous with the
145 att.milestoneUnit changes of speaker or narrator.
253 att.milestoneUnit If the milestone marks the beginning of a piece of text not present in the reference edition, the special value
255 att.milestoneUnit may be used as the value of
257 att.milestoneUnit . The normal interpretation is that the reference edition does not contain the text which follows, until the next
259 att.milestoneUnit tag for the edition in question is encountered.

att.edition.xml#13000

# id text
2 att.edition provides attributes identifying the source edition from which some encoded feature derives.
8 att.edition edition
12 att.edition supplies a sigil or other arbitrary identifier for the source edition in which the associated feature (for example, a page, column, or line break) occurs at this point in the text.
21 att.edition edition reference
23 att.edition provides a pointer to the source edition in which the associated feature (for example, a page, column, or line break) occurs at this point in the text.

prologue.xml#13000

# id text
4 prologue contains the prologue to a drama, typically spoken by an actor out of character, possibly in association with a particular performance or venue.

handShift.xml#13092

# id text
4 handShift marks the beginning of a sequence of text written in a new hand, or the beginning of a scribal stint.
71 handShift element may be used either to denote a shift in the document hand (as from one scribe to another, on one writing style to another). Or, it may indicate a shift within a document hand, as a change of writing style, character or ink. Like other milestone elements, it should appear at the point of transition from some other state to the state which it describes.

value.xml#13012

# id text
12 value contains a single value for some property, attribute, or other analysis.

appInfo.xml#13000

# id text
2 appInfo application information
12 appInfo records information about an application which has edited the TEI file.

projectDesc.xml#13000

# id text
16 projectDesc describes in detail the aim or purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected.

binding.xml#13000

# id text
35 binding specifies whether or not the binding is contemporary with the majority of its contents
53 binding The value
55 binding indicates that the binding is contemporaneous with its contents; the value
57 binding that it is not. The value
59 binding should be used when the date of either binding or manuscript is unknown

model.contentPart.xml#13001

# id text
2 model.contentPart groups elements which may appear as part of the content element.

att.namespaceable.xml#13000

# id text
2 att.namespaceable provides an attribute indicating the target namespace for an object being created
6 att.namespaceable namespace
18 att.namespaceable specifies the namespace to which this element belongs

redo.xml#13000

# id text
32 redo This encoding represents the following sequence of events:

settingDesc.xml#13000

# id text
2 settingDesc setting description
14 settingDesc describes the setting or settings within which a language interaction takes place, or other places otherwise referred to in a text, edition, or metadata.
74 settingDesc May contain a prose description organized as paragraphs, or a series of
76 settingDesc elements. If used to record not settings of language interactions, but other places mentioned in the text, then

docEdition.xml#13000

# id text
2 docEdition document edition
16 docEdition contains an edition statement as presented on a title page of a document.
61 docEdition element of bibliographic citation. As usual, the shorter name has been given to the more frequent element.

eTree.xml#13000

# id text
2 eTree embedding tree
14 eTree provides an alternative to tree element for representing ordered rooted tree structures.
52 eTree provides the value of an embedding tree, which is a feature structure or other analytic element.
144 eTree an optional label followed by zero or more embedding trees, triangles, or embedding leafs.

macro.specialPara.xml#13000

# id text
2 macro.specialPara 'special' paragraph content
14 macro.specialPara defines the content model of elements such as notes or list items, which either contain a series of component-level elements or else have the same structure as a paragraph, containing a series of phrase-level and inter-level elements.

data.version.xml#13000

# id text
2 data.version defines the range of attribute values which may be used to specify a TEI or Unicode version number.
13 data.version The value of this attribute follows the pattern specified by the Unicode consortium for its version number (
14 data.version ). A version number contains digits and fullstop characters only. The first number supplied identifies the major version number. A second and third number, for minor and sub-minor version numbers, may also be supplied.

facsimile.xml#13000

# id text
2 facsimile contains a representation of some written source in the form of a set of images rather than as transcribed or encoded text.

altIdent.xml#13000

# id text
2 altIdent alternate identifier
12 altIdent supplies the recommended XML name for an element, class, attribute, etc. in some language.
48 altIdent All documentation elements in ODD have a canonical name, supplied as the value for their
52 altIdent element is used to supply an alternative name for the corresponding XML object, perhaps in a different language.

textClass.xml#13000

# id text
2 textClass text classification
16 textClass groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc.

data.point.xml#13000

# id text
22 data.point A point is defined by two numeric values, which may be expressed in any notation permitted.

email.xml#13000

# id text
2 email electronic mail address
13 email contains an email address identifying a location to which email messages can be delivered.
47 email The format of a modern Internet email address is defined in

model.ptrLike.xml#13129

# id text
2 model.ptrLike groups elements used for purposes of location and reference.

styleDefDecl.xml#13000

# id text
2 styleDefDecl style definition language declaration
4 styleDefDecl specifies the name of the formal language in which style or renditional information is supplied elsewhere in the document. The specific version of the scheme may also be supplied.

sound.xml#13000

# id text
4 sound describes a sound effect or musical sequence specified within a screen play or radio script.
27 sound categorizes the sound in some respect, e.g. as music, special effect, etc.
46 sound indicates whether the sound overlaps the surrounding speeches or interrupts them.
66 sound The value
68 sound indicates that the sound is heard between the surrounding speeches; the value
70 sound indicates that the sound overlaps one or more of the surrounding speeches.
169 sound A specialized form of stage direction.

trailer.xml#13000

# id text
2 trailer contains a closing title or footer appearing at the end of a division of a text.

normalization.xml#13000

# id text
4 normalization indicates the extent of normalization or regularization of the original source carried out in converting it to electronic form.
33 normalization indicates a bibliographic description or other resource documenting the principles underlying the normalization carried out.
77 normalization normalization made silently
93 normalization normalization represented using markup

zone.xml#13000

# id text
42 zone indicates the amount by which this zone has been rotated clockwise, with respect to the normal orientation of the parent
44 zone element as implied by the dimensions given in the
48 zone itself. The orientation is expressed in arc degrees.
82 zone The position of every zone for a given surface is always defined by reference to the coordinate system defined for that surface.
84 zone A graphic element contained by a zone represents the whole of the zone.
86 zone A zone may be of any shape. The attribute

pause.xml#13000

# id text
4 pause marks a pause either between or within utterances.

listWit.xml#13000

# id text
2 listWit witness list
58 listWit May contain a series of
68 listWit Situations commonly arise where there are many more or less fragmentary witnesses, such that there may be quite distinct groups of witnesses for different parts of a text or collection of texts. Such groups may be given separately, or nested within a single
77 listWit Note however that a given witness can only be defined once, and can therefore only appear within a single

content.xml#13168

# id text
2 content content model
14 content contains the text of a declaration for the schema documented.
47 content controls whether or not pattern names generated in the corresponding Relax NG schema source are automatically prefixed to avoid potential nameclashes.
56 content Each name referenced in e.g. a
58 content element within a content model is automatically prefixed by the value of the
66 content No prefixes are added: any prefix required by the value of the
70 content must therefore be supplied explicitly, as appropriate.
87 content element defines a content model allowing either a sequence of paragraphs or a series of msItem elements optionally preceded by a summary:
102 content This content model defines a content model allowing either a sequence of paragraphs or a series of msItem elements optionally preceded by a summary:
165 content As the example shows, content models may be expressed using the RELAX NG syntax directly. To avoid ambiguity when schemas using elements from different namespaces are created, the name supplied for an element in a content model will be automatically prefixed by a short string, as specified by the
174 content macro.schemaPattern
175 content defines which elements may be used to define content models. Alternatively, a content model may be expressed using the TEI

charName.xml#13000

# id text
2 charName character name
14 charName contains the name of a character, expressed following Unicode conventions.
48 charName The name must follow Unicode conventions for character naming. Projects working in similar fields are recommended to coordinate and publish their list of
50 charName s to facilitate data exchange.

broadcast.xml#13000

# id text
4 broadcast describes a broadcast used as the source of a spoken text.

att.translatable.xml#13000

# id text
18 att.translatable specifies the date on which the source text was extracted and sent to the translator
38 att.translatable attribute can be used to determine whether a translation might need to be revisited, by comparing the modification date on the containing file with the
40 att.translatable value on the translation. If the file has changed, changelogs can be checked to see whether the source text has been modified since the translation was made.

data.numeric.xml#13000

# id text
2 data.numeric defines the range of attribute values used for numeric values.
27 data.numeric Any numeric value, represented as a decimal number, in floating point format, or as a ratio.
33 data.numeric , may be used. In this format, the value is expressed as two numbers separated by the letter E. The first number, the significand (sometimes called the mantissa) is given in decimal format, while the second is an integer. The value is obtained by multiplying the mantissa by 10 the number of times indicated by the integer. Thus the value represented in decimal notation as 1000.0 might be represented in scientific notation as 10E3.
35 data.numeric A value expressed as a ratio is represented by two integer values separated by a solidus (/) character. Thus, the value represented in decimal notation as 0.5 might be represented as a ratio by the string 1/2.

precision.xml#13239

# id text
2 precision indicates the numerical accuracy or precision associated with some aspect of the text markup.
23 precision indicates the degree of precision to be assigned as a value between 0 (none) and 1 (optimally precise)
30 precision characterizes the precision of the element or attribute pointed to by the
39 precision supplies a standard deviation associated with the value in question

unicodeName.xml#13000

# id text
2 unicodeName unicode property name
14 unicodeName contains the name of a registered Unicode normative or informative property.
37 unicodeName specifies the version number of the Unicode Standard in which this property name is defined.
73 unicodeName A definitive list of current Unicode property names is provided in The Unicode Standard.

arc.xml#13000

# id text
4 arc encodes an arc, the connection from one node to another in a graph.
31 arc gives the identifier of the node which is adjacent from this arc.
50 arc gives the identifier of the node which is adjacent to this arc.
102 arc element must be used if the arcs are labeled. Otherwise, arcs can be encoded using the
118 arc provides a label for the arc; the second provides a second label for the arc, and should be used if a transducer is being encoded.

correspAction.xml#13224

# id text
2 correspAction contains a structured description of the place, the name of a person/organization and the date related to the sending/receiving of a message or any other action related to the correspondence

witDetail.xml#13092

# id text
2 witDetail witness detail
49 witDetail indicates the sigil or sigla identifying the witness or witnesses to which the detail refers.
109 witDetail note type='witnessDetail'
112 witDetail attribute, which permits an application to extract all annotation concerning a particular witness or witnesses from the apparatus. It also differs in that the location of a

source.xml#13000

# id text
4 source describes the original source for the information contained with a manuscript description.

valDesc.xml#13000

# id text
2 valDesc value description
14 valDesc specifies any semantic or syntactic constraint on the value that an attribute may take, additional to the information carried by the

tag.xml#13000

# id text
4 tag contains text of a complete start- or end-tag, possibly including attribute specifications, but excluding the opening and closing markup delimiter characters.
27 tag indicates the type of XML tag intended
88 tag supplies the name of the schema in which this tag is defined.
105 tag TEI
109 tag text encoding initiative
113 tag This tag is defined as part of the TEI scheme.
133 tag this tag is part of the Docbook scheme.
159 tag this tag is part of an unknown scheme.

number.xml#13000

# id text
5 number indicates grammatical number associated with a form, as given in a dictionary.
83 number gram type="num"

teiHeader.xml#13064

# id text
2 teiHeader TEI header
16 teiHeader supplies the descriptive and declarative information making up an electronic title page for every TEI-conformant document.
48 teiHeader specifies the kind of document to which the header is attached, for example whether it is a corpus or individual text.
67 teiHeader text
71 teiHeader the header is attached to a single text.
87 teiHeader the header is attached to a corpus.
307 teiHeader One of the few elements unconditionally required in any TEI document.

author.xml#13000

# id text
4 author in a bibliographic reference, contains the name(s) of an author, personal or corporate, of a work; for example in the same form as that provided by a recognized bibliographic name authority.
69 author Particularly where cataloguing is likely to be based on the content of the header, it is advisable to use a generally recognized name authority file to supply the content for this element. The attributes
75 author In the case of a broadcast, use this element for the name of the company or network responsible for making the broadcast.
77 author Where an author is unknown or unspecified, this element may contain text such as
81 author . When the appropriate TEI modules are in use, it may also contain detailed tagging of the names used for people, organizations or places, in particular where multiple names are given.

objectDesc.xml#13000

# id text
41 objectDesc a short project-specific name identifying the physical form of the carrier, for example as a codex, roll, fragment, partial leaf, cutting etc.

fsdLink.xml#13000

# id text
2 fsdLink feature structure declaration link
12 fsdLink associates the name of a typed feature structure with a feature structure declaration for it.
32 fsdLink identifies the type of feature structure to be documented; this will be the value of the

origDate.xml#13000

# id text
2 origDate origin date
13 origDate contains any form of date, used to identify the date of origin for a manuscript or manuscript part.

re.xml#13000

# id text
2 re related entry
14 re contains a dictionary entry for a lexical item related to the headword, such as a compound phrase or derived form, embedded inside a larger entry.
51 re shows a single related entry for which no definition is given, since its meaning is held to be readily derivable from the root entry:
350 re shows a number of related entries embedded in the main entry. The original entry resembles the following:
367 re One encoding for this entry would be:
443 re s in its main entry for
447 re This entry may be encoded thus:
513 re May contain character data mixed with any other elements defined in the dictionary tag set.
517 re tag, and used where a dictionary has embedded information inside one entry which could have formed a separate entry. Some authorities distinguish related entries, run-on entries, and various other types of degenerate entries; no such typology is attempted here.

body.xml#13000

# id text
2 body text body
16 body contains the whole body of a single unitary text, excluding any front or back matter.

linkGrp.xml#13000

# id text
2 linkGrp link group
13 linkGrp defines a collection of associations or hypertextual links.
124 linkGrp A web or link group is an administrative convenience, which should be used to collect a set of links together for any purpose, not simply to supply a default value for the

tech.xml#13000

# id text
2 tech technical stage direction
14 tech describes a special-purpose stage direction that is not meant for the actors.
37 tech categorizes the technical stage direction.
72 tech a sound cue
122 tech performance
134 tech elements documenting the performance or performances to which this technical direction applies.

att.ascribed.xml#13016

# id text
18 att.ascribed indicates the person, or group of people, to whom the element content is ascribed.
38 att.ascribed ) in the body of the play are linked to

hi.xml#13000

# id text
14 hi marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made.

namespace.xml#13000

# id text
4 namespace supplies the formal name of the namespace to which the elements documented by its children belong.
30 namespace specifies the full formal name of the namespace concerned.

locus.xml#13000

# id text
4 locus defines a location within a manuscript or manuscript part, usually as a (possibly discontinuous) sequence of folio references.
30 locus identifies the foliation scheme in terms of which the location is being specified by pointing to some
53 locus specifies the starting point of the location in a normalized form, typically a page number.
74 locus specifies the end-point of the location in a normalized form, typically as a page number.
189 locus attribute is available globally when the
211 locus attribute should only be used to point to elements that contain or indicate a transcription of the locus being described, as in the first example above. To associate a
253 locus When the location being defined consists of a single page, use the
261 locus . For example, if the manuscript description being transcribed has

att.resourced.xml#13000

# id text
2 att.resourced provides attributes by which a resource (such as an externally held media file) may be located.
16 att.resourced specifies the URL from which the media concerned may be obtained.

addrLine.xml#13000

# id text
2 addrLine address line
13 addrLine contains one line of a postal
86 addrLine Addresses may be encoded either as a sequence of lines, or using any sequence of component elements from the
92 addrLine if they form part of the printed address in some source text.

attList.xml#13000

# id text
4 attList contains documentation for all the attributes associated with this element, as a series of
52 attList specifies whether all the attributes in the list are available (org="group") or only one of them (org="choice")
69 attList group

respons.xml#13092

# id text
14 respons identifies the individual(s) responsible for some aspect of the content or markup of particular element(s).
64 respons responsibility is being assigned concerning the name of the element or attribute used.
76 respons responsibility is being assigned concerning the location of the element concerned.
80 respons responsibility is being assigned concerning the content (for an element) or the value (for an attribute)
210 respons element is designed for cases in which fine-grained information about specific aspects of the markup of a text is desirable for whatever reason. Global responsibility for certain aspects of markup is usually more simply indicated in the TEI header, using the
212 respons element within the title statement, edition statement, or change log.

revisionDesc.xml#13000

# id text
16 revisionDesc summarizes the revision history for a file.
80 revisionDesc to record the status at the time of that change. Conventionally change elements should be given in reverse date order, with the most recent change at the start of the list.

occupation.xml#13000

# id text
30 occupation indicates the classification system or taxonomy in use, for example by supplying the identifier of a
63 occupation identifies an occupation code defined within the classification system or taxonomy defined by the
138 occupation The content of this element may be used as an alternative to the more formal specification made possible by its attributes; it may also be used to supplement the formal specification with commentary or clarification.

add.xml#13000

# id text
14 add contains letters, words, or phrases inserted in the source text by an author, scribe, or a previous annotator or corrector.
45 add In a diplomatic edition attempting to represent an original source, the
47 add element should not be used for additions to the current TEI electronic edition made by editors or encoders. In these cases, either the
53 add In a TEI edition of a historical text with previous editorial emendations in which such additions or reconstructions are considered part of the source text, the use of

att.declarable.xml#13000

# id text
2 att.declarable provides attributes for those elements in the TEI header which may be independently selected by means of the special purpose
32 att.declarable indicates whether or not this element is selected by default when its parent is selected.
53 att.declarable This element is selected if its parent is selected
69 att.declarable This element can only be selected explicitly, unless it is the only one of its kind, in which case it is selected if its parent is selected.
88 att.declarable The rules governing the association of declarable elements with individual parts of a TEI text are fully defined in chapter
91 att.declarable attribute with a value of

recording.xml#13000

# id text
2 recording recording event
16 recording provides details of an audio or video recording event used as the source of a spoken text, either directly or from a public broadcast.
70 recording audio recording
86 recording audio and video recording

institution.xml#13000

# id text
4 institution contains the name of an organization such as a university or library, with which a manuscript is identified, generally its holding institution.

model.stageLike.xml#13000

# id text
2 model.stageLike groups elements containing stage directions or similar things defined by the module for performance texts.

spGrp.xml#13000

# id text
2 spGrp speech group
4 spGrp contains a group of speeches or songs in a performance text presented in a source as constituting a single unit or
5 spGrp number

correction.xml#13000

# id text
2 correction correction principles
44 correction indicates the degree of correction applied to the text.
67 correction the text has been thoroughly checked and proofread.
83 correction the text has been checked at least once.
99 correction the text has not been checked.
115 correction the correction status of the text is unknown.
213 correction May be used to note the results of proof reading the text against its original, indicating (for example) whether discrepancies have been silently rectified, or recorded using the editorial tags described in section

specGrpRef.xml#13000

# id text
2 specGrpRef reference to a specification group
51 specGrpRef points at the specification group which logically belongs here.
132 specGrpRef usually produces a comment indicating that a set of declarations printed in another section will be inserted at this point in the
138 specGrpRef The specification group identified by the

desc.xml#13000

# id text
15 desc contains a brief description of the object documented by its parent element, including its intended usage, purpose, or application where this is appropriate.
58 desc TEI convention requires that this be expressed as a finite clause, begining with an active verb.

publisher.xml#13000

# id text
4 publisher provides the name of the organization responsible for the publication or distribution of a bibliographic item.
63 publisher Use the full form of the name by which a company is usually referred to, rather than any abbreviation of it which may appear on a title page

placeName.xml#13000

# id text
5 placeName contains an absolute or relative place name.

eg.xml#13000

# id text
64 eg If the example contains material in XML markup, either it must be enclosed within a CDATA marked section, or character entity references must be used to represent the markup delimiters. If the example contains well-formed XML, it should be marked using the more specific

castGroup.xml#13000

# id text
2 castGroup cast list grouping
14 castGroup groups one or more individual castItem elements within a cast list.
125 castGroup Note that in this example the role description

att.readFrom.xml#13000

# id text
6 att.readFrom specifies the source from which declarations and definitions for the components of the object being defined may be obtained.
12 att.readFrom The context indicated must provide a set of TEI-conformant specifications in a form directly usable by an ODD processor. By default, this will be the location of the current release of the TEI Guidelines.
14 att.readFrom The source may be specified in the form of a private URI, for which the form recommended is
20 att.readFrom for 1.5.1 release of TEI P5 or (as a special case)

gi.xml#13000

# id text
2 gi element name
14 gi contains the name (generic identifier) of an element.
37 gi supplies the name of the scheme in which this name is defined.
54 gi TEI
58 gi this element is part of the TEI scheme.
143 gi This example shows the use of both a namespace prefix and the schema attribute as alternative ways of indicating that the gi in question is not a TEI element name: in practice only one method should be adopted.

model.global.xml#13000

# id text
2 model.global groups elements which may appear at any point within a TEI text.

recordingStmt.xml#13000

# id text
2 recordingStmt recording statement
16 recordingStmt describes a set of recordings used as the basis for transcription of a spoken text.

explicit.xml#13000

# id text
5 explicit explicit
6 explicit of a manuscript item, that is, the closing words of the text proper, exclusive of any rubric or colophon which might follow it.

sealDesc.xml#13000

# id text
2 sealDesc seal description
13 sealDesc describes the seals or other external items attached to a manuscript, either as a series of paragraphs or as a series of distinct
15 sealDesc elements, possibly with additional

model.entryPart.top.xml#13000

# id text
2 model.entryPart.top groups high level elements within a structured dictionary entry
17 model.entryPart.top Members of this class typically contain related parts of a dictionary entry which form a coherent subdivision, for example a particular sense, homonym, etc.

org.xml#13075

# id text
67 org specifies a primary role or classification for the organization.
83 org Values for this attribute may be locally defined by a project, using arbitrary keywords such as
88 org family group

keywords.xml#13000

# id text
4 keywords contains a list of keywords or phrases identifying the topic or nature of a text.
33 keywords identifies the controlled vocabulary within which the set of keywords concerned is defined identifies the classification scheme within which the set of categories concerned is defined, for example by a
109 keywords Each individual keyword (including compound subject headings) should be supplied as a
121 keywords If no control list exists for the keywords used, then no value should be supplied for the

classSpec.xml#13000

# id text
13 classSpec contains reference information for a TEI element class; that is a group of elements which appear together in content models, or which share some common attribute, or both.
81 classSpec content model
91 classSpec members of this class appear in the same content models
135 classSpec indicates which alternation and sequence instantiations of a model class may be referenced. By default, all variations are permitted.
170 classSpec members of the class are to be provided in sequence
218 classSpec members of the class may be provided one or more times, in sequence

language.xml#13000

# id text
4 language characterizes a single language or sublanguage used within a text.
38 language Supplies a language code constructed as defined in
40 language which is used to identify the language documented by this element, and which is referenced by the global
93 language specifies the approximate percentage (by volume) of the text which uses this language.
154 language Particularly for sublanguages, an informal prose characterization should be supplied as content for the element.

model.resourceLike.xml#13000

# id text
2 model.resourceLike groups non-textual elements which may appear together with a header and a text to constitute a TEI document.

street.xml#13000

# id text
2 street contains a full street address including any name or number identifying a building as well as the name of the street or route on which it is located.
63 street The order and presentation of house names and numbers and street names, etc., may vary considerably in different countries. The encoding should reflect the order which is appropriate in the country concerned.

seg.xml#13092

# id text
14 seg represents any segmentation of text below the
137 seg element may be used at the encoder's discretion to mark any segments of the text of interest for processing. One use of the element is to mark text features for which no appropriate markup is otherwise defined. Another use is to provide an identifier for some segment which is to be pointed at by some other element—i.e. to provide a target, or a part of a target, for a

gb.xml#13000

# id text
30 gb attribute indicates the number or other value used to identify this gathering in a collation.

langKnowledge.xml#13242

# id text
2 langKnowledge language knowledge
12 langKnowledge summarizes the state of a person's linguistic knowledge, either as prose or by a list of
61 langKnowledge supplies one or more valid language tags for the languages specified
79 langKnowledge This attribute should be supplied only if the element contains no
81 langKnowledge children. Its values are language

macro.anyXML.xml#13000

# id text
2 macro.anyXML defines a content model within which any XML elements are permitted
11 macro.anyXML egXML

set.xml#13000

# id text
2 set setting
13 set contains a description of the setting, time, locale, appearance, etc., of the action of a play, typically found in the front matter of a printed performance text (not a stage direction).
167 set This element should not be used outside the front matter; for similar contextual descriptions within the body of the text, use the

settlement.xml#13000

# id text
4 settlement contains the name of a settlement such as a city, town, or village identified as a single geo-political or administrative unit.

metamark.xml#13000

# id text
2 metamark contains or describes any kind of graphic or written signal within a document the function of which is to determine how it should be read rather than forming part of the actual content of the document.
23 metamark identifies one or more elements to which the function indicated by the metamark applies.

publicationStmt.xml#13000

# id text
133 publicationStmt classes rather than one or more paragraphs or anonymous blocks, care should be taken to ensure that the repeated elements are presented in a meaningful order. It is a conformance requirement that elements supplying information about publication place, address, identifier, availability, and date be given following the name of the publisher, distributor, or authority concerned, and preferably in that order.

age.xml#13012

# id text
4 age specifies the age of a person.
29 age supplies a numeric code representing the age or age group
47 age This attribute may be used to complement a more detailed discussion of a person's age in the content of the element
79 age As with other culturally-constructed traits such as sex, the way in which this concept is described in different cultural contexts may vary. The normalizing attributes are provided as a means of simplifying that variety to Western European norms and should not be used where that is inappropriate. The content of the element may be used to describe the intended concept in more detail, using plain text.

model.oddRef.xml#13000

# id text
2 model.oddRef groups elements which reference declarations in some markup language in ODD documents.

joinGrp.xml#13000

# id text
2 joinGrp join group
14 joinGrp groups a collection of join elements and possibly pointers.
50 joinGrp supplies the default value for the
92 joinGrp Any number of

data.percentage.xml#13221

# id text
11 data.percentage Any non-negative integer value less than 100.

personGrp.xml#13000

# id text
2 personGrp personal group
14 personGrp describes a group of individuals treated as a single person for analytic purposes.
48 personGrp specifies the role of this group of participants in the interaction.
66 personGrp Values for this attribute may be locally defined by a project, using arbitrary keywords such as
80 personGrp specifies the sex of the participant group.
98 personGrp Values for this attribute may be locally defined by a project, or may refer to an external standard, such as vCard's sex property
123 personGrp . For a mixed group, a value such as "mixed" may also be supplied.
128 personGrp specifies the age group of the participants.
146 personGrp Values for this attribute may be locally defined by a project, using arbitrary keywords such as
162 personGrp describes informally the size or approximate size of the group for example by means of a number and an indication of accuracy e.g.
194 personGrp May contain a prose description organized as paragraphs, or any sequence of demographic elements in any combination.
198 personGrp attribute should be used to identify each speaking participant in a spoken text if the

w.xml#13000

# id text
53 w provides a lemma for the word, such as an uninflected dictionary entry form.

decoNote.xml#13000

# id text
2 decoNote note on decoration
13 decoNote contains a note describing either a decorative component of a manuscript, or a fairly homogenous class of such components.

ref.xml#13000

# id text
13 ref defines a reference to another location, possibly modified by additional text or comment.
41 ref Only one of the attributes @target' and @cRef' may be supplied on

wit.xml#13000

# id text
4 wit contains a list of one or more sigla of witnesses attesting a given reading, in a textual variation.
54 wit attribute of the reading; it may be used to record the exact form of the sigla given in the source edition, when that is of interest.

fsConstraints.xml#13000

# id text
14 fsConstraints specifies constraints on the content of valid feature structures.
55 fsConstraints May contain a series of conditional or biconditional elements.

macro.limitedContent.xml#13000

# id text
2 macro.limitedContent paragraph content
12 macro.limitedContent defines the content of prose elements that are not used for transcription of extant materials.

closer.xml#13000

# id text
4 closer groups together salutations, datelines, and similar phrases appearing as a final group at the end of a division, especially of a letter.

back.xml#13123

# id text
2 back back matter
203 back Because cultural conventions differ as to which elements are grouped as back matter and which as front matter, the content models for the

channel.xml#13000

# id text
2 channel primary channel
14 channel describes the medium or channel by which a text is delivered or experienced. For a written text, this might be print, manuscript, email, etc.; for a spoken one, radio, telephone, face-to-face, etc.
37 channel specifies the mode of this channel with respect to speech and writing.
58 channel spoken
78 channel spoken to be written
104 channel written to be spoken

model.labelLike.xml#13000

# id text
2 model.labelLike groups elements used to gloss or explain other parts of a document.

change.xml#13013

# id text
2 change documents a change or set of changes made during the production of a source document, or during the revision of an electronic file.
123 change element elsewhere in the header, identifying the person responsible for the change and their role in making it.
127 change attribute may be used to indicate the status of a document following the change documented.

performance.xml#13000

# id text
4 performance contains a section of front or back matter describing how a dramatic piece is to be performed in general or how it was performed on some specific occasion.
151 performance contains paragraphs and an optional cast list only.

handDesc.xml#13000

# id text
13 handDesc contains a description of all the different kinds of writing used in a manuscript.
50 handDesc specifies the number of distinct hands identified within the manuscript

att.global.change.xml#13000

# id text
10 att.global.change elements documenting a state or revision campaign to which the element bearing this attribute and its children have been assigned by the encoder.

profileDesc.xml#13062

# id text
134 profileDesc Although the content model permits it, it is rarely meaningful to supply multiple occurrences for any of the child elements of

formula.xml#13000

# id text
33 formula names the notation used for the content of the element.

symbol.xml#13000

# id text
2 symbol symbolic value
14 symbol represents the value part of a feature-value specification which contains one of a finite list of symbols.
38 symbol supplies a symbolic value for the feature, one of a finite list that may be specified in a feature declaration.

root.xml#13000

# id text
2 root root node
14 root represents the root node of a tree.
38 root identifies the root node of the network by pointing to a feature structure or other analytic element.
57 root identifies the elements which are the children of the root node.
75 root If the root has no children (i.e., the tree is
77 root ), then the
110 root indicates whether or not the root is ordered.
128 root The value
130 root indicates that the children of the root are ordered, whereas
134 root Use if and only if
140 root element and the root has more than one child.
177 root gives the out degree of the root, the number of its children.
195 root The in degree of the root is always 0.

constraintSpec.xml#13229

# id text
2 constraintSpec constraint on schema
4 constraintSpec contains a constraint, expressed in some formal syntax, which cannot be expressed in the structural content model
28 constraintSpec Rules in the Schematron 1.* language must be inside a constraintSpec with a value of 'schematron' on the scheme attribute
37 constraintSpec Rules in the ISO Schematron language must be inside a constraintSpec with a value of 'isoschematron' on the scheme attribute
46 constraintSpec Rules in XSLT must be inside a constraintSpec with a value of 'isoschematron' on the scheme attribute
54 constraintSpec An ISO Schematron constraint specification for a macro should not have an 'assert' or 'report' element without a parent 'rule' element
61 constraintSpec supplies the name of the language in which the constraints are defined
80 constraintSpec private constraint language
87 constraintSpec This constraint uses Schematron to enforce the presence of the
120 constraintSpec This constraint uses a language which is not expressed in XML to check whether the title and author are identical:

origin.xml#13000

# id text
4 origin contains any descriptive or other information concerning the origin of a manuscript or manuscript part.

stdVals.xml#13000

# id text
16 stdVals specifies the format used when standardized date or number values are supplied.

model.measureLike.xml#13000

# id text
2 model.measureLike groups elements which denote a number, a quantity, a measurement, or similar piece of text that conveys some numerical meaning.

model.ptrLike.form.xml#13000

# id text
2 model.ptrLike.form groups elements used for purposes of location of particular orthographic or pronunciation forms within a dictionary entry.

equiv.xml#13000

# id text
37 equiv a single word which follows the rules defining a legal XML name (see
86 equiv references an external script which contains a method to transform instances of this element to canonical TEI
109 equiv hi rend='bold'
177 equiv attribute should be used to supply the MIME media type of the filter script specified by the

msName.xml#13000

# id text
2 msName alternative name
14 msName contains any form of unstructured alternative name used for a manuscript, such as an

att.source.xml#13000

# id text
2 att.source provides attributes for pointing to the source of a bibliographic reference.
8 att.source provides a pointer to the bibliographical source from which a quotation or citation is drawn.

respStmt.xml#13000

# id text
14 respStmt supplies a statement of responsibility for the intellectual content of a text, edition, recording, or series, where the specialized elements for authors, editors, etc. do not suffice or do not apply. May also be used to encode information about individuals or organizations which have played a role in the production or distribution of a bibliographic work.

colophon.xml#13000

# id text
5 colophon colophon

macro.phraseSeq.limited.xml#13000

# id text
2 macro.phraseSeq.limited limited phrase sequence
12 macro.phraseSeq.limited defines a sequence of character data and those phrase-level elements that are not typically used for transcribing extant documents.

entry.xml#13000

# id text
4 entry contains a single structured entry in any kind of lexical resource, such as a dictionary or lexicon.
122 entry s; one convenient method is to use the orthographic form of the headword, appending a disambiguating number where necessary. Identification codes are sometimes included on machine-readable tapes of dictionaries for in-house use.
126 entry element even for an entry that has only one sense to group together all parts of the definition relating to the word sense since this leads to more consistent encoding across entries.

data.pointer.xml#13000

# id text
28 data.pointer (IRIs) mapping to URIs. For example,

glyphName.xml#13000

# id text
2 glyphName character glyph name
14 glyphName contains the name of a glyph, expressed following Unicode conventions for character names.
47 glyphName For characters of non-ideographic scripts, a name following the conventions for Unicode names should be chosen. For ideographic scripts, an
49 glyphName (IDS) as described in Chapter 10.1 of the Unicode Standard is recommended where possible. Projects working in similar fields are recommended to coordinate and publish their list of
51 glyphName s to facilitate data exchange.

def.xml#13000

# id text
14 def contains definition text in a dictionary entry.

att.interpLike.xml#13092

# id text
2 att.interpLike provides attributes for elements which represent a formal analysis or interpretation.
116 att.interpLike points to instances of the analysis or interpretation represented by the current element.
134 att.interpLike The current element should be an analytic one. The element pointed at should be a textual one.

valList.xml#13000

# id text
2 valList value list
53 valList specifies the extensibility of the list of values specified.

model.divTop.xml#13000

# id text
2 model.divTop groups elements appearing at the beginning of a text division

model.nameLike.xml#13129

# id text
2 model.nameLike groups elements which name or refer to a person, place, or organization.

att.enjamb.xml#13000

# id text
49 att.enjamb indicates that the end of a verse line is marked by enjambement.
68 att.enjamb the line is end-stopped
84 att.enjamb the line in question runs on into the next
100 att.enjamb the line is weakly enjambed
116 att.enjamb the line is strongly enjambed
133 att.enjamb The usual practice will be to give the value
135 att.enjamb to this attribute when enjambement is being marked, or the values
139 att.enjamb if degrees of enjambement are of interest; if no value is given, however, the attribute does not default to a value of
141 att.enjamb ; this allows the attribute to be omitted entirely when enjambement is not of particular interest.

then.xml#13000

# id text
2 then separates the condition from the default in an

data.outputMeasurement.xml#13000

# id text
47 data.outputMeasurement These values map directly onto the values used by XSL-FO and CSS. For definitions of the units see those specifications; at the time of this writing the most complete list is in the

incipit.xml#13000

# id text
3 incipit incipit
4 incipit of a manuscript item, that is the opening words of the text proper, exclusive of any
5 incipit rubric
6 incipit which might precede it, of sufficient length to identify the work uniquely; such incipits were, in former times, frequently used a means of reference to a work, in place of a title.

biblStruct.xml#13042

# id text
76 biblStruct WARNING: use of deprecated method — the use of the idno element as a direct child of the biblStruct element will be removed from the TEI on 2016-09-18

heraldry.xml#13000

# id text
4 heraldry contains a heraldic formula or phrase, typically found as part of a blazon, coat of arms, etc.

term.xml#13000

# id text
88 term This element is used to supply the form under which an index entry is to be made for the location of a parent
94 term element may be used to mark any of these. No position is taken on the philosophical issue of what a term can be; the looser definition simply allows the
100 term class, instances of this element occuring in a text may be associated with a canonical definition, either by means of a URI (using the
102 term attribute), or by means of some system-specific code value (using the

msItemStruct.xml#13000

# id text
2 msItemStruct structured manuscript item
13 msItemStruct contains a structured description for an individual work or item within the intellectual content of a manuscript or manuscript part.
98 msItemStruct identifies the text types or classifications applicable to this item by pointing to other elements or resources defining the classification concerned.

constitution.xml#13000

# id text
4 constitution describes the internal composition of a text or text sample, for example as fragmentary, complete, etc.
27 constitution specifies how the text was constituted.
48 constitution a single complete text
64 constitution a text made by combining several smaller items, each individually complete
90 constitution a text made by combining several smaller, not necessarily complete, items

rubric.xml#13000

# id text
4 rubric contains the text of any
5 rubric rubric
6 rubric or heading attached to a particular manuscript item, that is, a string of words through which a manuscript signals the beginning of a text division, often with an assertion as to its author and title, which is in some way set off from the text itself, usually in red ink, or by use of different size or type of script, or some other such visual device.

calendarDesc.xml#13000

# id text
2 calendarDesc calendar description
10 calendarDesc contains a description of the calendar system used in any dating expression found in the text.
196 calendarDesc s are from W3 guidelines at

att.damaged.xml#13017

# id text
2 att.damaged provides attributes describing the nature of any physical damage affecting a reading.
19 att.damaged in the case of damage (deliberate defacement, inking out, etc.) assignable to a distinct hand, signifies the hand responsible for the damage by pointing to one of the hand identifiers declared in the document header (see section
37 att.damaged categorizes the cause of the damage, if it can be identified.
54 att.damaged damage results from rubbing of the leaf edges
68 att.damaged damage results from mildew on the leaf surface
82 att.damaged damage results from smoke
98 att.damaged provides a coded representation of the degree of damage, either as a number between 0 (undamaged) and 1 (very extensively damaged), or as one of the codes
110 att.damaged attribute should only be used where the text may be read with some confidence; text supplied from other sources should be tagged as
161 att.damaged element is appropriate where it is desired to record the fact of damage although this has not affected the readability of the text, for example a weathered inscription. Where the damage has rendered the text more or less illegible either the
163 att.damaged tag (for partial illegibility) or the
165 att.damaged tag (for complete illegibility, with no text supplied) should be used, with the information concerning the damage given in the attribute values of these tags. See section
223 att.damaged assigns an arbitrary number to each stretch of damage regarded as forming part of the same physical phenomenon.

iNode.xml#13000

# id text
2 iNode intermediate (or internal) node
14 iNode represents an intermediate (or internal) node of a tree.
38 iNode indicates an intermediate node, which is a feature structure or other analytic element.
57 iNode provides a list of identifiers of the elements which are the children of the intermediate node.
105 iNode indicates whether or not the internal node is ordered.
123 iNode The value
125 iNode indicates that the children of the intermediate node are ordered, whereas
129 iNode Use if and only if
135 iNode element and the intermediate node has more than one child.
172 iNode provides the identifier of an element which this node follows.
190 iNode If the tree is unordered or partially ordered, this attribute has the property of fixing the relative order of the intermediate node and the element which is the value of the attribute.
203 iNode gives the out degree of an intermediate node, the number of its children.
221 iNode The in degree of an intermediate node is always 1.

langUsage.xml#13000

# id text
2 langUsage language usage

model.persStateLike.xml#13000

# id text
2 model.persStateLike groups elements describing changeable characteristics of a person which have a definite duration, for example occupation, residence, or name.

att.msExcerpt.xml#13000

# id text
52 att.msExcerpt In the case of an incipit, indicates whether the incipit as given is defective, i.e. the first words of the text as preserved, as opposed to the first words of the work itself. In the case of an explicit, indicates whether the explicit as given is defective, i.e. the final words of the text as preserved, as opposed to what the closing words would have been had the text of the work been whole.

table.xml#13000

# id text
4 table contains text displayed in tabular form, in rows and columns.
58 table indicates the number of rows in the table.
76 table If no number is supplied, an application must calculate the number of rows.
101 table indicates the number of columns in each row of the table.
119 table If no number is supplied, an application must calculate the number of columns.
283 table Contains an optional heading and a series of rows.
285 table Any rendition information should be supplied using the global
287 table attribute, at the table, row, or cell level as appropriate.

listEvent.xml#13000

# id text
2 listEvent list of events
6 listEvent contains a list of descriptions, each of which provides information about an identifiable event.

altGrp.xml#13000

# id text
2 altGrp alternation group
14 altGrp groups a collection of
51 altGrp states whether the alternations gathered in this collection are exclusive or inclusive.
167 altGrp Any number of alternations, pointers or extended pointers.

att.duration.iso.xml#13000

# id text
2 att.duration.iso provides attributes for recording normalized temporal durations.
56 att.duration.iso are specified, the values should be interpreted as indicating a span of time by its starting time (or date) and duration. In order to represent a time range by a duration and its ending time the
62 att.duration.iso form, no claim is made that the form in the source text is incorrect; the regularized form is simply that chosen as the main form for purposes of unifying variant forms under a single heading.

att.witnessed.xml#13000

# id text
7 att.witnessed witness or witnesses
17 att.witnessed contains a space-delimited list of one or more pointers indicating the witnesses which attest to a given reading.
37 att.witnessed This attribute may occur both within an apparatus gathering variant readings in the transcription of an individual witness and within an apparatus gathering readings from different witnesses.
39 att.witnessed Additional descriptions or alternative versions of the sigla referenced may be supplied as the content of a child

att.personal.xml#13000

# id text
14 att.personal common attributes for those elements which form part of a name usually, but not necessarily, a personal name.
33 att.personal indicates whether the name component is given in full, as an abbreviation or simply as an initial.
56 att.personal the name component is spelled out in full.
82 att.personal the name component is given in an abbreviated form.
108 att.personal the name component is indicated only by one initial.
128 att.personal specifies the sort order of the name component in relation to others within the name.

moduleRef.xml#13000

# id text
40 moduleRef are only allowed when an external module is being loaded
47 moduleRef specifies a default prefix which will be prepended to all patterns from the imported module
62 moduleRef Use of this attribute avoids name collisions (and thus invalid schemas) when the external schema being mixed in with TEI uses a name the TEI or some other included external schema already uses for a pattern.
68 moduleRef supplies a list of the elements which are to be copied from the specified module into the schema being defined.
75 moduleRef supplies a list of the elements which are not to be copied from the specified module into the schema being defined.
84 moduleRef the name of a TEI module
105 moduleRef refers to a non-TEI module of RELAX NG code by external location
123 moduleRef This includes all objects available from the linking module.
139 moduleRef This includes all elements available from the linking module except for the
154 moduleRef elements from the linking module.
169 moduleRef A TEI module is identified by the name supplied as value for the
175 moduleRef attribute may be used to specify an online source from which the specification of that module may be read. A URI may alternatively be supplied in the case of a non-TEI module, and this is expected to be written as a RELAX NG schema.

per.xml#13000

# id text
3 per person
15 per contains an indication of the grammatical person (1st, 2nd, 3rd, etc.) associated with a given inflected form in a dictionary.
99 per gram type="person"

monogr.xml#13023

# id text
14 monogr contains bibliographic elements describing an item (e.g. a book or journal) published as an independent item (i.e. as a separate physical object).

sp.xml#13000

# id text
15 sp contains an individual speech in a performance text, or a passage presented as such in a prose or verse text.
140 sp Lines or paragraphs, stage directions, and phrase-level elements.

rhyme.xml#13000

# id text
26 rhyme provides a label (usually a single letter) to identify which part of a rhyme scheme this rhyming string instantiates.
47 rhyme elements with the same value for their
49 rhyme attribute are assumed to rhyme with each other. The scope is defined by the nearest ancestor element for which the

defaultVal.xml#13000

# id text
2 defaultVal default value
13 defaultVal specifies the default declared value for an attribute.
52 defaultVal any legal declared value or TEI-defined keyword

expan.xml#13000

# id text
72 expan The content of this element should usually be a complete word or phrase. The
76 expan module may be used to mark up sequences of letters supplied within such an expansion.

msItem.xml#13000

# id text
2 msItem manuscript item
13 msItem describes an individual work or item within the intellectual content of a manuscript or manuscript part.
56 msItem identifies the text types or classifications applicable to this item by pointing to other elements or resources defining the classification concerned.

punctuation.xml#13008

# id text
2 punctuation specifies editorial practice adopted with respect to punctuation marks in the original.
16 punctuation indicates whether or not punctation marks have been retained as content within the text.
23 punctuation no punctuation marks have been retained
27 punctuation some punctuation marks have been retained
31 punctuation all punctuation marks have been retained
44 punctuation punctuation marks are captured inside adjacent elements
48 punctuation punctuation marks are captured outside adjacent elements

hyphenation.xml#13000

# id text
4 hyphenation summarizes the way in which hyphenation in a source text has been treated in an encoded version of it.
42 hyphenation indicates whether or not end-of-line hyphenation has been retained in a text.
65 hyphenation all end-of-line hyphenation has been retained, even though the lineation of the original may not have been.
81 hyphenation end-of-line hyphenation has been retained in some cases.
97 hyphenation all soft end-of-line hyphenation has been removed: any remaining end-of-line hyphenation should be retained.
113 hyphenation all end-of-line hyphenation has been removed: any remaining hyphenation occurred within the line.

time.xml#13000

# id text
4 time contains a phrase defining a time of day in any format.

titlePart.xml#13000

# id text
2 titlePart contains a subsection or division of the title of a work, as indicated on a title page.
28 titlePart specifies the role of this subdivision of the title.
51 titlePart main title of the work
95 titlePart alternate
107 titlePart alternative title of the work
123 titlePart abbreviated form of title

att.deprecated.xml#13001

# id text
6 att.deprecated provides a date before which the construct being defined will not be removed.
24 att.deprecated The value of this attribute should represent a date (in standard
26 att.deprecated format) which is later than the date on which the attribute is added to an ODD. Technically, this attribute asserts only the intent to leave a construct in future releases of the markup language being defined up to at least the specified date, and makes no assertion about what happens past that date. In practice, the expectation is that the construct will be removed from future releases of the markup language being defined sometime shortly after the
32 att.deprecated date that is in the past. An ODD processor will typically warn users about constructs which have a
34 att.deprecated date that is in the future. E.g., the documentation for such a construct might include the phrase

domain.xml#13000

# id text
2 domain domain of use
14 domain describes the most important social context in which the text was realized or for which it is intended, for example private vs. public, education, religion, etc.
37 domain categorizes the domain of use.
104 domain business and work place
120 domain education
202 domain Usually empty, unless some further clarification of the type attribute is needed, in which case it may contain running prose.
204 domain The list presented here is primarily for illustrative purposes.

listPlace.xml#13000

# id text
2 listPlace list of places
12 listPlace contains a list of places, optionally followed by a list of relationships (other than containment) defined amongst them.

addSpan.xml#13000

# id text
2 addSpan added span of text
14 addSpan marks the beginning of a longer sequence of text added by an author, scribe, annotator or corrector (see also
95 addSpan Both the beginning and the end of the added material must be marked; the beginning by the

binary.xml#13000

# id text
2 binary binary value
14 binary represents the value part of a feature-value specification which can contain either of exactly two possible values.
40 binary supplies a binary value.
57 binary This attribute has a datatype of data.truthValue, which may be represented by the values
91 binary The value attribute may take any value permitted for attributes of the W3C datatype Boolean: this includes for example the strings

att.duration.xml#13000

# id text
3 att.duration provides attributes for normalization of elements that contain datable events.
28 att.duration class. In general, the possible values of attributes restricted to the W3C datatypes form a subset of those values available via the ISO 8601 standard. However, the greater expressiveness of the ISO datatypes is rarely needed, and there exists much greater software support for the W3C datatypes.

choice.xml#13000

# id text
4 choice groups a number of alternative encodings for the same point in a text.
79 choice element all represent alternative ways of encoding the same sequence, it is natural to think of them as mutually exclusive. However, there may be cases where a full representation of a text requires the alternative encodings to be considered as parallel.
85 choice Where the purpose of an encoding is to record multiple witnesses of a single work, rather than to identify multiple possible encoding decisions at a given point, the

vMerge.xml#13000

# id text
2 vMerge merged collection of values
14 vMerge represents a feature value which is the result of merging together the feature values contained by its children, using the organization specified by the
133 vMerge This example returns a list, concatenating the indeterminate value with the set of values masculine, neuter and feminine.

rs.xml#13092

# id text
2 rs referencing string
14 rs contains a general purpose name or referring string.

group.xml#13000

# id text
4 group contains the body of a composite text, grouping together a sequence of distinct texts (or groups of such texts) which are regarded as a unit for some purpose, for example the collected works of an author, a sequence of prose essays, etc.

att.pointing.group.xml#13000

# id text
2 att.pointing.group defines a set of attributes common to all elements which enclose groups of pointer elements.
40 att.pointing.group If this attribute is supplied every element specified as a target must be contained within the element or elements named by it. An application may choose whether or not to report failures to satisfy this constraint as errors, but may not access an element of the right identifier but in the wrong context. If this attribute is not supplied, then target elements may appear anywhere within the target document.
134 att.pointing.group The number of separate values must match the number of values in the
144 att.pointing.group element may be needed to accomplish this). It should also match the number of values in the
146 att.pointing.group attribute, of the current element, if one has been specified.

citedRange.xml#13000

# id text
42 citedRange . For example, if the citation has

att.docStatus.xml#13000

# id text
6 att.docStatus describes the status of a document either currently or, when associated with a dated element, at the time indicated.

move.xml#13000

# id text
89 move character moves on stage
105 move specifies the direction of a stage movement.
134 move stage left
160 move stage right
186 move centre stage
206 move upper stage left
226 move performance
236 move identifies the performance or performances in which this movement occurred as specified by pointing to one or more

case.xml#13000

# id text
4 case contains grammatical case information given by a dictionary for a given form.
109 case May contain character data and phrase-level elements. Typical values will be of the form
120 case gram type="case"

div3.xml#13000

# id text
2 div3 level-3 text division
16 div3 contains a third-level subdivision of the front, body, or back of a text.
162 div3 any sequence of low-level structural elements, possibly grouped into lower subdivisions.

datatype.xml#13000

# id text
4 datatype specifies the declared value for an attribute, by referring to any datatype defined by the chosen schema language.
32 datatype minimum number of occurences
44 datatype indicates the minimum number of times this datatype may occur in the specification of the attribute being defined
65 datatype maximum number of occurences
77 datatype indicates the maximum number of times this datatype may occur in the specification of the attribute being defined
151 datatype The encoding in the following example requires that the attribute being defined contain at least two URIs in its value, as is the case for the
164 datatype In the TEI scheme, most datatypes are expressed using pre-defined TEI macros, which map a name in the form

encodingDesc.xml#13000

# id text
16 encodingDesc documents the relationship between an electronic text and the source or sources from which it was derived.

att.entryLike.xml#13000

# id text
18 att.entryLike indicates type of entry, in dictionaries with multiple types.
39 att.entryLike a main entry (default).
99 att.entryLike a reduced entry whose only function is to point to another main entry (e.g. for forms of an irregular verb or for variant spellings:
163 att.entryLike an entry for a prefix, infix, or suffix.
189 att.entryLike an entry for an abbreviation.
205 att.entryLike a supplemental entry (for use in dictionaries which issue supplements to their main work in which they include updated information about entries).
221 att.entryLike an entry for a foreign word in a monolingual dictionary.

delSpan.xml#13000

# id text
2 delSpan deleted span of text
14 delSpan marks the beginning of a longer sequence of text deleted, marked as deleted, or otherwise signaled as superfluous or spurious by an author, scribe, annotator, or corrector.
95 delSpan Both the beginning and ending of the deleted sequence must be marked: the beginning by the
101 delSpan The text deleted must be at least partially legible, in order for the encoder to be able to transcribe it. If it is not legible at all, the
103 delSpan tag should not be used. Rather, the
105 delSpan tag should be employed to signal that text cannot be transcribed, with the value of the
109 delSpan element should be used to signal the areas of text which cannot be read with confidence. See further sections
112 delSpan tag with the
125 delSpan tag should not be used for deletions made by editors or encoders. In these cases, either the
127 delSpan tag or the
129 delSpan tag should be used.

fvLib.xml#13000

# id text
14 fvLib assembles a library of reusable feature value elements (including complete feature structures).
62 fvLib A feature value library may include any number of values of any kind, including multiple occurrences of identical values such as
65 fvLib default
66 fvLib . The only thing guaranteed unique in a feature value library is the set of labels used to identify the values.

glyph.xml#13000

# id text
2 glyph character glyph
14 glyph provides descriptive information about a character glyph

data.word.xml#13000

# id text
23 data.word Attributes using this datatype must contain a single
25 data.word which contains only letters, digits, punctuation characters, or symbols: thus it cannot include whitespace.

data.enumerated_data-dot-name.xml#13071

# id text
2 data.enumerated defines the range of attribute values expressed as a single XML name taken from a list of documented possibilities.
20 data.enumerated Attributes using this datatype must contain a single
22 data.enumerated matching the rules for XML names: i.e., a token beginning with a letter or one of a few punctuation characters, and continuing with letters, digits, hyphens, underscores, colons, or full stops.
24 data.enumerated Typically, the list of documented possibilities will be provided (or exemplified) by a value list in the associated attribute specification, expressed with a

model.lPart.xml#13000

# id text
2 model.lPart groups phrase-level elements which may appear within verse only.

interpretation.xml#13000

# id text
4 interpretation describes the scope of any analytic or interpretive information added to the text in addition to the transcription.

att.global.rendition.xml#

# id text
2 att.global.rendition provides rendering attributes common to all elements in the TEI encoding scheme.
7 att.global.rendition rendition
17 att.global.rendition indicates how the element in question was rendered or presented in the source text.
58 att.global.rendition These Guidelines make no binding recommendations for the values of the
60 att.global.rendition attribute; the characteristics of visual presentation vary too much from text to text and the decision to record or ignore individual characteristics varies too much from project to project. Some potentially useful conventions are noted from time to time at appropriate points in the Guidelines. The values of the
62 att.global.rendition attribute are a set of sequence-indeterminate individual tokens separated by whitespace.
85 att.global.rendition contains an expression in some formal style definition language which defines the rendering or presentation used for this element in the source text
107 att.global.rendition attribute may contain whitespace. This attribute is intended for recording inline stylistic information concerning the source, not any particular output.
109 att.global.rendition The formal language in which values for this attribute are expressed may be specified using the
111 att.global.rendition element in the TEI header.
116 att.global.rendition points to a description of the rendering or presentation used for this element in the source text.
168 att.global.rendition attribute defined for XHTML but with the important distinction that its function is to describe the appearance of the source text, not necessarily to determine how that text should be presented on screen or paper.
178 att.global.rendition element defining the intended rendition in terms of some appropriate style language, as indicated by the

fLib.xml#13000

# id text
78 fLib attribute may be used to supply an informal name to categorize the library's contents.

data.xmlName.xml#13221

# id text
8 data.xmlName The rules defining an XML name form a part of the XML Specification.

terrain.xml#13242

# id text
4 terrain contains information about the physical terrain of a place.

att.datable.xml#13228

# id text
2 att.datable provides attributes for normalization of elements that contain dates, times, or datable events.
23 att.datable indicates the system or calendar to which the date represented by the content of this element belongs.
43 att.datable @calendar indicates the system or calendar to which the date represented by the content of this element belongs, but this
69 att.datable ) defines the calendar system of the date in the original material defined by the parent element,
71 att.datable the calendar to which the date is normalized.
75 att.datable supplies a pointer to some location defining a named period of time within which the datable item is understood to have occurred.
101 att.datable classes. In general, the possible values of attributes restricted to the W3C datatypes form a subset of those values available via the ISO 8601 standard. However, the greater expressiveness of the ISO datatypes may not be needed, and there exists much greater software support for the W3C datatypes.

specList.xml#13000

# id text
2 specList specification list
12 specList marks where a list of descriptions is to be inserted into the prose documentation.

schemaSpec.xml#13000

# id text
52 schemaSpec specifies entry points to the schema, i.e. which patterns may be used as the root of documents conforming to it.
69 schemaSpec TEI
73 schemaSpec specifies a default prefix which will be prepended to all patterns relating to TEI elements, unless otherwise stated.
94 schemaSpec Use of this attribute allows an external schema which has an element with the same local name as a TEI element to be mixed in.
107 schemaSpec target language
117 schemaSpec specifies which language to use when creating the objects in a schema if names for elements or attributes are available in more than one language
136 schemaSpec documentation language
146 schemaSpec specifies which languages to use when creating documentation if the description for an element, attribute, class or macro is available in more than one language
180 schemaSpec combines references to modules, individual element or macro declarations, and specification groups together to form a unified schema. The processing of the

macro.schemaPattern.xml#13000

# id text
2 macro.schemaPattern provides a pattern to match elements from the chosen schema language

tns.xml#13000

# id text
15 tns indicates the grammatical tense associated with a given inflected form in a dictionary.
99 tns gram type="tense"

data.count.xml#13000

# id text
2 data.count defines the range of attribute values used for a non-negative integer value used as a count.

hyph.xml#13000

# id text
2 hyph hyphenation
14 hyph contains a hyphenated form of a dictionary headword, or hyphenation information in some other form.

att.spanning.xml#13000

# id text
2 att.spanning provides attributes for elements which delimit a span of text by pointing mechanisms rather than by enclosing it.
18 att.spanning indicates the end of a span initiated by the element bearing this attribute.
50 att.spanning The span is defined as running in document order from the start of the content of the pointing element to the end of the content of the element pointed to by the
52 att.spanning attribute (if any). If no value is supplied for the attribute, the assumption is that the span is coextensive with the pointing element. If no content is present, the assumption is that the starting point of the span is immediately following the element itself.

watermark.xml#13000

# id text
4 watermark contains a word or phrase describing a watermark or similar device.

macro.xtext.xml#13000

# id text
2 macro.xtext extended text
14 macro.xtext defines a sequence of character data and gaiji elements.

surplus.xml#13000

# id text
4 surplus marks text present in the source which the editor believes to be superfluous or redundant.
18 surplus one or more words indicating why this text is believed to be superfluous, e.g.

sourceDoc.xml#13000

# id text
2 sourceDoc contains a transcription or other representation of a single source document potentially forming part of a
4 sourceDoc or collection of sources.
49 sourceDoc for TEI documents containing only page images, or for documents containing both images and transcriptions. Transcriptions may be provided within the
51 sourceDoc elements making up a source document, in parallel with them as part of a
53 sourceDoc element, or in both places if the encoder wishes to distinguish these two modes of transcription.

row.xml#13000

# id text
4 row contains one row of a table.

att.tableDecoration.xml#13000

# id text
20 att.tableDecoration indicates the kind of information held in this cell or in each cell of this row.
74 att.tableDecoration When this attribute is specified on a row, its value is the default for all cells in this row. When specified on a cell, its value overrides any default specified by the
109 att.tableDecoration indicates the number of rows occupied by this cell or row.
129 att.tableDecoration A value greater than one indicates that this cell
130 att.tableDecoration spans several rows. Where several cells span multiple rows, it may be more convenient to use nested tables.
157 att.tableDecoration indicates the number of columns occupied by this cell or row.
177 att.tableDecoration A value greater than one indicates that this cell or row spans several columns. Where an initial cell spans an entire row, it may be better treated as a heading.

divGen.xml#13000

# id text
2 divGen automatically generated text division
14 divGen indicates the location at which a textual division generated automatically by a text-processing application is to appear.
40 divGen specifies what type of generated text division (e.g. index, table of contents, etc.) is to appear.
59 divGen an index is to be generated and inserted at this point.
77 divGen a table of contents
91 divGen a list of figures
107 divGen a list of tables
138 divGen One use for this element is to allow document preparation software to generate an index and insert it in the appropriate place in the output. The example below assumes that the
142 divGen elements in the text has been used to specify index entries for the two generated indexes, named NAMES and THINGS:
234 divGen is to specify the location of an automatically produced table of contents:
250 divGen This element is intended primarily for use in document production or manipulation, rather than in the transcription of pre-existing materials; it makes it easier to specify the location of indices, tables of contents, etc., to be generated by text preparation or word processing software.

data.duration.w3c.xml#13000

# id text
2 data.duration.w3c defines the range of attribute values available for representation of a duration in time using W3C datatypes.
60 data.duration.w3c A duration is expressed as a sequence of number-letter pairs, preceded by the letter P; the letter gives the unit and may be Y (year), M (month), D (day), H (hour), M (minute), or S (second), in that order. The numbers are all unsigned integers, except for the
64 data.duration.w3c as the decimal point). If any number is
66 data.duration.w3c , then that number-letter pair may be omitted. If any of the H (hour), M (minute), or S (second) number-letter pairs are present, then the separator
69 data.duration.w3c time

att.citing.xml#13001

# id text
2 att.citing provides attributes for specifying the specific part of a bibliographic item being cited.
70 att.citing the element contains a page number or page range.
86 att.citing the element contains a line number or line range.

titleStmt.xml#13000

# id text
2 titleStmt title statement
16 titleStmt groups information about the title of a work and those responsible for its content.

g.xml#13000

# id text
2 g character or glyph
38 g points to a description of the character or glyph intended.
99 g The medieval brevigraph per could similarly be considered as an individual glyph, defined in a
102 g per
108 g The name
111 g gaiji
112 g , which is the Japanese term for a non-standardized character or glyph.

am.xml#13000

# id text
12 am contains a sequence of letters or signs present in an abbreviation which are omitted or replaced in the expanded form of the abbreviation.

spanGrp.xml#13000

# id text
2 spanGrp span group
14 spanGrp collects together span tags.

att.patternReplacement.xml#13000

# id text
82 att.patternReplacement etc. are references to the corresponding group in the regular expression specified by
84 att.patternReplacement (counting open parenthesis, left to right). Processors are expected to replace them with whatever matched the corresponding group in the regular expression.
86 att.patternReplacement If a digit preceded by a dollar sign is needed in the actual replacement pattern (as opposed to being used as a back reference), the dollar sign must be written as

data.temporal.w3c.xml#13000

# id text
39 data.temporal.w3c If it is likely that the value used is to be compared with another, then a time zone indicator should always be included, and only the dateTime representation should be used.

constraint.xml#13117

# id text
2 constraint constraint rules
4 constraint the formal rules of a constraint

div4.xml#13000

# id text
2 div4 level-4 text division
16 div4 contains a fourth-level subdivision of the front, body, or back of a text.
158 div4 any sequence of low-level structural elements, possibly grouped into lower subdivisions.

meeting.xml#13000

# id text
2 meeting contains the formalized descriptive title for a meeting or conference, for use in a bibliographic description for an item derived from such a meeting, or as a heading or preamble to publications emanating from it.

extent.xml#13000

# id text
4 extent describes the approximate size of a text stored on some carrier medium or of some other object, digital or non-digital, specified in any convenient units.
40 extent element may be used to supplied normalised or machine tractable versions of the size or sizes concerned.

specGrp.xml#13236

# id text
2 specGrp specification group
99 specGrp A specification group is referenced by means of its

att.identified.xml#13000

# id text
36 att.identified : the value of the module attribute ("
37 att.identified ") should correspond to an existing module, via a moduleSpec or moduleRef
91 att.identified supplies a name for the module in which this object is to be declared.
110 att.identified indicates the current status of the object identified with respect to the current version of the TEI Guidelines.
119 att.identified the item is not recommended for use, and may be withdrawn at a future release.
123 att.identified the item is new and still under review.
127 att.identified the item has changed significantly since the preceding version.
131 att.identified the item has not recently changed and is not expected to do so except for correction of any errors.

actor.xml#13000

# id text
2 actor contains the name of an actor appearing within a cast list.
60 actor This element should be used only to mark the name of the actor as given in the source. Chapter

model.persNamePart.xml#13000

# id text
2 model.persNamePart groups elements which form part of a personal name.

data.sex.xml#13000

# id text
20 data.sex Values for attributes using this datatype may be locally defined by a project, or may refer to an external standard, such as vCard's sex property

numeric.xml#13000

# id text
2 numeric numeric value
14 numeric represents the value part of a feature-value specification which contains a numeric value or range.
38 numeric supplies a lower bound for the numeric value represented, and also (if
71 numeric supplies an upper bound for the numeric value represented.
90 numeric specifies whether the value represented should be truncated to give an integer value.
113 numeric This represents the numeric value 42.
148 numeric attribute had the value FALSE, this example would represent any of the infinite number of numeric values between 42.45 and 50.0
154 numeric attribute in the absence of a value for the

said.xml#13000

# id text
12 said indicates passages thought or spoken aloud, whether explicitly indicated in the source or not, whether directly or indirectly reported, whether by real people or fictional characters.
61 said The value
63 said indicates the encoded passage was expressed outwardly (whether spoken, signed, sung, screamed, chanted, etc.); the value
127 said The value
129 said indicates the speech or thought is represented directly; the value

geogFeat.xml#13000

# id text
2 geogFeat geographical feature name

restore.xml#13000

# id text
4 restore indicates restoration of text to an earlier state by cancellation of an editorial or authorial marking or instruction.
36 restore attribute categorizes the way that the cancelled intervention has been indicated in some way, for example by means of a marginal note, over-inking, additional markup, etc.

elementSpec.xml#13043

# id text
13 elementSpec documents the structure, content, and purpose of a single element type.
69 elementSpec specifies a default prefix which will be prepended to all patterns relating to the element, unless otherwise stated.

decoDesc.xml#13000

# id text
13 decoDesc contains a description of the decoration of a manuscript, either as a sequence of paragraphs, or as a sequence of topically organized

quote.xml#13000

# id text
2 quote quotation
14 quote contains a phrase or passage attributed by the narrator or author to some agency external to the text.
61 quote If a bibliographic citation is supplied for the source of a quotation, the two may be grouped using the

particDesc.xml#13000

# id text
14 particDesc describes the identifiable speakers, voices, or other participants in any kind of text or other persons named or otherwise referred to in a text, edition, or metadata.
83 particDesc This example shows both a very simple person description, and a very detailed one, using some of the more specialized elements from the module for Names and Dates.
161 particDesc May contain a prose description organized as paragraphs, or a structured list of persons and person groups, with an optional formal specification of any relationships amongst them.

model.global.spoken.xml#13000

# id text
2 model.global.spoken groups elements which may appear globally within spoken texts.

attDef.xml#13189

# id text
72 attDef should have a closed valList or a datatype
79 attDef It does not make sense to make "
80 attDef " the default value of @
97 attDef the default value of the @
98 attDef attribute is not among the closed list of possible values
108 attDef the default value of the @
109 attDef attribute is not among the closed list of possible values
181 attDef namespace
193 attDef specifies the namespace to which this attribute belongs

additions.xml#13000

# id text
4 additions contains a description of any significant additions found within a manuscript, such as marginalia or other annotations.

catRef.xml#13000

# id text
2 catRef category reference
16 catRef specifies one or more defined categories within some taxonomy or text typology.
41 catRef identifies the classification scheme within which the set of categories concerned is defined, for example by a
125 catRef The scheme attribute need be supplied only if more than one taxonomy has been declared.

att.rdgPart.xml#13000

# id text
18 att.rdgPart witness or witnesses
28 att.rdgPart contains a space-delimited list of one or more sigla indicating the witnesses to this reading beginning or ending at this point.

fw.xml#13000

# id text
14 fw contains a running head (e.g. a header, footer), catchword, or similar material appearing on the current page.
38 fw classifies the material encoded according to some useful typology.
57 fw a running title at the top of the page
73 fw a running title at the bottom of the page
89 fw page number
99 fw a page number or foliation symbol
115 fw line number
125 fw a line number, either of prose or poetry
147 fw a signature or gathering symbol
214 fw element is intended for cases where the running head changes from page to page, or where details of page layout and the internal structure of the running heads are of paramount importance.

abstract.xml#13092

# id text
2 abstract contains a summary or formal abstract prefixed to an existing source document by the encoder.
28 abstract The abstract for a born digital document should be located within the
30 abstract ; this element is provided for cases where no abstract is available in the original source.

model.dimLike.xml#13000

# id text
2 model.dimLike groups elements which describe a measurement forming part of the physical dimensions of some object.

segmentation.xml#13000

# id text
4 segmentation describes the principles according to which the text has been segmented, for example into sentences, tone-units, graphemic strata, etc.

data.enumerated.xml#13071

# id text
2 data.enumerated defines the range of attribute values expressed as a single XML name taken from a list of documented possibilities.
20 data.enumerated Attributes using this datatype must contain a single word matching the pattern defined for this datatype: for example it cannot include whitespace but may begin with digits.
22 data.enumerated Typically, the list of documented possibilities will be provided (or exemplified) by a value list in the associated attribute specification, expressed with a

classCode.xml#13000

# id text
2 classCode classification code
14 classCode contains the classification code used for this text in some standard classification system.

default.xml#13000

# id text
2 default default feature value
14 default represents the value part of a feature-value specification which contains a defaulted value.

accMat.xml#13000

# id text
2 accMat accompanying material
14 accMat contains details of any significant additional material which may be closely associated with the manuscript being described, such as non-contemporaneous documents or fragments bound in with the manuscript at some earlier historical period.

att.coordinated.xml#13000

# id text
16 att.coordinated indicates the element within a transcription of the text containing at least the start of the writing represented by this zone or surface.
25 att.coordinated gives the x coordinate value for the upper left corner of a rectangular space.
42 att.coordinated gives the y coordinate value for the upper left corner of a rectangular space.
59 att.coordinated gives the x coordinate value for the lower right corner of a rectangular space.
76 att.coordinated gives the y coordinate value for the lower right corner of a rectangular space.
93 att.coordinated identifies a two dimensional area within the bounding box specified by the other attributes by means of a series of pairs of numbers, each of which gives the x,y coordinates of a point on a line enclosing the area.

date.xml#13000

# id text
4 date contains a date in any format.

att.dimensions.xml#13000

# id text
70 att.dimensions lines of text
92 att.dimensions characters of text
125 att.dimensions indicates the size of the object concerned using a project-specific vocabulary combining quantity and units in a single string of words.
144 att.dimensions characterizes the precision of the values specified by the other attributes.

model.common.xml#13000

# id text
17 model.common This class defines the set of chunk- and inter-level elements; it is used in many content models, including those for textual divisions.

att.global.responsibility.xml#13093

# id text
2 att.global.responsibility provides attributes indicating the agency responsible for some aspect of the text, the markup or something asserted by the markup, and the degree of certainty associated with it.
8 att.global.responsibility certainty
18 att.global.responsibility signifies the degree of certainty associated with the intervention or interpretation.
47 att.global.responsibility indicates the agency responsible for the intervention or interpretation, for example an editor or transcriber.
67 att.global.responsibility pointing to a person or organization is likely to be somewhat ambiguous with regard to the nature of the responsibility. For this reason, we recommend that
79 att.global.responsibility or similar element which clarifies the exact role played by the agent. Pointing to multiple
81 att.global.responsibility s allows the encoder to specify clearly each of the roles played in part of a TEI file (creating, transcribing, encoding, editing, proofing etc.).

vocal.xml#13000

# id text
52 vocal The value
54 vocal indicates that the vocal effect is repeated several times rather than just occurring once.

att.datcat.xml#13000

# id text
6 att.datcat attributes which are used to align XML elements or attributes with the appropriate Data Categories (DCs) defined by the ISO 12620:2009 standard and stored in the Web repository called ISOCat at
19 att.datcat contains a PID (persistent identifier) that aligns the content of the given element or the value of the given attribute with the appropriate simple Data Category (or categories) in ISOcat.
29 att.datcat relates the feature name to the data category "partOfSpeech" and
31 att.datcat the feature value to the data category "commonNoun". Both these data categories reside in the ISOcat DCR at
42 att.datcat ISO 12620:2009 is a standard describing the data model and procedures for a Data Category Registry (DCR). Data categories are defined as elementary descriptors in a linguistic structure. In the DCR data model each data category gets assigned a unique Peristent IDentifier (PID), i.e., an URI. Linguistic resources or preferably their schemas that make use of data categories from a DCR should refer to them using this PID. For XML-based resources, like TEI documents, ISO 12620:2009 normative Annex A gives a small Data Category Reference XML vocabulary (also available online at

nameLink.xml#13000

# id text
2 nameLink name link
6 nameLink contains a connecting phrase or link used within a name but not regarded as part of it, such as

listApp.xml#13000

# id text
2 listApp list of apparatus entries
6 listApp contains a list of apparatus entries.
31 listApp In the following example from the exegetical Yasna, the base text is encoded in the

activity.xml#13000

# id text
4 activity contains a brief informal description of what a participant in a language interaction is doing other than speaking, if anything.
44 activity For more fine-grained description of participant activities during a spoken text, the

div7.xml#13000

# id text
2 div7 level-7 text division
16 div7 contains the smallest possible subdivision of the front, body or back of a text, larger than a paragraph.
133 div7 any sequence of low-level structural elements, e.g., paragraphs (

c.xml#13000

# id text
74 c element, or a sequence of graphemes to be treated as a single character. The
79 c punctuation

textDesc.xml#13000

# id text
2 textDesc text description
14 textDesc provides a description of a text in terms of its situational parameters.

geo.xml#13000

# id text
12 geo contains any expression of a set of geographic coordinates, representing a point, line, or area on the surface of the earth in some notation.
67 geo element supplied in the TEI header, using the
69 geo attribute. If no such link is made, the assumption is that the content of each

val.xml#13000

# id text
2 val value

population.xml#13242

# id text
4 population contains information about the population of a place.

ptr.xml#13000

# id text
41 ptr Only one of the attributes @target and @cRef may be supplied on

locusGrp.xml#13000

# id text
4 locusGrp groups a number of locations which together form a distinct but discontinuous item within a manuscript or manuscript part, according to a specific foliation.
21 locusGrp identifies the foliation scheme in terms of which all the locations contained by the group are specified by pointing to some

ab.xml#13000

# id text
56 ab element may be used at the encoder's discretion to mark any component-level elements in a text for which no other more specific appropriate markup is defined.

etym.xml#13000

# id text
146 etym May contain character data mixed with any other elements defined in the dictionary tag set.

model.offsetLike.xml#13000

# id text
2 model.offsetLike groups elements which can appear only as part of a place name.

gen.xml#13000

# id text
86 gen May contain character data and phrase-level elements. Typical content will be
95 gen gram type="gender"

finalRubric.xml#13000

# id text
4 finalRubric contains the string of words that denotes the end of a text division, often with an assertion as to its author and title, usually set off from the text itself by red ink, by a different size or type of script, or by some other such visual device.

cit.xml#13000

# id text
2 cit cited quotation
13 cit contains a quotation from some other document, together with a bibliographic reference to its source. In a dictionary it may contain an example text with at least one occurrence of the word form, used in the sense being described, or a translation of the headword, or an example.

textLang.xml#13000

# id text
2 textLang text language
13 textLang describes the languages and writing systems identified within the bibliographic work being described, rather than its description.
49 textLang main language
60 textLang supplies a code which identifies the chief language used in the bibliographic work.
128 textLang This element should not be used to document the languages or writing systems used for the bibliographic or manuscript description itself: as for all other TEI elements, such information should be provided by means of the global
133 textLang language tag
136 textLang . Additional documentation for the language may be provided by a
138 textLang element in the TEI Header.

summary.xml#13000

# id text
2 summary contains an overview of the available information concerning some aspect of an item (for example, its intellectual content, history, layout, typography etc.) as a complement or alternative to the more detailed information carried by more specific elements.

genName.xml#13000

# id text
2 genName generational name component
13 genName contains a name component used to distinguish otherwise similar names on the basis of the relative ages or generations of the persons named.

subst.xml#13000

# id text
12 subst groups one or more deletions with one or more additions when the combination is to be regarded as a single intervention in the text.
40 subst must have at least one child add and at least one child del

data.key.xml#13000

# id text
2 data.key defines the range of attribute values expressing a coded value by means of an arbitrary identifier, typically taken from a set of externally-defined possibilities.
20 data.key Information about the set of possible values for an attribute using this datatype may (but need not) be documented in the document header. Externally defined constraints, for example that values should be legal keys in an external database system, cannot usually be enforced by a TEI system. Similarly, because the key is externally defined, no constraint other than a requirement that it consist of Unicode characters is possible.

editor.xml#13000

# id text
3 editor contains a secondary statement of responsibility for a bibliographic item, for example the name of an individual, institution or organization, (or of several such) acting as editor, compiler, translator, etc.
52 editor Particularly where cataloguing is likely to be based on the content of the header, it is advisable to use generally recognized authority lists for the exact form of personal names.

person.xml#13000

# id text
4 person provides information about an identifiable individual, for example a participant in a language interaction, or a person referred to in a historical source.
39 person specifies a primary role or classification for the person.
57 person Values for this attribute may be locally defined by a project, using arbitrary keywords such as
62 person author
73 person specifies the sex of the person.
91 person Values for this attribute may be locally defined by a project, or may refer to an external standard, such as vCard's sex property
121 person specifies an age group for the person.
139 person Values for this attribute may be locally defined by a project, using arbitrary keywords such as
250 person May contain either a prose description organized as paragraphs, or a sequence of more specific demographic elements drawn from the

classes.xml#13000

# id text
4 classes specifies all the classes of which the documented element or class is a member or subclass.
49 classes this declaration changes the declaration of the same name in the current definition
63 classes this declaration replaces the declaration of the same name in the current definition

att.xml#13000

# id text
14 att contains the name of an attribute appearing within running text.
39 att supplies an identifier for the scheme in which this name is defined.
56 att TEI
60 att text encoding initiative
70 att this attribute is part of the TEI scheme.
135 att the attribute is part of the XHTML language
137 att the attribute is part of the XML language
211 att A namespace prefix may be used in order to specify the scheme as an alternative to specifying it via the scheme attribute: it takes precedence

height.xml#13000

# id text
38 height If used to specify the height of a non text-bearing portion of some object, for example a monument, this element conventionally refers to the axis perpendicular to the surface of the earth.

att.canonical.xml#13000

# id text
2 att.canonical provides attributes which can be used to associate a representation such as a name or title with canonical information about the object being named or referenced.
8 att.canonical provides an externally-defined means of identifying the entity (or entities) being named, using a coded value of some kind.
32 att.canonical The value may be a unique identifier from a database, or any other externally-defined string identifying the referent.
36 att.canonical attribute, since its form will depend entirely on practice within a given project. For the same reason, this attribute is not recommended in data interchange, since there is no way of ensuring that the values used by one project are distinct from those used by another. In such a situation, a preferable approach for magic tokens which follows standard practice on the Web is to use a
38 att.canonical attribute whose value is a tag URI as defined in
53 att.canonical provides an explicit means of locating a full definition for the entity being named by means of one or more URIs.
67 att.canonical The value must point directly to one or more XML elements or other resources by means of one or more URIs, separated by whitespace. If more than one is supplied the implication is that the name identifies several distinct entities.

data.text.xml#13000

# id text
2 data.text defines the range of attribute values used to express some kind of identifying string as a single sequence of unicode characters possibly including whitespace.
10 data.text Attributes using this datatype must contain a single
12 data.text in which whitespace and other punctuation characters are permitted.

edition.xml#13012

# id text
14 edition describes the particularities of one edition of a text.

writing.xml#13000

# id text
4 writing contains a passage of written text revealed to participants in the course of a spoken text.
31 writing indicates whether the writing is revealed all at once or gradually.
49 writing The value
51 writing indicates the writing is revealed gradually; the value
53 writing that the writing is revealed all at once.
100 writing element will usually be short and most simply transcribed as a character string; the content model also allows a sequence of paragraphs and paragraph-level elements, in case the writing has enough internal structure to warrant such markup. In either case the usual phrase-level tags for written text are available.

teiCorpus.xml#13000

# id text
2 teiCorpus contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI elements, each containing a single text header and a text.
45 teiCorpus The version of the TEI scheme
144 teiCorpus Must contain one TEI header for the corpus, and a series of
148 teiCorpus This element is mandatory when applicable.

addName.xml#13000

# id text
2 addName additional name
14 addName contains an additional name component, such as a nickname, epithet, or alias, or any other descriptive phrase used within a personal name.

geoDecl.xml#13000

# id text
12 geoDecl documents the notation and the datum used for geographic coordinates expressed as content of the
44 geoDecl supplies a commonly used code name for the datum employed.
97 geoDecl the values supplied are geospatial entity object codes, based on
119 geoDecl the value supplied is to be interpreted as a British National Grid Reference.
143 geoDecl the value supplied is to be interpreted as latitude followed by longitude according to the European Datum coordinate system.

lacunaStart.xml#13000

# id text
4 lacunaStart indicates the beginning of a lacuna in the text of a mostly complete textual witness.

model.addressLike.xml#13129

# id text
2 model.addressLike groups elements used to represent a postal or email address.

witEnd.xml#13000

# id text
2 witEnd fragmented witness end
13 witEnd indicates the end, or suspension, of the text of a fragmentary witness.

castList.xml#13000

# id text
2 castList cast list
14 castList contains a single cast list or dramatis personae.

equipment.xml#13000

# id text
4 equipment provides technical details of the equipment and media used for an audio or video recording used as the source for a spoken text.

speaker.xml#13000

# id text
7 speaker contains a specialized form of heading or label, giving the name of one or more speakers in a dramatic text or fragment.

sex.xml#13000

# id text
4 sex specifies the sex of a person.
29 sex supplies a coded value for sex
35 sex Values for this attribute may be locally defined by a project, or may refer to an external standard, such as vCard's sex property
114 sex As with other culturally-constructed traits such as age, the way in which this concept is described in different cultural contexts may vary. The normalizing attributes are provided only as an optional means of simplifying that variety to one or more external standards for purposes of interoperability, or project-internal taxonomies for consistency, and should not be used where that is inappropriate or unhelpful. The content of the element may be used to describe the intended concept in more detail, using plain text.

sense.xml#13000

# id text
2 sense groups together all information relating to one word sense in a dictionary entry, for example definitions, examples, and translation equivalents.
35 sense gives the nesting depth of this sense.
111 sense May contain character data mixed with any other elements defined in the dictionary tag set.

layoutDesc.xml#13000

# id text
2 layoutDesc layout description
13 layoutDesc collects the set of layout descriptions applicable to a manuscript.

localName.xml#13000

# id text
2 localName locally-defined property name
14 localName contains a locally defined name for some property.
52 localName No definitive list of local names is proposed. However, the name
54 localName is recommended as a means of naming the property identifying the recommended character entity name for this character or glyph.

gram.xml#13000

# id text
14 gram within an entry in a dictionary or a terminological data file, contains grammatical information relating to a term, word, or form.
38 gram classifies the grammatical information given according to some convenient typology—in the case of terminological information, preferably the dictionary of data element types specified in
79 gram any of the word classes to which a word may be assigned in a given language, based on form, meaning, or a combination of features, e.g. noun, verb, adjective, etc.
121 gram number
180 gram A much fuller list of values for the
182 gram attribute may be generated from the data category registry accessible from

interp.xml#13000

# id text
2 interp interpretation
15 interp summarizes a specific interpretative annotation which can be linked to a span of text.
67 interp attribute. This permits the encoder to explicitly associate the interpretation represented by the content of an
77 interp attribute which points to one or more textual elements to which the analysis represented by the content of the

custEvent.xml#13000

# id text
2 custEvent custodial event
13 custEvent describes a single event during the custodial history of a manuscript.

repository.xml#13000

# id text
4 repository contains the name of a repository within which manuscripts are stored, possibly forming part of an institution.

state.xml#13242

# id text
4 state contains a description of some status or quality attributed to a person, place, or organization often at some specific time or for a specific date range.
132 state the more general purpose element
134 state should be used even for unchanging characteristics. If you wish to distinguish between characteristics that are generally perceived to be time-bound states and those assumed to be fixed traits, then
138 state element encodes characteristics which are sometimes assumed to change, often at specific times or over a date range, whereas the

listBibl.xml#13000

# id text
2 listBibl citation list
14 listBibl contains a list of bibliographic citations of any kind.

title.xml#13000

# id text
4 title contains a title for any kind of work.
50 title analytic
60 title the title applies to an analytic item, such as an article, poem, or other work published as part of a larger item.
86 title the title applies to a monograph such as a book or other item considered to be a distinct publication, including single volumes of multi-volume works
110 title the title applies to any serial or periodical publication such as a journal, magazine, or newspaper
126 title series
136 title the title applies to a series of otherwise distinct publications such as a collection
156 title the title applies to any unpublished material (including theses and dissertations unless published by a commercial press)
173 title The level of a title is sometimes implied by its context: for example, a title appearing directly within an
182 title s
185 title attribute is not required in contexts where its value can be unambiguously inferred. Where it is supplied in such contexts, its value should not contradict the value implied by its parent element.
249 title classifies the title according to some convenient typology.
268 title main title
294 title subtitle, title of part
310 title alternate
320 title alternate title, often in another language, by which the work is also known
336 title abbreviated form of title
362 title descriptive paraphrase of the work functioning as a title
483 title may be used to indicate the canonical form for the title; the former, by supplying (for example) the identifier of a record in some external library system; the latter by pointing to an XML element somewhere containing the canonical form of the title.

witness.xml#13000

# id text
4 witness contains either a description of a single witness referred to within the critical apparatus, or a list of witnesses which is to be referred to by a single sigil.
39 witness The content of the
41 witness element may give bibliographic information about the witness or witness group, or it may be empty.

att.ptrLike.form.xml#13000

# id text
2 att.ptrLike.form form pointers
32 att.ptrLike.form identifies the orthographic form or pronunciation referred to.

prefixDef.xml#13000

# id text
21 prefixDef supplies a name which functions as the prefix for an abbreviated pointing scheme such as a private URI scheme. The prefix constitutes the text preceding the first colon.
39 prefixDef The abbreviated pointer may be dereferenced to produce either an absolute or a relative URI reference. In the latter case it is combined with the value of
41 prefixDef in force at the place where the pointing attribute occurs to form an absolute URI in the usual manner as prescribed by

milestone.xml#13000

# id text
4 milestone marks a boundary point separating any kind of section of a text, typically but not necessarily indicating a point at which some part of a standard reference system changes, where the change is not represented by a structural element.
49 milestone attribute indicates the new number or other value for the unit which changes at this milestone. The special value
53 milestone The order in which milestone elements are given at a given point is not normally significant.

gloss.xml#13000

# id text
4 gloss identifies a phrase or word used to provide a gloss or definition for some other word or phrase.

specDesc.xml#13000

# id text
156 specDesc The description is usually displayed as a label and an item.
157 specDesc The list of attributes may include some which are inherited by virtue of an element's class membership; descriptions for such attributes may also be retrieved using another
159 specDesc , this time pointing at the relevant class.

cb.xml#13000

# id text
14 cb marks the beginning of a new column of a text on a multi-column page.
79 cb attribute indicates the number or other value associated with the column which follows the point of insertion of this
81 cb element. Encoders should adopt a clear and consistent policy as to whether the numbers associated with column breaks relate to the physical sequence number of the column in the whole text, or whether columns are numbered within the page. The
83 cb element is placed at the head of the column to which it refers.

supplied.xml#13103

# id text
4 supplied signifies text supplied by the transcriber or editor for any reason; for example because the original cannot be read due to physical damage, or because of an obvious omission by the author or scribe.
28 supplied one or more words indicating why the text has had to be supplied, e.g.

model.frontPart.xml#13000

# id text
2 model.frontPart groups elements which appear at the level of divisions within front or back matter.

data.interval.xml#13221

# id text
19 data.interval Any value greater than zero or any one of the values

corr.xml#13000

# id text
2 corr correction
12 corr contains the correct form of a passage apparently erroneous in the copy text.
37 corr If all that is desired is to call attention to the fact that the copy text has been corrected,

rdgGrp.xml#13000

# id text
2 rdgGrp reading group

model.availabilityPart.xml#13000

# id text
2 model.availabilityPart groups elements such as licences and paragraphs of text which may appear as part of an availability statement

div6.xml#13000

# id text
2 div6 level-6 text division
16 div6 contains a sixth-level subdivision of the front, body, or back of a text.
155 div6 any sequence of low-level structural elements, possibly grouped into lower subdivisions.

relation.xml#13165

# id text
14 relation describes any kind of relationship or linkage amongst a specified group of places, events, persons, objects or other items.
43 relation One of the attributes 'name', 'ref' or 'key' must be supplied
49 relation Only one of the attributes @active and @mutual may be supplied
55 relation the attribute 'passive' may be supplied only if the attribute 'active' is supplied
62 relation supplies a name for the kind of relationship of which this is an instance.
97 relation supplies a list of participants amongst all of whom the relationship holds equally.
148 relation This indicates that the person with identifier p1 is supervisor of persons p2, p3, and p4.
183 relation This example records a relationship, defined by the SAWS ontology, between a passage of text identified by a CTS URN, and a variant passage of text in the Perseus Digital Library, and assigns the identification of the relationship to a particular editor (all using resolvable URIs).
193 relation may be supplied only if the attribute

egXML.xml#13116

# id text
15 egXML element itself functions as the root element.
59 egXML the example is intended to be fully valid, assuming that its root element, or a provided root element, could have been used as a possible root element in the schema concerned.
63 egXML the example could be transformed into a valid document by inserting any number of valid attributes and child elements anywhere within it; or it is valid against a version of the schema concerned in which the provision of character data, list, element, or attribute values has been made optional.
131 egXML In the source of the TEI Guidelines, this element declares itself and its content as belonging to the namespace
133 egXML . This enables the content of the element to be validated independently against the TEI scheme. Where this element is used outside this context, a different namespace or none at all may be preferable. The content must however be a well-formed XML fragment or document: where this is not the case, the more general
135 egXML element should be used in preference. In a TEI context use of the
137 egXML attribute in the TEI namespace, as opposed to the TEI Examples namespace, enables recording of rendition information.

att.measurement.xml#13061

# id text
20 att.measurement indicates the units used for the measurement, usually using the standard symbol for the desired units.
109 att.measurement SI base unit of time
165 att.measurement SI unit of pressure or stress
302 att.measurement 10⁻¹⁰ m
490 att.measurement If the measurement being represented is not expressed in a particular unit, but rather is a number of discrete items, the unit
496 att.measurement Wherever appropriate, a recognized SI unit name should be used (see further
498 att.measurement ). The list above is indicative rather than exhaustive.
543 att.measurement specifies the number of the specified units that comprise the measurement
582 att.measurement In general, when the commodity is made of discrete entities, the plural form should be used, even when the measurement is of only one of them.

refState.xml#13000

# id text
2 refState reference state
14 refState specifies one component of a canonical reference defined by the milestone method.
55 refState When constructing a reference, if the reference component found is of numeric type, the length is made up by inserting leading zeros; if it is not, by inserting trailing blanks. In either case, reference components are truncated if necessary at the right hand side.
57 refState When seeking a reference, the length indicates the number of characters which should be compared. Values longer than this will be regarded as matching, if they start correctly. If no value is provided, the length is unlimited and goes to the next delimiter or to the end of the value.
90 refState supplies a delimiting string following the reference component.

att.scoping.xml#13239

# id text
27 att.scoping which identifies a set of nodes, selected within the context identified by the
29 att.scoping attribute if this is supplied, or within the context of the element bearing this attribute if it is not.
42 att.scoping The expression of certainty applies to the nodeset identified by the value of the
44 att.scoping attribute, possibly modified additionally by the value of the
46 att.scoping attribute. If neither attribute is present, the expression of certainty applies to the context of the
50 att.scoping Note that the value of the

data.xTruthValue.xml#13000

# id text
2 data.xTruthValue extended truth value
7 data.xTruthValue defines the range of attribute values used to express a truth value which may be unknown.
31 data.xTruthValue In cases where where uncertainty is inappropriate, use the datatype

nym.xml#13000

# id text
2 nym canonical name
14 nym contains the definition for a canonical name or name component of any kind.

dictScrap.xml#13000

# id text
13 dictScrap encloses a part of a dictionary entry in which other phrase-level dictionary elements are freely combined.
102 dictScrap This element is used to mark part of a dictionary entry in which lower level dictionary elements appear, but which does not itself form an identifiable structural unit.

model.catDescPart.xml#13000

# id text
2 model.catDescPart groups component elements of the TEI header Category Description.

string.xml#13000

# id text
2 string string value
14 string represents the value part of a feature-value specification which contains a string.

model.nameLike.agent.xml#13000

# id text
20 model.nameLike.agent This class is used in the content model of elements which reference names of people or organizations.

data.temporal.iso.xml#13000

# id text
46 data.temporal.iso If it is likely that the value used is to be compared with another, then a time zone indicator should always be included, and only the dateTime representation should be used.

model.textDescPart.xml#13000

# id text
3 model.textDescPart groups elements used to categorize a text for example in terms of its situational parameters.

model.castItemPart.xml#13000

# id text
2 model.castItemPart groups component elements of an entry in a cast list, such as dramatic role or actor's name.

timeline.xml#13012

# id text
12 timeline provides a set of ordered points in time which can be linked to elements of a spoken text to create a temporal alignment of that text.
37 timeline designates the origin of the timeline, i.e. the time at which it begins.
55 timeline If this attribute is not supplied, the implication is that the time of origin is not known. If it is supplied, it must point either to one of the
68 timeline specifies the unit of time corresponding to the
70 timeline value of the timeline or of its constituent points in time.
151 timeline specifies a time interval either as a positive integral value or using one of a set of predefined codes.
169 timeline The value
171 timeline indicates uncertainty about all the intervals in the timeline; the value
173 timeline indicates that all the intervals are evenly spaced, but the size of the intervals is not known; numeric values indicate evenly spaced values of the size specified. If individual points in time in the timeline are given different values for the
175 timeline attribute, those values locally override the value given in the timeline.

model.pPart.edit.xml#13000

# id text
2 model.pPart.edit groups phrase-level elements for simple editorial correction and transcription.

lg.xml#13000

# id text
2 lg line group
14 lg contains one or more verse lines functioning as a formal unit, e.g. a stanza, refrain, verse paragraph, etc.
70 lg An lg element must contain at least one child l, lg or gap element.
157 lg contains verse lines or nested line groups only, possibly prefixed by a heading.

model.publicationStmtPart.detail.xml#13000

# id text
4 model.publicationStmtPart.detail element of the TEI header.

dataRef.xml#

# id text
2 macroRef identifies the datatype of an attribute value, either by referencing an item in an externally defined datatype library, or by pointing to a TEI-defined data specification
16 macroRef the identifier used for this datatype specification
23 macroRef the name of a datatype in the list provided by
32 macroRef a pointer to a datatype defined in some datatype library
40 macroRef supplies a string representing a regular expression providing additional constraints on the strings used to represent values of this datatype

headLabel.xml#13000

# id text
2 headLabel heading for list labels
14 headLabel contains the heading for the label or term column in a glossary list or similar structured list.
69 headLabel element may appear only if each item in the list is preceded by a

model.segLike.xml#13000

# id text
21 model.segLike The principles on which segmentation is carried out, and any special codes or attribute values used, should be defined explicitly in the
25 model.segLike within the associated TEI header.

att.cReferencing.xml#13000

# id text
20 att.cReferencing element in the TEI header
50 att.cReferencing The value of
52 att.cReferencing should be constructed so that when the algorithm for the resolution of canonical references (described in section

alt.xml#13000

# id text
14 alt identifies an alternation or a set of choices among elements or passages.
44 alt states whether the alternations gathered in this collection are exclusive or inclusive.
202 alt , the sum of weights must be in the range from 0 to the number of alternants.

origPlace.xml#13000

# id text
2 origPlace origin place
13 origPlace contains any form of place name, used to identify the place of origin for a manuscript or manuscript part.
60 origPlace origin
61 origPlace , for example original place of publication, as opposed to original place of printing.

custodialHist.xml#13000

# id text
2 custodialHist custodial history
13 custodialHist contains a description of a manuscript's custodial history, either as running prose or as a series of dated custodial events.

att.sortable.xml#13016

# id text
6 att.sortable supplies the sort key for this element in an index, list or group which contains it.
24 att.sortable The sort key is used to determine the sequence and grouping of entries in an index. It provides a sequence of characters which, when sorted with the other values, will produced the desired order; specifics of sort key construction are application-dependent
26 att.sortable Dictionary order often differs from the collation sequence of machine-readable character sets; in English-language dictionaries, an entry for
40 att.sortable may all appear in numeric order
46 att.sortable . The sort key is required if the orthography of the dictionary entry does not suffice to determine its location.

att.global.xml#13092

# id text
2 att.global provides attributes common to all elements in the TEI encoding scheme.
83 att.global number
93 att.global gives a number (or other label) for an element, which is not necessarily unique within the document.
111 att.global The value of this attribute is always understood to be a single token, even if it contains space or other punctuation characters, and need not be composed of numbers only. It is typically used to specify the numbering of chapters, sections, list items, etc.; it may also be used in the specification of a standard reference system for the text.
134 att.global language
144 att.global indicates the language of the element content using a
145 att.global tag
189 att.global The xml:lang value will be inherited from the immediately enclosing element, or from its parent, and so on up the document hierarchy. It is generally good practice to specify xml:lang at the highest appropriate level, noticing that a different default may be needed for the teiHeader from that needed for the associated resource element or elements, and that a single TEI document may contain texts in many languages.
191 att.global The authoritative list of registered language subtags is maintained by IANA and is available at
192 att.global . For a good general overview of the construction of language tags, see
196 att.global The value used must conform with BCP 47. If the value is a private use code (i.e., starts with
202 att.global element with a matching value for its
204 att.global attribute should be supplied in the TEI header to document this value. Such documentation may also optionally be supplied for non-private-use codes, though these must remain consistent with their
357 att.global signals an intention about how white space should be managed by applications.
372 att.global signals that the application's default white-space processing modes are acceptable
376 att.global indicates the intent that applications preserve all white space

listTranspose.xml#13000

# id text
2 listTranspose supplies a list of transpositions, each of which is indicated at some point in a document typically by means of metamarks.
23 listTranspose This example might be used for a source document which indicates in some way that the elements identified by
25 listTranspose and code

ident.xml#13000

# id text
12 ident contains an identifier or name for an object of some kind in a formal language.

surface.xml#13000

# id text
2 surface defines a written surface as a two-dimensional coordinate space, optionally grouping one or more graphic representations of that space, zones of interest within that space, and transcriptions of the writing within them.
47 surface describes the method by which this surface is or was connected to the main surface
54 surface glued in place
58 surface pinned or stapled in place
62 surface sewn in place
68 surface indicates whether the surface is attached and folded in such a way as to provide two writing surfaces
87 surface element represents any two-dimensional space on some physical surface forming part of the source material, such as a piece of paper, a face of a monument, a billboard, a scroll, a leaf etc.
89 surface The coordinate space defined by this element may be thought of as a grid
101 surface element may contain graphic representations or transcriptions of written zones, or both. The coordinate values used by every

camera.xml#13000

# id text
4 camera describes a particular camera angle or viewpoint in a screen play.

superEntry.xml#13000

# id text
4 superEntry groups a sequence of entries within any kind of lexical resource, such as a dictionary or lexicon which function as a single unit, for example a set of homographs.

lang.xml#13000

# id text
2 lang language name
14 lang contains the name of a language mentioned in etymological or other linguistic discussion.

imprimatur.xml#13000

# id text
2 imprimatur contains a formal statement authorizing the publication of a work, sometimes required to appear on a title page or its verso.

listState.xml#13000

# id text
2 listState list of states and/or traits
4 listState contains a list of various kinds of characteristics of people, places, and organizations.
30 listState attribute may be used to distinguish lists of characteristics of a particular type if convenient.

listPerson.xml#13000

# id text
2 listPerson list of persons
13 listPerson contains a list of descriptions, each of which provides information about an identifiable person or a group of people, for example the participants in a language interaction, or the people referred to in a historical source.
79 listPerson The type attribute may be used to distinguish lists of people of a particular type if convenient.

objectType.xml#13000

# id text
58 objectType attribute may be used to point to one or more items within a taxonomy of types of object, defined either internally or externally.

msPart.xml#13005

# id text
13 msPart contains information about an originally distinct manuscript or part of a manuscript, now forming part of a composite manuscript.
70 msPart children if needed) should be used instead of an
77 msPart WARNING: use of deprecated method — the use of the altIdentifier element as a direct child of the msPart element will be removed from the TEI on 2016-09-09
137 msPart As this last example shows, for compatibility reasons the identifier of a manuscript part may be supplied as a simple

memberOf.xml#13000

# id text
58 memberOf add
92 memberOf supplies the maximum number of times the element can occur in elements which use this model class in their content model
99 memberOf supplies the minumum number of times the element must occur in elements which use this model class in their content model
111 memberOf This element will appear in any content model which references
137 memberOf Elements or classes which are members of multiple (unrelated) classes will have more than one
141 memberOf element. If an element is a member of a class C1, which is itself a subclass of a class C2, there is no need to state this, other than in the documentation for class C1.
143 memberOf Any additional comment or explanation of the class membership may be provided as content for this element.

series.xml#13000

# id text
2 series series information
14 series contains information about the series in which a book or other bibliographic item has appeared.

model.emphLike.xml#13000

# id text
2 model.emphLike groups phrase-level elements which are typographically distinct and to which a specific function can be attributed.

derivation.xml#13000

# id text
4 derivation describes the nature and extent of originality of this text.
27 derivation categorizes the derivation of the text.
46 derivation text is original
62 derivation text is a revision of some other text
78 derivation text is a translation of some other text
94 derivation text is an abridged version of some other text
110 derivation text is plagiarized from some other text
126 derivation text has no obvious source but is one of a number derived from some common ancestor
160 derivation For derivative texts, details of the ancestor may be included in the source description.

kinesic.xml#13000

# id text
52 kinesic The value
54 kinesic indicates that the kinesic is repeated several times rather than occurring only once.

att.metrical.xml#13000

# id text
2 att.metrical defines a set of attributes which certain elements may use to represent metrical information.
46 att.metrical The pattern may be specified by means of either a standard term for the kind of metrical unit (e.g.
99 att.metrical The pattern may be specified by means of either a standard term for the kind of metrical unit (e.g.
128 att.metrical rhyme scheme
138 att.metrical specifies the rhyme scheme applicable to a group of verse lines.
156 att.metrical By default, the rhyme scheme is expressed as a string of alphabetic characters each corresponding with a rhyming line. Any non-rhyming lines should be represented by a hyphen or an X. Alternative notations may be defined as for
160 att.metrical element in the TEI header.
162 att.metrical When the default notation is used, it does not make sense to specify this attribute on any unit smaller than a line. Nor does the default notation provide any way to record internal rhyme, or to specify non-conventional rhyming practice. These extensions would require user-defined alternative notations.

label.xml#13000

# id text
4 label contains any label or heading used to identify part of a text, typically but not exclusively in a list or glossary.
28 label Labels are commonly used for the headwords in glossary lists; note the use of the global
30 label attribute to set the default language of the glossary list to Middle English, and identify the glosses and headings as modern English or Latin:
296 label Labels may also be used to record explicitly the numbers or letters which mark list items in ordered lists, as in this extract from Gibbon's
315 label Labels may also be used for other structured list items, as in this extract from the journal of Edward Gibbon:
343 label rather than as its sibling. Though syntactically valid, this usage is not recommended TEI practice.
347 label Labels may also be used to represent a label or heading attached to a paragraph or sequence of paragraphs not treated as a structural division, or to a group of verse lines. Note that, in this case, the
373 label In this example the text of the label appears in the right hand margin of the original source, next to the paragraph it describes, but approximately in the middle of it.

div2.xml#13000

# id text
2 div2 level-2 text division
16 div2 contains a second-level subdivision of the front, body, or back of a text.
195 div2 any sequence of low-level structural elements, possibly grouped into lower subdivisions.

msIdentifier.xml#13000

# id text
58 msIdentifier An msIdentifier must contain either a repository or location of some type, or a manuscript name

model.divTopPart.xml#13000

# id text
2 model.divTopPart groups elements which can occur only at the beginning of a text division.

birth.xml#13012

# id text
4 birth contains information about a person's birth, such as its date and place.

vLabel.xml#13000

# id text
2 vLabel value label
14 vLabel represents the value part of a feature-value specification which appears at more than one point in a feature structure.
39 vLabel supplies a name identifying the sharing point.

fsDecl.xml#13000

# id text
46 fsDecl gives a name for the type of feature structure being declared.
65 fsDecl gives the name of one or more typed feature structures from which this type inherits feature specifications and constraints; if this type includes a feature specification with the same name as that of any of those specified by this attribute, or if more than one specification of the same name is inherited, then the set of possible values is defined by unification. Similarly, the set of constraints applicable is derived by combining those specified explicitly within this element with those implied by the
69 fsDecl attribute is specified, no feature specification or constraint is inherited.
113 fsDecl The process of combining constraints may result in a contradiction, for example if two specifications for the same feature specify disjoint ranges of values, and at least one such specification is mandatory. In such a case, there is no valid representative for the type being defined.

att.datable.w3c.xml#13000

# id text
2 att.datable.w3c provides attributes for normalization of elements that contain datable events conforming to the W3C
21 att.datable.w3c supplies the value of the date or time in a standard form, e.g. yyyy-mm-dd.
37 att.datable.w3c Examples of W3C date, time, and date & time formats.
133 att.datable.w3c specifies the earliest possible date for the event in standard form, e.g. yyyy-mm-dd.
152 att.datable.w3c specifies the latest possible date for the event in standard form, e.g. yyyy-mm-dd.
216 att.datable.w3c The value of these attributes should be a normalized representation of the date, time, or combined date & time intended, in any of the standard formats specified by
220 att.datable.w3c The most commonly-encountered format for the date portion of a temporal attribute is
232 att.datable.w3c may also be used. For the time part, the form
236 att.datable.w3c Note that this format does not currently permit use of the value
238 att.datable.w3c to represent the year 1 BCE; instead the value

model.placeNamePart.xml#13000

# id text
2 model.placeNamePart groups elements which form part of a place name.

stamp.xml#13000

# id text
4 stamp contains a word or phrase describing a stamp or similar device.

locale.xml#13000

# id text
3 locale contains a brief informal description of the kind of place concerned, for example: a room, a restaurant, a park bench, etc.

country.xml#13012

# id text
4 country contains the name of a geo-political unit, such as a nation, country, colony, or commonwealth, larger than or administratively superior to a region and smaller than a bloc.
47 country The recommended source for codes to represent coded country names is ISO 3166.

analytic.xml#13000

# id text
2 analytic analytic level
14 analytic contains bibliographic elements describing an item (e.g. an article or poem) published within a monograph or journal and not as an independent publication.
77 analytic , where its use is mandatory for the description of an analytic level bibliographic item.

headItem.xml#13000

# id text
2 headItem heading for list items
14 headItem contains the heading for the item or gloss column in a glossary list or similar structured list.
88 headItem element may appear only if each item in the list is preceded by a

pVar.xml#13120

# id text
49 pVar indicates what notation is used for the pronunciation, if more than one occurs in the machine-readable dictionary.

witStart.xml#13000

# id text
2 witStart fragmented witness start
13 witStart indicates the beginning, or resumption, of the text of a fragmentary witness.

line.xml#13000

# id text
2 line contains the transcription of a topographic line in the source document
59 line This element should be used only to mark up writing which is topographically organized as a series of lines, horizontal or vertical. It should not be used to mark lines of verse (for which use
61 line ) nor to mark linebreaks within text which has been encoded using structural elements such as

data.duration.iso.xml#13000

# id text
2 data.duration.iso defines the range of attribute values available for representation of a duration in time using ISO 8601 standard formats
64 data.duration.iso A duration is expressed as a sequence of number-letter pairs, preceded by the letter P; the letter gives the unit and may be Y (year), M (month), D (day), H (hour), M (minute), or S (second), in that order. The numbers are all unsigned integers, except for the last, which may have a decimal component (using either
68 data.duration.iso as the decimal point; the latter is preferred). If any number is
70 data.duration.iso , then that number-letter pair may be omitted. If any of the H (hour), M (minute), or S (second) number-letter pairs are present, then the separator
73 data.duration.iso time

fDecl.xml#13000

# id text
14 fDecl declares a single feature, specifying its name, organization, range of allowed values, and optionally its default value.
45 fDecl a single word which follows the rules defining a legal XML name (see
46 fDecl ), indicating the name of the feature being declared; matches the
93 fDecl indicates whether or not the value of this feature may be present.
113 fDecl If a feature is marked as optional, it is possible for it to be omitted from a feature structure. If an obligatory feature is omitted, then it is understood to have a default value, either explicitly declared, or, if no default is supplied, the special value
115 fDecl . If an optional feature is omitted, then it is understood to be missing and any possible value (including the default) is ignored.

macro.paraContent.xml#13067

# id text
2 macro.paraContent paragraph content
14 macro.paraContent defines the content of paragraphs and similar elements.

model.global.meta.xml#13000

# id text
20 model.global.meta Elements in this class are typically used to hold groups of links or of abstract interpretations, or by provide indications of certainty etc. It may find be convenient to localize all metadata elements, for example to contain them within the same divison as the elements that they relate to; or to locate them all to a division of their own. They may however appear at any point in a TEI text.

support.xml#13000

# id text
4 support contains a description of the materials etc. which make up the physical support for the written part of a manuscript.

quotation.xml#13000

# id text
4 quotation specifies editorial practice adopted with respect to quotation marks in the original.
39 quotation quotation marks
49 quotation indicates whether or not quotation marks have been retained as content within the text.
70 quotation no quotation marks have been retained
86 quotation some quotation marks have been retained
102 quotation all quotation marks have been retained

graphic.xml#13000

# id text
2 graphic indicates the location of an inline graphic, illustration, or figure.
65 graphic attribute should be used to supply the MIME media type of the image specified by the

att.duration.w3c.xml#13000

# id text
2 att.duration.w3c provides attributes for recording normalized temporal durations.
54 att.duration.w3c are specified, the values should be interpreted as indicating a span of time by its starting time (or date) and duration. In order to represent a time range by a duration and its ending time the
60 att.duration.w3c form, no claim is made that the form in the source text is incorrect; the regularized form is simply that chosen as the main form for purposes of unifying variant forms under a single heading.

abbr.xml#13092

# id text
91 abbr the abbreviation comprises a special symbol or mark.
107 abbr the abbreviation includes writing above the line.
139 abbr the abbreviation is for a title of address (Dr, Ms, Mr, …)
155 abbr the abbreviation is for the name of an organization.
190 abbr attribute is provided for the sake of those who wish to classify abbreviations at their point of occurrence; this may be useful in some circumstances, though usually the same abbreviation will have the same type in all occurrences. As the sample values make clear, abbreviations may be classified by the method used to construct them, the method of writing them, or the referent of the term abbreviated; the typology used is up to the encoder and should be carefully planned to meet the needs of the expected use. For a typology of Middle English abbreviations, see
269 abbr tag is not required; if appropriate, the encoder may transcribe abbreviations in the source text silently, without tagging them. If abbreviations are not transcribed directly but
271 abbr silently, then the TEI header should so indicate.

model.orgPart.xml#13000

# id text
2 model.orgPart groups elements which form part of the description of an organization.

seriesStmt.xml#13000

# id text
2 seriesStmt series statement
16 seriesStmt groups information about the series, if any, to which a publication belongs.

leaf.xml#13000

# id text
28 leaf provides a pointer to a feature structure or other analytic element.
66 leaf provides an identifier of an element which this leaf follows.
84 leaf If the tree is unordered or partially ordered, this attribute has the property of fixing the relative order of the leaf and the element which is the value of the attribute.
114 leaf The in degree of a leaf is always 1, its out degree always 0.

app.xml#13000

# id text
2 app apparatus entry
14 app contains one entry in a critical apparatus, with an optional lemma and usually one or more readings or notes on the relevant passage.
118 app This attribute should be used when either the double-end point method of apparatus markup, or the location-referenced method with a URL rather than canonical reference, are used.
149 app This attribute is only used when the double-end point method of apparatus markup is used, when the encoded apparatus is not embedded
168 app location
178 app indicates the location of the variation, when the location-referenced method of apparatus markup is used.
196 app This attribute is used only when the location-referenced encoding method is used. It supplies a string containing a canonical reference for the passage to which the variation applies.

sequence.xml#13221

# id text
2 sequence sequence of references
14 sequence The sequence element must have at least two child elements
20 sequence if true, indicates that the order in which component elements of a sequence appear in a document must correspond to the order in which they are given in the content model.
37 sequence This example content model matches a sequence consisting of either a
41 sequence followed by nothing, or by a sequence of up to five

width.xml#13000

# id text
40 width If used to specify the depth of a non text-bearing portion of some object, for example a monument, this element conventionally refers to the axis facing the observer, and perpendicular to that indicated by the
41 width depth

model.oddDecl.xml#13000

# id text
2 model.oddDecl groups elements which generate declarations in some markup language in ODD documents.

affiliation.xml#13012

# id text
4 affiliation contains an informal description of a person's present or past affiliation with some organization, for example an employer or sponsor.
64 affiliation If included, the name of an organization may be tagged using either the

certainty.xml#13092

# id text
2 certainty indicates the degree of certainty associated with some aspect of the text markup.
32 certainty certainty
42 certainty signifies the degree of certainty associated with the object pointed to by the
51 certainty indicates more exactly the aspect concerning which certainty is being expressed: specifically, whether the markup is correctly located, whether the correct element or attribute name has been used, or whether the content of the element or attribute is correct, etc.
70 certainty uncertainty concerns whether the name of the element or attribute used is correctly applied.
86 certainty uncertainty concerns the content (for an element) or the value (for an attribute)
92 certainty provides an alternative value for the aspect of the markup in question—an alternative generic identifier, transcription, or attribute value, or the identifier of an
100 certainty ; if none is given, it applies to the markup in the text.
233 certainty The envisioned typical value of this attribute would be the identifier of another
235 certainty element or a list of such identifiers. It may thus be possible to construct probability networks by chaining
239 certainty elements (with no value for
241 certainty ). The semantics of this chaining would be understood in this way: if a
243 certainty element is specified, via a reference, as the assumption, then it is not the attribution of uncertainty that is the assumption, but rather the assertion itself. For instance, in the example above, the first

opener.xml#13000

# id text
4 opener groups together dateline, byline, salutation, and similar phrases appearing as a preliminary group at the start of a division, especially of a letter.

pb.xml#13000

# id text
52 pb A page break may be associated with a facsimile image of the page it introduces by means of the
76 pb attribute indicates the number or other value associated with this page. This will normally be the page number or signature printed on it, since the physical sequence number is implicit in the presence of the

docImprint.xml#13000

# id text
2 docImprint document imprint
16 docImprint contains the imprint statement (place and date of publication, publisher name), as given (usually) at the foot of a title page.
130 docImprint element of bibliographic citations. As with title, author, and editions, the shorter name is reserved for the element likely to be used more often.

vColl.xml#13000

# id text
2 vColl collection of values
14 vColl represents the value part of a feature-value specification which contains multiple values organized as a set, bag, or list.
54 vColl indicates organization of given value or values as

preparedness.xml#13000

# id text
4 preparedness describes the extent to which a text may be regarded as prepared or spontaneous.
78 preparedness follows a predefined set of conventions

q.xml#13000

# id text
14 q contains material which is distinguished from the surrounding text using quotation marks or a similar method, for any one of a variety of reasons including, but not limited to: direct speech or thought, technical terms or jargon, authorial distance, quotations from elsewhere, and passages that are mentioned but not used.
39 q may be used to indicate whether the offset passage is spoken or thought, or to characterize it more finely.
90 q quotation from a written source
128 q linguistically distinct
138 q technical term
215 q May be used to indicate that a passage is distinguished from the surrounding text for reasons concerning which no claim is made. When used in this manner,
219 q with a value of
221 q that indicates the use of such mechanisms as quotation marks.

creation.xml#13000

# id text
4 creation contains information about the creation of a text.
85 creation element may be used to record details of a text's creation, e.g. the date and place it was composed, if these are of interest.
91 creation element, which records date and place of publication.

bindingDesc.xml#13000

# id text
2 bindingDesc binding description
13 bindingDesc describes the present and former bindings of a manuscript, either as a series of paragraphs or as a series of distinct
15 bindingDesc elements, one for each binding of the manuscript.

listChange.xml#13000

# id text
2 listChange groups a number of change descriptions associated with either the creation of a source text or the revision of an encoded text.
62 listChange element it documents the set of revision campaigns or stages identified during the evolution of the original text. When it appears within the

model.rdgLike.xml#13000

# id text
20 model.rdgLike element to be easily created via TEI customizations.

model.pPart.transcriptional.xml#13000

# id text
2 model.pPart.transcriptional groups phrase-level elements used for editorial transcription of pre-existing source materials.

sourceDesc.xml#13000

# id text
2 sourceDesc source description
13 sourceDesc describes the source from which an electronic text was derived or generated, typically a bibliographic description in the case of a digitized text, or a phrase such as "born digital" for a text which has no previous existence.

sic.xml#13092

# id text
14 sic contains text reproduced although apparently incorrect or inaccurate.

role.xml#13000

# id text
4 role contains the name of a dramatic role, as given in a cast list.

supportDesc.xml#13012

# id text
2 supportDesc support description
13 supportDesc groups elements describing the physical support for the written part of a manuscript.
58 supportDesc a short project-defined name for the material composing the majority of the support

vNot.xml#13000

# id text
2 vNot value negation
14 vNot represents a feature value which is the negation of its content.

macroRef.xml#13000

# id text
15 macroRef the identifier used for the required pattern within the source indicated.
32 macroRef Patterns or macros are identified by the name supplied as value for the
36 macroRef element in which they are declared. All TEI macro names are unique.

att.internetMedia.xml#13000

# id text
16 att.internetMedia MIME media type
24 att.internetMedia specifies the applicable multimedia internet mail extension (MIME) media type
47 att.internetMedia is used to indicate that the URL points to a TEI XML file encoded in UTF-8.
54 att.internetMedia This attribute class provides an attribute for describing a computer resource, typically available over the internet, using a value taken from a standard taxonomy. At present only a single taxonomy is supported, the Multipurpose Internet Mail Extensions (MIME) Media Type system. This typology of media types is defined by the Internet Engineering Task Force in
57 att.internetMedia list of types
60 att.internetMedia attribute must have a value taken from this list.

iType.xml#13000

# id text
38 iType indicates the type of indicator used to specify the inflection class, when it is necessary to distinguish between the usual abbreviated indications (e.g.
83 iType coded reference to a table of verbs
101 iType gram type='inflectional type'
148 iType May contain character data and phrase-level elements. Typical content will be

interaction.xml#13000

# id text
4 interaction describes the extent, cardinality and nature of any interaction among those producing and experiencing the text, for example in the form of response or interjection, commentary, etc.
28 interaction specifies the degree of interaction between active and passive participants in the text.
47 interaction no interaction of any kind, e.g. a monologue
63 interaction some degree of interaction, e.g. a monologue with set responses
95 interaction this parameter is inappropriate or inapplicable in this case
113 interaction specifies the number of active participants (or
194 interaction number of addressors unknown or unspecifiable
212 interaction specifies the number of passive participants (or
214 interaction ) to whom a text is directed or in whose presence it is created or performed.
243 interaction text is addressed to the originator e.g. a diary
259 interaction text is addressed to one other person e.g. a personal letter
275 interaction text is addressed to a countable number of others e.g. a conversation in which all participants are identified
291 interaction text is addressed to an undefined but fixed number of participants e.g. a lecture
307 interaction text is addressed to an undefined and indeterminately large number e.g. a published book

media.xml#13000

# id text
2 media indicates the location of any form of external media such as an audio or video clip etc.
61 media The attributes available for this element are not appropriate in all cases. For example, it makes no sense to specify the temporal duration of a graphic. Such errors are not currently detected.
65 media attribute must be used to specify the MIME media type of the resource specified by the

pc.xml#13000

# id text
2 pc punctuation character
4 pc contains a character or string of characters regarded as constituting a single punctuation mark.
26 pc indicates the extent to which this punctuation mark conventionally separates words or phrases
33 pc the punctuation mark is a word separator
37 pc the punctuation mark is not a word separator
41 pc the punctuation mark may or may not be a word separator
47 pc provides a name for the kind of unit delimited by this punctuation mark.
54 pc indicates whether this punctuation mark precedes or follows the unit it delimits.

unclear.xml#13000

# id text
4 unclear contains a word, phrase, or passage which cannot be transcribed with certainty because it is illegible or inaudible in the source.
29 unclear indicates why the material is hard to transcribe.
92 unclear Where the difficulty in transcription arises from damage, categorizes the cause of the damage, if it can be identified.
111 unclear damage results from rubbing of the leaf edges
127 unclear damage results from mildew on the leaf surface
143 unclear damage results from smoke
187 unclear The same element is used for all cases of uncertainty in the transcription of element content, whether for written or spoken material. For other aspects of certainty, uncertainty, and reliability of tagging and transcription, see chapter

chapters ('en') (i.e., https://svn.code.sf.net/p/tei/code/trunk/P5/Source/Guidelines/en/)

COL-Colophon.xml#12020

# id text
4 COL The text of this manual was prepared electronically on a variety of systems. Most sections were originally drafted by members of the work groups and working committees of the TEI; all have been revised by the editors to achieve greater uniformity of style and greater consistency in the tag set.
8 COL Almost every available SGML and XML editor or processing program has been used at one time or another by the TEI; but without the open source implementations of XML parsers, editors and XSLT engines by James Clark, Richard Stallman, Michael Kay, and Daniel Veillard, the TEI could not survive, and we thank these individuals. We would also like to thank the staff at Syncrosoft, creators of the oXygen editor, for their support for the TEI during the creation on P5.
10 COL Many volunteers contributed to the preparation of this release of the Guidelines; we particularly note the work of Sabine Krott, Eva Radermacher and Arianna Ciula for their work in structuring the bibliographies.
12 COL The production and release process for TEI P5 was managed by Sebastian Rahtz for the TEI Technical Council.

PrefatoryNote.xml#12945

# id text
5 PREFS prefixed to each revision of the TEI Guidelines since its first publication in 1994.
9 p4pf02 The primary goal of this revision has been to make available a new and corrected version of the TEI Guidelines which:
13 p4pf02 generates a set of DTD fragments that can be combined together to form either SGML or XML document type definitions;
17 p4pf02 can be processed and maintained using readily available XML tools instead of the special-purpose ad hoc software originally used for TEI P3.
21 p4pf02 A second major design goal of this revision has been to ensure that the DTD fragments generated would not break existing documents: in other words, that any document conforming to the original TEI P3 SGML DTD would also conform to the new XML version of it. Although full backwards compatibility cannot be guaranteed, we believe our implementation is consistent with that goal.
23 p4pf02 In most respects, the TEI Guidelines have stood the test of time remarkably well. The present edition makes no substantial attempt to rewrite those few parts of them which have now been rendered obsolete by changes since their first publication, though an indication is given in the text of where such rewriting is now considered necessary. Neither does the present version attempt to address any of the many possible new areas of digital activity in which the TEI approach to standardization may have something to offer. Both these tasks require the existence of an informed and active TEI Council to direct and validate such extension and maintenance work, in response to the changing needs and priorities of the TEI user community.
29 p4pf02 workgroup chaired by Christian Wittern, which undertook to provide expert advice and correction at very short notice, in the latter task.
31 p4pf02 The preparation of this new version relied extensively on preliminary work carried out by the former North American editor of the TEI Guidelines, C.M. Sperberg-McQueen. In a TEI working paper written in 1999
32 p4pf02 TEI ED W69
33 p4pf02 , available from the TEI web site at
35 p4pf02 he sketched out a precise blueprint for the conversion of the TEI from SGML to XML, which we have implemented, with only slight modification.
37 p4pf02 The Editors would also like to express thanks to the team of volunteers from the TEI community who helped us with the task of proofreading the first draft during the summer of 2001; and to Sebastian Rahtz of Oxford University Computing Services, without whose skill and enthusiasm this new edition would not have been possible.
39 p4pf02 A substantial proportion of the work of preparing this new edition was funded with the assistance of a grant from the US National Endowment for the Humanities, whose continued support of the TEI has also been crucial to the effort of setting up the TEI Consortium.
41 p4pf02 Finally, we would like to thank all our colleagues on the interim management board of the TEI Consortium, in particular its Chairman John Unsworth, for their continued support of the TEI's work, and their willingness to devote effort to the difficult task of overseeing its transition to a new organizational infrastructure.
52 p4pf01 To complete the work started in June of this year, the TEI Editors asked for volunteers from the TEI community to proofread the preliminary XML version. 24 volunteers responded to this call during August, and gave invaluable help both by identifying a number of previously un-noticed errors, and by suggesting areas in which more substantial revision should be undertaken in the future. The Editors gratefully acknowledge the assistance of the following individuals during this exercise:
56 p4pf01 In addition to error correction, and clear delineation of those sections in which substantial revision is yet to be undertaken for TEI P5, the present draft differs from earlier ones in the following respects:
58 p4pf01 Formal Public Identifiers have been introduced as a means of constructing TEI DTDs and an SGML Open Catalog is now included with the standard release;
62 p4pf01 The chapters on obtaining the TEI DTDs and WSDs have been brought up to date; the chapter on modification has been expanded to include a discussion of the TEI Lite customization;
74 PPF2 This is a preliminary version of a revised and fully XML-compliant edition of the TEI Guidelines. Although work on revising and correcting the text of the document is incomplete, by making available this preliminary version we hope to facilitate testing of the XML document type declarations which it describes by as wide a range of TEI users as possible.
76 PPF2 The primary goal of this revision is to make available the corrected (May 1999) edition of the Guidelines in a new version which:
80 PPF2 generates a set of XML DTD fragments that can be combined together in the same way as the existing TEI (P3) SGML DTD fragments to form true TEI XML DTD fragments without loss of functionality;
82 PPF2 can be processed and maintained using readily available XML tools instead of the special-purpose ad hoc software originally used for TEI P3.
84 PPF2 As noted elsewhere, a number of errors were corrected in the May 1999 edition. A (much) smaller number of errors have also been corrected in this edition, but no new material has been added. We expect the expansion and modification of the Guidelines to become a real possibility in the context of the newly formed TEI Consortium, which has funded the preparation of this present edition.
86 PPF2 A major design goal of both this and the previous revision has been to ensure that the DTD fragments generated would not break existing documents: in other words, that any document conforming to the original TEI P3 SGML DTD would also conform to the new XML version of it. Although full backwards compatibility cannot be guaranteed, we believe our implementation is consistent with that goal.
88 PPF2 In making this new version, we relied extensively on preliminary work carried out by the outgoing North American editor of the TEI Guidelines, Michael Sperberg-McQueen. In a TEI working paper written in 1999, TEI ED W69, Michael sketched out a precise blueprint for the conversion of the TEI from SGML to XML, which we have implemented, with only slight modification. The current TEI editors wish to express here our admiration for the detailed care put into that paper, without which our task would have been forbiddingly difficult, if not impossible. We would also like to express our thanks to Sebastian Rahtz of Oxford University Computing Services, for his invaluable assistance in preparing this new edition.
90 PPF2 We list here in summary form all the changes made in the present edition. Full technical details are provided in documents TEI EDW69 and TEI EDW70, available from the TEI web site.
94 PPF2 has been added. By setting its value to
96 PPF2 , rather than the default
100 PPF2 The content models of all elements have been checked, and, where necessary, changed so that they are equally valid as SGML or as XML;
102 PPF2 The declared value for all attributes has been changed to a form which is equally valid as SGML or as XML;
109 PPF2 tag omissibility
114 PPF2 used within element declarations in the DTD. When XML is to be generated, the parameter entities concerned are redeclared with the null string as their value.
116 PPF2 The second change was achieved by removing SGML-specific features (ampersand connectors, inclusion and exclusion exceptions, various types of attribute content) from the DTD and revising the syntax of the DTD to conform to XML requirements (specifically in the representation of mixed-content models, and by removing redundant parentheses). In making these changes, we took care to ensure that the resulting content model would continue to accept existing valid documents, though in the nature of things it could not be guaranteed to reject the same set of documents. As further discussed in EDW69 and EDW70, some constraints (exclusion exceptions, for example) which could be carried out by a generic SGML parser using TEI P3 will have to be implemented by a special purpose TEI validator using TEI P4.
118 PPF2 Much work remains to be done, firstly in testing the new DTD fragments against as wide a range of TEI materials as possible, secondly in revising the discussion of markup theory and practice within the text to reflect current thinking. A few sections of the current text (the Gentle Introduction to SGML and the discussion of Extended Pointer syntax are two examples) will need substantial rewriting. For the most part, however, we think the Guidelines have stood the test of time well and can be recommended to a new generation of text encoders scarcely born at the time they were first formulated.
128 ppf No work of the size and complexity of the TEI
130 ppf could reasonably be expected to be error-free on publication, nor to remain long uncorrected. It has however taken rather longer than might have been anticipated to complete production of the present corrected reprint of the first edition, for which we present our apologies, both to the many individuals and institutions whose enthusiastic adoption and promotion of the TEI encoding scheme have ensured its continued survival in the rapidly changing world of digital scholarship, and also to the many helpfully critical users whose assiduous uncovering and reporting of our errors have made possible the present revision.
132 ppf At its first meeting in Bergen, in June 1996, the TEI Technical Review Committee (TRC) approved the setting up of a small working committee to oversee the production of a revised edition of the TEI
134 ppf , to include corrections of as many as possible of the `corrigible errors' notified to the editors since publication of the first edition in May 1994, the bulk of which are summarized in a TEI working paper (TEI EDW67, available from the TEI web site).
138 ppf The work of making the corrections and regenerating the text proceeded rather fitfully during 1998 and 1999, largely because of increasing demands on the editors' time from their other responsibilities. With the establishment of the new TEI Consortium, it is be hoped that maintenance of the Guidelines will be placed on a more secure footing. Some specific areas in which we anticipate future revisions being carried out are listed below.
144 ppf-tcm examples of TEI markup throughout the text were all checked against the relevant DTD fragment and an embarassingly large number of tagging errors corrected;
150 ppf-tcm listed in working paper TEI EDW67 were all corrected: some of these required specific changes to the DTD which are listed in the next section.
157 ppf-spc A major goal of this revision was to avoid changes which might invalidate existing data, even where existing constructs seemed erroneous in retrospect. To that end, wherever changes have been made in content models for existing elements, they have as far as possible been made so that the DTD will now accept a superset of what was previously legal. Only one new element (
161 ppf-spc Where possible, a few content models have been changed in such a way as to facilitate conversion to XML, but XML compatibility is
204 ppf-spc ; this class was then added to the global inclusion class
281 ppf-spc for use in simplification of the content model for
291 ppf-spc corrected an error whereby global attributes were not properly defined for elements specifying a non-default value for any of the
313 ppf-spc changed content models to permit empty
319 ppf-spc changed content model for
323 ppf-spc changed content model for
335 ppf-spc changed content model for
341 ppf-spc changed content model for
351 ppf-spc changed content models for
363 ppf-spc A number of content models were changed with a view to easing the creation of an XML compatible version of the Guidelines. Specifically:
374 ppf-spc changed the mixed content models for
397 ppf-err A small number of other known problems remain uncorrected in this version and are briefly listed below. Please watch the TEI mailing list for announcements of their correction.
410 ppf-err need to be addressed systematically; in particular, the treatment of list items or notes which contain several paragraphs continues to surprise many users: no whitespace is allowed between the paragraphs;
419 ppf-err Our next priority however will be the production of a fully XML-compliant version of the TEI DTD, work on which is already well advanced.
429 PF The impetus for the project came from the humanities computing community, which sought a common encoding scheme for complex textual structures in order to reduce the diversity of existing encoding practices, simplify processing by machine, and encourage the sharing of electronic texts. It soon became apparent that a sufficiently flexible scheme could provide solutions for text encoding problems generally. The scope of the TEI was therefore broadened to meet the varied encoding requirements of any discipline or application. Thus, the TEI became the only systematized attempt to develop a fully general text encoding model and set of encoding conventions based upon it, suitable for processing and analysis of any type of text, in any language, and intended to serve the increasing range of existing (and potential) applications and use.
431 PF What is published here is a major milestone in this effort. It provides a single, coherent framework for all kinds of text encoding which is hardware-, software- and application-independent. Within this framework, it specifies encoding conventions for a number of key text types and features. The ongoing work of the TEI is to extend the scheme presented here to cover additional text types and features, as well as to continue to refine its encoding recommendations on the basis of extensive experience with their actual application and use.
433 PF We therefore offer these Guidelines to the user community for use in the same spirit of active collaboration and cooperation with which they have so far been developed. The TEI is committed to actively supporting the wide-spread and large-scale use of the Guidelines which, with the publication of this volume, is now for the first time possible. In addition, we anticipate that users of the TEI Guidelines will in some instances adapt and extend them as necessary to suit particular needs; we invite such users to engage in the further development of the Guidelines by working with us as they do so.
435 PF Like any standard which is actually used, these Guidelines do not represent a static finished work, but rather one which will evolve over time with the active involvement of its community of users. We invite and encourage the participation of the user community in this process, in order to ensure that the TEI Guidelines become and remain useful in all sorts of work with machine-readable texts.
437 PF This document was made possible in part by financial support from the U.S. National Endowment for the Humanities, an independent federal agency; Directorate General XIII of the Commission of the European Communities; the Andrew W. Mellon Foundation; and the Social Science and Humanities Research Council of Canada. Direct and indirect support has also been received from the University of Illinois at Chicago, the Oxford University Computing Services, the University of Arizona, the University of Oslo and Queen's University (Kingston, Ont.), Bellcore (Bell Communications Research), the Istituto di Linguistica Computazionale (C.N.R.) Pisa, the British Academy, and Ohio State University, as well as the employers and host institutions of the members of the TEI working committees and work groups listed in the acknowledgments.
439 PF The production of this document has been greatly facilitated by the willingness of many software vendors to provide us with evaluation versions of their products. Most parts of this text have been processed at some time by almost every currently available SGML-aware software system. In particular, we gratefully acknowledge the assistance of the following vendors:
456 PF Details of the software actually used to produce the current document are given in the colophon at the end of the work.
461 WG Many people have given of their time, energy, expertise, and support in the creation of this document; it is unfortunately not possible to thank them all adequately. Below are listed those who have served as formal members of the TEI's Work Groups and Working Committees during its six-year history; others not so officially enfranchised also contributed much to the quality of the result.
467 WGWC TEI Working Committees (1990-1993)
495 WGWC In addition, the two TEI editors served ex officio on each committee.
497 WGWC Following publication of the first draft of the TEI Guidelines (P1) in November 1990, a number of specialist work groups were charged with responsibility for drafting revisions and extensions, which, together with material already presented in P1, constitute the basis of the present work.
499 WGWC In addition, many members of the work groups listed below met on three occasions to review the emerging proposals in detail at technical review meetings convened by the TEI Steering Committee. These meetings, held in Myrdal, Norway (November 1991), Chicago (May 1992) and Oxford (May 1993), were largely responsible for the technical content and organization of the present work. Attendants at these meetings are starred in the list below.
521 WGWC TR11 Drama and performance texts
530 WGWC AI2 Spoken text
542 WGWC AI5 Print dictionaries
554 WGAB Members of the TEI Advisory Board during the lifetime of the project are listed below, grouped under the name of the organization represented.
603 WGSC Members of the Steering Committee of the TEI during the preparation of this work were:

DI-PrintDictionaries.xml#13091

# id text
4 DI This chapter defines a module for encoding lexical resources of all kinds, in particular human-oriented monolingual and multilingual dictionaries, glossaries, and similar documents. The elements described here may also be useful in the encoding of computational lexica and similar resources intended for use by language-processing software; they may also be used to provide a rich encoding for wordlists, lexica, glossaries, etc. included within other documents. Dictionaries are most familiar in their printed form; however, increasing numbers of dictionaries exist also in electronic forms which are independent of any particular printed form, but from which various displays can be produced.
6 DI Both typographically and structurally, print dictionaries are extremely complex. Such lexical resources are moreover of interest to many communities with different and sometimes conflicting goals. As a result, many general problems of text encoding are particularly pronounced here, and more compromises and alternatives within the encoding scheme may be required in the future.
21 DI dictionaries; encoding guidelines should include these structural principles. We therefore define two distinct elements for dictionary entries, one (
34 DI Second, since so much of the information in printed dictionaries is implicit or highly compressed, their encoding requires clear thought about whether it is to capture the precise typographic form of the source text or the underlying structure of the information it presents. Since both of these views of the dictionary may be of interest, it proves necessary to develop methods of recording both, and of recording the interrelationship between them as well. Users interested mainly in the printed format of the dictionary will require an encoding to be faithful to an original printed version. However, other users will be interested primarily in capturing the lexical information in a dictionary in a form suitable for further processing, which may demand the expansion or rearrangement of the information contained in the printed form. Further, some users wish to encode
36 DI of these views of the data, and retain the links between related elements of the two encodings. Problems of recording these two different views of dictionary data are discussed in section
37 DI , together with mechanisms for retaining both views when this is desired.
39 DI To deal with this complexity, and in particular to account for the wide variety of linguistic contexts within which a dictionary may be designed, it can be necessary to customize or change the schema by providing more restriction or possibly alternate content models for the elements defined in this chapter. Section
40 DI illustrates this with the provision of a closed set of values for grammatical descriptors.
42 DI This chapter contains a large number of examples taken from existing print dictionaries; in each case, the original source is identified. In presenting such examples, we have tried to retain the original typographic appearance of the example as well as presenting a suggested encoding for it. Where this has not been possible (for example in the display of pronunciation) we have adopted the transliteration found in the electronic edition of the
44 DI . Also, the middle dot in quoted entries is rendered with a full stop, while within the sample transcriptions hyphenation and syllabification points are indicated by a vertical bar |, regardless of their appearance in the source text.
49 DIBO Overall, dictionaries have the same structure of front matter, body, and back matter familiar from other texts. In addition, this module defines
55 DIBO as component-level elements which can occur directly within a text division or the text body.
68 DIBO As members of the classes
82 DIBO The front and back matter of a dictionary may well contain specialized material such as lists of common and proper nouns, grammatical tables, gazetteers, a
84 DIBO , etc. These should be tagged using elements defined elsewhere in these Guidelines, chiefly in the core module (chapter
89 DIBO element consists of a set of
93 DIBO elements. These text divisions might, for example, correspond to sections for different letters of the alphabet, or to sections for different languages in a bilingual dictionary, as in the following example:
118 DIBO In a print dictionary, the entries are typically typographically distinct entities, each headed by some morphological form of the lexical item described (the
120 DIBO ), and sorted in alphabetical order or (especially for non-alphabetic scripts) in some other conventional sequence. Dictionary entries should be encoded as distinct successive items, each marked as an
128 DIBO Some dictionaries provide distinct entries for homographs, on the basis of etymology, part-of-speech, or both, and typically provide a numeric superscript on the headword identifying the homograph number. In these cases each homograph should be encoded as a separate entry; the
130 DIBO element may optionally be used to group such successive homograph entries. In addition to a series of
136 DIBO group (see section
137 DIBO ) when information about hyphenation, pronunciation, etc., is given only once for two or more homograph entries. If the homograph number is to be recorded, the global attribute
139 DIBO may be used for this purpose. In some dictionaries, homographs are treated in distinct parts of the same entry; in these cases, they may be separated by use of the
146 DIBO attribute, is often required for superentries and entries, especially in cases where the order of entries does not follow the local character-set collating sequence (as, for example, when an entry for
148 DIBO appears at the place where
210 DIEN A simple dictionary entry may contain information about the form of the word treated, its grammatical characterization, its definition, synonyms, or translation equivalents, its etymology, cross-references to other entries, usage information, and examples. These we refer to as the
224 DIEN In addition, however, dictionary entries often have a complex hierarchical structure. For example, an entry may consist of two or more sub-parts, each corresponding to information for a different part-of-speech homograph of the headword. The entry (or part-of-speech homographs, if the entry is split this way) may also consist of senses, each of which may in turn be composed of two or more sub-senses, etc. Each sub-part, homograph entry, sense, or sub-sense we call a
232 DIENHI The outermost structural level of an entry is marked with the elements
242 DIENHI element even for an entry that has only one sense to group together all parts of the definition relating to the word sense since this leads to more consistent encoding across entries. All of these levels may each contain any of the constituent parts of an entry. A special case of hierarchical structure is represented by the
247 DIENHI may be used at any point in the hierarchy to delimit parts of the dictionary entry which are structurally anomalous, as further discussed in section
257 DIENHI For example, an entry with two senses will have the following structure:
265 DIENHI An entry with two homographs, the first with two senses and the second with three (one of which has two sub-senses), may have a structure like this:
326 DIENHI The hierarchic structure of a dictionary entry is enforced by the structures defined in this module. The content model for
328 DIENHI specifies that entries do not nest, that homographs nest within entries, and that senses nest within entries, homographs, or senses, and may be nested to any depth to reflect the embedding of sub-senses. Any of the top-level constituents (
352 DIENGP information about the form of the word treated (orthography, pronunciation, hyphenation, etc.)
356 DIENGP definitions or translations into another language
395 DIENGP In a simple entry with no internal hierarchy, all top-level constituents can appear as children of
403 DIENGP n person who competes.
432 DIENGP Any top-level constituent can appear at any level when the hierarchical structure of the entry is more complex. The most obvious examples are
438 DIENGP level when several senses or translations exist:
481 DIENGP n cry of an ass; sound of a trumpet. ∙ vt [VP2A] make a cry or sound of this kind.
518 DIENGP Information of the same kind can appear at different levels within the same entry; here, grammatical information occurs both at entry and homograph level.
582 DIENGP 2 n [U] the state when one's feelings and actions are uncontrolled; freedom from control...
677 DITPFO Dictionary entries most often begin with information about the form of the word to which the entry applies. Typically, the orthographic form of the word, sometimes marked for syllabification or hyphenation, is the first item in an entry. Other information about the word, including variant or alternate forms, inflected forms, pronunciation, etc., is also often given.
712 DITPFO gen, number, case
723 DITPFO when describing that particular form of the word.
725 DITPFO Different dictionaries use different means to mark hyphenation, syllabification, and stress, and they often use some unusual glyphs (e.g., the
728 DITPFO . When transcribing representations of pronunciation the International Phonetic Alphabet should be used. It may be convenient (as has been done in the text of this chapter) to use a simple transliteration scheme for this; such a scheme should however be properly documented in the header.
753 DITPFO For a variety of reasons including ease of processing, it may be desired to split into separate elements information which is collapsed into a single element in the source text; orthography and hyphenation may for example be transcribed as separate elements, although given together in the source text. For a discussion of the issues involved, and of methods for retaining both the presentation form and the interpreted form, see section
797 DITPFO Or the inflectional pattern may be indicated by reference to a table of paradigms, as here:
820 DITPFO Explanatory labels may be attached to alternate forms:
825 DITPFO mean time between failures.
866 DITPFO element is repeated to associate the first orthographic form explicitly with the first pronunciation, and the second orthographic form with the second pronunciation:
894 DITPFO element can preserve relations among elements that are implicit in the text. For example, in the CED entry for
962 DITPGR , or any other element containing content about which there is grammatical information. For example, in the entry
977 DITPGR , the elements for morphological information are simply shorthand for the general purpose
979 DITPGR element. Consider this entry for the French word
987 DITPGR This entry can be tagged using specialized grammatical elements:
1120 DITPSE Dictionaries may describe the meanings of words in a wide variety of different ways—by means of synonyms, paraphrases, translations into other languages, formal definitions in various highly stylized forms, etc. No attempt is made here to distinguish all the different forms which sense information may take; all of them may be tagged using the
1125 DITPSE As a special case it is frequently desirable to distinguish the provision of translation equivalents in other languages from other forms of sense information; the use of
1126 DITPSE cit type="translation"
1127 DITPSE (which groups a translation equivalent with related information such as its grammatical description) for this purpose is described in section
1134 DITPDE Dictionary definitions are those pieces of prose in a dictionary entry that describe the meaning of some lexical item. Most often, definitions describe the headword of the entry; in some cases, they describe translated texts, examples, etc.; see
1135 DITPDE cit type="translation"
1138 DITPDE cit type="example"
1142 DITPDE element directly contains the text of the definition; unlike
1146 DITPDE , it does not serve solely to group a set of smaller elements. The close analysis of definition text, such as the tagging of hypernyms, typical objects, etc., is not covered by these Guidelines.
1148 DITPDE Definitions may occur directly within an entry; when multiple definitions are given, they are typically identified as belonging to distinct senses, as here:
1228 DITPTR Multilingual dictionaries contain information about translations of a given word in some source language for one or more target languages. Minimally, the dictionary provides the corresponding translation in the target language; other material, such as morphological information (gender, case), various kinds of usage restrictions, etc., may also be given. If translation equivalents are to be distinguished from other kinds of sense information, they may be encoded using
1229 DITPTR cit type="translation"
1236 DITPTR element is used in multilingual dictionaries to group information (forms, grammatical information, usage, translation(s), etc.) about a given sense of a word where necessary. Information about the individual translation equivalents within a sense is grouped using
1237 DITPTR cit type="translation"
1238 DITPTR . This information may include the translation text (tagged
1260 DITPTR Note how in the following example, different translation equivalents are grouped into the same or different senses, following the punctuation of the source and the usage labels:
1389 DITPTR cit type="translation"
1390 DITPTR may also be used in monolingual dictionaries when a translation is given for a foreign word:
1437 DITPET marks a block of etymological information. Etymologies may contain highly structured lists of words in an order indicating their descent from each other, but often also include related words and forms outside the direct line of descent, for comparison. Not infrequently, etymologies include commentary of various sorts, and can grow into short (or long!) essays with prose-like structure. This variation in structure makes it impracticable to define tags which capture the entire intellectual structure of the etymology or record the precise interrelation of all the words mentioned. It is, however, feasible to mark some of the more obvious phrase-level elements frequently found in etymologies, using tags defined in the core module or elsewhere in this chapter. Of particular relevance for the markup of etymologies are:
1449 DITPET As in other prose, individual word forms mentioned in an etymological description are tagged with
1459 DITPET element may be used to identify a particular language name where it appears, in addition to using the
1545 DITPEG cit type="example"
1546 DITPEG element contains usage examples and associated information; the example text itself should be enclosed in a
1552 DITPEG element associates a quotation with a bibliographic reference to its source.
1571 DITPEG adj tech having many parts: the multiplex eye of the fly.
1578 DITPEG Or when one wants a more comprehensive representation of examples:
1679 DITPEG When a source is indicated, the example should be marked with a
1710 DITPUS Most dictionaries provide restrictive labels and phrases indicating the usage of given words or particular senses. Other phrases, not necessarily related to usage, may also be attached to forms, translations, cross-references, and examples. The following elements are provided to mark up such labels:
1717 DITPUS element may be used for any kind of significative phrase or label within the text. The
1733 DITPUS Many dictionaries provide an explanation and/or a list of such usage labels in a preface or appendix. The type of the usage information may be indicated in the
1740 DITPUS geo
1746 DITPUS time
1759 DITPUS domain
1762 DITPUS reg
1790 DITPUS lang
1793 DITPUS language for foreign words, spellings pronunciations, etc.
1796 DITPUS gram
1801 DITPUS In addition to this kind of information, multilingual dictionaries often provide
1803 DITPUS to help the user determine the right sense of a word in the source language (and hence the correct translation). These include synonyms, concept subdivisions, typical subjects and objects, typical verb complements, etc. These labels may also be marked with the
1822 DITPUS colloc
1855 DITPUS unclassifiable piece of information to guide sense choice
1961 DITPUS When the usage label is hard to classify, it may be described as a
1994 DITPXR Dictionary entries frequently refer to information in other entries, often using extremely dense notations to convey the headword of the entry to be sought, the particular part of the entry being referred to, and the nature of the information to be sought there (synonyms, antonyms, usage notes, etymology, an illustration, etc.)
1996 DITPXR Cross-references may be tagged in dictionaries using the
2000 DITPXR elements defined in the core module (section
2003 DITPXR element may be used to group all the information relating to a cross-reference.
2015 DITPXR ) is used to tag the cross-reference target proper (in dictionaries, usually the headword, possibly accompanied by a homograph number, a sense number, or other further restriction specifying what portion of the target entry is being referred to). The
2017 DITPXR element is used to group the target with any accompanying phrases or symbols used to label the cross-reference; the cross-reference label itself may be tagged as a
2057 DITPXR to mark the cross-reference label, the two examples differ in another way. The former assumes that the first sense of
2061 DITPXR , and that the specific form of the reference in the source volume can be reconstructed, if needed, from that information. The latter does not require the first sense of
2063 DITPXR to have an identifier, and retains the print form of the cross-reference; by omitting the
2069 DITPXR and find the location referred to, or else that such processing will not be necessary.
2075 DITPXR element may be used to indicate what kind of cross-reference is being made, using any convenient typology. Since different dictionaries may label the same kind of cross-reference in different ways, it may be useful to give normalized indications in the
2131 DITPXR Strictly speaking, the reference above is not to the entry for
2133 DITPXR , but to the list of synonyms found within that entry.
2135 DITPXR In some cases, the cross-reference is to a particular subset of the meanings of the entry in question:
2167 DITPXR The asterisk signals a reference to the entry for
2175 DITPXR In some cases, the form in the definition is inflected, and thus
2226 DITPNO am not, is not, are not, have not
2232 DITPNO Although the interrogative form
2235 DITPNO am I not?
2236 DITPNO , it is generally avoided in spoken English and never used in formal English.
2291 DITPRE element encloses a degenerate entry which appears in the body of another entry for some purpose. Many dictionaries include related entries for direct derivatives or inflected forms of the entry word, or for compound words, phrases, collocations, and idioms containing the entry word.
2372 DIHW Examples, definitions, etymologies, and occasionally other elements such as cross-references, orthographic forms, etc., often contain a shortened or iconic reference to the headword, rather than repeating the headword itself. The references may be to the orthographic form or to the pronunciation, to the form given or to a variant of that form. The following elements are used to encode such iconic references to a headword:
2382 DIHW which may optionally be used to resolve any ambiguity about the headword form being referred to.
2390 DIHW indicates a reference to the full form of the headword
2410 DIHW gives the initial of the word followed by a full stop, to indicate reference to the full form of the headword
2414 DIHW refers to a capitalized form of the headword
2420 DIHW element should be used for iconic or shortened references to the orthographic form(s) of the headword itself. It is an empty element and replaces, rather than enclosing, the reference. Note that the reference to a headword is not necessarily a simple string replacement. In the example
2426 DIHW , the tilde stands for either headword form (
2520 DIHW attribute to refer to a specific form of the headword:
2525 DIHW comb form … : vagus nerve <
2625 DIHW In many cases the reference is not to the orthographic form of the headword, but rather to another form of the headword—usually to an inflected form. In these cases, the element
2627 DIHW should be used; this element takes as its content the string as it appears in the text.
2666 DIHW , which are defined in the additional module for linking, segmentation, and alignment (see chapter
2689 DIHW In addition, some dictionaries make reference to the pronunciation of the headword in the pronunciation of related entries, variants, or examples. The
2746 DIHW Since existing printed dictionaries use different conventions for headword references (swung dash, first letter abbreviated form, capitalization, or italicization of the word, etc.) the exact method used should be documented in the header.
2764 DIMV typographic view
2765 DIMV —the two-dimensional printed page, including information about line and page breaks and other features of layout
2768 DIMV editorial view
2769 DIMV —the one-dimensional sequence of tokens which can be seen as the input to the typesetting process; the wording and punctuation of the text and the sequencing of items are visible in this view, but specifics of the typographic realization are not
2772 DIMV lexical view
2773 DIMV —this view includes the underlying information represented in a dictionary, without concern for its exact textual form
2777 DIMV For example, a domain indication in a dictionary entry might be broken over a line and therefore hyphenated (
2781 DIMV ); the typographic view of the dictionary preserves this information. In a purely editorial view, the particular form in which the domain name is given in the particular dictionary (as
2787 DIMV , etc.) would be preserved, but the fact of the line break would not. Font shifts might plausibly be included in either a strictly typographic or an editorial view. In the lexical view, the only information preserved concerning domain would be some standard symbol or string representing the nautical domain (e.g.
2789 DIMV ) regardless of the form in which it appears in the printed dictionary.
2795 DIMV , the fonts in which different types of information are to be rendered, etc.), and then the typographic view, which is tied to a specific printed rendering. Computational linguists and philologists often begin with the typographic view and analyse it to obtain the editorial and/or lexical views. Some users may ultimately be concerned with retaining only the lexical view, or they may wish to preserve the typographic or editorial views as a reference text, perhaps as a guard against the loss or misinterpretation of information in the translation process. Some researchers may wish to retain all three views, and study their interrelations, since research questions may well span all three views.
2797 DIMV In general, an electronic encoding of a text will allow the recovery of at least one view of that text (the one which guided the encoding); if editorial and typographic practices are consistently applied in the production of a printed dictionary, or if exceptions to the rules are consistently recorded in the electronic encoding, then it is
2799 DIMV possible to recover the editorial view from an encoding of the lexical view, and the typographic view from an encoding of the editorial view. In practice, of course, the severe compression of information in dictionaries, the variety of methods by which this compression is achieved, the complexity of formulating completely explicit rules for editorial and typographic practice, and the relative rarity of complete consistency in the application of such rules, all make the mechanical transformation of information from one view into another something of a vexed question.
2801 DIMV This section describes some principles which may be useful in capturing one or the other of these views as consistently and completely as possible, and describes some methods of attempting to capture more than one view in a single encoding. Only the editorial and lexical views are explicitly treated here; for methods of recording the physical or typographic details of a text, see chapter
2806 DIMV attributes to link feature structures to a transcription of the editorial view of a dictionary, are not discussed here (for feature structures, see chapter
2807 DIMV . For linkage of textual form and underlying information, see chapter
2813 DIMVTV Common practice in encoding texts of all sorts relies on principles such as the following, which can be used successfully to capture the editorial view when encoding a dictionary:
2815 DIMVTV All characters of the source text should be retained, with the possible exception of
2816 DIMVTV rendition text
2819 DIMVTV Characters appearing in the source text should typically be given as character data content in the document, rather than as the value of an attribute; again, rendition text may optionally be excepted from this rule.
2821 DIMVTV Apart from the characters or graphics in the source text, nothing else should appear as content in the document, although it may be given in attribute values.
2823 DIMVTV The material in the source text should appear in the encoding in the same order. Complications of the character sequence by footnotes, marginal notes, etc., text wrapping around illustrations, etc., may be dealt with by the usual means (for notes, see section
2825 DIMVTV Complications of sequence caused by marginal or interlinear insertions and deletions, which are frequent in manuscripts, or by unconventional page layouts, as in concrete poetry, magazines with imaginative graphic designers, and texts about the nature of typography as a medium, typically do not occur in dictionaries, and so are not discussed here.
2830 DIMVTV In a very conservative transcription of the editorial view of a text,
2831 DIMVTV rendition characters
2833 DIMVTV rendition text
2834 DIMVTV (for example, conjunctions joining alternate headwords, etc.) are typically retained. Removing the tags from such a transcription will leave all and only the characters of the source text, in their original sequence.
2835 DIMVTV This is a slight oversimplification. Even in conservative transcriptions, it is common to omit page numbers, signatures of gatherings, running titles and the like. The simple description above also elides, for the sake of simplicity, the difficulties of assigning a meaning to the phrase
2836 DIMVTV original sequence
2837 DIMVTV when it is applied to the printed characters of a source text; the
2838 DIMVTV original sequence
2839 DIMVTV retained or recovered from a conservative transcription of the editorial view is, of course, the one established during the transcription by the encoder.
2849 DIMVTV . a feather, wing, fin, or similarly shaped part. 3. another name for
2853 DIMVTV A conservative encoding of the editorial view of this entry, which retains all rendition text, might resemble the following:
2916 DIMVTV A somewhat simplified encoding of the editorial view of this entry might exploit the fact that rendition text is often systematically recoverable. For example, parentheses consistently appear around pronunciation in this dictionary, and thus are effectively implied by the start- and end-tags for
2919 DIMVTV The omission of rendition text is particularly common in systems for document production; it is considered good practice there, since automatic generation of rendition text is more reliable and more consistent than attempting to maintain it manually in the electronic text.
2920 DIMVTV In such an encoding, removing the tags should exactly reproduce the sequence of characters in the source, minus rendition text. The original character sequence can be recovered fully by replacing tags with any rendition text they imply.
2924 DIMVTV element in the header would be used to record the following patterns of rendition text:
2934 DIMVTV appears before alternate forms
2940 DIMVTV , inflection information, and sense numbers
2942 DIMVTV senses are numbered in sequence unless otherwise specified using the global
3006 DIMVTV When rendition text is omitted, it is recommended that the means to regenerate it be fully documented, using the
3008 DIMVTV element of the TEI header.
3010 DIMVTV If rendition text is used systematically in a dictionary, with only a few mistakes or exceptions, the global attribute
3012 DIMVTV may be used on any tag to flag exceptions to the normal treatment. The values of the
3020 DIMVTV element in the TEI header.
3052 DIMVLV If the text to be interchanged retains only the lexical view of the text, there may be no concern for the recoverability of the editorial (not to speak of the typographic) view of the text. However, it is strongly recommended that the TEI header be used to document fully the nature of all alterations to the original data, such as normalization of domain names, expansion of inflected forms, etc.
3054 DIMVLV In an encoding of the lexical view of a text, there are degrees of departure from the original data: normalizing inconsistent forms like
3068 DIMVLV reorganizing the order of elements in an entry to show their relationship, as in
3073 DIMVLV where in a strictly lexical view one might wish to group
3079 DIMVLV splitting an entry into two separate entries, as in
3082 DIMVLV /"selIb@sI/ n [U] state of living unmarried, esp as a religious obligation. celi.bate /"selIb@t/ n [C] unmarried person (esp a priest who has taken a vow not to marry).
3084 DIMVLV For some purposes, this entry might usefully be split into an entry for
3086 DIMVLV and a separate entry for
3092 DIMVLV An encoding which captures the lexical view of the example given in the previous section might look something like the following. In this encoding:
3161 DIMVLV Whether the given dictionary encoding focusses on the lexical view and thus approaches the status of lexical databases, or uses the typographic/editorial view approach and needs to communicate the sometimes informally stated values for the particular descriptive features, the issue of
3163 DIMVLV of the content and of the container objects becomes relevant, in view of the growing tendency to interlink pieces of information across Internet resources. In such cases, it becomes crucial to be able to encode the fact that whether the information on, for instance, the value of the grammatical category of Number is provided as "sg.", "sing.", "Singular", or equivalently "poj." in Polish, or "Ez." in German, etc., what is actually referred to is always the same grammatical value that can be rendered with a plethora of markers, depending on the publisher, language, or lexicographic tradition. In order to signal that this variety of surface markers in fact indicate the same underlying value, it is possible to align them with an external inventory of standardized values. The TEI provides means to align grammatical categories as well as their content with the ISOcat reference, which is a Web implementation of
3167 DIMVLV In the example below, a fragment of the entry for
3174 DIMVLV ). Depending on the status and extent of the dictionary, various strategies may be used to reduce the redundancy of the repeated ISOcat references.
3193 DIMVBO It is sometimes desirable to retain both the lexical and the editorial view, in which case a potential conflict exists between the two. When there is a conflict between the encodings for the lexical and editorial views, the principles described in the following sections may be applied.
3198 DIMVAV If the order of the data is the same in both views, then both views may be captured by encoding one
3200 DIMVAV view in the character data content of the document, and encoding the other using attribute values on the appropriate elements. If all tags were to be removed, the remaining characters would be those of the dominant view of the text.
3204 DIMVAV is used to provide attributes for use in encoding multiple views of the same dictionary entry. These attributes are available for use on all elements defined in this chapter when the base module for dictionaries is selected.
3206 DIMVAV When the editorial view is dominant, the following attributes may be used to capture the lexical view:
3211 DIMVAV When the lexical view is dominant, the following attributes may be used to record the editorial view:
3221 DIMVAV For example, if the source text had the domain label
3223 DIMVAV , it might be encoded as follows. With the editorial view dominant:
3227 DIMVAV The lexical view of the same label would transcribe the normalized form as content of the
3229 DIMVAV element, the typographic form as an attribute value:
3235 DIMVAV If the source text gives inflectional information for the verb
3241 DIMVAV . An encoding of the editorial view might take this form:
3259 DIMVAV tag with null content, to enable the representation of implicit information even though it has no print realization.
3261 DIMVAV The lexical view might be encoded thus:
3284 DIMVAV A particular problem may be posed by the common practice of presenting two alternate forms of a word in a single string, by marking some parts of the word as optional in some forms. The following entry is for a word which can be spelled either
3292 DIMVAV With the editorial view dominant, this entry might begin thus:
3300 DIMVAV With the lexical view dominant, however, two
3349 DIMVAV attribute is recommended, however, when long spans of text are involved, or when the optional part contains embedded tags.
3362 DIMVAV A simple encoding solution would be to leave the definition text unanalysed, but this might be felt inadequate since it does not show that there are two definitions. A possible alternative encoding would be:
3372 DIMVAV This transcribes some characters of the source text twice, however, which deviates from the usual practice. The following encoding records both the editorial and lexical views:
3388 DIMVOL The attributes described in the previous section are useful only when the order of material is the same in both the editorial and the lexical view. When the two views impose different orders on the data, the standard linking mechanisms may be used to show the original location of material transposed in an encoding of the lexical view.
3392 DIMVOL element may be used to mark the original location of the material, and the
3394 DIMVOL attribute may be used on the lexical encoding of that material to indicate its original location(s). Like those in the preceding section, this attribute is defined for the attribute class
3562 DIFR The content model for the
3564 DIFR element provides an entry structure suitable for many average dictionaries, as well as many regular entries in more exotic dictionaries. However, the structure of some dictionaries does not allow the restrictions imposed by the content model for
3570 DIFR elements are provided to support much wider variation in entry structure. The
3572 DIFR element offers less freedom, in that it can only contain phrase level elements, but it can itself appear at any point within a dictionary entry where any of the structural components of a dictionary entry are permitted. As such, it acts as a container for otherwise anomalous parts of an entry.
3588 DIFR element. For example, in the following entry from a dictionary already in electronic form, it is necessary to include a
3592 DIFR . This is not permitted in the content model for
3629 DIFR ) elements—that is, using no grouping elements at all. This can be desirable if the encoder wants a completely
3631 DIFR view, with no indication of or commitment to the association of one element with another. The following encoding uses no grouping elements, and keeps all rendition text:
3659 DIFR Here is an alternative way of representing the same structure, this time using
3697 DI The selection and combination of modules to form a TEI schema is described in

CC-LanguageCorpora.xml#13064

# id text
3 CC The term
4 CC language corpus
5 CC is used to mean a number of rather different things. It may refer simply to any collection of linguistic data (for example, written, spoken, signed, or multimodal), although many practitioners prefer to reserve it for collections which have been organized or collected with a particular end in view, generally to characterize a particular state or variety of one or more languages. Because opinions as to the best method of achieving this goal differ, various subcategories of corpora have also been identified. For our purposes however, the distinguishing characteristic of a corpus is that its components have been selected or structured according to some conscious set of design criteria.
7 CC These design criteria may be very simple and undemanding, or very sophisticated. A corpus may be intended to represent (in the statistical sense) a particular linguistic variety or sublanguage, or it may be intended to represent all aspects of some assumed
8 CC core
9 CC language. A corpus may be made up of whole texts or of fragments or text samples. It may be a
15 CC corpus, the composition of which may change over time. However, since an open corpus is of necessity finite at any particular point in time, the only likely effect of its expansibility from the encoding point of view may be some increased difficulty in maintaining consistent encoding practices (see further section
23 CC ). This is because although each discrete sample of language in a corpus clearly has a claim to be considered as a text in its own right, it is also regarded as a subdivision of some larger object, if only for convenience of analysis. Corpora share a number of characteristics with other types of composite texts, including anthologies and collections. Most notably, different components of composite texts may exhibit different structural properties (for example, some may be composed of verse, and others of prose), thus potentially requiring elements from different TEI modules.
25 CC Aside from these high-level structural differences, and possibly differences of scale, the encoding of language corpora and the encoding of individual texts present identical sets of problems. Any of the encoding techniques and elements presented in other chapters of these Guidelines may therefore prove relevant to some aspect of corpus encoding and may be used in corpora. Therefore, we do not repeat here the discussion of such fundamental matters as the representation of multiple character sets (see chapter
27 CC ). In addition to these general purpose elements, these Guidelines offer a range of more specialized sets of tags which may be of use in certain specialized corpora, for example those consisting primarily of verse (chapter
28 CC ), drama (chapter
29 CC ), transcriptions of spoken text (chapter
31 CC should be reviewed for details of how these and other components of the Guidelines should be tailored to create a document type definition appropriate to a given application. In sum, it should not be assumed that only the matters specifically addressed in this chapter are of importance for corpus creators.
33 CC This chapter does however include some other material relevant to corpora and corpus-building, for which no other location appeared suitable. It begins with a review of the distinction between unitary and composite texts, and of the different methods provided by these Guidelines for representing composite texts of different kinds (section
35 CC describes a set of additional header elements provided for the documentation of contextual information, of importance largely though not exclusively to language corpora. This is the additional module for language corpora proper. Section
36 CC discusses a mechanism by which individual parts of the TEI header may be associated with different parts of a TEI-conformant text. Section
37 CC reviews various methods of providing linguistic annotation in corpora, with some specific examples of relevance to current practice in corpus linguistics. Finally, section
55 CCDEF ); this section discusses their application to composite texts in particular.
58 CCDEF text
59 CCDEF refers to any stretch of discourse, whether complete or incomplete, unitary or composite, which the encoder chooses (perhaps merely for purposes of analytic convenience) to regard as a unit. The term
60 CCDEF composite text
63 CCDEF language corpora
67 CCDEF poem cycles and epistolary works (novels or essays written in the form of collections or series of letters)
70 CCDEF The elements listed above may be combined to encode each of these varieties of composite text in different ways.
72 CCDEF In corpora, the component samples are clearly distinct texts, but the systematic collection, standardized preparation, and common markup of the corpus often make it useful to treat the entire corpus as a unit, too. Some corpora may become so well established as to be regarded as texts in their own right; the Brown and LOB corpora are now close to achieving this status.
76 CCDEF element is intended for the encoding of language corpora, though it may also be useful in encoding newspapers, electronic anthologies, and other disparate collections of material. The individual samples in the corpus are encoded as separate
78 CCDEF elements, and the entire corpus is enclosed in a
88 CCDEF element, in which the corpus as a whole, and encoding practices common to multiple samples may be described. The overall structure of a TEI-conformant corpus is thus:
105 CCDEF Header information which relates to the whole corpus rather than to individual components of it should be factored out and included in the
107 CCDEF element prefixed to the whole. This two-level structure allows for contextual information to be specified at the corpus level, at the individual text level, or at both. Discussion of the kinds of information which may thus be specified is provided below, in section
112 CCDEF In some cases, the design of a corpus is reflected in its internal structure. For example, a corpus of newspaper extracts might be arranged to combine all stories of one type (reportage, editorial, reviews, etc.) into some higher-level grouping, possibly with sub-groups for date, region, etc. The
114 CCDEF element provides no direct support for reflecting such internal corpus structure in the markup: it treats the corpus as an undifferentiated series of components, each tagged
118 CCDEF If it is essential to reflect a single permanent organization of a corpus into sub- and sub-sub-corpora, then the corpus or the high-level subcorpora may be encoded as composite texts, using the
121 CCDEF . The mechanisms for corpus characterization described in this chapter, however, are designed to reduce the need to do this. Useful groupings of components may easily be expressed using the text classification and identification elements described in section
122 CCDEF , and those for associating declarations with corpus components described in section
123 CCDEF . These methods also allow several different methods of text grouping to co-exist, each to be used as needed at different times. This helps minimize the danger of cross-classification and misclassification of samples, and helps improve the flexibility with which parts of a corpus may be characterized for different applications.
125 CCDEF Anthologies and collections are often treated as texts in their own right, if only for historical reasons. In conventional publishing, at least, anthologies are published as units, with single editorial responsibility and common front and back matter which may need to be included in their electronic encodings. The texts collected in the anthology, of course, may also need to be identifiable as distinct individual objects for study.
127 CCDEF Poem cycles, epistolary novels, and epistolary essays differ from anthologies in that they are often written as single works, by single authors, for single occasions; nevertheless, it can be useful to treat their constituent parts as individual texts, as well as the cycle itself. Structurally, therefore, they may be treated in the same way as anthologies: in both cases, the body of the text is composed largely of other texts.
133 CCDEF element can also be used to record the potentially complex internal structure of language corpora. For a full description, see chapter
140 CCDEF elements. The embedded text itself may be encoded using the
145 CCDEF All composite texts share the characteristic that their different component texts may be of structurally similar or dissimilar types. If all component texts may all be encoded using the same module, then no problem arises. If however they require different modules, then these must be included in the schema. This process is described in more detail in section
150 CCAH Contextual information is of particular importance for collections or corpora composed of samples from a variety of different kinds of text. Examples of such contextual information include: the age, sex, and geographical origins of participants in a language interaction, or their socio-economic status; the cost and publication data of a newspaper; the topic, register or factuality of an extract from a textbook. Such information may be of the first importance, whether as an organizing principle in creating a corpus (for example, to ensure that the range of values in such a parameter is evenly represented throughout the corpus, or represented proportionately to the population being sampled), or as a selection criterion in analysing the corpus (for example, to investigate the language usage of some particular vector of social characteristics).
152 CCAH Such contextual information is potentially of equal importance for unitary texts, and these Guidelines accordingly make no particular distinction between the kinds of information which should be gathered for unitary and for composite texts. In either case, the information should be recorded in the appropriate section of a TEI header, as described in chapter
153 CCAH . In the case of language corpora, such information may be gathered together in the overall corpus header, or split across all the component texts of a corpus, in their individual headers, or divided between the two. The association between an individual corpus text and the contextual information applicable to it may be made in a number of ways, as further discussed in section
157 CCAH , which should be read in conjunction with the present section, describes in full the range of elements available for the encoding of information relating to the electronic file itself, for example its bibliographic description and those of the source or sources from which it was derived (see section
159 CCAH ); more detailed descriptive information about the creation and content of the corpus, such as the languages used within it and any descriptive classification system used (see section
160 CCAH ); and version information documenting any changes made in the electronic text (see section
164 CCAH , several other elements can be used in the TEI header if the additional module defined by this chapter is invoked. These additional tags make it possible to characterize the social or other situation within which a language interaction takes place or is experienced, the physical setting of a language interaction, and the participants in it. Though this information may be relevant to, and provided for, unitary texts as well as for collections or corpora, it is more often recorded for the components of systematically developed corpora than for isolated texts, and thus this module is referred to as being
165 CCAH for language corpora
168 CCAH When the module defined in this chapter is included in a schema, a number of additional elements become available within the
170 CCAH element of the TEI header (discussed in section
187 CCAHTD element provides a full description of the situation within which a text was produced or experienced, and thus characterizes it in a way relatively independent of any
191 CCAHTD . The description is organized as a set of values and optional prose descriptions for the following eight
200 CCAHTD By default, a text description will contain each of the above elements, supplied in the order specified. Except for the
202 CCAHTD element, which may be repeated to indicate multiple purposes, no element should appear more than once within a single text description. Each element may be empty, or may contain a brief qualification or more detailed description of the value expressed by its attributes. It should be noted that some texts, in particular literary ones, may resist unambiguous classification in some of these dimensions; in such cases, the situational parameter in question should be given the content
206 CCAHTD Texts may be described along many dimensions, according to many different taxonomies. No generally accepted consensus as to how such taxonomies should be defined has yet emerged, despite the best efforts of many corpus linguists, text linguists, sociolinguists, rhetoricians, and literary theorists over the years. Rather than attempting the task of proposing a single taxonomy of
208 CCAHTD (or the equally impossible one of enumerating all those which have been proposed previously), the closed set of
220 CCAHTD it is equally applicable to spoken, written, or signed texts
222 CCAHTD Two alternative approaches to the use of these parameters are supported by these Guidelines. One is to use pre-existing taxonomies such as those used in subject classification or other types of text categorization. Such taxonomies may also be appropriate for the description of the topics addressed by particular texts. Elements for this purpose are described in section
224 CCAHTD . A second approach is to develop an application-specific set of
232 CCAHTD Where the organizing principles of a corpus or collection so permit, it may be convenient to regard a particular set of values for the situational parameters listed in this section as forming a
234 CCAHTD in its own right; this may also be useful where the same set of values applies to several texts within a corpus. In such a case, the set of text-types so defined should be regarded as a
235 CCAHTD taxonomy
243 CCAHTD element rather than a prose description. Particular texts may then be associated with such definitions using the mechanisms described in sections
308 CCAHPA element provides additional information about the participants in a spoken text or, where this is judged appropriate, the persons named or depicted in a written text. When the detailed elements provided by the
311 CCAHPA are included in a schema, this element can contain detailed demographic or descriptive information about individual speakers or groups of speakers, such as their names or other personal characteristics. Individually identified persons may also identified by a code which can then be used elsewhere within the encoded text, for example as the value of a
316 CCAHPA speaker
321 CCAHPA within a written text, except where otherwise stated. For the purposes of analysis of language usage, the information specified here should be equally applicable to written, spoken, or signed texts.
325 CCAHPA contains a description of the participants in an interaction, which may be supplied as straightforward prose, possibly containing a list of names, encoded using the usual
341 CCAHPA Alternatively, when the
365 CCAHPA An identified character in a drama or a novel may also be regarded as a participant in this sense, and encoded using the same techniques:
366 CCAHPA It is particularly useful to define participants in a dramatic text in this way, since it enables the
368 CCAHPA attribute to be used to link
393 CCAHSE element is used to describe the setting or settings in which language interaction takes place. It may contain a prose description, analogous to a stage description at the start of a play, stating in broad terms the locale, or a more detailed description of a series of such settings.
395 CCAHSE Each distinct setting is described by means of a
405 CCAHSE . If this attribute is not specified, the setting details provided are assumed to apply to all participants represented in the language interaction. Note however that it is not possible to encode different settings for the same participant: a participant is deemed to be a person within a specific setting.
409 CCAHSE element may contain either a prose description or a selection of elements from the classes
415 CCAHSE . By default, when the module defined by this chapter is included in a schema, these classes thus provide the following elements:
426 CCAHSE may also be available if the
430 CCAHSE The following example demonstrates the kind of background information often required to support transcriptions of language interactions, first encoded as a simple prose narrative:
471 CCAHSE Again, a more detailed encoding for places is feasible if the
473 CCAHSE module is included in the schema. The above examples assume that only the general purpose
475 CCAHSE element supplied in the core module is available.
484 CCAS This section discusses the association of the contextual information held in the header with the individual elements making up a TEI text or corpus. Contextual information is held in elements of various kinds within the TEI header, as discussed elsewhere in this section and in chapter
485 CCAS . Here we consider what happens when different parts of a document need to be associated with different contextual information of the same type, for example when one part of a document uses a different encoding practice from another, or where one part relates to a different setting from another. In such situations, there will be more than one instance of a header element of the relevant type.
487 CCAS The TEI scheme allow for the following possibilities:
489 CCAS A given element may appear in the corpus header only, in the header of one or more texts only, or in both places
491 CCAS There may be multiple occurrences of certain elements in either corpus or text header.
498 CCAS1 A TEI-conformant document may have more than one header only in the case of a TEI corpus, which must have a header in its own right, as well as the obligatory header for each text. Every element specified in a corpus-header is understood as if it appeared within every text header in the corpus. An element specified in a text header but not in the corpus header supplements the specification for that text alone. If any element is specified in both corpus and text headers, the corpus header element is over-ridden for that text alone.
502 CCAS1 for a corpus text is understood to be prefixed by the
504 CCAS1 given in the corpus header. All other optional elements of the
506 CCAS1 should be omitted from an individual corpus text header unless they differ from those specified in the corpus header. All other header elements behave identically, in the manner documented below. This facility makes it possible to state once for all in the corpus header each piece of contextual information which is common to the whole of the corpus, while still allowing for individual texts to vary from this common denominator.
508 CCAS1 For example, the following schematic shows the structure of a corpus comprising three texts, the first and last of which share the same encoding description. The second one has its own encoding description.
555 CCAS2 Certain of the elements which can appear within a TEI header are known as
557 CCAS2 . These elements have in common the fact that they may be linked explicitly with a particular part of a text or corpus by means of a
559 CCAS2 attribute on that element. This linkage is used to over-ride the default association between declarations in the header and a corpus or corpus text. The only header elements which may be associated in this way are those which would not otherwise be meaningfully repeatable.
570 CCAS2 An alphabetically ordered list of declarable elements follows:
611 CCAS2 . Since there are two, one of them (in this case
629 CCAS2 For texts associated with the header in which this declaration appears, correction method
631 CCAS2 will be assumed, unless they explicitly state otherwise. Here is the structure for a text which does state otherwise:
641 CCAS2 In this case, the contents of the divisions D1 and D3 will both use correction policy
643 CCAS2 , and those of division D2 will use correction policy
657 CCAS2 , as well as smaller structural units, down to the level of paragraphs in prose, individual utterances in spoken texts, and entries in dictionaries. However, TEI recommended practice is to limit the number of multiple declarable elements used by a document as far as possible, for simplicity and ease of processing.
663 CCAS2 An identifier specifying an element which contains multiple instances of one or more other elements should be interpreted as if it explicitly identified the elements identified as the default in each such set of repeated elements
665 CCAS2 Each element specified, explicitly or implicitly, by the list of identifiers must be of a different kind.
708 CCAS2 applies, correction method C1A and normalization method N1 apply, since these are the specified defaults within
710 CCAS2 . In the same way, for a text specifying
714 CCAS2 , correction C2A, and normalization N2B will apply.
716 CCAS2 A finer grained approach is also possible. A text might specify
717 CCAS2 text decls='C2B N2A'
720 CCAS2 declarations as required. A tag such as
721 CCAS2 text decls='ED1 ED2'
722 CCAS2 would (obviously) be illegal, since it includes two elements of the same type; a tag such as
723 CCAS2 text decls='ED2 C1A'
728 CCAS2 , resulting in a list that identifies two correction elements (C1A and C2A).
734 CCAS3 If there is a single occurrence of a given declarable element in a corpus header, then it applies by default to all elements within the corpus.
736 CCAS3 If there is a single occurrence of a given declarable element in the text header, then it applies by default to all elements of that text irrespective of the contents of the corpus header.
738 CCAS3 Where there are multiple occurrences of declarable elements within either corpus or text header,
740 CCAS3 each must have a unique value specified as the value of its
746 CCAS3 attribute with the value
754 CCAS3 An association made by one element applies by default to all of its descendants.
759 CCAN Language corpora often include analytic encodings or annotations, designed to support a variety of different views of language. The present Guidelines do not advocate any particular approach to linguistic annotation (or
761 CCAN ); instead a number of general analytic facilities are provided which support the representation of most forms of annotation in a standard and self-documenting manner. Analytic annotation is of importance in many fields, not only in corpus linguistics, and is therefore discussed in general terms elsewhere in the Guidelines.
766 CCAN The present section presents informally some particular applications of these general mechanisms to the specific practice of corpus linguistics.
772 CCAN1 we mean here any annotation determined by an analysis of linguistic features of the text, excluding as borderline cases both the formal structural properties of the text (e.g. its division into chapters or paragraphs) and descriptive information about its context (the circumstances of its production, its genre, or medium). The structural properties of any TEI-conformant text should be represented using the structural elements discussed elsewhere in these Guidelines, for example in chapters
774 CCAN1 . The contextual properties of a TEI text are fully documented in the TEI header, which is discussed in chapter
778 CCAN1 Other forms of linguistic annotation may be applied at a number of levels in a text. A code (such as a word-class or part-of-speech code) may be associated with each word or token, or with groups of such tokens, which may be continuous, discontinuous, or nested. A code may also be associated with relationships (such as cohesion) perceived as existing between distinct parts of a text. The codes themselves may stand for discrete non-decomposable categories, or they may represent highly articulated bundles of textual features. Their function may be to place the annotated part of the text somewhere within a narrowly linguistic or discoursal domain of analysis, or within a more general semantic field, or any combination drawn from these and other domains.
780 CCAN1 The manner by which such annotations are generated and attached to the text may be entirely automatic, entirely manual, or a mixture. The ease and accuracy with which analysis may be automated may vary with the level at which the annotation is attached. The method employed should be documented in the
782 CCAN1 element within the encoding description of the TEI header, as described in section
783 CCAN1 . Where different parts of a corpus have used different annotation methods, the
788 CCAN1 An extended example of one form of linguistic analysis commonly practised in corpus linguistics is given in section
794 CCREC These Guidelines include proposals for the identification and encoding of a far greater variety of textual features and characteristics than is likely to be either feasible or desirable in any one language corpus, however large and ambitious. The reasoning behind this catholic approach is further discussed in chapter
795 CCREC . For most large-scale corpus projects, it will therefore be necessary to determine a subset of TEI recommended elements appropriate to the anticipated needs of the project, as further discussed in chapter
796 CCREC ; these mechanisms include the ability to exclude selected element types, add new element types, and change the names of existing elements. A discussion of the implications of such changes for TEI conformance is provided in chapter
799 CCREC Because of the high cost of identifying and encoding many textual features, and the difficulty in ensuring consistent practice across very large corpora, encoders may find it convenient to divide the set of elements to be encoded into the following four categories:
802 CCREC texts included within the corpus will always encode textual features in this category, should they exist in the text
805 CCREC textual features in this category will be encoded wherever economically and practically feasible; where present but not encoded, a note in the header should be made.
808 CCREC textual features in this category may or may not be encoded; no conclusion about the absence of such features can be inferred from the absence of the corresponding element in a given text.
812 CCREC textual features in this category are deliberately not encoded; they may be transcribed as unmarked up text, or represented as
833 CC The selection and combination of modules to form a TEI schema is described in

FS-FeatureStructures.xml#12945

# id text
6 FS is a general purpose data structure which identifies and groups together individual
8 FS , each of which associates a name with one or more values. Because of the generality of feature structures, they can be used to represent many different kinds of information, but they are of particular usefulness in the representation of linguistic analyses, especially where such analyses are partial, or
29 FSor binary
34 FSor numeric
36 FSor string
43 FSor set
47 FSor list
49 FSor discusses how the operations of alternation, negation, and collection of feature values may be represented. Section
62 FSBI The fundamental elements used to represent a feature structure analysis are
74 FSBI attribute which may be used to represent typed feature structures, and may contain any number of
81 FSBI value
82 FSBI . The value may be simple: that is, a single binary, numeric, symbolic (i.e. taken from a restricted set of legal values), or string value, or a collection of such values, organized in various ways, for example, as a list; or it may be complex, that is, it may itself be a feature structure, thus providing a degree of recursion. Values may be under-specified or defaulted in various ways. These possibilities are all described in more detail in this and the following sections.
86 FSBI . The components of such libraries may then be referenced from other feature or feature-value representations, using the
92 FSBI We begin by considering the simple case of a feature structure which contains binary-valued features only. The following three XML elements are needed to represent such a feature structure:
101 FSBI are not discussed in this section: they provide an alternative way of indicating the content of an element, as further discussed in section
108 FSBI elements with binary values can be straightforwardly used to encode the
145 FSBI attribute to indicate the name of the feature. Feature structures need not be typed, but features must be named. Similarly, the
153 FSBI to a binary value) requires additional validation, as does any restriction on the features available within a feature structure of a particular type (e.g. whether a feature structure of type
157 FSBI ). Such validation may be carried out at the document level, using special purpose processing, at the schema level using additional validation rules, or at the declarative level, using an additional mechanism such as the
162 FSBI Although we have used the term
163 FSBI binary
172 FSBI ), it should be noted that such values are not restricted to propositional assertions. As this example shows, this kind of value is intended for use with any binary-valued feature.
181 FSSY numeric values
183 FSSY string values
184 FSSY . The module defined by this chapter allows for the specification of additional datatypes if necessary, by extending the underlying class
194 FSSY element is used for the value of a feature when that feature can have any of a small, finite set of possible values, representable as character strings. For example, the following might be used to represent the claim that the Latin noun form
210 FSSY case
214 FSSY number
215 FSSY ) are used to define morpho-syntactic properties of a word. Each of these features can take one of a small number of values (for example, case can be
225 FSSY elements. Note that, instead of using a symbolic value for grammatical number, one could have named the feature
229 FSSY and given it an appropriate binary value, as in the following example:
234 FSSY Whether one uses a binary or symbolic value in situations like this is largely a matter of taste.
238 FSSY element is used for the value of a feature when that value is a string drawn from a very large or potentially unbounded set of possible strings of characters, so that it would be impractical or impossible to use the
240 FSSY element. The string value is expressed as the content of the
242 FSSY element, rather than as an attribute value. For example, one might encode a street address as follows:
250 FSSY element is used when the value of a feature is a numeric value, or a range of such values. For example, one might wish to regard the house number and the street name as different features, using an encoding like the following:
257 FSSY If the numeric value to be represented falls within a specific range (for example an address that spans several numbers), the
266 FSSY It is also possible to specify that the numeric value (or values) represented should (or should not) be truncated. For example, assuming that the daily rainfall in mm is a feature of interest for some address, one might represent this by an encoding like the following:
269 FSSY This represents any of the infinite number of numeric values falling between 0 and 1.3; by contrast
274 FSSY Some communities of practice, notably those with a strong computer-science bias, prefer to dissociate the information on the value of the given feature from the specification of the data type that this value represents. In such cases, feature values can be provided directly as textual content of
281 FSSY As noted above, additional processing is necessary to ensure that appropriate values are supplied for particular features, for example to ensure that the feature
283 FSSY is not given a value such as
284 FSSY symbol value="feminine"/
285 FSSY . There are two ways of attempting to ensure that only certain combinations of feature names and values are used. First, if the total number of legal combinations is relatively small, one can predefine all of them in a construct known as a
287 FSSY , and then reference the combination required using the
292 FSSY feature value library
293 FSSY (so called, since a feature structure may be the value of a feature). A total of 30 feature structures (5 × 3 × 2) is required to enumerate all the possible combinations of individual case, gender and number values in the preceding illustration. We discuss the use of such libraries and their representation in XML further in section
301 FSSY Whether at the level of feature-system declarations, feature- and feature-value libraries, or individual features, it is possible to align both feature names and their values with standardized external data category repositories such as ISOcat.
306 FSSY and its value
321 FSFL As the examples in the preceding section suggest, the direct encoding of feature structures can be verbose. Moreover, it is often the case that particular feature-value combinations, or feature structures composed of them, are re-used in different analyses. To reduce the size and complexity of the task of encoding feature structures, one may use the
337 FSFL ). If a feature has as its value a feature structure or other value which is predefined in this way, the
344 FSFL For example, suppose a feature library for phonological feature specifications is set up as follows.
391 FSFL Then the feature structures that represent the analysis of the phonological segments (phonemes)
405 FSFL The preceding are but four of the 128 logically possible fully specified phonological segments using the seven binary features listed in the feature library. Presumably not all combinations of features correspond to phonological segments (there are no strident vowels, for example). The legal combinations, however, can be collected together, each one represented as an identifiable
423 FSFL attribute; for example, one might use them in a feature value pair such as:
427 FSFL Feature structures stored in this way may also be associated with the text which they are intended to annotate, either by a link from the text (for example, using the TEI global
429 FSFL attribute), or by means of stand-off annotation techniques (for example, using the TEI
434 FSFL Note that when features or feature structures are linked to in this way, the result is effectively a copy of the item linked to into the place from which it is linked. This form of linking should be distinguished from the phenomenon of
444 FSST Features may have complex values as well as atomic ones; the simplest such complex value is represented by supplying a
446 FSST element as the content of an
450 FSST element as the value for the
464 FSST To illustrate the use of complex values, consider the following simple model of a word, as a structure combining surface form information, a syntactic category, and semantic information. Each word analysis is represented as a
465 FSST fs type='word'
467 FSST surface
472 FSST . The first of these has an atomic string value, but the other two have complex values, represented as nested feature structures of types
473 FSST category
492 FSST This analysis does not tell us much about the meaning of the symbols
514 FSST element, as a number of
516 FSST elements. Alternatively, the relevant features may be referenced by their identifiers, supplied as the value of the
532 FSST With such libraries in place, and assuming the availability of similarly predefined feature structures for transitivity and semantics, the preceding example could be considerably simplified:
556 FSVAR Sometimes the same feature value is required at multiple places within a feature structure, in particular where the value is only partially specified at one or more places. The
563 FSVAR For example, suppose one wishes to represent noun-verb agreement as a single feature structure. Within the representation, the feature indicating (say) number appears more than once. To represent the fact that each occurrence is another appearance of the same feature (rather than a copy) one could use an encoding like the following:
590 FSVAR vLabel
595 FSVAR The scope of the names used to label re-entrancy points is that of the outermost
597 FSVAR element in which they appear. When a feature structure is imported from a feature value library, or referenced from elsewhere (for example by using the
599 FSVAR attribute) the names of any sharing points it may contain are implicitly prefixed by the identifier used for the imported feature structure, to avoid name clashes. Thus, if some other feature structure were to reference the
602 FSVAR then the labelled points in the example would be interpreted as if they had the name
616 FSSS A feature whose value is regarded as a set, bag, or list may have any positive number of values as its content, or none at all, (thus allowing for representation of the empty set, bag, or list). The items in a list are ordered, and need not be distinct. The items in a set are not ordered, and must be distinct. The items in a bag are neither ordered nor distinct. Sets and bags are thus distinguished from lists in that the order in which the values are specified does not matter for the former, but does matter for the latter, while sets are distinguished from bags and lists in that repetitions of values do not count for the former but do count for the latter.
618 FSSS If no value is specified for the
622 FSSS defines a list of values. If the
628 FSSS attribute, suppose that a feature structure analysis is used to represent a genealogical tree, with the information about each individual treated as a single feature structure, like this:
654 FSSS element is first used to supply a list of
655 FSSS name
658 FSSS feature. Other features are defined by reference to values which we assume are held in some external feature value library (not shown here). For example, the
660 FSSS element is used a second time to indicate that the persons's siblings should be regarded as constituting a set rather than a list. Each sibling is represented by a feature structure: in this example, each feature structure is a copy of one specified in the feature value library.
662 FSSS If a specific feature contains only a single feature structure as its value, the component features of which are organized as a set, bag, or list, it may be more convenient to represent the value as a
666 FSSS . For example, consider the following encoding of the English verb form
670 FSSS feature whose value is a feature structure which contains
671 FSSS person
673 FSSS number
714 FSSS element is also useful in cases where an analysis has several components. In the following example, the French word
716 FSSS has a two-part analysis, represented as a list of two values. The first specifies that the word contains a preposition; the second that it contains a masculine plural relative pronoun:
736 FSSS The set, bag, or list which has no members is known as the null (or empty) set, bag, or list. A
738 FSSS element with no content and with no value for its
740 FSSS attribute is interpreted as referring to the null set, bag, or list, depending on the value of its
755 FSSS elements, if, for example one of the members of a set is itself a set, or if two lists are concatenated together. Note that such collections pay no attention to the contents of the nested
757 FSSS elements: if it is desired to produce the union of two sets, the
759 FSSS element discussed below should be used to make a new collection from the two sets.
764 FVE It is sometimes desirable to express the value of a feature as the result of an operation over some other value (for example, as
768 FVE , or as the concatenation of two collections). Three special purpose elements are provided to represent disjunctive alternation, negation, and collection of values:
779 FVALT element can be used wherever a feature value can appear. It contains two or more feature values, any one of which is to be understood as the value required. Suppose, for example, that we are using a feature system to describe residential property, using such features as
781 FVALT . In a particular case, we might wish to represent uncertainty as to whether a house has two or three bathrooms. As we have already shown, one simple way to represent this would be with a numeric maximum:
791 FVALT element represents alternation over feature values, not feature-value pairs. If therefore the uncertainty relates to two or more feature value specifications, each must be represented as a feature structure, since a feature structure can always appear where a value is required. For example, suppose that it is uncertain as to whether the house being described has two bathrooms or two bedrooms, a structure like the following may be used:
805 FVALT : in the case above, the implication is that having two bathrooms excludes the possibility of having two bedrooms and vice versa. If inclusive alternation is required, a
824 FVALT This analysis indicates that the property may have two bathrooms, two bedrooms, or both two bathrooms and two bedrooms.
830 FVALT to describe items that are mentioned to enhance a property's sales value, such as whether it has a pool or a good view. Now suppose for a particular listing, the selling points include an alarm system and a good view, and either a pool or a jacuzzi (but not both). This situation could be represented, using the
870 FVALT If a large number of ambiguities or uncertainties need to be represented, involving a relatively small number of features and values, it is recommended that a stand-off technique, for example using the general-purpose
883 FVNOT element can be used wherever a feature value can appear. It contains any feature value and returns the complement of its contents. For example, the feature
885 FVNOT in the following example has any whole numeric value other than 2:
892 FVNOT element is to provide the complement of the feature values it contains, rather than their negation. If a feature system declaration is available which defines the possible values for the associated feature, then it is possible to say more about the negated value. For example, suppose that the available values for the feature
893 FVNOT case
894 FVNOT are declared to be nominative, genitive, dative, or accusative, whether in a TEI feature system declaration or by some other means. Then the following two specifications are equivalent:
906 FVNOT If however no such system declaration is available, all that one can say about a feature specified via negation is that its value is something other than the negated value.
908 FVNOT Negation is always applied to a feature value, rather than to a feature-value pair. The negation of an atomic value is the set of all other values which are possible for the feature.
910 FVNOT Any kind of value can be negated, including collections (represented by a
914 FVNOT elements). The negation of any complex value is understood to be the set of values which cannot be unified with it. Thus, for example, the negation of the feature structure F is understood to be the set of feature structures which are not unifiable with F. In the absence of a constraint mechanism such as the Feature System Declaration, the negation of a collection is anything that is not unifiable with it, including collections of different types and atomic values. It will generally be more useful to require that the organization of the negated value be the same as that of the original value, for example that a negated set is understood to mean the set which is a complement of the set, but such a requirement cannot be enforced in the absence of a constraint mechanism.
921 FVCOLL element can be used wherever a feature value can appear. It contains two or more feature values, all of which are to be collected together. The organization of the resulting collection is specified by the value of the
923 FVCOLL attribute, which need not necessarily be the same as that of its constituent values if these are collections. For example, one can change a list to a set, or vice versa.
940 FVCOLL Suppose however that we discover for some language it is necessary to add a new possible value, and to treat the value of the feature as a list rather than as a set. The
961 FSBO The value of a feature may be underspecified in a number of different ways. It may be null, unknown, or uncertain with respect to a range of known possibilities, as well as being defined as a negation or an alternation. As previously noted, the specification of the range of known possibilities for a given feature is not part of the current specification: in the TEI scheme, this information is conveyed by the
963 FSBO . Using this, or some other system, we might specify (for example) that the range of values for an element includes symbols for masculine, feminine, and neuter, and that the default value is neuter. With such definitions available to us, it becomes possible to say that some feature takes the default value, or some unspecified value from the list. The following special element is provided for this purpose:
968 FSBO The value of an empty
982 FSBO If, however, the value is explicitly stated to be the default one, using the
984 FSBO element, then the following two representations are equivalent:
992 FSBO Similarly, if the value is stated to be the negation of the default, then the following two representations are equivalent:
1007 FSLINK Text elements can be linked with feature structures using any of the linking methods discussed elsewhere in the Guidelines (see for example sections
1121 FSLINK element is used to link selected characters in the text
1168 FSLINK It would then be possible to link each word to its intended annotation in the feature library quoted above, as follows:
1183 FD The Feature System Declaration (FSD) is intended for use in conjunction with a TEI-conforming text that makes use of
1187 FD It provides a mechanism by which the encoder can list all of the feature names and feature values and give a prose description as to what each represents.
1193 FD It provides a mechanism by which the encoder can define the intended interpretation of underspecified feature structures. This involves defining default values (whether literal or computed) for missing features.
1196 FD . This chapter relies upon, but does not reproduce, formal definitions and descriptions presented more thoroughly in the ISO standard, which should be consulted in case of ambiguity or uncertainty.
1198 FD The FSD serves an important function in documenting precisely what the encoder intended by the system of feature structure markup used in an XML-encoded text. The FSD is also an important resource which standardizes the rules of inference used by software to validate the feature structure markup in a text, and to infer the full interpretation of underspecified feature structures.
1200 FD The reader should be aware the terminology used in this document does not always closely follow conventional practice in formal logic, and may also diverge from practice in some linguistic applications of typed feature structures. In particular, the term
1201 FD interpretation
1202 FD when applied to a feature structure is not an interpretation in the model-theoretic sense, but is instead a minimally informative (or equivalently, most general) extension
1203 FD of that feature structure that is consistent with a set of constraints declared by an FSD. In linguistic application, such a system of constraints is the principal means by which the grammar of some natural language is expressed. There is a great deal of disagreement as to what, if any, model-theoretic interpretation feature structures have in such applications, but the status of this formal kind of interpretation is not germane to the present document. Similarly, the term
1205 FD is used here as elsewhere in these Guidelines to identify the syntactic state of well-formedness in the sense defined by the logic of typed feature structures itself, as distinct from and in addition to the
1209 FD We begin by describing how an encoded text is associated with one or more feature system declarations. The second, third, and fourth sections describe the overall structure of a feature system declaration and give details of how to encode its components. The final section offers a full example; fuller discussion of the reasoning behind FSDs and another complete example are provided in
1213 FDLK Linking a TEI Text to Feature System Declarations
1215 FDLK In order for application software to use feature system declarations to aid in the automatic interpretation of encoded texts, or even for human readers to find the appropriate declarations which document the feature system used in markup, there must be a formal link from the encoded texts to the declarations. However, the schema which declares the syntax of the Feature System itself should be kept distinct from the feature structure schema, which is an application of that system.
1219 FDLK element for each distinct type of feature structure used must be provided and associated with the type, which is the value used within each feature structure for its
1230 FDLK element may be supplied either within the header of a standard TEI document, or as a standalone document in its own right. It contains one or more
1245 FDLK element for each within the header attached to the document as follows:
1274 FDLK In this case there is an implicit link between the
1278 FDLK element because they share the same value for their
1280 FDLK attribute and appear within the same document. This is a short cut for the more general case which requires a more explicit link provided by means of the
1285 FDLK Ways of pointing to components of a TEI document without using an XML identifier are discussed in
1286 FDLK way of accomplishing this is to add an XML identifier to each
1301 FDLK (Although in this case the XML identifier is simply an uppercase version of the type name, there is no necessary connection between the two names. The only requirement is that the XML identifier conform to the standards required for identifiers, and that it be unique within the document containing it.)
1332 FDLK there is no requirement for the local name for a given type of feature structures to be the same as that used by
1348 FDLK element of a TEI document containing typed feature structures. Alternatively, it may appear independently of any feature structures, as a document in its own right, possibly with its own
1362 FDLK value specified on a
1371 FDOV A feature system declaration contains one or more feature structure declarations, each of which has up to three parts: an optional description (which gives a prose comment on what that type of feature structure encodes), an obligatory set of feature declarations (which specify range constraints and default values for the features in that type of structure), and optional feature structure constraints (which specify co-occurrence restrictions on feature values).
1380 FDOV element may name one or more
1385 FDOV fsDecl type="Basic"
1387 FDOV fDecl name="One"
1389 FDOV fDecl name="Two"
1391 FDOV fsDecl type="Derived" baseTypes="Basic"
1393 FDOV fDecl name="Three"
1395 FDOV fs type="Derived"
1397 FDOV fsDecl type="Derived"
1399 FDOV fsDecl type="Basic"
1400 FDOV when it specifies a base type of
1422 FDOV gives the name of one or more types from which this type inherits feature specifications and constraints; if this type includes a feature specification with the same name as one inherited from any of the types specified by this attribute, or if more than one specification of the same name is inherited, then the possible values of that feature is determined by unification. Similarly, the set of constraints applicable is derived by conjoining those specified explicitly within this element with those implied by the
1424 FDOV attribute. When no base type is specified, no feature specification or constraint is inherited.
1426 FDOV Although the present standard does provide for default feature values, feature inheritance is defined to be monotonic.
1427 FDOV The process of combining constraints may result in a contradiction, for example if two specifications for the same feature specify disjoint ranges of values, and at least one such specification is mandatory. In such a case, there is no valid feature structure of the type being defined.
1432 FDOV fsDecl type="Sub" baseTypes="Super1 Super2"
1455 FDFD has three parts: an optional prose description (which should explain what the feature and its values represent), an obligatory range specification (which declares what values the feature is allowed to have), and an optional default specification (which declares what default value should be supplied when the named feature does not appear in an
1460 FDFD has no value provided, or the value
1466 FDFD either has no default specified, or has conditional defaults, none of the conditions on which is met,
1468 FDFD then the value of this feature in the feature structure's most general valid extension is the most general value provided in its
1470 FDFD , in the case of a unit organization, or the singleton set, bag, or list containing that element, in the case of a complex organization. If the feature:
1473 FDFD has no value provided, or the value
1477 FDFD either has a default specified, or has conditional defaults, one of the conditions on which is met,
1479 FDFD then this feature does have a value in the feature structure's most general valid extension when it exists, namely the default value that pertains.
1481 FDFD It is possible that a feature structure will not have a valid extension because the default value that pertains to a feature is not consistent with that feature's declared range. Additional tools are required for the enforcement of such criteria.
1492 FDFD The logic for validating feature values and for matching the conditions for supplying default values is based on the operation of
1506 FDFD containing the value
1510 FDFD . The negation of a value
1515 FDFD ) subsumes any value that is not
1519 FDFD subsumes any numeric value other than zero.
1520 FDFD The value
1521 FDFD fs type="X"/
1524 FDFD , even if it is not valid.
1534 FDFD The INV feature, which encodes whether or not a sentence is inverted, allows only the values plus (+) and minus (-). If the feature is not specified, then the default rule (FSD 1 above) says that a value of minus is always assumed. The feature declaration for this feature would be encoded as follows:
1544 FDFD The value range is specified as an alternation (more precisely, an exclusive disjunction), which can be represented by the
1546 FDFD feature value. That is, the value must be either true or false, but cannot be both or neither.
1548 FDFD The CONJ feature indicates the surface form of the conjunction used in a construction. The ~ in the default rule (see FSD 2 above) represents negation. This means that by default the feature is not applicable, in other words, no conjunction is taking place. Note that CONJ not being present is distinct from CONJ being present but having the NIL value allowed in the value range. In their analysis, NIL means that the phenomenon of conjunction is taking place but there is no explicit conjunction in the surface form of the sentence. The feature declaration for this feature would be encoded as follows:
1568 FDFD is not strictly necessary in this case, since the binary value of
1572 FDFD The COMP feature indicates the surface form of the complementizer used in a construction. In value range, it is analogous to CONJ. However, its default rule (see FSD 9 above) is conditional. It says that if the verb form is infinitival (the VFORM feature is not mentioned in the rule since it is the only feature that can take INF as a value), and the construction has a subject, then a
1598 FDFD The AGR feature stores the features relevant to subject-verb agreement. Gazdar et al. specify the range of this feature as CAT. This means that the value is a
1599 FDFD category
1600 FDFD , which is their term for a feature structure. This is actually too weak a statement. Not just any feature structure is allowable here; it must be a feature structure for agreement (which is defined in the complete example at the end of the chapter to contain the features of person and number). The following feature declaration encodes this constraint on the value range:
1605 FDFD That is, the value must be a feature structure of type
1608 FDFD fsDecl type="Agreement"
1610 FDFD fDecl name="PERS"
1612 FDFD fDecl name="NUM"
1615 FDFD The PFORM feature indicates the surface form of the preposition used in a construction. Since PFORM is specified above as an open set,
1626 FDFD subsumes any string that is not the empty string.
1646 FDFS Ensuring the validity of feature structures may require much more than simply specifying the range of allowed values for each feature. There may be constraints on the co-occurrence of one feature value with the value of another feature in the same feature structure or in an embedded feature structure.
1648 FDFS Such constraints on valid feature structures are expressed as a series of conditional and biconditional tests in the
1652 FDFS . A particular feature structure is valid only if it meets all the constraints. The
1654 FDFS element encodes the conventional if-then conditional of boolean logic which succeeds when both the antecedent and consequent are true, or whenever the antecedent is false. The
1656 FDFS element encodes the biconditional (if and only if) operation of boolean logic. It succeeds only when the corresponding if-then conditionals in both directions are true.
1657 FDFS In feature structure constraints the antecedent and consequent are expressed as feature structures; they are considered true if they
1660 FDFS ) the feature structure in question, but in the case of consequents, this truth is asserted rather than simply tested. That is to say, a conditional is enforced by determining that the antecedent does not (and will never) subsume the given feature structure, or by determining that the antecedent does subsume the given feature structure, and then unifying the consequent with it (the result of which, if successful, will be subsumed by the consequent). In practice, the enforcement of such constraints can result in periods in which the truth of a constraint with respect to a given feature structure is simply not known; in this case, the constraint must be persistently monitored as the feature structure becomes more informative until either its truth value is determined or computation fails for some other reason.
1675 FDFS The first constraint says that if a construction is inverted, it must also have an auxiliary and a finite verb form. That is,
1683 FDFS The second constraint says that if a construction has a BAR value of zero (i.e., it is a sentence), then it must have a value for the features N, V, and SUBCAT. By the same token, because it is a biconditional, if it has values for N, V, and SUBCAT, it must have BAR='0'. That is,
1694 FDFS The final constraint says that if a construction has a BAR value of 1 (i.e., it is a phrase), then the SUBCAT feature should be absent (~). This is not biconditional, since there are other instances under which the SUBCAT feature is inappropriate. That is,
1830 FSDEF This elements discussed in this chapter constitute a module of the TEI scheme which is formally defined as follows:
1844 FSDEF The selection and combination of modules to form a TEI schema is described in

CE-CertaintyResponsibility.xml#13217

# id text
3 CE Encoders of text often find it useful to indicate that some aspects of the encoded text are problematic or uncertain, and to indicate who is responsible for various aspects of the markup of the electronic text. These Guidelines provide several methods of recording uncertainty about the text or its markup:
8 CE may be used with a value of
9 CE certainty
20 CE element defined in this chapter may be used to record the accuracy with which some numerical value (such as a date or quantity) is provided by some other element or attribute.
24 CE element defined in the module for linking and segmentation may be used to provide alternative encodings for parts of a text, as described in section
28 CE the TEI header records who is responsible for an electronic text by means of the
48 CE element may be used with a value of
49 CE resp
63 CE elements, since they are defined in the core module and header respectively. The
65 CE element is only available when the module for linking has been selected, as described in chapter
72 CE elements, the module for certainty and responsibility must be selected.
81 CE These attributes enable statements about certainty, precision, or responsibility to be made with respect to the whole of a document, or any part or parts of it which can be identified using standard XML location methods. Several examples are given in the discussion of the
91 CECERT a given tag may or may not correctly apply (e.g. a given word may be a personal name, or perhaps not)
95 CECERT the value given for an attribute is uncertain
97 CECERT the content given for an element is unreliable for any reason.
105 CECERT the numerical precision associated with a number or date (for this use the
110 CECERT the content of the document being transcribed is identifiable, but may be read or understood in different ways (for this use the transcriptional elements such as
115 CECERT a transcriber, editor, or author wishes to indicate a level of confidence in a factual assertion made in the text (for this use the interpretative mechanisms discussed in
123 CECENO The simplest way of recording uncertainty about markup is to attach a note to the element or location about which one is unsure. In the following (invented) paragraph, for example, an encoder might be uncertain whether to mark
125 CECENO as a place name or a personal name, since both might be plausible in the given context:
140 CECENO Using the normal mechanisms, the note may be associated unambiguously with specific elements of the text, thus:
166 CECECE is in fact a place name, as it is tagged, we use the
171 CECECE name
180 CECECE element is placed in a document; it may be placed adjacent to the target element, or elsewhere in the same or another document. Its position is however significant when the
186 CECECE really is a place name here. The
190 CECECE element, expressed as a number between 0 and 1:
193 CECECE This expresses the point of view that there is a 60 percent chance of
195 CECECE being a place name here, and hence a 40 percent chance of its being a personal name. We can use two
197 CECECE elements to indicate the two probabilities independently. Both elements indicate the same location in the text, but the second provides an alternative choice of name identifier (in this case
199 CECECE ), which is given as the value of the
210 CECECE In the simplest case, it is also possible to place the
218 CECECE is specified, by default the proposed certainty applies to its parent element, in this case the
230 CEconcon attribute to list the identifiers of
256 CEconcon element is interpreted as claiming a given degree of confidence in a particular markup given the assertional content of the
258 CEconcon elements indicated. That is, a conjectural assertion is being made solely on the assumption that the interpretation indicated by the element named by the
266 CEconcon as a personal name or a place name, assigning a 60 percent probability to the former. If it is a place name, there may be a 50 percent chance that the place name actually in question is
270 CEconcon , while if it is correctly tagged as a personal name, it is much more likely (say, 90 percent certain) that the name is
272 CEconcon . Hence there is uncertainty about the correct location for the markup as well as about which markup to use. This state of affairs can be expressed using the
296 CEconcon Multiplying the numeric values out, this markup may be interpreted as assigning specific probabilities to three different ways of marking up the sentence:
304 CEconcon The probabilities do not add up to 1.00 because the markup indicates that if
306 CEconcon is (part of) a personal name, there is a 10 percent likelihood that the element should start somewhere other than the place indicated, without however giving an alternative location; there is thus a 6 percent chance (0.1 × 0.6) that none of the alternatives given is correct.
313 CECECE attribute may be used to supply a pattern identifying the portion of a document concerning which certainty is being expressed. The value of the
324 CECECE has been supplied here, and so by default the
326 CECECE expressed would therefore apply to the parent element. However, in this case the XPath supplied as the value for
328 CECECE returns a set of all the
347 CECECE value of
352 CECECE If an element in a document is matched by more than one match expression, then the most specific pattern applies.
355 CECECE As a simple case, if both the preceding
360 CECECE div type="checked"
361 CECECE element would potentially match both pattern expressions. However because the second pattern is more specific than the former, in fact this is the only one that would apply. If multiple patterns match and have the same priority, then the first one (in document order) is applied. Only those statements of certainty which have matched in this sense are available for conditional application using the
363 CECECE attribute mentioned above.
367 CECECE attribute is processed, the namespace bindings in force are those in effect at that point in the document. For example,
373 CECECE might be used to indicate a high degree of certainty about the content of any elements taken the namespace associated with the prefix
375 CECECE . This namespace prefix must be associated with an appropriate namespace definition, either on the
382 CECECE Doubts about whether the content of an element is correct may also be expressed by assigning to
384 CECECE the value
385 CECECE value
386 CECECE . For example, if the source is hard to read and so the transcription is uncertain:
404 CECECE attribute should be used to provide an alternative value for whatever aspect of the markup is in doubt: an alternative name, or the identifier of an alternative starting or ending point, as already shown, an alternative attribute value, or alternative element content, as in this example:
412 CECECE attribute is not generally useful for specifying alternative transcriptions; it cannot for example be used if the alternative reading contains markup of any kind. More robust methods of handling uncertainties of transcription are the
421 CECECE element allows for indications of uncertainty to be structured with at least as much detail and clarity as appears to be currently required in most ongoing text projects.
430 CECECE data.pointer
431 CECECE as values and may thus also contain an XPath expression of arbitrary complexity. Because full support for XPath is not provided by current processors, it is not generally recommended TEI practice. There are however some simple cases in which XPath syntax is to be preferred, notably those in which the
437 CECECE attribute has the value
447 CECECE value (expressed as an URI) and a
449 CECECE value (expressed as an XPath). The former defines the context within which the latter is to be evaluated. As previously noted, if no value is supplied for
451 CECECE , the context within which the value of
457 CECECE A typical case where it may be convenient to specify both
461 CECECE is that where we wish to indicate that the value of an attribute on some specific element is uncertain. In this case, the
463 CECECE attribute takes the value
464 CECECE value
465 CECECE . For example, supposing there is only a 50 percent chance that the question was spoken by participant A:
477 CECECE attributes together provide a powerful mechanism which can be used to indicate precision for a large number of assertions throughout an encoded document in an economical way. Some further examples follow:
480 CECECE This encoding indicates that there is only a 0.2 certainty that the boundaries of all
487 CECECE This encoding indicates that there is only a 0.2 certainty that the boundaries of the
491 CECECE value
499 CECECE This encoding indicates that there is only a 0.2 certainty that the value for the
508 CECECE This encoding indicates that there is only a 0.2 certainty that any value for the
514 CECECE This encoding indicates that there is only a 0.2 certainty that the value for the
522 CECECE This encoding indicates that there is only a 0.2 certainty that the content of any element the
524 CECECE attribute of which has the value
530 CECECE element and the other TEI mechanisms for indicating uncertainty provide a range of methods of graduated complexity. Simple expressions of uncertainty may be made by using the
536 CECECE element, and in cases where highly structured certainty information must be given, it is recommended that the
550 CEPREC As noted above, certainty about the accuracy of an encoding or its content is not the same thing as the
551 CEPREC precision
552 CEPREC with which a value is specified. In the case of a date or a quantity, for example, we might be certain that the value given is imprecise, or uncertain about whether or not the value given is correct. The latter possibility would be represented by the
558 CEPREC The elements concerning which statements of precision are to be made are identified using the same
570 CEPREC several ways of indicating ranges of values were introduced. For example, if we know that a date falls between 1930 and 1935, without being certain exactly where, this fact may be encoded using attributes
578 CEPREC Equally, if we know that every page of a manuscript has a width of at least 10 cm but no more than 30, we can use the attributes
586 CEPREC Suppose however that the precision with which the value of such an attribute can be specified is variable. For example, suppose an event is dated
587 CEPREC about fifty years after the death of Augustus
588 CEPREC . In this case, the precision of one end of the range (the death of Augustus) is higher than the other, assuming we know when Augustus died. We can say that the latest possible date is probably 50 years after that, but with less confidence than we can attach to the earliest possible date.
592 CEPREC element allows us to indicate the two attributes concerned and attach different levels of precision to them, using a similar mechanism as that provided for the
601 CEPREC In much the same way, we may wish to indicate different levels of precision about the dating of either end of a historical period. For example, the elements defined for encoding personal data all bear a similar set of attributes to indicate normalized values for earliest or latest dates, etc. (see section
602 CEPREC ); the precision of these attribute values may be indicated in exactly the same way. For example,
608 CEPREC It may also be useful to indicate that the precisions given for minimum and maximum quanta differ. For example, to indicate that all pages measure at least 10 cm wide, and at most
621 CEPREC might be used to record the average number of characters per line in a typescript. If in addition we wish to record the standard deviation for the values summarized by that average, this would require an additional
632 CERESP In general, attribution of responsibility for the transcription and markup of an electronic text is made by
634 CERESP elements within the header: specifically, within the title statement, the edition statement(s), and the revision history.
636 CERESP In some cases, however, more detailed element-by-element information may be desired. For example, an encoder may wish to distinguish between the individuals responsible for transcribing the content and those responsible for determining that a given word or phrase constitutes a proper noun. Where such fine-grained attribution of responsibility is required, the
665 CERESP element at the location indicated:
676 CERESP Similarly, in the following example, we indicate that RC is responsible for proposing the value of the
688 CE The module described in this chapter makes available the following additional elements:
699 CE The selection and combination of modules to form a TEI schema is described in

PH-PrimarySources.xml#13092

# id text
5 PH provides elements for the encoding of digital facsimiles or images of such materials, while the remainder of the chapter discusses ways of encoding detailed transcriptions of such materials. This module may also be useful in the preparation of critical editions, but the module defined here is distinct from that defined in chapter
7 PH , but again the present module may be used independently if such data is not required.
13 PH to the encoding of printed matter or indeed any form of written source, including monumental inscriptions. Similarly, where in the following descriptions terms such as
16 PH author
18 PH editor
25 PH plays a role analogous to the
27 PH , while in an authorial manuscript, the author and the scribe are the same person.
32 PHFAX These Guidelines are mostly concerned with the preparation of digital texts in which pre-existing sources are transcribed or otherwise converted into character form, and marked up in XML. However, it is also very common practice to make a different form of
33 PHFAX digital text
34 PHFAX , which is instead composed of digital images of the original source, typically one per page, or other written surface. We call such a resource a
35 PHFAX digital facsimile
36 PHFAX . A digital facsimile may, in the simplest case, just consist of a collection of images, with some metadata to identify them and the source materials portrayed. It may sometimes contain a variety of images of the same source pages, perhaps of different resolutions, or of different kinds. Such a collection may form part of any kind of document, for example a commentary of a codicological or paleographic nature, where there is a need to align explanatory text with image data. It may also be complemented by a transcribed or encoded version of the original source, which may be linked to the page images. In this section we present elements designed to support these various possibilities and discuss the associated mechanisms provided by these Guidelines.
56 PHFAX In the simple case where a digital text is composed of page images, the
74 PHFAX attribute represents the whole of the text following the
78 PHFAX element. Any convenient milestone element (see further
79 PHFAX ) could be used in the same way; for example if the images represent individual columns, the
81 PHFAX element might be used. Though simple, this method has some drawbacks. It does not scale well to more complex cases where, for example, the images do not correspond exactly with transcribed pages, or where the intention is to align specific marked up elements with detailed images, or parts of images. The management of information about the images may become more difficult if references to them are scattered through many files rather than being concentrated in a single identifiable location. Nevertheless, this solution may be adequate for many straightforward
97 PHFAX , which are also provided by this module. These elements make it possible to accommodate multiple images of each page, as well as to record the position and relative size of elements identified on any kind of written surface and to link such elements with digital facsimile images of them. Typical applications include the provision of full text search in
98 PHFAX digital facsimile editions
99 PHFAX , and ways of annotating graphics, for example so as to identify individuals appearing in group portraits and link them to data about the people represented.
114 PHFAX elements may be used to represent a digital facsimile. Either may appear within a TEI document along with, or instead of, the
119 PHFAX element is designed for the case where the digital facsimile contains only images, whereas the
121 PHFAX element is for use in the case where such images are complemented by a documentary transcription. In this section, we first discuss the simpler case, returning to the use of the
124 PHFAX below. When this module is selected therefore, a legal TEI document may thus comprise any of the following:
126 PHFAX a TEI header and a text element
128 PHFAX a TEI header and a facsimile element
130 PHFAX a TEI header and a sourceDoc element
132 PHFAX a TEI header, a facsimile element, and a text element
134 PHFAX a TEI header, one or more sourceDoc or facsimile elements, and a text element
150 PHFAX In the simplest case, a facsimile just contains a series of
169 PHFAX In this simple case, the four page images are understood to represent the complete facsimile, and are to be read in the sequence given. Suppose, however, that the second page of this particular work is available both as an ordinary photograph and as an infra-red image, or in two different resolutions. The
171 PHFAX element may be used to group the two image files, since these correspond with the same area of the work:
186 PHFAX element provides a way of indicating that the two images of page2 represent the same surface within the source material. A
187 PHFAX surface
188 PHFAX might be one side of a piece of paper or parchment, an opening in a codex treated as a single surface by the writer, a face of a monument, a billboard, a membrane of a scroll, or indeed any two-dimensional surface, of any size.
209 PHFAX Simply grouping related graphics is not however the main purpose of the
211 PHFAX element: rather it is to help identify the location and size of the various two-dimensional spaces constituting the digital facsimile. Note that the actual dimensions of the object represented are not provided by the
215 PHFAX element defines an abstract coordinate space which may be used to address parts of the image. Four attributes supplied by the
223 PHFAX By default, the same coordinate space is used for a
226 PHFAX The coordinate space may be thought of as a grid superimposed on a rectangular space. Rectangular areas of the grid are defined as four numbers
227 PHFAX a b c d
232 PHFAX points from the origin along the
236 PHFAX points from the origin along the
239 PHFAX It may be most convenient to derive a coordinate space from a digital image of the surface in question such that each pixel in the image corresponds with a whole number of units (typically 1) in the coordinate space. In other cases it may be more convenient to use units such as millimetres. Neither practice implies any specific mapping between the coordinate system used and the actual dimensions of the physical object represented.
245 PHFAX elements, each of which represents a region or
247 PHFAX defined in terms of the same coordinate space as that of its parent
249 PHFAX element. A zone may be rectangular or non-rectangular: a rectangular zone is defined by a sequence of four coordinates in the same way as a surface; a non-rectangular zone is defined using the attribute
251 PHFAX , which provides a sequence of coordinates, each of which specifies a point on the perimeter of the zone.
256 PHFAX in the same form as that required by the
263 PHFAX A zone may be used to define any region of interest, such as a detail or illustration, or some part of the surface which is to be aligned with a particular text element, or otherwise distinguished from the rest of the surface. A surface establishes a coordinate system which may be used to address parts or the whole of some digital representation of a written surface. A zone, by contrast, defines any arbitrary area of interest relative to that surface, using the same coordinate system. It might be bigger or smaller than its parent surface, or might overlap its boundaries. The only constraint is that it must be defined using the same coordinate system.
265 PHFAX When an image of some kind is supplied within either a zone or a surface, the implication is that the image represents the whole of the zone or surface concerned. In the simple case therefore, we might imagine a surface defining a page, within which there is a graphic representing the whole of that page, and a number of zones defining parts of the page, each with its own graphic, each representing a part of the page. If however one of those graphics actually represents an area larger than the page (for example to include a binding or the surface of a desk on which the page rests), then it will be enclosed by a zone with coordinates larger than those of the parent surface.
273 PHFAX This is an image of a two page spread from a manuscript in the Badische Landesbibliothek, Karlsruhe. We have no information as to the dimensions of the original object, but the low resolution image displayed here contains 500 pixels horizontally and 321 pixels vertically. For convenience, we might map each pixel to one cell of the coordinate space.
274 PHFAX The coordinate space used here is based on pixels, but the mapping between pixels and units in the coordinate space need not be one-to-one; it might be convenient to define a more delicate grid, to enable us to address much smaller parts of the image. This can be done simply by supplying appropriate values for the attributes which define the coordinate space; for example doubling them all would map each pixel to two grid points in the coordinate space.
279 PHFAX element corresponding with the area of the image which represents the whole of the two page spread and embed the graphic within it:
315 PHFAX elements may be used to identify parts of a surface for analytical purposes.
317 PHFAX The relationship between zone and surface can be quite complex: for example, it may be appropriate to treat the whole of a two page spread as a single written surface, perhaps because particular written zones span both pages. A zone may contain a nested surface, if for example a page has an additional scrap of paper attached to it. A zone may be of any shape, not simply rectangular. Discussion of these and other cases are provided in section
320 PHFAX In the following extended example, we discuss a hypothetical digital edition of an early 16th century French work, Charles de Bovelles'
323 PHFAX The image is taken from the collection at
329 PHFAX element used to contain the whole set of pages, we define a
340 PHFAX We can now identify distinct zones within the page image using the coordinate scale defined for the surface. In the following figure
348 facs-fig1 Detail of p 49r from Bovelles
351 PHFAX The following encoding defines each of the four zones identified in the figure above.
365 PHFAX Note that the location of each zone is defined independently but using the same coordinate system.
381 PHFAX element has been associated directly with the surface of the page rather than nesting it within a zone. However, it is also possible to include multiple
385 PHFAX element, if for example a detailed image is available. Since all
389 PHFAX ), there is no need to demonstrate enclosure of one zone within another by means of nesting. To continue the current example, supposing that we have an additional image called
391 PHFAX containing an additional image of the figure in the third zone above, we might encode that zone as follows:
402 PH-transcr A digitized source document may contain nothing more than page images and a small amount of metadata. It may also contain an encoded transcription of the pages represented, which may either be
406 PH-transcr element, or supplied in parallel with a
410 PH-transcr If the transcription is regarded as a text in its own right, organized and structured independently of its physical realization in the document or documents represented by the facsimile, then the recommended practice is to use the
419 PH-transcr below. Alternatively, if the transcription is intended not to prioritize representation of the final text so much as the process by which the document came to take its present form, or the physical disposition of its component parts, it may be preferable to present it as an embedding transcription, as further described in section
425 PH-bov Suppose now that we wish to align a transcription of the page discussed in the preceding section with particular zones. We begin by giving each relevant part of the facsimile an identifier:
492 PH-bov attribute, which supplies the identifier of the element containing at least the start of the transcribed text found within the surface or zone concerned. Thus, another way of linking this page with its transcription would be simply
546 PHZLAB When supplied within a
548 PHZLAB element, these elements may contain transcriptions of the written content of a source in addition to or as an alternative to digital images of them. Such transcription may be placed directly within the
552 PHZLAB elements, for cases where the writing is linear, in the sense that it is composed of discrete tokens organized physically into groups, typically organized in a sequence corresponding with the way they are intended to be read. Depending on the directionality of the writing system used, this might be any combination of top-down and left to right, or vice versa. The element
554 PHZLAB may be used to hold a complete group of such tokens. Where, however, the lineation is not considered significant, any group of tokens may be indicated using the
565 PHZLAB Returning to the preceding example, we might transcribe the content of the zone to which we gave the identifier
598 PHZLAB As mentioned above, some or all of the written surfaces being transcribed may be composed of physically distinct scraps. In the following example, taken from the Walt Whitman Archive, two pieces of newsprint have been glued to a piece of blue paper on which a poem is being drafted:
601 sleeprs Single leaf of notes possibly related to the poem eventually titled Sleepers. From the Walt Whitman Archive (Duke 258).
603 PHZLAB The two pieces of newsprint might simply be regarded as special kinds of zone, but they are also new surfaces, since they might contain additional written zones themselves (such as the numbers in this case).
650 PHZLAB elements identified in the transcription. The encoder may choose to complement a transcription with graphic representations of its source at whatever level is considered effective, or not at all. Equally, the encoder may choose to provide only graphics without any transcription, to provide only a structured (non-embedded) transcription, or to provide any combination of the three.
654 PHZLAB element they are to be found, other than the reading order implicit in their sequence. Such information could be added if desired by specifying a coordinate system on the outermost
656 PHZLAB element, and then indicating values within that system for each of the two fragments, as was discussed above. We discuss this in further detail in section
666 PHST transcription or a critical edition. In either case they may also wish to include other editorial material, such as comments on the status or possible origin of particular readings, corrections, or text supplied to fill lacunae.
672 PHST of writing in one or more documents. Transcriptions of this kind are closely focussed on the physical appearance of specific documents, needing to distinguish the traces of different writing activities on them, such as additions and deletions but also other indications of how the writing is to be read, such as indications of transposition, re-affirmation of writing which has been deleted, and so on. Such distinctions are considered of particular importance when dealing with authorial manuscripts, but are also relevant in the case of historical sources such as charters or other legal documents.
674 PHST In either case, it is customary in transcriptions to register certain features of the source, such as ornamentation, underlining, deletion, areas of damage and lacunae. This chapter provides ways of encoding such information:
676 PHST methods of recording editorial or other alterations to the text, such as expansion of abbreviations, corrections, conjectures, etc. (section
679 PHST methods of describing important extra-linguistic phenomena in the source: unusual spaces, lines, page and line breaks, changes of manuscript hand, etc. (section
685 PHST methods of representing aspects of layout such as spacing or lines
688 PHST methods of representing material such as running heads, catch-words, and the like (section
696 PHST , etc. are used to mark writing traces and their functions within the document. Each such element can be assigned to one or more editorially-defined modification groups, termed a
697 PHST change
700 PHST attribute, which references a definition for the modification group concerned, typically provided within the TEI header
717 PHST These recommendations are not intended to meet every transcriptional circumstance likely to be faced by any scholar. Rather, they should be regarded as a base which can be elaborated if necessary by different scholars in different disciplines
720 PHST As a rule, all elements which may be used in the course of a transcription of a single witness may also be used in a critical apparatus, i.e. within the elements proposed in chapter
721 PHST . This can generally be achieved by nesting a particular reading containing tagged elements from a particular witness within the
727 PHST Just as a critical apparatus may contain transcriptional elements within its record of variant readings in various witnesses, one may record variant readings in an individual witness by use of the apparatus mechanisms
737 PHCH In the detailed transcription of any source, it may prove necessary to record various types of actual or potential alteration of the text: expansion of abbreviations, correction of the text (either by author, scribe, or later hand, or by previous or current editors or scholars), addition, deletion, or substitution of material, and similar matters. The sections below describe how such phenomena may be encoded using either elements defined in the core module (defined in chapter
738 PHCH ) or specialized elements available only when the module described in this chapter is available.
757 PHCO All of these elements bear additional attributes for specifying who is responsible for the interpretation represented by the markup, and the associated certainty. In addition, some of them bear an attribute allowing the markup to be categorized by type and source.
766 PHCO The following sections describe how the core elements just named may be used in the transcription of primary source materials.
772 PHAB The writing of manuscripts by hand lends itself to the use of abbreviation to shorten scribal labour. Commonly occurring letters, groups of letters, words, or even whole phrases, may be represented by significant marks. This phenomenon of manuscript abbreviation is so widespread and so various that no taxonomy of it is here attempted. Instead, methods are shown which allow abbreviations to be encoded using the core elements mentioned above.
774 PHAB A manuscript abbreviation may be viewed in two ways. One may transcribe it as a particular sequence of letters or marks upon the page: thus, a
775 PHAB p with a bar through the descender
781 PHAB per
783 PHAB re
788 PHAB In many cases the glyph found in the manuscript source also exists in the Unicode character set: for example the common Latin brevigraph ⁊, standing for
792 PHAB can be directly represented in any XML document as the Unicode character with code point
803 PHAB These two methods of coding abbreviation may also be combined. An encoder may record, for any abbreviation, both the sequence of letters or marks which constitutes it, and its sense, that is, the letter or letters for which it is believed to stand. For example, in the following fragment the phrase
805 PHAB is represented by a sequence of abbreviated characters:
826 PHAB Note that in each case the
859 PHAB When abbreviated forms such as these are expanded, two processes are carried out: some characters not present in the abbreviation are added (always), and some characters or glyphs present in the abbreviation are omitted or replaced (often). For example, when the abbreviation
871 PHAB element surrounds characters or signs such as tittles or tildes, used to indicate the presence of an abbreviation, which are typically removed or replaced by other characters in the expanded form of the abbreviation:
887 PHAB The content of the
905 PHAB As implied in the preceding discussion, making decisions about which of these various methods of representing abbreviation to use will form an important part of an encoder's practice. As a rule, the
909 PHAB elements should be preferred where it is wished to signify that the content of the element is an abbreviation, without necessarily indicating what the abbreviation may stand for. The
913 PHAB elements should be used where it is wished to signify that the content of the element is not present in the source but has been supplied by the transcriber, without necessarily indicating the abbreviation used in the original. The decision as to which course of action is appropriate may vary from abbreviation to abbreviation; there is no requirement that the same system be used throughout a transcription, although doing so will generally simplify processing. The choice is likely to be a matter of editorial policy. If the highest priority is to transcribe the text
915 PHAB (letter by letter), while indicating the presence of abbreviations, the choice will be to use
919 PHAB throughout. If the highest priority is to present a reading transcription, while indicating that some letters or words are not actually present in the original, the choice will be to use
934 PHAB , a note is attached to an editorial expansion of the tail on the final d of
951 PHAB The editor might declare a degree of certainty for this expansion, based on the OED examples, and state the responsibility for the expansion:
955 PHAB The value supplied for the
957 PHAB attribute should point to the name of the editor responsible for this and possibly other interventions; an appropriate element therefore might be a
959 PHAB element in the header like the following:
972 PHAB element only to indicate confidence in the content of the element (i.e. the expansion), and responsibility for suggesting this expansion respectively.
984 PHAB If it is desired to express aspects of certainty and responsibility for some other aspect of the use of these elements, then the mechanisms discussed in chapter
986 PHAB for discussion of the issues of certainty and responsibility in the context of transcription.
1025 PHCC and its correction
1038 PHCC element is used to provide a corrected form which is
1040 PHCC present in the source; in the case of a correction made in the source itself, whether scribal, authorial, or by some other hand, the
1053 PHCC element indicates the transcriber's correction of them. Where the transcriber considers that one or more words have been erroneously omitted in the original source and corrects this omission, the
1058 PHCC . Thus, in the following example, from George Moore's draft of additional materials for
1072 PHCC , the choice as to whether to record simply that there is an apparent error, or simply that a correction has been applied, or to record both possible readings within a
1074 PHCC element is left to the encoder. The decision is likely to be a matter of editorial policy, which might be applied consistently throughout or decided case by case. If the highest priority is to present an uncorrected transcription while noting perceived errors in the original, the choice will typically be to use only
1076 PHCC throughout. If the highest priority is to present a reading transcription, while indicating that perceived errors in the original have been corrected, the choice will be to use only
1119 PHCC is used to indicate who is responsible for the proposed emendation. Its value is a pointer, which will typically indicate a
1123 PHCC element in the header of the transcribed document, but can point anywhere, for example to some online authority file. Using these two attributes, the
1154 PHCC element. However, if the number of corrections is large and the number of notes is small, it may well be both more practical and more appropriate to regard the collection of annotations as constituting a typology and then use the
1156 PHCC attribute. Suppose that the note given above is one of half a dozen possible kinds of corrected phenomena identified in a given text; others might include, say,
1157 PHCC repetition of a word from the preceding line
1162 PHCC element can be used to specify an arbitrary code for the particular kind of correction (or other editorial intervention) identified within it. This code can be chosen freely and is not treated as a pointer.
1175 PHCC In addition, the conscientious encoder will provide documentation explaining the circumstances in which particular codes are judged appropriate. A suitable location for this might be within the
1196 PHCC choice type="substitution" subtype="graphicResemblance"
1203 PHCC attributes automatically. This is easily done but requires customization of the TEI system using techniques described in
1207 PHCC When making a correction in a source which forms part of a textual tradition attested by many witnesses, a textual editor will sometimes use a reading from one witness to correct the reading of the source text. In the general case, such encoding is best achieved with the mechanisms provided by the module for textual criticism described in chapter
1214 PHCC mentioned above, Parkes proposes to emend the problematic word
1223 PHCC The value of the
1225 PHCC attribute here is, like the value of the
1227 PHCC attribute, a pointer, in this case indicating the manuscript used as a witness. Elsewhere in the transcribed text, a list of witnesses used in this text will be given, one of which has an identifier
1229 PHCC . Each witness will be represented either by a
1266 PHCC attribute were supplied on the
1268 PHCC element, it would indicate the person responsible for asserting that the manuscript indicated has this reading, who is not necessarily the same as the person responsible for asserting that this reading should be used to correct the others. Editorial intervention elements such as
1272 PHCC to provide this additional information:
1283 PHCC found in Gg is regarded as a correction by Parkes.
1295 PHCC element, these attributes indicate confidence in and responsibility for identifying the reading within the sources specified; when used on the
1297 PHCC element they indicate confidence in and responsibility for the use of the reading to correct the base text. If no other source is indicated (either by the
1303 PHCC ), the reading supplied within a
1305 PHCC has been provided by the person indicated by the
1309 PHCC If it is desired to express certainty of or responsibility for some other aspect of the use of these elements, then the mechanisms discussed in chapter
1311 PHCC for further discussion of the issues of certainty and responsibility in the context of transcription.
1317 PHAD Additions and deletions observed in a source text may be described using the following elements:
1327 PHAD are included in the core module, while
1331 PHAD are available only when using the module defined in this chapter. These particular elements are members of the
1338 PHAD Further characteristics of each addition and deletion, such as the hand used, its effect (complete or incomplete, for example), or its position in a sequence of such operations may conveniently be recorded as attributes of these elements, all of which are members of the
1384 PHAD attribute may be useful to indicate the classification; when they are classified by the manner in which they were effected, or by their appearance, however, this will lead to a certain arbitrariness in deciding whether to use the
1392 PHAD attribute be reserved for higher level or more abstract classifications.
1396 PHAD attribute is also available to indicate the location of an addition. For example, consider the following passage from a draft letter by Robert Graves:
1420 PHAD above the line, and then deletes it. This may be encoded similarly:
1426 PHAD has been added and then deleted:
1434 PHAD , and then changed it; it may be that he inserted other punctuation marks between the letters before replacing them with the centre dots used elsewhere to represent this acronym. We do not deal with these possibilities here, and mention them only to indicate that any encoding of manuscript material of this complexity will need to make decisions about what is and is not worth mentioning.
1442 PHAD , then deletes
1462 PHAD elements defined in the core module suffice only for the description of additions and deletions which fit within the structure of the text being transcribed, that is, which each deletion or addition is completely contained by the structural element (paragraph, line, division) within which it occurs. Where this is not the case, for example because an individual addition or deletion involves several distinct structural subdivisions, such as poems or prose items, or otherwise crosses a structural boundary in the text being encoded, special treatment is needed. The
1476 PHAD element is first declared, within the header of the document, to associate the identifier
1478 PHAD with Helgi. Each of the added poems is encoded as a distinct
1480 PHAD element. In the body of the text, an
1482 PHAD element is placed to mark the beginning of the span of added text, and an
1506 PHAD several occasions where sequences of whole lines are marked for deletion, either by boxes or by being struck out. If the encoder is marking up individual verse lines with the
1528 PHAD It is also often the case that deletions and additions may themselves contain other deletions and additions. For example, in Thomas Moore's autograph of the second version of
1543 PHAD In this case the
1551 PHAD The text deleted must be at least partially legible, in order for the encoder to be able to transcribe it. If all of part of it is not legible, the
1553 PHAD element should be used to indicate where text has not been transcribed, because it could not be. The
1556 PHAD may be used to indicate areas of text which cannot be read with confidence. See further section
1566 PHSU As we have shown, the simplest method of recording a substitution is simply to record both the addition and the deletion. However, when the module defined by this chapter is in use, additional elements are available to indicate that the encoder believes the addition and the deletion to be part of the same intervention: a substitution.
1580 PHSU Since the purpose of this element is solely to group its child elements together, the order in which they are presented is not significant. When both deletion and addition are present, it may not always be clear which occurs first: using the
1590 PHSU and this is then replaced by
1594 PHSU This may be encoded as follows, representing the two changes as a sequence of additions and deletions:
1606 PHSU to record text first added, then deleted in the source. The numbers assigned by the
1608 PHSU attribute may be used to identify the order in which the various additions and deletions are believed by the encoder to have been carried out, and thus provide a simple method of supporting the kind of
1617 PHSU The case of a single substitution or scribal correction that involves non-contiguous addition and deletion can be handled by using the
1619 PHSU element to make an explicit connection between one or more
1627 PHSU to group this
1633 PHSU allows the encoder to indicate that additions and deletions separated in this way are part of a single scribal intervention:
1688 PHSU in the last line is simply marked as a deletion;
1695 PHSU provides similar facilities, by treating each state of the text as a distinct reading. The
1717 PHCD An author or scribe may mark a word or phrase in some way, and then on reflection decide to cancel the marking. For example, text may be marked for deletion and the deletion then cancelled, thus restoring the deleted text. Such cancellation may be indicated by the
1723 PHCD This element bears the same attributes as the other transcriptional elements. These may be used to supply further information such as the hand in which the restoration is carried out, the type of restoration, and the person responsible for identifying the restoration as such, in the same way as elsewhere.
1725 PHCD Presume that Lawrence decided to restore
1730 PHCD For I hate this my body
1733 PHCD first deleted then restored by writing
1740 PHCD Another feature commonly encountered in manuscripts is the use of circles, lines, or arrows to indicate transposition of material from one point in the text to another. No specific markup for this phenomenon is proposed at this time. Such cases are most simply encoded as additions at the point of insertion and deletions at the point of encirclement or other marking.
1746 PHOM Where text is not transcribed, whether because of damage to the original, or because it is illegible, or for some other reason such as editorial policy, the
1748 PHOM core element may be used to register the omission; where such text is transcribed, but the editor wishes to indicate that they consider it to be superfluous, for example because it is an inadvertent scribal repetition, the
1750 PHOM element may be used in preference. Where text not present in the source is supplied (whether conjecturally or from other witnesses) to fill an apparent gap in the text, the
1760 PHOM element has no content. It marks a point in the text where nothing at all can be read, whether because of authorial or scribal erasure, physical damage, or any other form of illegibility. Its attributes allow the encoder to specify the amount of text which is illegible in this way at this point, using any convenient units, where this can be determined. For example, in the Beerbohm manuscript of
1762 PHOM cited above, the author has erased a passage amounting about 10 cm in length by inking over it completely:
1769 PHOM The degree of precision attempted when measuring the size of a gap will vary with the purpose of the encoding and the nature of the material: no particular recommendation is made here.
1773 PHOM element should only be used where text has not been transcribed. If partially legible text has been transcribed, one of the elements
1778 PHOM ); if the text is legible and has been transcribed, but the editor wishes to indicate that they regard it is superfluous or redundant, then the element
1780 PHOM may be used in preference to the core element
1782 PHOM used to indicate text regarded as erroneous.
1784 PHOM Amongst the many examples cited in Hans Krummrey & Silvio Panciera's classic text on the editing of epigraphic inscriptions is the following. In a late classical inscription, the form
1786 PHOM is encountered. The editor may choose any of the following three possibilities:
1789 PHOM mark this as an erroneous form
1794 PHOM additionally supply a corrected form
1802 PHOM indicate that the erroneous form contains surplus characters which the editor wishes to suppress
1825 PHOM here are metrically inconsistent with the rest and have been marked by the editor as such.
1827 PHOM If some part of the source text is completely illegible or missing, an encoder may sometimes wish to supply new (conjectural) material to replace it. This conjectural reading is analogous to a correction in that it contains text provided by the encoder and not attested in the source. This is not however a correction, since no error is necessarily present in the original; for that reason a different element
1830 PHOM I am dear Sir your very humble Servt Sydney Smith
1831 PHOM , the text illegible in the autograph might be supplied in the transcription:
1839 PHOM attributes are used, as elsewhere, to indicate respectively the sigil of a manuscript from which the supplied reading has been taken, and the identifier of the person responsible for deciding to supply the text. If the
1841 PHOM attribute is not supplied, the implication is that the encoder (or whoever is indicated by the value of the
1843 PHOM attribute) has supplied the missing reading. Both
1859 PHPH This section discusses in more detail the representation of aspects of responsibility perceived or to be recorded for the writing of a primary source. These include points at which one scribe takes over from another, or at which ink, pen, or other characteristics of the writing change. A discussion of the usage of the
1870 PHDH For many text-critical purposes it is important to signal the person responsible (the
1872 PHDH ) for the writing of a whole document, a stretch of text within a document, or a particular feature within the document. A hand, as the name suggests, need not necessarily be identified with a particular known (or unknown) scribe or author; it may simply indicate a particular combination of writing features recognized within one or more documents. The examples given above of the use of the
1874 PHDH attribute with coding of additions and deletions illustrate this.
1887 PHDH attribute, may appear in either of two places in the TEI header, depending on which modules are included in a schema. When the
1893 PHDH element of the TEI header, to hold one or more
1901 PHDH also becomes available as part of a structured manuscript description. The encoder may choose to place
1903 PHDH elements identifying individual hands in either location without affecting their accessibility since the element is always addressed by means of its
1907 PHDH element may be more appropriate when a full cataloguing of each manuscript is required; the
1909 PHDH element if only a brief characterization of each hand is needed. It is also possible to use the two elements together if, for example, the
1911 PHDH element contains a single summary describing all the hands discursively, while the
1913 PHDH element gives specific details of each. The choice will depend on individual encoders' priorities.
1917 PHDH attribute is available on several elements to indicate the hand in which the content of the element (usually a deletion or addition) is carried out. The
1919 PHDH element may also be used within the body of a transcription to indicate where a change of hand is detected for whatever reason.
1935 PHDH A single hand may employ different writing styles and inks within a document, or may change character. For example, the writing style might shift from
1939 PHDH , or the ink from blue to brown, or the character of the hand may change. Simple changes of this kind may be indicated by assigning a new value to the appropriate attribute within the
1941 PHDH element. It is for the encoder to decide whether a change in these properties of the writing style is so marked as to require treatment as a distinct hand.
1943 PHDH Where such a change is to be identified, the
1945 PHDH attribute indicates the hand applicable to the material following the
1947 PHDH . The sequence of such
1949 PHDH elements will often, but not necessarily, correspond with the order in which the material was originally written. Where this is not the case, the facilities described in section
1952 PHDH As might be expected, a single hand may also vary renditions within the same writing style, for example medieval scribes often indicate a structural division by emboldening all the words within a line. Such changes should be indicated by use of the
1958 PHDH In the following example there is a change of ink within a single hand. This is simply indicated by a new value for the
1969 PHDH In the following example, the encoder has identified two distinct hands within the document and given them identifiers
1973 PHDH , by means of the following declarations included in the document's TEI header:
1983 PHDH Then the change of hand is indicated in the text:
1987 PHDH When a more precise or nuanced discussion of the writing in a manuscript is required, the
2004 PHHR attributes have similar, but not identical, meanings. Observe their distinctive uses in the following encoding of the William James passage mentioned above in section
2009 PHHR , and the consequent editorial correction of
2034 PHHR should be reserved for indicating the hand of any form of marking—here, addition but also deletion, correction, annotation, underlining, etc.—within the primary text being transcribed. The scribal or authorial responsibility for this marking may be inferred from the value of the
2036 PHHR attribute. The value of the
2038 PHHR attribute should be a pointer to a hand identifiers typically declared in the document header but potentially in another document or repository (see section
2043 PHHR attribute, by contrast, indicates the person responsible for deciding to mark up this part of the text with this particular element. In the case of the
2049 PHHR attribute is supplied) to which hand it should be attributed. In this case, Bowers is credited with identifying the hand as that of William James. In the case of the
2053 PHHR attribute indicates who is responsible for supplying the intellectual content of the correction reported in the transcription: here, Bowers' correction of
2057 PHHR . In the case of a deletion, the
2067 PHHR attributes are defined for a particular element, the two attributes refer to the same aspect of the markup. The one indicates who is intellectually responsible for some item of information, the other indicates the degree of confidence in the information. Thus, for a correction, the
2069 PHHR attribute signifies the person responsible for supplying the correction, while the
2073 PHHR attribute signifies the person responsible for supplying the expansion and the
2081 PHHR attributes with each element is intended to provide for the most frequent circumstances in which encoders might wish to make unambiguous statements regarding the responsibility for and certainty of aspects of their encoding. The
2085 PHHR attributes, as so defined, give a convenient mechanism for this. However, there will be cases where it is desirable to state responsibility for and certainty concerning other aspects of the encoding. For example, one may wish in the case of an apparent addition to state the responsibility for the use of the
2087 PHHR element, rather than the responsibility for identifying the hand of the addition. It may also be that one editor may make an electronic transcription of another editor's printed transcription of a manuscript text—here, one will wish to assign layers of responsibility, so as to allow the reader to determine exactly what in the final transcription was the responsibility of each editor. In these complex cases of divided editorial responsibility for and certainty concerning the content, attributes, and application of a particular element, the more general mechanisms for representing certainty and responsibility described in chapter
2091 PHHR It should be noted that the certainty and responsibility mechanisms described in chapter
2100 PHHR in line 117 of Chaucer's
2113 PHHR Exactly the same information could be conveyed using the certainty and responsibility mechanisms, as follows:
2119 PHHR The choice of which mechanism to use is left to the encoder. In transcriptions where only such statements of responsibility and certainty are made as can be accommodated within the
2127 PHHR attributes of those elements. Where many statements of responsibility and certainty are made which cannot be so accommodated, it may be economical to use the
2133 PHHR The above discussion supposes that in each case an encoder is able to specify exactly what it is that one wishes to state responsibility for and certainty about. Situations may arise when an encoder wishes to make a statement concerning certainty or responsibility but is unable or unwilling to specify so precisely the domain of the certainty or responsibility. In these cases, the
2137 PHHR attribute set to
2140 PHHR resp
2141 PHHR and the content of the note giving a prose description of the state of affairs.
2148 PHDAMCON The carrier medium of a primary source may often sustain physical damage which makes parts of it hard or impossible to read. In this section we discuss elements which may be used to represent such situations and give recommendations about how these should be used in conjunction with the other related elements introduced previously in this chapter.
2158 PHDA ) should be used with appropriate attributes where the degree of damage or illegibility in a text is such that nothing can be read and the text must be either omitted or supplied conjecturally or from one or more other sources. In many cases, however, despite damage or illegibility, the text may yet be read with reasonable confidence. In these cases, the following elements should be used:
2181 PHDA inherits the following additional attribute:
2190 PHDA In the first line of this leaf, the transcriber may believe that the last three letters of
2198 PHDA If, as is often the case, the damage crosses structural divisions, so that the
2225 PHDA element, since it is the whole of the leaf (the text between the two
2230 PHDA If, as is also likely, the damage affects several disjoint parts of the text, each such part must be marked with a separate
2236 PHDA attribute may be used as in the following example. In this (imaginary) text of Fitzgerald's translation from Omar Khayam, water damage has affected an area covering parts of several lines:
2255 PHDA which may be used to link together arbitrary elements of any kind in the transcription. Here, several phenomena of illegibility and conjecture all result from a single cause: an area of damage to the text caused by rubbing at various points. The damage is not continuous, and affects the text at irregular points. In cases such as this, the join element may be used to indicate which tagged features are part of the same physical phenomenon.
2257 PHDA If the damage has been so severe as to render parts of the text only imperfectly legible, the
2285 PHDA element may if desired be enclosed within a
2304 PHDA Where elements are nested in this way, information about agency, etc. is by default inherited. In the following imaginary example, there is a smoke-damaged part within which two stretches can be read with some difficulty, and a third stretch which cannot be read at all:
2355 PHCOMB elements may be closely allied in their use. For example, an area of damage in a primary source might be encoded with any one of the first four of these elements, depending on how far the damage has affected the readability of the text. Further, certain of the elements may nest within one another. The examples given in the last sections illustrate something of how these elements are to be distinguished in use. This may be formulated as follows:
2357 PHCOMB where the text has been rendered completely illegible by deletion or damage and no text is supplied by the editor in place of what is lost: place an empty
2361 PHCOMB attribute to state the cause (damage, deletion, etc.) of the loss of text.
2363 PHCOMB where the text has been rendered completely illegible by deletion or damage and text is supplied by the editor in place of what is lost: surround the text supplied at the point of deletion or damage with the
2367 PHCOMB attribute to state the cause (damage, deletion, etc.) of the loss of text leading to the need to supply the text.
2369 PHCOMB where the text has been rendered partly illegible by deletion or damage so that the text can be read but without perfect confidence: transcribe the text and surround it with the
2373 PHCOMB attribute to state the cause (damage, deletion, etc.) of the uncertainty in transcription and the
2377 PHCOMB where there is deletion or damage but at least some of the text can be read with perfect confidence: transcribe the text and surround it with the
2387 PHCOMB where there is an area of deletion or damage and parts of the text within that area can be read with perfect confidence, other parts with less confidence, other parts not at all: in transcription, surround the whole area with the
2395 PHCOMB element. Places within the damaged area where the text has been rendered completely illegible and no text is supplied by the editor may be marked with the
2397 PHCOMB element. For each element, one may use appropriate attribute values to indicate the cause and type of deletion or damage and the certainty of the reading.
2404 PHCOMB elements, and for the interpretation of such combinations, are similar:
2407 PHCOMB if one
2413 PHCOMB ), then the addition
2424 PHCOMB if one
2435 PHCOMB if a
2439 PHCOMB element, the normal interpretation will be that an addition was made within a passage which was later deleted in its entirety:
2444 PHCOMB if an
2448 PHCOMB element, the normal interpretation will be that a deletion was made from a passage which had earlier been added:
2459 alterations Modifications of various kinds (correction, addition, deletion, etc.) are frequently found within a single document, and may also be inferred when different documents are compared, although it may be an open question as to whether inter-document discrepancies
2462 alterations In this section we discuss a number of elements which may be useful when attempting to record traces of the writing process within a document.
2467 PH-mod Most, if not all, transcriptional elements imply a certain level of semantic interpretation. For instance, using the
2469 PH-mod element to encode a word or phrase that occupies interlinear space involves a decision that it has been deliberately inserted as an addition rather than an alternative, and indeed a judgment that it was written after, rather than before, the other lines. Where it is felt desirable to keep the recording of
2472 PH-mod what is the editor’s interpretation
2484 PH-mod attribute, but they provide no further interpretation of the function or intention of the passage so marked up. The
2486 PH-mod attribute may be used to indicate the end of a modified passage if this extends across the boundaries of some other XML element, for example from the middle of one line tagged as a
2515 PH-meta metamark
2516 PH-meta we mean marks such as numbers, arrows, crosses, or other symbols introduced by the writer into a document expressly for the purpose of indicating how the text is to be read. Such marks thus constitute a kind of markup of the document, rather than forming part of the text.
2521 PH-meta Unlike marginal notes or other additions to the text, metamarks are used by the writer to indicate a deliberate alteration of the writing itself, such as
2522 PH-meta move this passage over there
2523 PH-meta . An addition or annotation by contrast would typically concern some property of the passage other than its intended location or status within the text flow. A metamark may contain text, or some other graphic which the encoder wishes to represent, or it may simply consist of arrows, dots, lines etc. which the encoder simply describes.
2540 PH-meta . The passage to which the metamark applies may be indicated in either of two ways: the
2546 PH-meta itself must be supplied at the position in the document where the passage concerned begins; in the former case it may be supplied at any convenient point. Both attributes should not be supplied.
2560 PH-meta . It is thought to function as a metamark, indicating that this sentence forms part of the regulations. A further sentence was then added, while at some later stage the text and also the metamark were deleted. We might encode this as follows:
2596 PH-meta deletion symbol to left and right of the section. The deletion itself might be encoded by using the normal
2602 PH-meta element. This is quite a different case from that of the next example, in which the writer does not intend to suppress the content, but only to mark that it has been copied to another manuscript or reused.
2607 PH-meta From "I am that halfgrown angry boy" (MS q 25), David M. Rubenstein Rare Book & Manuscript Library, Duke University.
2613 PH-meta signalled by the larger of the two single vertical lines, which shows that the written material has been transferred or re-used, not deleted.
2648 PH-meta In this example, we class as metamarks both the long vertical line and the annotation
2651 PH-meta Both metamarks are assumed to indicate that the whole of the written zone with identifier
2659 PH-fix A writer may sometimes rewrite material a second time without significant change and in the same place. We consider this a distinct activity from addition as usually defined because no new textual material results; instead the status of existing material is reaffirmed. We may distinguish two variants of this:
2674 PH-fix hastily, and then returned to it to make the letter
2675 PH-fix l
2719 PH-fix element is used only for cases where text has been written multiple times. When metamarks and other markup-like strokes have been rewritten multiple times, the
2740 undo ) is provided for the comparatively simple case where a simple deletion is marked as having been subsequently cancelled. The
2742 undo element discussed here is more widely applicable and may be used for any kind of cancellation. It points to the element or elements which are being cancelled. These components need not be contiguous, provided that the cancellation is clearly a single act; each distinct act of cancellation requires a distinct
2755 undo We hypothesize that the text has gone through three states or changes, as follows:
2765 undo This sequence of events might be encoded as follows:
2781 undo attribute, to delimit the two parts of the deletion which were reverted at change s3. Note that in this case, since
2791 undo to delimit the two sequences whose deletion is being reverted, and then use the
2817 transpo occurs when metamarks are found in a document indicating that passages should be moved to a different position. Typically this may be done using arrows, asterisks or numbers, or other means. By definition the result of a transposition is not present in the document, and should not therefore be encoded, if the intention is to represent the actual appearance of the document. Instead, the following elements may be used to indicate the intended reordering:
2851 transpo element to identify the sections of text being transposed. When (as in the following example) the whole of a line is to be transposed, there is no need to delimit the sections concerned:
2878 transpo elements may be supplied either embedded within the text or in the
2896 alter In this example two alternative readings are provided, but no preference is indicated. While the author apparently first composed the line
2902 alter . The manuscript supplies no indication of which word Moore favours at this point, although in fact, in the first printed edition of
2912 alter module gives a simple way of encoding the state of this manuscript, as follows:
2946 instantcorr necessarily implies that the modifications they indicate were made at some time after the original writing. An exception to this is where a false start or
2948 instantcorr correction has been identified: the author starts to write, and then immediately corrects what has been written.
2954 instantcorr class to modify this default assumption. When the value of
2956 instantcorr is set to
2958 instantcorr , the addition or deletion is considered to belong to the same change as its parent element, while
2960 instantcorr means some change later than that of its parent.
2962 instantcorr An example of false start or instant correction can be seen in the following line:
2966 instantcorr [I am a curse]
2970 instantcorr in which we can detect the following sequence of events:
2974 instantcorr is written and then immediately deleted
2983 instantcorr is then deleted
2991 instantcorr To indicate that the first of these acts must have taken place during the main act of writing, before the other deletion and additions, we might encode this revision campaign as follows:
3023 PH-surfzone element is both to identify a specific area containing writing and to provide a two dimensional set of coordinates which can be used to position and provide dimensions for sub-parts of it. Furthermore, surfaces may nest within other surfaces, as in the case of
3025 PH-surfzone or other written materials attached to the main writing surface. In the general case, the position and dimensions of such nested surfaces will be defined using the same coordinate system as that supplied by the parent
3038 PH-surfzone when given on the
3040 PH-surfzone element define the coordinate scheme, rather than specifying the location of that surface. We must therefore introduce an additional
3067 PH-surfzone element that contains it. This zone, and the preceding one, which contains a sequence of
3073 PH-surfzone elements occupy a rectangle with coordinates (1,1,10,10), while the nested surface occupies a rectangle with coordinates (4,4,20,20).
3075 PH-surfzone Now suppose that we wish to define a finer scale grid for the newspaper patch, perhaps because we wish to localize zones within it with greater accuracy. To do this we will need to specify the position of the nested surface as in the previous example, but also to define the new coordinate system. We accomplish this as follows:
3091 PH-surfzone As before, the second zone defines the position and size of the newspaper patch itself in terms of a coordinate system running from 0 to 50 on both X and Y axes. The nested
3093 PH-surfzone element however defines a new scale for all of its components, running from 0 to 100 on both X and Y axes. The position of the nested zone containing the text
3099 PH-surfzone attribute may be used to define non-rectangular zones as a series of points. For example, in the last of the Whitman examples discussed in section
3100 PH-surfzone above, we might wish to record the exact shape of the zone containing the metamark
3104 PH-surfzone attribute to indicate the points defining a polygon which contains it. The values used are expressed in terms of a coordinate space running from 0 to 229 in the X dimension, and 0 to 160 in the Y dimension.
3112 PH-surfzone In exactly the same way, we may wish to identify the curved zone in the following image containing the word
3119 PH-surfzone This curved zone might be encoded in the following way:
3129 PH-surfzone does not need to be entirely contained within the two-dimensional space defined by its parent surface. For example, we might wish to encode the example in
3130 PH-surfzone above not as a surface representing the whole of the two page spread, but as a surface representing only the written part of this opening. The written part appears 50 units from the left of the image and 20 units from the top, while the bottom right corner of the written part appears 400 units from the left of the image, and 280 units from the top. We therefore define the written surface within this image as follows:
3135 PH-surfzone To describe the whole image, we will now need to define a zone of interest which represents an area larger than this surface. Using the same coordinate system as that defined for the surface, its coordinates are
3137 PH-surfzone . This zone of interest can be defined by a
3139 PH-surfzone element, within which we can place the uncropped
3153 PHLAY The following methods are available to capture general aspects of the layout of material on a page where this is considered important. Within the
3184 PHLAY s corresponding with each two page opening, for example where it is clear that the writer regarded each such opening as a single writing surface, with written zones or other features crossing the page divide. An example is shown here:
3193 PHLAY The coloured lines added to this image indicate a number of zones of writing, colour coded to indicate the order in which they were written (purple, then green, then red). For example, the zone marked in red on the left contains a note referring to the purple zone on the right.
3196 PHLAY This approach assumes that the transcription will primarily be organized in the same way as the physical layout of the source, using embedded transcription elements. Alternatively, where the a non-embedded transcription has been provided, using the
3198 PHLAY element, it is still possible to record gathering breaks, page breaks, column breaks, line breaks etc in the source, using the elements described in section
3199 PHLAY . Detailed metadata about the physical make-up of a source will usually be summarized by the
3209 PHSP The author or scribe may have left space for a word, or for an initial capital, and for some reason the word or capital was never supplied and the space left empty. The presence of significant space in the text being transcribed may be indicated by the
3214 PHSP Note that this element should not be used to mark normal inter-word space or the like.
3216 PHSP In line 694 of Chaucer's
3218 PHSP in the Holkham manuscript the scribe has left a space for a word where other manuscripts read
3225 PHSP element discussed in the previous section may be used to supply the text presumed missing:
3229 PHSP Here, the fact of the space within the manuscript is indicated by the value of the
3231 PHSP attribute. The source of the supplied text is shown by the value of the
3233 PHSP attribute as the Hengwrt manuscript; the transcriber responsible for supplying the text is ES.
3239 PHLN One of the more common forms of modification encountered in written documents of any kind is the presence of lines written under, beside, or through the text. Such lines may be of various types: they may be solid, dashed or dotted, doubled or tripled, wavy or straight, or a combination of these and other renderings. The line may be used for emphasis, or to mark a foreign or technical term, or to signal a quotation or a title, etc.: the elements
3249 PHLN may be used for these. Where the line has a clear paratextual function the
3251 PHLN element may be considered more appropriate. Frequently, a scholar may judge that a line is used to delete text: the
3274 PHLN The above examples presume the common case where a single word or phrase is marked by a line, with no doubt as to where the marking begins or ends and with no overlapping of the area of text with other marked areas of text. Where there is doubt, the
3287 PHLN Where the area of text marked overlaps other areas of text, for example crossing a structural division, one of the spanning mechanisms mentioned above must be used; for example where the line is thought to mark a deletion, the
3289 PHLN element may be used. Where it is desired simply to record the marking of a span of text in circumstances where it is not possible to surround the text with a
3299 PHLN More work needs to be done on clarifying the treatment of other textual features marked by lines which might so overlap or nest. For example, in many Middle English manuscripts (e.g. the Jesus and Digby verse collections), marginal sidebars may indicate metrical structure: couplets may be linked in pairs, with the pairs themselves linked into stanzas. Or, marginal sidebars may indicate emphasis, or may point out a region of text on which there is some annotation: in many manuscripts of Chaucer's
3307 PHLN element, containing a prose description of the manuscript at this point, enhanced by a link to a visual representation (or facsimile) of the feature in question. For example, in the Chaucer example just cited, one may wish to record that the
3325 PHSK Such information as page numbers, signatures, or catchwords may be recorded in a specialized
3327 PHSK element provided for that purpose. Although the name derives from the term
3333 PHSK element may be used for such features of any document, written or printed. Note that the purpose of this element is to record page numbers etc.
3346 PHSK : since this information is usually provided by the encoder, it is not subject to the constraint that it should be present only if textually present in the source being encoded. In text-critical situations it may be useful to provide both a normalized version of the pagination and a representation of the catch-word or numbering, especially when the latter presents a variant reading, or is significant for compositor identification.
3361 PHSK other material repeated from page to page, which falls outside the stream of the text
3386 PH-changes A major purpose of genetic editing is the identification of
3390 PH-changes . An editor may wish to assign a set of alterations (deletions, additions, substitutions, transpositions, etc.) or any other act of writing to a particular change, to indicate both that one or more of such phenomena preceded or followed another and also to indicate that they are related in some way, for example that one is a consequence of the other. They might also wish to group together certain revisions, regardless of when they might have occurred, based on a variety of other shared characteristics (e.g., corrections of factual errors or revisions that incorporate suggestions made by a given reader). To document this we need:
3392 PH-changes a system to assign phenomena to a particular change
3394 PH-changes a way to characterize a change, in itself and in relation to other changes.
3399 PH-changes (within the TEI header profile description) contains all information relating to the genesis or production of a text. It may contain a
3401 PH-changes element which contains a number of
3409 PH-changes In the following example an editor has identified four distinct changes:
3435 PH-changes (the default). The attribute specifies whether the order of child elements signifies a temporal order for the revision campaigns which they document. In the example above, the editor has asserted that the four stages distinguished are ordered chronologically according to the order of the
3440 PH-changes elements can be nested hierarchically. This may be helpful in two cases. Firstly one can build up hypotheses about related revisions step-by-step, starting with stages of smaller coverage, whose members are certainly related, and then in a subsequent pass grouping these stages in turn, thereby extending their reach.
3481 PH-changes In addition to the possibility of ordering text stages in relation to each other,
3483 PH-changes elements may carry a number of attributes from the
3497 PH-changes ) which allow each stage to be dated as exactly or inexactly as necessary, in the same way as is currently possible for the TEI
3542 PH-changes element, apart from declaring a distinct change in the creation of the document, may also contain references to other annotations contained within the
3544 PH-changes or in the document (as shown in the previous example). Such references, along with the textual content, are purely documentary and do not affect the textual stage associated with any element thus referred to. The association of a textual component with a change is always made explicitly, either by using the
3554 PH-changes element is associated with some element, it is also associated with all of that element's children, unless otherwise indicated, for example by a new value for the
3558 PH-changes In the following simple example, the text at one stage read
3570 PH-changes In this example, however, the text originally read
3584 PH-changes Note that in this case both the deletion and the addition are associated with the second stage. The word
3594 PH-changes and the like carry an implied semantics concerning the order in which events in the writing of a document was carried out: something which is deleted must have been written before it was deleted; something which is added must have been added at a later stage of the writing. Even when a combination of such elements is used, the chronology can usually be inferred (see further
3595 PH-changes ). Explicit indication of the stage to which some modification belongs is mostly useful in situations where all the alterations identified in a document are to be grouped, for example chronologically.
3599 PH-changes The interpretation of change assignments for a particular text passage is based on a number of implicit assumptions and constraints which have the effect of minimizing the amount of tagging necessary. The system is also flexible enough to support an explicit distinction between acts of writing and textual alterations, since either of these can be associated with changes described in the encoding. The following example shows an encoding in which the same passage is transcribed twice, once from a documentary perspective, and once from a textual one:
3655 PH-changes The documentary transcription stresses the writing process, while the textual transcription emphasizes textual alterations. In either case, the change of writing activity associated with a particular feature in the transcript is explicitly indicated. From the documentary perspective, by assigning particular modifications to a specific change, we describe the writing process, in that they specify which segment has been written when
3656 PH-changes . From the textual perspective, the markup concentrates simply on the existence of textual alterations and makes no explicit claims about the order of writing.
3663 PHTRXX We repeat the advice given at the beginning of this chapter, that these recommendations are not intended to meet every transcriptional circumstance ever likely to be faced by any scholar. They are intended rather as a base to enable encoding of the most common phenomena found in the course of scholarly transcription of primary source materials. These guidelines particularly do not address the encoding of physical description of textual witnesses: the materials of the carrier, the medium of the inscribing implement, the organisation of the carrier materials themselves (as quiring, collation, etc.), authorial instructions or scribal markup, etc., except insofar as these are involved in the broader question of manuscript description, as addressed by the
3688 PH The selection and combination of modules to form a TEI schema is described in

HD-Header.xml#13139

# id text
2 HD The TEI Header
4 HD This chapter addresses the problems of describing an encoded work so that the text itself, its source, its encoding, and its revisions are all thoroughly documented. Such documentation is equally necessary for scholars using the texts, for software processing them, and for cataloguers in libraries and archives. Together these descriptions and declarations provide an electronic analogue to the title page attached to a printed work. They also constitute an equivalent for the content of the code books or introductory manuals customarily accompanying electronic data sets.
6 HD Every TEI-conformant text must carry such a set of descriptions, prefixed to it and encoded as described in this chapter. The set is known as the
7 HD TEI header
16 HD , containing a full bibliographical description of the computer file itself, from which a user of the text could derive a proper bibliographic citation, or which a librarian or archivist could use in creating a catalogue entry recording its presence within a library or archive. The term
18 HD here is to be understood as referring to the whole entity or document described by the header, even when this is stored in several distinct operating system files. The file description also includes information about the source or sources from which the electronic document was derived. The TEI elements used to encode the file description are described in section
25 HD , which describes the relationship between an electronic text and its source or sources. It allows for detailed description of whether (or how) the text was normalized during transcription, how the encoder resolved ambiguities in the source, what levels of encoding or analysis were applied, and similar matters. The TEI elements used to encode the encoding description are described in section
29 HD text profile
32 HD , containing classificatory and contextual information about the text, such as its subject matter, the situation in which it was produced, the individuals described by or participating in producing it, and so forth. Such a text profile is of particular use in highly structured composite texts such as corpora or language collections, where it is often highly desirable to enforce a controlled descriptive vocabulary or to perform retrievals from a body of text in terms of text type or origin. The text profile may however be of use in any form of automatic text processing. The TEI elements used to encode the profile description are described in section
36 HD revision history
39 HD , which allows the encoder to provide a history of changes made during the development of the electronic text. The revision history is important for
41 HD and for resolving questions about the history of a file. The TEI elements used to encode the revision description are described in section
45 HD A TEI header can be a very large and complex object, or it may be a very simple one. Some application areas (for example, the construction of language corpora and the transcription of spoken texts) may require more specialized and detailed information than others. The present proposals therefore define both a
46 HD core
47 HD set of elements (all of which may be used without formality in any TEI header) and some additional elements which become available within the header as the result of including additional specialized modules within the schema. When the module for language corpora (described in chapter
48 HD ) is in use, for example, several additional elements are available, as further detailed in that chapter.
50 HD The next section of the present chapter briefly introduces the overall structure of the header and the kinds of data it may contain. This is followed by a detailed description of all the constituent elements which may be used in the core header. Section
51 HD , at the end of the present chapter, discusses the recommended content of a minimal TEI header and its relation to standard library cataloguing practices.
53 HD1 Organization of the TEI Header
55 HD11 The TEI Header and Its Components
61 HD11 front matter
62 HD11 of the text itself (for which see section
63 HD11 ). A composite text, such as a corpus or collection, may contain several headers, as further discussed below. In the general case, however, a TEI-conformant text will contain a single
71 HD11 The header element has the following description:
76 HD11 element has four principal components:
81 HD11 element is required in all TEI headers; the others are optional. Only one of the four components of the TEI header (the
84 HD11 below. The smallest possible valid TEI Header thus looks like this:
94 HD11 The content of the elements making up a TEI header may be given in any language, not necessarily that of the text to which the header applies, and not necessarily English. As elsewhere, the
96 HD11 attribute should be used at an appropriate level to specify the language. For example, in the following schematic example, an English text has been given a French header:
106 HD11 In the case of language corpora or collections, it may be desirable to record header information either at the level of the individual components in the corpus or collection, or at the level of the corpus or collection itself (more details concerning the tagging of composite texts are given in section
109 HD11 attribute may be used to indicate whether the header applies to a corpus or a single text. A corpus may thus take the form:
144 HD12 Types of Content in the TEI Header
146 HD12 The elements occurring within the TEI header may contain several types of content; the following list indicates how these types of content are described in the following sections:
151 HD12 should be understood to imply a series of paragraphs, each marked as a
165 HD12 ) usually enclose a group of specialized elements recording some structured information. In the case of the bibliographic elements, the suffix
171 HD12 . On the relation between the TEI proposals and other standards for bibliographic description, see further section
173 HD12 In most cases grouping elements may contain prose descriptions as an alternative to the set of specialized elements, thus allowing the encoder to choose whether or not the information concerned should be presented in a structured form or in prose.
182 HD12 ) enclose information about specific encoding practices applied in the electronic text; often these practices are described in coded form. Typically, such information takes the form of a series of declarations, identifying a code with some more complex structure or description. A declaration which applies to more than one text or division of a text need not be repeated in the header of each such text or subdivision. Instead, the
184 HD12 attribute of each text (or subdivision of the text) to which the declaration applies may be used to supply a cross-reference to it, as further described in section
197 HD1 Model Classes in the TEI Header
199 HD1 The TEI header provides a very rich collection of metadata categories, but makes no claim to be exhaustive. It is certainly the case that individual projects may wish to record specialized metadata which either does not fit within one of the predefined categories identified by the TEI header or requires a more specialized element structure than is proposed here. To overcome this problem, the encoder may elect to define additional elements using the customization methods discussed in
200 HD1 . The TEI class system makes such customizations simpler to effect and easier to use in interchange.
202 HD1 These classes are specific to parts of the header:
224 HD2 The bibliographic description of a machine-readable or digital text resembles in structure that of a book, an article, or any other kind of textual object. The file description element of the TEI header has therefore been closely modelled on existing standards in library cataloguing; it should thus provide enough information to allow users to give standard bibliographic references to the electronic text, and to allow cataloguers to catalogue it. Bibliographic citations occurring elsewhere in the header, and also in the text itself, are derived from the same model (on bibliographic citations in general, see further section
228 HD2 The bibliographic description of an electronic text should be supplied by the mandatory
288 HD21 It contains the title given to the electronic work, together with one or more optional
295 HD21 element contains the chief name of the electronic work, including any alternative title or subtitles it may have. It may be repeated, if the work has more than one title (perhaps in different languages) and takes whatever form is considered appropriate by its creator. Where the electronic work is derived from an existing source text, it is strongly recommended that the title for the former should be derived from the latter, but clearly distinguishable from it, for example by the addition of a phrase such as
298 HD21 a digital edition
300 HD21 This will distinguish the electronic work from the source text in citations and in catalogues which contain descriptions of both types of material.
302 HD21 The electronic work will also have an external name (its
305 HD21 data set name
306 HD21 ) or reference number on the computer system where it resides at any time. This name is likely to change frequently, as new copies of the file are made on the computer system. Its form is entirely dependent on the particular computer system in use and thus cannot always easily be transferred from one system to another. Moreover, a given work may be composed of many files. For these reasons, these Guidelines strongly recommend that such names should
329 HD21 which identify the person(s) responsible for the intellectual or artistic content of an item and any corporate bodies from which it emanates.
331 HD21 Any number of such statements may occur within the title statement. At a minimum, identify the author of the text and (where appropriate) the creator of the file. If the bibliographic description is for a corpus, identify the creator of the corpus.
332 HD21 Optionally include also names of others involved in the transcription or elaboration of the text, sponsors, and funding agencies. The name of the person responsible for physical data input need not normally be recorded, unless that person is also intellectually responsible for some aspect of the creation of the file.
334 HD21 Where the person whose responsibility is to be documented is not an author, sponsor, funding body, or principal researcher, the
340 HD21 element indicating the nature of the responsibility. No specific recommendations are made at this time as to appropriate content for the
344 HD21 Names given may be personal names or corporate names. Give all names in the form in which the persons or bodies wish to be publicly cited. This would usually be the fullest form of the name, including first names.
345 HD21 Agencies compiling catalogues of machine-readable files are recommended to use available authority lists, such as the Library of Congress Name Authority List, for all common personal names.
400 HD22 It contains either phrases or more specialized elements identifying the edition and those responsible for it:
404 HD22 edition
405 HD22 applies to the set of all the identical copies of an item produced from one master copy and issued by a particular publishing agency or a group of such agencies. A change in the identity of the distributing body or bodies does not normally constitute a change of edition, while a change in the master copy does.
409 HD22 is not entirely appropriate, since they are far more easily copied and modified than printed ones; nonetheless the term
410 HD22 edition
411 HD22 may be used for a particular state of a machine-readable text at which substantive changes are made and fixed. Synonymous terms used in these Guidelines are
424 HD22 changes have to be before they are regarded as producing a new edition, rather than a simple update. The general principle proposed here is that the production of a new edition entails a significant change in the intellectual content of the file, rather than its encoding or appearance. The addition of analytic coding to a text would thus constitute a new edition, while automatic conversion from one coded representation to another would not. Changes relating to the character code or physical storage details, corrections of misspellings, simple changes in the arrangement of the contents and changes in the output format do not normally constitute a new edition, whereas the addition of new information (e.g. a linguistic analysis expressed in part-of-speech tagging, sound or graphics, referential links to external data sets) almost always does.
426 HD22 Clearly, there will always be borderline cases and the matter is somewhat arbitrary. The simplest rule is: if you think that your file is a new edition, then call it such. An edition statement is optional for the first release of a computer file; it is mandatory for each later release, though this requirement cannot be enforced by the parser.
430 HD22 changes in a file considered significant, whether or not they are regarded as constituting a new edition or simply a new revision, should be independently noted in the revision description section of the file header (see section
435 HD22 element should contain phrases describing the edition or version, including the word
436 HD22 edition
439 HD22 , or equivalent, together with a number or date, or terms indicating difference from other editions such as
440 HD22 new edition
442 HD22 revised edition
443 HD22 etc. Any dates that occur within the edition statement should be marked with the
453 HD22 elements may also be used to supply statements of responsibility for the edition in question. These may refer to individuals or corporate bodies and can indicate functions such as that of a reviser, or can name the person or body responsible for the provision of supplementary matter, of appendices, etc., in a new edition. For further detail on the
487 HD23 For printed books, information about the carrier, such as the kind of medium used and its size, are of great importance in cataloguing procedures. The print-oriented rules for bibliographic description of an item's medium and extent need some re-interpretation when applied to electronic media. An electronic file exists as a distinct entity quite independently of its carrier and remains the same intellectual object whether it is stored on a magnetic tape, a CD-ROM, a set of floppy disks, or as a file on a mainframe computer. Since, moreover, these Guidelines are specifically aimed at facilitating transparent document storage and interchange, any purely machine-dependent information should be irrelevant as far as the file header is concerned.
497 HD23 Although it is equally system-dependent, some measure of the size of the computer file may be of use for cataloguing and other practical purposes. Because the measurement and expression of file size is fraught with difficulties, only very general recommendations are possible; the element
543 HD23 Note that when more than one
545 HD23 is supplied in a single
558 HD24 element and is mandatory. Its function is to name the agency by which a resource is made available (for example, a publisher or distributor) and to supply any additional information about the way in which it is made available such as licensing conditions, identifying numbers, etc.
562 HD24 These elements form the
564 HD24 class; if the agency making the resource available is unknown, but other structured information about it is available, an explicit statement such as
565 HD24 publisher unknown
569 HD24 publisher
570 HD24 is the person or institution by whose authority a given edition of the file is made public. The
571 HD24 distributor
572 HD24 is the person or institution from whom copies of the text may be obtained. Where a text is not considered formally published, but is nevertheless made available for circulation by some individual or organization, this person or institution is termed the
573 HD24 release authority
576 HD24 Whichever of these elements is chosen, it may be followed by one or more of the following elements, which together form the
596 HD24 elements all supply additional information relating to the the publisher, distributor, or release authority immediately preceding them. In the following example, Benson is identified as responsible for distribution of some resource at the date and place cited:
605 HD24 A resource may have (for example) both a publisher and a distributor, or more than one publisher each using different identifiers for the same resource, and so on. For this reason, the sequence of at least one
611 HD24 The following example shows a resource published by one agency (Sigma Press) at one address and date, which is also distributed by another (Oxford Text Archive), with a specified identifier and a different date:
641 HD24 always refers to the date of publication, first distribution, or initial release. If the text was created at some other date, this may be recorded using the
645 HD24 element. Other useful dates (such as dates of collection of data) may be given using a note in the
663 HD24 attribute to point to a location from which the licence document itself may be obtained. Alternatively, the licence document may simply be contained within the
680 HD26 series
683 HD26 A group of separate items related to one another by the fact that each item bears, in addition to its own title proper, a collective title applying to the group as a whole. The individual items may or may not be numbered.
687 HD26 A separately numbered sequence of volumes within a series or serial.
695 HD26 may be used to supply any identifying number associated with the item, including both standard numbers such as an ISSN and particular issue numbers. (Arabic numerals separated by punctuation are recommended for this purpose:
701 HD26 attribute is used to categorize the number further, taking the value
737 HD27 the nature, scope, artistic form, or purpose of the file; also the genre or other intellectual category to which it may belong: e.g.
744 HD27 an abstract or summary of the content of a document which has been supplied by the encoder because no such abstract forms part of the content of the source. This should be supplied in the
751 HD27 summary description providing a factual, non-evaluative account of the subject content of the file: e.g.
758 HD27 bibliographic details relating to the source or sources of an electronic text: e.g.
759 HD27 Transcribed from the Norton facsimile of the 1623 Folio
765 HD27 further information relating to publication, distribution, or release of the text, including sources from which the text may be obtained, any restrictions on its use or formal terms on its availability. These should be placed in the appropriate division of the
771 HD27 ICPSR study number 1803
773 HD27 Oxford Text Archive text number 1243
785 HD27 dates, when they are relevant to the content or condition of the computer file: e.g.
790 HD27 names of persons or bodies connected with the technical production, administration, or consulting functions of the effort which produced the file, if these are not named in statements of responsibility in the title or edition statements of the file description: e.g.
793 HD27 availability of the file in an additional medium or information not already recorded about the availability of documentation: e.g.
796 HD27 language of work and abstract, if not encoded in the
801 HD27 The unique name assigned to a serial by the International Serials Data System (ISDS), if not encoded in an
804 HD27 lists of related publications, either describing the source itself, or concerned with the creation or use of the electronic work, e.g.
808 HD27 Each such item of information may be tagged using the general-purpose
819 HD27 There are advantages, however, to encoding such information with more precise elements elsewhere in the TEI header, when such elements are available. For example, the notes above might be encoded as follows:
847 HD3 element. It is a mandatory element and is used to record details of the source or sources from which a computer file is derived. This might be a printed text or manuscript, another computer file, an audio or video recording of some kind, or a combination of these. An electronic file may also have no source, if what is being catalogued is an original text created in electronic form.
852 HD3 element may contain little more than a simple prose description, or a brief note stating that the document has no source:
864 HD3 These classes make available by default a range of ways of providing bibliographic citations which specify the provenance of the text. For written or printed sources, the source may be described in the same way as any other bibliographic citation, using one of the following elements:
871 HD3 . Using them, a source might be described in very simple terms:
896 HD3 When the header describes a text derived from some pre-existing TEI-conformant or other digital document, it may be simpler to use the following element, which is designed specifically for documents derived from texts which were
912 HD3 class also makes available additional elements when additional modules are included. For example, when the
916 HD3 element may also include the following special-purpose elements, intended for cases where an electronic text is derived from a spoken text rather than a written one:
920 HD3 A single electronic text may be derived from multiple source documents, in whole or in part. The
935 HD3 may be used to associate parts of the encoded text with the bibliographic element from which it derives in either case.
937 HD3 The source description may also include lists of names, persons, places, etc. when these are considered to form part of the source for an encoded document. When such information is recorded using the specialized elements discussed in the
956 HD31 If a computer file (call it B) is derived not from a printed source but from another computer file (call it A) which includes a TEI file header, then the source text of computer file B is another computer file, A. The four sections of A's file header will need to be incorporated into the new header for B in slightly differing ways, as listed below:
957 HD31 fileDesc
964 HD31 profileDesc
969 HD31 encodingDesc
971 HD31 A's encoding practice may or (more likely) may not be the same as B's. Since the object of the encoding description is to define the relationship between the current file and its source, in principle only changes in encoding practice between A and B need be documented in B. The relationship between A and its source(s) is then only recoverable from the original header of A. In practice it may be more convenient to create a new complete
974 HD31 revisionDesc
988 HD5 element is the second major subdivision of the TEI header. It specifies the methods and editorial principles which governed the transcription or encoding of the text in hand and may also include sets of coded definitions used by other components of the header. Though not formally required, its use is highly recommended.
1022 HD51 element may be used to describe, in prose, the purpose for which a digital resource was created, together with any other relevant information concerning the process by which it was assembled or collected. This is of particular importance for corpora or miscellaneous collections, but may be of use for any text, for example to explain why one kind of encoding practice has been followed rather than another.
1048 HD52 the underlying population being sampled
1059 HD52 It may also include a simple description of any parts of the source text included or excluded.
1064 HD52 A sampling declaration which applies to more than one text or division of a text need not be repeated in the header of each such text. Instead, the
1066 HD52 attribute of each text (or subdivision of the text) to which the sampling declaration applies may be used to supply a cross-reference to it, as further described in section
1079 HD53 It may contain a prose description only, or one or more of a set of specialized elements, members of the TEI
1083 HD53 Some of these policy elements carry attributes to support automated processing of certain well-defined editorial decisions; all of them contain a prose description of the editorial principles adopted with respect to the particular feature concerned. Examples of the kinds of questions which these descriptions are intended to answer are given in the list below.
1091 HD53 Was the text corrected during or after data capture? If so, were corrections made silently or are they marked using the tags described in section
1092 HD53 ? What principles have been adopted with respect to omissions, truncations, dubious corrections, alternate readings, false starts, repetitions, etc.?
1099 HD53 Was the text normalized, for example by regularizing any non-standard spellings, dialect forms, etc.? If so, were normalizations performed silently or are they marked using the tags described in section
1100 HD53 ? What authority was used for the regularization? Also, what principles were used when normalizing numbers to provide the standard values for the
1110 HD53 How were quotation marks processed? Are apostrophes and quotation marks distinguished? How? Are quotation marks retained as content in the text or replaced by markup? Are there any special conventions regarding for example the use of single or double quotation marks when nested? Is the file consistent in its practice or has this not been checked? See section
1111 HD53 for discussion of ways in which quotation marks may be encoded.
1122 HD53 hyphens? What principle has been adopted with respect to end-of-line hyphenation where source lineation has not been retained? Have soft hyphens been silently removed, and if so what is the effect on lineation and pagination? See section
1123 HD53 for discussion of ways in which hyphenation may be encoded.
1130 HD53 How is the text segmented? If
1134 HD53 segmentation units have been used to divide up the text for analysis, how are they marked and how was the segmentation arrived at?
1153 HD53 Has any analytic or
1155 HD53 information been provided—that is, information which is felt to be non-obvious, or potentially contentious? If so, how was it generated? How was it encoded? If feature-structure analysis has been used, are
1166 HD53 How has the encoding of punctuation marks present in the original source been treated? For example, has it been normalised, or suppressed in favour of descriptive markup? If it has been retained, is it located within or around elements such as
1170 HD53 Any information about the editorial principles applied not falling under one of the above headings should be recorded in a distinct list of items. Experience shows that a full record should be kept of decisions relating to editorial principles and encoding practice, both for future users of the text and for the project which produced the text in the first instance. Some simple examples follow:
1202 HD53 An editorial practices declaration which applies to more than one text or division of a text need not be repeated in the header of each such text. Instead, the
1204 HD53 attribute of each text (or subdivision of the text) to which it applies may be used to supply a cross-reference to it, as further described in section
1213 HD57 the namespace to which elements appearing within the transcribed text belong.
1215 HD57 how often particular elements appear within the text, so that a recipient can validate the integrity of a text during interchange.
1219 HD57 a default rendition applicable to all instances of an element.
1230 HD57 element consists of an optional sequence of
1232 HD57 elements, each of which must bear a unique identifier, followed by an optional sequence of one or more
1234 HD57 elements, each of which contains a series of
1236 HD57 elements, up to one for each element type from that namespace occurring within the associated
1249 HD57-1 element allows the encoder to specify how one or more elements are rendered in the original source in any of the following ways:
1253 HD57-1 using a standard stylesheet language such as CSS or XSL-FO
1255 HD57-1 using a project-defined formal language
1264 HD57-1 element may be used to indicate a default rendition for all occurrences of the named element
1268 HD57-1 attribute may be used on any element to indicate its rendition, overriding or complementing any supplied default value
1279 HD57-1 elements are by default to be rendered using one set of specifications identified as
1306 HD57-1 As noted above, the content of a
1308 HD57-1 element may describe the appearance of the source material using prose, a project-defined formal language, or any standard languages such as the Cascading Stylesheet Language (
1313 HD57-1 ) may be supplied within the
1327 HD57-1 First we define a rendition element for each aspect of the source page rendition that we wish to retain. Details of CSS are given in
1328 HD57-1 ; we use it here simply to provide a vocabulary with which to describe such aspects as font size and style, letter and line spacing, colour, etc. Note that the purpose of this encoding is to describe the original, rather than specify how it should be reproduced, although the two are obviously closely linked.
1355 HD57-1 attribute can now be used to specify on any element which of the above rendition features apply to it. For example, a title page might be encoded as follows:
1393 HD57-1 pseudo-elements can be used often in conjunction with the "content" property to add additional characters which need to be added before or after the element content to make it more closely resemble the appearance of the source.
1395 HD57-1 For example, assuming that a text has been encoded using the
1397 HD57-1 element to enclose passages in quotation marks, but the quotation marks themselves have been routinely omitted from the encoding, a set of renditions such as the following:
1409 HD57-1 element is actually rendered in the source with initial and final quotation marks, it may then be encoded as follows:
1420 HD57-2 element, if present, should contain up to one occurrence of a
1422 HD57-2 element for each element type from the given namespace that occurs within the outermost
1427 HD57-2 In the case of a TEI corpus (
1430 HD57-2 in a corpus header will describe tag usage across the whole corpus, while one in an individual text header will describe tag usage for the individual text concerned.
1433 HD57-2 element may be used to supply a count of the number of occurrences of this element within the text, which is given as the value of its
1435 HD57-2 attribute. It may also be used to hold any additional usage information, which is supplied as running prose within the element itself.
1447 HD57-2 attribute may optionally be used to specify how many of the occurrences of the element in question bear a value for the global
1455 HD57-2 The content of the
1461 HD57-2 attributes, but if it does, then the counts provided must correspond with the number of such elements present in the associated
1474 HD57-1a The content of the
1476 HD57-1a element and the value of the
1478 HD57-1a attribute are expressed using one of a small number of formally defined style definition languages. For ease of processing, it is strongly recommended to use a single such language throughout an encoding project, although the TEI system permits a mixture.
1484 HD57-1a element, is used to supply the name of the default style definition language. The name is supplied as the value of the
1490 HD57-1a Informal free text description
1499 HD57-1a A user-defined formal description language
1503 HD57-1a attribute may be used to supply the precise version of the style definition language used, and the content of this element, if any, may supply additional information.
1507 HD57-1a attribute is used, its value must always be expressed using whichever default style definition language is in force. If more than one occurrence of the
1509 HD57-1a is provided, there will be more than one default available, and the
1522 HD54 It may contain either a series of prose paragraphs or the following specialized elements:
1527 HD54 Note that not all possible referencing schemes are equally easily supported by current software systems. A choice must be made between the convenience of the encoder and the likely efficiency of the particular software applications envisaged, in this context as in many others. For a more detailed discussion of referencing systems supported by these Guidelines, see section
1534 HD54 as a series of pairs of regular expressions and XPaths
1537 HD54 milestone
1538 HD54 s
1545 HD54 element can be included in the header if more than one canonical reference scheme is to be used in the same document, but the current proposals do not check for mutual inconsistency.
1551 HD54P by a simple prose description. Such a description should indicate which elements carry identifying information, and whether this information is represented as attribute values or as content. Any special rules about how the information is to be interpreted when reading or generating a reference string should also be specified here. Such a prose description cannot be processed automatically, and this method of specifying the structure of a canonical reference system is therefore not recommended for automatic processing.
1592 HD54M This method is appropriate when only
1593 HD54M milestone
1597 HD54M A reference based on milestone tags concatenates the values specified by one or more such tags. Since each tag marks the point at which a value changes, it may be regarded as specifying the
1598 HD54M refState
1599 HD54M of a variable. A reference declaration using this method therefore specifies the individual components of the canonical reference as a sequence of
1608 HD54M might be thought of as representing the state of three variables: the
1610 HD54M variable is in state
1614 HD54M variable is in state
1618 HD54M variable is in state
1620 HD54M . If milestone tagging has been used, there should be a tag marking the point in the text at which each of the above
1625 HD54M tag itself, what are here referred to as
1634 HD54M therefore an application must scan left to right through the text, monitoring changes in the state of each of these three variables as it does so. When all three are simultaneously in the required state, the desired point will have been reached. There may of course be several such points.
1642 HD54M tags in the text are to be checked for state-changes. A state-change is signalled whenever a new
1644 HD54M tag is found with
1650 HD54M element in question. The value for the new state may be given explicitly by the
1654 HD54M element, or it may be implied, if the
1658 HD54M For example, for canonical references in the form
1662 HD54M represents the page number in the first edition, and
1664 HD54M the line number within this page, a reference system declaration such as the following would be appropriate:
1668 HD54M This implies that milestone tags of the form
1670 HD54M will be found throughout the text, marking the positions at which page and line numbers change. Note that no value has been specified for the
1672 HD54M attribute on the second milestone tag above; this implies that its value at each state change is monotonically increased. For more detail on the use of milestone tags, see section
1677 HD54M The milestone referencing scheme, though conceptually simple, is not supported by a generic XML parser. Its use places a correspondingly greater burden of verification and accuracy on the encoder.
1687 HD54M A reference system declaration which applies to more than one text or division of a text need not be repeated in the header of each such text. Instead, the
1689 HD54M attribute of each text (or subdivision of the text) to which the declaration applies may be used to supply a cross-reference to it, as further described in section
1695 HD55 element is used to group together definitions or sources for any descriptive classification schemes used by other parts of the header. Each such scheme is represented by a
1705 HD55 element has two slightly different, but related, functions. For well-recognized and documented public classification schemes, such as Dewey or other published descriptive thesauri, it contains simply a bibliographic citation indicating where a full description of a particular taxonomy may be found.
1715 HD55 element contains a description of the taxonomy itself as well as an optional bibliographic citation. The description consists of a number of
1717 HD55 elements, each defining a single category within the given typology. The category is defined by the contents of a nested
1719 HD55 element, which may contain either a phrase describing the category, or any number of elements from the
1721 HD55 class. When the corpus module is included in a schema, this class provides the
1723 HD55 element whose components allow the definition of a text type in terms of a set of
1726 HD55 ; if the corpus module is not included in a schema, this class is empty and the
1730 HD55 If the category is subdivided, each subdivision is represented by a nested
1732 HD55 element, having the same structure. Categories may be nested to an arbitrary depth in order to reflect the hierarchical structure of the taxonomy. Each
1766 HD55 Linkage between a particular text and a category within such a taxonomy is made by means of the
1771 HD55 . Where the taxonomy permits of classification along more than one dimension, more than one category will be referenced by a particular
1773 HD55 , as in the following example, which identifies a text with the sub-categories
1779 HD55 within the category
1787 HD55 child, when for example the category is described in more than one language, as in the following example:
1821 HDGDECL The following element is provided to indicate (within the header of a document, or in an external location) that a particular coordinate notation, or a particular datum, has been employed in a text. The default notation is a string containing two real numbers separated by whitespace, of which the first indicates latitude and the second longitude according to the 1984 World Geodetic System (WGS84).
1833 HDSCHSPEC , it allows embedding of a schema inside a TEI header; alternatively, this element may be used in the
1840 HDSCHSPEC element contains all the information needed to generate schemas for a particular TEI customization, and the ODD documentation elements, by reference to the TEI, are more succinct than the schemas derived from them. Therefore you may find it convenient to make a copy of the
1844 HDSCHSPEC itself, in addition to supplying an external schema and/or ODD file; if the XML file becomes separated from its schema, the schema can be regenerated at any time using the information in the
1864 HDAPP to allow an application to discover that it has previously opened or edited a file, and what version of itself was used to do that;
1866 HDAPP to show (through a date) which application last edited the file to allow for diagnosis of any problems that might have been caused by that application;
1868 HDAPP to allow users to discover information about an application used to edit the file
1870 HDAPP to allow the application to declare an interest in elements of the file which it has edited, so that other applications or human editors may be more wary of making changes to those sections of the file.
1886 HDAPP element identifies the current state of one software application with regard to the current file. This element is a member of the
1888 HDAPP class, which provides a variety of attributes for associating this state with a date and time, or a temporal range. The
1892 HDAPP attributes should be used to uniquely identify the application and its major version number (for example,
1894 HDAPP ). It is not intended that an application should add a new
1896 HDAPP each time it touches the file.
1898 HDAPP The following example shows how these elements might be used to document the fact that version 1.5 of an application called
1916 HDENCOTH The elements discussed so far are available to any schema. When the schema in use includes some of the more specialized TEI modules, these make available other more module-specific components of the encoding description. These are discussed fully in the documentation for the module in question, but are also noted briefly here for convenience.
1919 HDENCOTH element is available only when the
1921 HDENCOTH module is included in a schema. Its purpose is to document the
1924 HDENCOTH ) underlying any analytic
1927 HDENCOTH ) present in the text documented by this header.
1930 HDENCOTH element is available only when the
1932 HDENCOTH module is included in a schema. Its purpose is to document any metrical notation scheme used in the text, as further discussed in section
1933 HDENCOTH . It consists either of a prose description or a series of
1938 HDENCOTH element is available only when the
1940 HDENCOTH module is included in a schema. Its purpose is to document the method used to encode textual variants in the text, as discussed in section
1949 HD4 element is the third major subdivision of the TEI header. It is an optional element, the purpose of which is to enable information characterizing various descriptive aspects of a text or a corpus to be recorded within a single unified framework.
1952 HD4 In principle, almost any component of the header might be of importance as a means of characterizing a text. The author of a written text, its title or its date of publication, may all be regarded as characterizing it at least as strongly as any of the parameters discussed in this section. The rule of thumb applied has been to exclude from discussion here most of the information which generally forms part of a standard bibliographic style description, if only because such information has already been included elsewhere in the TEI header.
1958 HD4 element, followed by any number of additional elements taken from the
1960 HD4 class. The default members of this class are the following :
1991 HD4 . Its purpose is to group together a number of
1995 HD4 element can also appear within a structured manuscript description, when the
2000 HD4 element is actually declared within the header module, but is only accessible to a schema when one or other of the
2020 HD4C element contains phrases describing the origin of the text, e.g. the date and place of its composition.
2023 HD4C The date and place of composition are often of particular importance for studies of linguistic variation; since such information cannot be inferred with confidence from the bibliographic description of the copy text, the
2025 HD4C element may be used to provide a consistent location for this information:
2044 HD41 elements, each of which provides information about a single language, notably the quantity of that language present in the text. Note that this element should
2056 HD41 element may be supplied for each different language used in a document. If used, its
2058 HD41 attribute should specify an appropriate language identifier, as further discussed in section
2059 HD41 . This is particularly important if extended language identifiers have been used as the value of
2079 HD43 element is used to classify a text in some way.
2087 HD43 by providing a set of keywords, as provided for example by British Library or Library of Congress Cataloguing in Publication data
2089 HD43 by referencing any other taxonomy of text categories recognized in the field concerned, or peculiar to the material in hand; this may include one based on recurring sets of values for the situational parameters defined in section
2101 HD43 element simply categorizes an individual text by supplying a list of keywords which may describe its topic or subject matter, its form, date, etc. In some schemes, the order of items in the list is significant, for example, from major topic to minor; in others, the list has an organized substructure of its own. No recommendations are made here as to which method is to be preferred. Wherever possible, such keywords should be taken from a recognized source, such as the British Library/Library of Congress Cataloguing in Publication data in the case of printed books, or a published thesaurus appropriate to the field.
2105 HD43 attribute is used to indicate the source of the keywords used, in the case where such a source exists. If the keywords are taken from an externally defined authority which is available online, this attribute should point directly to it, as in the following examples:
2125 HD43 If the authority file is not available online, but is generally recognized and commonly cited, a bibliographic description for it should be supplied within the
2130 HD43 attribute may then reference that
2154 HD43 If no authority file exists, perhaps because the keywords used were assigned directly by an author, the
2158 HD43 Alternatively, if the keyword vocabulary itself is locally defined, the
2172 HD43 element also categorizes an individual text, by supplying a numerical or other code rather than descriptive terms. Such codes constitute a recognized classification scheme, such as the Dewey Decimal Classification. On this element, the
2174 HD43 attribute is required; it indicates the source of the classification scheme in the same way as for keywords: this may be a pointer of any kind, either to a TEI element, possibly in the current document, as in the
2176 HD43 examples above, or to some canonical source for the scheme, as in the following example:
2183 HD43 element categorizes an individual text by pointing to one or more
2192 HD43 ) holds information about a particular classification or category within a given taxonomy. Each such category must have a unique identifier, which may be supplied as the value of the
2196 HD43 elements which are regarded as falling within the category indicated.
2198 HD43 A text may, of course, fall into more than one category, in which case more than one identifier may be supplied as the value for the
2205 HD43 attribute may be supplied to specify the taxonomy to which the categories identified by the target attribute belong, if this is not adequately conveyed by the resource pointed to. For example,
2207 HD43 Here the same text has been classified as of categories
2213 HD43 ), and as of category
2219 HD43 with multiple identifiers in the value of
2223 HD43 elements, each with a single identifier in the value of
2225 HD43 . However, note that maintenance of a TEI document with a large number of values within a single
2233 HD43 elements is that the values used as identifying codes are exhaustively enumerated for the former, typically within the TEI header. In the latter case, however, the values use any externally-defined scheme, and therefore may be taken from a more open-ended descriptive classification system.
2240 HD4ABS The main purpose of the
2242 HD4ABS element is to supply a brief resume or abstract for an article which was originally published without such a component. An abstract or summary forming part of the document at its creation should usually appear in the front matter (
2265 HD4ABS The same element may be used to provide other summary information supplied by the encoder, perhaps grouped together into a list of discrete items:
2310 HD44 Each such element contains one or more paragraphs of description for the calendar system concerned, and also supplies an identifying code for it as the value of its
2324 HD44 This identifying code may then be referenced from any element supplying a date expressed using that calendar system:
2348 HD44CD This information is complementary to the detailed descriptions of physical objects (such as letters) associated with correspondence activities, which are typically provided by the sourceDesc element.
2367 HD44CD element is used to group references relevant to the item of correspondence being described, typically to other items such as the item to which it is a reply, or the item which replies to it:
2394 HD44CD to describe the sending of a letter by Adelbert von Chamisso from Vertus on 29 January 1807 to Louis de La Foye at Caen. The date of reception is unknown:
2414 HD44CD to provide a normalized form of the date. The content of the
2416 HD44CD element may also be omitted, since no underlying source is being transcribed.
2420 HD44CD if the action is considered to apply to them all acting as a single group. In the following example two people are considered to have received the communication.
2459 HD44CD The same person may be associated with many actions. For example, it will often be the case that the author and sender of a message are identical, and that many individual letters will need to be associated with the same person. The
2462 HD44CD may be used to indicate that the same name applies to many actions. Its value will usually be the identifier of an element defining the person or name concerned, which is supplied elsewhere in the document.
2470 HD44CD It is assumed that each correspondence action applies to a single act of communication. It may however be the case that the same physical object is involved in several such acts, if for example person A sends a letter to person B, who then annotates it and sends it on to person C, or if persons A and B both use the same document to convey quite different messages. In such situations, multiple
2472 HD44CD elements should be supplied, one for each communication. In the following example, the same document contains distinct messages, sent by two different people to the same destination:
2520 HD6 The final sub-element of the TEI header, the
2522 HD6 element, provides a detailed change log in which each change made to a text may be recorded. Its use is optional but highly recommended. It provides essential information for the administration of large numbers of files which are being updated, corrected, or otherwise modified as well as extremely useful documentation for files being passed from researcher to researcher or system to system. Without change logs, it is easy to confuse different versions of a file, or to remain unaware of small but important changes made in the file by some earlier link in the chain of distribution. No significant change should be made in any TEI-conformant file without corresponding entries being made in the change log.
2529 HD6 The main purpose of the revision description is to record changes in the text to which a header is prefixed. However, it is recommended TEI practice to include entries also for significant changes in the header itself (other than the revision description itself, of course). At the very least, an entry should be supplied indicating the date of creation of the header.
2531 HD6 The log consists of a list of entries, one for each change. Changes may be grouped and organised using either the
2537 HD6 . Alternatively, a simple sequence of
2543 HD6 may be supplied for each
2545 HD6 element to indicate its date and the person responsible for it respectively. The description of the change itself can range from a simple phrase to a series of paragraphs. If a number is to be associated with one or more changes (for example, a revision number), the global
2628 HD7 The TEI header allows for the provision of a very large amount of information concerning the text itself, its source, its encodings, and revisions of it, as well as a wealth of descriptive information such as the languages it uses and the situation(s) in which it was produced, together with the setting and identity of participants within it. This diversity and richness reflects the diversity of uses to which it is envisaged that electronic texts conforming to these Guidelines will be put. It is emphatically
2630 HD7 intended that all of the elements described above should be present in every TEI Header.
2632 HD7 The amount of encoding in a header will depend both on the nature and the intended use of the text. At one extreme, an encoder may expect that the header will be needed only to provide a bibliographic identification of the text adequate to local needs. At the other, wishing to ensure that their texts can be used for the widest range of applications, encoders will want to document as explicitly as possible both bibliographic and descriptive information, in such a way that no prior or ancillary knowledge about the text is needed in order to process it. The header in such a case will be very full, approximating to the kind of documentation often supplied in the form of a manual. Most texts will lie somewhere between these extremes; textual corpora in particular will tend more to the latter extreme. In the remainder of this section we demonstrate first the minimal, and next a commonly recommended, level of encoding for the bibliographic information held by the TEI header.
2634 HD7 Supplying only the minimal level of encoding required, the TEI header of a single text might look like the following example:
2656 HD7 The only mandatory component of the TEI header is the
2664 HD7 are all required constituents. Within the title statement, a title is required, and an author should be specified, even if it is
2666 HD7 , as should some additional statement of responsibility, here given by the
2670 HD7 , a publisher, distributor, or other agency responsible for the file must be specified. Finally, the source description should contain at the least a loosely structured bibliographic citation identifying the source of the electronic text if (as is usually the case) there is one.
2672 HD7 We now present the same example header, expanded to include additionally recommended information, adequate to most bibliographic purposes, in particular to allow for the creation of an
2674 HD7 -conformant bibliographic record. We have also added information about the encoding principles used in this (imaginary) encoding, about the text itself (in the form of Library of Congress subject headings), and about the revision of the file.
2848 HD7 Many other examples of recommended usage for the elements discussed in this chapter are provided here, in the reference index and in the associated tutorials.
2852 HD8 A strong motivation in preparing the material in this chapter was to provide in the TEI header a viable chief source of information for cataloguing computer files. The TEI header is not a library catalogue record, and so will not make all of the distinctions essential in standard library work. It also includes much information generally excluded from standard bibliographic descriptions. It is the intention of the developers, however, to ensure that the information required for a catalogue record be retrievable from the TEI file header, and moreover that the mapping from the one to the other be as simple and straightforward as possible. Where the correspondence is not obvious, it may prove useful to consult one of the works which were influential in developing the content of the TEI header. These include:
2856 HD8 is an international standard setting out what information should be recorded in a description of a bibliographical item. Until a consolidated edition published in 2011, there was a general standard called ISBD(G) and separate ISBDs covering different types of material, e.g. ISBD(M) for monographs, ISBD(ER) for electronic resources. These separate ISBDs follow the same general scheme as the main ISBD(G), but provide appropriate interpretations for the specific materials under consideration.
2862 HD8 were published in 1978, with revisions appearing periodically through 2005. AACR2 provides guidelines for the construction of catalogues in general libraries in the English-speaking world. AACR2 is explicitly based on the general framework of the ISBD(G) and the subsidiary ISBDs: it gives a description of how to describe bibliographic items and how to create access points such as subject or name headings and uniform titles. Other national cataloguing codes exist as well, including the Z44 series of standards from issued by the Association française de normalisation (AFNOR),
2865 HD8 Regole italiane di catalogazione per autore
2876 HD8 Since the TEI file description elements are based on the ISBD areas, it should be possible to use the content of file description as the basis for a catalog record for a TEI document. However, cataloguers should be aware that the permissive nature of the TEI Guidelines may lead to divergences between practice in using the TEI file description and the comparatively strict recommendations of AACR2 and other national cataloguing codes. Such divergences as the following may preclude automatic generation of catalogue records from TEI headers:
2878 HD8 The TEI Guidelines do not require that text be transcribed from the
2879 HD8 chief source of information
2880 HD8 using normalized capitalization and punctuation
2883 HD8 The TEI title statement may not categorize constituent titles in the same way as prescribed by a national cataloguing code.
2885 HD8 The TEI title statement contains authors, editors, and other responsible parties in separate elements, with names which may not have been normalized; it does not necessarily contain a single statement of responsibility
2888 HD8 There is no specific place in a TEI header to specify the
2889 HD8 main entry
2893 HD8 name or title headings under which a catalogue record is filed
2896 HD8 The TEI header does not require use of a particular vocabulary for subject headings nor require the use of subject headings.
2900 HD The TEI Header Module
2904 header The TEI Header
2913 HD The selection and combination of modules to form a TEI schema is described in

TD-DocumentationElements.xml#13168

# id text
4 TD This chapter describes a module which may be used for the documentation of the XML elements and element classes which make up any markup scheme, in particular that described by the TEI Guidelines, and also for the automatic generation of schemas or DTDs conforming to that documentation. It should be used also by those wishing to customize or modify these Guidelines in a conformant manner, as further described in chapters
6 TD and may also be useful in the documentation of any other comparable encoding scheme, even though it contains some aspects which are specific to the TEI and may not be generally applicable.
13 TD , and was the name invented by the original TEI Editors for the predecessor of the system currently used for this purpose. See further
16 TD Like any other piece of XML software, an ODD processor may be instantiated in many ways: the current system uses a number of XSLT stylesheets which are freely available from the TEI, but this specification makes no particular assumptions about the tools which will be used to provide an ODD processing environment.
18 TD As the name suggests, an ODD processor uses a single XML document to generate multiple outputs. These outputs will include:
23 TD detailed descriptive documentation, embedding some parts of the formal reference documentation, such as the tag description lists provided in this and other chapters of these Guidelines;
25 TD declarative code for one or more XML schema languages, such as RELAX NG, W3C Schema, ISO Schematron, or DTD.
30 TD The input required to generate these outputs consists of running prose, and special purpose elements documenting the components (elements, classes, etc.) which are to be declared in the chosen schema language. All of this input is encoded in XML using elements defined in this chapter. In order to support more than one schema language, these elements constitute a comparatively high-level model which can then be mapped by an ODD processor to the specific constructs appropriate for the schema language in use. Although some modern schema languages such as RELAX NG or W3C Schema natively support self-documentary features of this kind, we have chosen to retain the ODD model, if only for reasons of compatibility with earlier versions of these Guidelines. For reasons of backwards compatibility, the ISO standard XML schema language RELAX NG (
31 TD ) may be used as a means of declaring content models and datatypes, but it is also possible to express content models using natively TEI XML constructs. We also use the ISO Schematron language to define additional constraints beyond those expressed in the content model, as further discussed in
34 TD In the TEI system, a
38 TD and has an identifier unique across the whole TEI scheme. For convenience, these specifications are grouped into a number of discrete
40 TD , which can also be combined more or less as required. Each major chapter of these Guidelines defines a distinct module. Each module declares a number of
43 TD classes
44 TD . All classes are available globally, irrespective of the module in which they are declared; particular modules extend the meaning of a class by adding elements or attributes to it. Wherever possible, element content models are defined in terms of classes rather than in terms of specific elements. Modules can also declare particular
46 TD , which act as short-cuts for commonly used content models or class references.
48 TD In the present chapter, we discuss the components needed to support this system. In addition, section
49 TD discusses some general purpose elements which may be useful in any kind of technical documentation, wherever there is need to talk about technical features of an XML encoding such as element names and attributes. Section
54 TD provides a summary overview of the elements provided by this module.
62 TDphraseTE In any kind of technical documentation, the following phrase-level elements may be found useful for marking up strings of text which need to be distinguished from the running text because they come from some formal language:
66 TDphraseTE Like other phrase-level elements used to indicate the semantics of a typographically distinct string, these are members of the
68 TDphraseTE class. They are available anywhere that running prose is permitted when the module defined by this chapter is included in a schema.
74 TDphraseTE elements are intended for use when citing brief passages in some formal language such as a programming language, as in the following example:
91 TDphraseTE A further group of similar phrase-level elements is also defined for the special case of representing parts of an XML document:
101 TDphraseTE . They are also available anywhere that running prose is permitted when the module defined by this chapter is included in a schema.
103 TDphraseTE As an example of the recommended use of these elements, we quote from an imaginary TEI working paper:
131 TDphraseTE element may be used to enclose any kind of example, which will typically be rendered as a distinct block, possibly using particular formatting conventions, when the document is processed. It is a specialized form of the more general
133 TDphraseTE element provided by the TEI core module. In documents containing examples of XML markup, the
136 TDphraseTE , since the content of this element can be checked for well-formedness.
140 TDphraseTE when this module is included in a schema. That class is a part of the general
152 TDphraseEA Within the body of a document using this module, the following elements may be used to reference parts of the specification elements discussed in section
159 TDphraseEA TEI practice recommends that a
161 TDphraseEA listing the elements under discussion introduce each subsection of a module's documentation. The source for the present section, for example, begins as follows:
178 TDphraseEA element in this example, an ODD processor might simply generate the section number and title of the section referred to, perhaps additionally inserting a link to the section. In a similar way, when processing the
184 TDphraseEA in this case) from their associated declaration elements: typically, the details recovered will include a brief description of the element and its attributes. These, and other data, will be stored in a specification element elsewhere within the current document, or they may be supplied by the ODD processor in some other way, for example from a database. For this reason, the link to the required specification element is always made using a TEI-defined key rather than an XML IDREF value. The ODD processor uses this key as a means of accessing the specification element required. There is no requirement that this be performed using the XML ID/IDREF mechanism, but there is an assumption that the identifier be unique.
213 TDmodules As mentioned above, the primary purpose of this module is to facilitate the documentation and creation of an XML schema derived from the TEI Guidelines. The following elements are provided for this purpose:
217 TDmodules is a convenient way of grouping together element and other declarations, and of associating an externally-visible name with the resulting group. A
218 TDmodules specification group
219 TDmodules performs essentially the same function, but the resulting group is not accessible outside the scope of the ODD document in which it is defined, whereas a module can be accessed by name from any TEI schema specification. Elements, and their attributes, element classes, and patterns are all individually documented using further elements described in section
220 TDmodules below; part of that specification includes the name of the module to which the component belongs.
224 TDmodules element found. For example, the chapter documenting the TEI module for names and dates contains a module specification like the following:
241 TDmodules attribute, the value of which is
242 TDmodules namesdates
245 TDmodules element above can thus generate a schema fragment for the TEI
249 TDmodules In most realistic applications, it will be desirable to combine more than one module together to form a complete
251 TDmodules . A schema consists of references to one or more modules or specification groups, and may also contain explicit declarations or redeclarations of elements (see further
253 TDmodules The distinction between base and additional tagsets in earlier versions of the TEI scheme has not been carried forward into P5.
256 TDmodules A schema can combine references to TEI modules with references to other (non-TEI) modules using different namespaces, for example to include mathematical markup expressed using MathML in a TEI document. By default, the effect of combining modules is to allow all of the components declared by the constituent modules to coexist (where this is syntactically possible: where it is not—for example, because of name clashes—a schema cannot be generated). It is also possible to over-ride declarations contained by a module, as further discussed in section
264 TDmodules attribute, and may then be referenced from any point in an ODD document using the
266 TDmodules element. This is useful if, for example, it is desired to describe particular groups of elements in a specific sequence. Note however that the order in which element declarations appear within the schema code generated from an ODD file element is not in general affected by the order of declarations within a
270 TDmodules An ODD processor will generate a piece of schema code corresponding with the declarations contained by a
272 TDmodules element in the documentation being output, and a cross-reference to such a piece of schema code when processing a
274 TDmodules . For example, if the input text reads
285 TDmodules then the output documentation will replace the two
287 TDmodules elements above with a representation of the schema code declaring the elements
297 TDmodules respectively. Similarly, if the input text contains elsewhere a passage such as
304 TDmodules then the
306 TDmodules elements may be replaced by an appropriate piece of reference text such as
331 TDcrystals Unlike most elements in the TEI scheme, each of these
333 TDcrystals has a fairly rigid internal structure consisting of a large number of child elements which are always presented in the same order.
334 TDcrystals Furthermore, since these elements all describe markup objects in broadly similar ways, they have several child elements in common. In the remainder of this chapter, we discuss first the elements which are common to all the specification elements, and then those which are specific to a particular type.
338 TDcrystals element, but the specification element for any particular component may only appear once (except in the case where a modification is being defined; see further
339 TDcrystals ). The order in which they appear will not affect the order in which they are presented within any schema module generated from the document. In documentation mode, however, an ODD processor will output the schema declarations corresponding with a specification element at the point in the text where they are encountered, provided that they are contained by a
342 TDcrystals as discussed in the previous section. An ODD processor will also associate all declarations found with the nominated module, thus including them within the schema code generated for that module, and it will also generate a full reference description for the object concerned in a catalogue of markup objects. These latter two actions always occur irrespective of whether or not the declaration is included in a
355 TDcrystalsCE This section discusses the child elements common to all of the specification elements; some of these are defined in the core module (
373 TDcrystalsCEdc element may be used to provide a brief explanation for the name of the object if this is not self-explanatory. For example, the specification for the element
375 TDcrystalsCEdc used to mark arbitrary blocks of text begins as follows:
382 TDcrystalsCEdc may also be supplied for an attribute name or an attribute value in similar circumstances:
400 TDcrystalsCEdc element is needed to explain the significance of the identifier for an item only when this is not apparent, for example because it is abbreviated, as in the above example. It should not be used to provide a full description of the intended meaning (this is the function of the
402 TDcrystalsCEdc element), nor to comment on equivalent values in other schemes (this is the purpose of the
406 TDcrystalsCEdc attribute value in other languages (this is the purpose of the
412 TDcrystalsCEdc element provide a brief characterization of the intended function of the object being documented in a form that permits its quotation out of context, as in the following example:
428 TDcrystalsCEdc Where specifications are supplied in multiple languages, the elements
432 TDcrystalsCEdc may be repeated as often as needed. Each such description or gloss should carry both an
436 TDcrystalsCEdc attribute to indicate the language used and the date on which the translated text was last checked against its source.
442 TDcrystalsCEdc attribute is used to supply a pointer to some location where such external concepts are defined. For example, to indicate that the TEI
444 TDcrystalsCEdc element corresponds to the concept defined by the CIDOC CRM category E69, the declaration for the former might begin as follows:
458 TDcrystalsCEdc attributes to point to an implementation of the mapping. This is useful when a TEI customization (see
461 TDcrystalsCEdc for convenience of data entry or markup readability. For example, suppose that in some TEI customization an element
464 TDcrystalsCEdc hi rend='bold'
467 TDcrystalsCEdc element can be converted to canonical TEI by obtaining a filter from the URI specified, and running the procedure with the name
471 TDcrystalsCEdc attribute specifies the language (in this case XSL) in which the filter is written:
484 TDcrystalsCEdc element is used to provide an alternative name for an object, for example using a different natural language. Thus, the following might be used to indicate that the
496 TDcrystalsCEdc may also be referred to using the alternate identifier
512 TDcrystalsCEdc of a component is identical to the value of its
518 TDcrystalsCEdc element contains any additional commentary about how the item concerned may be used, details of implementation-related issues, suggestions for other ways of treating related information etc., as in the following example:
534 TDcrystalsCEdc A specification element will usually conclude with a list of references, each tagged using the standard
538 TDcrystalsCEdc element: in the case of the
540 TDcrystalsCEdc element discussed above, the list is as follows:
545 TDcrystalsCEdc where the value
570 TDeg attribute may be used on either element to indicate the source from which an example is taken, typically by means of a pointer to an entry in an associated bibliography, as in the following example:
576 TDeg element should be used. In such a case, it will clearly be necessary to distinguish the markup within the example from the markup of the document itself. In an XML environment, this is easily done by using a different name space for the content of the
592 TDeg If the XML contained in an example is not well-formed then it must either be enclosed in a CDATA marked section, or
606 TDeg element should not be used to tag non-XML examples: the general purpose
616 TDcrystalsCEcl In the TEI scheme elements are assigned to one or more
617 TDcrystalsCEcl classes
630 TDcrystalsCEcl element. It specifies the classes of which the element or class concerned is a member by means of one or more
679 DEFCON may have three different kinds of content. It may express a content model directly using the TEI elements discussed in the remainder of this section. Alternatively, it may use a schema language of some kind, as defined by a pattern called
680 DEFCON macro.schemaPattern
682 DEFCON below. As a third possibility, the legal content for an element may be exhaustively specified using the
687 DEFCON The following elements are used to define a content model:
707 DEFCON provides the name of an element which may appear at a certain point in a content model. A
709 DEFCON provides the name of a class, members of which may appear at a certain point in content model. A
711 DEFCON provides the name of a predefined macro, the expansion of which may be inserted at a certain point in a content model.
718 DEFCON Finally, two wrapper elements are provided to indicate whether the components of a content model form a sequence or an alternation:
731 DEFCON This is the content model for the macro
733 DEFCON , which is defined as containing any number (including zero) of elements from the
745 DEFCON This is the content model for the
747 DEFCON element, which is defined as a sequence of components, firstly a mandatory
749 DEFCON , followed by any number (including zero) of elements from the
759 TDTAGCONT Alternatively, element content models may be defined using RELAX NG patterns, or by expressions in some other schema language, depending on the value of the
760 TDTAGCONT macro.schemaPattern
769 TDTAGCONT element appears will have a content model which is expressed in RELAX NG as
770 TDTAGCONT text
771 TDTAGCONT , using the RELAX NG namespace. This model will be copied unchanged to the output when RELAX NG schemas are being generated. When an XML DTD is being generated, an equivalent declaration (in this case
787 TDTAGCONT This is the content model for the
793 TDTAGCONT The RELAX NG language does not formally distinguish element names, attribute names, class names, or macro names: all names are patterns which are handled in the same way, as the above example shows. Within the TEI scheme, however, different naming conventions are used to distinguish amongst the objects being named. Unqualified names (
794 TDTAGCONT fileDesc
796 TDTAGCONT revisionDesc
805 TDTAGCONT ) are always class names. In DTD language, classes are represented by parameter entities (
810 TDTAGCONT The RELAX NG pattern names generated by an ODD processor by default include a special prefix, the default value for which is set using the
815 TDTAGCONT The purpose of this is to ensure that the pattern name generated is uniquely identified as belonging to a particular schema, and thus avoid name clashes. For example, in a RELAX NG schema combining the TEI element
822 TDTAGCONT ident
823 TDTAGCONT . Most of the time, this behaviour is entirely transparent to the user; the one occasion when it is not will be where a content model (expressed using RELAX NG syntax) needs explicitly to reference either the TEI
829 TDTAGCONT may be used. For example, suppose that we wish to define a content model for
831 TDTAGCONT which permits either a TEI
835 TDTAGCONT defined by some other vocabulary. A suitable content model would be generated from the following
850 TDTAGCONS element, a set of general
854 TDTAGCONS attribute) in order that a TEI customization may override, delete or change them individually. Each
863 TDTAGCONS assertion language
864 TDTAGCONS , together with a RELAXNG to validate it. The Schematron assertion language provides a powerful way of expressing constraints on the content of any XML document in addition to those provided by other schema languages. Such constraints can be embedded within a TEI schema specification using the methods exemplified in this chapter. An ODD processor will typically process any
866 TDTAGCONS elements in a TEI specification whose
870 TDTAGCONS The TEI Guidelines include some additional constraints which are expressed using the ISO Schematron language. A conformant TEI document should respect these constraints, although automatic validation of them may not be possible for all processors. A TEI customization may likewise specify additional constraints using this mechanism. Some examples of what is possible using the Schematron language are given below.
872 TDTAGCONS Constraints are generally used to model local rules which may be outside the scope of the target schema language. For example, in earlier versions of these Guidelines several constraints on the usage of the attributes of the TEI element
881 TDTAGCONS may be supplied only if the attribute
884 TDTAGCONS . Few schema language support co-occurence constraints such as the latter. In the current version of the Guidelines, constraint specifications expressed as Schematron rules have been added, as follows:
906 TDTAGCONS The constraints in the preceding example all related to attributes in the empty namespace, and the schematron rules did not therefore need to define a TEI namespace prefix. The Schematron language
908 TDTAGCONS element should be used to do this when a constraint needs to refer to a TEI element, as in the following example, which models the constraint that a TEI
921 TDTAGCONS Schematron rules are also useful where an application needs to enforce rules on attribute values, as in the following examples which check that various types of
939 TDTAGCONS As a further example, Schematron may be used to enforce rules applicable to a TEI document which is going to be rendered into accessible HTML, for example to check that some sort of content is available from which the
956 TDTAGCONS Schematron rules can also be used to enforce other HTML accessibility rules about tables; note here the use of a report and an assertion within one pattern:
973 TDTAGCONS Constraints can be expressed using any convenient language. The following example uses a pattern matching language called SPITBOL to express the requirement that title and author should be different. Implementing private schemes of this kind will generally be more problematic than simply adopting a widely-deployed system such as ISO Schematron however.
988 TDATT element is used to document information about a collection of attributes, either within an
992 TDATT . An attribute list can be organized either as a group of attribute definitions, all of which are understood to be available, or as a choice of attribute definitions, of which only one is understood to be available. An attribute list may thus contain nested attribute lists.
998 TDATT elements are all to be made available, or whether only one of them may be used. For example, the attribute list for the element
1000 TDATT contains a nested attribute list to indicate that either the
1020 TDATT element is used to document a single attribute, using an appropriate selection from the common elements already mentioned and the following which are specific to attributes:
1034 TDATT is used to specify only the attributes which are specific to that particular element. Instances of the element may carry other attributes which are declared by the classes of which the element is a member. These extra attributes, which are shared by other elements, or by all elements, are specified by an
1046 TD-datatypes element is used to state what kind of value an attribute may have. The TEI defines a number of datatype macros, each with an identifier beginning
1048 TD-datatypes , which are used in preference to the datatypes available natively from the target schema, since the facilities provided by different schema languages vary so widely. The available TEI datatypes are described in section
1051 TD-datatypes A TEI schema specification using RELAX NG may choose to define datatypes directly using RELAX NG syntax, for example
1054 TD-datatypes permits any string of Unicode characters not containing markup, and is thus the equivalent of
1058 TD-datatypes The RELAX NG language also provides support for a number of more complex cases such as choices or lists.
1059 TD-datatypes Such usages are permitted by the scheme documented here, but are not recommended when it is desired to remain independent of a particular schema language, since the full generality of one schema language cannot readily be converted to that of another. In the TEI abstract model, datatyping should preferably be carried out either by explicit enumeration of permitted values (using the TEI-specific
1061 TD-datatypes element described below), by reference to an existing datatype macro, or by definition of a new datatype, using the
1070 TD-datatypes are provided for the case where an attribute may take more than one value of the type specified. The
1083 TD-datatypes attribute may take any number of values, each being of the type defined by the TEI
1085 TD-datatypes macro. As is usual in XML, multiple values for a single attribute are separated by one or more white space characters. Hence, values such as
1098 TDATTvs element may be used to describe constraints on data content in an informal way: for example
1115 TDATTvs must take positive integer values less than 150, the datatype
1155 TDATTvs Where all the possible values for an attribute can be enumerated, the datatype
1173 TDATTvs element here to explain the otherwise less than obvious meaning of the codes used for these values. Since this value list specifies that it is of type
1181 TDATTvs attribute will have the value
1212 TDATTvs The datatype will be
1220 TDATTvs element) to put constraints on the permitted content of an element, as noted at
1221 TDATTvs . This use is not however supported by all schema languages, and is therefore not recommended if support for non-RELAX NG systems is a consideration.
1246 TDCLA A model class specification does not list all of its members. Instead, its members declare that they belong to it by means of a
1252 TDCLA element for each class of which the relevant element is a member, supplying the name of the relevant class. For example, the
1280 TDCLA The function of a model class declaration is to provide another way of referring to a group of elements. It does not confer any other properties on the elements which constitute its membership.
1288 TDCLA classes. In the case of attribute classes, the attributes provided by membership in the class are documented by an
1292 TDCLA . In the case of model classes, no further information is needed to define the class beyond its description, its identifier, and optionally any classes of which it is a member.
1294 TDCLA When a model class is referenced in the content model of an element (i.e. in the
1298 TDCLA ), its meaning will depend on the name used to reference the class.
1300 TDCLA If the reference simply takes the form of the class name, it is interpreted to mean an alternated list of all the current members of the class. For example, suppose that the members of the class
1308 TDCLA . Then a content model such as
1312 TDCLA would be equivalent to the explicit content model:
1322 TDCLA ). However, a content model referencing the class as
1324 TDCLA would be equivalent to the following explicit content model:
1334 TDCLA The following suffixes, appended with an underscore, can be given to a class name when it is referenced in a content model:
1340 TDCLA sequence
1342 TDCLA members of the class are to be provided in sequence
1354 TDCLA members of the class must be provided one or more times, in sequence
1360 TDCLA in a content model would be equivalent to:
1384 TDCLA sequence
1385 TDCLA in which members of a class appear in a content model when one of the sequence options is used is that in which the elements are declared.
1391 TDCLA attribute, which can be used to say that this particular model may only be referenced in a content model with the suffixes it specifies. For example, if the
1395 TDCLA took the form
1396 TDCLA classSpec ident="model.hiLike" generate="sequence sequenceOptional"
1397 TDCLA then a content model referring to (say)
1411 TDCLA defines a small set of attributes common to all elements which are members of that class: those attributes are listed by the
1423 TDCLA , to which some modules contribute additional attributes when they are included in a schema.
1453 TDENT element may be used to select a specific named pattern from those available. Patterns are used as a shorthand chiefly to describe common content models and datatypes, but may be used for any purpose. The following elements are used to represent patterns:
1488 TDbuild specification elements also have an attribute which determines which namespace to which the object being created will belong. In the case of
1490 TDbuild , this namespace is inherited by all the elements created in the schema, unless they have their own
1496 TDbuild These attributes are used by an ODD processor to determine how declarations are to be combined to form a schema or DTD, as further discussed in this section.
1498 TDbuild As noted above, a TEI schema is defined by a
1500 TDbuild element containing an arbitrary mixture of explicit declarations for objects (i.e. elements, classes, patterns, or macro specifications) and references to other objects containing such declarations (i.e. references to specification groups, or to modules). A major purpose of this mechanism is to simplify the process of defining user customizations, by providing a formal method for the user to combine new declarations with existing ones, or to modify particular parts of existing declarations.
1506 TDbuild An ODD processor, given such a document, should combine the declarations which belong to the named modules, and deliver the result as a schema of the requested type. It may also generate documentation for the elements declared by those modules. No source is specified for the modules, and the schema will therefore combine the declarations found in the most recent release version of the TEI Guidelines known to the ODD processor in use.
1508 TDbuild The value specified for the
1510 TDbuild attribute, when it is supplied as a URL, specifies any convenient location from which the relevant ODD files may be obtained. For the current release of the TEI Guidelines, a URL in the form
1516 TDbuild . Alternatively, if the ODD files are locally installed, it may be more convenient to supply a value such as
1520 TDbuild The value for the
1522 TDbuild attribute may be any form of URI. A set of TEI-conformant specifications in a form directly usable by an ODD processor must be available at the location indicated. When no
1524 TDbuild value is supplied, an ODD processor may either raise an error or assume that the location of the current release of the TEI Guidelines is intended.
1526 TDbuild If the source is specified in the form of a private URI, the form recommended is
1530 TDbuild is a prefix indicating the markup language in use, and
1534 TDbuild should be used to reference release 1.2.1 of the current TEI Guidelines. When such a URI is used, it will usually be necessary to translate it before such a file can be used in blind interchange.
1542 TDbuild which allow the encoder to supply an explicit lists of elements from the stated module which are to be included or excluded respectively. For example:
1546 TDbuild The schema specified here will include all the elements supplied by the core module except for
1558 TDbuild elements from the linking module.
1567 TDbuild Note that in this last case, there is no need to specify the name of the module from which the two element declarations are to be found; in the TEI scheme, element names are unique across all modules. The module is simply a convenient way of grouping together a number of related declarations.
1578 TDbuild , which is not defined in the TEI scheme, will be added to the output schema. This element will also be added to the existing TEI class
1580 TDbuild , and will thus be available in TEI conformant documents.
1590 TDbuild The effect of this is to redefine the content model for the element
1600 TDbuild which appear both in the original specification and in the new specification supplied above:
1602 TDbuild in this example. Note that if the value for
1610 TDbuild A schema may not contain more than two declarations for any given component. The value of the
1612 TDbuild attribute is used to determine exactly how the second declaration (and its constituents) should be combined with the first. The following table summarizes how a processor should resolve duplicate declarations; the term
1619 TDbuild mode value
1627 TDbuild add
1631 TDbuild add new declaration to schema; process its children in add mode
1635 TDbuild add
1659 TDbuild change
1667 TDbuild change
1671 TDbuild process identifiable children according to their modes; process unidentifiable children in replace mode; retain existing children where no replacement or change is provided
1694 ST-aliens Combining TEI and Non-TEI Modules
1696 ST-aliens In the simplest case, all that is needed to include a non-TEI module in a schema is to reference its RELAX NG source using the
1702 ST-aliens (defining Standard Vector Graphics) are included. To avoid any risk of name clashes, the schema specifies that all TEI patterns generated should be prefixed by the string "TEI_".
1712 ST-aliens This specification generates a single schema which might be used to validate either a TEI document (with the root element
1714 ST-aliens ), or an SVG document (with a root element
1718 ST-aliens validate a TEI document containing
1722 ST-aliens element must become a member of a TEI model class (
1723 ST-aliens ), so that it may be referenced by other TEI elements. To achieve this, we modify the last
1735 ST-aliens This states that when the declarations from the
1739 ST-aliens in the TEI module should be extended to include the element
1741 ST-aliens as an alternative. This has the effect that elements in the TEI scheme which define their content model in terms of that element class (notably
1743 ST-aliens ) can now include it. A RELAX NG schema generated from such a specification can be used to validate documents in which the TEI
1763 TD-LinkingSchemas This example includes a standard RELAX NG schema, a Schematron schema which might be used for checking that all pointing attributes point at existing targets, and also a link to the TEI ODD file from which the RELAX NG schema was generated. See also
1764 TD-LinkingSchemas for details of another method of linking an ODD specification into your file by including a
1778 tagdocs Documentation of TEI modules
1787 TDformal The selection and combination of modules to form a TEI schema is described in
1808 TDformal ). All of these classes are declared along with the other general TEI classes, in the basic structure module documented in
1815 TDformal macro.schemaPattern

BIB-Bibliography_hold_temporarily_until_XXXX_is_deemed_OK.xml#12280

# id text
23 VEMEana-eg-23 Doglia mi reca ne lo core ardire
79 TSSASE-eg-20 Structures of social action: Studies in conversation analysis
343 NDPER-eg-17 membrane 5, entry 154
441 VEST-eg-4 2nd edition
566 DIC-CP Collins Pocket Dictionary of the English language
586 SA-BIBL-2 Orbis Pictus: a facsimile of the first English edition of 1659
603 PHegsurp2 Poeti del Duecento
853 COEDADD-eg-89 The waste land: a facsimile and transcript of the original drafts including the annotations of Ezra Pound
883 DS-eg-05 Is there a text in this class? The authority of interpretive communities
922 FTGRA-eg-18 2nd edition
1006 COHQU-eg-43 Natural language processing in Prolog
1257 DRSTA-eg-40 Everyman's library: the drama
1289 COBICOR-eg-248 ISO 690:1987: Information and documentation – Bibliographic references – Content, form and structure
1473 COHQQ-eg-33 note 12
1600 DRPRO-eg-7 epilogue
1634 STGA-eg-9 Crofts American history series
1703 TSBA-eg-19 The approach of the Text Encoding Initiative to the encoding of spoken discourse
1723 MS-eg-001 A summary catalogue of western manuscripts in the Bodleian Library at Oxford which have not hitherto been catalogued ...
1733 MS-eg-001 P5-MS: A general purpose tagset for manuscript description
1762 STGA-eg-10 Crofts American history series
1931 TSSASE-eg-37 Report on the compatibility of J P French's spoken corpus transcription conventions with the TEI guidelines for transcription of spoken texts
1958 GDFT-eg-12 Partial family tree for Bertrand Russell
2322 DSBACK-eg-83 index to vol. 1
2556 WHITMS1 "[I am a curse]" in
2562 WHITMS2 Single leaf of Notes for a poem about night "visions," possibly related to the untitled 1855 poem that Whitman eventually titled "The Sleepers." Fragments of an unidentified newspaper clipping about the Puget Sound area have been pasted to the leaf. The Trent Collection of Walt Whitman Manuscripts, Duke University Rare Book, Manuscript, and Special Collections Library.
3666 BIB Works cited elsewhere in the text of the Guidelines
3752 Burnard1995b The Design of the TEI Encoding Scheme
4361 SG-BIBL-2 Refining our notion of what text really is: the problem of overlapping hierarchies
4630 CO-BIBL-1 An international handbook of the science of language and society
4767 TS-BIBL-3 TEI document TEI AI2 W1
4912 DI-BIBL-3 TEI working paper TEI AIW20
5015 DI-BIBL-6 Principles for Encoding machine readable dictionaries
5069 DI-BIBL-8 Electronic dictionary encoding: customizing the TEI Guidelines
5609 NH-BIBL-7 The layered markup and annotation language
5661 FS-BIBL-01 A rationale for the TEI recommendations for feature-structure markup,
5728 ISO-690 ISO 690:1987: Information and documentation – Bibliographic references – Content, form and structure
5740 ISO-12620 ISO 12620:2009: Terminology and other language and content resources – Specification of data categories and management of a Data Category Registry for language resources
5750 RICA Istituto Centrale per il Catalogo Unico
5752 RICA Regole italiane di catalogazione per autori
5819 BIB-RDG Reading list
5821 BIB-RDG The following lists of readings in markup theory and the TEI derive from work originally prepared by Susan Schreibman and Kevin Hawkins for the TEI Education Special Interest Group, recoded in TEI P5 by Sabine Krott and Eva Radermacher. They should be regarded only as a snapshot of work in progress, to which further contributions and corrections are welcomed (see further
6297 Burnard1999 Closing plenary address at the XML Europe Conference, Granada, May 1999
6375 Burnard2001a Dalle «Due Culture» Alla Cultura Digitale: La Nascita del Demotico Digitale
6491 Burnard2005b Metadata for corpus work
7448 Pichler1995 Culture and Value: Philosophy and the Cultural Sciences. Beiträge des 18. Internationalen Wittgenstein Symposiums 13–20. August 1995 Kirchberg am Wechsel
7451 Pichler1995 Kirchberg am Wechsel
8364 Unsworthetaleds2004 TEI Consortium
8502 BIB-RDG TEI
8617 BaumanandCatapano1999 TEI and the Encoding of the Physical Structure of Books
8647 Bauman2005 TEI HORSEing Around
8729 Burnard1993 Rolling your own with the TEI
8845 Burnard1997 Prepared for a seminar on Etiquetación y extracción de información de grandes corpus textuales within the Curso Industrias de la Lengua (14–18 de Julio de 1997). Sponsored by the Fundacion Duques de Soria.
8862 BurnardandPopham1999 Putting Our Headers Together: A Report on the TEI Header Meeting 12 September 1997.
8925 Ciottied2005 Il Manuale TEI Lite: Introduzione Alla Codifica Elettronica Dei Testi Letterari
8945 Chang2001 The Implications of TEI
8991 DigitalLibraryFederation1998 TEI and XML in Digital Libraries: Meeting June 30 and July 1, 1998, Library of Congress, Summary/Proceedings
9007 DigitalLibraryFederation2007 TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices
9105 Loiseaunodate Introduction à la TEI
9129 MarkoandKelleher2001 Descriptive Metadata Strategy for TEI Headers: A University of Michigan Library Case Study
9159 Mertz2003 XML Matters: TEI — the Text Encoding Initiative
9273 Rahtz2003 Building TEI DTDs and Schemas on demand
9305 Rahtzetal2004 A unified model for text markup: TEI, Docbook, and beyond
9365 Robinsonnodate Making a Digital Edition with TEI and Anastasia
9383 Seaman1995 The Electronic Text Center Introduction to TEI and Guide to Document Preparation
9403 Simons1999 Using Architectural Forms to Map TEI Data into an Object-Oriented Database
9433 Smith1999 Textual Variation and Version Control in the TEI
9565 Vanhoutte2004 An Introduction to the TEI and the TEI Consortium

VE-Verse.xml#13191

# id text
4 VE This module is intended for use when encoding texts which are entirely or predominantly in verse, and for which the elements for encoding verse structure already provided by the core module are inadequate.
7 VE include elements for the encoding of verse lines and line groups such as stanzas: these are available for any TEI document, irrespective of the module it uses. Like the modules for prose and for drama, the module for verse additionally makes use of the module defined in chapter
16 VE The module for verse extends the facilities provided by these modules in the following ways:
18 VE a special purpose
20 VE element is provided, to allow for segmentation of the verse line (see section
23 VE a set of attributes is provided for the encoding of rhyme scheme and metrical information (see sections
27 VE a special purpose
29 VE element is provided to support simple analysis of rhyming words (see section
36 VEST Like other kinds of text, texts written in verse may be of widely differing lengths and structures. A complete poem, no matter how short, may be treated as a free-standing text, and encoded in the same way as a distinct prose text. A group of poems functioning as a single unit may be encoded either as a
40 VEST , depending on the encoder's view of the text. For further discussion, including an example encoding for a verse anthology, see chapter
90 VEST Often, however, lines are grouped, formally or informally, into stanzas, verse paragraphs, etc. The
92 VEST element defined in the core tag set (in section
124 VEST It may also be used to mark the verse paragraphs into which longer poems are often divided, as in the following example from Samuel Taylor Coleridge's
161 VEST element, where a verse line is broken between two line groups, as discussed in section
166 VEST element is used to mark the highly regular line groups which characterize stanzaic and similar verse forms, as in the following example from Chaucer:
191 VEST elements may be nested hierarchically. For example, one particularly common English stanzaic form consists of a quatrain or sestet followed by a couplet. The
220 VEST attribute to name the type of unit encoded by the
232 VEST attribute is intended solely for conventional names of different classes of text block. For systematic analysis of metrical and rhyme schemes, use the
239 VEST As a further example, consider the Shakespearean sonnet. This may be divided into two parts: a concluding couplet, and a body of twelve lines, itself subdivided into three quatrains:
292 VEST each of which contains a prologue followed by twelve
294 VEST . Each prologue and each canto consists of nine-line
348 VESE It is often convenient for various kinds of analysis to encode subdivisions of verse lines. The general purpose
350 VESE element defined in the tag set for segmentation and alignment (section
355 VESE To use this element together with the module for verse, the module for segmentation and alignment must also be enabled as further described in section
358 VESE In Old and Middle English alliterative verse, individual verse lines are typically split into half lines. The
385 VESE element, down to whatever level of detailed structure is required. In the following example, the line has been divided into
392 VESE attribute) this example will still require additional processing, since whitespace should be retained for the lower level
395 VESE syll
426 VESE element may be used to identify any subcomponent of a line which has content; its type attribute may characterize such units in any way appropriate to the needs of the encoder. For the specific case of labeling each foot with its formal type (
447 VESE ). If both kinds of segmentation are required, the
491 VESE element, it might be simpler just to mark the point at which the caesura occurs. An additional element is provided for analyses of this kind, in which what is to be marked are points
493 VESE , which have some significance within a verse line:
497 VESE caesura
500 VESE , which occurs on a foot boundary (not to be confused with the division of a diphthong into two syllables, or the diacritic symbol used to indicate such division, each of which is also termed
502 VESE ). This distinction is rarely made nowadays, the term
503 VESE caesura
510 VESE element, we refer again to the example from Langland. An encoder might choose simply to record the location of the caesura within each line, rather than encoding each half-line as a segment in its own right, as follows:
524 VESE Logically, the opposite of caesura might be considered to be
528 VESE module is included in a schema, an additional class called
537 VESE elements and the syntactic structure of the verse (a discrepancy of some significance in some schools of verse):
552 VESA It is possible that certain textual structures may span multiple lines of verse, either by incorporating more than one, or by crossing line hierarchy. This is common, for example, when lines contain reported thought or speech (i.e.
554 VESA ), or other forms of quotation (i.e.
606 VEME When the module for verse is in use, the following additional attributes are available to record information about rhyme and metrical form:
617 VEME , etc. In general, the attributes should be specified at the highest level possible; they may not however be specifiable at the highest level if some of the subdivisions of a text are in prose and others in verse. All these attributes may also be attached to the
621 VEME elements, but the default notation for the
623 VEME attribute has no defined meaning when specified on
627 VEME . The value for these attributes may take any form desired by the encoder, but the nature of the notation used will determine how well the attribute values can be processed by automatic means.
631 VEME attribute, as further discussed below. A simple mechanism is also provided for recording the actual realization of a rhyme pattern; see
662 VEMEsamp This text is written entirely in
664 VEMEsamp ; each line is an iambic pentameter (which, using a common notation, can be described with the formula
674 VEMEsamp a line-end), and the couplets rhyme (which can be represented with the conventional formula
678 VEMEsamp Because both rhyme pattern and metrical form are consistent throughout the poem, they may be conveniently specified on the
690 VEMEsamp attributes is user-defined, no binding description can be given of its details or of how its interpretation must proceed. (A default notation is provided for the
693 VEMEsamp .) It is expected, however, that software should be able to support these attributes in useful ways; the more intelligent the software is, and the more knowledge of metrics is built into it, the better it will be able to support these attributes. In the extract given above, for example, the
703 VEMEsamp value specifies the metrical form of a single verse line, the structure of the
705 VEMEsamp as a whole is understood to involve as many repetitions of the pattern as there are lines in the verse paragraph. The same attribute value, when inherited in turn by the
709 VEMEsamp to repeat. With sufficiently sophisticated software, segments within the line might even be understood as inheriting precisely that portion of the formula which applies to the segment in question; this will, however, be easier to accomplish for some languages than for others.
713 VEMEsamp attribute in this example uses the default notation to specify a rhyme scheme applicable only to pairs of lines. As elsewhere, the default notation for the
715 VEMEsamp attribute has no meaning for metrical units at the line level or below. In verse forms where line-internal rhyme is structurally significant, e.g. in some skaldic poetry, the default notation is incapable of expressing the required information, since the rhyme pattern may need to be specified for units smaller than the line. In such cases, a user-specified rhyme notation must be substituted for the default notation, or else the rhyme pattern must be described using some alternative method (e.g. by using the
723 VEMEsamp attribute, when user-specified notations are used.
725 VEMEsamp A formal definition of the significance of each component of the pattern given as the value of the
731 VEMEsamp element in the TEI header (see section
732 VEMEsamp ). The encoder is free to invent any notation appropriate to his or her analytic needs, provided that it is adequately documented in this element. The notation may define metrical components using invented or traditional names (such as
746 VEMEsamp attribute has the same value as the
748 VEMEsamp attribute on the same element; it is only necessary to provide an explicit value when the realization differs in some way from the abstract metrical pattern. The tension between conventional metrical pattern and its realization may thus be recorded explicitly. For example, many readers of the above passage would stress the word
750 VEMEsamp at the beginning of the third line rather than the word
757 VEMEsamp attribute is used to over-ride the default or conventional metrical pattern, it applies only to the element on which it is specified. The default pattern for any subsequent lines is unaffected.
770 VEMEsamp attribute, the encoder is required to determine whether the change is a systematic or conventional one (as in this example) or an occasional variation, perhaps for local effect. In the following example, from Goethe's
811 VEMElevels The examples given so far have encoded information about the realization of metrical conventions at the level of the whole verse-line. This has obvious advantages of simplicity, but the disadvantage that any deviation from metrical convention is not marked at its precise point of occurrence in the text. Greater precision may be achieved, but only at the cost of marking deviant metrical units explicitly. This may be done with the
813 VEMElevels element, giving the variant realization as the value of the
827 VEMElevels The marking of the foot boundaries with the symbol
831 VEMElevels attribute value of the
833 VEMElevels element allows the human reader, or a sufficiently intelligent software program, to isolate the correct portion of that attribute value as the default value for the same attribute on the
841 VEMElevels here, and whether or not also to tag the feet in the line in which there is no deviation from the metrical convention. The ability of software to infer which foot is being marked, if not all are tagged, will depend heavily on the language of the text and the knowledge of prosody built into the software; the fuller and more explicit the markup, the easier it will be for software to handle it. It may prove useful, however, to mark metrical deviations in the manner shown, even if the available software is not sufficiently intelligent to scan lines without aid from the markup. Human readers who are interested in prosody may well be able to exploit the markup in useful ways even with less sophisticated software.
847 VEMElevels . If we wish to identify the exact location of the different types of foot in the first line of Virgil's
849 VEMElevels , the text could be encoded as follows (for simplicity's sake the caesura has been omitted):
862 VEMElevels An appropriate value of the
864 VEMElevels attribute might also be supplied on the enclosing
868 VEMElevels at the level of the foot may be considered a series of local variations on this fundamental pattern; in cases like this, of course, the local variations may also be considered aspects of realization rather than of convention, in which case the
872 VEMElevels , if desired.
878 VEMEana The method described above may be used to encode quite complex verse forms, for instance various kinds of fixed-form stanzas. Let us take one of Dante's canzoni, in which each stanza except the last has the same combination of eleven-syllable and seven-syllable lines, and the same rhyme scheme:
894 VEMEana attribute specifies a rhyme scheme for each stanza, in the same way.
898 VEMEana represents a line containing nine syllables which may or may not be metrically prominent, a tenth which is prominent and an optional non-prominent eleventh syllable. The letter
900 VEMEana is used to represent a line containing five syllables which may or may not be metrically prominent, a sixth which is prominent and an optional non-prominent seventh syllable. A suitable definition for this notation might be given by a
928 VEMEana attribute on the eighth stanza itself, which will override the default value inherited from parent
949 VEMEana . Moreover, although it is quite regular (in the sense that the last stanza of each
962 VERH attribute is used to specify the rhyme pattern of a verse form. It should not be confused with the
974 VERH element in the TEI header. Unlike
978 VERH attribute has a default notation; if this default notation is used, no
982 VERH The default notation for rhyme offers the ability to record patterns of rhyming lines, using the traditional notation in which distinct letters stand for rhyming lines. For a work in rhyming couplets, like the Pope example above, the
986 VERH , indicating that pairs of adjacent lines rhyme with each other. For a slightly more complex scheme, applicable to groups of four lines, in which lines 1 and 3 rhyme, as do lines 2 and 4, this attribute would have the value
990 VERH , indicating that within each nine line stanza, lines 1 and 3 rhyme with each other, as do lines 2, 4, 5 and 7, and lines 6, 8 and 9.
992 VERH Non-rhyming lines within such a group may be represented using a hyphen or an x, as in the following example:
1007 VERH element may be used to mark the words (or parts of words) which rhyme according to a predefined pattern:
1020 VERH attribute is used to specify which parts of a rhyme scheme a given set of rhyming words represent:
1057 VERH elements with the same value for their
1059 VERH attribute are assumed to rhyme with each other: thus, in the above example, the two rhymes labelled
1061 VERH in the first stanza rhyme with each other, but not necessarily with those labelled
1069 VERH element can appear anywhere within a verse line, and not necessarily around a single word. It can thus be used to mark quite complex internal rhyming schemes, as in the following example:
1097 VERH This mechanism, although reasonably simple for simple cases, may not be appropriate for more complex applications. In general, rhyme may be considered as a special form of
1099 VERH , and hence encoded using the mechanisms defined for that purpose in section
1129 VERH Now that each rhyming word, or part-word, has been tagged and allocated an arbitrary identifier, the general purpose
1154 VERH class when the module defined by this chapter is included in a schema.
1162 HDMN element of the TEI header to document the metrical notation used in marking up a text.
1167 HDMN As with other components of the header, metrical notation may be specified either formally or informally. In a formal specification, every symbol used in the metrical notation must be documented by a corresponding
1173 HDMN if
1177 HDMN if any
1179 HDMN is defined, then any notation using undefined symbols should be regarded as invalid
1181 HDMN if both pattern and symbol are defined, then every symbol appearing explicitly within pattern must be defined
1190 HDMN As a simple example, consider the case of the notation in which metrical prominence, metrical feet, and line boundaries are all to be encoded. Legal specifications in this notation may be written for any sequence of metrically prominent or non-prominent features, optionally separated by foot or metrical line boundaries at arbitrary points. Assuming that the symbol
1198 HDMN for line boundary, then the following declaration achieves this object:
1219 HDMN attribute values within the text which use this metrical notation.
1223 HDMN attribute should be used to indicate for a given symbol whether or not it may be re-defined in terms of other symbols used within the same notation. For example, here is a notation for encoding classical metres, in which symbols are provided for the most common types of foot.
1244 HDMN attribute to supply an additional name for the symbols being documented.
1250 HDMN , each supplied with an
1254 HDMN attribute may be used in the text of the document to specify which
1258 HDMN s are defined in the header, one with an English verse pattern and one with a French pattern. In the body of the document, there are two
1306 VEETC A number of procedures that may be of particular concern to encoders of verse texts are dealt with elsewhere in these guidelines. Some aspects of layout and physical appearance, especially important in the case of free verse, are dealt with in chapter
1307 VEETC . Some initial recommendations for the encoding of phonetic or prosodic transcripts, which may be helpful in the analysis of sound structures in poetry, are to be found in chapter
1311 VEETC contains much which will be found useful for the aligning of multiple levels of commentary and structure within verse analysis. Encoders of verse (as of other types of literary text) will frequently wish to attach identifying labels to portions of text that are not part of a system of hierarchical divisions, may overlap with one another, and/or may be discontinuous; for instance passages associated with particular characters, themes, images, allusions, topoi, styles, or modes of narration. Much of the computerized analysis of verse seems likely to require dividing texts up into blocks in this way. The
1315 VEETC , provide a powerful means of encoding a wide variety of aspects of verse literature, including not only the metrical structures discussed above, but also such stylistic and rhetorical features as metaphor.
1317 VEETC For other features it must for the time being be left to encoders to devise their own terminology. Elements such as
1321 VEETC might well suggest themselves; but given the problems of definition involved, and the great richness of modern metaphor theory, it is clear that any such format, if predefined by these Guidelines, would have seemed objectionable to some and excessively restrictive to many. Leaving the choice of tagging terminology to individual encoders carries with it one vital corollary, however: the encoder must be utterly explicit, in the TEI header, about the methods of tagging used and the criteria and definitions on which they rest. Where no formal elements are currently proposed, such information may readily be given as simple prose description within the
1346 VESTR The selection and combination of modules to form a TEI schema is described in

MS-ManuscriptDescription.xml#12922

# id text
10 msov This chapter is based on the work of the European MASTER (Manuscript Access through Standards for Electronic Records) project, funded by the European Union from January 1999 to June 2001, and led by Peter Robinson, then at the Centre for Technology and the Arts at De Montfort University, Leicester (UK). Significant input also came from a TEI Workgroup headed by Consuelo W. Dutschke of the Rare Book and Manuscript Library, Columbia University (USA) and Ambrogio Piazzoni of the Biblioteca Apostolica Vaticana (IT) during 1998-2000.
11 msov defines a special purpose element which can be used to provide detailed descriptive information about handwritten primary sources. Although originally developed to meet the needs of cataloguers and scholars working with medieval manuscripts in the European tradition, the scheme presented here is general enough that it can also be extended to other traditions and materials, and is potentially useful for any kind of inscribed artefact.
13 msov The scheme described here is also intended to accommodate the needs of many different classes of encoders. On the one hand, encoders may be engaged in
16 msov ex nihilo
17 msov , that is, creating new detailed descriptions for materials never before catalogued. Some may be primarily concerned to represent accurately the description itself, as opposed to the ideas and interpretations the description represents; others may have entirely opposite priorities. At one extreme, a project may simply wish to capture an existing catalogue in a form that can be displayed on the Web, and which can be searched for literal strings, or for such features such as titles, authors and dates; at the other, a project may wish to create, in highly structured and encoded form, a detailed database of information about the physical characteristics, history, interpretation, etc. of the material, able to support practitioners of
21 msov To cater for this diversity, here as elsewhere, these Guidelines propose a flexible strategy, in which encoders must choose for themselves the approach appropriate to their needs, and are provided with a choice of encoding mechanisms to support those differing degrees.
31 msdesc element of the header of a TEI-conformant document, where the document being encoded is a digital representation of some manuscript original, whether as an encoded transcription, as a collection of digital images (as described in
32 msdesc ), or as some combination of the two. However, in cases where the document being encoded is essentially a collection of manuscript descriptions, the
40 msdesc ) making up the TEI element class
50 msdesc element has the following components, which provide more detailed information under a number of headings. Each of these component elements is further described in the remainder of this chapter.
66 msdesc ), and then either one or more paragraphs, marked up as a series of
80 msdesc ). These elements are all optional, but if used they must appear in the order given here. Finally, in the case of a composite manuscript, a full description may also contain one or more
95 msdesc The simplest way of digitizing this catalogue entry would simply be to key in the text, tagging the relevant parts of it which make up the mandatory
118 msdesc and add some of the additional phrase-level elements available when this module is in use:
160 msdesc Note that in this version the text has been slightly reorganized, but no actual rewriting has been necessary. The encoding now allows the user to search for such features as title, material, and date and place of origin; it is also possible to distinguish quoted material from descriptive passages and to search within descriptions relating to a particular topic (for example, history as distinct from material).
162 msdesc This process could be continued further, restructuring the whole entry so as to take full advantage of many more of the encoding possibilities provided by the module described in this chapter:
279 msphrase Within a manuscript description, many other standard TEI phrase level elements are available, notably those described in the Core module (
297 msdates elements respectively, used to indicate specifically the date and place of origin of a manuscript or manuscript part. Such information would normally be encoded within the
304 msdates can also be used to identify the place or date of origin of any aspect of the manuscript, such as its decoration or binding, when these are not of the same date or from the same location as rest of the manuscript. Both these elements are members of the
312 msdates class, and may thus also carry additional attributes giving normalized values for the associated dating.
320 msmat element can be used to tag any specific term used for the physical material of which a manuscript (or binding, seal, etc.) is composed. The
322 msmat element may be used to tag any term specifying the type of object or manuscript upon with the text is written.
327 msmat These elements may appear wherever a term regarded as significant by the encoder occurs, as in the following examples:
356 mswat These element may appear wherever a term regarded as significant by the encoder occurs. The
369 mswat element will typically appear when text from the source is being transcribed, for example within a rubric in the following case:
385 mswat If, as here, any text contained by a stamp is included in its description it should be clearly distinguished from that description. The element
395 msdim element can be used to specify the size of some aspect of the manuscript, and thus may be thought of as a specialized form of the existing TEI
403 msdim element will normally occur within the element describing the particular feature or aspect of a manuscript whose dimensions are being given; thus the size of the leaves would be specified within the
410 msdim ), while the dimensions of other specific parts of a manuscript, such as accompanying materials, binding, etc., would be given in other parts of the description, as appropriate.
438 msdim are used only when the measurement applies to several items, for example the size of all leaves in a manuscript; attributes
442 msdim are used when the measurement applies to a single item, for example the size of a specific codex, but has had to be estimated. Attribute
444 msdim is used when the measurement can be given exactly, and applies to a single item; this is the usual situation. In this case, the units in which dimensions are measured may be specified using the
446 msdim attribute, which will normally take from a closed set of values appropriate to the project, using standard units of measurement wherever possible, such as following values:
453 msdim line
455 msdim char
456 msdim . If however the only data available for the measurement uses some other unit, or it is preferred to normalize it in some other way, then it may be supplied as a string value by means of the
464 msdim More usually, the measurement will be normalized into a value and an appropriate SI unit:
466 msdim Where the exact value is uncertain, the attributes
474 msdim It is often convenient to supply a measurement which applies to a number of discrete observations: for example, the number of ruled lines on the pages of a manuscript (which may not all be the same), or the diameter of an object like a bell, which will differ depending where it is measured. In such cases, the
488 msdim element may be repeated as often as necessary, with appropriate attribute values to indicate the nature and scope of the measurement concerned. For example, in the following case the leaf size and ruled space of the leaves of the manuscript are specified:
498 msdim This indicates that for most leaves of the manuscript being described the ruled space is 90 mm high and 48 mm wide, while the leaves throughout are between 157 and 160 mm in height and 105 mm in width.
502 msdim element is provided for cases where some measurement other than height, width, or depth is required. Its
514 msdim element may be supplied is not constrained.
525 msloc element, used to indicate a location, or sequence of locations, within a manuscript.
532 msloc element is used to reference a single location within a manuscript, typically to specify the location occupied by the element within which it appears. If, for example, it is used as the first component of a
537 msloc below) then it is understood to specify the location (or locations) of that item within the manuscript being described.
543 msloc element can be used to identify any reference to one or more folios within a manuscript, wherever such a reference is appropriate. Locations are conventionally specified as a sequence of folio or page numbers, but may also be a discontinuous list, or a combination of the two. This specification should be given as the content of the
553 msloc A normalized form of the location can also be supplied, using special purpose attributes on the
563 msloc When the item concerned occupies a discontinuous sequence of pages, this may simply be indicated in the body of the
572 msloc Alternatively, if it is desired to indicate normalized values for each part of the sequence, a sequence of
587 msloc Finally, the content of the
589 msloc element may be omitted if a formatting application can construct it automatically from the values of the
609 msloc attribute can also be used to associate a location within a manuscript with facsimile images of that location, using the
611 msloc attribute, or with a transcription of the text occurring at that location. The former association is effected by means of the
619 msloc is available only when the
640 msloc attribute uses a URI reference to point directly to images of the relevant pages. This method may be found cumbersome when many images are to be associated with a single location. It is of most use when specific pages are referenced within a description, as in the following example:
690 msloc When (as in this example) a sequence of elements is to be supplied as target value, it may be given explicitly as above, or using the xPointer range() syntax defined at
691 msloc . Note however that support for this pointer mechanism is not widespread in current XML processing systems.
695 msloc attribute should only be used to point to elements that contain or indicate a transcription of the locus being described. To associate a
706 msloc attribute may be used to distinguish them. For example, MS 65 Corpus Christi College, Cambridge contains two fly leaves bearing music. These leaves have modern foliation 135 and 136 respectively, but are also marked with an older foliation. This may be preserved in an encoding such as the following:
721 msloc attribute should be supplied on the
742 msnames The standard TEI element
769 msnames name
770 msnames , not the person, place, or organization to which that name refers. In the last example above, the
772 msnames attribute is used to associate the name with a more detailed description of the person named. This is provided by means of the
774 msnames element, which becomes available when the
777 msnames is included in a schema. An element such as the following might then be used to provide detailed information about the person indicated by the name:
792 msnames element must be provided for each distinct
794 msnames value specified. For example, in the case above, the value
800 msnames element; the same value will be used as the
808 msnames attribute may be used to supply a unique identifying code for the person referenced by the name independently of both the existence of a
810 msnames element and the use of the standard URI reference mechanism. If, for example, a project maintains as its authority file some non-digital resource, or uses a database which cannot readily be integrated with other digital resources for this purpose, the unique codes used by such
815 msnames , interchange is improved by use of tag URIs in
823 msnames elements referenced by a particular document set should be collected together within a
826 msnames element, located in the TEI header. This functions as a kind of prosopography for all the people referenced by the set of manuscripts being described, in much the same way as a
828 msnames element in the back matter may be used to hold bibliographic information for all the works referenced.
843 msmisc element is used to describe one method by which correct ordering of the quires of a codex is ensured. Typically, this takes the form of a word or phrase written in the lower margin of the last leaf verso of a gathering, which provides a preview of the first recto leaf of the successive gathering. This may be a simple phrase such as the following:
859 msmisc element can be used for either leaf signatures, or a combination of quire and leaf signatures, whether the marking is alphabetic, alphanumeric, or some ad hoc system, as in the following more complex example:
869 msmisc ) taken from a specific known point in a codex (for example the first few words on the second leaf). Since these words will differ from one copy of a text to another, the practice originated in the middle ages of using them when cataloguing a manuscript in order to distinguish individual copies of a work in a way which its opening words could not.
878 mshera Descriptions of heraldic arms, supporters, devices, and mottos may appear at various points in the description of a manuscript, usually in the context of ownership information, binding descriptions, or detailed accounts of illustrations. A full description may also contain a detailed account of the heraldic components of a manuscript independently considered. Frequently, however, heraldic descriptions will be cited as short phrases within other parts of the record. The phrase level element
919 msid element is intended to provide an unambiguous means of uniquely identifying a particular manuscript. This may be done in a structured way, by providing information about the holding institution and the call number, shelfmark, or other identifier used to indicate its location within that institution. Alternatively, or in addition, a manuscript may be identified simply by a commonly used name.
923 msid A manuscript's actual physical location may occasionally be different from its place of ownership; at Cambridge University, for example, manuscripts owned by various colleges are kept in the central University Library. Normally, it is the ownership of the manuscript which should be specified in the manuscript identifier, while additional or more precise information on the physical location of the manuscript can be given within the
938 msid These elements are all structurally equivalent to the standard TEI
940 msid element with an appropriate value for its
948 msid and they must, if present, appear in the order given.
958 msid to reference a single standardized source of information about the entity named.
969 msid Major manuscript repositories will usually have a preferred form of citation for manuscript shelfmarks, including rules about punctuation, spacing, abbreviation, etc., which should be adhered to. Where such a format also contains information which might additionally be supplied as a distinct subcomponent of the
971 msid , for example a collection name, a decision must be taken as to whether to use the more specific element, or to include such information within the
1012 msid In the former example, the preferred form of the identifier can be retrieved by prefixing the content of the
1028 msid might be considered helpful in some circumstances (if, for example, some of the items in the Ellesmere collection had shelfmarks which did not begin
1032 msid In some cases the shelfmark may contain no information about the collection; in other cases, the item may be regarded as belonging to more than one collection. The
1070 msid Note in the latter case the use of the
1072 msid element to provide a common name other than the shelfmark by which a manuscript is known. Where a manuscript has several such names, more than one of these elements may be used, as in the following example:
1090 msid attribute has been used to specify the language of the alternative names.
1092 msid In very rare cases a repository may have only one manuscript (or only one of any significance), which will have no shelfmark as such but will be known by a particular name or names. In such circumstances, the
1094 msid element may be omitted, and the manuscript identified by the name or names used for it, using one or more
1111 msid Where manuscripts have moved from one institution to another, or even within the same institution, they may have identifiers additional to the ones currently used, such as former shelfmarks, which are sometimes retained even after they have been officially superseded. In such cases it may be useful to supply an alternative identifier, with a detailed structure similar to that of the
1115 msid in the collection of the Duque de Osuna, but which now has the shelfmark
1139 msid , except in cases where a manuscript is likely still to be referred to or known by its former identifier. For example, an institution may have changed its call number system but still wish to retain a record of the earlier number, perhaps because the manuscript concerned is frequently cited in print under its previous number:
1153 msid Where (as in this example) no repository is specified for the
1157 msid . Where the holding institution has only one preferred form of citation but wishes to retain the other for internal administrative purposes, the secondary could be given within
1159 msid with an appropriate value on the
1182 msid , substantial parts of which are to be found in three separate repositories, in Ljubljana, Warsaw, and St. Petersburg. This should be represented using three distinct
1184 msid elements, using an appropriate value on the type attribute to indicate that these three identifiers are not alternate ways of referring to the same physical object, but three parts of the same entity.
1217 msid As mentioned above, the smallest possible description is one that contains only the element
1241 msdo . This will often have been enough to identify a manuscript in a small collection because the identity of the author is implicit. Where a title does not imply the author, and is thus insufficient to identify the main text of a manuscript, the author should be stated explicitly (e.g.
1245 msdo ). Many inventories of manuscripts consist of no more than an author and title, with some form of copy-specific identifier, such as a shelfmark or
1253 msdo ); information on date and place of writing will sometimes also be included. The standard TEI element
1258 msdo In this way the cataloguer or scholar can supply in one place a minimum of essential information, such as might be displayed or printed as the heading of a full description. For example:
1276 msdo element is intended principally to contain a heading. More structured information concerning the contents, physical form, or history of the manuscript should be given within the specialized elements described below,
1284 msdo element may also be used to supply an unstructured collection of such information, as in the example given above (
1293 msco element is used to describe the intellectual content of a manuscript or manuscript part. It comprises
1295 msco a series of informal prose paragraphs
1297 msco a series of
1301 msco elements, each of which provides a more detailed description of a single item contained within the manuscript. These may be prefaced, if desired, by a
1325 msco This description may of course be expanded to include any of the TEI elements generally available within a
1394 msco elements if it is desired to provide both a general summary of the contents of a manuscript and more detail about some or all of the individual items within it. It may not however be used within an individual
1419 mscoit Each discrete item in a manuscript or manuscript part can be described within a distinct
1464 mscoit is that in the former, the order and number of child elements is not constrained; any element, in other words, may be given in any order, and repeated as often as is judged necessary. In the latter, however, the sub-elements, if used, must be given in the order specified above and only some of them may be repeated; specifically,
1480 mscoit may contain untagged running text, both permit an unstructured description to be provided in the form of one or more paragraphs of text. They differ in this respect also: if paragraphs are supplied as the content of an
1482 mscoit , then none of the other component elements listed above is permitted; in the
1490 mscoit elements may also nest, where a number of separate items in a manuscript are grouped under a single title or rubric, as is the case, for example, with a work like
1549 mscoit ; they are available only when the
1563 msat element should be used to supply a regularized form of the item's title, as distinct from any rubric quoted from the manuscript. If the item concerned has a standardized distinctive title, e.g.
1565 msat , then this should be the form given as content of the
1567 msat element, with the value of the
1571 msat . If no uniform title exists for an item, or none has been yet identified, or if one wishes to provide a general designation of the contents, then a
1572 msat supplied
1573 msat title can be given, e.g.
1575 msat , in which case the
1579 msat should be given the value
1580 msat supplied
1583 msat Similarly, if used within a manuscript description, the
1585 msat element should always contain the normalized form of an author's name, irrespective of how (or whether) this form of the name is cited in the manuscript. If it is desired to retain the form of the author's name as given in the manuscript, this may be tagged as a distinct
1587 msat element, within the text at the point where it occurs.
1594 msat element carrying full details of the person concerned (see further
1599 msat element can be used to supply the name and role of a person other than the author who is responsible for some aspect of the intellectual content of the manuscript:
1612 msat element can also be used where there is a discrepancy between the author of an item as given in the manuscript and the accepted scholarly view, as in the following example:
1622 msat Note that such attributions of authorship, both correct and incorrect, are frequently found in the rubric or final rubric (and occasionally also elsewhere in the text), and can therefore be transcribed and included in the description, if desired, using the
1633 mscorie It is customary in a manuscript description to record the opening and closing words of a text as well as any headings or colophons it might have, and the specialized elements
1647 mscorie , for recording other bits of the text not covered by these elements. Each of these elements has the same substructure, containing a mixture of phrase-level elements and plain text. A
1649 mscorie element can be included within each, in order to specify the location of the component, as in the following example:
1667 mscorie In the following example, standard TEI elements for the transcription of primary sources have been used to mark the expansion of abbreviations and other features present in the original:
1702 mscorie to indicate that the text begins and ends defectively.
1716 mscorie may always be used to identify the language of the text quoted, if this is different from the default language specified by the
1750 msclass One or more text classification or text-type codes may be specified, either for the whole of the
1779 msclass The value used for the
1791 msclass element of the TEI header (
1820 mslangs element should be used to provide information about the languages used within a manuscript item. It may take the form of a simple note, as in the following example:
1825 mslangs Where, for validation and indexing purposes, it is thought convenient to add keywords identifying the particular languages used, the
1836 mslangs A manuscript item will sometimes contain material in more than one language. The
1846 mslangs Since Old Church Slavonic may be written in either Cyrillic or Glagolitic scripts, and even occasionally in both within the same manuscript, it might be preferable to use a more explicit identifier:
1851 mslangs The form and scope of language identifiers recommended by these Guidelines is based on the IANA standard described at
1852 mslangs and should be followed throughout. Where additional detail is needed correctly to describe a language, or to discuss its deployment in a given text, this should be done using the
1854 mslangs element in the TEI header, within which individual
1861 mslangs element defines a particular combination of human language and writing system. Only one
1863 mslangs element may be supplied for each such combination. Standard TEI practice also allows this element to be referenced by any element using the global
1865 mslangs attribute in order to specify the language applicable to the content of that element. For example, assuming that
1902 msph we subsume a large number of different aspects generally regarded as useful in the description of a given manuscript. These include:
1904 msph aspects of the form, support, extent, and quire structure of the manuscript object and of the way in which the text is laid out on the page (
1910 msph and discussion of its binding, seals, and any accompanying material (
1914 msph Most manuscript descriptions touch on several of these categories of information though few include them all, and not all distinguish them as clearly as we propose here. In particular, it is often the case that an existing description will include information for which we propose distinct elements within a single paragraph, or even sentence. The encoder must then decide whether to rewrite the description using the structure proposed here, or to retain the existing prose, marked up simply as a series of
1922 msph element may thus be used in either of two distinct ways. It may contain a series of paragraphs addressing topics listed above and similar ones. Alternatively, it may act as a container for any choice of the more specialized elements described in the remainder of this section, each of which itself contains a series of paragraphs, and may also have more specific attributes.
1926 msph element will normally contain either a series of
1928 msph elements, or a sequence of specialized elements from the
1932 msph the description already exists in a prose form where some of the specialized topics are treated together in paragraphs of prose, but others are treated distinctly;
1955 msph The order in which specific elements may appear is also constrained by the content model; again this is for simplicity of processing. They may of course be processed or displayed in any desired order, but for ease of validation, they must be given in the order specified below.
1961 msph1 element is used to group together those parts of the physical description which relate specifically to the text-bearing object, its format, constitution, layout, etc. The
1963 msph1 attribute is used to indicate the specific type of writing vehicle being described, for example, as a codex, roll, tablet, etc. If used it must appear first in the sequence of specialized elements. The
1966 msph1 support
1967 msph1 , i.e. the physical carrier on which the text is inscribed; and a description of the
1968 msph1 layout
1969 msph1 , i.e. the way text is organized on the carrier.
1971 msph1 Taking these in turn, the description of the support is tagged using the following elements, each of which is discussed in more detail below:
1981 msph1 ), may be used to tag specific terms of interest if so desired.
2007 msph1sup element groups together information about the physical carrier. Typically, for western manuscripts, this will entail discussion of the material (parchment, paper, or a combination of the two) written on. For paper, a discussion of any watermarks present may also be useful. If this discussion makes reference to standard catalogues of such items, these may be tagged using the standard
2030 msph1ext element, defined in the TEI header, may also be used in a manuscript description to specify the number of leaves a manuscript contains, as in the following example:
2070 msph1col element, which is provided when the
2121 msphfo element may be used to indicate the scheme, medium or location of folio, page, column, or line numbers written in the manuscript, frequently including a statement about when and, if known, by whom, the numbering was done.
2129 msphfo Where a manuscript contains traces of more than one foliation, each should be recorded as a distinct
2131 msphfo element and optionally given a distinct value for its
2136 msphfo can then indicate which foliation scheme is being cited by means of its
2155 msphco element is used to summarize the overall physical state of a manuscript, in particular where such information is not recorded elsewhere in the description. It should not, however, be used to describe changes or repairs to a manuscript, as these are more appropriately described as a part of its custodial history (see
2156 msphco ). It should be supplied within the
2158 msphco element, if it discusses the condition of the physical support of the manuscript; within the
2163 msphco ) if it discusses only the condition of the binding or bindings concerned; or within the
2165 msphco element if it discusses the condition of any seal attached to the manuscript.
2187 msphla of the manuscript, that is the way in which text and illumination are arranged on the page, specifying for example the number of written, ruled, or pricked lines and columns per page, size of margins, distinct blocks such as glosses, commentaries, etc. This may be given as a simple series of paragraphs. Alternatively, one or more different layouts may be identified within a single manuscript, each described by its own
2196 msphla element is used, the layout will often be sufficiently regular for the attributes on this element to convey all that is necessary; more usually however a more detailed treatment will be required. The attributes are provided as a convenient shorthand for commonly occurring cases, and should not be used except where the layout is regular. The value
2198 msphla (not-applicable) should be used for cases where the layout is either very irregular, or where it cannot be characterized simply in terms of lines and columns, for example, where blocks of commentary and text are arranged in a regular but complex pattern on each page
2217 msphla elements within the content of the element, as in the following example:
2239 msph2 The second group of elements within a structured physical description concerns aspects of the writing, illumination, or other notation (notably, music) found in a manuscript, including additions made in later hands—the
2240 msph2 text
2259 msphwr element can contain a short description of the general characteristics of the writing observed in a manuscript, as in the following example:
2276 msphwr Where several distinct hands have been identified, this fact can be registered by using the
2318 msphwr can be used to link the relevant parts of the transcription to the appropriate
2321 msphwr handShift new="#Eirsp-2"/
2334 msphwr element can simply provide a summary description:
2357 msphwr elements should be supplied. Similarly, in the following example, the source text is a typescript with extensive handwritten annotation:
2391 msphdec It can be difficult to draw a clear distinction between aspects of a manuscript which are purely physical and those which form part of its intellectual content. This is particularly true of illuminations and other forms of decoration in a manuscript. We propose the following elements for the purpose of delimiting discussion of these aspects within a manuscript description, and for convenience locate them all within the physical description, despite the fact that the illustrative features of a manuscript will in many cases also be seen as constituting part of its intellectual content.
2401 msphdec Alternatively, it may contain a series of more specific typed
2428 msphdec Where more exact indexing of the decorative content of a manuscript is required, the standard TEI elements
2470 msphmu element may be used to describe the form of notation employed, as in the following example:
2486 mspham element can be used to list or describe any additions to the manuscript, such as marginalia, scribblings, doodles, etc., which are considered to be of interest or importance. Such topics may also be discussed or referenced elsewhere in a description, for example in the
2590 msph3 The third major component of the physical description relates to supporting but distinct physical components, such as bindings, seals and accompanying material. These may be described using the following specialist elements:
2602 msphbi element contains a description of the state of the present and former bindings of a manuscript, including information about its material, any distinctive marks, and provenance information. This may be given as a series of paragraphs if only one binding is being described, or as a series of distinct
2604 msphbi elements, each describing a distinct binding where these are separately described. For example:
2612 msphbi Within a binding description, the elements
2639 msphbi for paragraphs concerned exclusively with the condition of a binding, where this has not been supplied as part of the physical description.
2679 msadac The circumstance may arise where material not originally part of a manuscript is bound into or otherwise kept with a manuscript. In some cases this material would best be treated in a separate
2682 msadac below). There are, however, cases where the additional matter is not self-evidently a distinct manuscript: it might, for example, be a set of notes by a later scholar, or a file of correspondence relating to the manuscript. The
2688 msadac Here is an example of the use of this element, describing a note by the Icelandic manuscript collector Árni Magnússon which has been bound with the manuscript:
2734 mshy The following elements are used to record information about the history of a manuscript:
2752 mshy Information about the origins of the manuscript, its place and date of writing, should be given as one or more paragraphs contained by a single
2754 mshy element; following this, any available information on distinct stages in the history of the manuscript before its acquisition by its current holding institution should be included as paragraphs within one or more
2802 mshy elements where distinct periods of ownership for the manuscript have been identified:
2841 msad Three categories of additional information are provided for by the scheme described here, grouped together within the
2852 msad is required. If any is supplied, it may appear once only; furthermore, the order in which elements are supplied should be as specified above.
2862 msadad element is used to hold information relating to the curation and management of a manuscript. This may be supplied as a note using the global
2875 msrh element may contain simply a series of paragraphs. Alternatively it may contain a
2877 msrh element, followed by an optional series of
2886 msrh element is used to document the primary source of information for the record containing it, in a similar way to the standard TEI
2888 msrh element within a TEI Header. If the record is a new one, made without reference to anything other than the manuscript itself, then it may simply contain a
2895 msrh Frequently, however, the record will be derived from some previously existing description, which may be specified using the standard TEI
2907 msrh If, as is likely, a full bibliographic description of the source from which cataloguing information was taken is included within the
2911 msrh element, or elsewhere in the current document, then it need not be repeated here. Instead, it should be referenced using the standard TEI
2947 msrh element of the standard TEI header; its use here is intended to signal the similarity of function between the two container elements. Where the TEI header should be used to document the revision history of the whole electronic file to which it is prefixed, the
2960 msadch element is another element also available in the TEI header, which should be used here to supply any information concerning access to the current manuscript, such as its physical location (where this is not implicit in its identifier), any restrictions on access, information about copyright, etc.
2977 msadch record is used to describe the custodial history of a manuscript, recording any significant events noted during the period that it has been located within its holding institution. It may contain either a series of
2979 msadch elements, or a series of
2981 msadch elements, each describing a distinct incident or event, further specified by a
3018 msadsu element is used to provide information about representations such as photographs or other representations of the manuscript which may exist within the holding institution or elsewhere.
3028 msadsu element. However, it is often also convenient to record information such as negative numbers or digital identifiers for unpublished collections of manuscript images maintained within the holding institution, as well as to provide more detailed descriptive information about the surrogate itself. Such information may be provided as prose paragraphs, within which identifying information about particular surrogates may be presented using the standard TEI
3056 msadsu Note the use of the specialized form of title (
3057 msadsu general material designation
3060 msadsu At a later revision, the content of the
3062 msadsu element is likely to be expanded to include elements more specifically intended to provide detailed information such as technical details of the process by which a digital or photographic image was made. For information about the inclusion of digital facsimile images within a TEI document, refer also to
3137 MSref The selection and combination of modules to form a TEI schema is described in

SA-LinkingSegmentationAlignment.xml#13230

# id text
4 SA This chapter discusses a number of ways in which encoders may represent analyses of the structure of a text which are not necessarily linear or hierarchic. The module defined by this chapter provides for the following common requirements:
6 SA to link disparate elements using the
11 SA to link disparate elements without using the
17 SA to segment text into elements convenient for the encoder and to mark arbitrary points within documents (section
20 SA to represent correspondence or alignment among groups of text elements, both those with content and those which are empty (section
22 SA We use the term
24 SA as a special case for the more general notion of correspondence. Using A as a short form for
27 SA set to the value
29 SA , and suppose elements A1, A2, and A3 occur in that order and form one group, while elements B1, B2, and B3 occur in that order and form another group. Then a relation in which A1 corresponds to B1, A2 corresponds to B2, and A3 corresponds to B3 is an alignment. On the other hand, a relation in which A1 corresponds to B2, B1 to C2, and C1 to A2 is not an alignment.
31 SA to synchronize elements of a text, that is to represent temporal correspondences and alignments among text elements (section
32 SA ) and also to align them with specific points in time (section
35 SA to specify that one text element is identical to or a copy of another (section
47 SA to associate segments of a text with interpretations or analyses of their significance (section
51 SA These facilities all use the same set of techniques based on the W3C XPointer framework (
63 SA is extended to include eight additional attributes to support the various kinds of linking listed above. Each of these attributes is introduced in the appropriate section below. In addition, for many of the topics discussed, a choice of methods of encoding is offered, ranging from simple but less general ones, which use attribute values only, to more elaborate and more general ones, which use specialized elements.
70 SAPT to others if the first has an attribute whose value is a reference to the others: such an element is called a
80 SAPT . These elements all indicate an association between one place in the document (the location of the pointer itself) and one or more others (the elements whose identifiers are specified by the pointer's
83 SAPT link
100 SAPTL element, which represents an association between two (or more) locations by specifying each location explicitly. Its own location is irrelevant to the intended linkage. All three elements use the attribute
104 SAPTL class as a means of indicating the location or locations referenced or pointed to.
114 SAPTL between an element (which, in the case of a pure pointer, is simply a location in a document), and one or more others, known collectively as its
121 SAPTL point, conceptually, at a single target, even if that target may be discontinuous in the document. The
126 SAPTL These three elements also share a common set of attributes, derived from the
141 SAPTL element. All that is required is that the value of the
143 SAPTL (or other pointing) attribute of the one be the value of the
161 SAPTL attribute may take as value one or more URI reference. In the simplest case, each such reference will indicate an element in the current document (or in some other document), for example by supplying the value used for its global
163 SAPTL attribute. It may however carry as value any form of URI, such as a URL pointing to some other document or location on the Internet. Pointing or linking to external documents and pointing and linking where identifiers are not available is described below in section
170 SAPTEG As an example of the use of mechanisms which establish connections among elements, consider the practice (common in 18th century English verse and elsewhere) of providing footnotes citing parallel passages from classical authors.
172 POPE The figure shows the original page of Pope's Dunciad which is discussed in the text.
178 SAPTEG attribute, placed adjacent to the passage to which the note refers:
181 SAPTEG attribute on the note is used to classify the notes using the typology established in the Advertisement to the work:
185 SAPTEG In the source text, the text of the poem shares the page with two sets of notes, one headed
214 SAPTEG implicit linking
215 SAPTEG ). It relies on the juxtaposition of the note to the text being commented on for the connection to be understood. If it is felt that the mere juxtaposition of the note to the text does not make it sufficiently clear exactly what text segment is being commented on (for example, is it the immediately preceding line, or the immediately preceding two lines, or what?), or if it is decided to place the note at some distance from the text, then the pointing or the linking must be made explicit. We now consider various methods for doing that.
219 SAPTEG element might be placed at an appropriate point within the text to link it with the annotation:
242 SAPTEG ) to enable it to be specified as the target of the pointer element. Because there is nothing in the text to signal the existence of the annotation, the
244 SAPTEG attribute has been given the value
254 SAPTEG attribute has been supplied for the associated text:
264 SAPTEG Given this encoding of the text itself, we can now link the various notes to it. In this case, the note itself contains a pointer to the place in the text which it is annotating; this could be encoded using a
268 SAPTEG attribute of its own and contains a (slightly misquoted) extract from the text marked as a
292 SAPTEG a pointer within one line indicates the note
294 SAPTEG the note indicates the line
296 SAPTEG a pointer within the note indicates the line
298 SAPTEG Note that we do not have any way of pointing from the line itself to the note: the association is implied by containment of the pointer. We do not as yet have a true double link between text and note. To achieve that we will need to supply identifiers for the annotations as well as for the verse lines, and use a
331 SAPTEG element here bears the identifier of the note followed by that of the verse line. We could also allocate an identifier to the reference within the note and encode the association between it and the verse line in the same way:
346 SAPTEG s could be combined into one, as follows:
352 SAPTLG Clearly, there are many reasons for which an encoder might wish to represent a link or association between different elements. For some of them, specific elements are provided in these Guidelines; some of these are discussed elsewhere in the present chapter. The
354 SAPTLG element is a general purpose element which may be used for any kind of association. The element
356 SAPTLG may be used to group links of a particular type together in a single part of the document; such a collection may be used to represent what is sometimes referred to in the literature of Hypertext as a
358 SAPTLG , a term introduced by the Brown University FRESS project in 1969, and not to be confused with the World Wide Web.
373 SAPTLG element provides a convenient way of establishing a default for the
375 SAPTLG attribute on a group of links of the same type: by default, the
379 SAPTLG element has the same value as that given for
385 SAPTLG Typical software might hide a web entirely from the user, but use it as a source of information about links, which are displayed independently at their referenced locations. Alternatively, software might provide a direct view of the link collection, along with added functions for manipulating the collection, as by filtering, sorting, and so on. To continue our previous example, this text contains many other notes of a kind similar to the one shown above. Here are a few more of the lines to which annotations have to be attached, followed by the annotations themselves:
426 SAPTLG attribute can be used to identify the text elements within which the individual targets of the links are to be found. Suppose that the text under discussion is organized into a
428 SAPTLG element, containing the text of the poem, and a
432 SAPTLG attribute can have as its value the identifiers of the
436 SAPTLG , to enable an application to verify that the link targets are in fact contained by appropriate elements, or to limit its search space:
448 SAPTLG domain
449 SAPTLG ; if some notes are contained by a section with identifier
460 SAPTLG attribute can be used to provide further information about the role or function of the various targets specified for each link in the group. The value of the
462 SAPTLG attribute is a list of names (formally, name tokens), one for each of the targets in the link; these names can be chosen freely by the encoder, but their significance should be documented in the encoding description in the header.
463 SAPTLG Since no special element is provided for this purpose in the present version of these Guidelines, the information should be supplied as a series of paragraphs at the end of the
467 SAPTLG In the current example, we might think of the note as containing the
468 SAPTLG source
469 SAPTLG of the imitation and the verse line as containing the
489 SAPTIP In the preceding examples, we have shown various ways of linking an annotation and a single verse line. However, the example cited in fact requires us to encode an association between the note and a
491 SAPTIP of verse lines (lines 284 and 285); we call these two lines a
492 SAPTIP span
495 SAPTIP There are a number of possible ways of correcting this error: one could use the
497 SAPTIP attribute to indicate one end of the span and the special purpose
501 SAPTIP element to point to the other. Another possibility might be to create an element which represents the whole span itself and assign that an
503 SAPTIP attribute, which can then be linked to the
531 SAPTIP then provides an identifier which can be linked to the
540 SAPTIP value of
546 SAPTIP had the value
548 SAPTIP , the link target would be the pointer itself, rather than the objects it points to.
552 SAPTIP element is used to group a collection of
565 SAXP This section introduces more formally the pointing mechanisms available in the TEI. In addition to those discussed so far, the TEI provides methods of pointing:
575 SAXP at arbitrary content in any XML document using TEI-defined XPointer schemes.
579 SAXP All TEI attributes used to point at something else are declared as having the datatype
599 SAUR Like the ubiquitous if misnamed XHTML pointing attribute
601 SAUR , the TEI pointing attributes can point to a document that is not the current document (the one that contains the pointing element) whether it is in the same local filesystem as the current document, or on a different system entirely. In either case, the pointing can be accomplished absolutely (using the entire address of the target document) or relatively (using an address relative to the current base URI in force). The
605 SAUR . If there is none, the base URI is that of the current document. In common practice the current base URI in force is likely to be the value of the
616 SAUR This example points explicitly to a location on the Web, accessible via HTTP
617 SAUR . Suppose however that we wish to access a document stored locally in a file. Again we will supply an absolute URI reference, but this time using a different protocol:
631 SAUR is specified here, the location of the resource
635 SAUR In the following example, however, we first change the current base URI by setting a new value for
637 SAUR . The resource required is then identified by means of a relative URI:
691 SABN Because the default base URI is the current document, a pointer that is specified as a
692 SABN bare name
694 SABN In more recent W3C documents, the term
695 SABN bare name
696 SABN is deprecated in favour of the more explicit
720 SABN of the target element as a bare name only (e.g.,
722 SABN ) is the simplest and often the best approach where it can be applied, i.e. where both the source element and target element are in the same XML document, and where the target element carries an identifier. It is the method used extensively in previous sections of this chapter and elsewhere in these Guidelines.
729 SAPU is a useful way of handling the repeated use of long external URIs. However, it is less convenient when your text contain many references to a variety of different sources in different locations. Even in the case of relative links on the local file system,
733 SAPU attributes may become quite lengthy and make XML code difficult to read. To deal with this problem, the TEI provides a useful method of using abbreviated pointers and documenting a way to dereference them automatically.
735 SAPU Imagine a project which has a large collection of XML documents organized like this:
765 SAPU If you want to link a
773 SAPU file, the link will look like this:
777 SAPU If there are many names to tag in a single paragraph, the XML encoding will be congested, and such lengthy links are prone to typographical error. In addition, if the project organization is changed, every relative link will have to be found and altered.
787 SAPU element in the TEI header, as described in
788 SAPU . However, such a link cannot be mechanically processed by an external system that does not know how to interpret it; a human will have to read the header explanation and write code explicitly to reconstruct the intended link.
794 SAPU , and can therefore be used as the value of any attribute which has that datatype, such as
798 SAPU . Such a scheme consists of a prefix with a colon, and then a value. You might, for example, use the prefix
800 SAPU (for "person"), and structure your name tags like this:
806 SAPU ? Essentially, it isn't, except that TEI provides a structured method of dereferencing it (turning it into a computable path, such as
810 SAPU in the TEI header, using the elements and attributes for prefix declaration:
831 SAPU value is constructed with a
837 SAPU , and it contains any number of
847 SAPU provides the string which will be used as a replacement. In this example, using
849 SAPU , the value
853 SAPU , and also captured (through the parentheses in the regular expression); it would then be replaced by the value
869 SAPU in the header to see if there is an available expansion for it, and if there is, it can automatically provide the expansion and generate a full or relative URI.
873 SAPU element in the personography file, it might also be useful to point to an external source which is available on the network, representing the same information in a different way. So there might be a second
881 SAPU Any number of
883 SAPU elements may be provided for the same prefix. A processor may decide to process one or all of them; if it processes only one, it should choose the first one with the correct
891 SAPU When creating private URI schemes, it is recommended that you avoid using any existing registered prefix. A list of registered prefixes is maintained by IANA at
906 SATS TEI XPointer Schemes
908 SATS The pointing schemes described in this chapter are part of a number of such schemes envisaged by the W3C, which together constitute a framework for addressing data within XML documents, known as the XPointer Framework (
912 SATS . The W3C has predefined a set of such schemes, and maintains a register for their expansion.
917 SATS . These Guidelines also define six other pointer schemes, which provide access to parts of an XML document such as points within data content or stretches of data content. These additional TEI pointer schemes are defined in sections
921 SATSin Introduction to TEI Pointers
923 SATSin Before discussing the TEI pointer schemes, we introduce slightly more formally the terminology used to define them. So far, we have discussed only ways of pointing at components of the XML information set node such as elements and attributes. However, there is often a need in text analysis to address additional types of location such as the
931 SATSin that may arbitrarily cross the boundaries of nodes in a document. The content of an XML document is organized sequentially as well as hierarchically, and it makes sense to consider ranges of characters within a document independently of the nodes to which they belong. From the perspective of most of the pointer schemes discussed below, a TEI document is a tree structure superimposed upon a character stream. Nodes are entities available only in the tree, while points are available only in the stream. For this reason, the schemes below that rely upon character positions (
937 SATSin ) cannot take nodes into account. Similarly, XPath, being a method for locating nodes in the tree, treats those nodes as atomic, and is unable to address parts of nodes in their document context.
939 SATSin The TEI pointer scheme thus distinguishes the following kinds of object:
943 SATSin A node is an instance of one of the node kinds defined in the
945 SATSin . It represents a single item in the XML information set for a document. For pointing purposes, the only nodes that are of interest are Text Nodes, Element Nodes, and Attribute nodes.
949 SATSin A Sequence follows the definition in the XPath 2.0 Data Model, with one alteration. A Sequence is an ordered collection of zero or more items, where an item is either a node or a partial text node.
953 SATSin A Text Stream is the concatenation of the text nodes in a document and behaves as though all tags had been removed. A text stream begins at a reference node and encompasses all of the text inside that node (if any) and all the text following it in document order. In XPath terms, this would encompass all of the text nodes beginning at a particular node, and following it on the
959 SATSin A Point represents a dimensionless point between nodes or characters in a document. Every point is adjacent to either characters or elements, and never to another point. Points can only be referenced in relation to an element or text node in the document (i.e. something addressable by either an XPath or a fragment identifier). Points occur either immediately before or after an element, or at a numbered position inside a text stream. Position zero in the stream would be immediately before the first character. Note that points within attribute values cannot mark the beginning or end of a range extending beyond the attribute value, because points indicate a position within a document. Since attribute nodes are by definition un-ordered, they cannot be said to have a fixed position.
963 SATSin The TEI recommends the following seven pointer schemes:
967 SATSin Addresses a node or nodeset using the XPath syntax. (
974 SATSin addresses the point before (left) or after (right) a node or node set (
980 SATSin addresses a point inside a text node (
994 SATSin addresses a range which matches a specified string within a node (
1001 SATSin scheme refers to the existing XPath specification which is adopted with one modification: the default namespace for any XPath used as a parameter to this scheme is assumed to be the TEI namespace
1007 SATSin draft, but are individually much simpler. At the time of this writing, there is no current or scheduled activity at the W3C towards revising this draft or issuing it as a recommendation.
1009 SATSin A note on namespaces
1014 SATSin ) which when prepended to a resolvable pointer allows for the definition of namespace prefixes to be used in XPaths in subsequent pointers. TEI Pointer schemes assume that un-prefixed element names in TEI Pointer XPaths are in the TEI namespace,
1018 SATSin is thus optional, provided no new prefixes need to be defined. If the schemes described here are used to address non-TEI elements, then any new prefixes to be used in pointer XPaths may be defined using the
1030 SATSXP scheme locates a node within an XML Information Set. The single argument
1038 SATSXP scheme because they represent extracted values rather than locations in the source document. XPath expressions that address attribute nodes are only advisable in the
1042 SATSXP The example below, and all subsequent examples in this section refer to the following TEI fragment
1075 SATSXP A TEI Pointer that referenced the "normalized" form in the
1076 SATSXP choice
1077 SATSXP in line 1 of the example might look like:
1081 SATSXP When an XPath is interpreted by a TEI processor, the information set of the referenced document is interpreted without any additional information supplied by any schema processing that may or may not be present. In particular this means that no whitespace normalization is applied to a document before the XPath is interpreted.
1087 SATSXP pointers more robust than the other mechanisms discussed in this section even if the designated document changes. For durability in the presence of editing, use of
1089 SATSXP is always recommended when possible.
1101 SATSL scheme locates the point immediately preceding the node addressed by its argument, which is either an
1105 SATSL , the value of an
1112 SATSL lb
1114 SATSL gap
1134 SATSR scheme locates the point immediately following the node addressed by its argument.
1139 SATSR lb
1156 SATSSI scheme locates a point based on character positions in a text stream relative to the node identified by the IDREF or XPATH parameter. The
1160 SATSSI . An offset of 0 represents the position immediately before the first character in either the first text node descendant of the node addressed in the first parameter or the first following text node, if the addressed element contains no text node descendants.
1165 SATSSI s
1170 SATSSI in line 2.
1184 SATSRN s, which are each members of the set
1196 SATSRN locates a (possibly non-contiguous) sequence beginning at the first POINTER parameter and ending at the last. If the POINTER locates a node (i.e. is an XPATH or IDREF), then that node is a member of the addressed sequence. If a sequence addressed by a range pointer overlaps, but does not wholly contain, an element (i.e. it contains only the start but not the end tag or vice-versa), then that element is not part of the sequence.
1199 SATSRN s may address sequences of non-contiguous nodes. For example, a range() might select text beginning before an
1201 SATSRN , encompassing the content of a single
1210 SATSRN line 4
1219 SATSRN indicates the sequence
1225 SATSRN indicates the non-contiguous sequence
1237 SATSSR The string-range() scheme locates a sequence based on character positions in a text stream relative to the node identified by the first parameter. The location of the beginning of the addressed sequence is determined precisely as for
1245 SATSSR parameter is a positive integer that denotes the length of the text stream captured by the sequence. As with
1247 SATSSR , the addressed sequence may contain text nodes and/or elements. The
1249 SATSSR scheme, can accept multiple OFFSET, LENGTH pairs to address a non-contiguous sequence in mauch the same way that range() can accept multiple pairs of pointers.
1251 SATSSR Because string-range() addresses points in the text stream, tags are invisible to it. For example, if an empty tag like
1253 SATSSR is encountered while processing a string-range(), it will be included in the resulting sequence, but the LENGTH count will not increment when it is captured.
1258 SATSSR line 5
1259 SATSSR from the text immediately following the
1260 SATSSR lb
1262 SATSSR ab
1267 SATSSR indicates the sequence
1273 SATSSR indicates the non-contiguous sequence
1285 SATSMA The match scheme locates a sequence based on matching the REGEX parameter against a text stream relative to the reference node identified by the first parameter. REGEX is a regular expression as defined by
1299 SATSMA are assumed to operate in multi-line mode. The end of the string to be matched against is either the end of the text contained by the element in the first parameter or the end of the document, if that parameter indicates an empty element. The meta-character
1301 SATSMA therefore matches the beginning of the text stream inside or following the reference node, and the meta-character
1305 SATSMA The optional INDEX parameter is an integer greater than 0 which specifies which match should be chosen when there is more than one possibility. If omitted, the first match in the text stream will be used.
1315 SATSMA indicates the sequence
1318 SATSMA line 5
1326 SATSMA unclear
1329 SATSMA , just their text children.
1343 SACR , chapter 5, verse 7.
1344 SACR They might then wish to translate the string
1357 SACR Several elements in the TEI scheme (
1367 SACR , just for this purpose. Using the system described in this section, an encoder may specify references to canonical works in a discipline-familiar format, and expect software to derive a complete URI from it. The value of the
1369 SACR attribute is processed as described in this section, and the resulting URI reference is treated as if it were the value of the
1379 SACR attribute to function as required, a mechanism is needed to define the mapping between (for example)
1385 SACR in the TEI header, which contains an algorithm for translating a canonical reference string (like
1421 SACR When an application encounters a canonical reference as the value of
1423 SACR attribute, it might follow this sequence of specific steps to transform it into a URI reference:
1436 SACR match the value of the
1438 SACR attribute to the regular expression found as the value of the
1442 SACR if the value of the
1446 SACR take the value of the
1448 SACR attribute and substitute the back references ($1, $2, etc.) with the corresponding matched substrings
1450 SACR the result is taken as if it were a relative or absolute URI reference specified on the
1454 SACR attribute value as usual
1456 SACR no further processing of this value of the
1460 SACR should take place
1464 SACR if, however, the value of the
1466 SACR attribute does not match the regular expression specified in the value of the
1478 SACR The regular expression language used as the value of the
1486 SACR tei
1487 SACR matches any string that contains
1488 SACR tei
1489 SACR , in the W3C language it only matches the string
1490 SACR tei
1492 SACR The value of the
1498 SACR are replaced by the corresponding substring match. Note that since a maximum of nine substring matches are permitted, the string
1501 SACR the value of the first matched substring followed by the character
1505 SACR . If there is a need for an actual string including a dollar sign followed by a digit that is not supposed to be replaced, the dollar sign should be written as
1519 SACRWE above, an application comes across a
1521 SACRWE value of
1529 SACRWE . The application would first apply the regular expression
1539 SACRWE . The application would then apply these substrings to the pattern
1549 SACRWE If, however, the input string had been
1551 SACRWE , the first regular expression would not have matched. The application would have then tried the second,
1557 SACRWE . It would then have substituted those matched substrings into the pattern
1559 SACRWE to produce a fragment identifier, which when appended to the
1565 SACRWE If the input string had been
1567 SACRWE , neither the first nor the second regular expressions would have successfully matched. The application would have then tried the third,
1586 SACRex In the above example, the value of
1639 SACRmu Canonical reference pointers are intended for use by TEI encoders. However, this specification might be useful to the development of a process for recognizing canonical references in non-TEI documents (such as plain text documents), possibly as part of their conversion to TEI.
1647 SASE In this section, we discuss three general purposes elements which may be used to mark and categorize both a span of text and a point within one. These elements have several uses, most notably to provide elements which can be given identifiers for use when aligning or linking to parts of a document, as discussed elsewhere in this chapter. They also provide a convenient way of extending the semantics of the TEI markup scheme in a theory-neutral manner, by providing for two neutral or
1649 SASE elements to which the encoder can add any meaning not supplied by other TEI defined elements.
1690 SASE , it is useful where multiple views of a document are to be combined, for example, when a logical view based on paragraphs or verse lines is to be mapped on to a physical view based on manuscript lines. Like those elements, it is a member of the class
1692 SASE and can therefore appear anywhere within a document when the module defined by this chapter is included in a schema. Unlike the other elements in its class, the
1695 SASE , rather than as a means of marking segment boundaries for some arbitrary segmentation of a text.
1697 SASE For example, suppose that we wish to mark the end of the fifth word following each occurrence of some term in a particular text, perhaps to assist with some collocational analysis. This can most easily be done with the help of the
1712 SASE element may be used at the encoder's discretion to mark almost any segment of the text of interest for processing. One use of the element is to mark text features for which no appropriate markup is otherwise defined, i.e. as a simple extension mechanism. Another use is to provide an identifier for some segment which is to be pointed at by some other element, i.e. to provide a target, or a part of a target, for a
1720 SASE as a means of marking segments significant in a metrical or rhyming analysis (see section
1723 SASE as a means of marking typographic lines in drama (see section
1724 SASE ) or title pages (see section
1735 SASE element simply delimits the extent of a stutter, a textual feature for which no element is provided in these Guidelines.
1759 SASE elements may be nested directly within one another, to any degree of analysis considered appropriate. This is taken a little further in the following example, where the
1802 SASE to facilitate this particular kind of analysis. These allow for the explicit markup of units called
1829 SASE attribute of these specialized elements now carries the value carried by the
1833 SASE element. For an analysis not using these traditional linguistic categories however, the
1837 SASE In language corpora and similar material, the
1839 SASE element may be used to provide an end-to-end segmentation as an alternative to the more specific
1848 SASE element can then be used to mark both features within s-units and segments composed of s-units, as in the following example:
1850 SASE , where the text from which this fragment is taken is analyzed.
1864 SASE tag must be properly enclosed within other elements. Thus, a single
1866 SASE element can be used to group together words in different sentences only if the sentences are not themselves tagged. The first of the following two encodings is legal, but the second is not.
1890 SASE element has the same content as a paragraph in prose: it can therefore be used to group together consecutive sequences of
1892 SASE class elements, such as lists, quotations, notes, stage directions, etc. as well as to contain sequences of phrase-level elements. It cannot however be used to group together sequences of paragraphs or similar text units such as verse lines; for this purpose, the encoder should use intermediate pointers, as described in section
1894 SASE . It is particularly important that the encoder provide a clear description of the principles by which a text has been segmented, and the way in which that segmentation is represented. This should include a description of the method used and the significance of any categorization codes. The description should be provided as a series of paragraphs within the
1896 SASE element of the encoding description in the TEI header, as described in section
1901 SASE element may also be used to encode simultaneous or mutually exclusive variants of a text when the more special purpose elements for simple editorial changes, abbreviation and expansion, addition and deletion, or for a critical apparatus are not appropriate. In these circumstances, one
1903 SASE is encoded for each possible variant, and the set of them is enclosed in a
1907 SASE For example, if one were writing dual-platform instructions for installation of software, it might be useful to use
1916 SASE Elsewhere in this chapter we provide a number of examples where the
1924 SASE element, but is used for portions of the text which occur not within paragraphs or other component-level elements, but at the component level themselves. It is therefore a member of the
1930 SASE element may be used, for example, to tag the canonical verse divisions of Biblical texts:
1948 SASE In other cases, where the text clearly indicates paragraph divisions containing one or more verses, the
1950 SASE element may be used to tag the paragraphs, and the
1978 SASE element is also useful for marking dramatic speeches when it is not clear whether the speech is to be regarded as prose or verse. If, for example, an encoder does not wish to express an opinion as to whether the opening lines of Shakespeare's
2027 SACS , which is a special kind of correspondence involving an ordered set of correspondences. Both cases may be represented using the
2032 SACS . We also discuss the special case of alignment in time or
2034 SACS , for which special purpose elements are proposed in section
2040 SACS1 A common requirement in text analysis is to represent correspondences between two or more parts of a single document, or between places in different documents. Provided that explicit elements are available to represent the parts or places to be linked, then the global linking attribute
2055 SACS1 element should be used, if no other element is available. Where the correspondence is between
2059 SACS1 element should be used, if no other element is available.
2063 SACS1 attribute with spans of content is illustrated by the following example:
2081 SACS1 attributes. This mechanism is simple to apply, but has the drawback that it is not possible to specify more exactly what kind of correspondence is intended. Where this attribute is used, therefore, encoders are encouraged to specify their intent in the associated encoding description in the TEI header.
2139 SACSAL One very important application area for the alignment of parallel texts is multilingual corpora. Consider, for example, the need to align
2141 SACSAL of sentences drawn from a corpus such as the Canadian Hansard, in which each sentence is given in both English and French. Concerning this problem, Gale and Church write:
2142 SACSAL Most English sentences match exactly one French sentence, but it is possible for an English sentence to match two or more French sentences. The first two English sentences [in the example below] illustrate a particularly hard case where two English sentences align to two French sentences. No smaller alignments are possible because the clause
2144 SACSAL in the first English sentence corresponds to (part of) the second French sentence. The next two alignments ... illustrate the more typical case where one English sentence aligns with exactly one French sentence. The final alignment matches two English sentences to a single French sentence. These alignments [which were produced by a computer program] agreed with the results produced by a human judge.
2146 SACSAL , from which the example in the text is taken.
2148 SACSAL The alignment produced by Gale and Church's program can be expressed in four different ways. The encoder must first decide whether to represent the alignment in terms of points within each text (using the
2152 SACSAL element. To some extent the choice will depend on the process by which the software works out where alignment occurs, and the intention of the encoder. Secondly, the encoder may elect to represent the actual encoding using either
2183 SACSAL attribute be specified in both English and French texts, since (as noted above) this attribute is defined as representing a mutual association. However, it may simplify processing to do so, and also avoids giving the impression that the English is translating the French, or vice versa. More seriously, this encoding does not make explicit that it is in fact the entire stretch of text between the anchors which is being aligned, not simply the points themselves. If for example one text contained material omitted from the other, this approach would not be appropriate.
2239 SACSXA The preceding encoding of the alignment of parallel passages from two texts requires that those texts and the alignment all be part of the same document. If the texts are in separate documents, then complete URIs, whether absolute or relative (section
2240 SACSXA ), will be required. These external pointers may appear anywhere within the document, but if they are created solely for use in encoding links, they may for convenience be grouped within the
2250 SACSXA Each topic covered in this work has three parts: a picture, a prose text in Latin describing the topic, and a carefully-aligned translation of the Latin into English, German, or some other vernacular. Key terms in the two texts are typographically distinct, and are linked to the picture by numbers, which appear in the two texts and within the picture as well.
2252 SACSXA First, we consider the text portions. The English and Latin portions have been encoded as distinct
2299 SACSXA Next we consider the non-textual parts of the page. Encoding this requires providing two distinct components: firstly a digitized rendering of the page itself, and secondly a representation of the areas within that image which are to be aligned. In section
2309 SACSXA This example of SVG defines two rectangles at the locations with the specified x and y coordinates. A view is defined on these, enabling them to be mapped by an SVG processor to the image found at the URL specified (
2312 SACSXA ; for further discussion of using non-TEI XML vocabularies such as SVG within a TEI document, see section
2315 SACSXA As printed, the Comenius text exhibits three kinds of alignment.
2321 SACSXA Particular words or phrases are marked as terms in the two languages by a change of rendition: the English text, which otherwise uses black letter type throughout, has the words
2339 SACSXA Numbered labels appear within the text portions, linking keywords to each other and to sections of the picture. These labels, which have been left out of the above encoding, are attached to the first, third, and last segments in each language quoted below, and also appear (rather indistinctly) within the picture itself. Thus, the images of the study, the student, and his books are each aligned with the correct term for them in the two languages.
2375 SACSXA This map, of course, only aligns whole segments and image portions, since these are the only parts of our encoding which bear identifiers and can therefore be pointed to. To add to it the alignment between the typographically distinct words mentioned above, new elements must be defined, either within the text itself or externally by using stand off techniques. Encoding these word pairs as
2379 SACSXA , although intuitively obvious, requires a non-trivial decision as to whether the Latin text is glossing the English, or vice versa. Tagging all the marked words as
2381 SACSXA avoids the difficult decision, but might be thought by some encoders to convey the wrong information about the words in question. Simply tagging them as additional embedded
2385 SACSXA These solutions all require the addition of further markup to the text. This may pose no problems, or it may be infeasible, for example because the text is held on a read-only medium. If it is not feasible to add more markup to the original text, some form of stand-off markup will be needed. Any item within the text that can be pointed to using the various pointer schemes discussed in this chapter may be used, not simply those which rely on the existence of an
2410 SACSXA To express the same alignment mentioned above, we could use an XPath expression to identify the required
2422 SACSXA correspond, we might express the link between them as follows:
2429 SASY In the previous section we discussed two particular kinds of alignment: alignment of parallel texts in different languages; and alignment of texts and portions of an image. In this section we address another specialized form of alignment: synchronization. The need to mark the relative positions of text components with respect to time arises most naturally and frequently in transcribed spoken texts, but it may arise in any text in which quoted speech occurs, or events are described within a time frame. The methods described here are also generalizable for other kinds of alignment (for example, alignment of text elements with respect to space).
2434 SASYNC Provided that explicit elements are available to represent the parts or places to be synchronized, then the global linking attribute
2443 SASYNC elements may be used to make explicit the fact that the synchronous elements are aligned.
2445 SASYNC To illustrate the use of these mechanisms for marking synchrony, consider the following representation of a spoken text:
2447 SASYNC B: The first time in twenty five years, we've cooked Christmas (unclear) for a blooming great load of people. A: So you're [1] (unclear) [2] B: [1] It will be [2] nice in a way, but, [3] be strange. [4] A: [3] Yeah [4], yeah, cos it, it's [5] the [6] B: [5] not [6]
2456 SASYNC To encode this we use the spoken texts module, described in chapter
2516 SASYNC As with other forms of alignment, synchronization may be expressed between stretches of speech as well as between points. When complete utterances are synchronous, for example, if one person says
2529 SASYNC (where one speaker starts speaking before another has finished) is thus to use the
2548 SASYNC element and the content of a
2550 SASYNC element, and between the content of an
2563 SASYMP A synchronous alignment specifies which points in a spoken text occur at the same time, and the order in which they occur, but does not say at what time those points actually occur. If that information is available to the encoder it can be represented by means of the
2573 SASYMP attribute, whose value is a string which specifies a particular time, or indirectly by means of the
2579 SASYMP is used, then the
2583 SASYMP attributes should also be used to indicate the amount of time that has elapsed since the time specified by the element pointed to by the
2585 SASYMP attribute; the value
2591 SASYMP elements are uniformly spaced in time, then the
2599 SASYMP elements. If the intervals vary, but the units are all the same, then the
2615 SASYMP element which specifies the reference or origin for the timings within the
2617 SASYMP ; this must, of course, specify its position in time absolutely. If the origin of a timeline is unknown, then this attribute may be omitted.
2643 SASYMP To avoid the need for two distinct link groups (one marking the synchronization of anchors with each other, and the other marking their alignment with points on the time line) it would be better to link the
2656 SASYMP Finally, suppose that a digitized audio recording is also available, and an XML file that assigns identifiers to the various temporal spans of sound is available. For example, the following Synchronized Multimedia Integration Language (SMIL, pronounced "smile") fragment:
2682 SAIE , that is, an element which is not explicitly present in a text, but the presence of which an application can infer from the encoding supplied. In this section, we are concerned with virtual elements made by simply cloning existing elements. In the next section (
2685 SAIE Provided that explicit elements are available to represent the parts or places to be linked, then the global linking attributes
2694 SAIE It is useful to be able to represent the fact that one element of text is identical to others, for analytical purposes, or (especially if the elements have lengthy content) to obviate the need to repeat the content. For example, consider the repetition of the
2708 SAIE element above has identical content to the first. The
2710 SAIE attribute is provided for this purpose. Using it, we could recode the last line of the above example as follows:
2716 SAIE attribute may be used to document the fact that two elements have identical content. It may be regarded as a special kind of link. It should only be attached to an element with identical content to that which it targets, or to one the content of which clearly designates it as a repetition, such as the word
2720 SAIE in the representation of the chorus of a song, the second time it is to be sung. The relation specified by the
2722 SAIE attribute is symmetric: if a chorus is repeated three times and each repetition bears a
2728 SAIE attribute is used in a similar way to indicate that the content of the element bearing it is identical to that of another. The difference is that the content is not itself repeated. The effect of this attribute is thus to create a
2730 SAIE of the element indicated. Using this attribute, the repeated date in the first example above could be recoded as follows:
2732 SAIE An application program should replace whatever is the actual content of an element bearing a
2734 SAIE attribute with the content of the element specified by it. If the content of the element specified includes other elements, these will become embedded within the element bearing the attribute. Care must be taken to ensure that the document is valid both before and after this embedding takes place. If, for example, the element bearing a
2736 SAIE attribute requires a mandatory sub-component, then this component must be present (though possibly empty), even though it will be replaced by the content of the targetted element.
2790 SAAG Because of the strict hierarchical organization of elements, or for other reasons, it may not always be possible or desirable to include all the parts of a possibly fragmented text segment within a single element. In section
2791 SAAG we introduced the notion of an intermediate pointer as a way of pointing to discontinuous segments of this kind. In this section we first describe another way of linking the parts of a discontinuous whole, using a set of linking attributes, which are made available for any tag by following the procedure described at the beginning of this chapter. We then describe how the
2795 SAAG element, which is a special-purpose linking element specifically for representing the aggregation of parts, and the
2801 SAAG The linking attributes for aggregation are
2814 SAAG Here is the material on which we base our first illustration of the use of these mechanisms. Our problem is to represent the s-units identified below as
2844 SAAG attributes, we can link the s-units with identifiers
2854 SAAG Double linking of the two s-units, as illustrated by the last of these encodings, is equivalent to specifying a
2862 SAAG attribute with a value of
2863 SAAG join
2864 SAAG to specify that the link is to be understood as joining its targets into a single aggregate.
2871 SAAG join
2883 SAAG element within a text is significant: it must be supplied at a position where the element indicated by its
2893 SAAG As a further example, consider the following list of authors' names. The object of the
2895 SAAG element here is to provide another list, composed of those authors from the larger list who happen to come from Heidelberg:
2917 SAAG can be used to reconstruct a text cited in fragments presented out of order. The poem being remembered (an unusual translation of a well-known poem by Basho) runs
2958 SAAG is available for use when a number of
2964 SAAG if they are all of the same type, and also allows us to restrict the domain within which their target elements are to be found, in the same way as for
2971 SAAG may appear only where the elements represented by its contents are legal. Thus if we had created many
2973 SAAG tags of the sort just described, we could group them together, and require that their components are all contained by an element with the identifier
2985 SAAG ). It may also be used as a convenient way of representing a variety of analytic units, like the
2998 SAAG And then he added,
3011 SAAG Suppose now that we wish to represent an interpretation of the above passage in which we distinguish between the various
3015 SAAG attribute has been used for this purpose; its value on each occasion supplies a pointer to the
3017 SAAG to which each speech is attributed. (For convenience in this example, we use simply the first occurrence of the names used for each voice as the target for these pointers.) Note also that we add
3019 SAAG attributes to each distinct speech fragment, which we can then use to link the material spoken by each voice:
3060 SAAG s making up the
3068 SAAG value for them.
3147 SAAT if any of those elements could be present in a text, but one and only one of them is; in addition, we say that those elements are
3151 SAAT if at least one (and possibly more) of them is present. The elements that are in alternation may also be called
3155 SAAT The need to mark exclusive alternation arises frequently in text encoding. A common situation is one in which it can be determined that exactly one of several different words appears in a given location, but it cannot be determined which one. One way to mark such an exclusive alternation is to use the linking attribute
3157 SAAT . Having marked an exclusive alternation, it can sometimes later be determined which of the alternants actually appears in the given location. To preserve the fact that an alternation was posited, one can add the linking attribute
3159 SAAT to a tag which hierarchically encompasses the alternants, which points to the one which actually appears. To assign responsibility and degree of certainty to the choice, one can use the
3161 SAAT tag described in chapter
3162 SAAT . Also see that chapter for further discussion of certainty in general.
3172 SAAT A more general way to mark alternation, encompassing both exclusive and inclusive alternation, is to use the linking element
3174 SAAT . The description and attributes of this tag and of the associated grouping tag
3180 SAAT To take a simple hypothetical example, suppose in transcribing a spoken text, we encounter an utterance that we can understand either as
3193 SAAT If it is then determined that the speaker said
3197 SAAT , the encoder could amend the text by deleting the alternant containing
3203 SAAT value to the
3205 SAAT attribute value on the
3225 SAAT seg type="word"
3227 SAAT seg type="character"
3252 SAAT , but is certain that if it is
3254 SAAT , then the other uncertain word is definitely
3290 SAAT The value of the
3292 SAAT attribute is defined as a list of identifiers; hence it can also be used to narrow down the range of alternants, as in:
3302 SAAT element tag appears, and is thus equivalent to just the alternation of those two tags:
3311 SAAT attribute can also be used in case there is uncertainty about the tag that appears in a certain position. For example, the occurrence of the word
3315 SAAT can be interpreted, in the absence of other information, either as a person's name or as a date. The uncertainty can be rendered as follows, using the
3326 SAAT ; this avoids having to repeat the content of the element whose correct tagging is in doubt.
3341 SAAT element in the body of a document, or as the first
3358 SAAT attribute, if used, would appear on the
3384 SAAT Now we define the specialized linking element
3407 SAAT , which is to be used if one wishes to assign
3409 SAAT to the targets (alternants). Its value is a list of numbers, corresponding to the targets, expressing the probability that each target appears.
3410 SAAT If the alternants are mutually exclusive, then the weights must sum to 1.0.
3467 SAAT alt mode="incl"
3472 SAAT is the number of targets. If the sum is 0%, then the alternation is equivalent to exclusive alternation; if the sum is (100 x k)%, then all of the alternants must appear, and the situation is better encoded without an
3486 SAAT attribute defaults to the value
3498 SAAT , but that if the first word is
3500 SAAT , then the third word is
3502 SAAT . Now suppose we have the following additional information: if
3504 SAAT occurs, then the probability that
3508 SAAT occurs is 50%; if
3510 SAAT occurs, then the probability that
3530 SAAT As noted above, when the
3534 SAAT has the value
3536 SAAT , then each weight states the probability that the corresponding alternative occurs, given that at least one of the other alternatives occurs.
3546 SAAT Another very similar example is the following regarding the text of a Broadway song. In three different versions of the song, the same line reads
3552 SAAT The variant readings are found in the commercial sheet music, the performance score, and the Broadway cast recording.
3564 SAAT Let us extend the example with a further (imaginary) variation, supposing for the sake of the argument that the next line is variously given as
3570 SAAT element, we can express the conviction that if the first choice for the second line is correct, then the probability that the first line contains
3572 SAAT is 90%, and each of the others 5%; whereas if the second choice for the second line is correct, then the probability that the first line contains
3616 SASOin Most of the mechanisms defined in this chapter rely to a greater or lesser extent on the fact that tags in a marked-up document can both assert a property for a span of text which they enclose, and assert the existence of an association between themselves and some other span of text elsewhere. In stand-off markup, there is a clear separation of these two behaviours: the markup does not directly contain any part of the text, but instead includes it by reference. One specific mechanism recommended by these Guidelines for this purpose is the standard XInclude mechanism defined by the W3C; another is to use pointers as demonstrated elsewhere in this chapter.
3618 SASOin There are many reasons for using stand-off markup: the source text might be read-only so that additional markup cannot be added, or a single text may need to be marked up according to several hierarchically incompatible schemes, or a single scheme may need to accommodate multiple hierarchical ambiguities, so that a single markup tree is not the most faithful representation of the source material.
3628 SASOin source document
3631 SASOin a document to which the stand-off markup refers (a source document can be either XML or plain text); there may be more than one source document.
3637 SASOin markup that is already present in an XML source document
3643 SASOin markup that is either outside of the source document and points in to it to the data it describes, or alternatively is in another part of the source document and points elsewhere within the document to the data it describes
3649 SASOin a document that contains stand-off markup that points to a different, source document
3655 SASOin the action of creating a new XML document with external markup and data integrated with the source document data, and possibly some source document markup as well
3661 SASOin a process applied to markup from a pre-existing XML document, which splits it into two documents, an XML (external) document containing some of the markup of the original document, and another (source) XML document containing whatever text content and markup has not been extracted into the stand-off document; if all markup has been externalized from a document, the new source may be a plain text document
3667 SASOin any valid TEI markup can be either internal or external,
3669 SASOin external markup can be internalized by applying it to the document content by either substituting the existing markup or adding to it, to form a valid TEI document, and
3679 SASOov Stand-off markup which relies on the inclusion of virtual content is adequately supported by the W3C XInclude recommendation, which is also recommended for use by these Guidelines.
3680 SASOov The version on which this text is based is the
3685 SASOov XInclude defines a namespace (
3695 SASOov discussed elsewhere in this chapter to point to the actual fragments of text to be internalized. Although XInclude only requires support for the
3700 SASOov XInclude is a W3C recommendation which specifies a syntax for the inclusion within an XML document of data fragments placed in different resources. Included resources can be either plain text or XML. XInclude instructions within an XML document are meant to be replaced by a resource targetted by a URI, possibly augmented by an XPointer that identifies the exact subresource to be included.
3706 SASOov attribute to specify the location of the resource to be included; its value is an URI containing, if necessary, an XPointer. Additionally, it uses the
3709 SASOov text
3712 SASOov ) to specify whether the included content is plain text or an XML fragment, and the
3714 SASOov attribute to provide a hint, when the included fragment is text, of the character encoding of the fragment. An optional
3718 SASOov ; it specifies alternative content to be used when the external resource cannot be fetched for some reason. Its use is not however recommended for stand-off markup.
3722 SASOso Stand-off Markup in TEI
3726 SASOso internalization of one or more source documents' content into a stand-off document. TEI use of XInclude for stand-off markup enables use of XInclude-conformant software to perform this useful operation. However, internalization is not clearly defined for all stand-off files, because the structure of the internal and external markup trees may overlap. In particular, when an external markup document selects a range that overlaps partial elements in the source document, it is not clear how the semantics of internalization (inclusion) should work, since partial elements are not XML objects.
3728 SASOso XInclude defines a semantics for this case that involves only complete elements.
3730 SASOso When a range selection partially overlaps a number of elements in a source document, XInclude specifies that the partially overlapping elements should be included as well as all completely overlapping elements and characters (partially overlapping characters are not possible). The effect of this is that elements that straddle the start or end of a selected range will be included as wrappers for those of their children that are completely or partially selected by the range. For example, given the following source document:
3746 SASOso The result of the inclusion is two paragraph elements, while the original range designated in the source document overlapped two paragraph fragments.
3747 SASOso The semantics of XInclude require the creation of well-formed XML results even though the pointing mechanisms it uses do not necessarily respect the hierarchical structure of XML documents, as in this case. While this is a good way to ensure that internalization is always possible, it has implications for the use of XInclude as a notation for the
3751 SASOso When overlapping hierarchies need to be represented for a single document, each hierarchy must be represented by a separate set of XInclude tags pointing to a common source document. This sort of structure corresponds to common practice in work with linguistic text corpora. In such corpora, each potentially overlapping hierarchy of elements for the text is represented as a separate stream of stand-off markup. Generally the source text contains markup for the smallest significant units of analysis in the corpus, such as words or morphemes, this information and its markup representing a layer of common information that is shared by all the various hierarchies. As a way of organizing the representation of complex data, this technique generally allows a large number of
3753 SASOso attributes to be attached to the shared elements, providing robust anchors for links and facilitating adjustments to the source document without breaking external documents that reference it.
3756 SASOso Any tag can be externalized by
3757 SASOso removing its content and replacing it with an
3761 SASOso For instance the following portion of a TEI document:
3777 SASOso can be externalized by placing the actual text in a separate document, and providing exactly the same markup with the
3793 SASOso Please note that this specification requires that the XInclude namespace declaration is present in all cases. The
3795 SASOso element contains text or XML fragments to be placed in the document if the inclusion fails for any reason (for instance due to inaccessibility of an external resource). The
3797 SASOso element is optional; if it is not present an XInclude processor must signal a fatal error when a resource is not found. This is the preferred behaviour for use with stand-off markup. These Guidelines recommend against the use of
3805 SASOva The whole source fragment identified by an XInclude element, as well as any markup therein contained is inserted in the position specified, and an XInclude processor is required to ensure that the resulting internalized document is well-formed. This has obvious implications when the external document contains XML markup. A plain text source document will always create a well-formed internalized document.
3807 SASOva While a TEI customization may permit
3809 SASOva elements in various places in a TEI document instance, in general these Guidelines suggest that validity be verified after the resolution of all the
3817 SASOfr When the source text is plain text the overall form of the XPointer pointing to it is of minimal importance. The form of the XPointer matters considerably, on the other hand, when the source document is XML.
3819 SASOfr In this case, it is rather important to distinguish whether we intend to substitute the source XML with the new one, or just to add new markup to it. The XPointers used in the references can express both cases.
3851 SASOfr will select the whole poem, text content
3857 SASOfr hypertext links (NB: in XPointer whitespace-only text nodes count).
3863 SASOfr will only select the text of the poem, with no markup inside.
3881 SAAN and elsewhere, provision is made for analytic and interpretive markup to be represented outside of textual markup, either in the same document or in a different document. The elements in these separate domains can be connected, either with the pointing attributes
3884 SAAN analysis
3904 linking Linking, segmentation and alignment
3913 SAref The selection and combination of modules to form a TEI schema is described in

BIB-Bibliography_first_move_ptr_try.xml#12280

# id text
23 VEMEana-eg-23 Doglia mi reca ne lo core ardire
79 TSSASE-eg-20 Structures of social action: Studies in conversation analysis
343 NDPER-eg-17 membrane 5, entry 154
441 VEST-eg-4 2nd edition
566 DIC-CP Collins Pocket Dictionary of the English language
586 SA-BIBL-2 Orbis Pictus: a facsimile of the first English edition of 1659
603 PHegsurp2 Poeti del Duecento
853 COEDADD-eg-89 The waste land: a facsimile and transcript of the original drafts including the annotations of Ezra Pound
883 DS-eg-05 Is there a text in this class? The authority of interpretive communities
922 FTGRA-eg-18 2nd edition
1006 COHQU-eg-43 Natural language processing in Prolog
1257 DRSTA-eg-40 Everyman's library: the drama
1289 COBICOR-eg-248 ISO 690:1987: Information and documentation – Bibliographic references – Content, form and structure
1473 COHQQ-eg-33 note 12
1600 DRPRO-eg-7 epilogue
1634 STGA-eg-9 Crofts American history series
1703 TSBA-eg-19 The approach of the Text Encoding Initiative to the encoding of spoken discourse
1723 MS-eg-001 A summary catalogue of western manuscripts in the Bodleian Library at Oxford which have not hitherto been catalogued ...
1733 MS-eg-001 P5-MS: A general purpose tagset for manuscript description
1762 STGA-eg-10 Crofts American history series
1931 TSSASE-eg-37 Report on the compatibility of J P French's spoken corpus transcription conventions with the TEI guidelines for transcription of spoken texts
1958 GDFT-eg-12 Partial family tree for Bertrand Russell
2322 DSBACK-eg-83 index to vol. 1
2556 WHITMS1 "[I am a curse]" in
2562 WHITMS2 Single leaf of Notes for a poem about night "visions," possibly related to the untitled 1855 poem that Whitman eventually titled "The Sleepers." Fragments of an unidentified newspaper clipping about the Puget Sound area have been pasted to the leaf. The Trent Collection of Walt Whitman Manuscripts, Duke University Rare Book, Manuscript, and Special Collections Library.
3666 BIB Works cited elsewhere in the text of the Guidelines
3752 Burnard1995b The Design of the TEI Encoding Scheme
4361 SG-BIBL-2 Refining our notion of what text really is: the problem of overlapping hierarchies
4630 CO-BIBL-1 An international handbook of the science of language and society
4767 TS-BIBL-3 TEI document TEI AI2 W1
4912 DI-BIBL-3 TEI working paper TEI AIW20
5015 DI-BIBL-6 Principles for Encoding machine readable dictionaries
5069 DI-BIBL-8 Electronic dictionary encoding: customizing the TEI Guidelines
5609 NH-BIBL-7 The layered markup and annotation language
5661 FS-BIBL-01 A rationale for the TEI recommendations for feature-structure markup,
5728 ISO-690 ISO 690:1987: Information and documentation – Bibliographic references – Content, form and structure
5740 ISO-12620 ISO 12620:2009: Terminology and other language and content resources – Specification of data categories and management of a Data Category Registry for language resources
5750 RICA Istituto Centrale per il Catalogo Unico
5752 RICA Regole italiane di catalogazione per autori
5819 BIB-RDG Reading list
5821 BIB-RDG The following lists of readings in markup theory and the TEI derive from work originally prepared by Susan Schreibman and Kevin Hawkins for the TEI Education Special Interest Group, recoded in TEI P5 by Sabine Krott and Eva Radermacher. They should be regarded only as a snapshot of work in progress, to which further contributions and corrections are welcomed (see further
6296 Burnard1999 Closing plenary address at the XML Europe Conference, Granada, May 1999
6374 Burnard2001a Dalle «Due Culture» Alla Cultura Digitale: La Nascita del Demotico Digitale
6490 Burnard2005b Metadata for corpus work
7447 Pichler1995 Culture and Value: Philosophy and the Cultural Sciences. Beiträge des 18. Internationalen Wittgenstein Symposiums 13–20. August 1995 Kirchberg am Wechsel
7450 Pichler1995 Kirchberg am Wechsel
8357 Unsworthetaleds2004 TEI Consortium
8495 BIB-RDG TEI
8609 BaumanandCatapano1999 TEI and the Encoding of the Physical Structure of Books
8639 Bauman2005 TEI HORSEing Around
8720 Burnard1993 Rolling your own with the TEI
8836 Burnard1997 Prepared for a seminar on Etiquetación y extracción de información de grandes corpus textuales within the Curso Industrias de la Lengua (14–18 de Julio de 1997). Sponsored by the Fundacion Duques de Soria.
8853 BurnardandPopham1999 Putting Our Headers Together: A Report on the TEI Header Meeting 12 September 1997.
8916 Ciottied2005 Il Manuale TEI Lite: Introduzione Alla Codifica Elettronica Dei Testi Letterari
8936 Chang2001 The Implications of TEI
8982 DigitalLibraryFederation1998 TEI and XML in Digital Libraries: Meeting June 30 and July 1, 1998, Library of Congress, Summary/Proceedings
8998 DigitalLibraryFederation2007 TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices
9096 Loiseaunodate Introduction à la TEI
9120 MarkoandKelleher2001 Descriptive Metadata Strategy for TEI Headers: A University of Michigan Library Case Study
9150 Mertz2003 XML Matters: TEI — the Text Encoding Initiative
9264 Rahtz2003 Building TEI DTDs and Schemas on demand
9296 Rahtzetal2004 A unified model for text markup: TEI, Docbook, and beyond
9356 Robinsonnodate Making a Digital Edition with TEI and Anastasia
9374 Seaman1995 The Electronic Text Center Introduction to TEI and Guide to Document Preparation
9394 Simons1999 Using Architectural Forms to Map TEI Data into an Object-Oriented Database
9424 Smith1999 Textual Variation and Version Control in the TEI
9556 Vanhoutte2004 An Introduction to the TEI and the TEI Consortium

USE.xml#13163

# id text
2 USE Using the TEI
4 USE This section discusses some technical topics concerning the deployment of the TEI markup scheme documented elsewhere in these Guidelines.
6 USE we discuss the scope and variety of the TEI customization mechanisms, distinguishing between
8 USE modifications, which result in a schema that supports a subset of the distinctions made in the full TEI system, on the one hand, from
12 USE TEI Conformance
13 USE , distinguishing between documents which are algorithmically TEI-conformant ("TEI-conformable") from those which are intrinsically conformant ("TEI-conformant"); we also define the concept of a TEI extension. Since the ODD markup description language defined in chapter
14 USE is fundamental to the way conformance and customization are handled in the TEI system, these two definitional sections are followed by a section (
20 MEDIATYPE Serving TEI files with the TEI Media Type
22 MEDIATYPE In February 2011, the media type
28 MEDIATYPE ). We recommend that any XML file whose root element is in the TEI namespace be served with the media type
30 MEDIATYPE to enable and encourage automated recognition and processing of TEI files by external applications.
33 DT Obtaining the TEI Schemas
36 DT , the modules making up the TEI scheme are generated from a single set of XML source files. Schemas can be generated for TEI customizations in each of XML DTD language, W3C schema language, and RELAX NG schema language. In the body of the Guidelines, only the latter form is presented, using the compact syntax.
38 DT The TEI schemas and Guidelines are widely available over the Internet and elsewhere. The canonical home for the TEI source, the schema fragments generated from it, and example modifications, is the TEI repository at
39 DT ; versions are also available in other formats, along with copies of the Guidelines and related materials, from the TEI web site at
46 MD These Guidelines provide an encoding scheme suitable for encoding a very wide range of texts, and capable of supporting a wide variety of applications. For this reason, the TEI scheme supports a variety of different approaches to solving similar problems, and also defines a much richer set of elements than is likely to be necessary in any given project. Furthermore, the TEI scheme may be extended in well-defined and documented ways for texts that cannot be conveniently or appropriately encoded using what is provided. For these reasons, it is almost impossible to use the TEI scheme without customizing or personalizing it in some way.
48 MD This section describes how the TEI encoding scheme may be customized, and should be read in conjunction with chapter
49 MD , which describes how a specific application of the TEI encoding scheme should be documented. The documentation system described in that chapter is, like the rest of the TEI scheme, independent of any particular schema or document type definition language.
51 MD Formally speaking, these Guidelines provide both syntactic rules about how elements and attributes may be used in valid documents and semantic recommendations about what interpretation should be attached to a given syntactic construct. In this sense, they provide both a
56 MD TEI Abstract Model
57 MD , which defines a set of related concepts, and the
58 MD TEI schema
59 MD which defines a set of syntactic rules and constraints. Many (though not all) of the semantic recommendations are provided solely as informal descriptive prose, though some of them are also enforced by means of such constructs as datatypes (see
62 MD them in the sense of attaching slightly variant semantics to them.
68 MD which can take on arbitrary string values, depending on how it is used in a document. A new type of
69 MD note
70 MD , therefore, requires no change in the existing model. On the other hand, for many applications, it may be desirable to constrain the possible values for the
72 MD attribute to a small set of possibilities. A schema modified in this way would no longer necessarily regard as valid the same set of documents as the corresponding unmodified TEI schema, but would remain faithful to the same conceptual model.
74 MD This section explains how the TEI scheme can be customized by suppressing elements, modifying classes of elements, adding elements, and renaming elements. Documents which validate against an application of the TEI scheme which has been customized in this way may or may not be considered
79 MD The TEI scheme is designed to support modification and customization in a documented way that can be validated by an XML processor. This is achieved by writing a small TEI-conformant document, from which an appropriate processor can generate both human-readable documentation, and a schema expressed in a language such as RELAX NG or DTD. The mechanisms used to instantiate a TEI schema differ for different schema languages, and are therefore not defined here. In XML DTDs, for example, extensive use is made of parameter entities, while in RELAX NG schemas, extensive use is made of patterns. In either case, the names of elements and, wherever possible, their attributes and content models are defined indirectly. The syntax used to implement this indirection also varies with the schema language used, but the underlying constructs in the TEI Abstract Model are given the same names.
82 MD , the TEI encoding scheme comprises a set of class and macro declarations, and a number of
84 MD . Each module is made up of element and attribute declarations, and a schema is made by combining a particular set of modules together. In the absence of any other kind of personalization, when modules are combined together:
88 MD each such element is identified by the canonical name given it in these Guidelines;
90 MD the content model of each such element is as defined by these Guidelines;
94 MD the elements comprising element classes and the meaning of macro declarations expressed in terms of element classes is determined by the particular combination of modules selected.
95 MD The TEI personalization mechanisms allow the user to control this behaviour as follows:
97 MD particular elements may be suppressed, removing them from any classes in which they are members, and also from any generated schema;
99 MD within certain limits, the name (generic identifier) associated with an element may be changed, without changing the semantic or syntactic properties of the element;
101 MD new elements may be added to an existing class, thus making them available in macros or content models defined in terms of those classes;
103 MD additional attributes, or attribute values, may be specified for an individual element or for classes of elements;
105 MD within certain limits, attributes, or attribute values, may also be removed either from an individual element or for classes of elements;
107 MD the characteristics inherited by one class from another class may be modified by modifying its class membership: all members of the class then inherit the changed characteristics;
109 MD the set of values legal for an attribute or attribute class may be constrained or relaxed by supplying or modifying a value list, or by modifying its datatype.
114 MD ; in the remainder of this section we give specific examples to illustrate how that system may be applied. An ODD processor, such as the Roma application supported by the TEI, or any other comparable set of stylesheets will use the declarations provided by an ODD to generate appropriate sets of declarations in a specific schema language such as RELAX NG or the XML DTD language. We do not discuss in detail here how this should be done, since the details are schema language-specific; some background information about the methods used for XML DTD and RELAX NG schema generation is however provided in section
115 MD . Several example ODD files are also provided as part of the standard TEI release: see further section
126 MDMD modification of content models;
135 MDMD Each kind of modification changes the set of documents that will be considered valid according to the resulting schema. Any combination of unchanged TEI modules may be thought of as defining a certain set of documents. Each schema resulting from a modified combination of TEI modules will define a different set of documents. The set of documents valid according to the unmodified schema may or may not be properly contained in the set of documents considered to be valid according to the modified schema. We use the term
137 MDMD to describe a modification which regards as valid a subset of the documents considered valid by the same combination of TEI modules unmodified. Alternatively, the set of documents considered valid by the original schema might be disjoint from the set of documents considered valid by the modified schema, with neither being properly contained by the other. Modifications that have this result are called
141 MDMD Cleanliness can only be assessed with reference to elements in the TEI namespace.
145 MDMDSU The simplest way to modify the supplied modules is to suppress one or more of the supplied elements. This is simply done by setting the
153 MDMDSU For example, if the
158 MDMDSU attribute here supplies the canonical name of the element to be deleted, the
162 MDMDSU attribute specifies what is to be done with it. Note that the module name must be supplied explicitly, and that the schema specification in which this declaration appears must also contain a reference to the module itself. The full specification for a schema in which this modification is applied would thus be something like the following:
169 MDMDSU In most cases, deletion is a clean modification, since most elements are optional. Documents that are valid with respect to the modified schema are also valid according to the unmodified schema. To say this another way, the set of documents matching the new schema is contained by the set of documents matching the original schema.
171 MDMDSU There are however some elements in the TEI scheme which have mandatory children; for example, the element
185 MDMDSU In general, whenever the element deleted by a modification is mandatory within the content model of some other (undeleted) element, the result is an unclean modification, and may also break the TEI Abstract Model (
186 MDMDSU ). However, the parent of a mandatory child can be safely removed if it is itself optional.
188 MDMDSU To determine whether or not an element is mandatory in a given context, the user must inspect the content model of the element concerned. In most cases, content models are expressed in terms of model classes rather than elements; hence, removing an element will generally be a clean modification, since there will generally be other members of the class available. If a class is completely depopulated by a modification, then the cleanliness of the modification will depend upon whether or not the class reference is mandatory or optional, in the same way as for an individual element.
193 MDMDNM Every element and other named markup construct in the TEI scheme has a
194 MDMDNM canonical name
195 MDMDNM , usually in the English language: this name is supplied as the value of the
205 MDMDNM used to define it. The element or attribute declaration used within a schema generated from that specification may however be different, thus permitting schemas to be written using elements with generic identifiers from a different language, or otherwise modified. There may be many alternative identifiers for the same markup construct, and an ODD processor may choose which of them to use for a given purpose. Each such alternative name is supplied by means of an
220 MDMDNM now takes the value
221 MDMDNM change
222 MDMDNM to indicate that those parts of the element specification not supplied are to be inherited from the standard definition. The content of the
224 MDMDNM element will be used in place of the canonical
226 MDMDNM value in the schema generated.
230 MDMDNM modification. Although it is an inherently unclean modification (because the set of documents matched by the resulting schema is disjoint with the set matched by its unmodified equivalent), the process of converting any document in which elements have been renamed into an exactly equivalent document using canonical names is completely deterministic, requiring only access to the ODD in which the renaming has been specified. This assumes that the renamed elements used are not placed in the TEI namespace but either use a null namespace or some user-defined namespace, as further discussed in
231 MDMDNM ; if this is not the case, care must be taken to avoid name collision between the new name and all existing TEI names. Furthermore, unclean modifications which do not specify a namespace are not conformant (see further
234 MDMDNM The TEI provides a systematic set of renamings into languages other than English. These all use a language-specific namespace.
239 MDMDCM The content model for an element in the TEI scheme is defined by means of a
243 MDMDCM which specifies it. As shown elsewhere in these Guidelines, the content model is defined using RELAX NG syntax, whether the resulting schema is expressed in RELAX NG or in some other schema language.
254 MDMDCM This indicates that the content model contains declarations taken from the RELAX NG namespace, and that it consists of a reference to a pattern called
256 MDMDCM . Further examination shows that this pattern in turn expands to an optional repeatable alternation of text (
258 MDMDCM ) with references to three other classes (
264 MDMDCM ). For some particular application it might be preferable to insist that
276 MDMDCM This is a clean modification which does not change the meaning of a TEI element; there is therefore no need to assign the element to some other namespace than that of the TEI, though it may be considered good practice; see further
279 MDMDCM A change of this kind, which simplifies the possible content of an element by reducing its model to one of its existing components, is always clean, because the set of documents matched by the resulting schema is a subset of the set of documents which would have been matched by the unmodified schema.
281 MDMDCM Note that content models are generally defined (as far as possible) in terms of references to model classes, rather than to explicit elements. This means that the need to modify content models is greatly reduced: if an element is deleted or modified, for example, then the deletion or modification will be available for every content model which references that element via its class, as well as those which reference it explicitly. For this reason it is not (in general) good practice to replace class references by explicit element references, since this may have unintended side effects.
283 MDMDCM An unqualified reference to an element class within a content model generates a content model which is equivalent to an alternation of all the members of the class referenced. Thus, a content model which refers to the model class
285 MDMDCM will generate a content model in which any one of the members of that class is equally acceptable. It is also possible to reference predefined content model fragments based on classes, such as
288 MDMDCM a sequence containing no more than one of each member of the class
292 MDMDCM Content model changes which are not simple restrictions on an existing model should be undertaken with caution. The set of documents matching the schema which results from such changes is likely to be disjoint with the set of documents matching the unmodified schema, and such changes are therefore regarded as unclean. When content models are changed or extended, care should be taken to respect the existing semantics of the element concerned as stated in the Guidelines. For example, the element
294 MDMDCM is defined as containing a line of verse. It would not therefore make sense to redefine its content model so that it could also include members of the class
296 MDMDCM : such a modification although syntactically feasible would not be regarded as TEI-conformant because it breaks the TEI Abstract Model.
307 MDMDAL element. To add a new attribute to an element, the schema builder should therefore first check to see whether this attribute is already defined by some existing attribute class. If it is, then the simplest method of adding it will be to make the element in question a member of that class, as further discussed below. If this is not possible, then a new
320 MDMDAL content
331 MDMDAL Suppose, for example, that we wish to add two attributes to the
345 MDMDAL element in fact has no local attributes defined for it at all: we will therefore need to add not only an
365 MDMDAL The value supplied for the
370 MDMDAL add
371 MDMDAL ; if this attribute already existed on the element we are modifying this should generate an error, since a specification cannot have more than one attribute of the same name. If the attribute is already present, we can replace the whole of the existing declaration by supplying
373 MDMDAL as the value for
375 MDMDAL ; alternatively, we can change some parts of an existing declaration only by supplying just the new parts, and setting
376 MDMDAL change
377 MDMDAL as the value for
381 MDMDAL Because the new attribute is not defined by the TEI, we must specify a namespace for it on the
391 MDMDAL The canonical name for the new attribute is
393 MDMDAL , and is supplied on the
397 MDMDAL element. In this simple example, we supply only a description and datatype for the new attribute; the former is given by the
402 MDMDAL ). The content of the
406 MDMDAL element, uses patterns from the RELAX NG namespace, in this case to select one of the predefined TEI datatypes (
409 MDMDAL It is often desirable to constrain the possible values for an attribute to a greater extent than is possible by simply supplying a TEI datatype for it. This facility is provided by the
413 MDMDAL element. Suppose for example that, rather than supplying them as pointers to a bibliography, all that we wish to indicate about the source of our examples is that each comes from one of three predefined sources, which we call A, B, and C. A declaration like the following might be appropriate:
442 MDMDAL supplied as part of any attribute in the TEI scheme.
444 MDMDAL Depending on the modification, the set of documents matched by a schema generated from an ODD modified in this way, may or may not be a subset of the set of documents matched by the unmodified schema. As such, it is difficult to tell in principle whether such modifications are intrinsically unclean.
449 MDMDCL The concept of element classes was introduced in
450 MDMDCL ; an understanding of it is fundamental to successful use of the TEI scheme. As noted there, we distinguish
451 MDMDCL model classes
453 MDMDCL attribute classes
454 MDMDCL , the members of which simply share a set of attributes.
458 MDMDCL . All classes to which the element belongs must be specified within this, using a
462 MDMDCL To add an element to a class in which it is not already a member, all that is needed is to supply a new
466 MDMDCL element for the element concerned. For example, to add an element to the
477 MDMDCL element is set to
478 MDMDCL change
479 MDMDCL (rather than its default value of
483 MDMDCL element retains its membership of the two classes (
493 MDMDCL defined in the core module is a member of two attribute classes,
510 MDMDCL If the intention is to change the class membership of an element completely, rather than simply add or remove it to or from one or more classes, the value of
514 MDMDCL can be set to
516 MDMDCL (which is the default if no value is specified), indicating that the memberships indicated by its child
531 MDMDCL attribute is set to
532 MDMDCL change
537 MDMDCL To change or remove attributes inherited from an attribute class for all members of the class (as opposed to specific members of that class), it is also possible to modify the class specification itself. For example, the class
561 MDMDCL defining the attributes inherited through membership of this class has the value
562 MDMDCL change
567 MDMDCL The classes used in the TEI scheme are further discussed in chapter
568 MDMDCL . Note in particular that classes are themselves classified: the attributes inherited by a member of attribute class A may come to it directly from that class, or from another class of which A is itself a member. For example, the class
570 MDMDCL is itself a member of the classes
574 MDMDCL . By default, these two classes are predefined as empty. However, if (for example) the
576 MDMDCL module is included in a schema, a number of attributes (
584 MDMDCL will then inherit these new attributes (see further section
593 MDMDCL Such global changes should be undertaken with caution: in general removing existing non-mandatory attributes from a class will always be a clean modification, in the same way as removing non-mandatory elements. Adding a new attribute to a class however can be a clean modification only if the new attribute is labelled as belonging to some namespace other than the TEI.
595 MDMDCL The same mechanisms are available for modification of model classes. Care should be taken when modifying the model class membership of existing elements since model class membership is what determines the content model of most elements in the TEI scheme, and a small change may have unintended consequences.
600 MDMDNE To add a completely new element into a schema involves providing a complete element specification for it, the
602 MDMDNE element of which includes a reference to at least one TEI model class. Without such a reference, the new element will not be referenced by the content model of any other TEI element, and will therefore be inaccessible within a TEI document.
612 MDMDNE . To add a fourth member (say
622 MDMDNE The other parts of this declaration will typically include a description for the new element and information about its content model, its attributes, etc., as further described in
629 MDNS All the elements defined by the TEI scheme are labelled as belonging to a single
630 MDNS namespace
631 MDNS , maintained by the TEI and with the URI
636 MDNS used to represent TEI examples has its own namespace,
639 MDNS Only elements which are unmodified or which have undergone a clean modification may use this namespace. In a TEI-conformant document, it is assumed that all attributes not explicitly labelled with a namespace (such as, for example
641 MDNS ) also belong to the TEI namespace, and are defined by the TEI.
643 MDNS This implies that any other modification (including a renaming or reversible modification) must either specify a different namespace or specify no namespace at all. The
653 MDNS Suppose, for example, that we wish to add a new attribute
655 MDNS to the existing TEI element
657 MDNS . In the absence of namespace considerations, this would be an unclean modification, since
659 MDNS does not currently have such an attribute. The most appropriate action is to explicitly attach the new attribute to a new namespace by a declaration such as the following:
678 MDNS is explicitly labelled as belonging to something other than the TEI namespace, we regard the modification which introduced it as clean. A namespace-aware processor will be able to validate those elements in the TEI namespace against the unmodified schema.
679 MDNS Full namespace support does not exist in the DTD language, and therefore these techniques are available only to users of more modern schema languages such as RELAX NG or W3C Schema.
681 MDNS Similar considerations apply when modification is made to the content model or some other aspect of an element, or when a new element is declared. Clean modification requires that all such changes be explicitly labelled as belonging to some non-TEI name space or to no name space at all.
685 MDNS attribute is supplied on a
687 MDNS element, it identifies the namespace applicable to all components of the schema being specified. Even if such a schema includes unmodified modules from the TEI namespace, the elements contained by such modules will now be regarded as belonging to the namespace specified on the
689 MDNS . This can be useful if it is desired simply to avoid namespace processing. For example, the following schema specification results in a schema called
691 MDNS which has no namespace, even though it comprises declarations from the TEI
698 MDNS In addition to the TEI canonical namespace mentioned above, the TEI may also define namespaces for approved translations of the TEI scheme into other languages. These may be used as appropriate to indicate that a customization uses a standardized set of renamings. The namespace for such translations is the same as that for the canonical namespace, suffixed by the appropriate ISO language identifier (
699 MDNS ). A schema specification using the Chinese translation, for example, would use the namespace
705 MDDO The elements used to define a TEI customization (
711 MDDO , etc.) will typically be used within a TEI document which supplies further information about the intended use of the new schema, the meaning and application of any new or modified elements within it, and so on. This document will typically conform to a TEI (or other) schema which includes the module described in chapter
715 MDDO Where the customization to be documented simply consists in a selection of modules, perhaps with some deletion of unwanted elements or attributes, the documentation need not specify anything further. Even here however it may be considered worthwhile to replace some of the semantic information provided by the unmodified TEI specification. For example, the
717 MDDO element of an unmodified TEI
732 MDDO elements are not required, or in which any other rule stated in these Guidelines is either not enforced or not enforceable. In fact, the mechanism, if used in an extreme way, permits replacement of all that the TEI has to say about every component of its scheme. Such revisions would result in documents that are not TEI-conformant in even the broadest sense, and it is not intended that encoders use the mechanism in this way. We discuss exactly what is meant by the concept of
733 MDDO TEI conformance
739 MDlite Several examples of customizations of the TEI are provided as part of the standard release. They include the following:
743 MDlite The schema generated from this customization is the minimum needed for TEI Conformance. It provides only a handful of elements.
747 MDlite The schema generated from this customization combines all available TEI modules, providing
752 MDlite The schema generated from this customization combines all available TEI modules with three other non-TEI vocabularies, specifically MathML, SVG, and XInclude.
756 MDlite It is unlikely that any project would wish to use any of these extremes unchanged. However, they form a useful starting point for customization, whether by removing modules from tei_all or tei_allPlus, or by replacing elements deleted from tei_bare. They also demonstrate how an ODD document may be constructed to provide a basic reference manual to accompany schemas generated from it.
758 MDlite Shortly after publication of the first edition of these Guidelines, as a demonstration of how the TEI encoding scheme might be adopted to meet 90% of the needs of 90% of the TEI user community, the TEI editors produced a brief tutorial defining one specific
760 MDlite modification of the TEI scheme, which they called TEI Lite. This tutorial and its associated DTD became very popular and are still available from the TEI web site at
761 MDlite . The tutorial and associated schema specification is also included as one of the exemplars provided with TEI P5.
763 MDlite The exemplars provided with TEI P5 also include a customization file from which a schema for the validation of other customization files may be generated. This ODD, called tei_odds, combines the four basic modules with the tagdocs, dictionaries, gaiji, linking, and figures modules as well as including the (non-TEI) module defining the RELAX NG language. This enables schemas derived from this customization file to validate examples contained within them in a number of ways, further described within the document.
771 CF TEI Conformance
772 CF is intended to assist in the description of the format and contents of a particular XML document instance or set of documents. It may be found useful in such situations as:
780 CF specifying the form of documents to be produced by or for a given project.
782 CF It is not intended to provide any other evaluation, for example of scholarly merit, intellectual integrity, or value for money. A document may be of major intellectual importance and yet not be TEI-conformant; a TEI-conformant document may be of no scholarly value whatsoever.
784 CF In this section we explore several aspects of conformance, and in particular attempt to define how the term
786 CF should be used. The terminology defined here should be considered normative: users and implementors of the TEI Guidelines should use the phrases
791 CF TEI Extension
796 CF if it:
802 CF TEI Schema
803 CF , that is, a schema derived from the TEI Guidelines (
806 CF conforms to the TEI Abstract Model (
810 CF TEI Namespace
817 CF ) which refers to the TEI Guidelines
821 CF A document is said to be
823 CF if it is a well-formed XML document which can be transformed algorithmically and automatically into a TEI-conformant document as defined above without loss of information. Such a document may informally be described as TEI-conformant; the terms
829 CF A document is said to use a
830 CF TEI Extension
831 CF if it is a well-formed XML document which is valid against a TEI Schema which contains additional distinctions, representing concepts not present in the TEI Abstract Model, and therefore not documented in these Guidelines. Such a document cannot, in general, be algorithmically conformant since it cannot be automatically transformed without loss of information. However, since one of the goals of the TEI is to support extensions and modifications, it should not be assumed that no TEI document can include extensions: an extension which is expressed by means of the recommended mechanisms is also a TEI document provided that those parts of it which are not extensions are TEI-conformant, or -conformable.
833 CF A TEI-conformant (or -conformable) document is said to follow
834 CF TEI Recommended Practice
844 CFWF . Other ways of representing the concepts of the TEI Abstract Model are possible, and other representations may be considered appropriate for use in particular situations (for example, for data capture, or project-internal processing). But such alternative representations are at best
851 CFWF A TEI-conformant document must use the TEI namespace, and therefore must also include an XML-conformant namespace declaration, as defined below (
854 CFWF The use of XML greatly reduces the need to consider hardware or software differences between processing environments when exchanging data. No special packing or interchange format is required for an XML document, beyond that defined by the W3C recommendations, and no special
856 CFWF format is therefore proposed by these Guidelines. For discussion of encoding issues that may arise in the processing of special character sets or non-standard writing systems, see further chapter
861 CFWF document, as being a well-formed document which matches a specific set of rules or syntactic constraints, defined by a
863 CFWF . As noted above, TEI conformance implies that the schema used to determine validity of a given document should be derived from the present Guidelines, by means of an ODD which references and documents the schema fragments which the Guidelines define.
870 CFVL documents must validate against a schema file that has been derived from the published TEI Guidelines, combined and documented in the manner described in section
872 CFVL TEI Schema
875 CFVL The TEI does not mandate use of any particular schema language, only that this schema
880 CFVL TEI ODD file
881 CFVL that references the TEI Guidelines. Currently available tools permit the expression of schemas in any or all of the XML DTD language, W3C XML Schema, and RELAX NG (both compact and XML formats). Some of what is syntactically possible using the ODD formalism cannot be represented by all schema languages; and there are some features of some schema languages which have no counterpart in ODD. No single schema language fully captures all the constraints implied by conformance to the TEI Abstract Model. A document which is valid according to a TEI schema represented using one schema language may not be valid against the same schema expressed in other languages; in particular the DTD language does not fully support namespaces. Features which cannot be represented in all schema languages are documented in chapters
886 CFVL , many varieties of TEI schema are possible and not all of them are necessarily
888 CFVL ; derivation from an ODD is a necessary but not a sufficient condition for TEI Conformance.
892 CFAM Conformance to the TEI Abstract Model
895 CFAM TEI Abstract Model
896 CFAM is the conceptual schema instantiated by the TEI Guidelines. These Guidelines define, both formally and informally, a set of abstract concepts such as
902 CFAM s do not contain
904 CFAM s. These Guidelines also define classes of elements, which have both semantic and structural properties in common. Those semantic and structural properties are also a part of the TEI Abstract Model; the class membership of an existing TEI element cannot therefore be changed without changing the model. Elements can however be removed from a class by deletion, and new non-TEI elements within their own namespaces can be added to existing TEI classes.
908 CFAMsc It is an important condition of TEI conformance that elements defined in the TEI Guidelines as having one specific meaning should not be used with another. For example, the element
910 CFAMsc is defined in the TEI Guidelines as containing a line of verse. A schema in which it is redefined to mean a typographic line, or an ordered queue of objects of some kind, cannot therefore be TEI-conformant, whatever its other properties.
912 CFAMsc The semantics of elements defined in the TEI Guidelines are conveyed in a number of ways, ranging from formally verifiable datatypes to informal descriptive prose. In addition, a mapping between TEI elements and concepts in other conceptual models may be provided by the
916 CFAMsc A schema which shares equivalent concepts to those of the TEI conceptual model may be mappable to the TEI Schema by means of such a mechanism. For example, the concept of paragraph expressed in the TEI scheme by the
920 CFAMsc element. In this respect (though not in others) a DocBook-conformant document might therefore be considered to be TEI-conformable. Such areas of overlap facilitate interoperability, because elements from one namespace may be readily integrated with those from another, but do not affect the definition of conformance.
922 CFAMsc A document is said to conform to the
923 CFAMsc TEI Abstract Model
924 CFAMsc if features for which an encoding is proposed by the TEI Guidelines are encoded within it using the markup and other syntactic properties defined by means of a valid
926 CFAMsc schema. Hence, even though the names of elements or attributes may vary, a TEI-conformant document must respect the TEI Semantic Model, and be valid with respect to a TEI-conformant Schema. Although it may be possible to transform a document which follows the
927 CFAMsc TEI Abstract Model
934 CFAMmc Mandatory Components of a TEI Document
958 CFAMmc in the case of a corpus or collection, a single overall
960 CFAMmc element followed by a series of
973 CFAMmc This should include the title of the TEI document expressed using a
979 CFAMmc This should include the place and date of publication or distribution of the TEI document, expressed using the
994 CFNS TEI Namespace
997 CFNS ) provides a way for an XML document to combine markup from different vocabularies without risking name collision and consequent processing difficulties. While the scope of the TEI is large, there are many areas in which it makes no particular recommendation, or where it recommends that other defined markup schemes should be adopted, such as graphics or mathematics. It is also considered desirable that users of other markup schemes should be able to integrate documents using TEI markup with their own system. To meet these objectives without compromising the reliability of its encoding, a TEI-conformant document is required to make appropriate use of the TEI namespace.
999 CFNS Essentially all elements in a TEI Schema which represents concepts from the TEI Abstract Model belong to the TEI namespace,
1001 CFNS , maintained by the TEI. A TEI-conformant document is required to declare the namespace for all the elements it contains whether these come from the TEI namespace or from other schemes.
1003 CFNS A TEI Schema may be created which assigns TEI elements to some other namespace, or to no namespace at all. A document using such a schema must be regarded as a TEI extension and cannot be considered TEI-conformant, though it may be TEI-conformable. A document which places non-TEI elements or attributes within the TEI namespace cannot be TEI-conformant; such practices are strongly deprecated as they may lead to serious difficulties for processing or interchange.
1010 CFOD above, a TEI Schema can only be generated from a TEI ODD, which also serves to document the semantics of the elements defined by it. A TEI-conformant document should therefore always be accompanied by (or refer to) a valid
1011 CFOD TEI ODD file
1012 CFOD specifying which modules, elements, classes, etc. are in use together with any modifications or renamings applied, and from which a TEI Schema can be generated to validate the document. The TEI supplies a number of predefined
1013 CFOD TEI Customization exemplar ODD files
1015 CFOD ), but most projects will typically need to customize the TEI beyond what these examples provide. It is assumed, for example, that most projects will customize the TEI scheme by removing those elements that are not needed for the texts they are encoding, and by providing further constraints on the attribute values and element content models the TEI provides. All such customizations must be specified by means of a valid
1016 CFOD TEI ODD
1019 CFOD As different sorts of customization have different implications for the interchange and interoperability of TEI documents, it cannot be assumed that every customization will necessarily result in a schema that validates only TEI-conformant documents. The ODD language permits modifications which conflict with the TEI Abstract Model, even though observing this model is a requirement for TEI Conformance. The ODD language can in fact be used to describe many kinds of markup scheme, including schemes which have nothing to do with the TEI at all.
1021 CFOD Equally, it is possible to construct a TEI Schema which is identical to that derived from a given TEI ODD file without using the ODD scheme. A schema can constructed simply by combining the predefined schema language fragments corresponding with the required set of TEI modules and other statements in the relevant schema language. The status of such a schema with respect to the
1023 CFOD schema cannot however be determined, in general; it may therefore be impossible to determine whether such a schema represents a clean modification or an extension. This is one reason for making the presence of a TEI ODD file a requirement for conformance.
1027 CFCATSCH Varieties of TEI Conformance
1031 CFCATSCH Is it a valid XML document, for which a TEI Schema exists? If not, then the document cannot be considered TEI-conformant in any sense.
1033 CFCATSCH Is the document accompanied by a TEI-conformant ODD specification describing its markup scheme and intended semantics? If not, then the document can only be considered TEI-conformant if it validates against a predefined TEI Schema and conforms to the TEI abstract model.
1035 CFCATSCH Does the markup in the document correctly represent the TEI abstract model? Though difficult to assess, this is essential to TEI conformance.
1037 CFCATSCH Does the document claim that all of its elements come from some namespace other than the TEI (or no namespace)? If so, the document cannot be TEI-conformant.
1039 CFCATSCH If the document claims to use the TEI namespace, in part or wholly, do the elements associated with that namespace in fact belong to it? If not, the document cannot be TEI-conformant; if so, and if all non-TEI elements and attributes are correctly associated with other namespaces, then the document may be TEI-conformant.
1041 CFCATSCH Is the document valid according to a schema made by combining all TEI modules as well as valid according to the schema derived from its associated ODD specification? If so, the document is TEI-conformant.
1045 CFCATSCH ? If so, the document uses a TEI extension.
1049 CFCATSCH , using only information supplied in the accompanying ODD and without loss of information? If so, the document is TEI-conformable.
1075 tab-conformance Conforms to TEI Abstract Model
1135 tab-conformance Uses TEI and other namespaces correctly
1176 tab-conformance Document can be converted automatically to a form which is valid as a subset of
1200 CFCATSCH The document in column A is TEI-conformant. Its tagging follows the TEI Abstract Model, both as regards syntactic constraints (its
1206 CFCATSCH elements appear to contain verse lines rather than typographic ones). It is accompanied by a valid ODD which documents exactly how it uses the TEI. All the TEI-defined elements and attributes in the document are placed in the TEI namespace. The schema against which it is valid is a
1212 CFCATSCH The document in column B is not a TEI document. Although it is accompanied by a valid TEI ODD, the resulting schema includes some
1214 CFCATSCH modifications, and represents some concepts from the TEI Abstract Model using non-TEI elements; for example, it re-defines the content model of
1220 CFCATSCH which appears to have the same meaning as the existing TEI
1222 CFCATSCH element, but the equivalence is not made explicit in the ODD. It uses the TEI namespace correctly to identify the TEI elements it contains, but the ODD does not contain enough information automatically to convert its non-TEI elements into TEI equivalents.
1224 CFCATSCH The document in column C is TEI-conformable. It is almost the same as the document in column A, except that the names of the elements used are not those specified by the TEI namespace. Because the ODD accompanying it contains an exact mapping for each element name (using the
1226 CFCATSCH element) and there are no name conflicts, it is possible to make an automatic conversion of this document.
1228 CFCATSCH The document in column D is a TEI Extension. It combines elements from its own namespace with unmodified TEI elements in the TEI namespace. Its usage of TEI elements conforms to the TEI Abstract Model. Its ODD defines a new
1230 CFCATSCH element which has no exact TEI equivalent, but which is assigned to an existing TEI class; consequently its schema is not a clean subset of
1232 CFCATSCH . If the associated ODD provided a way of mapping this element to an existing TEI element, then this would be TEI-conformable.
1234 CFCATSCH The document in column E is superficially similar to document D, but because it does not use any namespace declarations (or, equivalently, it assigns unmodified TEI elements to its own namespace), it may contain name collisions; there is no way of knowing whether a
1238 CFCATSCH or has some other meaning. The accompanying ODD file may be used to provide the human reader with information about equivalently named elements in the TEI namespace, and hence to determine whether the document is valid with respect to the TEI Abstract Model but this is not an automatable process. In particular, cases of apparent conflict (for example use of an element
1240 CFCATSCH to represent a concept not in the TEI Abstract Model but in the abstract model of some other system, whose namespace has been removed as well) cannot be reliably resolved. By our current definition therefore, this is not a TEI document.
1244 CFCATSCH which is used in this document is a specialization of an existing TEI element, and the ODD in which it is defined specifies the mapping (a
1252 CFCATSCH ; if it does not, this would also be a case of TEI Extension.
1254 CFCATSCH The document in column G is not a TEI document. Its structure is fully documented by a valid TEI ODD, but it does not claim to represent the TEI Abstract Model, does not use the TEI namespace, and is not intended to validate against any TEI schema.
1256 CFCATSCH The document in column H is very like that in column A, but it lacks an accompanying ODD. Instead, the schema used to validate it is produced simply by combining TEI schema fragments in the same way as an ODD processor would, given the ODD. If the resulting schema is a clean subset of
1258 CFCATSCH , such a document is indistinguishable from a TEI-conformant one, but there is no way of determining (without inspection) whether this is the case if any modification or extension has been applied. Its status is therefore, like that of Text E, impossible to determine.
1268 IM The specifications in this section are illustrative but not normative. Its function is to further illustrate the intended scope and application of the elements documented in chapter
1269 IM , since it is believed that these may have application beyond the areas directly addressed by the TEI.
1271 IM An ODD processing system has to accomplish two main tasks. A set of selections, deletions, changes, and additions supplied by an ODD customization (as described in
1272 IM ) must first be merged with the published TEI P5 ODD specifications. Next, the resulting unified ODD must be processed to produce the desired outputs.
1274 IM An ODD processor is not required to do these two stages in sequence, but that may well be the simplest approach; the ODD processing tools currently provided by the TEI Consortium, which are also used to process the source of these Guidelines, adopt this approach.
1288 IM-unified attribute. This provides a name for the generated schema, which other components of the processing system may use to refer to the schema being generated, e.g. in issuing error messages or as part of the generated output schema file or files. The
1290 IM-unified attribute may be used to specify the default namespace within which elements valid against the resulting schema belong, as discussed in
1295 IM-unified element contains an unordered series of specialized elements, each of which is of one of the following four types:
1301 IM-unified (by default
1315 IM-unified add
1317 IM-unified If the value of
1320 IM-unified add
1321 IM-unified , then the object is simply copied to the output, but if it is
1322 IM-unified change
1327 IM-unified , then it will be looked at by other parts of the process.
1336 IM-unified element, in turn, groups together a set of ODD specifications (among other things, including further
1360 IM-unified references to TEI Modules
1365 IM-unified attributes refer to components of the TEI. The value of the
1371 IM-unified element defining a TEI module. The
1373 IM-unified must be dereferenced by some means, such as reading an XML file with the TEI ODD specification (either from the local hard drive or off the Web), or looking up the reference in an XML database (again, locally or remotely); whatever means is used, it should return a stream of XML containing the element, class, and macro specifications collected together in the specified module. These specification elements are then processed in the same way as if they had been supplied directly within the
1383 IM-unified attribute; the content of such modules, which must be available in the RELAX NG XML syntax, are passed directly and without modification to the output schema when that is created.
1387 IM-unified Each object obtained from the TEI ODD specification using
1395 IM-unified if there is an object in the ODD customization with the same value for the
1399 IM-unified value of
1401 IM-unified , then the object from the module is ignored;
1403 IM-unified if there is an object in the ODD customization with the same value for the
1407 IM-unified value of
1409 IM-unified , then the object from the module is ignored, and the one from the ODD customization is used in its place;
1411 IM-unified if there is an object in the ODD customization with the same value for the
1415 IM-unified value of
1416 IM-unified change
1417 IM-unified , then the two objects must be merged, as described below;
1419 IM-unified if there is an object in the ODD customization with the same value for the
1423 IM-unified value of
1424 IM-unified add
1425 IM-unified , then an error condition should be raised;
1441 IM-unified elements). If such a component is found in the ODD customization, it will be copied to the output; if it is not found there, but is present in the TEI ODD specification, then that will be copied to the output.
1447 IM-unified , for example); these are always copied to the output, and their children are then processed following the rules given in this list.
1481 IM-unified elements. These should be copied from both the TEI ODD specification and the ODD customization, and all occurrences included in the output.
1522 IM-unified This means that when
1523 IM-unified memberOf key="att.typed"/
1524 IM-unified is processed, that class is looked up, each attribute which it defines is examined in turn, and the customization is searched for an override. If the modification is of the attribute class itself, work proceeds as usual; if, however, the modification is at the element level, the class reference is deleted and a series of
1526 IM-unified elements is added to the element, one for each attribute inherited from the class. Since attribute classes can themselves be members of other attribute classes, membership must be followed recursively.
1542 IM-unified to provide an alternate description in another language. Nothing prevents the user from supplying
1554 IM-unified In the processing of the content models of elements and the content of macros, deleted elements may require special attention.
1555 IM-unified The carthago program behind the Pizza Chef application, written by Michael Sperberg-McQueen for TEI P3 and P4, went to very great efforts to get this right. The XSLT transformations used by the P5 Roma application are not as sophisticated, partly because the RELAX NG language is more forgiving than DTDs.
1556 IM-unified A content model like this:
1575 IM-unified requires no special treatment because everything is expressed in terms of model classes; if deletions result in
1577 IM-unified having no members, then
1581 IM-unified . An ODD processor may or may not elect to simplify the resulting choice between nothing and
1585 IM-unified element. However, such simplification may be considerably more complex in the general case (if for example the
1591 IM-unified ), and an ODD processor is therefore likely to be more successful in carrying out such simplification as a distinct stage during processing of ODD sources.
1614 IM-unified Note that deletion of required elements will cause the schema specification to accept as valid documents which cannot be TEI-conformant, since they no longer conform to the TEI Abstract Model; conformance topics are addressed in more detail in
1622 IM-unified which contains a complete and internally consistent set of element, class, and macro specifications, possibly also including
1632 IMGS Assuming that any modifications have been resolved, as outlined in the previous section, making a schema is now a four stage process:
1634 IMGS all datatype and other macro specifications must be collected together and declared at the start of the output schema;
1636 IMGS all classes must be declared in the right order (since some classes reference others, the order is significant);
1646 IMGS Working in this order gives the best chance of successfully supporting all the schema languages. However, there are a number of obstacles to overcome along the way.
1648 IMGS An ODD processor may use any desired schema language or languages for its schema output. The TEI ODD specification uses RELAX NG to express content models, and is therefore biased towards this language. However, the current TEI ODD processing system is capable of producing schema output in the three main schema languages, as follows:
1650 IMGS A RELAX NG (XML) schema is generated by creating wrappers around the content models taken directly from the ODD specification; a version re-expressed in the RELAX NG compact syntax is generated using James Clark's
1654 IMGS A DTD schema is generated by converting the RELAX NG content models to DTD language, often simplifying it to allow for the less-sophisticated output language.
1656 IMGS A W3C Schema schema is created by generating a RELAX NG schema and then using James Clark's
1666 IMGS Secondly, it is possible to create two rather different styles of schema. On the one hand, the schema can try to maintain all the flexibility of ODD by using the facilities of the schema language for parameterization; on the other, it can remove all customization features and produce a flat result which is not suitable for further manipulation. The TEI project currently generates both styles of schema; the first as a set of schema fragments in DTD and RELAX NG languages, which can be included as modules in other schemas, and customized further; the second as the output from a processor such as Roma, in which many of the parameterization features have been removed.
1702 IMGS performance = element performance { (model.divTop | model.global)*, (model.common, model.global*)+, (model.divBottom, model.global*)* att.global.attribute.xmlspace, att.global.attribute.xmlid, att.global.attribute.n, att.global.attribute.xmllang, att.global.attribute.rend, att.global.attribute.xmlbase, att.global.linking.attribute.corresp, att.global.linking.attribute.synch, att.global.linking.attribute.sameAs, att.global.linking.attribute.copyOf, att.global.linking.attribute.next, att.global.linking.attribute.prev, att.global.linking.attribute.exclude, att.global.linking.attribute.select }
1705 IMGS ) would have no effect, since references to such classes have been expanded to reference their constituent attributes.
1708 IMGS performance = element performance { performance.content, performance.attributes } performance.content = (model.divTop | model.global)*, (model.common, model.global*)+, (model.divBottom, model.global*)* performance.attributes = att.global.attributes, empty
1711 IMGS is provided via an explicit reference (
1713 IMGS ), and can therefore be redefined. Moreover, the attributes are separated from the content model, allowing either to be overridden.
1719 IMGS are used to distinguish the two schema types. An ODD processor is not required to support both, though the simple schema output is generally preferable for most applications.
1744 IMGS class. What happens if
1762 IMGS it is impossible to be sure which rule is being used. This situation is not detected when RELAX NG is used, since the language is able to cope with non-deterministic content models of this kind and does not require that only a single rule be used.
1764 IMGS Finally, an application will need to have some method of associating the schema with document instances that use it. The TEI does not mandate any particular method of doing this, since different schema languages and processors vary considerably in their requirements. ODD processors may wish to build in support for some of the methods for associating a document instance with a schema. The TEI does not mandate any particular method, but does suggest that those which are already part of XML (the DOCTYPE declaration for DTDs) and W3C Schema (the
1770 IMGS attribute to be valid when a document is validated against either a DTD or a RELAX NG schema, ODD processors may wish to add declarations for this attribute and its namespace to the root element, even though these are not part of the TEI
1771 IMGS per se
1774 IMGS to the list of attributes on the root element, which permits the non-namespace-aware DTD language to recognize the
1776 IMGS notation. For RELAX NG, the namespace and attribute would be declared in the usual way:
1777 IMGS namespace xsi = "http://www.w3.org/2001/XMLSchema-instance"
1779 IMGS attribute xsi:schemaLocation { list { data.namespace, data.pointer }+ }
1780 IMGS inside the root element declaration.
1784 IMGS attribute in a W3C Schema schema is not permitted. Therefore, if W3C Schemas are being generated by converting the RELAX NG schema (for example, with
1798 IM-naming If a RELAX NG pattern or DTD parameter entity is being created, its name is the value of the corresponding
1800 IM-naming attribute, prefixed by the value of any
1804 IM-naming . This allows for elements from an external schema to be mixed in without risk of name clashes, since all TEI elements can be given a distinctive prefix such as
1814 IM-naming tei_sp = element sp { ... }
1817 IM-naming If an element or attribute is being created, its default name is the value of the
1819 IM-naming attribute, but if there is an
1821 IM-naming child, its content is used instead.
1827 IM-naming should be copied into the generated schema. If there is only one occurrence of either of these elements, it should be used regardless, but if there are several, local processing rules will need to be applied. For example, if there are several with different values of
1829 IM-naming , a locale indication in the processing environment might be used to decide which to use. For example,
1843 IM-naming might generate a RELAX NG schema fragment like the following, if the locale is determined to be French:
1844 IM-naming head = ## en-tête element head { head.content, head.attributes }
1847 IM-naming Alternatively, a selection might be made on the basis of the value of the
1853 IM-naming In addition, there are three conventions about naming patterns relating to classes; ODD processors need not follow them, but those reading the schemas generated by the TEI project will find it necessary to understand them:
1855 IM-naming when a pattern for an attribute class is created, it is named after the attribute class identifier (as above) suffixed by
1861 IM-naming when a pattern for an attribute is created, it is named after the attribute class identifier (as above) suffixed by
1863 IM-naming and then the identifier of the attribute (e.g.
1868 IM-naming when a parameterized schema is created, each element generates patterns for its attributes and its contents separately, suffixing respectively
1890 IMRN element defining which elements can occur as the root of a document. The ODD
1896 IMRN . A pattern normally corresponds to an element name, but if a prefix (see above,
1897 IMRN ) is supplied for an element, the pattern consists of the prefix name with the element name.
1902 IMMA An ODD macro generates a corresponding RELAX NG pattern simply by copying the body of the
1930 IMMA Although some versions of these Guidelines show the RELAX NG output in the compact syntax, both the content of the
1932 IMMA element and the unified ODD specification generated by the TEI ODD processing software always store RELAX NG in the more verbose XML syntax. However, the two formats are interchangeable.
1952 IMCL if the elements
1958 IMCL are included. Depending on the value of the
1962 IMCL , it may also generate a set of sequences as well as alternation patterns. Thus we may also generate the
2010 IMCL where the pattern name is created by appending an underscore and the name of the generation sequence to the class name.
2012 IMCL Attribute classes work by producing a pattern containing definitions of the appropriate attributes. So
2063 IMCL Since the processor may have expanded the attribute classes already, separate patterns are generated for each attribute in the class as well as one for the class itself. This allows an element to refer directly to a member of a class. Notice that the
2065 IMCL element is used to add an
2073 IMCL Naturally, this behaviour is not mandatory; and other ODD processors may create documentation in other ways, or ignore those parts of the ODD specifications when creating schemas.
2084 IMCL attribute in the namespace
2088 IMCL . The body of the attribute is taken from the
2094 IMCL value of
2096 IMCL . In that case an
2146 IMCL namespace to provide default values and documentation.
2156 IMEL pattern by which other elements can refer to it, and then it must generate an
2158 IMEL with the content model and attributes. It may be convenient to make two separate patterns, one for the element's attributes and one for its content model.
2160 IMEL The content model is created simply by copying the body of the
2171 IM-makeDTD . A DTD may not refer to an entity which has not yet been declared. Since both macros and classes generate DTD parameter entities, the TEI Guidelines are constructed so that they can be declared in the right order. A processor must therefore work in the following order:
2173 IM-makeDTD declare all model classes which have a
2175 IM-makeDTD value of
2180 IM-makeDTD value of
2183 IM-makeDTD declare all other classes
2209 IM-makeDTD <!ENTITY % faith 'INCLUDE' > <![ %faith; [ <!--doc:specifies the faith, religion, or belief set of a person. --> <!ELEMENT %n.faith; %om.RR; %macro.phraseSeq;> <!ATTLIST %n.faith; xmlns CDATA "http://www.tei-c.org/ns/1.0"> <!ATTLIST %n.faith; %att.global.attributes; %att.editLike.attributes; %att.datable.attributes; > ]]>
2211 IM-makeDTD ), the element name is parameterized (see
2216 IM-makeDTD . Note the additional attribute which provides a default
2218 IM-makeDTD declaration for the element; the effect of this is that if the document is processed by a DTD-aware XML processor, the namespace declaration will be present automatically without the document author even being aware of it.
2220 IM-makeDTD A simpler rendition for a flattened DTD generated from a customization will result in the following, with no containing marked section, and no parameterized name:
2221 IM-makeDTD <!ELEMENT faith %macro.phraseSeq;> <!ATTLIST faith xmlns CDATA "http://www.tei-c.org/ns/1.0"> <!ATTLIST faith %att.global.attribute.xmlspace; %att.global.attribute.xmlid; %att.global.attribute.n; %att.global.attribute.xmllang; %att.global.attribute.rend; %att.global.attribute.xmlbase; %att.global.linking.attribute.corresp; %att.global.linking.attribute.synch; %att.global.linking.attribute.sameAs; %att.global.linking.attribute.copyOf; %att.global.linking.attribute.next; %att.global.linking.attribute.prev; %att.global.linking.attribute.exclude; %att.global.linking.attribute.select; %att.editLike.attribute.cert; %att.editLike.attribute.resp; %att.editLike.attribute.evidence; %att.datable.w3c.attribute.period; %att.datable.w3c.attribute.when; %att.datable.w3c.attribute.notBefore; %att.datable.w3c.attribute.notAfter; %att.datable.w3c.attribute.from; %att.datable.w3c.attribute.to;>
2222 IM-makeDTD Here the attributes from classes have been expanded into individual entity references.
2241 IMGD The generated documentation may be of two forms. On the one hand, we may document the customization itself, that is, only those elements (etc.) which differ in their specification from that provided by the TEI reference documentation. Alternatively, we may generate reference documentation for the complete subset of the TEI which results from applying the customization. The TEI Roma tools take the latter approach, and operate on the result of the first stage processing described in
2252 IMGD for each element, by tracing which other elements have them as possible members of their content models.
2270 STPE Using TEI Parameterized Schema Fragments
2272 STPE The TEI parameterized DTD and RELAX NG fragments make use of parameter entities and patterns for several purposes. In this section we describe their interface for the user. In general we recommend use of ODD instead of this technique.
2276 STPED Special-purpose parameter entities are used to specify which modules are to be combined into a TEI DTD. They take the form
2280 STPED is the name of the module as given in table
2286 STPED . All such parameter entities are declared by default with the value
2288 STPED : to select a module, therefore, the encoder declares the appropriate parameter entities with the value
2292 STPED For XML DTD fragments, note that some modules generate two DTD fragments: for example the
2298 STPED . This is because the declarations they contain are needed at different points in the creation of an XML DTD.
2314 STPED If TEI.linking has its default value of IGNORE, neither declaration has any effect. If however it has the value INCLUDE, then the content of each marked section is acted upon: the parameter entities
2318 STPED are referenced, which has the effect of embedding the content of the files they represent at the appropriate point in the DTD.
2327 STPEEX The TEI DTD fragments also use marked sections and parameter entity references to allow users to exclude the definitions of individual elements, in order either to make the elements illegal in a document or to allow the element to be redefined. The parameter entities used for this purpose have exactly the same name as the generic identifier of the element concerned. The default definition for these parameter entities is
2331 STPEEX in order to exclude the standard element and attribute definition list declarations from the DTD.
2335 STPEEX , for example, are preceded by a definition for a parameter entity with the name
2340 STPEEX <!ENTITY % p 'INCLUDE' > <![ %p; [ <!-- element and attribute list declaration for p here --> ]]
2350 STPEEX <!ENTITY % p 'IGNORE' >
2351 STPEEX is added earlier in the DTD than the default (see further
2354 STPEEX Similarly, in the parameterized RELAX NG schemas, every element is defined by a pattern named after the element. To undefine an element therefore all that is necessary is to add a declaration like the following:
2355 STPEEX p = notAllowed
2360 STPEGI In the TEI DTD fragments, elements are not referred to directly by their generic identifiers; instead, the DTD fragments refer to parameter entities which expand to the standard generic identifiers. This allows users to rename elements by redefining the appropriate parameter entity. Parameter entities used for this purpose are formed by taking the standard generic identifier of the element and attaching the string
2372 STPEGI These declarations are generated by an ODD processor when TEI DTD fragments are created.
2374 STPEGI In the RELAX NG schemas, all elements are normally defined using a pattern with the same name as the element (as described in
2376 STPEGI abbr = element abbr { abbr.content, abbr.attributes }
2378 STPEGI abbr = element abbrev { abbr.content, abbr.attributes }
2379 STPEGI More complex revisions, such as redefining the content of the element (defined by the pattern
2383 STPEGI ) can be accomplished in a similar way, using the features of the RELAX NG language. The recommended method of carrying out such modifications is however to use the ODD language as further described in section
2389 STOVLO Any local modifications to a DTD (i.e. changes to a schema other than simple inclusion or exclusion of modules) are made by declarations stored in one of two local extension files, one containing modifications to the TEI parameter entities, and the other new or changed declarations of elements and their attributes. Entity declarations must be made which associate the names of these two files with the appropriate parameter entity so that the declarations they contain can be embedded within the TEI DTD at an appropriate point.
2393 STOVLO file to embed portions of the TEI DTD fragments or locally developed extensions.
2396 STOVLO identifies a local file containing extensions to the TEI parameter entities
2400 STOVLO identifies a local file containing extensions to the TEI module
2403 STOVLO For example, if the relevant files are called
2407 STOVLO , then declarations like the following would be appropriate:
2410 STOVLO When an entity is declared more than once, the first declaration is binding and the others are ignored. The local modifications to parameter entities should therefore be handled before the standard parameter entities themselves are declared in
2414 STOVLO is referred to before any TEI declarations are handled, to allow the user's declarations to take priority. If the user does not provide a
2418 STOVLO For example the encoder might wish to add two phrase-level elements
2423 STOVLO hi rend='italics'
2425 STOVLO hi rend='bold'
2427 STOVLO , this involves two distinct steps: one to define the new elements, and the other to ensure that they are placed into the TEI document structure at the right place.
2429 STOVLO Creating the new declarations is done in the same way for user-defined elements as for any other; the same parameter entities need to be defined so that they may be referenced by other elements. The content models of these new elements may also reference other parameter entities, which is why they need to be declared after other declarations.
2433 STOVLO should be modified to include the generic identifiers for the new elements we wish to create. The declaration for each modifiable parameter entity in the DTD includes a reference to an additional parameter entity with the same name prefixed by an
2435 STOVLO ; these entities are declared by default as the null string. However, in the file containing local declarations they may be redeclared to include references to the new class members:
2437 STOVLO and this declaration will take precedence over the default when the declaration for macro.phraseSeq is evaluated.

AI-AnalyticMechanisms.xml#13092

# id text
3 AI This chapter describes a module for associating simple analyses and interpretations with text elements. We use the term
4 AI analysis
5 AI here to refer to any kind of semantic or syntactic interpretation which an encoder wishes to attach to all or part of a text. Examples discussed in this chapter include familiar linguistic categorizations (such as
19 AI introduces elements which can be used to characterize text segments according to the familiar linguistic categories of
34 AI punctuation mark
41 AI introduces an additional global attribute which allows passages of text to be associated with specialized elements representing their interpretation. These
48 AI . They allow the encoder to specify an analysis as a series of names and associated values,
51 AI ; this term should not be confused, however, with XML attributes and their values, which are similar in concept but distinct in their formal definitions.
52 AI each such pair being linked to one or more stretches of text, either directly, in the case of spans, or indirectly, in the case of interpretations.
55 AI revisits the topic of linguistic analysis, and illustrates how these interpretative mechanisms may be used to associate simple linguistic analysis with text segments.
60 AILC linguistic segment category
61 AILC elements which may be used to represent the segmentation of a text into the traditional linguistic categories of
74 AILC punctuation marks
99 AILCW . They may thus appear anywhere that text is permitted within a document, when the module defined by this chapter is included in a schema.
103 AILCW element may be used simply to segment a text end-to-end into a series of non-overlapping segments, referred to here and elsewhere as
115 AILCW element is more restricted both in its content and its usage than the generic
132 AILCW Neither this constraint, nor the requirement that the whole of the text be segmented by
134 AILCW elements is enforced by the current TEI schemas; such constraints may however be introduced in a later version of these Guidelines.
137 AILCW element is intended for use as a generic segmentation element, the specific function of which may be indicated by its
146 AILCW seg type="s-unit"
148 AILCW seg type="clause"
150 AILCW seg type="phrase"
195 AILCW elements in the same way. A text may be segmented directly into clauses, or into phrases, with no need to include segmentation at a higher level as well.
197 AILCW For verse texts, the overlapping of metrical and syntactic structure requires that special care be given to representing both using an element hierarchy. One simple approach is to split the syntactic phrases into fragments when they cross verse boundaries, reuniting them with the
222 AILCW attributes defined in the additional module for linking (chapter
234 AILCW attribute on linguistic segment categories can be used to provide additional interpretative information about the category. The
240 AILCW elements can be used to provide additional information about the function of the category. Legal values for these two attributes are not defined by these Guidelines, but should be documented in the
244 AILCW element within the document's header. A general approach to the encoding of linguistic categories for parts of a text is discussed in section
263 AILCW Segmentation into clauses and phrases can, of course, be combined. Such detailed encodings as the following may require careful formatting if they are to be easily readable however.
329 AILCW This style of markup may introduce spurious new lines and blanks into the text. If the original layout is important, it should be explicitly encoded, using such facilities as the
348 AILCW w
350 AILCW m
352 AILCW c
355 AILCW is permitted to occur. However, their content is more constrained than
377 AILCW elements should contain only plain text, most often only a single character or a sequence of graphemes to be treated as a single character. Consequently, while these more specific elements can be translated directly into typed
381 AILCW The restriction on the content of the
383 AILCW element in particular requires that a certain care must be exercised when using it, especially in relation to the use of other tags that one may think of as
393 AILCW element is not part of the content model of the
417 AILCW carries additional attributes which may be of use in many indexing or analytic applications. The
421 AILCW , that is the head- or uninflected form of an inflected verb or noun, for example:
437 AILCW pointer attribute than to supply an explicit uninflected form. This attribute assumes the existence of a list of uninflected forms, for example in an online lexicon, with which individual
438 AILCW w
439 AILCW entries can be associated using the usual TEI pointer mechanisms. Assuming that a standardized lexicon for Latin is available at the location
458 AIPC element is used to mark up morphologically identified segmentation below the word level. Analogous to the
467 AIPC base form
500 AIPC There is a substantial linguistic difference between characters like letters or diacritics and punctuation marks. The former are used to construct meaningful units like morphemes or words. The latter are functionally independent units acting at the level of syntactic units. A word may consist of a single letter (for example
553 AIPC use to mark non-lexical punctuation marks is deprecated, since the
559 AIPC (punctuation character) element should be used to mark up characters which are specifically regarded as providing punctuation, rather than constituting parts of a word. It may be particularly useful when transcribing older written materials, in which an encoding of the original punctuation may be useful for interpretive or analytic purposes, in much the same way as an encoding of the original orthography may be. For example, in the following extract from a Bodleian Library musical manuscript
562 AIPC two different punctuation marks are used to distinguish kinds of pause in the text. The
583 AIPC element carries special attributes to record analyses of the functional behaviour or classification of the punctuation mark it contains. The
587 AIPC element to name the kind of unit which the punctuation mark delimits, for example a paragraph or section. The
589 AIPC attribute may be used to indicate whether the punctuation precedes or follows the unit it delimits. The
591 AIPC attribute indicates the strength of the association between the punctuation mark and its adjacent word.
593 AIPC In the following example, the paragraph marker (¶) has been tagged as a strong punctuation mark, preceding the unit it marks, which is named
610 AIPC elements can be used together to give a fairly detailed low-level grammatical analysis of text. For example, consider the following segmentation of the English S-unit
635 AIPC . A further advantage of segmenting the text down to this level is that it becomes relatively simple to associate each such segment with a more detailed formal analysis, for example by providing a baseform, or morphological analysis at whichever level is appropriate. This matter is taken up in detail in section
651 AIATTS When the module described by this chapter is selected, an additional attribute is defined for all elements:
654 AIATTS attribute may be specified for any element. Its effect is to associate the element with one or more others representing an analysis or interpretation of it. Its target should be one of the elements described in the section
669 AISP The simplest mechanisms for attaching analytic notes in some structured vocabulary to particular passages of text are provided by the
695 AISP elements may be used to indicate that the annotations are of specific types, for example thematic or structural. The annotation itself is supplied as the content of the
699 AISP element. In the case of the
701 AISP element, the span of text being annotated is indicated by values of the
709 AISP attribute is supplied, then the span is coterminous with the element indicated by its value; if both
713 AISP are supplied, the span runs from the start of the element indicated by the
717 AISP attribute; if the
719 AISP attribute is used, the span is defined by aggregating the contents of the (possibly non-contiguous) elements pointed to by its values. It is an error to supply only the
721 AISP attribute; to supply more than one pointer value for either
727 AISP attribute. In the case of
729 AISP (see below), the span is indicated by a pointer from a
747 AISP Here the two components of the span follow each other, so the
763 AISP This second approach might be cumbersome if the number of components to be combined is very large. It is however essential if the components do not follow each other, as in this example:
801 AISP element may, as in this example, be placed in the text near the textual span it is associated with. Alternatively, it may be placed elsewhere in the same or a different document. Where several
805 AISP elements share the same attributes, for example having the same responsibility or type, it may be convenient to group them within a
816 AISP Spans may also be used to represent structural divisions within a narrative, particularly when these do not coincide with the structure implied by the element structure. Consider the following narrative:
819 AISP The rule marks spaces left for the missing name in the manuscript.
820 AISP And when he came home, Borghild asked him to go away, but Sigmund offered her weregild, and she was obliged to accept it. At the funeral feast Borghild was serving beer. She took poison, a big drinking horn full, and brought it to Sinfiotli. When Sinfiotli looked into the horn, he saw that poison was in it, and said to Sigmund
822 AISP Sigmund took the horn and drank it off. It is said that Sigmund was hardy and that poison did him no harm, inside or out. And all his sons could tolerate poison on their skin. Borghild brought another horn to Sinfiotli, and asked him to drink, and everything happened as before. And a third time she brought him a horn, and reproachful words as well, if he didn't drink from it. He spoke again to Sigmund as before. He said
826 AISP Sigmund carried him a long way in his arms and came to a long, narrow fjord, and there was a small boat there and a man in it. He offered to ferry Sigmund over the fjord. But when Sigmund carried the body out to the boat, it was fully laden. The man said Sigmund should go around the fjord inland. The man pushed the boat out and then suddenly vanished.
828 AISP King Sigmund lived a long time in Denmark in the kingdom of Borghild, after he married her. Then he went south to Frankish lands, to the kingdom he had there. Then he married Hiordis, the daughter of King Eylimi. Their son was Sigurd. King Sigmund fell in a battle with the sons of Hunding. And then Hiordis married Alf, the son of King Hialprec. Sigurd grew up there as a boy.
833 AISP A structural analysis of this text, dividing it into narrative units in a pattern shared with other texts from the same literature, might look like this:
880 AISP unit which is normally part of the narrative pattern but which is not realized in the text shown.
883 AISP The same analysis may be expressed with the
887 AISP element; this element provide attributes for recording an interpretive category and its value, as well as the identity of the interpreter, but does not itself indicate which passage of text is being interpreted; the same interpretive structures can thus be associated with many passages of the text. The association between text passages and
889 AISP elements must be made either by pointing from the text to the
894 AISP , or by pointing at both text and interpretation from a
901 AISP , it is necessary to create a text element which contains—or corresponds to—the third, fourth, and fifth orthographic sentences (S-units) in the paragraph. This can be done either with the
907 AISP . The resulting element can then be associated with the
938 AISP tags in a similar manner. The interpretation itself can be expressed in an
960 AISP elements may be linked to the text either by means of the
968 AISP elements introduced specifically for this purpose), the text would be encoded as follows:
1001 AISP element, whose content is a set of
1003 AISP elements which point to each interpretive element and its corresponding text unit. This method does not require the use of the
1005 AISP attribute on the text units.
1019 AISP elements for the Sigmund text is that the
1026 AISP elements may require the creation of special text elements not otherwise needed (e.g. the
1045 AILA we mean here any annotation determined by an analysis of linguistic features of the text, excluding as borderline cases both the formal structural properties of the text (e.g. its division into chapters or paragraphs) and descriptive information about its context (the circumstances of its production, its genre or medium). The structural properties of any TEI-conformant text should be represented using the structural elements discussed elsewhere in this chapter and in chapters
1047 AILA , and the various chapters of Part III. The contextual properties of a TEI text are fully documented in the TEI header, which is discussed in chapter
1051 AILA Other forms of linguistic annotation may be applied at a number of levels in a text. A code (such as a word-class or part-of-speech code) may be associated with each word or token, or with groups of such tokens, which may be continuous, discontinuous, or nested. A code may also be associated with relationships (such as cohesion) perceived as existing between distinct parts of a text. The codes themselves may stand for discrete and non-decomposable categories, or they may represent highly articulated bundles of textual features. Their function may be to place the annotated part of the text somewhere within a narrowly linguistic or discoursal domain of analysis, or within a more general semantic field, or any combination drawn from these and other domains.
1053 AILA The manner by which such annotations are generated and attached to the text may be entirely automatic, entirely manual or a mixture. The ease and accuracy with which analysis may be automated may vary with the level at which the annotation is attached. The method employed should be documented in the
1055 AILA element within the encoding description of the TEI header, as described in section
1056 AILA . Where different parts of a language corpus have used different annotation methods, the
1075 AILA This may be easily transformed into an equivalent TEI XML representation:
1116 AILA , etc.) they are arbitrary codes, used in this case as pointers to other elements which define their significance more precisely. If the codes are considered to be
1118 AILA , then the
1153 AILA ), then this compositionality may be most clearly expressed using a mechanism based on the
1158 AILA This approach requires the text to be fully segmented, using the linguistic segment elements described in section
1161 AILA attribute used to point to each interpretation is clearly defined. A further analysis into phrase and clause elements can be superimposed on the word and morpheme tagging in the preceding illustration. For example, CLAWS provides the following constituent analysis of the sample sentence (the word class codes have been deleted):
1165 AILA Treating the labels on the brackets as phrase or clause interpretations, this analysis of the structure of the example sentence can be combined with the word class analysis and represented as follows (the symbol
1258 AILA element. In this case, each linguistic segment must be supplied with its own
1307 AILA Each linguistic segment so far discussed has been well-behaved with respect to the basic document hierarchy, having only a single parent. Moreover, the segmentation has been complete, in that each part of the text is accounted for by some segment at each level of analysis, without discontinuities or overlap. This state of affairs does not of course apply in all types of analysis, and these Guidelines provide a number of mechanisms to support the representation of discontinuities or multiple analyses. A brief overview of these facilities is provided in chapter
1311 AILA The mechanisms proposed in this chapter may also be used to encode analyses of an entirely different kind, for example discourse function. Here is an application of the span technique to record details of a sales transaction in a spoken text.
1337 AILA (utterance) element and other elements recommended for transcriptions of spoken language, see chapter
1346 analysis Simple analytic mechanisms
1355 AI The selection and combination of modules to form a TEI schema is described in

NH-Non-hierarchical.xml#12945

# id text
4 NH XML employs a strongly hierarchical document model. At various points, these Guidelines discuss problems that arise when using XML to encode textual features that either do not naturally lend themselves to representation in a strictly hierarchical form or conflict with other hierarchies represented in the markup. Examples of such situations include:
11 NH Conflict between a verse text's metrical structure (e.g., its arrangement in stanzas and metrical lines) and its rhetorical or linguistic structure (e.g., phrases, sentences, and, for plays, acts, scenes, and speeches).
15 NH Conflict between metrical, rhetorical, or linguistic structure and the representation of direct speech, especially if the quoted speech is interrupted by other elements (e.g.,
23 NH Conflict between different analytical views or descriptions of a text or document, e.g., markup intended to encode diplomatic information about a word's appearance in a manuscript with markup intended to describe its morphology or pronunciation.
30 NH These Guidelines support several methods for handling non-hierarchical information:
75 NH at the back of one certain man and asked me,
88 NH , encodes the text according to its metrical features: line divisions (as here), stanzas or cantos in larger poems, and perhaps prosodic features like stress or syllable patterns, alliteration, or rhyme. A second view, which we might describe as the
94 NH we will encode only metrical lines and line groups; for the
98 NH , we only will distinguish direct quotation from other narration.
103 NHME Conceptually, the simplest method of disentangling two (or more) conflicting hierarchical views of the same information is to encode it twice (or more), each time capturing a single view.
124 NHME would be encoded by taking the same text and replacing the metrical markup with information about its sentence structure:
185 NHME This method is TEI-conformant. Its advantages are that each way of looking at the information is explicitly represented in the data and that the individual views are simple to process. The disadvantages are that the method requires the maintenance of multiple copies of identical textual content (an invitation to inconsistency) and that there is no explicit indication that the various views, which might be in separate files, are related to each other: it might prove difficult to combine the views or access information from one view while processing the file that contains the encoding of another.
186 NHME It has been shown, however, that it is possible to relate the different annotations in an indirect way: if the textual content of the annotations is identical, the very text can serve as a means for linking the different annotations, as described in
193 NHBM A second method for accommodating non-hierarchical objects in an XML document involves marking the start and end points of the non-nesting material. This prevents textual features that fall outside the privileged hierarchy from invalidating the document while identifying their beginnings and ends for further processing. The disadvantage of this method is that no single XML element represents the non-nesting material and, as a result, processing with XML technologies is significantly more difficult.
201 NHBM For some common structural features, the TEI provides milestone elements that can be used to mark the beginning of a textual feature. These include
228 NHBM The use of these elements is by definition TEI-conformant. Care should be taken, however, that the meaning of the milestone elements is preserved: semantically, for example,
230 NHBM is used to mark the start of a new (typographical) line. While in much modern poetry, typographical and metrical line divisions correspond,
232 NHBM does not itself make a metrical claim: in encoding verse from sources, such as Old English manuscripts, where physical line breaks are not used to indicate metrical lineation, the correspondence would break down entirely.
236 NHBM element. Attributes can then be used to indicate the type of feature being delimited and whether a given instance opens or closes the feature.
257 NHBM Another approach is to design custom elements that provide richer information about the feature being delimited or its boundaries. This information can be included as attribute values or as part of the element name itself: e.g.,
288 NHBM If the custom elements can be replaced by TEI elements and attributes without loss of information, this method is TEI-conformable (see
289 NHBM ); if the custom elements introduce information or distinctions that cannot be captured using standard TEI elements, the method is an extension.
297 NHBM , etc.) can be adapted so that they serve as empty segment boundary delimiters when the features they encode cross-hierarchical boundaries. Additional attributes (
323 NHBM The method is TEI-conformable if the modified elements are placed in a distinct, non-TEI namespace (see
324 NHBM ), and if the modified elements and attributes can be mapped without loss of information to existing TEI markup structures such as milestone or anchor elements automatically (see
327 NHBM The method represents an Extension if the modified elements are placed in a distinct, non-TEI namespace, but contain information or distinctions that cannot be algorithmically translated to existing TEI elements without loss of information (see
330 NHBM The method is non-conformant—and indeed strongly deprecated—if the modified elements and attributes are not placed in a distinct, non-TEI namespace (see
334 NHBM In each of the above examples (except the last), the relationship between the start and end delimiters (where these exist) of a given feature is implicit: it is assumed that "end" delimiters close the nearest preceding "start" delimiter, or, in the case of milestones, that the milestone marks both the end of the preceding example and the beginning of the next. Complications arise, however, when the non-nesting text overlaps with other non-nesting text of the same type, as, for example, in a grammatical analysis of the various possible interpretations of the
379 NHBM tag with the
381 NHBM value
385 NHBM with the same value on
395 NHBM tag with the
397 NHBM value
401 NHBM tag that has the same value on
405 NHBM Despite their advantages, segment boundary delimiters incur the disadvantage of cumbersome processing: since the elements of the analysis (e.g., the sentences in the poems, or phrases in the above example) are not uniformly represented by nodes in the document tree, they must be reconstituted by software in an ad hoc fashion, which is likely to be difficult and may be error prone.
407 NHBM Most important for some encoders, the method also disguises the relationship between the beginning and the ending of each logical element. This makes it impossible for standard validation software to provide the same kind of validation possible elsewhere in the encoding. When using grammar-based schema languages it is not possible to define a content model for the range limited by empty elements.
408 NHBM Grammar based schema languages (e.g., DTD, W3C Schema, and RELAX NG) are used to define markup languages (e.g., XHTML or TEI). Rule-based schema languages (e.g., Schematron) can be used to define further constraints. Such a rule-based schema language permits a sequence of certain elements between empty elements to be legitimized or prohibited.
414 NHVE A third method involves breaking what might be considered a single logical (but non-nesting) element into multiple smaller structural elements that fit within the dominant hierarchy but can be reconstituted virtually. For example, if a passage of direct discourse begins in the middle of one paragraph and continues for several more paragraphs, one could encode the passage as a series of
418 NHVE element. The resulting encoding is valid XML, but the text in each
424 NHVE In the case of our selection from Pinsky's poem, for example, the second passage of direct quotation, which crosses a line boundary and is broken up by a
425 NHVE She said
478 NHVE marks seven spans of text using
490 NHVE is a string corresponding to no single grammatical category.
492 NHVE Taken together, these problems can make automatic analysis of the fragmented features difficult. An analysis that intended to count the number of sentences in Wordsworth's poem, for example, would arrive at an inflated figure if it understood the
494 NHVE elements to represent complete rhetorical sentences; if it wanted to do an analysis of his syntax, it would not be able to assume that
498 NHVE The technique of fragmentation is often complemented by the technique of virtual joins. Virtual joins may be used to combine objects in the text to a new hierarchy. Here is
500 NHVE again; this time the relationship between the parts of the fragmented sentences is indicated explicitly using the
545 NHVE attribute with the value
577 NHVE This method is TEI-conformant and simple to use. Its disadvantage is that it does not work well for cases of self-overlap, or if there are nested occurrences of the same element type, as it can become difficult to ascertain which initial, medial, or final partial element should be combined with which others or in which order. This problem becomes evident if we attempt to combine a detailed Grammatical view of the Pinsky example with its metrical encoding:
705 NHVE The major advantage of fragmentation and virtual joins is that it allows all the hierarchies in the text to be handled explicitly: both the privileged one directly represented and the alternate hierarchy that has been split up and rejoined. The major disadvantages are that (like most of the other methods described here) it privileges one hierarchy over the others, requires special processing to reconstitute the elements of the other hierarchies, and, except in the case of
713 NHSO Most markup is characterized by the embedding of elements in the text. An alternative approach separates the text and the elements used to describe it. This approach is known as stand-off markup (see section
714 NHSO ). It establishes a new hierarchy by building a new tree whose nodes are XML elements that do not contain textual content, but rather links to another
717 NHSO a node in another XML document or a span of text
718 NHSO . This approach can be subdivided according to different criteria. A first distinction concerns the link base, i.e. the content to which annotations are to be applied. Sometimes the link target contains markup that can be referred to explicitly, as in the following example where the offset markup uses the
724 NHSO A fake namespace is given for XInclude here, to avoid the markup being interpreted literally during processing.
798 NHSO Note that the layer that uses XInclude to build another hierarchy might well be in another document, in which case the value of
802 NHSO would need to be the URL of the document that contains the base layer, in this case the
810 NHSO elements, and that there exists off-the-shelf software that will perform appropriate processing. Stand-off markup may be used even when the base text being annotated is plain text, i.e. does not have any XML encoding. In this case, the range of text to be marked up is indicated by character offsets (see
812 NHSO ). Another distinction concerns the number of files which can serve as link targets. Often, one (dedicated) annotation is used as the link target of all the other annotations. It is also possible to freely interlink several layers.
814 NHSO It has been noted that stand-off markup has several advantages over embedded annotations. In particular, it is possible to produce annotations of a text even when the source document is read-only. Furthermore, annotation files can be distributed without distributing the source text. Further advantages mentioned in the literature are that discontinuous segments of text can be combined in a single annotation, that independent parallel coders can produce independent annotations, and that different annotation files can contain different layers of information. Lastly, it has also been noted that this approach is elegant.
818 NHSO Inasmuch as it uses elements not included in the TEI namespace, stand-off markup involves an extension of the TEI.
824 NHNX There exist many non-XML methods of encoding a text that either solve or do not suffer the problem of the inability to encode overlapping hierarchies. These include, but are not limited to, the following proposals.
830 NHNX Designing a form of document representation in which several trees share all or part of the same frontier, and in which each individual view of the document has the form of a tree (see
836 NHNX ), which stores a body of information as a set of intertwined XML trees. This approach eliminates unnecessary redundancy and makes the database readily updatable, while allowing the user to exploit different hierarchical access paths.
850 NHNX proposal. This offers alternatives to the basic XML linear form as well as its data and processing models. It uses an alternative notation to XML and a data structure based on Core Range Algebra (
858 NHNX . This provides a notation (TexMECS) and a data structure (Goddag) as well as a draft constraint language for the representation of non-hierarchical structures; see
862 NHNX These approaches are based either on non-standard XML processing or data models, or not based on XML at all. Since TEI is currently based on XML they are not described any further in these Guidelines. Use of these methods with the TEI will certainly involve extensions; in most cases the documents will also be non-conformant.

CO-CoreElements.xml#13243

# id text
2 CO Elements Available in All TEI Documents
4 CO This chapter describes elements which may appear in any kind of text and the tags used to mark them in all TEI documents. Most of these elements are freely floating phrases, which can appear at any point within the textual structure, although they must generally be contained by a higher-level element of some kind (such as a paragraph). A few of the elements described in this chapter (for example, bibliographic citations and lists) have a comparatively well-defined internal structure, but most of them have no consistent inner structure of their own. In the general case, they contain only a few words, and are often identifiable in a conventionally printed text by the use of typographic conventions such as shifts of font, use of quotation or other punctuation marks, or other changes in layout.
8 CO tag used to mark paragraphs, the prototypical formal unit for running text in many TEI modules. This is followed, in section
9 CO , by a discussion of some specific problems associated with the interpretation of conventional punctuation, and the methods proposed by the Guidelines for resolving ambiguities therein.
12 CO ) describes a number of phrase-level elements commonly marked by typographic features (and thus well-represented in conventional markup languages). These include features commonly marked by font shifts (section
13 CO ) and features commonly marked by quotation marks (section
18 CO introduces some phrase-level elements which may be used to record simple editorial interventions, such as emendation or correction of the encoded text. The elements described here constitute a simple subset of the full mechanisms for encoding such information (described in full in chapter
22 CO ) describes several phrase-level and inter-level elements which, although often of interest for analysis or processing, are rarely explicitly identified in conventional printing. These include names (section
35 CO , describe two kinds of quasi-structural elements: lists and notes. These may appear either within chunk-level elements such as paragraphs, or between them. Several kinds of lists are catered for, of an arbitrary complexity. The section on notes discusses both notes found in the source and simple mechanisms for adding annotations of an interpretive nature during the encoding; again, only a subset of the facilities described in full elsewhere (specifically, in chapter
39 CO introduces some simple ways of representing graphic or other non-textual content found in a text. A fuller discussion of the multimedia facilities supported by these Guidelines may be found in chapters
44 CO , describes methods of encoding within a text the conventional system or systems used when making references to the text. Some reference systems have attained canonical authority and must be recorded to make the text useable in normal work; in other cases, a convenient reference system must be created by the creator or analyst of an electronic text.
49 CO Additional elements for the encoding of passages of verse or drama (whether prose or verse) are discussed in section
53 CO , describing the structure of the TEI document type definition.
57 COPA The paragraph is the fundamental organizational unit for all prose texts, being the smallest regular unit into which prose can be divided. Prose can appear in all TEI texts, even those that are primarily of another genre (e.g., verse); thus the paragraph is described here, as an element which can appear in any kind of text.
59 COPA Paragraphs can contain any of the other elements described within this chapter, as well as some other elements which are specific to individual text types. We distinguish
70 COPA Because paragraphs may appear in different base or additional tag sets, their possible contents may differ in different kinds of documents. In particular, additional elements not listed in this chapter may appear in paragraphs in certain kinds of text. However, the elements described in this chapter are always by default available in all kinds of text.
86 COPA Since paragraphs are usually explicitly marked in Western texts, typically by indentation, the application of the
88 COPA tag usually presents few problems.
90 COPA In some cases, the body of a text may comprise but a single paragraph:
107 COPA The following extract from a Russian fairy tale demonstrates how other phrase level elements (in this case
139 COPU Punctuation marks cause two distinct classes of problem for text markup: the marks may not be available in the character set used, and they may be significantly ambiguous. To some extent, the availability of the Unicode character set addresses the first of these problems, since it provides specific code points for most punctuation marks, and also the second to the extent that it distinguishes glyphs (such as stop, comma, and hyphen) which are used with different functions.
140 COPU Where punctuation itself is the subject of study, the element
143 COPU . Where the character used for a punctuation mark is not available in Unicode, the
150 COPU-1 Punctuation is itself a form of markup, historically introduced to provide the reader with an indication about how the text should be read. As such, it is unsurprising that encoders will often wish to encode directly the purpose for which punctuation was provided, as well as, or even instead of, the punctuation itself. We discuss some typical cases below.
157 COPU-1 respectively. However, there are independent reasons for tagging these, whether or not they are marked by full stops, and the polysemy of the full stop itself is perhaps no different from that of any other character in the writing system.
163 COPU-1 usually mark the end of orthographic sentences, but may also be used as a mid-sentence comment by the author (
167 COPU-1 to query a word or expression or mark a sentence as dubious in linguistic discussion). Such usages may be distinguished by marking S-units, in which case the mid-sentence uses of these punctuation marks may be left unmarked, or tagged using the
173 COPU-1 are used for a variety of purposes: as a mark of omission, insertion, or interruption; to show where a new speaker takes over (in dialogue); or to introduce a list item. In the latter two cases particularly, it is clearly desirable to mark the function as well as its rendition using the elements
182 COPU-1 may be removed from text contained by
186 COPU-1 elements on editorial grounds, or they may be marked in a variety of ways; see the discussion of quotation and related features in section
190 COPU-1 must be distinguished from single quote marks. As with hyphens, this disambiguation is best performed by selecting the appropriate Unicode character, though it may also be represented by using appropriate XML markup for quotations as suggested above. However, apostrophes have a variety of uses. In English they mark contractions, genitive forms, and (occasionally) plural forms. Full disambiguation of these uses belongs to the level of linguistic analysis and interpretation.
193 COPU-1 and other marks of suspension such as dashes or ellipses are often used to signal information about the syntactic structure of a text fragment. Full disambiguation of their uses also belongs to the level of linguistic analysis and interpretation, and will therefore need to use the mechanisms discussed in chapter
196 COPU-1 Where punctuation marks are disambiguated by tagging their assumed function in the text (for example, quotation), it may be debated whether they should be excluded or left as part of the text. In the case of quotation marks, it may be more convenient to distinguish opening from closing marks simply by using the appropriate Unicode character than to use the
200 COPU-1 Where segmentation of a text is performed automatically, the accuracy of the result may be considerably enhanced by a first pass in which the function of different punctuation characters is explicitly marked. This need not be done for all cases, but only where the structural function of the punctuation markup (for example as a word or phrase delimiter) is ambiguous. Thus, dots indicating abbreviation might be distinguished from dots indicating sentence end, and exclamation or question marks internal to a sentence distinguished from those which terminate one. Furthermore, when encoding historical materials, it may be considered essential to retain the original punctuation, whether by using an appropriate character code, if this is available (or using the
202 COPU-1 element where it is not) or by an explicit encoding using
204 COPU-1 . The particular method adopted will vary depending upon the feature concerned and upon the purpose of the project.
209 COPU-2 Hyphenation as a phenomenon is generally of most concern when producing formatted text for display in print or on screen: different languages and systems have developed quite sophisticated sets of rules about where hyphens may be introduced and for what reason. These generally do not concern the text encoder, since they belong to the domain of formatting and will generally be handled by the rendition software in use. In this section, we discuss issues arising from the appearance of hyphens in pre-existing formatted texts which are being re-encoded for analysis or other processing. Unicode distinguishes four characters visually similar to the hyphen, including the undifferentiated hyphen-minus (U+002D) which is retained for compatibility reasons. The hard hyphen (U+2010) is distinguished from the minus sign (U+2212) which is for use in mathematical expressions, and also from the soft hyphen (U+00AD) which may appear in
211 COPU-2 documents to indicate places where it is acceptable to insert a hyphen when the document is formatted.
213 COPU-2 Historically, the hard hyphen has been used in printed or manuscript documents for two distinct purposes. In many languages, it is used between words to show that they function as a single syntactic or lexical unit. For example, in French,
219 COPU-2 etc. It may also have an important role in disambiguation (for example, by distinguishing say a
223 COPU-2 ). Such usages, although possibly problematic when a linguistic analysis is undertaken, are not generally of concern to text encoders: the hyphen character is usually retained in the text, because it may be regarded as part of the way a compound or other lexical item is spelled. Deciding whether a compound is to be decomposed into its constituent parts, and if so how, is a different question, involving consideration of many other phenomena in addition to the simple presence of a hyphen.
225 COPU-2 When it appears at the end of a printed or written line however, the hard hyphen generally indicates that—contrary to what might be expected—a word is not yet complete, but continues on the next line (or over the next page or column or other boundary). The hyphen character is not, in this case, part of the word, but just a signal that the word continues over the break. Unfortunately, few languages distinguish these two cases visually, which necessarily poses a problem for text encoders. Suppose, for example, that we wish to investigate a diachronic English corpus for occurrences of "tea-pot" and "teapot", to find evidence for the point at which this compound becomes lexicalized. Any case where the word is hyphenated across a linebreak, like this:
231 COPU-2 They may decide simply to remove any end-of-line hyphenation from the encoded text, on the grounds that its presence is purely a secondary matter of formatting. This will obviously apply also if line endings are themselves regarded as unimportant.
233 COPU-2 Alternatively, they may decide to record the presence of the hyphen, perhaps on the grounds that it provides useful morphological information; perhaps in order to retain information about the visual appearance of the original source. In either case, they need to decide whether to record it explicitly, by including an appropriate punctuation character in the text data, or implicitly by supplying an appropriate symbolic value for one or more of the attributes on the
235 COPU-2 or other milestone element used to record the fact of the line division. If the hyphen is included in the character data of the TEI document, it might be marked up using the
242 COPU-2 A similar range of possibilities applies equally to the representation of other common punctuation marks, notably quotation marks, as discussed in
246 COPU-2 text data
249 COPU-2 , even if those units are not explicitly indicated by the XML markup. The ambiguity of the end-of-line hyphen also causes problems in the way a processor identifies such tokens in the absence of explicit markup. If token boundaries are not explicitly marked (for example using the
253 COPU-2 elements), for most languages a processor will rely on character class information to determine where they are to be found: some punctuation characters are considered to be word-breaking, while others are not. In XML, the newline character in text data is a kind of whitespace, and is therefore word breaking. However, it is generally unsafe to assume that whitespace adjacent to markup tags will always be preserved, and it is decidedly unsafe to assume that markup tags themselves are equivalent to whitespace.
261 COPU-2 elements are notable exceptions to this general rule, since their function is precisely to represent (or replace) line, page, or column breaks, which, as noted above, are generally considered to be equivalent to whitespace. These elements provide a more reliable way of preserving the lineation, pagination, etc of a source document, since the encoder should not assume that (untagged) line breaks etc. in an XML source file will necessarily be preserved.
269 COPU-2 to indicate whether or not the element corresponds with a token boundary. The value
271 COPU-2 is also available, for cases where the encoder does not wish (or is unable) to determine whether the orthographic token concerned is broken by the line ending.
273 COPU-2 As a final complication, it should be noted that in some languages, particularly German and Dutch, the spelling of a word may be altered in the presence of end of line hyphenation. For example, in Dutch, the word
277 COPU-2 ), occurring at the end of a line may be hyphenated as
279 COPU-2 , with a single letter a. An encoder wishing to preserve the original form of this orthographic token in a printed text while at the same time facilitating its recognition as the word
281 COPU-2 will therefore need to rely on a more sophisticated process than simply removing the hyphen. This is however essentially the same as any other form of normalization accompanying the recognition of variations in spelling or morphology: as such it may be encoded using the
284 COPU-2 , or the more sophisticated mechanisms for linguistic analysis discussed in chapter
291 COHQ This section deals with a variety of textual features, all of which have in common that they are frequently realized in conventional printing practice by the use of such features as underlining, italic fonts, or quotation marks, collectively referred to here as
293 COHQ . After an initial discussion of this phenomenon and alternate approaches to encoding it, this section describes ways of encoding the following textual features, all of which are conventionally rendered using some kind of highlighting:
295 COHQ emphasis, foreign words and other linguistically distinct uses of highlighting
308 COHQW typographic features (font, size, hue, etc.) in a printed or written text in order to distinguish some passage of a text from its surroundings.
309 COHQW Although the way in which a spoken text is performed, (for example, the voice quality, loudness, etc.) might be regarded as analogous to
311 COHQW in this sense, these Guidelines recommend distinct elements for the encoding of such
313 COHQW in spoken texts. See further section
315 COHQW The purpose of highlighting is generally to draw the reader's attention to some feature or characteristic of the passage highlighted; this section describes the elements recommended by these Guidelines for the encoding of such textual features.
319 COHQW distinct in some way—as foreign, dialectal, archaic, technical, etc.
321 COHQW emphatic, and which would for example be stressed when spoken
323 COHQW not part of the body of the text, for example cross-references, titles, headings, labels, etc.
325 COHQW identified with a distinct narrative stream, for example an internal monologue or commentary.
327 COHQW attributed by the narrator to some other agency, either within the text or outside it: for example, direct speech or quotation.
329 COHQW set apart from the text in some other way: for example, proverbial phrases, words mentioned but not used, names of persons and places in older texts, editorial corrections or additions, etc.
332 COHQW The textual functions indicated by highlighting may not be rendered consistently in different parts of a text or in different texts. (For example, a foreign word may appear in italics if the surrounding text is in roman, but in roman if the surrounding text is in italics.) For this reason, these Guidelines distinguish between the encoding of rendering itself and the encoding of the underlying feature expressed by it.
341 COHQW ). This allows the encoder both to specify the function of a highlighted phrase or word, by selecting the appropriate element described here or elsewhere in the Guidelines, and to further describe the way in which it is highlighted, by means of an attribute. If the encoder wishes to offer no interpretation of the feature underlying the use of highlighting in the source text, then the
343 COHQW element may be used, which indicates only that the text so tagged was highlighted in some way.
354 COHQW attribute are not formally defined in this version of the Guidelines. It may be used to document any peculiarity of the way a given segment of text was rendered in the original source text, and may thus express a very large range of typographic or other features, by no means restricted to typeface, type size, etc. The
356 COHQW attribute, by contrast, defines the way the source text was rendered using a formally defined style language, such as the W3C standard Cascading Stylesheet Language (
359 COHQW attribute is used to point to one or more fragments expressed using such a language which have been predefined in the TEI header using the
370 COHQW for analytic purposes, it is in general more useful to know the intended function of a highlighted phrase than simply that it is distinct.
373 COHQW In many, if not most, cases the underlying function of a highlighted phrase will be obvious and non-controversial, since the distinctions indicated by a change of highlighting correspond with distinctions discussed elsewhere in these Guidelines. The elements available to record such distinctions are, for the most part, members of the
377 COHQW class mentioned above constitute the
381 COHQW The distinction between the two classes is simple, and typified by the two elements
385 COHQW : the former marks simply that a passage is typographically distinct in some way, while the latter asserts that a passage is linguistically emphasized for some purpose. These two properties, though often combined, are not identical. It should however be recognized, however, that cases do exist in which it is not economically feasible to mark the underlying function (e.g. in the preparation of large text corpora), as well as cases in which it is not intellectually appropriate (as in the transcription of some older materials, or in the preparation of material for the study of typographic practice). In such cases, the
408 COHQHF Words or phrases which are not in the main language of the text should be tagged as such, at least where the fact is indicated in the text. Where the word or phrase concerned is already distinguished from the rest of the text by virtue of its function (for example, because it is a name, a technical term, a quotation, a mentioned word, etc.) then the global
410 COHQHF attribute should be used to specify additionally that its language distinguishes it from the surrounding text. Any element in the TEI scheme may take a
412 COHQHF attribute, which specifies both the writing system and the language used by its content (see sections
430 COHQHF element should not be used to represent foreign words which are mentioned or glossed within the text: for these use the appropriate element from section
444 COHQHF Elements which do not explicitly state the language of their content by means of an
446 COHQHF attribute are understood to inherit a value for it from their parent element. In the general case, therefore, it is recommended practice to supply a default value for
448 COHQHF on the root
468 COHQHE element. In printed works, emphasis is generally indicated by devices such as the use of an italic font, a large typeface, or extra wide letter spacing; in manuscripts and typescripts, it is usually indicated by the use of underlining. As the following examples demonstrate, an encoder may choose whether or not to make explicit the particular type of rendition associated with the emphasis. If a source text consistently renders a particular feature (e.g. emphasis or words in foreign languages) in a particular way, the rendering associated with that feature may be described in the TEI header using the
476 COHQHE attributes may then be used to describe examples which deviate from the norm. For example, assuming that the TEI header has defined a default rendering for the
483 COHQHE If on the other hand no such default has been defined for the element, the encoder may specify it informally using the
489 COHQHE If the encoder wishes to express information about the rendition used in the source using a formal language such as CSS, then the
497 COHQHE In cases where the rendition of a source needs to be indicated several times in a document, it may be more convenient to provide a default value using the
499 COHQHE element in the header. If a small number of distinct values are required, it may also be convenient to define them all by means of a series of
501 COHQHE elements which can then be referenced from the elements in question by means of the global
528 COHQHE attribute, as discussed above, without however taking a position as to the function of the highlighting. This may also be useful if the text is to be processed in two stages: representing simply typographic distinctions during a first pass, and then replacing the
554 COHQHE in the sense
574 COHQHD element is provided for this purpose. Its attributes allow for additional information characterizing the nature of the linguistic distinction to be made in two distinct ways: the
576 COHQHD attribute simply assigns a user-defined code of some kind to the word or phrase which assigns it to some register, sub-language, etc. No recommendations as to the set of values for this attribute are provided at this time, as little consensus exists in the field.
578 COHQHD Alternatively, the remaining three attributes may be used in combination to place a word or phrase on a three-dimensional scale sometimes used in descriptive linguistics, as for example in
598 COHQHD that is, with respect to a social classification, for example as technical, polite, impolite, restricted, etc. Again, no recommendations are made for the values of these attributes at this time; the encoder should provide a description of the scheme used in the appropriate section of the header (see section
614 COHQHD should be preferred to these simple characterizations. It may also be preferable to record the kinds of analysis suggested here by means of the simple annotation element
628 COHQQ One form of presentational variation found particularly frequently in written and printed texts is the use of quotation marks. As with the typographic variations discussed in the preceding section, it is generally helpful to separate the encoding of the underlying textual feature (for example, a quotation or a piece of direct speech) from the encoding of its rendering (for example, the use of a particular style of quotation marks).
630 COHQQ This section discusses the following elements, all of which are often rendered by the use of quotation marks:
663 COHQQ The most common and important use of quotation marks is, of course, to mark
664 COHQQ quotation
665 COHQQ , by which we mean simply any part of the text which the author or narrator wishes to attribute to some agency other than the narrative voice. The
667 COHQQ element may be used if no further distinction beyond this is judged necessary. If it is felt necessary to distinguish such passages further, for example to indicate whether they are regarded as speech, writing, or thought, either the
673 COHQQ for words or phrases represented as being spoken or thought by people or characters within the current work. The
675 COHQQ element is used for cases where the author or narrator distances him or herself from the words in question without however attributing them to any other voice in particular. The
677 COHQQ element is appropriate for a case where a word or phrase is being discussed in the body of a text rather than forming part of the text directly.
679 COHQQ As noted above, if the distinction among these various reasons why a passage is offset from surrounding text cannot be made reliably, or is not of interest, then any representation of speech, thought, or writing may simply be marked using the
683 COHQQ Quotation may be indicated in a printed source by changes in type face, by special punctuation marks (single or double or angled quotes, dashes, etc.) and by layout (indented paragraphs, etc.), or it may not be explicitly represented at all. If these characteristics are of interest, one or other of the global
690 COHQQ Quotation marks themselves may, like other punctuation marks, be felt for some purposes to be worth retaining within a text, quite independently of their description by the
692 COHQQ attribute. This should generally be done using the appropriate Unicode character, or, if this is not possible, a numeric character reference (see
693 COHQQ ). If the encoder decides both to retain the quotation marks and to represent their function by means of an explicit tag such as
695 COHQQ , the quotation marks should be included within the element, rather than outside it, as in the first example below:
703 COHQQ Alternatively, since this use of the leading mdash is very common typographic practice, it may be considered unnecessary to retain it in the encoding. Its presence in the source might instead be signalled using one of the attributes
711 COHQQ element, which can then be referenced using the
729 COHQQ element provided in the TEI header (see
730 COHQQ ) to indicate that quotation marks have not been retained in the encoding; their presence in the source is implied by the
734 COHQQ Whether or not the quotation marks are suppressed, their presence and nature may be described using some appropriate set of conventions in the
748 COHQQ . If the rendition of passages tagged as
750 COHQQ is uniform throughout a text, then the
754 COHQQ element in the header may be used to specify a default rendering, in which case the same section might simply be tagged:
779 COHQQ This may be used to make explicit who is speaking:
794 COHQQ attribute may be supplied whether or not an indication of the speaker is given explicitly in the text. It may take the form (as above) of a normalized form of the speaker's name, but its role is to act as a pointer to a location elsewhere in the text, or another document, where data about each speaker may be supplied. While this attribute could point to any source of information about the speaker available by a URI, the most appropriate place to place such information is within the
796 COHQQ component of the TEI header, as further discussed in
797 COHQQ but for simple cases like the above, a simple list of speakers located in the front or back matter of the text may suffice.
799 COHQQ It may also be useful to distinguish representations of speech from representations of thought, in modern printed texts often indicated by a change of typeface. The
809 COHQQ Quoted matter may be embedded within quoted matter, as when one speaker reports the speech of another:
822 COHQQ Direct speech nested in this way is treated in the same way as elsewhere: a change of rendition may occur, but the same element should be used. An encoder may however choose to distinguish between direct speech which contains quotations from extra-textual matter and direct speech itself, as in the following example:
839 COHQQ element may be used to group together the quotation and its associated bibliographic reference, which should be encoded using the elements for bibliographic references discussed in section
860 COHQQ Like other bibliographic references, the citation associated with a quotation may be represented simply by a cross-reference, as in this example:
869 COHQQ impractical. In such circumstances, the quotation can be linked to a bibliographical reference using
883 COHQQ Unlike most of the other elements discussed in this chapter, direct speech and quotations may frequently contain other high-level elements such as paragraphs or verse lines, as well as being themselves contained by such elements. Three possible solutions exist for this well-known structural problem:
885 COHQQ the quotation is broken into segments, each of which is entirely contained within a paragraph
887 COHQQ the quotation is marked up using stand-off markup
889 COHQQ the quotation boundaries are represented by empty segment boundary delimiter elements
896 COHQQ is provided for all cases in which quotation marks are used to distance the quoted text from the narrator or speaker. Common examples include the
932 COHQU This section describes a set of textual elements which are used to provide a gloss, alternate identification, or description of something.
934 COHQU Technical terms are often italicized or emboldened upon first mention in printed texts; an explanation or gloss is sometimes given in quotation marks. Linguistic analyses conventionally cite words in languages under discussion in italics, providing a gloss immediately following marked with single quotation marks. Other texts in which individual words or phrases are
935 COHQU mentioned
943 COHQU may mark them either with italics or with quotation marks, and will gloss them less regularly.
957 COHQU is present, it may be linked to the term it is glossing by means of its
961 COHQU value to the
965 COHQU element and provide that id as the value of the
999 COHQU For technical terminology in particular, and generally in terminological studies, it may be useful to associate an instance of a term within a text with a canonical definition for it, which is stored either elsewhere in the same text (for example in a glossary of terms) or externally, for example in a database, authority file, or published standard. The attributes
1008 COHQU Another group of elements is used to supply different kinds of names for objects described by the TEI. Examples of this are documentation of elements, attributes, classes (and also attribute values where appropriate), and description of glyphs.
1015 COHQU element mentioned above, these elements constitute the
1039 COHQHEG This encoding would, however, lose the important distinction between an italicized title and an italicized foreign phrase. Many other phrases might also be italicized in the text, and a retrieval program seeking to identify foreign terms (for example) would not be able to produce reliable results by simply looking for italicized words. Where economic and intellectual constraints permit, therefore, it would be preferable to encode both the function of the highlighted phrases and their appearance, as follows:
1049 COHQHEG debatings. She says I am
1068 COHQHEG ; the former is emphasized, while the latter is proverbial. It also provides an ironic gloss for the words
1074 COHQHEG . The glossed phrases are not, however, technical terms or cited words, but quoted phrases, as if the writer were putting words into her own and her mother's mouths. Finally, the words
1111 COED As in editing a printed text, so in encoding a text in electronic form, it may be necessary to accommodate editorial comment on the text and to render account of any changes made to the text in preparing it. The tags described in this section may be used to record such editorial interventions, whether made by the encoder, by the editor of a printed edition used as a copy text, by earlier editors, or by the copyists of manuscripts.
1117 COED . The examples given here illustrate only simple cases of editorial intervention; in particular, they permit economical encoding of a simple set of alternative readings of a short span of text. To encode multiple views of large or heterogeneous spans of text, the mechanisms described in chapter
1123 COED , that is, a code indicating the person or agency responsible for making the editorial intervention in question, and also an indication of the degree of
1124 COED certainty
1138 COED Many of the elements discussed here can be used in two ways. Their primary purpose is to indicate that the text encoded as the element's content represents an editorial intervention (or non-intervention) of a specific kind, indicated by the element itself. However, pairs or other meaningful groupings of such elements can also be supplied, wrapped within a special purpose
1143 COED This element enables the encoder to represent for example a text in its
1145 COED uncorrected and unaltered form, alongside the same text in one or more
1148 COED view
1149 COED of a text and another, so that (for example) a stylesheet may be set to display either the text in its original form or after the application of editorial interventions of particular kinds.
1153 COED class. The default members of this class are
1177 COED indication or correction of apparent errors
1188 COEDCOR When the copy text is manifestly faulty, an encoder or transcriber may elect simply to correct it without comment, although for scholarly purposes it will often be more generally useful to record both the correction and the original state of the text. The elements described here enable all three approaches, and allows the last to be done in such a way as make it easy for software to present either the original or the correction.
1193 COEDCOR The following examples show alternative treatment of the same material. The copy text reads:
1194 COEDCOR Another property of computer-assisted historical research is that data modelling must permit any one textual feature or part of a textual feature to be a part of more than one information model and to allow the researcher to draw on several such models simultaneously, for example, to select from a machine-readable text those marginal comments which indicate that the date's mentioned in the main body of the text are incorrect.
1196 COEDCOR An encoder may choose to correct the typographic error, either silently or with an indication that a correction has been made, as follows:
1206 COEDCOR If the encoder elects both to record the original source text and to provide a correction for the sake of word-search and other programs, both
1226 COEDCOR If it is desired to indicate the person or edition responsible for the emendation, this might be done as follows:
1243 COEDCOR attribute has been used to indicate responsibility for the correction. Its value (
1250 COEDCOR element within the TEI header, but any element might be indicated in this way, including for example a
1269 COEDCOR Where, as here, the correction takes the form of adding text not otherwise present in the text being encoded, the encoder should use the
1271 COEDCOR element. Where the correction is present in the text being encoded, and consists of some combination of visible additions and deletions, the elements
1276 COEDCOR below. Where the correction takes the form of addition of material not present in the original because of physical damage or illegibility, the
1279 COEDCOR correction
1282 COEDCOR element may be used. These and other elements to support the detailed encoding of authorial or scribal interventions of this kind are all provided by the module described in chapter
1292 COEDREG When the source text makes extensive use of variant forms or non-standard spellings, it may be desirable for a number of reasons to
1299 COEDREG In some contexts, the term
1304 COEDREG As with other such changes to the copy text, the changes may be made silently (in which case the TEI header should specify the types of silent changes made) or may be explicitly marked using the following elements:
1340 COEDREG Alternatively, the encoder may elect to record both old and new spellings, so that (for example) the same electronic text may serve as the basis of an old- or new-spelling edition:
1369 COEDADD The following elements are used to indicate when words or phrases have been omitted from, added to, or marked for deletion from, a text. Like the other editorial elements, they allow for a wide range of editorial practices:
1376 COEDADD Encoders may choose to omit parts of the copy text for reasons ranging from illegibility of the source or impossibility of transcribing it, to editorial policy, e.g. a systematic exclusion of poetry or prose from an encoding. The full details of the policy decisions concerned should be documented in the TEI header (see section
1377 COEDADD ). Each place in the text at which omission has taken place should be marked with a
1379 COEDADD element, with optionally further information about the reason for the omission, its extent, and the person or agency responsible for it, as in the following examples:
1380 COEDADD Note that the extent of the gap may be marked precisely using attributes
1386 COEDADD attribute. Other, more detailed, options are also available for representing dimensions of any kind; see further
1391 COEDADD element may be used to supply a description of the material omitted, where that is considered useful:
1407 COEDADD elements may be used to record where words or phrases have been added or deleted in the copy text. They are not appropriate where longer passages have been added or deleted, which span several elements; for these, the elements
1414 COEDADD Additions to a text may be recorded for a number of reasons. Sometimes they are marked in a distinctive way in the source text, for example by brackets or insertion above the line (
1417 COEDADD additions
1429 COEDADD element should not be used to mark editorial changes, such as supplying a word omitted by mistake from the source text or a passage present in another version. In these cases, either the
1438 COEDADD element is used to mark passages in the original which cannot be read with confidence, or about which the transcriber is uncertain for other reasons, as for example when transcribing a partially inaudible or illegible source. Its
1444 COEDADD element, to indicate the cause of uncertainty and the person responsible for the conjectured reading.
1450 COEDADD or from a spoken text:
1456 COEDADD Where the material affected is entirely illegible or inaudible, the
1462 COEDADD element is used to mark material which is deleted in the source but which can still be read with some degree of confidence, as opposed to material which has been omitted by the encoder or transcriber either because it is entirely illegible or for some other reason. This is of particular importance in transcribing manuscript material, though deletion is also found in printed texts, sometimes for humorous purposes:
1476 COEDADD attribute may be used to distinguish different methods of deletion in manuscript or typescript material, as in this line from the typescript of Eliot's
1492 COEDADD provides a way of grouping additions and deletions of this kind.
1496 COEDADD element should not be used where the deletion is such that material cannot be read with confidence, or read at all, or where the material has been omitted by the transcriber or editor for some other reason. Where the material deleted cannot be read with confidence, the
1498 COEDADD tag should be used with the
1500 COEDADD attribute indicating that the difficulty of transcription is due to deletion. Where material has been omitted by the transcriber or editor, this may be indicated by use of the
1506 COEDADD element. Text supplied or marked as unneccessary by an editor should be marked with the
1515 COEDADD . These two sets of elements allow the encoder to distinguish editorial changes from those visible in the source text.
1525 CONA This section describes a number of textual features which it is often convenient to distinguish from their surrounding text. Names, dates, and numbers are likely to be of particular importance to the scholar treating a text as source for a database; distinguishing such items from the surrounding text is however equally important to the scholar primarily interested in lexis.
1534 CONARS referring string
1571 CONARS element may be used for any reference to a person, place, etc., not only to references in the form of a proper noun or noun phrase.
1580 CONARS element by contrast is provided for the special case of referencing strings which consist only of proper nouns; it may be used synonymously with the
1582 CONARS element, or nested within it if a referring string contains a mixture of common and proper nouns. The following example shows an alternative way of encoding the short sentence from
1594 CONARS As the following example shows, a proper name may be nested within a referring string:
1599 CONARS Simply tagging something as a name is generally not enough to enable automatic processing of personal names into the canonical forms usually required for reference purposes. The name as it appears in the text may be inconsistently spelled, partial, or vague. Moreover, name prefixes such as
1603 CONARS may or may not be included as part of the reference form of a name, depending on the language and country of origin of the bearer.
1605 CONARS Two issues arise in this context: firstly, there may be a need to encode a regularized form of a name, distinct from the actual form in the source to hand; secondly, there may be a need to identify the particular person, place, etc. referred to by the name, irrespective of whether the name itself is normalized or not. The element
1623 CONARS A very useful application for them is as a means of gathering together all references to the same individual or location scattered throughout a document:
1641 CONARS The value of the
1643 CONARS attribute may be an unexpanded code, as in the examples above, with no particular significance. More usually however, it will be an externally defined code of some kind, as provided by a standard reference source.
1649 CONARS The standard reference source should be documented using a
1651 CONARS element in the TEI header.
1655 CONARS attribute can be used to point directly to some other resource providing more information about the entity named by the element, such as an authority record in a database, an encylopaedia entry, another element in the same or a different document etc.
1663 CONARS (regularization) element to provide the standard form of a referring string, as in this example:
1673 CONARS attribute, since its form will depend entirely on practice within a given project. For the same reason, this attribute is not recommended in data interchange, since there is no way of ensuring that the values used by one project are distinct from those used by another. In such a situation, a preferable approach for magic tokens which follows standard practice on the Web is to use a
1675 CONARS attribute whose value is a tag URI as defined in
1684 CONARS The inclusion of the domain name of the party responsible for tagging (
1686 CONARS ), as specified in RFC 4151, helps ensure uniqueness of magic token values across TEI encoding projects, allowing for improved interchange of TEI documents.
1691 CONARS may be used if it is desired to record both a normalized form of a name and the name used in the source being encoded:
1707 CONARS may be more appropriate if the function of the regularization is to provide a consistent index:
1713 CONARS Although adequate for many simple applications, these methods have two inconveniences: if the name occurs many times, then its regularized form must be repeated many times; and the burden of additional XML markup in the body of the text may be inconvenient to maintain and complex to process. For applications such as onomastics, relating to persons or places named rather than the name itself, or wherever a detailed analysis of the component parts of a name is needed, the specialized elements described in chapter
1730 CONAAD elements; for other kinds of address this class may be extended by adding new elements if necessary.
1732 CONAAD These Guidelines provide no particular means for encoding the substructure of an email address (for example, distinguishing the local part from the domain part), nor of distinguishing personal email addresses from generic or fictitious ones.
1738 CONAAD The simplest way of encoding a postal address is to regard it as a series of distinct lines, just as they might be written on an envelope. The following element supports this view:
1739 CONAAD Here is an example of a postal address encoded using this approach:
1749 CONAAD Alternatively, an address may be encoded as a structure of more semantically rich elements. The class
1751 CONAAD element class identifies a number of such possible components:
1756 CONAAD Any number of elements from the
1758 CONAAD class may appear within an address and in any order. None of them is required.
1760 CONAAD Where code letters are commonly used in addresses (for example, to identify regions or countries) a useful practice is to supply the full name of the region or country as the content of the element, but to supply the abbreviatory code as the value of the global
1762 CONAAD attribute, so that (for example) an application preparing formatted labels can readily find the required information. Other components of addresses may be represented using the general-purpose
1764 CONAAD element or (when the additional module for names and dates is included) the more specialized elements provided for that purpose.
1766 CONAAD Using just the elements defined by the core module, the above address could thus be represented as follows:
1778 CONAAD The order of elements within an address is highly culture-specific, and is therefore unconstrained:
1792 CONAAD A telephone number (normally outside of the
1798 CONAAD , with the number itself appearing in the
1806 CONAAD . A full postal address may also include the name of the addressee, tagged as above using the general purpose
1811 CONAAD , a large number of more specific elements such as
1817 CONAAD . The above example might then be encoded as follows:
1861 CONANU element provides a convenient method of distinguishing numbers from the surrounding text. For other kinds of application, numbers are only useful if normalized: here the
1883 CONANU ; less frequently the number may be recognisable linguistically as such but may use a notation with which the encoder is unfamiliar. To help in these situations, the
1893 CONANU measure
1894 CONANU consists of a number, a phrase expressing units of measure and a phrase expressing the commodity being measured, though not all of these components need be present in every case. It may be helpful to distinguish measures from surrounding text for two reasons. Firstly, a measure may be expressed using a particular notation or system of abbreviations which the encoder does not wish to regard as lexical. Secondly, a quantitative application may wish to distinguish and normalize the internal components of a measure, in order to perform calculations on them.
1896 CONANU Consider, as an example of the first case, the following list of Celia's charms, in which the encoder has chosen to make explicit the measurements:
1931 CONANU In general, normalization of a measure will require specification of one or more of its three parts: the quantity, the units, and possibly also the commodity being measured. This is accomplished by supplying values for the three attributes
1937 CONANU , which are supplied by the
1946 CONANU Such techniques are particularly useful when representing historical data such as inventories:
1962 CONANU element is provided as a means of grouping several related measurements together, either because the measurement involves several dimensions (for example height and width) or to avoid the need to repeat all the normalizing attributes:
1983 CONADA Dates and times, like numbers, can appear in widely varying culture- and language-dependent forms, and can pose similar problems in automatic language processing. Such elements constitute the
1985 CONADA class, of which the default members are:
1989 CONADA These elements have some additional attributes by virtue of being members of the
1993 CONADA classes which, in turn, are members of the
2017 CONADA attribute by simply omitting a part of the value supplied. Imprecise dates or times (for example
2020 CONADA some time after ten and before twelve
2021 CONADA ) may be expressed as date or time ranges.
2023 CONADA These mechanisms are useful primarily for fully specified dates or times known with certainty. If component parts of dates or times are to be marked up, or if a more complex analysis of the meaning of a temporal expression is required, the techniques described in chapter
2026 CONADA Where the certainty (i.e. reliability) of the date or time is in question, the encoder should record this fact using the mechanisms discussed in chapter
2027 CONADA . The same chapter also discusses various methods of recording the precision of numerical or temporal assertions.
2040 CONADA attribute always supplies a normalized representation of the date given as content of the
2047 CONADA date
2059 CONADA time
2063 CONADA There is one exception: these Guidelines permit a time to be expressed as only a number of hours, or as a number of hours and minutes, as per ISO 8601:2004 section 4.2.2.3 and 4.3.3. The W3C
2067 CONADA datatypes require that the minutes and seconds be included in the normalized value if they are to be correctly processed for example when sorting.
2086 CONADA Note in the last example the use of a normalized representation for the date string which includes a time: this example could thus equally well be tagged using the
2109 CONADA attribute may be used to specify a date in any calendar system; if the
2111 CONADA attribute is also supplied, it should specify the equivalent date in the Gregorian calendar.
2121 CONAAB It is sometimes desirable to mark abbreviations in the copy text, whether to trigger special processing for them, to provide the full form of the word or phrase abbreviated, or to allow for different possible expansions of the abbreviation. Abbreviations may be transcribed as they stand, or expanded; they may be left unmarked, or marked using these tags:
2181 CONAAB Abbreviation is a particularly important feature of manuscript and other source materials, the transcription of which needs more detailed treatment than is possible using these simple elements. A more detailed set of recommendations is discussed in
2182 CONAAB , which includes additional elements made available for the purpose by the
2192 COXR Cross-references or links between one location in a document and one or more other locations, either in the same or different XML documents, may be encoded using the elements
2198 COXR from one location in a document, the place that the element itself appears, to another (or to several), specified by means of a
2200 COXR attribute, supplied by the
2208 COXR The value of the
2212 COXR mechanism. This permits a range of complexity, from the very simple (a reference to the value of the target element's
2214 COXR attribute) to the more complex usage of a full URI with embedded XPointers. For example, the source of the following paragraph looks something like this:
2226 COXR Alternatively, if no explicit link is to be encoded, but it is simply required to mark the phrase as a cross-reference, the
2237 COXR ; for a discussion of TEI schemes for XPointer, see
2247 COXR are the default members of the phrase-level model class
2249 COXR . As members of the classes
2267 COXR element may contain phrases specifying, or describing more exactly, the target of a cross-reference, which form the content of the element. Since its content thus serves as a human-readable pointer, in the simplest case a
2279 COXR attribute, so that processing software can access it directly, for example to implement a linkage, to generate an appropriate reference, or to give an error message if it cannot be found. Assuming that section 12 in the previous example has been tagged
2282 COXR then the same cross-reference might more exactly be encoded as
2288 COXR If the cross-reference itself is to be generated according to a fixed pattern, or if no text is to appear in the body of the cross-reference, the
2300 COXR ); the definition it provides is used to translate the value of the
2302 COXR attribute into a conventional pointer value, such as one that might be supplied by the
2312 COXR attribute is used, a cross reference may point to any number of locations simultaneously, simply by giving more than one identifier as the value of its
2314 COXR attribute. This may be particularly useful where an analytic index is to be encoded, as in the following example:
2328 COXR , etc. have been provided in the body of the text, for example as page breaks
2337 COXR A similar method may be used to link annotations on a text with the sigla used to encode their points of attachment in a text. For example:
2358 COXR The value
2364 COXR element here might be used to indicate that the object being referenced here is a bibliographic entry rather than a simple cross-reference to an illustration, as is the first
2366 COXR . In either case, the value of the
2373 COXR elements have many applications in addition to the simple cross-referencing facilities illustrated in this section. In conjunction with the analytic tools discussed in chapters
2376 COXR , they may be used to link analyses of a text to their object, to combine corresponding segments of a text, or to align segments of a text with a temporal or other axis or with each other.
2406 COLI list
2407 COLI : numbered, lettered, bulleted, or unmarked. Lists formatted as such in the copy text should in general be encoded using this element, with an appropriate value for the
2425 COLI Some of these values may of course be combined; a list may be inline, but also be rendered with numbers. An example appears below. For more sophisticated and detailed description of list rendering, consider using the
2431 COLI Each distinct item in the list should be encoded as a distinct
2433 COLI element. If the numbering or other identification for the items in a list is unremarkable and may be reconstructed by any processing program, no enumerator need be specified. If however an enumerator is retained in the encoded text, it may be supplied either by using the
2457 COLI The two styles may not be mixed in the same list: if one item is preceded by a label, all must be.
2459 COLI A list need not necessarily be displayed in list format. For example, the following is a reasonable encoding of a list which (in the original) is simply printed as a single paragraph:
2492 COLI A list may be given a heading or title, for which the
2496 COLI element to mark a tabular or glossary list in which each item is associated with a word or phrase rather than a numeric or alphabetic enumerator:
2522 COLI In such a list, the individual items have internal structure. In complex cases, where list items contain many components, the list is better treated as a
2523 COLI table
2528 COLI . A particularly important instance of the simple two-column table is the
2529 COLI glossary list
2530 COLI , which should be marked by the tag
2531 COLI list type="gloss"
2534 COLI element contains a term and each
2536 COLI its gloss; it is a semantic error for a list tagged with
2567 COLI might be used to make explicit the role that each column in the glossary list has, as follows:
2608 COLI ) element what language the term is from. For further discussion of the
2617 COLI element used to supply a title or heading for the whole list, headings for the two columns of a glossary-style list may be specified using the two special elements
2662 COLI , including other lists. In this example, a glossary list contains two items, each of which is itself a simple list:
2705 CONONO The following element is provided for the encoding of discursive notes, whether already present in the copy text or supplied by the encoder:
2708 CONONO A note is any additional comment found in a text, marked in some way as being out of the main textual stream. All notes should be marked using the same tag,
2710 CONONO , whether they appear as block notes in the main text area, at the foot of the page, at the end of the chapter or volume, in the margin, or in some other place.
2714 CONONO A note is usually attached to a specific point or span within a text, which we term here its
2718 CONONO When encoding such a text, it is conventional to replace this siglum by the content of the annotation, duly marked up with a
2720 CONONO element. This may not always be possible for example with marginal notes, which may not be anchored to an exact location. For ease of processing, it may be adequate to position marginal notes before the relevant paragraph or other element. In printed texts, it is sometimes conventional to group notes together at the foot of the page on which their points of attachment appear. This practice is not generally recommended for TEI-encoded texts, since the pagination of a particular printed text is unlikely to be of structural significance. In some cases, however, it may be desirable to transcribe notes not at their point of attachment to the text but at their point of appearance, typically at the end of the volume, or the end of the chapter. In such cases, the
2728 CONONO element, pointing from that to the body of the
2732 CONONO In cases where the note is applied not to a point but to a span of text, not itself represented as a TEI element, the
2736 CONONO function to specify the span of attachment.
2743 CONONO attribute is used to categorise the note as a gloss:
2757 CONONO element, we may infer that its point of attachment is in the margin adjacent to the line in question. In the following version of the same text, however, it may be inferred that the note applies to the whole of the stanza:
2770 CONONO This type of annotation, very common in the early printed texts which Coleridge may be presumed to be imitating in this case, may also be regarded as providing a heading or descriptive label for the passage concerned. The encoder may therefore prefer to use the
2785 CONONO In the following example, a note which appears at the foot of the page in the printed source is given at its point of attachment within the text. The global
2787 CONONO attribute is used to indicate the note number:
2801 CONONO In addition to transcribing notes already present in the copy text, researchers may wish to add their own notes or comments to it. The
2811 CONONO attribute may be used to point to a definition of the person or other agency responsible for the content of the note.
2813 CONONO As a simple example, an edition of the
2829 CONONO ; thus in this case, the TEI header for this text might contain a title statement like the following:
2840 CONONO When annotating the electronic text by means of analytic notes in some structured vocabulary, e.g. to specify the topics or themes of a text, the
2844 CONONO elements may be more effective than the free form
2846 CONONO element; these elements are available when the module for simple analysis is selected (see section
2852 CONOIX The indexing of scholarly texts is a skilled activity, involving substantial amounts of human judgment and analysis. It should not therefore be assumed that simple searching and information retrieval software will be able to meet all the needs addressed by a well-crafted manual index, although it may complement them for example by providing free text search. The role of an index is to provide access via keywords and phrases which are not necessarily present in the text itself, but must be added by the skill of the indexer.
2856 CONOIXpre When encoding a pre-existing text, therefore, if such an index is present it may be advisable to retain it along with the text, rather than attempt to regenerate it automatically. Elements discussed elsewhere in these Guidelines may be used for this purpose. For example, the
2860 CONOIXpre element may be used to mark the section of the text containing the index and the
2862 CONOIXpre element might be used to mark the index itself, each entry being represented by an
2864 CONOIXpre element, possibly containing within it a series of
2896 CONOIXpre Note that this simple representation does not capture the nested structure of the first of these index entries. A more accurate representation might entail the use of nested lists like the following:
2924 CONOIXpre elements above, might also include direct links to the appropriate location in the encoded text, using (for example) a target attribute to supply the identifier of an associated page break element:
2932 CONOIXpre . Note that similar methods may also be used to encode a table of contents, as further exemplified in section
2938 CONOIXgen It can also be useful, however, to generate a new index from a machine-readable text, whether the text is being written for the first time with the tags here defined, or as an addition to a text transcribed from some other source. Depending on the complexity of the text and its subject matter, such an automatically-generated index may not in itself satisfy all the needs of scholarly users. However it can assist a professional indexer to construct a fully adequate index, which might then be post-edited into the digital text, marked-up along the lines already suggested for preserving pre-existing index material.
2948 CONOIXgen this element may be used simply to provide descriptive or interpretive label of some kind for any location within a text, to be processed in any way by analytic software, but its main purpose is to facilitate the generation of an index for a printed version of the text. An
2950 CONOIXgen element may be placed anywhere within a text, between or within other elements. The headwords to be used when making up this index are given by the
2954 CONOIXgen element. The location of the generated index might be specified by means of a processing instruction within the text, such as the following (the exact form of the PI is of course dependent on the application software in use):
2956 CONOIXgen Alternatively, the special purpose
2960 CONOIXgen In the simplest case, a single headword is supplied by an
2972 CONOIXgen The effect of this is to document an index entry for the term
2974 CONOIXgen , which when processed could reference the location of the original
2978 CONOIXgen If the subject of Arabic lemmatization is treated at length in a text, then the index entry generated may need to reference a sequence of locations (e.g. page numbers). In such a case it will be necessary to identify the end of the relevant span of text as well as its starting point. This is most conveniently done by supplying an empty
2994 CONOIXgen This would generate the same index entries as the previous example, but the reference would be to the whole span of text between the location of the
2996 CONOIXgen element and the location of the element identified by the code
2998 CONOIXgen , rather than a single point, and thus might (for example) include a sequence of page numbers.
3002 CONOIXgen element in the text provides the target location that will be specified in the generated index entry, no part of the text itself is used to construct that entry. Index terms appearing in the entry come solely from the content of
3004 CONOIXgen elements, which consequently may have to repeat words or phrases from the text proper. This need not be done verbatim, thus giving scope for normalization of spelling (as in the example above) or other modifications which may assist generation of an index in a desired form or sequence.
3006 CONOIXgen Sometimes, for example when index terms are taken from a different language or consist of mathematical formulae or other expressions, even a normalized form of an index term may be insufficient for an application to order it exactly as desired. The
3008 CONOIXgen attribute may be used to address this problem, as in the following example:
3012 CONOIXgen Here, an entry for the symbol @ will appear in the index, but will be sorted alphabetically as if it were the string
3014 CONOIXgen . This technique is also useful when an index entry is to contain some non-Unicode character or glyph represented by the
3017 CONOIXgen . In the following example, we assume that somewhere a definition for this glyph has been provided using the elements described in chapter
3018 CONOIXgen , and given the code
3027 CONOIXgen Note that if no value is supplied for the sortKey attribute, a sorting application should always use the content of the
3031 CONOIXgen It is common practice to compile more than one index for a given text. A biography of a poet, for example, may offer an index of references to poems by the subject of the study, another index of works by other writers, an index of places or historical personages etc. The indexName attribute is used to assigning index terms and locations to one or more specific indexes:
3039 CONOIXgen TEI
3042 CONOIXgen , an index may contain structured entries like
3043 CONOIXgen TEI, markup practices, index terms
3044 CONOIXgen , where a top level entry
3045 CONOIXgen TEI
3046 CONOIXgen is followed by a number of second-level subcategories, any or all of which may have a third-level list attached to them and so on. In order to reflect such a hierarchical index listing,
3048 CONOIXgen elements may be nested to the required depth. For example, suppose that we wish to make a structured index entry for
3054 CONOIXgen , etc. The example at the start of this section might then be encoded with nested
3067 CONOIXgen The index entry from Burton's
3069 CONOIXgen quoted above might be generated in a similar way. To generate such an entry, the body of the text might include, at page 193, an
3081 CONOIXgen . Similarly, page 601 of the body text would include an
3109 CONOIXgen elements, the duplication required to make the structure explicit will normally be removed, so as to produce entries like those quoted above. However, this is not required by the encoding recommended here.
3113 CONOIXgen element may be used to mark the place at which an index generated from
3115 CONOIXgen elements should be inserted into the output of a processing program; typically but not necessarily this will be at some point within the back matter of the document. If the
3117 CONOIXgen element is used, then the
3119 CONOIXgen attribute should be used to specify which kind of index is to be generated, and its value should correspond with that of the
3140 CONOIXgen attribute may also be used to specify a name or identifier for the generated index itself in the usual way. Any additional headings etc. required for the generated index must be specified as content of the
3152 CONOIXgen If a processing instruction is used, then these parameters for the generated index may be supplied in some other way.
3154 CONOIXgen One final feature frequently found in manually-created indexes to printed works cannot readily be encoded by the means provided here, namely cross-references internal to the index term listing. For example, if all references to the TEI in a text have been indexed using the index term
3156 CONOIXgen , it may also be helpful to include an entry under the term
3157 CONOIXgen TEI
3158 CONOIXgen containing some text such as
3171 COGR Graphics, such as illustrations or diagrams, appear in many different kinds of text, and often with different purposes. Audio or video clips may also appear. In some cases, such media form an integral part of a text (indeed, some texts—comic books for example—may be almost entirely graphic); in others the graphic or video may be a kind of optional extra. In some cases, the text may be incomprehensible unless the media is included; in others, the presence of the media adds little to the sense of the work. It will therefore be a matter of encoding policy as to whether or how media found in a source text are transferred to a new encoded version of the same. In documents which are
3173 COGR , media such as graphics and other non-textual components may be particularly salient, but their inclusion in an archival form of the document concerned remains an editorial decision.
3175 COGR Considered as structural components, media may be anchored to a particular point in the text, or they may
3177 COGR either completely freely, or within some defined scope, such as a chapter or section. Time-based media such as audio or video may need to be synchronized with particular parts of a written text. Media of all kinds often contain associated text such as a heading or label. These Guidelines provide the following different elements to indicate their appearance within a text:
3185 COGR Media files may be encoded in a number of different ways:
3187 COGR in some non-XML or binary format such as PNG, JPEG, MP3, MP4 etc.
3191 COGR in a TEI XML format such as the notation for graphs and trees described in
3193 COGR In the last two cases, the presence of the graphic will be indicated by an appropriate XML element, drawn from the SVG namespace in the second case, and its content will fully define the graphic to be produced. In the first case, however, one of the elements
3197 COGR is used to mark the presence of the graphic only and the visual content itself is stored outside the XML document at a location referenced by means of an
3201 COGR class. Alternatively, if it is small, the media information may be embedded directly within the document using some suitable binary format such as Base64; in this case the
3213 COGR when this module is included in a schema. These elements are also members of the class
3220 COGR For example, the following passage indicates that a copy of the image found in the source text may be recovered from the URL
3228 COGR The media elements are phrase level elements which may be used anywhere that textual content is permitted, within but not between paragraphs or headings. In the following example, the encoder has decided to treat a specific printer's ornament as a heading:
3235 COGR provides additional capabilities, for example the ability to combine a number of images into a hierarchically organized structure or a block of images. The
3239 COGR attribute, which can be used to distinguish different kinds of graphic component within a single work, for example, maps as opposed to illustrations. It also provides the ability to associate an image with additional information such as a heading or a description.
3250 CORS we mean the system by which names or references are associated with particular passages of a text (e.g.
3252 CORS for the third verse of Psalm 23 or
3256 CORS , book 2, poem 10, line 7). Such names make it possible to mark a place within a text and enable other readers to find it again. A reference system may be based on structural units (chapters, paragraphs, sentences; stanza and verse), typographic units (page and line numbers), or divisions created specifically for reference purposes (chapter and verse in Biblical texts). Where one exists, the traditional reference system for a text should be preserved in an electronic transcript of it, if only to make it easier to compare electronic and non-electronic versions of the text.
3260 CORS where a reference system exists, and is based on the same logical structure as that of the text's markup, the reference for a passage may be recorded as the value of the global
3274 CORS where a reference system exists which is not based on the same logical structure as that of the text's markup (for example, one based on the page and line numbers of particular editions of the text rather than on the structural divisions of it), any of a variety of methods for encoding the logical structure representing the reference system may be employed, as described in chapter
3277 CORS where a reference system exists which does not correspond to any particular logical structure, or where the logical structure concerned is of no interest to the encoder except as a means of supporting the referencing system, then references may be encoded by means of
3279 CORS elements, which simply mark points in the text at which values in the reference system change, as described below in section
3281 CORS The specific method used to record traditional or new reference systems for a text should be declared in the TEI header, as further described in section
3285 CORS When a text has no pre-existing associated reference system of any kind, these Guidelines recommend as a minimum that at least the page boundaries of the source text be marked using one of the methods outlined in this section. Retaining page breaks in the markup is also recommended for texts which have a detailed reference system of their own. Line breaks in prose texts may be, but need not be, tagged.
3286 CORS Many encoders find it convenient to retain the line breaks of the original during data entry, to simplify proofreading, but this may be done without inserting a tag for each line break of the original.
3294 CORS1 When traditional reference schemes represent a hierarchical structuring of the text which mirrors that of the marked-up document, the
3298 CORS1 attribute may also be used to record the numbering of sections or list items in the copy text if the copy-text numbering is important for some reason, for example because the numbers are out of sequence.
3304 CORS1 —book 2, poem 10, line 7. Book, poem, and line are structural units of the work and will therefore be tagged in any case. (See chapter
3305 CORS1 for a discussion of structural units in verse collections.) In such cases, it is convenient to record traditional reference numbers of the structural units using the
3328 CORS1 One may also place the entire standard reference for each portion of the text into the appropriate value for the
3330 CORS1 attribute, though for obvious reasons this takes more space in the file:
3347 CORS1 If the names used by the traditional reference system can be formulated as identifiers, then the references can be given as values for the
3353 CORS1 attribute must be unique throughout the document. Our example then looks like this:
3370 CORS1 To document the usage and to allow automatic processing of these standard references, it is recommended that the TEI header be used to declare whether standard references are recorded in the
3379 CORS1 attribute one can specify only a single standard referencing system, a limitation not without problems, since some editions may define structural units differently and thus create alternative reference systems. For example, another edition of the
3381 CORS1 considers poem 10 a continuation of poem 9, and therefore would specify the same line as
3388 CORS2 If a text has no canonical reference system of its own, a new custom reference system may be used.
3402 CORS2 Determining a referencing system for a TEI encoding depends on many factors that may either be derived from textual structure, or influenced by extra-textual contingencies such as project and file management concerns. It is important, therefore, that the attribute used, the elements which can bear standard reference identifiers, and the method for constructing standard reference identifiers, should all be declared in the header as described in section
3410 CORS2-1 A new referencing system may be derived from the structure of the electronic text, specifically from the markup of the text. As with any reference system intended for long-term use, it is important to see the reference as an established, unchanging point in the text. Should the text be revised or rearranged, the reference-system identifiers associated with any section of text must stay with that section of text, even if it means the reference numbers fall out of sequence. (A new reference system may always be created beside the old one if out-of-sequence numbers must be avoided.)
3417 CORS2-1 domain-style address
3418 CORS2-1 comprising a series of components separated by full stops, with one component for each level of the document hierarchy. Two methods may be used. In the
3420 CORS2-1 form of identifier, each component in the identifier takes the form of an element identifier, a hyphen, and a number, for example
3422 CORS2-1 . The element name specifies what type of element is to be sought, and the number specifies which occurrence of that element type is to be selected. (The hyphen and number may be omitted if there is only one element of the given type.) In the
3424 CORS2-1 form of identifier, each component consists of a number, indicating which element in the sequence of nodes at each level is to be selected. To make the resulting identifier a valid XML identifier, it may need to be prefixed with an unchanging alphabetic letter.
3434 CORS2-1 element may be taken as a starting point only if identifiers need to be generated for the
3438 CORS2-1 element as a root would prevent assignment of identifiers for the front and back matter. The component corresponding to the root element can be omitted from identifiers, if no confusion will result. In collections and corpora, the component corresponding to the root may be replaced by the unique identifier assigned to the text or sample.
3446 CORS2-1 value; the latter are prefixed with the string
3490 CORS2-1 attribute is used to record the reference identifiers generated, each value should record the entire path. If the
3492 CORS2-1 attribute is used, each value may record either the entire path or only the subpath from the parent element. The attribute used, the elements which can bear standard reference identifiers, and the method for constructing standard reference identifiers, should all be declared in the header as described in section
3501 CORS2-2 attributes. Every convention will have strengths and weaknesses and it is left to encoders to make a decision that enables them to locate information in their TEI document.
3503 CORS2-2 Here are some examples of referencing systems that have been used in TEI project:
3506 CORS2-2 identifiers constructed with a number of characters from the main document title, followed by an incremental number. E.g. HOL001, HOL002, etc. using a fixed number of digits; or without fixed digits: HOL1, HOL2, etc.
3509 CORS2-2 identifiers constructed on the markup itself, as described in the previous section. To facilitate uniqueness in a corpus, each identifier may be prefixed with the identifier of the root
3518 CORS2-2 XML well-formedness requires only that xml:id attributes be unique within a single document. However, it is also worth keeping in mind that for operating with referencing systems across a corpus of TEI files it is helpful (or even necessary in some circumstances) to have unique identifiers across the whole corpus.
3522 CORS2-2 may be either populated computationally or manually. In the latter case, it is advisable to put measures in place to avoid human error. Custom data types and Schematron rules may be defined in a customization ODD, and a check digit may be added to prevent unwanted changes.
3523 CORS2-2 A check digit is computed from the value of an identifier and appended to the value itself. If the identifier is changed, the check digit would therefore invalidate it.
3530 CORS5 milestone
3534 CORS5 These elements simply mark the points in a text at which some category in a reference system changes. They have no content but subdivide the text into regions, rather in the same way as milestones mark points along a road, thus implicitly dividing it into segments. The elements
3542 CORS5 are specialized types of milestone, marking gathering, page, column, and line boundaries respectively. The global
3544 CORS5 attribute is used in each case to provide a value for the particular unit associated with this milestone (for example, the page or line number). Since it is not structural, validation of a reference system based on
3546 CORS5 s cannot readily be checked by an XML parser, so it will be the responsibility of the encoder or the application software to ensure that they are given in the correct order.
3548 CORS5 Milestone elements are often used as a simple means of capturing the original appearance of an early printed text, which will rarely coincide exactly with structural units, but they are generally useful wherever a text has two or more competing structures. For example, many English novels were first published as serial works, individual parts of which do not always contain a whole number of chapters. An encoder might decide to represent the chapter-based structure using
3603 CORS5 Similarly, when tagging dramatic verse one may wish to privilege stanzas and lines over speeches and speakers, particularly where speeches cross line and line group boundaries. One might also wish to mark changes in narrative voice in a prose text. In either case, a milestone tag may be used to indicate change of speaker:
3614 CORS5 Milestone tags also make it possible to record the reference systems used in a number of different editions of the same work. The reference system of any one edition can be recreated from a text in which all are marked by simply ignoring all elements that do not specify that edition on their
3618 CORS5 As a simple example, assuming that edition E1 of some collection of poems regards the first two poems as constituting the first book, while edition E2 regards the first poem as prefatory, a markup scheme like the following might be adopted:
3629 CORS5 In this case no
3631 CORS5 value is specified, since the numbers rise predictably and the application can keep a count from the start of the document, if desired.
3633 CORS5 The value of the
3649 CORS5 tags, line numbers may be supplied for every line or only periodically (every fifth, every tenth line). The latter may be simpler; the former is more reliable.
3659 CORS5 could have been used equally well if preferred. The special value
3661 CORS5 should be reserved for marking sections of text which fall outside the normal numbering system (e.g. chapter heads, poem numbers, titles, or speaker attributions in a verse drama).
3663 CORS5 By default, there are no constraints on the values supplied for the
3666 CORS5 may be used, for example to specify that the attribute must specify one of a predefined set of values.
3671 CORS5 Milestone elements may be used to mark any kind of shift in the properties associated with a piece of text, whether or not would normally be considered a reference system. For example, they may be used to mark changes in narrative voice in a prose text, or changes of speaker in a dramatic text, where these are not marked using structural elements such as
3677 CORS5 above, milestone elements such as
3681 CORS5 represent whitespace and are therefore by default assumed to occur between orthographic tokens in the text, where these are not otherwise indicated. By default it is reasonable to assume that words are not broken across page or line boundaries, and that therefore a sequence such as
3694 CORS5 attribute is provided to change the default assumption. To make explicit that
3699 CORS5 Where hyphenation appears before a line or page break, the encoder may or may not choose to record the fact, either explicitly using an appropriate Unicode character, or descriptively for example by means of the
3714 CORS6 Whatever kind of reference system is used in an electronic text, it is recommended that the TEI header contain a description of its construction in the
3734 CORS6 tags. The header section for such an encoding should look something like this:
3807 CORS6 tags, but giving the reference string in full on each tag. If canonical references are made only to lines, the reference system could be declared as follows:
3810 CORS6 Since the entire regular expression is enclosed as a parenthetical subgroup, the entire canonical reference string is sought as the value of the
3820 CORS6 This declaration indicates that the entire reference string must be sought as the value of the
3832 CORS6 The third example encodes the same reference system, this time giving the entire reference string as the value of the
3837 CORS6 although in general there seems to be little advantage in this case: it is no more difficult to use a standard relative URI reference as the value of
3841 CORS6 Reference systems recorded by means of milestone tags can also be declared; the following prose description could be used to declare the example given in section
3846 CORS6 Or in this way, using a formal declaration for this reference scheme derived from edition
3859 COBI Bibliographic references (that is, full descriptions of bibliographic items such as books, articles, films, broadcasts, songs, etc.) or pointers to them may appear at various places in a TEI text. They are required at several points within the TEI header's source description, as discussed in section
3860 COBI ; they may also appear within the body of a text, either singly (for example within a footnote), or collected together in a list as a distinct part of a text; detailed bibliographic descriptions of manuscript or other source materials may also be required. These Guidelines propose a number of specialized elements to encode such descriptions, which together constitute the
3869 COBI In printed texts, the individual constituents of a bibliographic reference are conventionally marked off from each other and from the flow of text by such features as bracketing, italics, special punctuation conventions, underlining, etc. In electronic texts, such distinctions are also important, whether in order to produce acceptably formatted output or to facilitate intelligent retrieval processing,
3872 COBI as an author's name from
3874 COBI as a place of publication or as a component of a title.
3877 COBI It should be emphasized that for references as for other textual features, the primary or sole consideration is not how the text should be formatted when it is printed. The distinctions permitted by the scheme outlined here may not necessarily be all that particular formatters or bibliographic styles require, although they should prove adequate to the needs of many such commonly used software systems.
3882 COBI structures, though the nature of their design prevents a simple one-to-one mapping from their data elements to TEI elements. For further information, see section
3885 COBI ) constitute a set which has been useful for a wide range of bibliographic purposes and in many applications, and which moreover corresponds to a great extent with existing bibliographic and library cataloguing practice. For a fuller account of that practice as applied to electronic texts see section
3901 COBI element; instead, the presence and order of child elements must be used to reconstruct the punctuation required by a particular style.
3905 COBI allows for considerable flexibility in that it can include both delimiting punctuation and unmarked-up text; and its constituents can also be ordered in any way. This makes it suitable for marking up bibliographies in existing documents, where it is considered important to preserve the form of references in the original document, while also distinguishing important pieces of information such as authors, dates, publishers, and so on.
3907 COBI may also be useful when encoding
3909 COBI documents which require use of a specific style guide when rendering the content; its flexibility makes it easier to provide all the information for a reference in the exact sequence required by the target rendering, including any necessary punctuation and linking words, rather than using an XSLT stylesheet or similar to reorder and punctuate the data.
3915 COBI , has a content model based on the
3917 COBI element of the TEI header. Both are based on the International Standard for Bibliographic Description (ISBD), which forms the basis of several national standards for bibliographic citations. The order of child elements in both
3938 COBI resource identifier and terms of availability area
3941 COBI , used with its child elements and without delimiting punctuation, provides an appropriate granularity of encoding with elements that can easily be rendered for the reader. However, it is important to note that some ISBD-derived citation formats (such as ANSI/NISO Z39.29 and ГОСТ 7.1) are not entirely conformant to ISBD either, since they may begin with a statement of authorship that does not map to the ISBD statement of responsibility.
3947 COBITY class all share a number of possible component sub-elements. For the
3957 COBITY Different levels of specific tagging may be appropriate in different situations. In some cases, it may be felt necessary to mark just the extent of the reference itself, with perhaps a few distinctions being made within it (for example, between the part of the reference which identifies a title or author and the rest). Such references, containing a mixture of text with specialized bibliographic elements, are regarded as
3970 COBITY Some bibliographic references are extremely elliptical, often only a string of the form
3972 COBITY . If no further details of Baxter's book are given in the source text and none is supplied by the encoder, then the reference thus given should be tagged as a
4032 COBITY element defined in the TEI header module. This element is provided as a means of embedding the file description of one existing digital text within that of another (see further section
4053 COBITY A list of bibliographic items, of whatever kind, may be treated in the same way as any other list (see section
4068 COBITY may contain only bibliographic elements, optionally preceded by a heading and a series of introductory paragraphs. For most purposes, good practice would usually require that a
4145 COBITY s and
4149 COBITY items, the key information is marked up, but it is presented in an order which makes it suitable for direct rendering, with the punctuation included.
4207 COBICO analytic
4211 COBICO series
4216 COBICO information relating to the publication, pagination, etc. of an item (most of these constitute the default members of the
4227 COBICO class, other phrase-level elements, and plain text may be combined without other constraint; within the latter, such of these elements as exist for a given reference must be distinguished, and must also be presented in a specific order, discussed further below (section
4232 COBICOL In common library practice a clear distinction is made between an individual item within a larger collection and a free-standing book, journal, or collection. Similarly a book in a series is distinguished sharply from the series within which it appears. An article forming part of a collection which itself appears in a series thus has a bibliographic description with three quite distinct levels of information:
4235 COBICOL analytic
4243 COBICOL series
4244 COBICOL level, giving the title of the series, possibly the names of its editors, etc., and the number of the volume within that series.
4245 COBICOL In the same way, an article in a journal requires at least two levels of information: the analytic level describing the article itself, and the monographic level describing the journal.
4247 COBICOL A different identifying number may be supplied for any of these three items, that is, for the analytic item, the monographic item, or the series.
4284 COBICOL , the levels are distinguished by the use of the following distinct elements:
4287 COBICOL For purposes of TEI encoding, journals and anthologies are both treated as monographs; a journal title should thus be tagged as a
4288 COBICOL title level="j"
4292 COBICOL analytic
4301 COBICOL element. (Whether reprints of an article are treated in the same bibliographic reference or a separate one varies among different styles. Library lists typically use a different entry for each publication, while academic footnoting practice typically treats all publications of the same article in a single entry.)
4305 COBICOL element is used to supply further information about the location of some part of a bibliographic reference. It specifies where to find the component in which it appears within the immediately preceding component of a different level.
4311 COBICOL , which was itself the second of a four volumes published together under the title
4313 COBICOL ; this last title constituted the 38th volume in the series of
4350 COBICOL In the following example, the article cited has been published twice, once in a journal (where it appeared in volume 40, on pages 3 -46 of the issue of October 1986) and once as a free-standing item, which appeared as number 11 of a German language series.
4407 COBICOL The practice of analytic vs. monographic citation, as described here, should be distinguished from the practice of including within one citation a reference to another work, which the encoder considers to be related to in some way: see further
4410 COBICOL If an identifier is available for the analytic item, it should be represented by means of an
4414 COBICOL element, as in the following example where a DOI (Digital Object identifier) is supplied for the article in question.
4462 COBICOL Punctuation must not appear between the elements within a structured bibliographic entry encoded with
4510 COBICOL , with all the relevant data items marked up appropriately. This markup approach can provide easy rendering, if only one styleguide is targeted, or an original source document uses a specific styleguide, while still allowing for automated recovery of key data items such as names of authors, titles etc.
4519 COBICOR Bibliographic references typically include the title of the work being cited and the names of those intellectually responsible for it. For articles in journals or collections, such statements should appear both for the analytic and for the monographic level. The following elements are provided for tagging such elements:
4545 COBICOR are the default members of the
4553 COBICOR In bibliographic references, all titles should be tagged as such, whether analytic, monographic, or series titles. The single element
4567 COBICOR It is a semantic error to give a value for the
4571 COBICOR value
4573 COBICOR implies the analytic level; the values
4574 COBICOR m
4578 COBICOR u
4579 COBICOR imply the monographic level; the value
4580 COBICOR s
4581 COBICOR implies the series level. Note, however, that the semantic error occurs only if the nested title is directly enclosed by the
4587 COBICOR element; if it is enclosed only indirectly (i.e., nested more deeply), no semantic error need be present. For example, the analytic title may contain a monographic title, as in the following example:
4615 COBICOR In this case, the analytic title
4622 COBICOR element; the monographic title contained within it,
4632 COBICOR The following reference, from a national standard for bibliographic references, illustrates this type of analysis with its distinction between main and subordinate titles. Note that this uses the more flexible
4636 COBICOR element: consequently, there is no requirement to tag all the components of the reference (notably the authors).
4653 COBICOR Slightly more complex is the distinction made below among main, subordinate, and parallel titles, in an example from the same source (p. 63). The punctuation and the bibliographic analysis are those given in ANSI Z39.29-1977; the punctuation is in the style prescribed by the International Standard Bibliographic Description (ISBD).
4654 COBICOR The analysis is not wholly unproblematic: as the text of the standard points out, the first subordinate title is subordinate only to the parallel title in French, while the second is subordinate to both the English main title and the French parallel title, without this relationship being made clear, either in the markup given in the example or in the reference structure offered by the standard.
4659 COBICOR , that specific punctuation may be included between the component elements of the reference.
4678 COBICOR element should be used for the person or agency with primary responsibility for a work's intellectual content, and the element
4681 COBICOR editor
4683 COBICOR author
4684 COBICOR of a broadcast, for example, while the author of a government report will usually be the agency which produced it. A translator, illustrator, or compiler, may however be marked by means of the
4690 COBICOR Many bibliographic and Linked Data applications require disambiguation of author names using unique identifiers. Both the
4696 COBICOR elements, to supply such identifiers. Alternatively, if only a single identifier is to be recorded, the
4735 COBICOR element may also be used for editors, if it is desired to record the specific terms in which their role is described.
4749 COBICOR element may also occur. When one of these elements precedes or immediately follows a title, it applies to that title; when it follows an
4751 COBICOR element or occurs within an edition statement, it applies to the edition in question.
4797 COBICOR This example retains the original punctuation and editorial conventions of the source (ISO 690: 1987) and is therefore encoded using the
4803 COBICOR element applies to the edition, and not to the collection
4804 COBICOR per se
4807 COBICOR element, the component elements have been reordered from their appearance on the title page of the volume in order to ensure the correct relationship of the collection title, the edition statement, and the statement of responsibility.
4848 COBICOR The party with a particular responsibility for the intellectual content may vary over time. Likewise, a given individal's responsibility or role may change over time. These situations may be recorded with the
4850 COBICOR element. For example, the following could be used when one proofreader took over for another.
4868 COBICOR Another form of
4870 COBICOR arises when a work is published as the outcome of a conference, workshop or similar meeting. The
4932 COBICOD identifiers of various types because they do not include a statement of the title and the names of those intellectually responsible for it. The following elements may be used for such purposes:
4940 COBICOD For example, a citation to a patent typically includes a country or organization code (a two-character code identifying a patent authority) and a serial number for the patent (whose structure varies by patent authority). The citation might also contain a
4941 COBICOD kind code
4942 COBICOD (which characterizes a particular publication for the patent and which corresponds to a specific stage in the patent procedure) and the date when the patent was filed with or published by the issuing authority. For bibliographic references to patents, the above elements may be used as follows:
4947 COBICOD , may be used to contain the code of the patent authority. The
4949 COBICOD attribute may be used to specify the type of patent authority (such as a national patent office or a supra-national patent organization).
4952 COBICOD may be used to contain the serial number assigned by the corresponding patent authority.
4955 COBICOD may be used to contain the kind code of the patent document.
4958 COBICOD may be used to contain the date of the patent document. The
4960 COBICOD attribute may be used to specify whether this corresponds to the filing date of a patent application or the publication date of a patent publication.
4988 COBICOI imprint
4989 COBICOI is meant all the information relating to the publication of a work: the person or organization by whose authority and in whose name a bibliographic entity such as a book is made public or distributed (whether a commercial publisher or some other organization), the place and the date of publication. It may also include a full address for the publisher or organization. A full bibliographic references will usually also specify the number of pages in a print publication (or equivalent information for non-print materials), and possibly also the specific location of the material being cited within its containing publication. The following elements are provided to hold this information:
4998 COBICOI Members of the model classes
5004 COBICOI element in a specific location within a
5014 COBICOI For bibliographic purposes, usually only the place (or places) of publication are required, possibly including the name of the country, rather than a full address; the element
5016 COBICOI is provided for this purpose. Where however the full postal address is likely to be of importance in identifying or locating the bibliographic item concerned, it may be supplied and tagged using the
5019 COBICOI . Alternatively, if desired, the
5024 COBICOI may be used; this involves no claim that the information given is either a full address or the name of a city.
5026 COBICOI The name of the publisher of an item should be marked using the
5028 COBICOI element even if the item is made public (
5030 COBICOI ) by an organization other than a conventional publisher, as is frequently the case with technical reports:
5094 COBICOI When an item has been reprinted, especially reprinted without change from a specific earlier edition, the reprint may appear in a
5098 COBICOI and other details of the reprint. In the following example, a microform reprint has been issued without any change in the title or authorship. The series statement here applies only to the second
5141 COBICOI This encoding can be extended to the case of patent documents, where the same patent application is published, with or without changes, at different stages of the patenting procedure. In this case, the kind code and, optionally, the publication date characterize different publications of the same patent application during the procedure. For example:
5167 COBICOI The above bibliographic reference discloses different publications of the patent EP1558513 during the patenting procedure. The first publication from 3 August 2005 has the kind code "A1" indicating that it is a published patent application comprising the European search report issued after carrying out the search at the European Patent Office, whereas the second publication from 9 September 2009 has the kind code "B1" indicating that it was published after the patent application has been granted.
5178 COBICOB Many bibliographic citations contain data limiting the citation to one or more volumes, issues, or pages, or to a name or number of a subdivison of the host work. These come in two varieties:
5188 COBICOB Where it is desired to distinguish different classes of such information (volume number, page number, chapter number, etc.), the
5310 COBICOB On the other hand, a cited range encodes that the author
5312 COBICOB defined by this range. For example, a footnote following a quotation from page 378 of
5360 COBICOS element. The title of the series may be tagged
5361 COBICOS title level="s"
5362 COBICOS , the volume number
5363 COBICOS biblScope unit="vol"
5364 COBICOS , and responsibility statements for the series (e.g. the name and affiliation of the editor, as in the example in section
5369 COBICOS . Any identifier associated with the series itself should be marked using the
5376 COBIRI related item
5377 COBIRI is any bibliographic item which, though related to that being defined, is distinct from it. The distinction between analytic and monographic items made above may be thought of as a special case of this kind of
5379 COBIRI item. More usually however, the term is applied to such items as translations, continuations, different versions, parts, etc.
5389 COBIRI describes a facsimile edition, and the second describes the work of which it is a facsimile. The relation between the facsimile and its source is represented by means of a
5439 COBIRI may contain any form of bibliographic reference. For example, one of the examples quoted above might also be encoded as follows:
5484 COBIRI attribute should be used to indicate the relationship between the bibliographic item and any
5526 COBIRI In this example, a full bibliographic description of the edition used as source for the translation is provided within the content of the
5528 COBIRI . Alternatively this might be provided by means of a link, in which case the
5547 COBICON Explanatory notes about the publication of unusual items, the form of an item (e.g.
5551 COBICON ), or its provenance (e.g.
5555 COBICON element. The same element may be used for any descriptive annotation of a bibliographic entry in a database.
5575 COBICON This element can take the form of a simple note such as:
5581 COBICON attribute to record the chief language of the bibliographic item, and optionally the
5593 COBICON attributes should both provide language identifiers in the same form as used for
5596 COBICON . Where additional detail is needed correctly to describe a language, or to discuss its deployment in a given text, this should be done using the
5598 COBICON element in the TEI header, within which individual
5625 COBICOO element, if it occurs, must come first, followed by one or more
5631 COBICOO element comes first), and then zero or more of the following in any order:
5647 COBICOO , the title(s), author(s), editor(s), and other statements of responsibility may appear in any order; it is recommended that all forms of the title be given together. Within
5649 COBICOO , the author, editor, and statements of responsibility may either come first or else follow the monographic title(s). Following these, the elements listed below, if present, must appear in the following order:
5652 COBICOO s on the publication (and
5654 COBICOO elements describing the conference, in the case of a proceedings volume)
5674 COBICOO , the sequence of elements is not constrained.
5688 COBIXR ). As discussed in that section, cross-referencing within TEI texts is in general represented by means of
5694 COBIXR attribute on these elements is used to supply an identifying value for the target of the cross-reference, which should be, in the case of bibliographic elements, a bibliographic reference of some kind. Where the form of the reference itself is unimportant, or may be reconstructed mechanically, or is not to be encoded, the
5701 COBIXR Where the form of the reference is important, or contains additional qualifying information which is to be kept but distinguished from the surrounding text, the
5707 COBIXR It may be important to distinguish between the short form of a bibliographic reference and some qualifying or additional information. The latter should not appear within the scope of the
5709 COBIXR element when this is the case, as for example in an application concerned to normalize bibliographic references:
5717 COBIXR element may also be used to provide a reference to a copy of the bibliographic item itself, particularly if this is available online, as in the following example:
5753 COBIOT The BibTeX scheme is intentionally compatible with that of Scribe, although it omits some fields used by Scribe. Hence only one list of fields is given here.
5756 COBIOT address
5758 COBIOT tag as
5765 COBIOT tag as
5768 COBIOT author
5770 COBIOT tag as
5775 COBIOT tag as
5776 COBIOT title level="m"
5784 COBIOT tag as
5785 COBIOT biblScope unit="chap"
5787 COBIOT date
5789 COBIOT used only to record date entry was made in the bibliographic database; not supported
5791 COBIOT edition
5793 COBIOT tag as
5796 COBIOT editor
5798 COBIOT tag as
5805 COBIOT tag as multiple
5829 COBIOT name type="org"
5833 COBIOT tag as
5835 COBIOT , possibly using the form
5836 COBIOT note place="inline"
5838 COBIOT institution
5840 COBIOT used only for issuer of technical reports; tag as
5845 COBIOT tag as
5846 COBIOT title level="j"
5854 COBIOT used to specify an alternate sort key for the bibliographic item, for use instead of author's or editor's name; not supported
5856 COBIOT meeting
5858 COBIOT tag as
5867 COBIOT ; if the date is not in a trivially parseable form, use the
5872 COBIOT note
5874 COBIOT tag as
5877 COBIOT number
5879 COBIOT tag as
5880 COBIOT biblScope unit="issue"
5882 COBIOT biblScope unit="number"
5884 COBIOT idno type="docno"
5888 COBIOT used only for sponsor of conference; use
5889 COBIOT name type="org"
5898 COBIOT tag as
5899 COBIOT biblScope unit="pp"
5901 COBIOT publisher
5903 COBIOT tag as
5908 COBIOT used only for institutions at which thesis work is done; tag as
5911 COBIOT series
5913 COBIOT tag as
5914 COBIOT title level="s"
5920 COBIOT title
5922 COBIOT tag as
5926 COBIOT value
5930 COBIOT tag as
5931 COBIOT biblScope unit="vol"
5935 COBIOT tag as
5937 COBIOT ; if the date is not in a trivially parseable form, use the
5945 CODV The following elements are included in the core module for the convenience of those encoding texts which include mixtures of prose, verse and drama.
5948 CODV Full details of other, more specialized, elements for the encoding of texts which are predominantly verse or drama are described in the appropriate chapter of part three (for verse, see the verse base described in chapter
5949 CODV ; for performance texts, see the drama base described in chapter
5950 CODV ). In this section, we describe only the elements listed above, all of which can appear in any text, whichever of the three modes prose, verse, or drama may predominate in it.
5954 COVE Like other written texts, verse texts or poems may be hierarchically subdivided, for example into books or cantos. These structural subdivisions should be encoded using the general purpose
5960 COVE . The fundamental unit of a verse text is the verse line rather than the paragraph, however.
5964 COVE element is used to mark up verse lines, that is metrical rather than typographic lines. In some modern or free verse, it may be hard to decide whether the typographic line is to be regarded as a verse line or not, but the distinction is quite clear for verse following regular metrical patterns. Where a metrical line is interrupted by a typographic line break, the encoder may choose to ignore the fact entirely or to use the empty
5967 COVE . By convention, the start of a metrical line implies the start of a typographic line; hence there is no need to introduce an
5969 COVE tag at the start of every
5971 COVE element, but only at places where a new typographic line starts within a metrical line, as in the following example:
5986 COVE In the original copy text, the presence of an ornamental capital at the start of the poem means that the measure is not wide enough to print the first four lines on four lines; instead each metrical line occupies two typographic lines, with a break at the point indicated. Note that this encoding makes no attempt to preserve information about the whitespace or indentation associated with either kind of line; if regarded as essential, this information would be recorded using the
5994 COVE element should not be used to represent typographic lines in non-verse materials: if the line-breaking points in a prose text are considered important for analysis, they should be marked with the
5996 COVE element. Alternatively, a neutral segmentation element such as
6011 COVE In some verse forms, regular groupings of lines are regarded as units of some kind, often identified by a regular verse scheme. In stichic verse and couplets, groups of lines analogous to paragraphs are often indicated by indentation. In other verse forms, lines are grouped into irregular sequences indicated simply by whitespace. The
6013 COVE or line group element may be used to mark any such grouping of elements from the
6020 COVE which may be used to further categorize the line group where this is felt desirable, as in the following example. This example also demonstrates the
6022 COVE attribute to indicate whether or not a line is indented.
6048 COVE For some kinds of analysis, it may be useful to identify different kinds of line group within the same piece of verse. Such line groups may self-nest, in much the same way as the un-numbered
6093 COVE It is often the case that verse line boundaries conflict with the boundaries of other structural elements. In the following example, the single verse line
6095 COVE is interrupted by a stage direction:
6119 COVE The same technique may be used where verse lines are collected together into units such as verse paragraphs:
6142 COVE element to indicate that it is incomplete, for example because it forms part of a group that is divided between two speakers, as in the following example:
6164 COVE For alternative methods of aligning groups of lines which do not form simple hierarchic groups, or which are discontinuous, see the more detailed discussion in chapter
6174 CODR performance texts
6175 CODR such as cinema or TV scripts are often hierarchically organized, for example into acts and scenes. These structural subdivisions should be encoded using the general purpose
6181 CODR . Within these divisions, the body of a performance text typically consists of
6183 CODR , often prefixed by a phrase indicating who is speaking, and occasionally interspersed with stage directions of various kinds.
6210 CODR In the following example, each speech consists of a sequence of verse lines, some of them being marked as metrically incomplete:
6266 CODR , the printed speaker attributions need to be supplemented by use of the
6312 CODR By contrast with the preceding examples, the following encodes an early printed edition without making any assumption about which parts are prose or verse:
6354 CODR elements should also be used to mark parts of a text otherwise in prose which are presented as if they were dialogue in a play. The following example is taken from a 19th century novel in which passages of narrative and passages of dialogue are mixed within the same chapter:
6401 core Elements common to all TEI documents
6410 COOV The selection and combination of modules to form a TEI schema is described in

WD-NonStandardCharacters.xml#12945

# id text
6 WD introduced the fundamental notions of language identification and character representation in an encoded TEI document. In this chapter we discuss some additional issues relating to the way that written language is represented in a TEI document. In sections
8 WD we introduce markup which may be used to represent and document non-standard characters, that is, written symbols for which no codepoint exists in Unicode. The same markup may be used to annotate existing characters according to their visual or other properties, and thus process them as distinct glyphs (see section
12 WD we discuss ways of documenting the writing mode used in a source text, that is, the directionality of the script, the orientation of individual characters, and related questions.
16 WDNE Despite the availability of Unicode, text encoders still sometimes find that the published repertoire of available characters is inadequate to their needs. This is particularly the case when dealing with ancient languages, for which encoding standards do not yet exist, or where an encoder wishes to represent variant forms of a character or
34 WDNE , and the associated character code charts. Alternatively, users can check the latest published version of
38 WDNE ), though the web site is often more up to date than the printed version, and should be checked for preference.
42 WDNE ) in the Unicode code charts are only meant to be representative, not definitive. If a specific form of an already encoded character is required for a project, refer to the guidelines contained below under
44 WDNE . Remember that your encoded document may be rendered on a system which has different fonts from yours: if the specific form of a character is important to you, then you should document it.
47 WDNE ) to see whether the character is in line for approval.
49 WDNE Ask on the Unicode email list (
54 WDNE Since there are now close to 100,000 characters in Unicode, chances are good that what you need is already there, but it might not be easy to find, since it might have a different name in Unicode. Look again, this time at other sites, for example
55 WDNE , which also provide searches based on scripts and languages. Take care, however, that all the properties of what seems to be a relevant character are consistent with those of the character you are looking for. For example, if your character is definitely a digit, but the properties of the best match you can find for it say that it is a letter, you may have a character not yet defined in Unicode.
59 WDNE However, if the character you are looking for is being used in a notation (rather than as part of the orthography of a language) then it is quite acceptable to select characters from the Mathematical Operators block, provided that they have the appropriate properties (i.e.
69 WDNE If, however, no suitable form of your character seems to exist, the next question will be:
70 WDNE Does the graphical unit in question represent a variant form of a known character, or does it represent a completely unencoded character?
74 WDNE These guidelines will help you proceed once you have identified a given graphical unit as either a variant or an unencoded character. Determining this will require knowledge of the contents of the document that you have. The first case will be called
76 WDNE of a character, while the second case will be called
82 WDNE While there is some overlap between these requirements, distinct specialized markup constructs have been created for each of these cases. These constructs are presented in section
91 D25-20 numeric character reference
94 D25-20 (A-umlaut). The encoder can also restrict the range of characters which are represented directly in a document (or part of it) by adding a suitable encoding declaration. For example, if a document begins with the declaration
96 D25-20 any Unicode characters which are not in the ISO-8859-1 character set must be represented by NCRs.
99 D25-20 gaiji
104 D25-20 .) This allows the encoder to distinguish characters and glyphs which Unicode regards as identical, to add new nonstandard characters or glyphs, and to represent Unicode characters not available in the document encoding by an alternative means.
122 D25-20 When the gaiji module is included in a schema, the
130 D25-20 The Unicode standard defines properties for all the characters it defines in the Unicode Character Database, knowledge of which is usually built into text processing systems. If the character represented by the
132 D25-20 element does not exist in Unicode at all, its properties are not available. If the character represented is an existing Unicode character, but is not available in the document character set recognized by a given text processing system, it may also be convenient to have access to its properties in the same way. The
136 D25-20 The list of attributes (properties) for characters is modelled on those in the Unicode Character Database, which distinguishes
140 D25-20 character properties. Additional, non-Unicode, properties may also be supplied. Since the list of properties will vary with different versions of the Unicode Standard, there may not be an exact correspondence between them and the list of properties defined in these Guidelines.
144 D25-20 . The gaiji module itself is formally defined in section
145 D25-20 below. It declares the following additional elements:
155 D25-20 when this module is included in a schema. The
159 D25-20 : this class is referenced as an alternative to plain text in almost every element which contains plain text, thus permitting the
161 D25-20 element also to appear at such places when this module is included in a schema.
182 D25-20 element) by providing a specific glyph that shows how a character appeared in the original document. This is necessary since Unicode code points refer not to a single, specific glyph shape of a character, but rather to a set of glyphs, any of which may be used to render the code point in question; in some cases they can differ considerably.
186 D25-20 element is provided for cases where the encoder wants to specify a specific glyph (or family of glyphs) out of all possible glyphs. Unfortunately, due to the way Unicode has been defined, there are cases where several glyphs that logically belong together have been given separate code points, especially in the blocks defining East Asian characters. In such cases,
188 D25-20 elements can also be used to express the view that these apparently distinct characters are to be regarded as instances of the same character (see further
191 D25-20 The Unicode Standard recommends naming conventions which should be followed strictly where the intention is to annotate an existing Unicode character, and which may also be used as a model when creating new names for characters or glyphs
192 D25-20 It should be noted, however, that this naming convention cannot meaningfully be applied to East Asian characters; the typical Unicode descriptions for these characters take the form
197 D25-20 is simply the Unicode code point value of the character in question. In cases where no Unicode code point exists, there is little hope of finding a name that helps to identify the character. Names should therefore be constructed in a way meaningful to local practice, for example by using a reference number from a well-known character dictionary or a project-specific serial number.
198 D25-20 . For convenience of processing, the following distinct elements are proposed for naming characters and glyphs:
225 D25-20 ) are defined by other TEI modules, and their usage here is no different from their usage elsewhere. The
227 D25-20 element, however, is used here only to link to an image of the character or glyph under discussion, or to contain a representation of it in SVG. The
239 D25-20 element is similar to the standard TEI
241 D25-20 element. While the latter is used to express correspondence relationships between TEI concepts or elements and those in other systems or ontologies, the former is used to express any kind of relationship between the character or glyph under discussion and characters or glyphs defined elsewhere. It may contain any Unicode character, or a
276 D25-20 The mapping element may also be used to represent a mapping of the character or (more likely) glyph under discussion onto a character from the private use area as in this example:
289 D25-20 A more precise documentation of the properties of any character or glyph may be supplied using the generic
297 ucsprops characters, defined by reference to a number of
299 ucsprops (or attribute-value pairs) which they are said to possess. For example, a lowercase letter is said to have the value
305 ucsprops properties (i.e. properties which form part of the definition of a given character), and
308 ucsprops additional
330 ucsprops For convenience, we list here some of the normative character properties and their values. For full information, refer to chapter 4 of
336 ucsprops The general category (described in the Unicode Standard chapter 4 section 5) is an assignment to some major classes and subclasses of characters. Suggested values for this property are listed here:
384 ucsprops Punctuation, initial quote
387 ucsprops Punctuation, final quote
405 ucsprops Separator, space
408 ucsprops Separator, line
432 ucsprops This property applies to all Unicode characters. It governs the application of the algorithm for bi-directional behaviour, as further specified in Unicode Annex 9,
518 ucsprops Start of fixed position classes
521 ucsprops End of fixed position classes
583 ucsprops This property is defined for characters, which may be decomposed, for example to a canonical form plus a typographic variation of some kind. For such characters the Unicode standard specifies both a decomposition type and a decomposition mapping (i.e. another Unicode character to which this one may be mapped in the way specified by the decomposition type). The following types of mapping are defined in the Unicode Standard:
589 ucsprops A no-break version of a space or hyphen
592 ucsprops An initial presentation form (Arabic)
595 ucsprops A medial presentation form (Arabic)
598 ucsprops A final presentation form (Arabic)
601 ucsprops An isolated presentation form (Arabic)
604 ucsprops An encircled form
607 ucsprops A superscript form
610 ucsprops A subscript form
613 ucsprops A vertical layout presentation form
622 ucsprops A small variant form (CNS compatibility)
628 ucsprops A vulgar fraction form
637 ucsprops This property applies for any character which expresses any kind of numeric value. Its value is the intended value in decimal notation.
643 ucsprops independent of the text direction: it has the value
650 ucsprops The Unicode Standard also defines a set of informative (but non-normative) properties for Unicode characters. If encoders want to provide such properties, they may be included using the suggested Unicode name, tagged using the
654 ucsprops element to distinguish them. If a Unicode name exists for a given property, it should however always be preferred to a locally defined name. Locally defined names should be used only for properties which are not specified by the Unicode Standard.
661 D25-30 Annotation of a character becomes necessary when it is desired to distinguish it on the basis of certain aspects (typically, its graphical appearance) only. In a manuscript, for example, where distinctly different forms of the letter "r" can be recognized, it might be useful to distinguish them for analytic purposes, quite distinct from the need to provide an accurate representation of the page. A digital facsimile, particularly one linked to a transcribed and encoded version of the text, will always provide a superior visual representation (for information on how to link a digital facsimile to a transcribed text see
662 D25-30 ), but cannot be used to support arguments based on the distribution of such different forms. Character annotation as described here provides a solution to this problem.
663 D25-30 It should be kept in mind that any kind of text encoding is an abstraction and an interpretation of the text at hand, which will not necessarily be useful in reproducing an exact facsimile of the appearance of a manuscript.
666 D25-30 Assuming that we wish to distinguish the variant glyphs from the standard representation for the character concerned, we will need to define distinct
693 D25-30 With these definitions in place, occurrences of these two special "r"s in the text can be annotated using the element
708 D25-30 element will be interpreted as an annotation on the content of the element
734 D25-30 ligature; the encoder may however prefer not to use it in order to simplify other text processing operations, such as indexing).
745 D25-30 which would enable the same material to be encoded as follows:
749 D25-30 The same technique may be used to represent particular abbreviation marks as well as to represent other characters or glyphs. For example, if we believe that the r-with-one-funny-stroke is being used as an abbreviation for
755 D25-30 Note however that this technique employs markup objects to provide a link between a character in the document and some annotation on that character. Therefore, it cannot be used in places where such markup constructs are not allowed, notably in attribute values.
757 D25-30 Since the need to use these constructs to annotate or define characters occurs frequently in Chinese, Korean, and Japanese documents, here are some issues that are specific to these documents. There are two slightly different versions of the problem. In the first case, due to the way Unicode is defined, there are occasions when more than one glyph is defined for a character. In such an occasion, one might want to retain the character as used, but add information in a way so that a normalizer (for search or indexing operations) could take advantage of this information. To achieve this, we simply define within a
777 D25-30 , simply maps our glyph to the code point where Unicode defined it. The other one, of type
779 D25-30 , encodes the fact that in our view, this glyph is a variation of the standard character given in the content of the element. We could then use this
783 D25-30 to refer to it from within a text as follows.
789 D25-30 A slightly different, but related problem occurs when we have multiple variants, none of which has been defined in Unicode. In this case, we need to define one as a new character using
808 D25-30 element then defines a variant glyph of this newly defined character. Additional properties should be specified in order to make these both identifiable.
814 D25-40 The creation of additional characters for use in text encoding is quite similar to the annotation of existing characters. The same element
816 D25-40 is used to provide a link from the character instance in the text to a character definition provided within the
818 D25-40 element. This character definition takes the form of a
822 D25-40 itself will usually be empty, but could contain a code point from the Private Use Area (PUA) of the Unicode Standard, which is an area set aside for the very purpose of privately adding new characters to a document. Recommendations on how to use such PUA characters are given in the following section.
824 D25-40 In some circumstances, it may be desirable to provide a single precomposed form of a character that is encoded in Unicode only as a sequence of code points. For example, in Medieval Nordic material, a character looking like a lowercase letter Y with a dot and an acute-accent above it may be encountered so frequently that the encoder wishes to treat it as a single precomposed character with one single coded value. In the transcription concerned, the encoder enters this letter as
826 D25-40 , which when the transcription is processed can then be expanded in one of three ways, depending on the mapping in force. The entity reference might be translated into the sequence of corresponding Unicode code points or into some locally-defined PUA character (say
828 D25-40 ) for local processing only. Both these options have disadvantages; the former loses the fact that the sequence of composed characters is regarded as a single object; the second is not reliably portable. Therefore, the recommended representation is to use the
831 D25-40 . This makes it possible for the encoder to provide useful documentation for the particular character or glyph so referenced:
845 D25-40 This definition specifies the mapping between this composed character and the individual Unicode-defined code points which make it up. It also supplies a single locally-defined property (
847 D25-40 ) for the character concerned, the purpose of which is to supply a recommended character entity name for the character.
849 D25-40 Under certain circumstances, Chinese Han characters can be written within a circle. Rather than considering this as simply an aspect of the rendering, an encoder may wish to treat such circled characters as entirely distinct derived characters. For a given character (say that represented by the numeric-character reference
880 D25-40 . The two mappings indicate firstly that the standard form of this character is the character
884 D25-40 . For convenience of local processing this PUA character may in fact appear as content of the
894 D25-50 The developers of the Unicode Standard have set aside an area of the codespace for the private use of software vendors, user groups, or individuals. As of this writing (Unicode 5.0), there are around 137,000 code points available in this area, which should be enough for most needs. No code point assignments will be made to this area by standard bodies and only some very basic default properties have been assigned (which may be overridden where necessary by the mechanism outlined in this chapter). Therefore, unlike all other code points defined by the Unicode Standard, PUA code points should
898 D25-50 In the two previous examples, we mentioned that the variant characters concerned might well be assigned specific code points from the PUA. This might, for example, facilitate the use of a particular font which displays the desired character at this code point in the local processing environment. Since however this assignment would be valid only on the local site, documents containing such code points are unsuitable for blind interchange. During the process of preparing such documents for interchange, any PUA code points should be replaced by an appropriate use of the
901 D25-50 g ref="#xxxx"
907 D25-50 , or retained as content of the
909 D25-50 element. However, since there is no requirement that the same PUA character be used to represent it at the receiving site, and since it may well be the case that this other site has already made an assignment of some other character to the original PUA code point, it is best practice to remove the locally-defined PUA character. It is to be expected that a further translation into the local processing environment at the receiving site will be necessary to handle such characters, during which variant letters can be converted to hitherto unused code points on the basis of the information provided in the
913 D25-50 This mechanism is rather weak in cases where DOM trees or parsed XML fragments are exchanged, which may increasingly be the case. The best an application can do here is to treat any occurrence of a PUA character only in the context of the local document and use the properties provided through the
917 D25-50 In the fullness of time, a character may become standardized, and thus assigned a specific code point outside the PUA. Documents which have been encoded using the mechanism must at the least ensure that this changed code point is recorded within the relevant
929 WDWM The scripts used for writing human languages vary not only in the glyphs they use, but also in the way (or ways) that those glyphs are arranged on the writing surface. For the majority of modern languages, writing is arranged as a series of lines which are to be read from top to bottom. Within each line, individual characters are frequently presented from left to right (English, Russian, Greek), but there are also several widely-used scripts which run right-to-left (Arabic, Hebrew). Writing in which the lines of glyphs are presented vertically and read from right to left is also often encountered, notably in older East Asian scripts (Sinitic characters, Japanese Kana, Korean Hangul, Vietnamese chữ nôm). In many cases, a language normally uses the same
930 WDWM writing mode
931 WDWM (we use this term to refer to the orientation of individual glyphs within a line and the order in which glyphs and lines should be read), but there are exceptions in which the same language may appear in different modes, for example either vertically or horizontally. Many East Asian scripts were traditionally written from top to bottom within the line, with their lines sequenced from right to left. Although modern Japanese, Chinese, and Korean are often written horizontally, the traditional vertical writing mode is still widely used. There are also comparatively rare cases of ancient scripts written with lines running left to right, each line being read top to bottom (Ancient Uighur, classical Mongolian and Manchu), or scripts such as Ogham where the writing direction may start from the bottom left and run around the edge of an inscribed object.
933 WDWM When different languages are combined, it is possible that different writing modes will be needed: for example, in Hebrew text, running right to left, sequences of Latin digits still run left to right. When different writing modes are available for the same language, it may be that different glyphs will be preferred when the script is used in different modes. For example, when Japanese is written horizontally, the Unicode character U+3001, the
935 WDWM , is used in preference to Unicode character U+FE11, the vertical mode comma. This ensures that the comma appears in the correct position relative to the surrounding glyphs. Even for scripts which are usually written in exactly the same way, different writing modes may be encountered in particular contexts; for example when a language using Roman script is embedded within vertically-organized Chinese text, it may sometimes be displayed vertically and sometimes horizontally. The writing mode may also vary in response to layout constraints such as those imposed by a complex table, where column or row labels may be written vertically or diagonally to make the most effective use of available space, just as it may vary in response to the size and shape of the carrier in the case of a monumental inscription.
937 WDWM For many, perhaps most, TEI documents there may be no need to encode the writing mode explicitly, even in so-called "mixed mode" texts containing passages written in languages which use different writing modes. Modern printed texts in most European languages, for instance, may be expected to use left-to-right/top-to-bottom directionality; while Arabic or Hebrew texts are expected to run right-to-left/top-to-bottom. In a TEI document, language and script are explicitly stated in the markup using the attribute
939 WDWM ; this indication will usually imply a particular default writing mode. Even where this attribute is not used, passages in different scripts will use different Unicode characters, and will thus imply a particular default writing mode.
941 WDWM Consider the case of an English text containing a few Arabic words :
943 WDWM The Arabic term قلم رصاص means "pencil".
945 WDWM A correct TEI encoding might read as follows:
954 WDWM attribute with value
956 WDWM that causes processing software to display the Arabic from right to left, but in fact, this is not the case. The order in which the Arabic characters appear when rendered would be the same, even if the markup were not present:
961 WDWM This is because Arabic glyphs are always displayed right to left, even when they appear within a left-to-right English sentence. Like most other codepoints in the Unicode standard, they have a specific directionality setting which helps any rendering software determine how they should be ordered. The Latin glyph "a" has a strong left-to-right bidirectionality setting, as do the digits 0 to 9; the Hebrew א (alef) is strongly right-to-left. Of course, some glyphs (common punctuation marks such as the period or comma for example) have weak or neutral settings because they may appear in several contexts.
965 WDWM ) defines a number of rules enabling software to render sequences of characters which have differing directionality properties in a predictable and reliable way, using only those properties.
966 WDWM Because this algorithm may not always give the desired result, Unicode also provides a set of "directional formatting characters" (
967 WDWM ). These additional codepoints can be used to signal to rendering software that a specific directionality setting should be turned on or off. However, in the case of documents encoded in XML, there is no need to use such characters, and in fact the W3C explicitly advises against it. "In (X)HTML and XML do not use the paired Unicode bidi formatting code characters where equivalent markup is available." (
969 WDWM . It should be remembered however that individual sequences of characters are always stored in a file in the order in which they should be read, irrespective of the order in which the characters making up a sequence should be displayed or rendered. For example, in a RTL language such as Hebrew, the first character in a file will be that which is displayed at the rightmost end of the first line of text.
971 WDWM An encoder wishing to document or to control the order in which sequences of characters in a TEI document are displayed will usually do so by segmenting the text into sequences presented in the desired order and specifying an appropriate language code for each. In situations where this approach may result in ambiguity or lack of precision, or if the encoder wishes to record directional information explicitly in their encoding, we recommend using the global @style attribute to supply detail about the writing mode applicable to the content of any element. The
975 WDWM At the time of writing, this W3C module has the status of a candidate recommendation: see further
978 WDWM which permits direct specification of a number of useful properties associated with writing modes, notably
1004 WDWM The global TEI
1010 WDWM and then point to them using the global
1013 WDWM . Although the CSS specifications are mainly used to provide instructions for software when rendering a digital text, they also provide a useful means of describing the visual properties of a pre-existing document in a formal and standardized way.
1015 WDWM The next section presents some examples of how CSS can be used to describe a variety of writing modes. A full description of the appearance of a document will probably include many other properties of course.
1021 WDWMEG The CSS recommendations provides several properties which can be used to encode aspects of the "writing mode". The most useful of these is the property "writing-mode" which may be used to specify a reading-order for both characters within a single line and lines within a single block of text. The property "text-orientation" may also used to indicate the orientation of individual characters with respect to the line, and the property "direction" to determine the reading order of characters within a line only. We give some examples of each below.
1028 WDWMEG1 property is particularly useful for languages which can be written in different writing modes, such as Chinese and Japanese. Its possible values include
1034 WDWMEG1 . Each value has two components:
1038 WDWMEG1 specifies the inline writing direction, while the second component specifies the direction in which lines in a block, and blocks in a sequence are arranged: from top to bottom (as in most European languages, in which lines and paragraphs are arranged from top to bottom on a page), from right to left (as in the case of Japanese written vertically), or left-to-right (as in the case of Mongolian).
1088 WDWMEG1 to supply a value of
1092 WDWMEG1 attribute specifies a horizontal writing mode; this may seem superfluous, but vertically-written romaji is not unknown.
1098 WDWMEG2 When Japanese is written vertically, the glyph orientation remains the same as when it is written horizontally. In other words, glyphs are not rotated (although as noted above some different glyphs may be used for some characters, in particular for punctuation which needs to be positioned differently in vertical and in horizontal text). However, it is very common for languages written vertically to have embedded runs of text from languages which are normally written horizontally. This raises the issue of the orientation of the glyphs from the horizontal language. Are they written upright, as they would normally appear in horizontal text runs, or are they rotated? Consider this fragment from a Japanese article about the Indonesian language, which takes the form of a glossary list:
1108 WDWMEG2 The text-orientation property allows us to indicate whether or not glyphs are rotated. In the following example, we have indicated that the list uses a
1110 WDWMEG2 writing mode, but that the orientation of individual glyphs may vary:
1126 WDWMEG2 characters from horizontal-only scripts are set sideways, i.e. 90° clockwise from their standard orientation in horizontal text. Characters from vertical scripts are set with their intrinsic orientation
1129 WDWMEG2 ). Since the default value for
1133 WDWMEG2 , this rule is not strictly required. However, if the Indonesian glyphs (which are roman characters) had been set vertically, like this:
1142 WDWMEG2 then an encoding like the following could be used to make this explicit:
1158 WDWMEG2 characters from horizontal-only scripts are rendered upright, i.e. in their standard horizontal orientation. Characters from vertical scripts are set with their intrinsic orientation and shaped normally
1169 WDWMEG3 It is not unusual to see text from horizontal languages written vertically even where no vertically-written script is involved. This example is a fragment from a table of information about agricultural development on Vancouver Island, written in 1855:
1180 WDWMEG3 Four of the subheading cells in this fragment contain English text written vertically, bottom-to-top, to conserve space on the page. To describe this sort of phenomenon, we can use the
1189 WDWMEG3 causes text to be set as if in a horizontal layout, but rotated 90° counter-clockwise.
1190 WDWMEG3 We might encode the third of the four cells containing vertical text like this:
1200 WDWMEG3 property captures the fact that the script is written vertically, and its lines are to be read from left to right (so the line containing
1203 WDWMEG3 Cash value
1206 WDWMEG3 value encodes the orientation (rotated 90° counter-clockwise). We might also add
1208 WDWMEG3 to the style, to express the fact that the text is centrally-aligned.
1214 WDWMEG4 Of the rather small number of scripts which appear to be written bottom-to-top, perhaps the best-known is Ogham, an alphabet used mainly to write Archaic Irish. Ogham is typically found inscribed along the edge of a standing stone, starting at its base. The CSS Writing Modes specification does not explicitly distinguish between vertical scripts which are written top-to-bottom and those which are written bottom-to-top. Instead, such bottom-to-top scripts are best treated as left-to-right horizontal scripts, oriented vertically because of the constraints of the medium on which they are inscribed. Such scripts are analogous to the vertical English text-runs in the table cells in the example above, and can be handled in exactly the same manner (
1216 WDWMEG4 ). In cases where writing follows a curved path (such as Ogham running around the edge of a stone), a meticulous encoder might resort to the use of SVG to describe the path, rather than treating the phenomenon as a writing mode.
1225 WDWMEG5 The Arabic term قلم رصاص means "pencil".
1238 WDWMEG5 property to record the observed directionality of the text is unambiguous, even though it is (as we noted above) superfluous. The use of the
1240 WDWMEG5 property here may require some explanation. By default this property has the value
1242 WDWMEG5 , the effect of which in this context would be to ignore any value supplied for the direction property. The CSS Writing Modes specification stipulates that the direction property
1243 WDWMEG5 has no effect on bidi reordering when specified on inline boxes whose
1245 WDWMEG5 property’s value is
1247 WDWMEG5 , because the element does not open an additional level of embedding with respect to the bidirectional algorithm.
1250 WDWMEG5 Mixed horizontal directionality is very common in languages such as Arabic and Hebrew, particularly when numbers (which are always given LTR) or phrases from LTR languages are embedded. It is not impossible, though quite unusual, for ambiguities to arise in such situations, which may give rise to the parts of a document being displayed in unexpected ways that do not correspond to the natural reading order. A more detailed discussion of this issue from an HTML perspective is provided by a W3C Internationalization Working Group report
1251 WDWMEG5 Inline markup and bidirectional text in HTML
1260 WDWMEG For most texts, information about text directionality need not be explicitly encoded in a TEI text, either because it follows unambiguously from
1262 WDWMEG values, or because it can be expected to be handled unequivocally by the Unicode Bidi Algorithm. Where it is considered important to encode such information, properties and values taken from the CSS Writing Modes module may be used by means of the global TEI
1264 WDWMEG attribute (or using the TEI
1275 WDWMTT In what follows, we examine a range of textual phenomena which in some ways appear very similar to those examined above, and even overlap with them. We can categorize these as text transformation features, and suggest some strategies for encoding them based on the properties detailed in the
1286 WDWMTT Here a block of text has been rotated around its z-axis. This is clearly not a
1287 WDWMTT writing mode
1288 WDWMTT ; the writing mode for this text is horizontal, left to right. Furthermore, even if we wished to treat this as a writing mode, we could not do so, because there is no way to use writing modes properties to describe an text orientation which is angled at 45 degrees; no human languages are consistently written in this orientation. It is more appropriate to treat this as a rotational transformation. We can do this using two properties:
1292 WDWMTT . (Both of these properties have quite complex value sets, and we will not look at all of them here. See the
1298 WDWMTT property takes as its value one or more of the transform functions, one of which is the function
1304 WDWMTT Any rotation must take place clockwise around an axis positioned relative to the element being rotated, and the
1306 WDWMTT property can be used to specify the pivot point. By default, the value of
1310 WDWMTT , the point at the centre of the element, but these values can be changed to reflect rotation around a different origin point. (The TEI
1316 WDWMTT A block of text may also be rotated about either of its other axes. For example, this shows rotation around the Y (vertical) axis:
1330 WDWMTT which are both normally printed in a rotated form so that they represent a pair of wings:
1351 WDWMTT We might also argue that this is in fact a vertical writing mode by supplying
1353 WDWMTT as the value for the
1357 WDWMTT Rotation is also useful as a method of handling a true writing mode which is not covered by the CSS Writing Modes:
1359 WDWMTT . This is a writing mode common in inscriptions in Latin, Greek and other languages, in which alternate lines run from left to right and from right to left
1360 WDWMTT The name is taken from the Greek βουστροφηδόν, meaning
1364 WDWMTT ); that is, turning as an ox does when pulling a plough.
1366 WDWMTT mirror writing
1389 WDWMTT The 180-degree rotation around the Y (vertical) axis here describes what is happening in the RTL line in boustrophedon; the order of glyphs is reversed, and so is their individual orientation (in fact, we see them
1390 WDWMTT from the back
1395 WDWMTT in the sense of poetic lines; the text is continuous prose, and linebreaks are incidental.
1397 WDWMTT There are obviously some unsatisfactory aspects of this manner of encoding boustrophedon. In the inscription above, some words run across linebreaks, so if we wished to tag both words and the right-to-left phenomena, one hierarchy would have to be privileged over the other. By using a transform function rather than a writing mode property, we are apparently suggesting that boustrophedon is not in fact a writing mode, whereas it clearly is. But the CSS Writing Modes specification does not provide support for boustrophedon, because it is a rather obscure historical phenomenon; using a rotational transform is one practical alternative.
1405 WDCAV ; the language is designed to describe how an HTML document should be formatted. This is not, of course, the case for the TEI, which lacks any explicit processing or formatting model, and attempts to define objects as far as possible without consideration of their visual appearance. As long as the properties and values from the CSS Transforms module are used as a convenient, well-specified descriptive language to capture features of a text, without any expectation of using them directly and reliably for rendering, this is not particularly problematic. CSS provides a useful and well-defined vocabulary to describe many aspects of the appearance of source texts, benefitting particularly from the clarity of definition provided by the specification. However, if there is any expectation of using this information to render a text in a predictable and accurate way, it will be essential to provide enough styling information throughout the document hierarchy to resolve all ambiguities with regard to size, positioning, block status, etc. before any element undergoes a transform operation.
1410 WSD-DEF The gaiji module described in this chapter makes available the following components:
1413 gaiji Character and glyph documentation
1422 WSD-DEF The selection and combination of modules to form a TEI schema is described in

TS-TranscriptionsofSpeech.xml#12961

# id text
4 TS The module described in this chapter is intended for use with a wide variety of transcribed spoken material. It should be stressed, however, that the present proposals are not intended to support unmodified every variety of research undertaken upon spoken material now or in the future; some discourse analysts, some phonologists, and doubtless others may wish to extend the scheme presented here to express more precisely the set of distinctions they wish to draw in their transcriptions. Speech regarded as a purely acoustic phenomenon may well require different methods from those outlined here, as may speech regarded solely as a process of social interaction.
6 TS This chapter begins with a discussion of some of the problems commonly encountered in transcribing spoken language (section
8 TS documents some additional TEI header elements which may be used to document the recording or other source from which transcribed text is taken. Section
10 TS of this chapter reviews further problems specific to the encoding of spoken language, demonstrating how mechanisms and elements discussed elsewhere in these Guidelines may be applied to them.
21 TSOV of speech. Speech varies according to a large number of dimensions, many of which have no counterpart in writing (for example, tempo, loudness, pitch, etc.). The audibility of speech recorded in natural communication situations is often less than perfect, affecting the accuracy of the transcription. Spoken material may be transcribed in the course of linguistic, acoustic, anthropological, psychological, ethnographic, journalistic, or many other types of research. Even in the same field, the interests and theoretical perspectives of different transcribers may lead them to prefer different levels of detail in the transcript and different styles of visual display. The production and comprehension of speech are intimately bound up with the situation in which speech occurs, far more so than is the case for written texts. A speech transcript must therefore include some contextual features; determining which are relevant is not always simple. Moreover, the ethical problems in recording and making public what was produced in a private setting and intended for a limited audience are more frequently encountered in dealing with spoken texts than with written ones.
23 TSOV Speech also poses difficult structural problems. Unlike a written text, a speech event takes place in time. Its beginning and end may be hard to determine and its internal composition difficult to define. Most researchers agree that the utterances or
25 TSOV of individual speakers form an important structural component in most kinds of speech, but these are rarely as well-behaved (in the structural sense) as paragraphs or other analogous units in written texts: speakers frequently interrupt each other, use gestures as well as words, leave remarks unfinished and so on. Speech itself, though it may be represented as words, frequently contains items such as vocalized pauses which, although only semi-lexical, have immense importance in the analysis of spoken text. Even non-vocal elements such as gestures may be regarded as forming a component of spoken text for some analytic purposes. Below the level of the individual utterance, speech may be segmented into units defined by phonological, prosodic, or syntactic phenomena; no clear agreement exists, however, even as to appropriate names for such segments.
27 TSOV Spoken texts transcribed according to the guidelines presented here are organized as follows. The overall structure of a TEI spoken text is identical to that of any other TEI text: the
29 TSOV element for a spoken text contains a
33 TSOV element. Even texts primarily composed of transcribed speech may also include conventional front and back matter, and may even be organized into divisions like printed texts.
39 TSOV as organizing unit for spoken material
40 TSOV A spoken
42 TSOV might typically be a conversation between a small number of people, a lecture, a broadcast TV item, or a similar event. Each such unit has associated with it a
44 TSOV providing detailed contextual information such as the source of the transcript, the identity of the participants, whether the speech is scripted or spontaneous, the physical and social setting in which the discourse takes place and a range of other aspects. Details of the header in general are provided in chapter
45 TSOV ; the particular elements it provides for use with spoken texts are described below (
46 TSOV ). Details concerning additional elements which may be used for the documentation of participant and contextual information are given in
49 TSOV Defining the bounds of a spoken text is frequently a matter of arbitrary convention or convenience. In public or semi-public contexts, a text may be regarded as synonymous with, for example, a
52 TSOV broadcast item
54 TSOV meeting
55 TSOV , etc. In informal or private contexts, a text may be simply a conversation involving a specific group of participants. Alternatively, researchers may elect to define spoken texts solely in terms of their duration in time or length in words. By default, these Guidelines assume of a text only that:
61 TSOV it represents a single stretch of time with no significant discontinuities.
66 TSOV element may take the value
68 TSOV to specify that the components of the text are discrete) but is not recommended.
72 TSOV it may be necessary to identify subdivisions of various kinds, if only for convenience of handling. The neutral
79 TSOV A spoken text may contain any of the following components:
87 TSOV kinesic (non-verbal, non-lexical) phenomena such as gestures
91 TSOV writing, regarded as a special class of incident in that it can be transcribed, for example captions or overheads displayed during a lecture
93 TSOV shifts or changes in vocal quality
96 TSOV Elements to represent all of these features of spoken language are discussed in section
101 TSOV ) may contain lexical items interspersed with pauses and non-lexical vocal sounds; during an utterance, non-linguistic incidents may occur and written materials may be presented. The
107 TSOV A spoken text itself may be without substructure, that is, it may consist simply of units such as utterances or pauses, not grouped together in any way, or it may be subdivided. If the notion of what constitutes a
108 TSOV text
109 TSOV in spoken discourse is inevitably rather an arbitrary one, the notion of formal subdivisions within such a
110 TSOV text
112 TSOV text
119 TSOV , provided only that the set of all such divisions is coextensive with the text.
121 TSOV Each such division of a spoken text should be represented by the numbered or unnumbered
124 TSOV . For some detailed kinds of analysis a hierarchy of such divisions may be found useful; nested
126 TSOV elements may be used for this purpose, as in the following example showing how a collection made up of transcribed
127 TSOV sound bites
128 TSOV taken from speeches given by a politician on different occasions might be encoded. Each extract is regarded as a distinct
148 TSOV attribute, for use where the divisions of a text do not all share the same set of the contextual declarations specified in the TEI header. (See further section
154 HD32 Where a computer file is derived from a spoken text rather than a written one, it will usually be desirable to record additional information about the recording or broadcast which constitutes its source. Several additional elements are provided for this purpose within the source description component of the TEI header:
168 HD32 Note that detailed information about the participants or setting of an interview or other transcript of spoken language should be recorded in the appropriate division of the profile description, discussed in chapter
169 HD32 , rather than as part of the source description. The source description is used to hold information only about the source from which the transcribed speech was taken, for example, any script being read and any technical details of how the recording was produced. If the source was a previously-created transcript, it should be treated in the same way as any other source text.
173 HD32 element should be used where it is known that one or more of the participants in a spoken text is speaking from a previously prepared script. The script itself should be documented in the same way as any other written text, using one of the three citation tags mentioned above. Utterances or groups of utterances may be linked to the script concerned by means of the
192 HD32 is used to group together information relating to the recordings from which the spoken text was transcribed. The element may contain either a prose description or, more helpfully, one or more
194 HD32 elements, each corresponding with a particular recording. The linkage between utterances or groups of utterances and the relevant recording statement is made by means of the
201 HD32 element should be used to provide a description of how and by whom a recording was made. This information may be provided in the form of a prose description, within which such items as statements of responsibility, names, places, and dates may be identified using the appropriate phrase-level tags. Alternatively, a selection of elements from the
212 HD32 Specialized collections may wish to add further sub-elements to these major components. These elements should be used only for information relating to the recording process itself; information about the setting or participants (for example) is recorded elsewhere: see sections
251 HD32 When a recording has been made from a public broadcast, details of the broadcast itself should be supplied within the
255 HD32 element. A broadcast is closely analogous to a publication and the
263 HD32 . The broadcasting agency responsible for a broadcast is regarded as its author, while other participants (for example interviewers, interviewees, script writers, directors, producers, etc.) should be specified using the
294 HD32 When a broadcast contains several distinct recordings (for example a compilation), additional
318 TSBA The following elements characterize spoken texts, transcribed according to these Guidelines:
323 TSBA element may appear directly within a spoken text, and may contain any of the others; the others may also appear directly (for example, a
327 TSBA element. In terms of the basic TEI model, therefore, we regard the
367 TSBA (for sounds produced by the human vocal apparatus), and
377 TSBA incident
383 TSBA kinesic
389 TSBA vocal
406 TSBA vocal events
408 TSBA usually involuntary noises. Equally, the distinction between utterances and vocals is not always clear, although for many analytic purposes it will be convenient to regard them as distinct. Individual scholars may differ in the way borderlines are drawn and should declare their definitions in the
410 TSBA element of the header (see
413 TSBA The following short extract exemplifies several of these elements. It is recoded from a text originally transcribed in the CHILDES format.
424 TSBA ). Non-verbal vocal effects such as the child's meowing are indicated either with orthographic transcriptions or with the
426 TSBA element, and entirely non-linguistic but significant incidents such as the sound of the toy cat are represented by the
470 TSBA This example also uses some elements common to all TEI texts, notably the
472 TSBA tag for editorial regularization. Unusually stressed syllables have been encoded with the
479 TSBA Contextual information is of particular importance in spoken texts, and should be provided by the TEI header of a text. In general, all of the information in a header is understood to be relevant to the whole of the associated text. The element
490 TSBAUT Each distinct
492 TSBAUT in a spoken text is represented by a
500 TSBAUT attribute to associate the utterance with a particular speaker is recommended but not required. Its use implies as a further requirement that all speakers be identified by a
504 TSBAUT element in the TEI header (see section
505 TSBAUT ), but it may also point to another external source of information about the speaker. Where utterances or other parts of the transcription cannot be attributed with confidence to any particular participant or group of participants, the encoder may choose to create
513 TSBAUT , and perhaps give the root
517 TSBAUT value of
519 TSBAUT , then point to those as appropriate using
526 TSBAUT . The value specified applies to the transition from the preceding utterance into the utterance bearing the attribute. For example:
527 TSBAUT For the most part, the examples in this chapter use no sentence punctuation except to mark the rising intonation often found in interrogative statements; for further discussion, see section
541 TSBAUT , while there is a marked pause between
552 TSBAUT An utterance may contain either running text, or text within which other basic structural elements are nested. Where such nesting occurs, the
562 TSBAUT ; that is, a pause or shift (etc.) within an utterance is regarded as being produced by that speaker only, while a pause between utterances applies to all speakers.
564 TSBAUT Occasionally, an utterance may seem to contain other utterances, for example where one speaker interrupts himself, or when another speaker produces a
566 TSBAUT while they are still speaking. The present version of these Guidelines does not support nesting of one
568 TSBAUT element within another. The transcriber must therefore decide whether such interruptions constitute a change of utterance, or whether other elements may be used. In the case of self-interruption, the
570 TSBAUT element may be used to show that the speaker has changed the quality of their speech:
589 TSBAUT Where this is not possible, it is simplest to regard the back-channel as a distinct utterance.
594 TSBAPA Speakers differ very much in their rhythm and in particular in the amount of time they leave between words. The following element is provided to mark occasions where the transcriber judges that speech has been paused, irrespective of the actual amount of silence:
595 TSBAPA A pause contained by an utterance applies to the speaker of that utterance. A pause between utterances applies to all speakers. The
607 TSBAPA If detailed synchronization of pausing with other vocal phenomena is required, the alignment mechanism defined at section
610 TSBAPA attribute mentioned in the previous section may also be used to characterize the degree of pausing between (but not within) utterances.
619 TSBAVO attribute should be used to specify the person or group responsible for a
625 TSBAVO which is contained within an utterance, if this differs from that of the enclosing utterance. The attribute must be supplied for a
635 TSBAVO attribute may be used to indicate that the vocal, kinesic, or incident is repeated, for example
641 TSBAVO , where what is being encoded is a shift in voice quality. For this last case, the
662 TSBAVO element of the TEI header.
694 TSBAVO The extent to which encoding of incidents or kinesics is included in a transcription will depend entirely on the purpose for which the transcription was made. As elsewhere, this will depend on the particular research agenda and the extent to which their presence is felt to be significant for the interpretation of spoken interactions.
698 TSBAWR Written text may also be encountered when speech is transcribed, for example in a television broadcast or cinema performance, or where one participant shows written text to another. The
700 TSBAWR element may be used to distinguish such written elements from the spoken text in which they are embedded.
702 TSBAWR For example, if speaker A in the breakfast table conversation in section
703 TSBAWR above had simply shown the newspaper passage to her interlocutor instead of reading it, the interaction might have been encoded as follows:
712 TSBAWR If the source of the writing being displayed is known, bibliographic information about it may be stored in a
716 TSBAWR element of the TEI header, and then pointed to using the
739 TSBATI As noted above, utterances, vocals, pauses, kinesics, incidents, and writing elements all inherit attributes providing information about their position in time from the classes
743 TSBATI . These attributes can be used to link parts of the transcription very exactly with points on a timeline, or simply to indicate their duration. Note that if
749 TSBATI elements whose temporal distance from each other is specified in a timeline, then
756 TSBATI ) may be used as an alternative means of aligning the start and end of timed elements, and is required when the temporal alignment involves points within an element.
764 TSSASH A common requirement in transcribing spoken language is to mark positions at which a variety of prosodic features change. Many paralinguistic features (pitch, prominence, loudness, etc.) characterize stretches of speech which are not co-extensive with utterances or any of the other units discussed so far. One simple method of encoding such units is simply to mark their boundaries. An empty element called
769 TSSASH element may appear within an utterance or a segment to mark a significant change in the particular feature defined by its attributes, which is then understood to apply to all subsequent utterances for the same speaker, unless changed by a new shift for the same feature in the same speaker. Intervening utterances by other speakers do not normally carry the same feature. For example:
779 TSSASH is spoken loudly, the words
791 TSSASH ); this list may be revised or supplemented using the methods outlined in section
796 TSSASH attribute specifies the new state of the feature following the shift. If this attribute has the special value
800 TSSASH A list of suggested values for each of the features proposed follows:
814 TSSASH l
825 TSSASH f
834 TSSASH p
860 TSSASH desc
888 TSSASH legato, every syllable receiving more or less equal stress
949 TSSASH A full definition of the sense of the values provided for each feature should be provided in the encoding description section of the text header (see section
965 TSSA This section describes the following features characteristic of spoken texts for which elements are defined elsewhere in these Guidelines:
967 TSSA segmentation below the utterance level
972 TSSA The elements discussed here are not provided by the module for spoken texts. Some of them are included in the core module and others are contained in the modules for linking and for analysis respectively. The selection of modules and their combination to define a TEI schema is discussed in section
977 TSSASE For some analytic purposes it may be desirable to subdivide the divisions of a spoken text into units smaller than the individual utterance or turn. Segmentation may be performed for a number of different purposes and in terms of a variety of speech phenomena. Common examples include units defined both prosodically (by intonation, pausing, etc.) and syntactically (clauses, phrases, etc.) The term
979 TSSASE has been used by a number of researchers to define units peculiar to speech transcripts.
980 TSSASE The term was apparently first proposed by
982 TSSASE A text can be analysed as a sequence of segments which are internally connected by a network of syntactic relations and externally delimited by the absence of such relations with respect to neighbouring segments. Such a segment is a syntactic unit called a macrosyntagm
992 TSSASE attribute to specify the kind of segmentation applicable to a particular segment, if more than one is possible in a text. A full definition of the segmentation scheme or schemes used should be provided in the
996 TSSASE element in the TEI header (see
999 TSSASE In the first example below, an utterance has been segmented according to a notion of syntactic completeness not necessarily marked by the speech, although in this case a pause has been recorded between the two sentence-like units. In the second, the segments are defined prosodically (an acute accent has been used to mark the position immediately following the syllable bearing the primary accent or stress), and may be thought of as
1017 TSSASE element in the header of the text should specify the principles adopted to define the segments marked in this way.
1022 TSSASE may be used, either as an alternative or in addition to the more general purpose
1059 TSSASE In this example, recoded from a corpus of language-impaired speech prepared by Fletcher and Garman, the speaker's utterance has been fully segmented into clausal (
1077 TSSASE has been used to define a particular characteristic of this corpus for which no element exists in the TEI scheme. See further chapter
1078 TSSASE for a discussion of the way in which this kind of user-defined extension of the TEI scheme may be performed and chapter
1081 TSSASE This example also uses the core elements
1088 TSSASE It is often the case that the desired segmentation does not respect utterance boundaries; for example, syntactic units may cross utterance boundaries. For a detailed discussion of this problem, and the various methods proposed by these Guidelines for handling it, see chapter
1091 TSSASE milestone
1094 TSSASE tag discussed in section
1097 TSSASE where several discontinuous segments are to be grouped together to form a syntactic unit (e.g. a phrasal verb with interposed complement), the
1104 TSSAPA A major difference between spoken and written texts is the importance of the temporal dimension to the former. As a very simple example, consider the following, first as it might be represented in a playscript:
1126 TSSAPA However, this does not allow us to indicate either the extent to which Stig's utterance is overlapped, nor does it show that there are in fact three things which are synchronous: the end of Jane's utterance, Stig's whole utterance, and Lou's kinesic. To overcome these problems, more sophisticated techniques, employing the mechanisms for pointing and alignment discussed in detail in section
1127 TSSAPA , are needed. If the module for linking has been enabled (as described in section
1137 TSSAPA should be consulted. The rest of the present section, which should be read in conjunction with that more detailed discussion, presents a number of ways in which these mechanisms may be applied to the specific problem of representing temporal alignment, synchrony, or overlap in transcribing spoken texts.
1145 TSSAPA attribute associated with this anchor point specifies the identifiers of the other two elements which are to be synchronized with it: specifically, the second utterance (
1147 TSSAPA ) and the kinesic (k1). Note that one of these elements has content and the other is empty.
1149 TSSAPA This example demonstrates only a way of indicating a point within one utterance at which it can be synchronized with another utterance and a kinesic. For more complex kinds of alignment, involving possibly multiple synchronization points, an additional element is provided, known as a
1151 TSSAPA . This consists of a series of
1161 TSSAPA This timeline represents four points in time, named TS-P1, TS-P2, TS-P6, and TS-P3 (as with all attributes named
1163 TSSAPA in the TEI scheme, the names must be unique within the document but have no other significance). TS-P1 is located absolutely, at 12:20:01:01 BST. TS-P2 is 4.5 seconds later than TS-P2 (i.e. at 12:20:46). TS-P6 is at some unspecified time later than TS-P2 and previous to TS-P3 (this is implied by its position within the timeline, as no attribute values have been specified for it). The fourth point, TS-P3, is 1.5 seconds later than TS-P6.
1165 TSSAPA One or more such timelines may be specified within a spoken text, to suit the encoder's convenience. If more than one is supplied, the
1177 TSSAPA elements in a time line are a fixed distance apart.
1179 TSSAPA Three methods are available for aligning points or elements within a spoken text with the points in time defined by the
1185 TSSAPA element as the value of one of the
1207 TSSAPA For example, using the timeline given above:
1269 TSSAPA Such conventions have the drawback that they are hard to generalize or to extend beyond the very simple case presented here. Their reliance on the accidentals of physical layout may also make them difficult to transport and to process computationally. These Guidelines recommend the following mechanisms to encode this.
1297 TSSAPA (Note that If only the ordering or sequencing of utterances is needed, then specific timing information shown here in
1326 TSSAPA To avoid deciding whether to point from the timeline to the text or vice versa, a
1377 TSREG When speech is transcribed using ordinary orthographic notation, as is customary, some compromise must be made between the sounds produced and conventional orthography. Particularly when dealing with informal, dialectal, or other varieties of language, the transcriber will frequently have to decide whether a particular sound is to be treated as a distinct vocabulary item or not. For example, while in a given project
1379 TSREG may not be worth distinguishing as a vocabulary item from
1389 TSREG One rule of thumb might be to allow such variation only where a generally accepted orthographic form exists, for example, in published dictionaries of the language register being encoded; this has the disadvantage that such dictionaries may not exist. Another is to maintain a controlled (but extensible) set of normalized forms for all such words; this has the advantage of enforcing some degree of consistency among different transcribers. Occasionally, as for example when transcribing abbreviations or acronyms, it may be felt necessary to depart from conventional spelling to distinguish between cases where the abbreviation is spelled out letter by letter (e.g.
1397 TSREG ). Similar considerations might apply to pronunciation of foreign words (e.g.
1403 TSREG In general, use of punctuation, capitalization, etc., in spoken transcripts should be carefully controlled. It is important to distinguish the transcriber's intuition as to what the punctuation should be from the marking of prosodic features such as pausing, intonation, etc.
1411 TSTPPR In the absence of conventional punctuation, the marking of prosodic features assumes paramount importance, since these structure and organize the spoken message. Indeed, such prosodic features as points of primary or secondary stress may be represented by specialized punctuation marks, or other characters such as those provided by the Unicode Spacing Modifier Letters block. Pauses have already been dealt with in section
1412 TSTPPR ; while tone units (or intonational phrases) can be indicated by the segmentation tag discussed in section
1418 TSTPPR In a more detailed phonological transcript, it is common practice to include a number of conventional signs to mark prosodic features of the surrounding or (more usually) preceding speech. Such signs may be used to record, for example, particular intonation patterns, truncation, vowel quality (long or short) etc. These signs may be preserved in a transcript either by using conventional punctuation or by marking their presence by
1426 TSTPPR of the TEI header
1441 TSTPPR These declarations might additionally provide information about how the characters concerned should be rendered, their equivalent IPA form, etc. In the transcript itself references to them can then be included as follows:
1493 TSTPPR This example, which is taken from a corpus of bookshop service encounters,
1499 TSTPPR . Where words are so unclear that only their extent can be recorded, the empty
1506 TSTPPR For more detailed work, involving a detailed phonological transcript including representation of stress and pitch patterns, it is probably best to maintain the prosodic description in parallel with the conventional written transcript, rather than attempt to embed detailed prosodic information within it. The two parallel streams may be aligned with each other and with other streams, for example an acoustic encoding, using the general alignment mechanisms discussed in section
1515 TSTPSM above), or to transcribe them using IPA or some other transcription system. To simplify analysis of the lexical features of a speech transcript, it may be felt useful to
1518 TSTPSM , to make explicit the extent of regularization or normalization performed by the transcriber.
1544 TSTPSM element may be used to indicate both the original and a corrected form of it:
1554 TSTPSM , where a speaker switches from one language to another, may easily be represented in a transcript by using the
1556 TSTPSM element provided by the core tagset:
1571 TSTPAC The recommendations made here only concern the establishment of a basic text. Where a more sophisticated analysis is needed, more sophisticated methods of markup will also be appropriate, for example, using stand-off markup to indicate multiple segmentation of the stream of discourse, or complex alignment of several segments within it. Where additional annotations (sometimes called
1575 TSTPAC ) are used to represent such features as linguistic word class (noun, verb, etc.), type of speech act (imperative, concessive, etc.), or information status (theme/rheme, given/new, active/semi-active/new), etc., a selection from the general purpose analytic tools discussed in chapters
1597 TS The selection and combination of modules to form a TEI schema is described in

FT-TablesFormulaeGraphics.xml#12973

# id text
5 FT In addition to graphic images, documents often contain material presented in graphical or tabular format. In such materials, details of layout and presentation may also be of comparatively greater significance or complexity than they are for running text. Indeed, it may often be difficult to make a clear distinction between details relating purely to the rendition of information and those relating to the information itself.
13 FT As with text markup in general, many incompatible formats have been proposed for the representation of graphics, formulæ, and tables in electronic form. Unfortunately, no single format as effective as XML in the domain of text has yet emerged for their interchange, to some extent because of the difficulty of representing the information these data formats convey independently of the way it is rendered.
15 FT The module defined by this chapter defines special purpose
20 FT . Specific recommendations for the encoding of graphic figures may be found in section
21 FT . The rest of the chapter is devoted to general problems of encoding graphic information.
23 FT There is at the time of writing no consensus on formats for graphical images, and such formats vary in many ways. We therefore provide (in section
25 FT ) a list of formal names for those representations most popular at this time. Each one includes a very brief description. These Guidelines recommend a few particular representations as being the most widely supported and understood.
29 FTTAB A table is the least
30 FTTAB graphic
31 FTTAB of the elements discussed in this chapter. Almost any text structure can be presented as a series of rows and columns: one might, for example, choose to show a glossary or other form of list in tabular form, without necessarily regarding it as a table. In such cases, the global
33 FTTAB attribute is an appropriate way of indicating that some element is being presented in tabular format, for example by using an appropriate display property in CSS. When tabular presentation is regarded as of less intrinsic importance, it is correspondingly simpler to encode descriptive or functional information about the contents of the table, for example to identify one cell as containing a name and another as containing a date, though the two methods may be combined.
35 FTTAB When, however, particular elements are required to encode the tabular arrangement itself, then one or other of the various
36 FTTAB table schemas
37 FTTAB now available may be preferable. The schemas in common use generally view a table as a special text element, made up of row elements, themselves composed of cells.
38 FTTAB Table cells generally appear in row-major order, with the first row from left to right, then the second row, and so on. Details of appearance such as column widths, border lines, and alignment are generally encoded by numerous attributes. Beyond this, however, such schemas differ greatly. This section begins by describing a table schema of this kind; a brief summary of some other widely available table schemas is also provided in section
41 FTTAB1 TEI Tables
43 FTTAB1 For encoding tables of low to moderate complexity, these Guidelines provide the following special purpose elements:
52 FTTAB1 It is to a large extent arbitrary whether a table should be regarded as a series of rows or as a series of columns. For compatibility with currently available systems, however, these Guidelines require a row-by-row description of a table. It is also possible to describe a table simply as a series of cells; this may be useful for tabular material which is not presented as a simple matrix.
58 FTTAB1 may be used to indicate the size of a table, or to indicate that a particular cell or row of a table spans more than one row or column. For both tables and cells, rows and columns are always given in top-to-bottom, left-to-right order, although formatting properties such as those provided by CSS may be used to specify that they should be displayed differently. These Guidelines do not require that the size of a table be specified; for most formatting and many other applications, it will be necessary to process the whole table in two passes in any case.
60 FTTAB1 Where cells span more than one column or row, the encoder must determine whether this is a purely presentational effect (in which case the
62 FTTAB1 attribute may be more appropriate), whether the part of the table affected would be better treated as a nested table, or whether to use the spanning attributes listed above.
66 FTTAB1 attribute may be used to categorize a single cell, or set a default for all the cells in a given row. The present Guidelines distinguish the roles of
67 FTTAB1 label
73 FTTAB1 numeric
85 FTTAB1 The following simple example demonstrates how the data presented as a labelled list in section
128 FTTAB1 The following example demonstrates how a simple statistical table may be represented using this scheme:
184 FTTAB1 Note the use of a blank cell in the first row to ensure that the column labels are correctly aligned with the data. Again, this encoding does not explicitly represent the alignment between column and row labels and the data to which they apply. Where the primary emphasis of an encoding is on the semantic content of a table, a more explicit mechanism for the representation of structured information such as that provided by the feature structure mechanism described in chapter
185 FTTAB1 may be preferred. Alternatively, the general purpose linkage and alignment mechanisms described in chapter
188 FTTAB1 The content of a table cell need not be simply character data. It may also contain any sequence of the phrase-level elements described in chapter
189 FTTAB1 , thus allowing for the encoding of potentially more useful semantic information, as in the following example, where the fact that one cell contains a number and the other contains a place name has been explicitly recorded:
255 FTTAB1 The content of table elements is not limited to
269 FTTAB1 provide options for including text which is clearly part of the table, but outside the actual tabular layout. This example shows the use of
308 FTTAB2 Many authoring systems include built-in support for their own or for public table schemas. These provide an enhanced user interface and good formatting capabilities, but are often product-specific, despite their use of a XML markup language.
310 FTTAB2 The DTD developed by the Association of American Publishers (AAP) and standardized in ANSI Z39.59 provided a very simple encoding for correspondingly simple tables. This has been further developed, together with the table DTD documented in ISO Technical Report 9537, and now forms part of ISO 12083. The TEI table model described above has functionality very similar to that defined by ISO 12083.
312 FTTAB2 For more complex tables, the most effective publicly-available DTD is probably that developed by the US Department of Defense CALS project. This supports vertical and horizontal spanning and various kinds of text rotation and justification within cells and is also directly supported by a number of existing XML software systems.
314 FTTAB2 The CALS table model is much too complex to describe fully here; for historical background see
316 FTTAB2 . As with any other XML vocabulary, the XML version of the CALS model may readily be included in a TEI schema, using the techniques described in
321 FTTAB2 The XHTML table model (
322 FTTAB2 ) is based on the HTML table model (
323 FTTAB2 ). Both models support arrangement of arbitrary data into rows and columns of cells. Table rows and columns may be grouped to convey additional structural information and may be rendered by user agents in ways that emphasize this structure. Support for incremental rendering of tables and for rendering on
327 FTTAB2 ). Stylesheets provide a far more effective means of controlling layout and other visual characteristics in both HTML and XML documents.
332 FTFOR Mathematical and chemical formulæ pose problems similar to those posed by tables in that rendition may be of great significance and hard to disentangle from content. They also require access to a wide range of special characters, for most of which standard entity names already exist in the documented ISO entity sets (see further chapters
338 FTFOR The AAP and ISO standards mentioned in section
339 FTFOR above both provide DTDs for equations as well as for tables, which now form part of ISO 12083. The European Mathematical Trust, an organization set up specifically to enhance research support for European mathematicians, has also defined a general purpose mathematical DTD known as EuroMath (
342 FTFOR Most if not all of the functionality provided by these DTDs can now be found in the OpenMath and MathML XML-based systems briefly described below.
344 FTFOR As with tables, in all the XML solutions a tension exists between the need to encode the way a formula is written (its appearance) and the need to represent its semantics. If the object of the encoding is purely to act as an interchange format among different formatting programs, then there is no need to represent the mathematical meaning of an expression. If however the object is to use the encoding as input to an algebraic manipulation system (such as Mathematica or Maple) or a database system, clearly simply representing superscripts and subscripts will be inadequate.
346 FTFOR The present Guidelines make no attempt to add to the number of available DTDs for representing formulæ. Instead, we recommend that the user make an informed choice from those already available. The module described in this chapter makes available only the following element, which should be used to encode any formula, no matter what notation is employed:
357 FTFOR must be escaped with entity references or numeric character references, e.g.
361 FTFOR If desired, the content of the
366 FTFOR When the content of a
377 FTFOR attribute supplies the name of a notation (
389 FTFOR structure of an expression. Most of its content elements correspond with the range of operators, relations, and named functions typically found at the high-school level of mathematics. The tortoise example given above in TeX can be re-expressed in MathML as
443 FTFOR MathML 2.0 provides support for a
463 FTFOR Encodings, both binary (
467 FTFOR OpenMath and MathML have certain common aspects. They both use prefix operators, both are XML-based and they both construct their objects by applying certain rules recursively. Such similarities facilitate mapping between the two standards. There are also some key differences between MathML and OpenMath. OpenMath does not provide support for presentation of mathematical objects and its scope of semantically-oriented elements is much broader that of MathML, with the expressive power to cover virtually all areas of computational mathematics. In fact, a particular set of Content Dictionaries, the
472 FTFOR ) is an extension of the OpenMath standard that supplies markup for structures such as axioms, theorems, proofs, definitions, texts (mixing formal content with mathematical text).
474 FTFOR In-line versus block placement for an equation can be distinguished if desired, via the global
480 FTFOR attributes may also be used to label or identify the formula, as in the following example:
525 FTNM Music, like many other art forms, is often mentioned, discussed and described in writings of various kinds. This applies to both historical and contemporary documents, even though methods of notating music have changed considerably in western history. In most cases, music notation enters the text flow in a way similar to figures, images or graphs. On other occasions, elements of music notation are treated as inline characters in running text.
528 FTNM provides a way to signal the presence of music notation in text, but defer to other representations, which are not covered by the TEI guidelines, to describe the music notation itself. In fact, several commercial, academic and standard bodies have developed digital representations of music notation, and given the topic's complexity, these representations often focus on different aspects and adopt different methodologies. Therefore,
530 FTNM only defines a container element to encode the occurrence of music notation and allows linking to the data format preferred by the encoder. (Note:
553 FTNM can be used to indicate the location of a representation of the music notation.
556 FTNM supplies the MIME type of the data format, when available.
566 FTNM can be used to indicate the location of a graphical representation of the music notation.
570 FTNM provides encoded binary data which constitutes another representation of the music notation (e.g. audio).
581 FTNM supplies the MIME type of the data format when available. For example:
597 FTNM It is possible to link to any kind of music notation data format. However, when a MIME type is not available, it is recommended that the format be specified in the description. See the following examples.
620 FTNM It is possible to specify the location of digital objects representing the notated music in other media such as images or audio-visual files. The interpretation of the correspondence between the notated music and these digital objects is not encoded explicitly. We recommend the use of
624 FTNM mainly as a fallback mechanism when the notated music format is not displayable by the application using the encoding. The alignment of encoded notated music, images carrying the notation, and audio files is a complex matter for which we refer the reader to other formats and specifications such as
634 FTNM In modern printing, music notation positioned between blocks of text for illustrative purposes is usually referred to as a
635 FTNM figure
674 FTGRA The following special purpose elements are used to indicate the presence of graphic images within a document:
685 FTGRA elements form part of the common core module, and are discussed in section
694 FTGRA attribute provides the location of an image. For example:
696 FTGRA Three kinds of content may be supplied inside a
700 FTGRA may be used to transcribe (or supply) a descriptive heading or title for the graphic itself as in this example:
703 FTGRA Figures are often accompanied not only by a title or heading (a caption), but by a paragraph or so of commentary (a legend) following the caption. One or more
708 FTGRA may be used to transcribe any commentary on the figure in the source:
718 FTGRA Here, the figure contains a heading
722 FTGRA . Both of these are transcribed from the source, while the description is provided by the encoder, for use by applications which cannot display the graphic directly. In documents created in electronic form with the needs of print-handicapped readers in mind, the
724 FTGRA element may be provided by the author rather than a subsequent encoder.
731 FTGRA Where the graphic itself contains large amounts of text, perhaps with a complex structure, and perhaps difficult to distinguish from the graphic, the encoder should choose whether to regard the graphic as containing the text (in which case, a nested
735 FTGRA element) or to regard the enclosed text as being a separate division of the
737 FTGRA element in which the graphic appears. In this latter case, an appropriate
741 FTGRA (etc.) element may be used for the text represented within the graphic, and the
743 FTGRA element embedded within it. The choice will depend to a large degree on the encoder's understanding of the relationship between the graphic and the surrounding text.
745 FTGRA A figure which is internally divided, or contains sub-figures, may be encoded with nested
766 FTGRA Like any other element in the TEI scheme, figures may be given identifiers so that they can be aligned with other elements, and linked to or from them, as described in chapter
771 FTGRA version which, when selected by the user, causes the other, high resolution, version to be accessed. In TEI terms, the thumbnail image acts as a
773 FTGRA to the other. Supposing that a thumbnail version of the figure discussed above is available as
786 FTGRA . When the module for transcription is included in a schema, specific attributes for parts of a text and parts (or all) of a digital image are available; these are discussed in
792 FTGRA with chapter two of some text, and another portion of it with chapter three. The application may be thought of as a hypertext browser in which the user selects from a graphic image which part of a text to read next, but the mechanism is independent of this particular application.
794 FTGRA The first requirement is some way of identifying and hence pointing to sub-parts of a graphic image. This may be done by pointing into an XML graphic representation, for example an SVG file. Thus
815 FTGRA The next requirement is some way of identifying the parts of the document to which a link is to be made. The most obvious way of doing this is to use the global
824 FTGRA Now, all that is needed to linking these areas to the relevant chapters is a
833 FTGRA In this example, the SVG representation of the graphic is stored externally to the TEI document and linked by means of a pointer. It is also possible to embed the SVG representation directly within the TEI by extending the content model of the
837 FTGRA from the SVG namespace. Like other customizations of the TEI scheme, this is carried out using the techniques documented in section
848 FTGROV The first major distinction in graphic representation is that between raster graphics and vector graphics. A
850 FTGROV is a list of points, or dots. Scanners, fax machines and other simple devices easily produce digital raster images, and such images are therefore quite common. A
852 FTGROV , in contrast, is a list of geometrical objects, such as lines, circles, arcs, or even cubes. These are much more difficult to produce, and so are mainly encountered as the output of sophisticated systems such as architectural and engineering CAD programs.
854 FTGROV Raster images are difficult to modify because by definition they only encode single points: a line, for example, cannot grow or shrink as such, since it is not identified as such. Only its component parts are identified, and only they can be manipulated. Therefore the resolution or dot-size of a raster image is important, which is not the case with vector images. It is also far more difficult to convert raster images to vector images than to perform the opposite conversion. Raster images generally require more storage space than vector images, and a wide variety of methods exists for compressing them; the variation in these methods leads to corresponding variations in representations for storage and transmission of raster images.
856 FTGROV Motion video usually consists of a long series of raster images. Data compression is even more effective on video than on single raster images (mainly owing to redundancy which arises from the usual similarity of adjacent frames). Notations for representing full-motion video are hotly debated at this time, and any user of these Guidelines would do well to obtain up-to-date expert advice before undertaking a project using them.
864 FTGROV save space by discarding a small portion of the image's detail, such as fine distinctions of shading. When decompressed, therefore, such an image will be only a close approximation of the original. In contrast,
866 FTGROV guarantees that the exact uncompressed image will be reproducible from the compressed form: only truly redundant information is removed. In general, therefore, lossless compression does not save quite so much space as lossy compression, though it does guarantee fidelity to the original uncompressed image.
870 FTGROV , which is the number of dots per inch used to represent the image. Doubling the resolution will give a more precise image, but also quadruple the storage requirement (before compression), and affect processing time for any operations to be performed, such as displaying an image for a reader. Motion video also has resolution in time: the number of frames to be shown per second. Encoders should consider carefully what resolution(s) and frame rate(s) to use for particular applications; these Guidelines express no recommendation in this matter, save the universal ones of consistency and documentation.
872 FTGROV Within any image, it is typical to refer to locations via Cartesian coordinate axes: values for x, y, and sometimes z and/or time. However, graphic notations vary in whether coordinates count from left-to-right and top-to-bottom, or another way. They also vary in whether coordinates are considered real (inches, millimeters, and so on), or virtual (dots). These Guidelines do not recommend any of these methods over another, but all decisions made should be applied consistently, and documented in the
874 FTGROV section of the TEI header.
875 FTGROV Since no special purpose element is provided for this purpose by the current version of the Guidelines, such information should be provided as one or more distinct paragraphs at the end of the
880 FTGROV Methods of aligning images and text are discussed in
885 FTGROV images, each point is rendered in some shade of gray, the number of shades varying from system to system. In true polychrome images, points are rendered in different hues, again with varying limitations affecting the number of distinct shades and the means by which they are displayed.
889 FTGRNO As noted above, there exists a wide variety of different graphics formats, and the following list is in no way exhaustive. Moreover, inclusion of any format in this list should not be taken as indicating endorsement by the TEI of this format or any products associated with it. Some of the formats listed here are proprietary to a greater or lesser extent and cannot therefore be regarded as standards in any meaningful sense. They are however widely used by many different vendors.
920 FTGRNO Brief descriptions of all the above are given below. Where possible, current addresses or other contact information are shown for the originator of each format. Many formal standards, especially those promulgated by ISO and many related national organizations (ANSI, DIN, BSI, and many more), are available from those national organizations. Addresses may be found in any standard organizational directory for the country in question.
930 FTGRAVGF SVG is a language for describing two-dimensional vector and mixed vector or raster graphics in XML. It is defined by the Scalable Vector Graphics (SVG) 1.0 Specification, W3C Recommendation, 04 September 2001, and is available at
946 FTGRARGF Currently the most widely supported raster image format, especially for black and white images, TIFF is also one of the few formats commonly supported on more than one operating system. The drawback to TIFF is that it actually is a wrapper for several formats, and some TIFF-supporting software does not support all variants. TIFF files may use LZW, CCITT Group 4, or PackBits compression methods, or may use no compression at all. Also, TIFF files may be monochrome, grayscale, or polychromatic. All such options should be specified in prose at the end of the
948 FTGRARGF section of the TEI header for any document including TIFF images. TIFF is owned by Aldus Corporation. Documentation on TIFF is available from them at Craigcook Castle, Craigcook Road, Edinburgh EH4 3UH, Scotland, or 411 First Avenue South, Seattle, Washington 98104 USA.
954 FTGRARGF PBM files are easy to process, eschewing all compression in favor of transparency of file format. PBM files can, of course, be compressed by generic file-compression tools for storage and transfer. Public domain software exists which will convert many other formats to and from PBM. Documentation on PBM is copyright by Jeff Poskanzer, and is available widely on the Internet.
970 FTGRAMPEG This standard is sponsored by CCITT and by ISO. It is ISO/IEC Draft International Standard 10918-1, and CCITT T.81. It handles monochrome and polychromatic images with a variety of compression techniques. JPEG per se, like CCITT Group IV, must be encapsulated before transmission; this can be done via TIFF, or via the JPEG File Interchange Format (JFIF), as commonly done for Internet delivery.
982 FTGRAMPEG SMIL is a W3C Recommendation which supports the integration of independent multimedia objects into a synchronized multimedia presentation. It provides multimedia authors with easily-defined basic timing relationships, fine-tuned synchronization, spatial layout, direct inclusion of non-text and non-image media objects, hyperlink support for time-based media, and adaptiveness to varying user and system characteristics. SMIL 1.0 (
983 FTGRAMPEG ) became a W3C Recommendation on June 15, 1998, and was further developed in SMIL 2.0. SMIL 2.0 adds native support for transitions, animation, event-based interaction, extended layout facilities, and more sophisticated timing and synchronization primitives to the SMIL 1.0 language. It also allows reuse of SMIL syntax and semantics in other XML-based languages, in particular those who need to represent timing and synchronization. For example, SMIL 2.0 components are used for integrating timing into XHTML Document Types and into SVG. SMIL 2.0 also provides recommendations for Document Types based on SMIL 2.0 Modules (
985 FTGRAMPEG ). It contains support for all of the major SMIL 2.0 features including animation, content control, layout, linking, media object, meta-information, structure, timing, and transition effects and is designed for Web clients that support direct playback from SMIL 2.0 markup. SMIL 2.0 (
986 FTGRAMPEG ) became a W3C Recommendation on August 7, 2001, becoming the first vocabulary to provide XML Schema support and to have reached such status.
997 figures Tables, formulæ, notated music, and figures
1009 FT The selection and combination of modules to form a TEI schema is described in

AB-About.xml#12945

# id text
7 AB They make recommendations about suitable ways of representing those features of textual resources which need to be identified explicitly in order to facilitate processing by computer programs. In particular, they specify a set of markers (or
9 AB ) which may be inserted in the electronic representation of the text, in order to mark the text structure and other features of interest. Many, or most, computer programs depend on the presence of such explicit markers for their functionality, since without them a digitized text appears to be nothing but a sequence of undifferentiated bits. The success of the World Wide Web, for example, is partly a consequence of its use of such markup to indicate such features as headings and lists on individual pages, and to indicate links between pages. The process of inserting such explicit markers for implicit textual features is often called
13 AB ; the term
15 AB is also used informally. We use the term
18 AB markup language
19 AB to denote the complete set of rules associated with the use of markup in a given context; we use the term
21 AB for the specific set of markers or named distinctions employed by a given encoding scheme. Thus, this work both describes the TEI encoding scheme, and documents the TEI markup vocabulary.
23 AB The TEI encoding scheme is of particular usefulness in facilitating the loss-free interchange of data amongst individuals and research groups using different programs, computer systems, or application software. Since they contain an inventory of the features most often deployed for computer-based text processing, the Guidelines are also useful as a starting point for those designing new systems and creating new materials, even where interchange of information is not a primary objective.
25 AB These Guidelines apply to texts in any natural language, of any date, in any literary genre or text type, without restriction on form or content. They treat both continuous materials (
26 AB running text
27 AB ) and discontinuous materials such as dictionaries and linguistic corpora. Though principally directed to the needs of the scholarly research community, the Guidelines are not restricted to esoteric academic applications. They are also useful for librarians maintaining and documenting electronic materials, and for publishers and others creating or distributing electronic texts. Although they focus on problems of representing in electronic form texts which already exist in traditional media, these Guidelines are also applicable to textual material which is
31 AB The rules and recommendations made in these Guidelines are expressed in terms of what is currently the most widely-used markup language for digital resources of all kinds: the Extensible Markup Language (XML), as defined by the World Wide Web Consortium's XML Recommendation. However, the TEI encoding scheme itself does not depend on this language; it was originally formulated in terms of SGML (the ISO Standard Generalized Markup Language), a predecessor of XML, and may in future years be re-expressed in other ways as the field of markup develops and matures. For more information on markup languages see chapter
35 AB This document provides the authoritative and complete statement of the requirements and usage of the TEI encoding scheme. As such, although it includes numerous small examples, it must be stressed that this work is intended to be a reference manual rather than a tutorial guide.
37 AB The remainder of this chapter comprises three sections. The first gives an overview of the structure and notational conventions used throughout these Guidelines. The second enumerates the design principles underlying the TEI scheme and the application environments in which it may be found useful. Finally, the third section gives a brief account of the origins and development of the Text Encoding Initiative itself.
41 ABSTRUNC The remaining two sections of the front matter to the Guidelines provide background tutorial material for those unfamiliar with basic markup technologies. Following the present introductory section, we present a detailed introduction to XML itself, intended to cover in a relatively painless manner as much as the novice user of the TEI scheme needs to know about markup languages in general and XML in particular. This is followed by a discussion of the general principles underlying current practice in the representation of different languages and writing systems in digital form. This chapter is largely intended for the user unfamiliar with the Unicode encoding systems, though the expert may also find its historical overview of interest.
43 ABSTRUNC The body of this edition of the Guidelines proper contains 23 chapters arranged in increasing order of specialist interest. The first five chapters discuss in depth matters likely to be of importance to anyone intending to apply the TEI scheme to virtually any kind of text. The next seven focus on particular kinds of text: verse, drama, spoken text, dictionaries, and manuscript materials. The next nine chapters deal with a wide range of topics, one or more of which are likely to be of interest in specialist applications of various kinds. The last two chapters deal with the XML encoding used to represent the TEI scheme itself, and provide technical information about its implementation. The last chapter also defines the notion of TEI conformance and its implications for interchange of materials produced according to these Guidelines.
45 ABSTRUNC As noted above, this is a reference work, and is not intended to be read through from beginning to end. However, the reader wishing to understand the full potential of the TEI scheme will need a thorough grasp of the material covered by the first four chapters and the last two. Beyond that, the reader is recommended to select according to their specific interests: one of the strengths of the TEI architecture is its modular nature.
47 ABSTRUNC As far as possible, extensive cross referencing is provided wherever related topics are dealt with; these are particularly effective in the online version of the Guidelines. In addition, a series of technical appendixes provide detailed formal definitions for every element, every class, and every macro discussed in the body of the work; these are also cross linked as appropriate. Finally, a detailed bibliography is provided, which identifies the source of many examples cited in the text as well as documenting works referred to, and listing other relevant publications.
49 ABSTRUNC As an aid to the reader, most chapters of these Guidelines follow the same basic organization. The chapter begins with an overview of the subjects treated within it, linked to the following subsections. Within each section where new elements are described, a summary table is first given, which provides their names and a brief description of their intended usage. This is then followed where appropriate by further discussion of each element, including wherever possible usage examples taken somewhat eclectically from a variety of real sources. These examples are not intended to be exhaustive, but rather to suggest typical ways in which the elements concerned may usefully be applied. Where appropriate, a link to a statement of the source for most examples is provided in the online version. Within the examples, use of whitespace such as newlines or indentation is simply intended to aid legibility, and is not prescriptive or normative.
51 ABSTRUNC Wherever TEI elements or classes are mentioned in the text, they are linked in the online version to the relevant reference specification for the element or class concerned. Element names are always given in the form
54 ABSTRUNC name
61 ABSTRUNC include a closing slash to distinguish them wherever they are discussed. References to attributes take the form
65 ABSTRUNC is the name of the attribute. References to classes are also presented as links, for example
73 AB-namecon TEI Naming Conventions
75 AB-namecon These Guidelines use a more or less consistent set of conventions in the naming of XML elements and classes. This section summarizes those conventions.
80 AB-namecon An unadorned name such as
82 AB-namecon is the name of a TEI element or attribute.
83 AB-namecon During generation of TEI RelaxNG schema fragments, the patterns corresponding with these TEI names are given a prefix
84 AB-namecon tei
85 AB-namecon to allow them to co-exist with names from other XML namespace. This prefix is not visible to the end user, and is not used in TEI documentation. When generating multi-namespace schemas, however, the user needs to be aware of them.
88 AB-namecon The following conventions apply to the choice of names:
94 AB-namecon Where an element name contains more than one token, the first letter of the second token, and of any subsequent ones, is capitalized, as in for example
104 AB-namecon The specification for an element or attribute whose name contains abbreviations generally also includes a
106 AB-namecon element providing the expanded sense of the name.
110 AB-namecon element; this is not however generally done in TEI P5.
116 AB-namecon att
120 AB-namecon bibl
126 AB-namecon category, especially as used in text classification
128 AB-namecon char
134 AB-namecon document: this usually refers to the original source document which is being encoded,
138 AB-namecon declaration: has a specific sense in the TEI Header, as discussed in
140 AB-namecon desc
142 AB-namecon description: has a specific sense in the TEI header, as discussed in
147 AB-namecon group. In TEI usage, a group is distinguished from a list in that the former associates several objects which act as a single entity, while the latter does not. For example, a
153 AB-namecon simply lists a number of otherwise unrelated
157 AB-namecon interp
159 AB-namecon interpretation or analysis
161 AB-namecon lang
162 AB-namecon (natural) language
167 AB-namecon org
169 AB-namecon organization, that is, a named group of people or legal entity
171 AB-namecon rdg
173 AB-namecon reading or version found in a specific witness
175 AB-namecon ref
176 AB-namecon reference or link
184 AB-namecon statement: used in a specific sense in the TEI header, as discussed in
188 AB-namecon structured: that is, containing a specific set of named elements rather than
189 AB-namecon mixed content
191 AB-namecon val
195 AB-namecon wit
207 AB-namecon is an additional name, not the name of an addition. Such inconsistencies are relatively few in number, and it is hoped to remove them in subsequent revisions of the Guidelines.
219 AB-namecon (division) etc. We do not specifically list such elements here: as noted above, an expansion of each such abbreviated name is provided within the documentation using the
240 ABSTRUNC att.global
244 ABSTRUNC model.biblPart
248 ABSTRUNC macro.paraContent
252 ABSTRUNC data.pointer
257 ABSTRUNC . Here we simply note some conventions about their naming.
261 ABSTRUNC Attribute class names take the form
265 ABSTRUNC is typically an adjective, or a series of adjectives separated by dots, describing a property common to the attributes which make up the class.
267 ABSTRUNC Attributes with the same name are considered to have the same semantics, whether the attribute is inherited from a class, or locally defined.
273 ABSTRUNC Model classes have names beginning
276 ABSTRUNC root name
279 ABSTRUNC A root name may be the name of an element, generally the prototypical parent or sibling for elements which are members of the class.
283 ABSTRUNC , if the class members are all children of the element named rootname; or
285 ABSTRUNC , if the class members are all siblings of the element named
291 ABSTRUNC is used to indicate that class members are permitted anywhere in a TEI document.
297 ABSTRUNC For example, the class of elements which can form part of a
301 ABSTRUNC . This class includes as a subclass the elements which can form part of a
303 ABSTRUNC in a spoken text, which is named
309 ABTEI2 Because of its roots in the humanities research community, the TEI scheme is driven by its original goal of serving the needs of research, and is therefore committed to providing a maximum of comprehensibility, flexibility, and extensibility. More specific design goals of the TEI have been that the Guidelines should:
315 ABTEI2 support the encoding of all kinds of features of all kinds of texts studied by researchers
317 ABTEI2 be application independent
318 ABTEI2 This has led to a number of important design decisions, such as:
320 ABTEI2 the choice of XML and Unicode
322 ABTEI2 the provision of a large predefined tag set
324 ABTEI2 encodings for different views of text
331 ABTEI2 The goal of creating a common interchange format which is application independent requires the definition of a specific markup syntax as well as the definition of a large set of elements or concepts. The syntax of the recommendations made in this document conforms to the World Wide Web Consortium's XML Recommendation (
334 ABTEI2 The goal of providing guidance for text encoding suggests that recommendations be made as to what textual features should be recorded in various situations. However, when selecting certain features for encoding in preference to others, these Guidelines have tended to prefer generic solutions to specific ones, and to avoid areas where no consensus exists, while attempting to accommodate as many diverse views as feasible. Consequently, the TEI Guidelines make (with relatively rare exceptions) no suggestions or restrictions as to the relative importance of textual features. The philosophy of the Guidelines is
335 ABTEI2 if you want to encode this feature, do it this way
338 ABTEI2 The requirement to support all kinds of materials likely to be of interest in research has largely conditioned the development of the TEI into a very flexible and modular system. The development of other XML vocabularies or standards is typically motivated by the desire to create a single fully specified encoding scheme for use in a well-defined application domain. By contrast, the TEI is intended for use in a large number of rather ill-defined and often overlapping domains. It achieves its generality by means of the modular architecture described in
341 ABTEI2 The Guidelines have been written largely with a focus on text capture (i.e. the representation in electronic form of an already existing copy text in another medium) rather than text creation (where no such copy text exists). Hence the frequent use of terms like
346 ABTEI2 copy text
347 ABTEI2 , etc. However, the Guidelines are equally applicable to text creation, although certain elements, such as
350 ABTEI2 the rendition indicators
353 ABTEI2 Concerning text capture the TEI Guidelines do not specify a particular approach to the problem of fidelity to the source text and recoverability of the original; such a choice is the responsibility of the text encoder. The current version of these Guidelines, however, provides a more fully elaborated set of tags for markup of rhetorical, linguistic, and simple typographic characteristics of the text than for detailed markup of page layout or for fine distinctions among type fonts or manuscript hands. It should be noted also that, with the present version of the Guidelines, it is no longer necessarily the case that an unmediated version of the source text can be recovered from an encoded text simply by removing the markup.
362 ABTEI2 interpretation
363 ABTEI2 . These distinctions, though widely made and often useful in narrow, well-defined contexts, are perhaps best interpreted as distinctions between issues on which there is a scholarly consensus and issues where no such consensus exists. Such consensus has been, and no doubt will be, subject to change. The TEI Guidelines do not make suggestions or restrictions as to which of these features should be encoded. The use of the terms
367 ABTEI2 about different types of encoding in the Guidelines is not intended to support any particular view on these theoretical issues. Historically, it reflects a purely practical division of responsibility amongst the original working committees (see further
370 ABTEI2 In general, the accuracy and the reliability of the encoding and the appropriateness of the interpretation is for the individual user of the text to determine. The Guidelines provide a means of documenting the encoding in such a way that a user of the text can know the reasoning behind that encoding, and the general interpretive decisions on which it is based. The TEI header may be used to document and justify many such aspects of the encoding, but the choice of TEI elements for a particular feature is in itself a statement about the interpretation reached by the encoder.
372 ABTEI2 In many situations more than one view of a text is needed since no absolute recommendation to embody one specific view of text can apply to all texts and all approaches to them. Within limits, the syntax of XML ensures that some encodings can be ignored for some purposes. To enable encoding multiple views, these Guidelines not only treat a variety of textual features, but sometimes provide several alternative encodings for what appear to be identical textual phenomena. These Guidelines offer the possibility of encoding many different views of the text, simultaneously if necessary. Where different views of the formal structure of a text are required, as opposed to different annotations on a single structural view, however, the formal syntax of XML (which requires a single hierarchical view of text structure) poses some problems; recommendations concerning ways of overcoming or circumventing that restriction are discussed in chapter
375 ABTEI2 In brief, the TEI Guidelines define a general-purpose encoding scheme which makes it possible to encode different views of text, possibly intended for different applications, serving the majority of scholarly purposes of text studies in the humanities. Because no predefined encoding scheme can possibly serve all research purposes, the TEI scheme is designed to facilitate both selection from a wide range of predefined markup choices, and the addition of new (non-TEI) markup options. By providing a formally verifiable means of extending the TEI recommendations, the TEI makes it simple for such user-identified modifications to be incorporated into future releases of the Guidelines as they evolve. The underlying mechanisms which support these aspects of the scheme are introduced in chapter
383 ABAPP guidance for individual or local practice in text creation and data capture;
385 ABAPP support of data interchange;
387 ABAPP support of application-independent local processing.
388 ABAPP These three functions are so thoroughly interwoven in practice that it is hardly possible to address any one without addressing the others. However, the distinction provides a useful framework for discussing the possible role of the Guidelines in work with electronic texts.
394 ABAPP1 Problems specific to text creation or text
396 ABAPP1 have not been considered explicitly in this document. These Guidelines are not concerned with the process by which a digital text comes into being: it can be typed by hand, scanned from a printed book or typescript, read from a typesetter's tape, or acquired from another researcher who may have used another markup scheme (or no explicit markup at all).
400 ABAPP1 XML can appear distressingly verbose, particularly when (as in these Guidelines) the names of tags and attributes are chosen for clarity and not for brevity. Editor macros and keyboard shortcuts can allow a typist to enter frequently used tags with single keystrokes. It is often possible to transform word-processed or scanned text automatically. Markup-aware software can help with maintaining the hierarchical structure of the document, and display the document with visual formatting rather than raw tags.
403 ABAPP1 may be used to develop simpler data capture TEI-conformant schemas, for example with limited numbers of elements, or with shorter names for the tags being used most often. Documents created with such schemas may then be automatically converted to a more elaborated TEI form.
408 ABAPP2 The TEI format may simply be used as an interchange format, permitting projects to share resources even when their local encoding schemes differ. If there are
414 ABAPP2 such mappings are needed. However, for such translations to be carried out without loss of information, the interchange format chosen must be as expressive (in a formal sense) as any of the target formats; this is a further reason for the TEI's provision of both highly abstract or generic encodings and highly specific ones.
422 ABAPP2 creating a suitable set of mappings.
425 ABAPP2 For example, to translate from encoding scheme X into the TEI scheme:
427 ABAPP2 Make a list of all the textual features distinguished in X.
429 ABAPP2 Identify the corresponding feature in the TEI scheme. There are three possibilities for each feature:
431 ABAPP2 the feature exists in both X and the TEI scheme;
433 ABAPP2 X has a feature which is absent from the TEI scheme;
435 ABAPP2 X has a feature which corresponds with more than one feature in the TEI scheme.
436 ABAPP2 The first case is a trivial renaming. The second will require an extension to the TEI scheme, as described in chapter
437 ABAPP2 . The third is more problematic, but not impossible, provided that a consistent choice can be made (and documented) amongst the alternatives.
442 ABAPP2 Translating from the TEI into scheme X follows the same pattern, except that if a TEI feature has no equivalent in X, and X cannot be extended, information must be lost in translation.
447 ABAPP2 The TEI
448 ABAPP2 abstract model
449 ABAPP2 (that is, the set of categorical distinctions which it defines) must be respected. The correspondence between a tag X and the semantic function assigned to it by these Guidelines may not be changed; such changes are known as
450 ABAPP2 tag abuse
453 ABAPP2 A TEI document must be expressed as a valid XML-conformant document which uses the TEI namespace appropriately. If, for example, the document encodes features not provided by the Guidelines, such extensions may not be associated with the TEI namespace.
455 ABAPP2 It must be possible to validate a TEI document against a schema derived from these Guidelines, possibly with extensions provided in the recommended manner.
461 ABAPP3 Machine-readable text can be manipulated in many ways; some users:
465 ABAPP3 edit, display, and link texts in hypertext systems
475 ABAPP3 perform content analysis on texts
485 ABAPP3 scan verse texts metrically
487 ABAPP3 link text and images
490 ABAPP3 These applications cover a wide range of likely uses but are by no means exhaustive. The aim has been to make the TEI Guidelines useful for encoding the same texts for different purposes. We have avoided anything which would restrict the use of the text for other applications. We have also tried not to omit anything essential to any single application.
492 ABAPP3 Because the TEI format is expressed using XML, almost any modern text processing system is able to process it, and new TEI-aware software systems are able to build on a solid base of existing software libraries.
497 ABTEI The Text Encoding Initiative grew out of a planning conference sponsored by the Association for Computers and the Humanities (ACH) and funded by the U.S. National Endowment for the Humanities (NEH), which was held at Vassar College in November 1987. At this conference some thirty representatives of text archives, scholarly societies, and research projects met to discuss the feasibility of a standard encoding scheme and to make recommendations for its scope, structure, content, and drafting. During the conference, the Association for Computational Linguistics and the Association for Literary and Linguistic Computing agreed to join ACH as sponsors of a project to develop the Guidelines. The outcome of the conference was a set of principles (the
504 ABTEI The Text Encoding Initiative project began in June 1988 with funding from the NEH, soon followed by further funding from the Commission of the European Communities, the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada. Four working committees, composed of distinguished scholars and researchers from both Europe and North America, were named to deal with problems of text documentation,
505 ABTEI text representation, text analysis and interpretation,
515 ABTEI ) of the Guidelines was distributed in July 1990 under the title
518 ABTEI Extensive public comment and further work on areas not covered in this version resulted in the drafting of a revised version, TEI P2, distribution of which began in April 1992. This version included substantial amounts of new material, resulting from work carried out by several specialist working groups, set up in 1990 and 1991 to propose extensions and revisions to the text of P1. The overall organization, both of the draft itself and of the scheme it describes, was entirely revised and reorganized in response to public comment on the first draft.
520 ABTEI In June 1993 an Advisory Board met to review the current state of the TEI Guidelines, and recommended the formal publication of the work done to that time. That version of the TEI Guidelines, TEI P3, consolidated the work published as parts of TEI P2, along with some additional new material and was finally published in May of 1994 without the label
525 ABTEI XML was originally developed as a way of publishing on the World Wide Web richly encoded documents such as those for which the TEI was designed. Several TEI participants contributed heavily to the development of XML, most notably XML's senior co-editor C. M. Sperberg-McQueen, who served as the North American editor for the TEI Guidelines from their inception until 1999.
526 ABTEI Following the rapid take-up of this new standard metalanguage, it became evident that the TEI Guidelines (which had been published originally as an SGML application) needed to be re-expressed in this new formalism if they were to survive. The TEI editors, with abundant assistance from others who had developed and used TEI, developed an update plan, and made tentative decisions on relevant syntactic issues.
528 ABTEI In January of 1999, the University of Virginia and the University of Bergen formally proposed the creation of an international membership organization, to be known as the TEI Consortium, which would maintain, develop, and promote the TEI. Shortly thereafter, two further institutions with longstanding ties to the TEI (Brown University and Oxford University) joined them in formulating an Agreement to Establish a Consortium for the Maintenance of the Text Encoding Initiative (
529 ABTEI ), on which basis the TEI Consortium was eventually established and incorporated as a not-for-profit legal entity at the end of the year 2000. The first members of the new TEI Board took office during January of 2001.
531 ABTEI The TEI Consortium was established in order to maintain a permanent home for the TEI as a democratically constituted, academically and economically independent, self-sustaining, non-profit organization. In addition, the TEI Consortium was intended to foster a broad-based user community with sustained involvement in the future development and widespread use of the TEI Guidelines (
534 ABTEI To oversee and manage the revision process in collaboration with the TEI Editors, the TEI Board formed a Technical Council, with a membership elected from the TEI user community. The Council met for the first time in January 2002 at King's College London. Its first task was to oversee production of an XML version of the TEI Guidelines, updating P3 to enable users to work with the emerging XML toolset. This, the P4 version of the Guidelines, was published in June 2002. It was essentially an XML version of P3, making no substantive changes to the constraints expressed in the schemas apart from those necessitated by the shift to XML, and changing only corrigible errors identified in the prose of the P3 Guidelines. However, given that P3 had by this time been in steady use since 1994, it was clear that a substantial revision of its content was necessary, and work began immediately on the P5 version of the Guidelines. This was planned as a thorough overhaul, involving a public call for features and new development in a number of important areas not previously addressed including character encoding, graphics, manuscript description, biographical and geographical data, and the encoding language in which the TEI Guidelines themselves are written.
536 ABTEI The members of the TEI Council and its associated workgroups are listed in
537 ABTEI . In preparing this edition, they have been attentive to the requirements and practice of the widest possible range of TEI users, who are now to be found in many different research communities across the world, and have been largely instrumental in transforming the TEI from a grant-supported international research project into a self-sustaining community-based effort. One effect of the incorporation of the TEI has been the legal requirement to hold an annual meeting of the Consortium members; these meetings have emerged as an invaluable opportunity to sustain and reinforce that sense of community.
544 ABTEI4 The encoding recommended by this document may be used without fear that future versions of the TEI scheme will be inconsistent with it in fundamental ways. The TEI will be sensitive, in revising these Guidelines, to the possible problems which revision might pose for those who are already using this version of the Guidelines.
546 ABTEI4 With TEI P5, a version numbering system is introduced following
548 ABTEI4 : the first digit identifies a major version number, the second digit a minor version number, and the third digit a sub-minor version number. The TEI undertakes that no change will be made to the formal expression of these Guidelines (that is, a TEI schema, as defined in
549 ABTEI4 ) such that documents conformant to a given major numbered release cease to be compatible with a subsequent release of the same major number. Moreover, as far as possible, new minor releases will be made only for the purpose of adding new compatible features, or of correcting errors in existing features.
551 ABTEI4 The Guidelines are currently maintained as an open source project on the Sourceforge site
554 ABTEI4 for information on how to find specific versions of TEI releases (Guidelines, schemas etc.). Notice of errors detected and enhancements requested may be submitted at

GD-GraphsNetworksTrees.xml#12945

# id text
21 GD The treatment here is largely based on the characterizations of graph types in
24 GD , which typically plot data in two or more dimensions, including plots with orthogonal or radial axes, bar charts, pie charts, and the like. These can be described using the elements defined in the module for figures and graphics; see chapter
36 GDGR . An undirected graph is a set of
40 GDGR ) together with a set of pairs of those vertices, called
44 GDGR . Each node in an arc of an undirected graph is said to be
45 GDGR incident
46 GDGR with that arc, and the two vertices (nodes) which make up an arc are said to be
48 GDGR . An directed graph is like an undirected graph except that the arcs are
50 GDGR of nodes. In the case of directed graphs, the term
52 GDGR is not used; moreover, each arc in a directed graph is said to be
54 GDGR the node from which the arc emanates, and
56 GDGR the node to which the arc is directed. We use the element
69 GDGR Before proceeding, some additional terminology may be helpful. We define a
71 GDGR in a graph as a sequence of nodes n1, ..., nk such that there is an arc from each ni to ni+1 in the sequence. A
75 GDGR is a path leading from a particular node back to itself. A graph that contains at least one cycle is said to be
79 GDGR . We say, finally, that a graph is
81 GDGR if there is a path from some node to every other node in the graph; any graph that is not connected is said to be
128 GDGR to record a label for the graph; similarly, the
138 GDGR element record the number of nodes and number of arcs in the graph respectively; these values are optional (since they can be computed from the rest of the graph), but if they are supplied, they must be consistent with the rest of the encoding. They can thus be used to help check that the graph has been encoded and transmitted correctly. The
142 GDGR elements record the number of arcs that are incident with that node. It is optional (because redundant), but can be used to help in validity checking: if a value is given, it must be consistent with the rest of the information in the graph. Finally, the
148 GDGR elements provide pointers to the nodes connected by those arcs. Since the graph is undirected, no directionality is implied by the use of the
152 GDGR attributes; the values of these attributes could be interchanged in each arc without changing the graph.
195 GDGR Note that each arc is represented twice in this encoding of the graph. For example, the existence of the arc from LAX to LVG can be inferred from each of the first two
197 GDGR elements in the graph. This redundancy, however, is not required: it suffices to describe an arc in any one of the three places it can be described (either adjacent node, or in a separate
226 GDGR element is redundant (since arcs can be described using the adjacency attributes of their adjacent nodes), it has nevertheless been included in this module, in order to allow the convenient specification of identifiers, display or rendition information, and labels for each arc (using the attributes
234 GDGR Next, let us modify the preceding graph by adding directionality to the arcs. Specifically, we now think of the arcs as specifying selected routes from one airport to another, as indicated by the direction of the arrowheads in the following diagram.
272 GDGR indicate the number of nodes which are adjacent to and from the node concerned respectively.
303 GDGR If we wish to label the arcs, say with flight numbers, then
370 GDTN ) of the network are distinguished. It can be understood as accepting the set of strings obtained by traversing it from its initial node to its final node, and concatenating the labels.
407 GDTN A finite state transducer has two labels on each arc, and can be thought of as representing a mapping from one sequence of labels to the other. The following example represents a transducer for translating the English strings accepted by the network in the preceding example into French. The nodes have been annotated with numbers, for convenience.
502 GDFT The next example provides an encoding a portion of a family tree
503 GDFT The family tree is that of the mathematician and philosopher Bertrand Russell, whose third wife was commonly known as Peter. The information presented here is taken from
621 GDHI For our final example, we represent graphically the relationships among various geographic areas mentioned in a seventeenth-century Scottish document. The document itself is a
627 GDHI Item instrument of Sasine given the said Hector Mcneil confirmed and dated 28 May 1632 [...] at Edinburgh upon the 15 June 1632
629 GDHI Item ane charter granted by Archibald late earl of Argyle and Donald McNeill of Gallachalzie wh makes mention that ... the said late Earl yields and grants to the said Donald MacNeill ...
631 GDHI All and hail the two merk land of old extent of Gallachalzie with the pertinents by and in the lordship of Knapdale within the sherrifdome of Argyll
638 GDHI the two merk land of old extent of Gallachalzie with the pertinents by and in the lordship of Knapdale within the sherrifdom of Argyll
652 GDHI We will represent these geographic entities as nodes in a graph. Arcs in the graph will represent the following relationships among them:
656 GDHI location within (IN)
665 GDHI , for example, are inverses of each other: the Earl of Argyll's land includes the parcel in Gallachalzie, and the parcel is therefore in the Earl of Argyll's land. Given an explicit set of inference rules, an appropriate application could use the graph we are constructing to infer the logical consequences of the relationships we identify.
667 GDHI Let us assume that feature-structure analyses are available which describe Gallachalzie, Knapdale, and Argyll. We will link to those feature structures using the
675 GDHI That is, the three syntactic interpretations of the clause are mutually exclusive. The notion that the pertinents are in Argyll is clearly not inconsistent with the notion that both the land in Gallachalzie and the pertinents are in Argyll. The graph given here describes the possible interpretations of the clause itself, not the sets of inferences derivable from each syntactic interpretation, for which it would be convenient to use the facilities described in chapter
678 GDHI We represent the graph and its encoding as follows, where the dotted lines in the graph indicate the mutually exclusive arcs; in the encoding, we use the
683 GDHI The graph formalizes the following relationships:
704 GDHI We encode the graph thus:
774 GDTR tree
775 GDTR is a connected acyclic graph. That is, it is possible in a tree graph to follow a path from any vertex to any other vertex, but there are no paths that lead from any vertex to itself. A rooted tree is a directed graph based on a tree; that is, the arcs in the graph correspond to the arcs of a tree such that there is exactly one node, called the
776 GDTR root
777 GDTR , for which there is a path from that node to all other nodes in the graph. For our purposes, we may ignore all trees except for rooted trees, and hence we shall use the
781 GDTR element for its root. The nodes adjacent to a given node are called its
783 GDTR , and the node adjacent from a given node is called its
789 GDTR element. A node with no children is tagged as a
791 GDTR . If the children of a node are ordered from left to right, then we say that that node is
793 GDTR . If all the nodes of a tree are ordered, then we say that the tree is an
794 GDTR ordered tree
795 GDTR . If some of the nodes of a tree are ordered and others are not, then the tree is a
796 GDTR partially ordered tree
797 GDTR . The ordering of nodes and trees may be specified by an attribute; we take the default ordering for trees to be ordered, that roots inherit their ordering from the trees in which they occur, and internal nodes inherit their ordering from their parents. Finally, we permit a node to be specified as following other nodes, which (when its parent is ordered) it would be assumed to precede, giving rise to crossing arcs. The elements used for the encoding of trees have the following descriptions and attributes.
809 GDTR ) are applied in evaluating the arithmetic formula
811 GDTR . In drawing the graph, the root is placed on the far right, and directionality is presumed to be to the left.
873 GDTR of the tree, which is the greatest value of the
879 GDTR , we say that the tree is a
880 GDTR binary
885 GDTR nodes does not affect the arithmetic result in this case, we could represent in this tree all of the arithmetically equivalent formulas involving its leaves, by specifying the attribute
972 GDTR Linguistic phrase structure is very commonly represented by trees. Here is an example of phrase structure represented by an ordered tree with its root at the top, and a possible encoding.
1010 GDTR Finally, here is an example of an ordered tree, in which a particular node which ordinarily would precede another is specified as following it. In the drawing, the
1012 GDTR symbol indicates that the arc from VB to PT crosses the arc from VP to PN.
1059 GDAT , which is based on the observation that any node of such a tree can be thought of as the root of the subtree that it dominates. Thus subtrees can be thought of as the same type as the trees they are embedded in, hence the designation
1062 GDAT embedding tree
1199 GDAT Ambiguity involving alternative tree structures associated with the same terminal sequence can be encoded relatively conveniently using a combination of the
1207 GDAT may be part of the content of exactly one of two different
1225 GDAT . This ambiguity is indicated in the sketch of the ambiguous tree by means of the dotted-line arcs. The markup using the
1316 GDAT the attachment of a modifier may require the creation of an intermediate node which is not required when the attachment is not made, as shown in the following diagram. A possible encoding of this ambiguous structure immediately follows the diagram.
1417 GDAT derivation
1418 GDAT in a generative grammar is often thought of as a set of trees. To encode such a derivation, one may use the
1428 GDAT attribute may be used to specify what kind of derivation it is. Here is an example of a two-tree forest, involving application of the
1430 GDAT transformation in the derivation of
1442 GDAT empty category
1527 GDAT attributes to provide virtual copies of elements in the tree representing the second stage of the derivation that also occur in the first stage, and the
1530 GDAT ) to link those elements in the second stage with corresponding elements in the first stage that are not copies of them.
1532 GDAT If a group of forests (e.g. a full grammatical derivation including syntactic, semantic, and phonological subderivations) is to be articulated, the grouping element
1549 GDstem ) is a tree-like graphic structure that has become traditional in manuscript studies for representing textual transmission. Consider the following hypothetical stemma:
1554 GDstem The nodes in this stemma represent manuscripts; each has a label (a letter) which identifies it and also distinguishes whether the manuscript is extant, lost, or hypothetical. Extant manuscripts are identified by uppercase Latin letters or words beginning with uppercase Latin letters, e.g.,
1556 GDstem , shown as aqua in this example; manuscripts no longer existing, but providing readings which are attested e.g. by note or copy made before their disappearance, are identified by lowercase Latin letters, e.g.,
1564 GDstem share textual material that is not shared with other manuscripts (represented in this case by
1566 GDstem ) even though no physical manuscript attesting this stage in the textual transmission has ever been identified.
1568 GDstem Manuscripts are copied from other manuscripts. The preceding stemma represents the hypothesis that all manuscripts go back to a common ancestor (
1570 GDstem ), that the tradition split after that stage into two (
1576 GDstem is the earliest common hypothetical stage that can be reconstructed, and all nodes below
1578 GDstem have a single parent, that is, were copied from a single other stage in the tradition.
1580 GDstem This familiar tree model is complicated because manuscripts sometimes show the influence of more than one ancestor. They may have been produced by a scribe who checked the text in one manuscript of the same work whilst copying from another, or perhaps made changes from his memory of a slightly different version of the text that he had read elsewhere. Alternatively, perhaps scribe A copied a manuscript from one source, scribe B made changes in it in the margins or between the lines (either by consulting another source directly or from memory), and another scribe then copied that manuscript, incorporating the changes into the body. Whatever the specific scenario, it is not uncommon for a manuscript to be based primarily on one source, but to incorporate features of another branch of the tradition. This mixed result is called
1598 GDstem element introduced in this chapter can be used to represent a closed tradition in a straightforward manner. Each non-terminal node is represented by a typed
1600 GDstem element and each terminal node by an
1608 GDstem attributes. For example, the closed part of the tradition headed by the label δ may be encoded as follows:
1622 GDstem To complete this representation, we need to show that the node labelled A is not derived solely from its parent node (labelled ε) but also demonstrates contamination from the node labelled γ. The easiest way to accomplish this is to include an appropriately-typed
1624 GDstem element within the node in question, the
1626 GDstem of which points to the node labelled γ. This requires that this latter node be supplied with a value for its
1677 GDstem In any substantial codicological project, it is likely that significantly more data will be required about the individual witnesses than indicated in the simple structures above. These Guidelines provide a rich variety of additional elements for representing such information: see in particular chapters
1698 GD The selection and combination of modules to form a TEI schema is described in

SG-GentleIntroduction.xml#12945

# id text
4 SG The encoding scheme defined by these Guidelines is formulated as an application of the Extensible Markup Language (XML) (
5 SG ). XML is widely used for the definition of device-independent, system-independent methods of storing and processing texts in electronic form. It is now also the interchange and communication format used by many applications on the World Wide Web. In the present chapter we informally introduce some of its basic concepts and attempt to explain to the reader encountering them for the first time how and why they are used in the TEI scheme. More detailed technical accounts of TEI practice in this respect are provided in chapters
12 SG , that is, a language used to describe other languages, in this case,
16 SG has been used to describe annotation or other marks within a text intended to instruct a compositor or typist how a particular passage should be printed or laid out. Examples include wavy underlining to indicate boldface, special symbols for passages to be omitted or printed in a particular font, and so forth. As the formatting and printing of texts was automated, the term was extended to cover all sorts of special codes inserted into electronic texts to govern formatting, printing, or other processing.
22 SG , as any means of making explicit an interpretation of a text. Of course, all printed texts are implicitly encoded (or marked up) in this sense: punctuation marks, capitalization, disposition of letters around the page, even the spaces between words all might be regarded as a kind of markup, the purpose of which is to help the human reader determine where one word ends and another begins, or how to identify gross structural features such as headings or simple syntactic units such as dependent clauses or sentences. Encoding a text for computer processing is, in principle, like transcribing a manuscript from
25 SG continuous writing
27 SG ; it is a process of making explicit what is conjectural or implicit, a process of directing the user as to how the content of the text should be (or has been) interpreted.
30 SG markup language
31 SG we mean a set of markup conventions used together for encoding texts. A markup language must specify how markup is to be distinguished from text, what markup is allowed, what markup is required, and what the markup means. XML provides the means for doing the first three; documentation such as these Guidelines is required for the last.
52 SG11 These three aspects are discussed briefly below, and then in more depth in the remainder of this chapter.
54 SG11 XML is frequently compared with HTML, the language in which web pages have generally been written, which shares some of the above characteristics. Compared with HTML, however, XML has some other important features:
57 SG11 : it does not consist of a fixed set of tags;
77 SG111 the following item is a paragraph
79 SG111 this is the end of the most recently begun list
83 SG111 move the left margin 2 quads left, move the right margin 2 quads right, skip down one line, and go to the new left margin,
84 SG111 etc. In XML, the instructions needed to process a document for some particular purpose (for example, to format it) are sharply distinguished from the markup used to describe it.
86 SG111 Usually, the markup or other information needed to process a document will be maintained separately from the document itself, typically in a distinct document called a
88 SG111 , though it may do much more than simply define the rendition or visual appearance of a document.
94 SG111 When descriptive markup is used, the same document can readily be processed in many different ways, using only those parts of it which are considered relevant. For example, a content analysis program might disregard entirely the footnotes embedded in an annotated text, while a formatting program might extract and collect them all together for printing at the end of each chapter. Different kinds of processing can be carried out with the same part of a file. For example, one program might extract names of persons and places from a document to create an index or database, while another, operating on the same text, but using a different stylesheet, might print names of persons and places in a distinctive typeface.
105 SG112 title
107 SG112 author
109 SG112 abstract
110 SG112 and a sequence of one or more
112 SG112 . Anything lacking a title, according to this formal definition, would not formally be a report, and neither would a sequence of paragraphs followed by an abstract, whatever other report-like characteristics these might have for the human reader.
123 SG113 A basic design goal of XML is to ensure that documents encoded according to its provisions can move from one hardware and software environment to another without loss of information. The two features discussed so far both address this requirement at an abstract level; the third feature addresses it at the level of the strings of data characters that make up a document. All XML documents, whatever languages or writing systems they employ, use the same underlying character encoding (that is, the same method of representing as binary data those graphic forms making up a particular writing system).
132 SG113 which is implemented by a universal character set maintained by an industry group called the Unicode Consortium, and known as Unicode.
134 SG113 Unicode provides a standardized way of representing any of the many thousands of discrete symbols making up the world's writing systems, past and present.
137 SG113 Most modern computing systems now support Unicode directly; for those which do not, XML provides a mechanism for the indirect representation of single characters by means of their character number, known as
146 SG12 A text is not an undifferentiated sequence of words, much less of bytes. For different purposes, it may be divided into many different units, of different types or sizes. A prose text such as this one might be divided into sections, chapters, paragraphs, and sentences. A verse text might be divided into cantos, stanzas, and lines. Once printed, sequences of prose and verse might be divided into volumes, gatherings, and pages.
148 SG12 Structural units of this kind are most often used to identify specific locations or refer to points within a text (
151 SG12 canto 10, line 1234
154 SG12 , etc.) but they may also be used to subdivide a text into meaningful fragments for analytic purposes (
160 SG12 ). Other structural units are more clearly analytic, in that they characterize a section of a text. A dramatic text might regard each speech by a different character as a unit of one kind, and stage directions or pieces of action as units of another kind. Such an analysis is less useful for locating parts of the text (
164 SG12 In a prose text one might similarly wish to regard as units of different types passages in direct or indirect speech, passages employing different stylistic registers (narrative, polemic, commentary, argument, etc.), passages of different authorship and so forth. And for certain types of analysis (most notably textual criticism) the physical appearance of one particular printed or manuscript source may be of importance: paradoxically, one may wish to use descriptive markup to describe presentational features such as typeface, line breaks, use of whitespace and so forth.
166 SG12 These textual structures overlap with one another in complex and unpredictable ways. Particularly when dealing with texts as instantiated by paper technology, the reader needs to be aware of both the physical organization of the book and the logical structure of the work it contains. Many great works (Sterne's
168 SG12 for example) cannot be fully appreciated without an awareness of the interplay between narrative units (such as chapters or paragraphs) and presentational ones (such as page divisions). For many types of research, the interplay among different levels of analysis is crucial: the extent to which syntactic structure and narrative structure mesh, or fail to mesh, for example, or the extent to which phonological structures reflect morphology.
176 SG131 The technical term used in XML for a textual unit, viewed as a structural component, is
186 SG131 of textual elements, because these are considered to be application dependent. It is up to the creators of XML vocabularies (such as these Guidelines) to choose intelligible element names and to define their intended use in text markup. That is the chief purpose of documents such as the TEI Guidelines. From the need to choose element names indicative of function comes the technical term for the name of an element type, which is
190 SG131 Within a marked-up text (a
192 SG131 ), each element must be explicitly marked or tagged in some way. This is done by inserting a tag at the beginning of the element (a
196 SG131 ). The start- and end-tag pair are used to bracket off element occurrences within the running text, in rather the same way as different types of parentheses or quotation marks are used in conventional punctuation. For example, a quotation element in a text might be tagged as follows:
200 SG131 As this example shows, a start-tag takes the form
201 SG131 quote
203 SG131 quote
209 SG131 The material between the start-tag and the end-tag (the string of words
212 SG131 content
213 SG131 of the element. Sometimes there may be nothing between the start and the end-tag; in this case the two may optionally be merged together into a single composite tag with the solidus at the end, like this:
221 SG132 , that is, it may have no content at all, or it may contain just a sequence of characters with no other elements. Often, however, elements of one type will be
229 SG132 , and it consists of a series of
235 SG132 , each stanza having embedded within it a number of
236 SG132 line
237 SG132 elements. Fully marked up, a text conforming to this model might appear as follows:
270 SG132 a valid TEI document.
271 SG132 The element names here have been chosen for clarity of exposition; there is, however, a TEI element corresponding to each, so that this example may be regarded as TEI-conformable in the sense that this term is defined in
273 SG132 It will, however, serve as an introduction to the basic notions of XML. Whitespace and line breaks have been added to the example for the sake of visual clarity only; they have no particular significance in the XML encoding itself. Also, the line
284 SG132 root element
289 SG132 each element is completely contained by the root element, or by an element that is so contained; elements do not partially overlap one another;
291 SG132 a tag explicitly marks the start and end of each element.
295 SG132 A well-formed XML document can be processed in a number of useful ways. A simple indexing program could extract only the relevant text elements in order to make a list of headings, first lines, or words used in the poem text; a simple formatting program could insert blank lines between stanzas, perhaps indenting the first line of each, or inserting a stanza number. Different parts of each poem could be typeset in different ways. A more ambitious analytic program could relate the use of punctuation marks to stanzaic and metrical divisions.
298 SG132 Scholars wishing to see the implications of changing the stanza or line divisions chosen by the editor of this poem can do so simply by altering the position of the tags. And of course, the text as presented above can be transported from one computer to another and processed by any program (or person) capable of making sense of the tags embedded within it with no need for the sort of transformations and translations needed for files which have been saved in one or other of the proprietary formats preferred by most word-processing programs.
300 SG132 As we noted above, one of the attractions of XML is that it enables us to make up our own names for the elements rather than requiring us always to use names predefined by other agencies. Clearly, however, if we wish to exchange our poems with others, or to include poems others have marked up in our anthology, we will need to know a bit more about the names used for the tags. The means that XML provides for this is called a
301 SG132 namespace
303 SG132 qualified name
304 SG132 , that is, a name with an optional prefix identifying the set of names to which it belongs. For example, we have defined an element
306 SG132 for the purpose of marking lines of verse. Another person might, however, define an element called
308 SG132 for the purpose of marking typographic lines, or drawn lines. Because of these different meanings, if we wish to share data it will be necessary to distinguish the two
309 SG132 line
311 SG132 namespace prefix
314 SG132 This feature is particularly important if we have different definitions of what a
315 SG132 line
316 SG132 is, of course, but there are many occasions when it is useful to distinguish groups of tags belonging to different
319 SG132 ). One particularly useful namespace prefix is predefined for XML: it is
323 SG132 Namespaces allow us to represent the fact that a name belongs to a group of names, but don't allow us to do much more by way of checking the integrity or accuracy of our tagging. Simple well-formedness alone is not enough for the full range of what might be useful in marking up a document. It might well be useful if, in the process of preparing our digital anthology, a computer system could check some basic rules about how stanzas, lines, and headings can sensibly co-occur in a document. It would be even more useful if the system could check that stanzas are always tagged
331 SG132 document, and the ability to perform such validation is one of the key advantages of using XML. To carry this out, some way of formally stating the criteria for successful validation is necessary: in XML this formal statement is provided by an additional document known as a
338 SG132 , both abbreviated as DTD, may also be encountered. Throughout these Guidelines we use the term
346 SG14 The design of a schema may be as lax or as restrictive as the occasion warrants. A balance must be struck between the convenience of following simple rules and the complexity of handling real texts. This is particularly the case when the rules being defined relate to texts that already exist: the designer may have only the haziest of notions as to an ancient text's original purpose or meaning and hence find it very difficult to specify consistent rules about its structure. On the other hand, where a new text is being prepared to an exact specification, for entry into a textual database of some kind for example, the more precisely stated the rules, the better they can be enforced. Even in the case where an existing text is being marked up, it may be beneficial to define a restrictive set of rules relating to one particular view or hypothesis about the text—if only as a means of testing the usefulness of that view or hypothesis. A schema designed for use by a small project or team is likely to take a different position on such issues than one intended for use by a large and possibly fragmented community. It is important to remember that every schema results from an interpretation of a text. There is no single schema encompassing the absolute truth about any text, although it may be convenient to privilege some schemas above others for particular types of analysis.
348 SG14 XML is widely used in environments where uniformity of document structure is a major desideratum. In the production of technical documentation, for example, it is of major importance that sections and subsections should be properly nested, that cross-references should be properly resolved and so forth. In such situations, documents are seen as raw material to match against predefined sets of rules. As discussed above, however, the use of simple rules can also greatly simplify the task of tagging accurately elements of less rigidly constrained texts. By making these rules explicit, the scholar reduces his or her own burdens in marking up and verifying the electronic text, while also being forced to make explicit an interpretation of the structure and significant particularities of the text being encoded.
353 SG141bis A schema can be expressed in a number of different ways; frequently-encountered methods include the Document Type Definition (DTD) language which XML inherited from SGML; the XML Schema language (
354 SG141bis ) defined by the W3C; and the RELAX NG language (
359 SG141bis of RELAX NG, but the specifications within these Guidelines are expressed in a way that is largely independent of the specific language in which a schema generated from them is expressed.
362 SG141bis . In practice, the only part of a TEI element specification not expressed using TEI-defined syntax is the content model for an element, which is expressed using the RELAX NG schema language for reasons of processing convenience. RELAX NG uses its own XML vocabulary to define content models, which is adopted by the TEI for the same purpose.
366 SG141bis anthology_p = element anthology { poem_p+ } poem_p = element poem { heading_p?, stanza_p+ } stanza_p = element stanza {line_p+} heading_p = element heading { text } line_p = element line { text } start = anthology_p
376 SG141bis ; that is, it defines a number of named patterns, each of which acts as a kind of template against which an input document can be matched. The meaning of a pattern is expressed in a schema by reference to other patterns, or to a small number of built-in fundamental concepts, as we shall see. In the example above, the word to the left of the equals sign is the pattern's name, and the material following it declares a meaning for the pattern. Patterns may also be of particular types; the ones that interest us here are called
380 SG141bis . In this example we see definitions for five element patterns. Note that we have used similar names for the pattern and the element which the pattern describes: so, for example, the line
384 SG141bis , the value of which defines an element called
386 SG141bis . These naming conventions are arbitrary; we could use the same name for the pattern as for the element, since the two are syntactically quite distinct. The name, or
391 SG141bis content model
394 SG141bis The last line of the schema above tells a RELAX NG validator which element (or elements) in a document can be used as the root element: in our case only
397 SG141bis entry point
423 SG141x ; the root element of a TEI-conformant document is
434 SG143 content model
435 SG143 of the element being defined, because it specifies what may legitimately be contained within it. In RELAX NG, the content model is defined in terms of other patterns, either by embedding them, or (as in our examples above) by naming or referring to them. The RELAX NG compact syntax also uses a small number of reserved words to identify other possible contents for an element, of which by far the most commonly encountered is
436 SG143 text
439 SG143 ), then almost always, following the branches of the tree downwards (for example, from
450 SG143 text
455 SG143 are so defined, since their content models say
456 SG143 text
457 SG143 only and name no embedded elements.
467 SG144 may be repeated. There are three occurrence indicators: the plus sign, the question mark, and the asterisk or star. The plus sign means that the pattern can match one or more times; the question mark means that it may match at most once but is not mandatory; the star means that the pattern concerned is not mandatory, but may match more than once. Thus, if the content model for
483 SG145 The content model
491 SG145 (the comma) used between its components. The comma connector indicates that the patterns concerned must appear in the sequence given. Another commonly encountered connector is the vertical bar, representing alternation. If the comma in this example were replaced by a vertical bar, then a
497 SG146 In our example so far, the components of each content model have been either single patterns or
498 SG146 text
499 SG146 . It is quite permissible, however, to define content models in which the components are lists of patterns, combined by connectors. Such lists may also be modified by occurrence indicators and themselves combined by connectors. To demonstrate these facilities, let us expand our example to include non-stanzaic types of verse. For the sake of demonstration, we will categorize poems as one of the following:
507 SG146 ). A blank-verse poem consists simply of lines (we ignore the possibility of verse paragraphs for the moment),
508 SG146 It will not have escaped the astute reader that the fact that verse paragraphs need not start on a line boundary seriously complicates the issue; see further section
510 SG146 so no additional elements need be defined for it. A couplet is defined as a
524 SG146 (which are distinguished to enable studies of rhyme scheme, for example
525 SG146 This is however a rather artificial example; XPath, for example, provides ways of distinguishing elements in an XML structure by their position without the need to give them distinct names.
526 SG146 ); these will have exactly the same content model as the existing
528 SG146 element. We will therefore add the following two lines to our example schema:
530 SG146 Next, we can change the declaration for the
536 SG146 The second version, by applying the occurrence indicator to the group rather than to each element within it, would allow a single poem to contain a mixture of stanzas, couplets, and lines.
538 SG146 A group of this kind can contain
539 SG146 text
541 SG146 mixed content
542 SG146 , allows for elements in which the sub-components appear with intervening stretches of character data. For example, if we wished to mark place names wherever they appear inside our verse lines, then, assuming we have also added a pattern for the
544 SG146 element, we could change the definition for
547 SG146 line_p = element line { (text | name_p )* }
550 SG146 Some XML schema languages place no constraints on the way that mixed content models may be defined, but in the XML DTD language, when
551 SG146 text
552 SG146 appears with other elements in a content model, it must always appear as the first option in an alternation; it may appear once only, and in the outermost model group; and if the group containing it is repeated, the star operator must be used. Although these constraints do not apply to (for example) schemas expressed in the RELAX NG language, all TEI content models currently obey them.
554 SG146 Quite complex models can easily be built up in this way, to match the structural complexity of many types of text. As a further example, consider the case of stanzaic verse in which a refrain or chorus appears. Like a stanza, a refrain consists of repetitions of the line element. A refrain can appear at the start of a poem only, or as an optional addition following each stanza. This could be expressed by a pattern such as the following:
556 SG146 That is, a poem consists of an optional heading, followed by either a sequence of lines or an unnamed group, which starts with an optional refrain and is followed by one or more occurrences of another group, each member of which is composed of a stanza followed by an optional refrain. A sequence such as
558 SG146 follows this pattern, as does the sequence
560 SG146 . The sequence
562 SG146 does not, however, and neither does the sequence
564 SG146 Among other conditions made explicit by this content model are the requirements that at least one stanza must appear in a poem, if it is not composed simply of lines, and that if there is both a heading and a stanza they must appear in that order.
576 SG152 In the simple cases described so far, we have assumed that one can identify the immediate constituents of every element in a textual structure. A poem consists of stanzas, and an anthology consists of poems. Stanzas do not float around unattached to poems or combined into some other unrelated element; a poem cannot contain an anthology. All the elements of a given document type may be arranged into a hierarchic structure like a family tree, with a single ancestor at one end and many children (mostly the elements containing simple text) at the other. For example, we could represent an anthology containing two poems, the first of which contains two four-line stanzas and the second a single stanza, by a tree structure like the following figure:
580 SG152 This graphic representation of the structure of an XML document is close to the abstract model implicit in most XML processing systems. Most such systems now use a standardized way of accessing parts of an XML document called
587 SG152 XPath gives us a non-graphical way of referring to any part of an XML document: for example, we might refer to the last line of Blake's poem as
589 SG152 . The square brackets here indicate a numerical selection: we are talking about the fourth line in the second stanza of the first poem in the anthology. If we left out all the square-bracketted selections, the corresponding XPath expression would refer to all lines contained by stanzas contained by poems contained by anthologies. An XPath expression can refer to any collection of elements: for example, the expression
595 SG152 The solidus within an XPath expression behaves in much the same way as the solidus or backslash in a filename specification: it indicates that the item to the left directly contains the item to the right of it. In XPath it is also possible to indicate that any number of other items may intervene by repeating the solidus. For example, the XPath expression
597 SG152 will refer to the first line of each poem in the anthology, irrespective of whether it is in a stanza.
599 SG152 Clearly, there are many such trees that might be drawn to describe the structure of this or other anthologies. Some of them might be representable as further subdivisions of this tree: for example, we might subdivide the lines into individual words, since in our simple example no word crosses a line boundary. Surprisingly perhaps, this grossly simplified view of what text is (memorably termed an
600 SG152 ordered hierarchy of content objects
601 SG152 (OHCO) view of text by Renear
605 SG152 ) turns out to be very effective for a large number of purposes. It is not, however, adequate for the full complexity of real textual structures, for which more complex mechanisms need to be employed. There are many other trees that might be drawn which do
609 SG152 In the OHCO model of text, representation of cases where different elements overlap so that several different trees may be identified in the same document is generally problematic. All the elements marked up in a document, no matter what namespace they belong to, must fit within a single hierarchy. To represent overlapping structures, therefore, a single hierarchy must be chosen, and the points at which other hierarchies intersect with it marked. For example, we might choose the verse structure as our primary hierarchy, and then mark the pagination by means of empty elements inserted at the boundary points between one page and the next. Or we could represent alternative hierarchies by means of the pointing and linking mechanisms described in chapter
619 SG16 , like some other words, has a specific technical sense. It is used to describe information that is in some sense descriptive of a specific element occurrence but not regarded as part of its content. For example, you might wish to add a
621 SG16 attribute to occurrences of some elements in a document to indicate their degree of reliability, or to add an
625 SG16 Although different elements may have attributes with the same name (for example, in the TEI scheme, every element is defined as having an attribute named
627 SG16 ), they are always regarded as different, and may have different values assigned to them. If an element has been defined as having attributes, the attribute values are supplied in the document instance as
631 SG16 The order in which attribute-value pairs are supplied inside a tag has no significance; they must, however, be separated by at least one whitespace (blank, newline, or tab) character. The value part must always be given inside matching quotation marks, either single or double
632 SG16 In the unlikely event that both kinds of quotation marks are needed within the quoted string, either or both can also be presented in escaped form, using the predefined character entities
652 SG16 attribute has the value
656 SG16 attribute has the value
662 SG16 attribute has the value
664 SG16 might be formatted differently from one in which the same attribute has the value
668 SG16 attribute is a slightly special case in that, by convention, it is always used to supply a unique value to identify a particular element occurrence, which may be used for cross-reference purposes, as discussed further below (
673 SG-att Attributes are declared in a schema in the same way as elements. As well as specifying an attribute's name and the element to which it is to be attached, it is possible to specify (within limits) what kind of value is acceptable for an attribute.
679 SG-att , whose value is an attribute pattern defining an attribute named
681 SG-att . Attribute names are subject to the same restrictions as other names in XML; they need not be unique across the whole schema, however, but only within the list of attributes for a given element.
683 SG-att A pattern defining the possible values for this attribute is given within the curly braces, in just the same way as a content model is given for an element pattern. In this case, the attribute's value must be one of the strings presented explicitly above.
689 SG-att In RELAX NG, an element pattern simply includes any attribute patterns applicable to it along with its other constituents, as shown above. Attribute patterns can also be grouped and alternated in the same way as element patterns, though this particular feature is not widely used in the TEI scheme, since it is not available to the same extent in all schema languages. Because a question mark follows the reference to the
697 SG-att Instead of supplying a list of explicit values, an attribute pattern can specify that the attribute must have a value of a particular type, for example a text string, a numeric value, a normalized date, etc. This is accomplished by supplying a pattern that refers to a
698 SG-att datatype
699 SG-att . In the example above, because a list of acceptable values is predefined, a parser can check that no
711 SG-att a parser would accept almost any unbroken string of characters (
717 SG-att ) as valid for this attribute. Sometimes, of course, the set of possible values cannot be predefined. Where it can, as in this case, it is generally better to do so.
719 SG-att Schema languages vary widely in the extent to which they support validation of attribute values. Some languages predefine a small set of possibilities. Others allow the schema designer to use values from a predefined
721 SG-att of possible datatypes, or to add their own definitions, possibly of great complexity. A
722 SG-att datatype
723 SG-att might be something fairly general (any positive integer), something very specific or idiosyncratic (any four-character string ending with "T"), or somewhere between the two. In the RELAX NG schemas used by the TEI, general patterns have been defined for about half a dozen datatypes (using the W3C Schema
726 SG-att ). In addition to the two possibilities already mentioned—plain text or an explicit list of possible strings—other datatypes likely to be encountered include the following:
732 SG-att numeric
734 SG-att values must represent a numeric quantity of some kind
736 SG-att date
738 SG-att values must represent a possible date and time in some calendar
751 SG-id see note 6
754 SG-id . When a text is being produced the actual numbers associated with the notes or chapters may not be certain. If we are using descriptive markup, such things as page or chapter numbers, being entirely matters of presentation, will not in any case be present in the marked-up text: they will be assigned by whatever processor is operating on the text (and may indeed differ in different applications). XML therefore predefines an attribute that may be used to provide any element occurrence with a special identifier, a kind of label, which may be used to refer to it from anywhere else: since it is defined in the XML namespace, the name of this attribute is
756 SG-id and it is used throughout the TEI schema. Because it is intended to act as an identifier, its values must be unique within a given document. The cross-reference itself will be supplied by an element bearing an attribute of a specific kind, which must also be declared in the schema.
758 SG-id Suppose, for example, we wish to include a reference within the notes on one poem that refers to another poem. We will first need to provide some way of attaching a label to each poem: this is easily done using the
772 SG-id Next we need to define a new element for the cross-reference itself. This will not have any content—it is only a pointer—but it has an attribute, the value of which will be the identifier of the element pointed at. This is achieved by the following definition:
780 SG-id . The value of this attribute must be a pointer or web reference of type
787 SG-id (URI) may be supplied here. The accepted syntax for URIs is an Internet Standard, defined in
792 SG-id defined by the W3C Schema datatype library.
793 SG-id furthermore, because there is no indication of optionality on the attribute pattern, it must be supplied on each occurrence—a
807 SG-id A processor may take any number of actions when it encounters a link encoded in this way: a formatter might construct an exact page and line reference for the location of the poem in the current document and insert it, or just quote the poem's title or first lines. A hypertext style processor might use this element as a signal to activate a link to the poem being referred to, for example by displaying it in a new window. Note, however, that the purpose of the XML markup is simply to indicate that a cross-reference exists: it does not necessarily determine what the processor is to do with it.
813 SG-id attribute of datatype URI:
814 SG-id graphic_p = element graphic {att.url, empty} att.url = attribute url {anyURI}
815 SG-id With these additions to the schema, we can now represent the location of the illustration within our text like this:
817 SG-id By providing a location from which a reproduction of the required image can be downloaded, this encoding makes it possible for appropriate software able to display the image as well as record its existence.
819 SG-id Attributes form part of the structure of an XML document in the same way as elements, and can therefore be accessed using XPath. For example, to refer to all the poems in our anthology whose
821 SG-id attribute has the value
833 SG-oth In addition to the elements and attributes so far discussed, an XML document can contain a few other formally distinct things. An XML document may contain references to predefined strings of data that a validator must resolve before attempting to validate the document's structure; these are called
837 SG-oth text or representing character data which cannot easily be keyboarded. An XML document may also contain arbitrary signals or flags for use when the document is processed in a particular way by some class of processor (a common example in document production is the need to force a formatter to start a new page at some specific point in a document); such flags are called
840 SG-oth namespace
845 SG-er As mentioned above, all XML documents use the same internal character encoding. Since not all computer systems currently support this encoding directly, a special syntax is defined that can be used to represent individual characters from the Unicode character set in a portable way by providing their numeric value, in decimal or hexadecimal notation.
849 SG-er is represented within an XML document as the Unicode character with hexadecimal value
851 SG-er . If such a document is being prepared on (or exported to) a system using a different character set in which this character is not available, it may instead be represented by the character reference
859 SG-er To aid legibility, however, it is also possible to use a mnemonic name (such as
861 SG-er ) for such character references, provided that each such name is mapped to the required Unicode value by means of a construct known as an
863 SG-er . A reference to a named character entity always takes the form of an ampersand, followed by the name, followed by a semicolon. For example an XML document containing the string
869 SG-er There is a small set of such character entity references that do not have to be declared because they form part of the definition of XML. These include the names used for characters such as the ampersand (
873 SG-er ), which could not easily otherwise be included in an XML document without ambiguity. Other predeclared entity names are those for quotation marks (
881 SG-er For all other named character entities, a set of entity declarations must be provided to an XML processor before the document referring to them can be validated. The declaration itself uses a non-XML syntax inherited from SGML; for example, to define an entity named
883 SG-er with the replacement value é, the declaration could have any of the following forms:
892 SG-er string substitution
893 SG-er purposes, where the same text needs to be repeated uniformly throughout a text. For example, if a declaration such as
894 SG-er <!ENTITY TEI "Text Encoding Initiative">
895 SG-er is included with a document, then references such as
897 SG-er may be used within it, each of which will be expanded in the same way and replaced by the string
899 SG-er before the text is validated.
904 SG-pi Although one of the aims of using XML is to remove any information specific to the processing of a document from the document itself, it is occasionally very convenient to be able to include such information—if only so that it can be clearly distinguished from the structure of the document. As suggested above, one common example is the need, when processing an XML document for printed output, to include a suggestion that the formatting processor might use to determine where to begin a new page of output. Page-breaking decisions are usually best made by the formatting engine alone, but there will always be occasions when it may be necessary to override these. An XML processing instruction inserted into the document is one very simple and effective way of doing this without interfering with other aspects of the markup.
912 SG-pi . In between are two space-separated strings: by convention, the first is the name of some processor (
914 SG-pi in the above example) and the second is some data intended for the use of that processor (in this case, the instruction to start a new page). The only constraint placed by XML on the strings is that the first one must be a valid XML name; the other can be any arbitrary sequence of characters, not including the closing character-sequence
920 SG-pi which can be supplied at the beginning of an XML document, for example:
922 SG-pi The XML declaration specifies the version number of the XML Recommendation applicable to the document it introduces (in this case, version 1.0), and optionally also the character encoding used to represent the Unicode characters within it. By default an XML document uses the character encoding UTF-8 or UTF-16; in this case, the 16-bit characters of Unicode have been mapped to the 8-bit character set known as ISO 8859-1; any characters present in the document but not available in the target character set will therefore need to be represented as character references (
923 SG-pi ). The XML declaration is purely documentary, but if it is wrong many XML-aware processors will be unable to process the associated text.
933 SGname namespace
934 SGname was introduced into the XML language as a means of addressing these and related problems. If the markup of an XML document is thought of as an expression in some language, then a namespace may be thought of as analogous to the lexicon of that language. Just as a document can contain words taken from different languages, so a well-formed XML document can include elements taken from different namespaces. A namespace resembles a schema in that we may say that a given set of elements
938 SGname a given schema. However, a schema is a set of element definitions, whereas a namespace is really only a property of a collection of elements: the only tangible form it takes in an XML document is its distinctive
941 SGname name
944 SGname Suppose for example that we wish to extend our anthology to include a complex diagram. We might start by considering whether or not to extend our simple schema to include XML markup for such features as arcs, polygons, and other graphical elements. XML can be used to represent any kind of structure, not simply text, and there are clear advantages to having our text and our diagrams all expressed in the same way.
946 SGname Fortunately we do not need to invent a schema for the representation of graphical components such as diagrams; it already exists in the shape of the Scalable Vector Graphics (SVG) language defined by the W3C.
949 SGname SVG is a widely used and rich XML vocabulary for representing all kinds of two-dimensional graphics; it is also well supported by existing software. Using an SVG-aware drawing package, we can easily draw our diagram and save it in XML format for inclusion within our anthology. When we do so, we need to indicate that this part of the document contains elements taken from the SVG namespace, if only to ensure that processing software does not confuse our
955 SGname An XML document need not specify any namespace: it is then said to use the
957 SGname namespace. Alternatively, the root element of a document may supply a default namespace, understood to apply to all elements which have no namespace prefix. This is the function of the
959 SGname attribute which provides a unique name for the default namespace, in the form of a URI:
964 SGname In exactly the same way, on the root element for each part of our document which uses the SVG language, we might introduce the SVG namespace name:
973 SGname Although a namespace name usually uses the URI (Uniform Resource Identifier) syntax, it is not treated as an online address and an XML processor regards it just as a string, providing a longer name for the namespace.
977 SGname attribute can also be used to associate a short prefix name with the namespace it defines. This is very useful if we want to mingle elements from different namespaces within the same document, since the prefix can be attached to any element, overriding the implicit namespace for itself (but not its children):
988 SGname There is no limit on the number of namespaces that a document can use. Provided that each is uniquely identified, an XML processor can identify those that are relevant, and validate them appropriately. To extend our example further, we might decide to add a linguistic analysis to each of the poems, using a set of elements such as
1016 SG-ms We mentioned above that the syntax of XML requires the encoder to take special action if characters with a syntactic meaning in XML (such as the left angle bracket or ampersand) are to be used in a document to stand for themselves, rather than to signal the start of a tag or an entity reference respectively. The predefined entities
1022 SG-ms provide one method of dealing with this problem, if the number of occurrences of such things is small. Other methods may be considered when the number is large, as in an XML document like the present Guidelines, which contains hundreds of examples of XML markup. One is to label the XML examples as belonging to a different namespace from that of the document itself, which is the approach taken in the present Guidelines. Another and simpler approach is provided by one of the features inherited by XML from its parent SGML: the
1026 SG-ms A marked section is a block of text within an XML document introduced by the characters
1030 SG-ms . Between these rather strange brackets, markup recognition is turned off, and any tags or entity references encountered are therefore treated as if they were plain text. For example, when we come to write the users' manual for our anthology, we may find ourselves often producing text like the following:
1043 SG18 if a document contains entity references that must be processed before the document can be validated, where are those entities defined?
1045 SG18 an XML document instance may be stored in a number of different operating system files; how should they be assembled together?
1047 SG18 how does a processor determine which stylesheets it should use when processing an XML document, or how to interpret any processing instructions it contains?
1053 SG18 Different schema languages and different XML processing systems take very different positions on all of these topics, since none of them is explicitly addressed in the XML specification itself. Consequently, the best answer is likely to be specific to a particular software environment and schema language. Since this chapter is concerned with XML considered independently of its processing environment, we only address them in summary detail here.
1060 SG-ass1 , which XML inherited from SGML. Different schema languages vary in the ways they make a collection of such definitions available to an XML processor, but fortunately there is one method that all current schema languages support.
1065 SG-ass1 statement. This declarative statement has been inherited by XML from SGML; in its full form it provides a large number of facilities, but we are here concerned only with the small subset of those facilities recognized by all schema languages.
1069 SG-ass1 Any XML processor encountering this statement will use it to add the two named entities it defines to those already predefined for XML. Before the document instance itself is validated, any references to these entities will be expanded to the character string given. Thus, wherever in the document instance the string
1072 SG-ass1 And, indeed, for those responsible for deciding the licensing conditions if they change their minds later.
1075 SG-ass1 following the string DOCTYPE in this example is, of course, the name of the root element of the document to which this declaration is prefixed; however, only an XML DTD processor will take note of this fact.
1088 SG-assoc points to the location of the schema. This is the only mandatory pseudo-attribute, but others can be added to give more information about the schema:
1094 SG-assoc This example includes a standard schema in XML Schema format, along with a schematron schema which might be used for checking the format and linking of names.
1098 SG-assoc Any modern XML processing software tool will provide convenient methods of validating documents which are appropriate to the particular schema language chosen. In the interests of maximizing portability of document instances, they should contain as little processing-specific information as possible.
1103 SG-mult As we have already indicated, a single XML document may be made up of several different operating system files that need to be pulled together by a processor before the whole document can be validated. The XML DTD language defines a special kind of entity (a
1105 SG-mult ) that can be used to embed references to whole files into a document for this purpose, in much the same way as the character or string entities discussed in
1112 SG-mult defines a generic mechanism for this purpose, which is supported by an increasing number of XML processors.
1116 SG-style As mentioned above, the processing of an XML document will usually involve the use of one or more stylesheets, often but not exclusively to provide specific details of how the document should be displayed or rendered. In general, there is no reason to associate a document instance with any specific stylesheet and the schema languages we have discussed so far do not therefore make any special provision for such association. The association is made when the stylesheet processor is invoked, and is thus entirely application-specific.
1118 SG-style However, since one very common application for XML documents is to serve them as browsable documents over the Web, the W3C has defined a procedure and a syntax for associating a document instance with its stylesheet (see
1119 SG-style ). This Recommendation allows a document to supply a link to a default stylesheet and also to categorize the stylesheet according to its
1121 SG-style , for example to indicate whether the stylesheet is written in CSS or XSLT, using a specialized form of processing instruction.
1125 SG-style which is available from the same location as the anthology itself, we could make it available over the Web simply by adding a processing instruction like the following to the anthology:
1128 SG-style Multiple stylesheets can be defined for the same document, and options are available to specify how a web browser should select amongst them. For example, if the document also contained a directive:
1132 SG-style could be used when rendering the document on a handheld device such as a mobile phone.
1134 SG-style Most modern web browsers support CSS (although the extent of their implementation varies), and some of them support XSLT.
1138 SG-val As we noted above, most schema languages provide some degree of datatype validation for attribute values (
1139 SG-val ). They vary greatly in the validation facilities they offer for the content of elements, other than the syntactic constraints already discussed. Thus, while we may very easily check that our
1145 SG-val elements contain between five and 500 correctly-spelled English words, should we wish to constrain our poetry in such a way. Also, because attributes and elements are treated differently, it is difficult or impossible to express co-occurrence constraints: for example, if the
1153 SG-val The XML DTD language offers very little beyond syntactic checking of element content. By contrast, a major impetus behind the design and development of the W3C schema language was the addition of a much more general and powerful constraint language to the existing structural constraints of XML DTDs. In RELAX NG the opposite approach was taken, in that all datatype validation, whether of attributes or element content, is regarded as external to the schema language. For attributes, as we have seen, RELAX NG makes use of the W3C Schema Datatype Library (but permits use of others). Because RELAX NG treats both elements and attributes as special cases of patterns, the same datatype validation facilities are available for element content as for attribute values; it is unlike other schema languages in this respect. In addition, for content validation, a different component of DSDL known as Schematron can be used. Schematron is a pattern matching (rather than a grammar-based) language, which allows us to test the components of a document against templates that express constraints such as those mentioned above.

BIB-Bibliography.xml#13216

# id text
23 VEMEana-eg-23 Doglia mi reca ne lo core ardire
79 TSSASE-eg-20 Structures of social action: Studies in conversation analysis
358 NDPER-eg-17 membrane 5, entry 154
472 VEST-eg-4 2nd edition
597 DIC-CP Collins Pocket Dictionary of the English language
617 SA-BIBL-2 Orbis Pictus: a facsimile of the first English edition of 1659
634 PHegsurp2 Poeti del Duecento
888 COEDADD-eg-89 The waste land: a facsimile and transcript of the original drafts including the annotations of Ezra Pound
918 DS-eg-05 Is there a text in this class? The authority of interpretive communities
957 FTGRA-eg-18 2nd edition
1041 COHQU-eg-43 Natural language processing in Prolog
1292 DRSTA-eg-40 Everyman's library: the drama
1324 COBICOR-eg-248 ISO 690:1987: Information and documentation – Bibliographic references – Content, form and structure
1508 COHQQ-eg-33 note 12
1637 DRPRO-eg-7 epilogue
1671 STGA-eg-9 Crofts American history series
1740 TSBA-eg-19 The approach of the Text Encoding Initiative to the encoding of spoken discourse
1760 MS-eg-001 A summary catalogue of western manuscripts in the Bodleian Library at Oxford which have not hitherto been catalogued ...
1770 MS-eg-001 P5-MS: A general purpose tagset for manuscript description
1800 STGA-eg-10 Crofts American history series
1968 TSSASE-eg-37 Report on the compatibility of J P French's spoken corpus transcription conventions with the TEI guidelines for transcription of spoken texts
1995 GDFT-eg-12 Partial family tree for Bertrand Russell
2366 DSBACK-eg-83 index to vol. 1
2600 WHITMS1 "[I am a curse]" in
2606 WHITMS2 Single leaf of Notes for a poem about night "visions," possibly related to the untitled 1855 poem that Whitman eventually titled "The Sleepers." Fragments of an unidentified newspaper clipping about the Puget Sound area have been pasted to the leaf. The Trent Collection of Walt Whitman Manuscripts, Duke University Rare Book, Manuscript, and Special Collections Library.
3818 Burnard1995b The Design of the TEI Encoding Scheme
4487 SG-BIBL-2 Refining our notion of what text really is: the problem of overlapping hierarchies
4756 CO-BIBL-1 An international handbook of the science of language and society
4923 TS-BIBL-3 TEI document TEI AI2 W1
5068 DI-BIBL-3 TEI working paper TEI AIW20
5171 DI-BIBL-6 Principles for Encoding machine readable dictionaries
5225 DI-BIBL-8 Electronic dictionary encoding: customizing the TEI Guidelines
5769 NH-BIBL-7 The layered markup and annotation language
5821 FS-BIBL-01 A rationale for the TEI recommendations for feature-structure markup,
5888 ISO-690 ISO 690:1987: Information and documentation – Bibliographic references – Content, form and structure
5900 ISO-12620 ISO 12620:2009: Terminology and other language and content resources – Specification of data categories and management of a Data Category Registry for language resources
5923 RICA Istituto Centrale per il Catalogo Unico
5925 RICA Regole italiane di catalogazione per autori
5994 BIB-RDG The following lists of readings in markup theory and the TEI derive from work originally prepared by Susan Schreibman and Kevin Hawkins for the TEI Education Special Interest Group, recoded in TEI P5 by Sabine Krott and Eva Radermacher. They should be regarded only as a snapshot of work in progress, to which further contributions and corrections are welcomed (see further
6469 Burnard1999 Closing plenary address at the XML Europe Conference, Granada, May 1999
6547 Burnard2001a Dalle «Due Culture» Alla Cultura Digitale: La Nascita del Demotico Digitale
6663 Burnard2005b Metadata for corpus work
7623 Pichler1995 Culture and Value: Philosophy and the Cultural Sciences. Beiträge des 18. Internationalen Wittgenstein Symposiums 13–20. August 1995 Kirchberg am Wechsel
7626 Pichler1995 Kirchberg am Wechsel
8533 Unsworthetaleds2004 TEI Consortium
8670 BIB-RDG TEI
8780 BaumanandCatapano1999 TEI and the Encoding of the Physical Structure of Books
8810 Bauman2005 TEI HORSEing Around
8889 Burnard1993 Rolling your own with the TEI
9005 Burnard1997 Prepared for a seminar on Etiquetación y extracción de información de grandes corpus textuales within the Curso Industrias de la Lengua (14–18 de Julio de 1997). Sponsored by the Fundacion Duques de Soria.
9022 BurnardandPopham1999 Putting Our Headers Together: A Report on the TEI Header Meeting 12 September 1997
9084 Ciottied2005 Il Manuale TEI Lite: Introduzione Alla Codifica Elettronica Dei Testi Letterari
9104 Chang2001 The Implications of TEI
9150 DigitalLibraryFederation1998 TEI and XML in Digital Libraries: Meeting June 30 and July 1, 1998, Library of Congress, Summary/Proceedings
9167 DigitalLibraryFederation2007 TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices
9264 Loiseaunodate Les standards : autour d'XML et de la TEI
9288 MarkoandKelleher2001 Descriptive Metadata Strategy for TEI Headers: A University of Michigan Library Case Study
9318 Mertz2003 XML Matters: TEI — the Text Encoding Initiative
9431 Rahtz2003 Building TEI DTDs and Schemas on demand
9462 Rahtzetal2004 A unified model for text markup: TEI, Docbook, and beyond
9521 Robinsonnodate Making a Digital Edition with TEI and Anastasia
9539 Seaman1995 The Electronic Text Center Introduction to TEI and Guide to Document Preparation
9558 Simons1999 Using Architectural Forms to Map TEI Data into an Object-Oriented Database
9588 Smith1999 Textual Variation and Version Control in the TEI
9720 Vanhoutte2004 An Introduction to the TEI and the TEI Consortium

CH-LanguagesCharacterSets.xml#13235

# id text
4 CH The documents which users of these Guidelines may wish to encode encompass all kinds of material, potentially expressed in the full range of written and spoken human languages, including the extinct, the non-existent, and the conjectural. Because of this wide scope, special attention has been paid to two particular aspects of the representation of linguistic information often taken for granted: language identification and character encoding.
6 CH Even within a single document, material in many different languages may be encountered. Human culture, and the texts which embody it, is intrinsically multilingual, and shows no sign of ceasing to be so. Traditional philologists and modern computational linguists alike work in a polyglot world, in which code-switching (in the linguistic sense) and accurate representation of differing language systems constitute the norm, not the exception. The current increased interest in studies of linguistic diversity, most notably in the recording and documentation of endangered languages, is one aspect of this long standing tradition. Because of their historical importance, the needs of endangered and even extinct languages must be taken into account when formulating Guidelines and recommendations such as these.
8 CH Beyond the sheer number and diversity of human languages, it should be remembered that in their written forms they may deploy a huge variety of scripts or writing systems. These scripts are in turn composed of smaller units, which for simplicity we term here characters. A primary goal when encoding a text should be to capture enough information for subsequent users of it correctly to identify both language, script, and constituent characters. In this chapter we address this requirement, and propose recommended mechanisms to indicate the languages, scripts and characters used in a document or a part thereof.
10 CH Identification of language is dealt with in
11 CH . In summary, it recommends the use of pre-defined identifiers for a language where these are available, as they increasingly are, in part as a result of the twin pressures of an increasing demand for language-specific software and an increased interest in language documentation. Where such identifiers are not available or not standardized, these Guidelines recommend a way of documenting language identifiers and their significance, in the same way as other metadata is documented in the TEI header.
13 CH Standardization of the means available to represent characters and scripts has moved on considerably since the publication of the first version of these Guidelines. At that time, it was essential to explicitly document the characters and encoded character sets used by almost any digital resource if it was to have any chance of being usable across different computer platforms or environments, but this is no longer the case. With the availability of the Unicode standard, more than 110,000 different characters representing almost all of the world's current writing systems are available and usable in any XML processing environment without formality. Nevertheless, however large the number of standardized characters, there will always be a need to encode documents which use non-standard characters and glyphs, particularly but not exclusively in historical material. Furthermore, the full potential of Unicode is still not yet realized in all software which users of the Guidelines are likely to encounter. The second part of this chapter therefore discusses in some detail the concepts and practice underlying this standard, and also introduces the methods available for extending beyond it, which are more fully discussed in
18 CHSH Identification of the language a document or part thereof is written in is a crucial requirement for many envisioned usages of an electronic document. The TEI therefore accommodates this need in the following way:
22 CHSH is defined for all TEI elements. Its value identifies the language and writing system used.
24 CHSH The TEI header has a section set aside for the information about the languages used in a document: see further
28 CHSH The value of the attribute
30 CHSH identifies the language using a coded value. For maximal compatibility with existing processes, modelling this value in the following way is recommended (this parallels the modelling of
34 CHSH The identifier for the language should be constructed as in
41 CHSH element in the TEI header, if one is present.
46 CHSH , and proposes the following mechanism for constructing an identifier (tag) for languages as administered by the Internet Assigned Numbers Authority (IANA). The tag is assembled from a sequence of subtags separated by the hyphen (-, U+002D) character. It gives the language (possibly further identified with a sublanguage), a script and a region for this language, each possibly followed by a variant subtag.
48 CHSH The authoritative list of registered subtags is maintained by IANA and is available at
49 CHSH . For a good general overview of the construction of language tags, see
53 CHSH In addition to the list of registered subtags, both BPC 47 and ISO 639-2 provide extensions that can be employed by private convention. The constructs provided can thus be used to generate identifiers for any language, past and present, in any used in any area of the world. If such private extensions are used within the context of the TEI, they should be documented within the
55 CHSH element of the TEI header, which might also provide a prose description of the language described by the language tag.
57 CHSH While language, region and script can be adequately identified using this mechanism, there is only very rough provision to express a dimension of time for the language of a document; those codes provided (e.g.
61 CHSH in ISO 639-2) might not reflect the segments appropriate for a text at hand. Text encoders might express the time window of the language used in the document by means of the extension mechanism defined in BCP 47 and relate that to a
65 CHSH section of the TEI header.
67 CHSH Equivalences to language identifiers by other authorities can be given in the
71 CHSH The scope of the language identification is extending to the whole subtree of the document anchored at the element that carries the
73 CHSH attribute, including all elements and all attributes where a language might apply.
74 CHSH This will exclude all attributes where a non-textual datatype has been specified, for example tokens, boolean values or predefined value lists.
81 CH All document encoding has to do with representing one thing by another in an agreed and systematic way. Applied to the smallest distinctive units in any given writing system, which for the moment we may loosely call
88 D4-41 When the first methods of representing text for storage or transmission by machines were devised, long before the development of computers, the overriding aim was to identify the smallest set of symbols needed to convey the essential semantic content, and to encode that symbol set in the most economical way that the storage or transmission media allowed. The initial outcome were systems that encoded only such content as could be expressed in uppercase letters in the Latin script, plus a few punctuation marks and some
92 D4-41 For many years after the invention of computers, the way they represented text continued to be constrained by the imperative to use expensive resources with maximal efficiency. Even when storage and processing costs began their dramatic fall, the Anglo-centric outlook of most hardware designers and software engineers hampered initiatives to devise a more generous and flexible model for text representation. The wish to retain compatibility with
94 D4-41 data was an additional disincentive. Eventually, tension in East Asia between commitment to technological progress and the inability of existing computers to cope with local writing systems led to decisive developments. Japanese, Korean and Chinese standards bodies, who long before the advent of computers had been engaged in the specification of character sets, joined with computer manufacturers and software houses to devise ways of mapping those character sets to numeric encodings and processing the resulting text data.
96 D4-41 Unfortunately, in the early years there was little or no co-ordination among either the national standards bodies or the manufacturers concerned, so that although commercial necessity dictated that these various local standards were all compatible with the representation of US-American English, they were not straightforwardly compatible with one another. Even within Japan itself there emerged a number of mutually incompatible systems, thanks to a mixture of commercial rivalry, disagreements about how best to manage certain intractable problems, and the fact that such pioneering work inevitably involved some false starts, leading to incompatibilities even between successive products of the same bodies. Roughly at the same time, and for similar reasons, multiple and incompatible ways of representing languages that use Cyrillic scripts were devised, along with methods of encoding ancient writing systems which inevitably could not aim for compatibility with other writing systems apart from basic Latin script. Many of the earliest projects that fed into the TEI were shaped in this developmental phase of the computerized representation of texts, and it was also the context in which SGML was devised and finalized.
98 D4-41 SGML had of necessity to offer ways of coping with multiple writing systems in multiple representations; or rather, it provided a framework within which SGML-compliant applications capable of handling such multiple representations might be developed by those with sufficient financial and personnel resources (such as are seldom found in academia). Earlier editions of these Guidelines offered advice on character set and writing system issues addressed to the condition of those for whom SGML was the only feasible option. That advice must now be substantially altered because of two closely-related developments: the availability of the ISO/Unicode character set as an international standard, and the emergence of XML and related technologies which are committed to the theory and practice of character representation which Unicode embodies.
118 D4-42 will not of itself take us very far towards greater terminological precision. It tends to be used to refer indiscriminately both to the visible symbol on a page and to the letter or ideograph which that symbol represents, two things that it is essential to keep conceptually distinct. The visible symbol obviously has some aspects by which we interpret it as representing one character rather than another; but its appearance may also be significantly determined by features that have no effect on our notion of which character in a writing system it represents. A familiar instance is the lowercase
122 D4-42 symbol (
123 D4-42 cf. figure 1
127 D4-42 figure 1
129 D4-42 abstract character
136 D4-42 in a serif typeface has additional strokes that are absent from the same letter when printed using a sans-serif typeface, so that once again we have differing glyphs standing for the same abstract character. In
137 D4-42 there is even a font, Capitals Regular, in which the glyph for the lowercase letter
139 D4-42 looks like a typical glyph for the character uppercase
141 D4-42 . The distinction between abstract characters and glyphs is fundamental to all machine processing of documents.
143 D4-42 In most scholarly encoding projects, the accurate recording of the abstract characters which make up the text is of prime importance, because it is the essential prerequisite of digitizing and processing the document without semantic loss. In many cases (though there are important exceptions, to be touched on shortly) it may not be necessary to encode the specific glyphs used to render those abstract characters in the original document. An encoding that faithfully registers the abstract characters of a document allows us to search and analyse our document's content, language and structure and access its full semantics. That same encoding, however, may not contain sufficient information to allow an exact visual representation of the glyphs in the source text or manuscript to be recreated.
145 D4-42 The importance of this distinction between information content and its visual representation is not always immediately apparent to people unused to the specific complexities of text handling by machine. Such users tend to ask first what (in order of conceptual priority) should actually be their very last question: how do I get a physical image that looks like character x in my source document to appear on to the screen or the output page? Their first question should in fact be: how can I get an abstract representation of character x into my encoded document in a way that will be universally and unambiguously identifiable, no matter what it happens to look like in printout or on any particular display? And occasionally the response they receive as a result of their misguided initial question is a custom
147 D4-42 that satisfies their immediate rendering wishes at the price of making their underlying document unintelligible to other users (or even to the original user in other times and places) because it encodes the abstract character in an idiosyncratic way.
149 D4-42 That said, there will certainly be documents or projects where it is a matter of scholarly significance that the compositor or scribe chose to represent a given abstract character using one particular glyph or set of strokes rather than a semantically-equivalent but visually distinct alternative, and in that case the specific appearance of the form will have to be encoded on one way or another. But that encoding need not (and in most cases will not) involve a notation that visually resembles the original, any more than italicized text in an original document will be represented by the use of italic characters in the encoded version.
151 D4-42 A collection of the abstract characters needed to represent documents in a given writing system is known as a
152 D4-42 character set
153 D4-42 , and the character set or
155 D4-42 of a processing or rendering device is the set of abstract characters that it is equipped to recognize and handle appropriately. There is, however, a subtle distinction between these two parallel uses of the same term, involving one more key concept which it is essential to grasp. The character set of a document (or the writing system in which it is recorded) is purely a collection of abstract characters. But the character set of a computing device is a set of abstract characters which have been mapped in a well-defined way to a set of numbers or
156 D4-42 code points
157 D4-42 by which the device represents those abstract characters internally. It can therefore be referred to as a
158 D4-42 coded character set
159 D4-42 , meaning a set of abstract characters each of which has been assigned a numerical code point (or in some instances a sequence of code points) which unambiguously identifies the character concerned.
161 D4-42 It is now possible to use this terminology to say what Unicode is: it is a coded character set, devised and actively maintained by an international public body, where each abstract character is identified by a unique name and assigned a distinctive code point.
162 D4-42 Although only Unicode is mentioned here explicitly, it should be noted that the character repertoire and assigned code points of Unicode and the ISO standard 10646 are identical and maintained in a way that ensures this continues to be the case.
163 D4-42 Unicode is distinguished from other, earlier and co-existing coded character sets by its (current and potential) size and scope; its built-in provision for (in practical terms) limitless expansion; the range and quality of linguistic and computational expertise on which it draws; the commitment in principle (and to an increasing degree in practice) to implement it by all important providers of hardware and software worldwide; and the stability, authority and accessibility it derives from its status as an international public standard.
169 D4-43 The distinction between abstract characters and glyphs can be crucial when devising an encoding scheme. Users performing text retrieval, searching or concordancing will expect the system to recognize and treat different glyphs as instances of the same character; but when perusing the text itself they may well expect to see glyph variants preserved and rendered. When encoding a pre-existing text, the encoder must determine whether a particular letter or symbol is a character or a glyphic variant. A detailed model of the relationship between characters and glyphs has been developed within the Unicode Consortium and an ISO work group (ISO/IEC JTC1 SC2/WG2). Its report (
171 D4-43 ) will form the base for much future standards work.
173 D4-43 The model makes explicit the distinction between two different properties of the components of written language:
175 D4-43 their content, i.e. its meaning and phonetic value (represented by a character)
181 D4-43 When searching for information, a system generally operates on the content aspects of characters, with little or no attention to their appearance. A layout or formatting process, on the other hand, must of necessity be concerned with the exact appearance of characters. Of course, some operations (hyphenation for example) require attention to both kinds of feature, but in general the kind of text encoding described in these Guidelines tends to focus on content rather than appearance (see further
186 D4-43 the level of character encoding, using an appropriate Unicode code point to represent the glyph concerned
188 D4-43 the markup level, with the glyph indicated via appropriate elements and/or attributes
192 D4-43 The encoding practice adopted may be guided by, among other things, an assessment of the most frequent uses to which the encoded text will be put. For example, if recognition of identical characters represented by a variety of glyphs is the main priority, it may be advisable to represent the glyph variations at markup level, so that the character value can be immediately exposed to the indexing and retrieval software. Plainly, an encoding project will need to consider such issues carefully and embody the outcome of their deliberations in local manuals of procedure to ensure encoding consistency. Using Unicode code points to represent glyph information requires that such choices be documented in the TEI header. Such documentation cannot of itself guarantee proper display of the desired glyph but at least makes the intention of the encoder discoverable.
194 D4-43 At present the Unicode Standard does not offer detailed specifications for the encoding of glyph variations. These Guidelines do give some recommendations; some discussion of related matters is given in
204 D4-44 (IMEs) commonly used for the entry of logographic characters. This is most likely to be convenient where the display used for text entry and/or the printer used to produce output for proofreading purposes is capable of rendering the characters concerned using correct and readily identifiable glyphs. Where such easily checkable rendering is not available, or where there is no suitable method of inputting certain characters directly, they may be input by one of two possible forms of indirect notation or
208 D4-44 The first form of reference is a
210 D4-44 (NCR), which takes the general form
214 D4-44 is an integer representing the code point of the character in base 10, or
218 D4-44 is the code point in hexadecimal notation. This has the advantage that no declaration of what this notation means is required anywhere in the document instance or its associated schema. Every XML processor is capable of recognising NCRs and replacing them with the required code point value without needing access to any additional data. The disadvantage of NCRs as a means of entering, representing and proofing character data is that most human beings find them anything but
222 D4-44 The second form of reference is a
226 D4-44 that could be distinctively recognized by a processing system). Character entity references can (and indeed should) have names whose significance is apparent to humans, but each and every entity name has to be associated with its replacement (which as explained below should be a character value, possibly in the form of a NCR) via a formal declaration in the document's internal or external subset. This, however, is not needed for Character Entities defined by the XML standard, namely & (&amp;), > (&gt;), < (&lt;), ' (&apos;), and " (&quot;). For a large number of characters defined by Unicode and commonly used in documents, there are ISO entity sets declaring mnemonic names which should be used wherever feasible: XML compatible character entity declarations using ISO names and suitable for inclusion into the subset are available on the TEI web sites.
228 D4-44 Where characters are not defined in Unicode and so have to be assigned both a local code point and a local entity name of the project's choosing (see
229 D4-44 below) it is highly desirable to follow the same nomenclature principles as ISO and to emulate the practice in the ISO character entity declarations of appending a string giving the character a unique descriptive name as a comment to the actual entity declaration. In addition, where different groups or projects are working on texts with geographical, historical, linguistic or other similarities that give rise to common issues of character encoding, it is highly advisable in the interests of consistency that they should consult one another when devising entity names. The TEI mailing list may provide a suitable first point of contact for such consultations. Further advice on the matter of locally-defined characters is contained in
237 D4-45a Rendering of the encoded text is a complicated process that depends largely on the purpose, external requirements, local equipment and so forth, it is thus outside the scope of coverage for these Guidelines.
239 D4-45a It might however nevertheless be helpful to put some of the terminology used for the rendering process in the context of the discussion of this chapter. As was mentioned above, Unicode encodes abstract characters, not specific glyphs. For any process that makes characters visible, however, concrete, specifically designed glyph shapes have to be used. For a printing process, for example, these shapes describe exactly at which point ink has to be put on the paper and which areas have to be left blank. If we want to print a character from the Latin script, besides the selection of the overall glyph shape, this process also requires that a specific weight of the font has been selected, a specific size and to what degree the shape should be slanted. Beyond individual characters, the overall typesetting process also follows specific rules of how to calculate the distance between characters, how much whitespace occurs between words, at which points line breaks might occur and so forth.
241 D4-45a If we concern ourselves only with the rendering process of the characters themselves, leaving out all these other parameters, we will realize that of all the information required for this process, only a small amount will be drawn from the encoded text itself. This information is the code point used to encode the character in the document. With this information, the font selected for printing will be queried to provide a glyph shape for this character. Some modern font formats (e.g. OpenType) do implement a sophisticated mapping from a code point to the glyph selected, which might take into account surrounding characters (to create ligatures where necessary) and the language or even area this character is printed for to accommodate different typesetting traditions and differences in the usage of glyphs.
243 D4-45a A TEI document might provide some of the information that is required for this process for example by identifying the linguistic context with the
245 D4-45a attribute. The selection of fonts and sizes is usually done in a stylesheet, while the actual layout of a page is determined by the typesetting system used. Similarly, if a document is rendered for publication on the Web, information of this kind can be shipped with the document in a stylesheet.
252 D4-45b The devisers of the XML standard took the view that Unicode should be the only means of representing abstract characters which conformant XML processors were obliged to support. That certainly does not preclude the use of other character encoding schemes or character sets in documents which are to be handled by XML processors, but it does mean that all the abstract characters which are encoded as characters (as distinct from being represented indirectly via markup) in an XML document must either possess an assigned code point within the public Unicode standard, or be assigned a code point devised by and specific to the local project, taken from a reserved range set aside by the standard expressly for this purpose, the so-called
254 D4-45b or PUAs. For the vast majority of projects to which these Guidelines are applicable, the Unicode standard will already offer code points for all the abstract characters their documents employ, and so the requirement that all such characters should be resolvable by XML processors to Unicode code points will not involve any representation via markup or use of PUA code points. Indeed, such projects are not obliged by their choice of XML to use Unicode in their documents. Provided they correctly declare at the requisite points any non-Unicode coded character set they may use, ensure that all their XML processors support their declared encoding, and then consistently employ that encoding in strict conformity with their declarations, they need not consciously concern themselves with Unicode unless and until they feel it is appropriate to do so.
259 D4-45-1 There are, however, strict limits to the way conformant XML processors handle documents whose character set is not Unicode, and unless these limits are understood it is likely that projects not yet ready to commit to Unicode across the board will run into unexpected and baffling problems as they attempt to operate with their legacy character encodings. First, it must be repeated that nothing in the XML standard
261 D4-45-1 conformant processors to handle non-Unicode documents. But even if there were any actual processors which on that basis refused to process non-Unicode documents, that would not limit their usefulness as severely as might at first appear. The reason is that there is a way of internally representing Unicode code points (explained further in
262 D4-45-1 below) where there is no detectable difference between a document which is actually encoded in ASCII employing only 7-bit values and one which is encoded in Unicode but which happens to contain only the abstract characters encompassed by the 7-bit ASCII standard. And the XML standard specifies that this way of representing Unicode is the one which processors must assume as the default for any document that does not explicitly declare an encoding. At a stroke, this provision ensures that all pure 7-bit ASCII encoded documents can be processed without further ado by all conformant XML processors. Add to this the provision, also within the XML standard, that allows any Unicode code point to be indirectly specified using only 7-bit ASCII characters via a Numeric Character Reference (NCR), and the upshot is that all documents in non-Unicode encodings which can be pre-processed to rewrite any characters outside the 7-bit ASCII range as Unicode code points in NCR notation (a simple batch procedure for which software is readily available) can be handled even by processors which have no inbuilt support for any encoding other than Unicode.
266 D4-45-1 To avoid confusion when taking advantage of such encoding support, it is first of all essential to grasp that an encoding declaration in an XML document is indeed simply a declaration: it is not an incantation that magically converts the document that follows into the encoding concerned. It is a common error to think that simply declaring a document's encoding to be, say ISO-8859-1 (or for that matter UTF-8 or UTF-16, the representations of Unicode for which support is mandatory) is sufficient to
268 D4-45-1 . Such a declaration is useless unless the document that follows actually is encoded strictly in conformance with the declaration. Some of the circumstances in which that may not in fact be the case are outlined in
269 D4-45-1 below. Secondly, an encoding declaration does not somehow switch an XML processor into a mode where it works entirely in the declared encoding for as long as the declaration is in scope. On the contrary, all it does is instruct the processor to pass its input through a filter that immediately converts all the code points in the declared encoding into their Unicode counterparts; from that point onwards the document as seen by all subsequent stages of processing is actually in Unicode, even though that may not be apparent to the user. Thirdly, this invariable internal conversion has a crucial consequence: the fact that a processor can successfully accept a document in a non-Unicode encoding does not mean that it will necessarily convert any output it may produce back into the declared input encoding. Internally, the document has been converted to and processed in Unicode, and there is nothing in the XML standard that requires the reverse conversion to be performed at the output stage. Most processors go beyond the standard by offering a facility to output in various encodings: but whether it is available and how to use it must be ascertained from the processor's documentation. Should it be unavailable or unreliable, the output may need to be post-processed through a character convertor to restore the original encoding, and again such software is freely available and easy to use.
275 D4-45-2 In the cases considered in the preceding section, there was a suitable Unicode code point corresponding to each abstract character contained in the non-Unicode character set of the input document. In such instances, the mandatory internal conversion to Unicode carried out by the processor can be more or less transparent to a user who wishes to continue to work with a non-Unicode character set. Things become rather different when the non-Unicode character set contains abstract characters for which there is no code point in the Unicode standard, or when a project that is attempting to work in Unicode throughout finds that it needs to represent abstract characters not currently provided for in the Unicode standard. Here, a significant difference between SGML and XML emerges in a rather troublesome way.
277 D4-45-2 Following their agenda to devise a subset of SGML that would be significantly easier to implement, the authors of the XML specification decided that one particular type of entity available in SGML, known as an internal SDATA entity, should not be carried over into XML. It would be idle to question that decision here, but its consequences for the handling of abstract characters for which there is no Unicode definition were significant.
279 D4-45-2 The procedures recommended in earlier versions of these Guidelines for encoding, processing and exchanging what we might call locally defined abstract characters were reliant on the availability of entities declared as of type SDATA, but that type is not supported in XML, and there is therefore no ready equivalent for XML-based projects to the recommendations previously offered.
280 D4-45-2 In essence, when an SGML parser encounters a reference to an entity of type SDATA, it supplies to the application which it is servicing the name of that entity, as found in the document, plus a pointer to a location somewhere on the local system, and what is present at that location may in turn allow or instruct the application to do one of a number of things, including looking up the entity name in a table and deriving information about the referenced entity which can trigger specific behaviours in the application appropriate to the processing of that abstract character. There is however no way to make an XML parser do anything of the kind in response to an entity reference.
281 D4-45-2 Entities in XML are really only of two basic types, parsed and unparsed. Unparsed entities are of no relevance here. References to parsed entities in an XML document result in only one kind of behaviour: when they appear in the parser's input stream, the parser expects to be able to resolve them by locating a declaration in the document's internal or external subset which maps the entity name to its replacement text. The parser then inserts that replacement text into the document in place of the entity reference, which is discarded without trace. The act of replacement is not notified to the application, except where it fails because the entity is undeclared or the declaration is in some way defective (in which case the parser signals a fatal error and stops.)
283 D4-45-2 Though for explanatory convenience much XML-related documentation, including these Guidelines, refers specifically to Character Entities and Character Entity References, a character entity in XML is not a distinct
285 D4-45-2 in the sense that
287 D4-45-2 is understood in Computer Science terminology, for example when referring to the type of an attribute. Hence there is no way in which editing or other software can check that the replacement to be inserted is indeed a single character or its equivalent rather than an arbitrary chunk of text, possibly including markup. A character entity is simply a general entity whose replacement text happens to be declared as a character value or a NCR representing that value. This has two important consequences if it is proposed to use such an entity reference to stand for a character that has no Unicode equivalent. First, the entity name reference will disappear at an early stage in the parse and be replaced by the declared value of the entity, so that no processing which requires access in the parsed document to the entity reference as originally entered is possible. Secondly, if a character entity is to be used as a true equivalent to a normal character, and consequently be employed at all points in a document where a single character could legitimately occur (apart from in element and attribute names, where no references of any kind are allowed) then it is essential that its replacement value indeed be pure character data. If the replacement value of the entity were to contain any markup, or a processing instruction, there would be many places in a document where simple character data would be legitimate, but where the substitution of markup or some other replacement could cause the document to become invalid or malformed. Taken together, these considerations mean that the transparent use of a CER to stand for a non-Unicode character in an XML document is simply not possible.
299 D4-46-1 The principles of Unicode are judiciously tempered with pragmatism. This means, among other things, that the actual repertoire of characters which the standard encodes, especially those parts dating from its earlier days, include a number of items which on a strict interpretation of the Unicode Consortium's theoretical approach should not have been regarded as abstract characters in their own right. Some of these characters are grouped
302 D4-46-1 . Ligatures are a case in point. Ligatures (e.g. the joining of adjacent lowercase letters
303 D4-46-1 s
307 D4-46-1 f
310 D4-46-1 in Latin scripts, whether produced by a scribal practice of not lifting the pen between strokes or dictated by the aesthetics of a type design) are representational features with no added semantic value beyond that of the two letters they unite (though for historians of typography their presence and form in a given edition may be of scholarly significance). However, by the time the Unicode standard was first being debated, it had become common practice to include single glyphs representing the more common ligatures in the repertoires of some typesetting devices and high-end printers, and for the coded character sets built into those devices to use a single code point for such glyphs, even though they represent two distinct abstract characters. So as to increase the acceptance of Unicode among the makers and users of such devices, it was agreed that some such pseudo-characters should be incorporated into the standard as compatibility characters. Nevertheless, if a project requires the presence of such ligatured forms to be encoded, this should normally be done via markup, not by the use of a compatibility character. That way, the presence of the ligature can still be identified (and, if desired, rendered visually) where appropriate, but indexing and retrieval software will treat the code points in the document as a simple sequential occurrence of the two constituent characters concerned and so correctly align their semantics with non-ligatured equivalents. Such ligatures should not be confused with digraphs (usually) indicating diphthongs, as in the French word "cœur". A digraph is an atomic orthographic unit representing an abstract character in its own right, not purely an amalgamation of glyphs, and indexing and retrieval software must treat it as such. Where a digraph occurs in a source text, it should normally be encoded using the appropriate code point for the single abstract character which it represents, either by direct entry of the character concerned or through the appropriate CER or NCR.
316 D4-46-2 The treatment of characters with diacritical marks within Unicode shows a similar combination of rigour and pragmatism. It is obvious enough that it would be feasible to represent many characters with diacritical marks in Latin and some other scripts by a sequence of code points, where one code point designated the base character and the remainder represented one or more diacritical marks that were to be combined with the base character to produce an appropriate glyphic rendering of the abstract character concerned. From its earliest phase, the Unicode Consortium espoused this view in theory but was prepared in practice to compromise by assigning single code points to
318 D4-46-2 characters which were already commonly assigned a single distinctive code point in existing encoding schemes. This means, however, that for quite a large number of commonly-occurring abstract characters, Unicode has two different, but logically and semantically equivalent encodings: a
320 D4-46-2 single code point, and a code point sequence of a base character plus one or more
323 D4-46-2 normalization
324 D4-46-2 of Unicode documents. Normalization is the process of ensuring that a given abstract character is represented in one way only in a given Unicode document or document collection. The Unicode Consortium provides four standard normalization forms, of which the Normalization Form C (NFC) seems to be most appropriate for text encoding projects. The NFC, as far as possible, defines conversions for all base characters followed by one or more combining characters into the corresponding precomposed characters. The World Wide Web Consortium has produced a document entitled
328 D4-46-2 , which among other things discusses normalization issues and outlines some relevant principles. An authoritative reference is Unicode Standard Annex #15
331 D4-46-2 . Individual projects will have to decide how far their decisions on normalization need be influenced by the fact that at present, by no means all hardware or software can correctly render (or even consistently identify) abstract characters encoded using combining symbols.
333 D4-46-2 It is important that every Unicode-based project should agree on, consistently implement and fully document a comprehensive and coherent normalization practice. As well as ensuring data integrity within a given project, a consistently implemented and properly documented normalization policy is essential for successful document interchange.
339 D4-46-3 In addition to the Universal Character Set itself, the Unicode Consortium maintains a database of additional character semantics
340 D4-46-3 . This includes names for each character code point and normative properties for it. Character properties, as given in this database, determine the semantics and thus the intended use of a code point or character. It also contains information that might be needed for correctly processing this character for different purposes. This database is an important reference in determining which Unicode code point to use to encode a certain character.
342 D4-46-3 In addition to the printed documentation and lists made available by the Unicode consortium, the information it contains may also be accessed by a number of search systems over the Web (e.g.
343 D4-46-3 ). Examples of character properties included in the database include case, numeric value, directionality, and, where applicable status as a
349 D4-46-3 . Where a project undertakes local definition of characters with code point in the PUA, it is desirable that any relevant additional information about the characters concerned should be recorded in an analogous way, as further discussed under
357 D4-47 An important difference between SGML and XML is that the latter allows for the processing of non-validated documents. Since validity and validation are central TEI concerns, it is unlikely that documents prepared according to these Guidelines will ever be designed or implemented as merely well-formed in the XML sense. However in the domain of XML technologies, even where a document invokes a DTD or schema, it is not always necessarily the case that an XML processor will perform a full validation of it. XSLT transformation is a common case in point. By the workflow stage at which a document is handed off to an XSLT process for transformation, it is likely that its associated DTD or schema will already have fulfilled its role of integrity assurance and quality control, and so it may be undesirable to add validation to the processing overhead. For this reason, most XSLT processors do not attempt validation by default, even if a DTD or schema is declared and accessible. This can, however, create a problem where parsed entities, (and character entities in particular in the present context) are referenced. A validating parser reads all entity declarations from the DTD (including those for character entities) in the initial phase of processing, so that they can be resolved as and when required. However, where no validation takes place, it cannot automatically be assumed that the parser will be able to resolve such entities in all circumstances. The XML standard requires a non-validating parser to read and act on entity declarations only if they are located within the document's internal subset (which does not, of course, mean that the entity declarations have to be manually merged into the document instance in advance of processing: character entity sets, for instance, count as being in the internal subset if they are placed there via a parameter entity, as is normal TEI practice). Some parsers when in non-validating mode will also access entity declarations in the external subset, but this behaviour is not mandated by the standard and should not be relied upon. Provided these facts are borne in mind, the presence of character entities in a document when parser validation is switched off should not cause any difficulties.
363 D4-48 In theory it should not be necessary for encoders to have any knowledge of the various ways in which Unicode code points can be represented internally within a document or in the memory of a processing system, but experience shows that problems frequently arise in this area because of mistaken practice or defective software, and in order to recognize the resulting symptoms and correct their causes an outline knowledge of certain aspects of Unicode internal representation is desirable.
368 D4-48-1 The code points assigned by Unicode 3.0 and later are notionally 32-bit integers, and the most straightforward way to represent each such integer in computer storage would be to use 4 eight-bit bytes. However, many of the code points for characters most commonly used in Latin scripts can be represented in one byte only and the vast majority of the remainder which are in common use (including those assigned from the most frequently used PUA range) can be expressed in two bytes alone. This accounts for the use of UTF-8 and UTF-16 and their special place in the XML standard. UTF-8 and UTF-16 are ways of representing 32-bit code points in an economical way.
369 D4-48-1 UTF-8 is a variable length-encoding: the more significant bits there are in the underlying code point (or in everyday terminology the bigger the number used to represent the character), the more bytes UTF-8 uses to encode it. What makes UTF-8 particularly attractive for representing Latin scripts, explaining its status as the default encoding in XML documents, is that all code points that can be expressed in seven or fewer bits (the 127 values in the original ASCII character set) are also encoded as the same seven or fewer bits (and therefore in a single byte) in UTF-8. That is why a document which is actually encoded in pure 7-bit ASCII can be fed to an XML processor without alteration and without its encoding being explicitly declared: the processor will regard it as being in the UTF-8 representation of Unicode and be able to handle it correctly on that basis.
371 D4-48-1 However, even within the domain of Latin-based scripts, some projects have documents which use characters from 8 bit extensions to ASCII, e.g. those in the ISO-8859-n series of encodings, and the way characters which under ISO-8859-n use all eight bits are encoded in UTF-8 is significantly different, giving rise to puzzling errors. Abstract characters that have a
373 D4-48-1 byte code point where the highest bit is set (that is, they have a decimal numeric representation between 129 and 255) are encoded in ISO-8859-n as a
375 D4-48-1 byte with the same value as the code point. But in UTF-8 code-point values inside that range are expressed as a
377 D4-48-1 byte sequence. That is to say, the abstract character in question is no longer represented in the file or in memory by the same number as its code-point value: it is
379 D4-48-1 (hence the T in UTF) into a sequence of two different numbers. Now as a side-effect of the way such UTF-8 sequences are derived from the underlying code-point value, many of the single-byte eight-bit values employed in ISO-8859-n encodings are illegal in UTF-8.
381 D4-48-1 This complicated situation has a simple consequence which can cause great bewilderment. XML processors will effortlessly handle character data in pure 7-bit ASCII without that encoding needing to be declared to the parser, and will similarly accept documents encoded in an undeclared ISO-8859-n encoding if they happen to use no characters outside the strict ASCII subset of the ISO character sets; but the parse will immediately fail if an eight-bit character from an ISO-8859-n set is encountered in the input stream, unless the document's encoding has been explicitly and correctly declared. Explicitly declaring the encoding ought to solve the problem, and if the file is correctly encoded throughout, it will do so. But since text editors and word processors are currently acquiring different degrees of Unicode support at different rates, projects are likely to find that they have to deal with some files encoded in UTF-8 along with others in, say, ISO-8859-1. Such encoding differences may go unnoticed, especially if the proportion of characters where the internal encodings are distinguishable is relatively small (for example in a long English text with a smattering of French words). If in the process of document preparation two such files have been merged, or intermixed via
389 D4-48-1 Where erroneously mixed encodings are the source of such an error, altering the encoding declaration will not solve the problem, though it may obfuscate it. Eight-bit character codes in a file declared as UTF-8 will always stop the parser. More insidiously, UTF-8 sequences in a file declared as ISO-8859-1 will not halt the parse, but will cause data corruption, because the parser will silently but erroneously convert each byte in every UTF-8 sequence into a spurious separate character, introducing semantic errors which may not become apparent until much later in the processing chain.
391 D4-48-1 In projects that routinely handle documents in non-Latin scripts, everyone is well aware of the need to ensure correct and consistent encoding, so in such places mixed encoding problems seldom arise, and when they do are readily identified and remedied. Real confusion tends to arise, however, in projects which have a low awareness of the issues because they employ predominantly unaccented Latin characters, with only thinly-distributed instances of accented letters, or other
394 D4-48-1 non-breaking space
395 D4-48-1 ). Even, or especially, if such projects view themselves as concerned only with English documents, the close relationship between XML and Unicode means they will need to acquire an understanding of these encoding issues and develop procedures which assure consistency and integrity of encoding and its correct declaration, including the use of appropriate software for transcoding and verification.
401 D4-48-2 The advantages of UTF-8 as an internal representation of Unicode code points outlined above do not obtain where documents are in scripts other than Latin, Cyrillic or Hebrew. Where characters with code points in the sixteen-bit range (two-byte) predominate, UTF-8 is inappropriate, because it requires three or more bytes to represent each abstract character. Here the preferred representation of Unicode (which all XML-conformant parsers must support) is UTF-16, where each code point corresponding to an abstract character is represented in two eight-bit bytes
404 D4-48-2 values to represent code points beyond the 16-bit range is passed over here, since it adds a complication that does not affect the key points at issue
405 D4-48-2 . This encoding presents a different hazard, especially while support for Unicode in editing software is relatively uneven and immature. Because the code points are represented as sixteen-bit integers stored (in most popular computers) in two separate bytes, the order in which those bytes are stored becomes important. This is dependent on the underlying hardware. In the realm of desktop computing, Macintosh machines, for example, store (on disk as well as in memory) byte pairs representing 16-bit integers with the higher-value byte first, whereas PCs using Intel processors store the bytes in the reverse order (this is often referred to with Swiftian nomenclature as
409 D4-48-2 byte order). This means that if a semantically identical plain text file encoded in UTF-16 is prepared on a Macintosh and on a PC, and the two files are then saved to disk, each byte pair in one file will be in the reverse order from the corresponding byte pair in the other file. To avoid the obvious incompatibility problems, the XML standard requires that all documents whose declared encoding is UTF-16 must begin with a special pseudo-character which is not itself part of the document, but merely a Byte Order Marker (BOM) from which the processor can determine the byte order of the document that follows. Now the insertion of a correct BOM and the consistent maintenance of the byte order throughout the file ought to be taken care of transparently by software, but experience, especially from environments where work is distributed across big-endian and little-endian hardware, shows that this cannot always be taken for granted in the current state of software development. As with mixed encoding problems involving UTF-8, inconsistent byte-order in UTF-16 files seems to be the result of merging or cutting and pasting between files using software which does not correctly enforce byte order integrity, and out of misconceived
411 D4-48-2 which conceals byte-order inconsistencies from the user. Once more, the result can be files which look correct in an editor, but which the XML parser either rejects outright or silently passes on in a seriously garbled form. Again, to avoid the consequent errors, projects need to cultivate an informed awareness of relevant encoding issues and devise policies to avoid them in the first place or detect them at an early stage.

ST-Infrastructure.xml#13092

# id text
2 ST The TEI Infrastructure
9 ST The TEI encoding scheme consists of a number of
12 ST classes
13 ST . Another part defines its possible content and attributes with reference to these classes. This indirection gives the TEI system much of its strength and its flexibility. Elements may be combined more or less freely to form a
15 ST appropriate to a particular set of requirements. It is also easy to add new elements which reference existing classes or elements to a schema, as it is to exclude some of the elements provided by any module included in a schema.
17 ST In principle, a TEI schema may be constructed using any combination of modules. However, certain TEI modules are of particular importance, and should always be included in all but exceptional circumstances: the module
25 ST provides declarations for the metadata elements and attributes constituting the TEI header, a component which is required for TEI conformance, while the
30 ST The specification for a TEI schema is itself a TEI document, using elements from the module described in chapter
40 ST The bulk of this chapter describes the TEI infrastructure module itself. Although it may be skipped at a first reading, an understanding of the topics addressed here is essential for anyone planning to take full advantage of the TEI customization techniques described in chapter
43 ST The chapter begins by briefly characterizing each of the modules available in the TEI scheme. Section
44 ST describes in general terms the method of constructing a TEI schema in a specific schema language such as XML DTD language, RELAX NG, or W3C Schema.
46 ST The next and largest part of the chapter introduces the attribute and element classes used to define groups of elements and their characteristics (section
52 ST , which are used to express some commonly used content models, and lists the
54 ST used to constrain the range of legal values for TEI attributes (section
58 STMA TEI Modules
64 STMA a formal declaration, expressed using a special-purpose XML vocabulary defined by these Guidelines in combination with elements taken from the ISO schema language RELAX NG
69 STMA Each chapter of the Guidelines presents a group of related elements, and also defines a corresponding set of declarations, which we call a
71 STMA . All the definitions are collected together in the reference sections provided as an appendix. Formal declarations for a given chapter are collected together within the corresponding module. For convenience, each element is assigned to a single module, typically for use in some specific application area, or to support a particular kind of usage. A module is thus simply a convenient way of grouping together a number of associated element declarations. In the simple case, a TEI schema is made by combining together a small number of modules, as further described in section
74 STMA The following table lists the modules defined by the current release of the Guidelines:
78 tab-mods Module name
86 tab-mods analysis
93 tab-mods certainty
100 tab-mods core
107 tab-mods corpus
115 tab-mods dictionaries
122 tab-mods drama
129 tab-mods figures
136 tab-mods gaiji
143 tab-mods header
150 tab-mods iso-fs
157 tab-mods linking
164 tab-mods msdescription
171 tab-mods namesdates
178 tab-mods nets
185 tab-mods spoken
192 tab-mods tagdocs
199 tab-mods tei
201 tab-mods TEI Infrastructure
207 tab-mods textcrit
214 tab-mods textstructure
221 tab-mods transcr
228 tab-mods verse
236 STMA For each module listed above, the corresponding chapter gives a full description of the classes, elements, and macros which it makes available when it is included in a schema. Other chapters of these Guidelines explore other aspects of using the TEI scheme.
240 STIN Defining a TEI Schema
243 STIN . For a valid TEI document, this schema must be a conformant TEI schema, as further defined in chapter
246 STIN be made explicit. The method of doing this recommended by these Guidelines is to provide explicitly or by reference a TEI schema specification against which the document may be validated.
248 STIN A TEI-conformant schema is a specific combination of TEI modules, possibly also including additional declarations that modify the element and attribute declarations contained by each module, for example to suppress or rename some elements. The TEI provides an application-independent way of specifying a TEI schema by means of the
251 STIN . The same system may also be used to specify a schema which extends the TEI by adding new elements explicitly, or by reference to other XML vocabularies. In either case, the specification may be processed to generate a formal schema, expressed in a variety of specific schema languages, such as XML DTD language, RELAX NG, or W3C Schema. These output schemas can then be used by an XML processor such as a validator or editor to validate or otherwise process documents. Further information about the processing of a TEI formal specification is given in chapter
257 STINsimpleExample The simplest customization of the TEI scheme combines just the four recommended modules mentioned above. In ODD format, this schema specification takes this form:
272 STINsimpleExample ). An ODD processor will generate an appropriate schema from this set of declarations, expressed using the XML DTD language, the ISO RELAX NG language, the W3C Schema language, or in principle any other adequately powerful schema language. The resulting schema may then be associated with the document instance by one of a number of different mechanisms, as further described in chapter
273 STINsimpleExample . The start point (or root element) of document instances to be validated against the schema is specified by means of the
282 STINlargerExample These Guidelines introduce each of the modules making up the TEI scheme one by one, and therefore, for clarity of exposition, each chapter focusses on elements drawn from a single module. In reality, of course, the markup of a text will draw on elements taken from many different modules, partly because texts are heterogeneous objects, and partly because encoders have different goals. Some examples of this heterogeneity include:
284 STINlargerExample a text may be a collection of other texts of different types: for example, an anthology of prose, verse, and drama;
286 STINlargerExample a text may contain other smaller, embedded texts: for example, a poem or song included in a prose narrative;
288 STINlargerExample some sections of a text may be written in one form, and others in a different form: for example, a novel where some chapters are in prose, others take the form of dictionary entries, and still others the form of scenes in a play;
290 STINlargerExample an encoded text may include detailed analytic annotation, for example of rhetorical or linguistic features;
292 STINlargerExample an encoded text may combine a literal transcription with a diplomatic edition of the same or different sources;
294 STINlargerExample the description of a text may require additional specialized metadata elements, for example when describing manuscript material in detail.
297 STINlargerExample The TEI provides mechanisms to support all of these and many other use cases. The architecture permits elements and attributes from any combination of modules to co-exist within a single schema. Within particular modules, elements and attributes are provided to support differing views of the
301 STINlargerExample a definition of a corpus or collection as a series of
303 STINlargerExample documents, sharing a common TEI header (see chapter
306 STINlargerExample a definition of composite texts which combine optional front- and back-matter with a group of collected texts, themselves possibly composite (see section
317 STINlargerExample Subsequent chapters of these Guidelines describe in detail markup constructs appropriate for these and many other possible features of interest. The markup constructs can be combined as needed for any given set of applications or project.
319 STINlargerExample For example, a project aiming to produce an ambitious digital edition of a collection of manuscript materials, to include detailed metadata about each source, digital images of the content, along with a detailed transcription of each source, and a supporting biographical and geographical database might need a schema combining several modules, as follows:
348 STINlargerExample The TEI architecture also supports more detailed customization beyond the simple selection of modules. A schema may suppress elements from a module, suppress some of their attributes, change their names, or even add new elements and attributes. Detailed discussion of the kind of modification possible in this way is provided in
349 STINlargerExample and conformance rules relating to their application are discussed in
350 STINlargerExample . These facilities are available for any schema language (though some features may not be available in all languages). The ODD language also makes it possible to combine TEI and non-TEI modules into a single schema, provided that the non-TEI module is expressed using the RELAX NG schema language (see further
356 STEC The TEI Class System
358 STEC The TEI scheme distinguishes about five hundred different elements. To aid comprehension, modularity, and modification, the majority of these elements are formally classified in some way. Classes are used to express two distinct kinds of commonality among elements. The elements of a class may share some set of attributes, or they may appear in the same locations in a content model. A class is known as an
360 STEC if its members share attributes, and as a
362 STEC if its members appear in the same locations. In either case, an element is said to
364 STEC properties from any classes of which it is a member.
372 STEC A basic understanding of the classes into which the TEI scheme is organized is strongly recommended and is essential for any successful customization of the system.
377 STECAT An attribute class groups together elements which share some set of common attributes. Attribute classes are given names composed of the prefix
385 STECAT attribute, both of which are inherited from their membership in the class rather than individually defined for each element. These attributes are said to be defined by (or inherited from) the
387 STECAT class. If another element were to be added to the TEI scheme for which these attributes were considered useful, the simplest way to provide them would be to make the new element a member of the
389 STECAT class. Note also that this method ensures that the attributes in question are always defined in the same way, taking the same default values etc., no matter which element they are attached to.
391 STECAT Some attribute classes are defined within the
393 STECAT infrastructural module and are thus globally available. Other attribute classes are specific to particular modules and thus defined in other chapters. Attributes defined by such classes will not be available unless the module concerned is included in a schema.
439 STECAT when the
441 STECAT module is included in a schema. If, however, this module is not included in a schema, then the
447 STECAT , is common to all modules, and is therefore described in some detail in the next section. A full list of all attribute classes is given in
453 STGA The following attributes are defined for every TEI element.
458 STGA These attributes are optionally available for any TEI element; none of them is required. Their usage is discussed in the following subsections.
463 STGAid The value supplied for the
466 STGAid name
472 STGAid The colon is also by default a valid name character; however, it has a specific purpose in XML (to indicate namespace prefixes), and may not therefore be used in any other way within a name.
476 STGAid in an XML TEI document) uppercase and lowercase letters are distinguished, and thus
493 STGAid attribute also provides an identifying name or number for an element, but in this case the information need not be a legal
495 STGAid value. Its value may be any string of characters; typically it is a number or other similar enumerator or label. For example, the numbers given to the items of a numbered list may be recorded with the
497 STGAid attribute; this would make it possible to record errors in the numeration of the original, as in this list of chapters, transcribed from a faulty original in which the number 10 is used twice, and 11 is omitted:
521 STGAid As noted above there is no requirement to record a value for either the
525 STGAid attribute. Any XML processor can identify the sequential position of one element within another in an XML document without any additional tagging. An encoding in which each line of a long poem is explicitly labelled with its numerical sequence such as the following
539 STGAla attribute indicates the natural language and writing system applicable to the content of a given element. If it is not specified, the value is inherited from that of the immediately enclosing element. As a rule, therefore, it is simplest to specify the base language of the text on the
541 STGAla element, and allow most elements to take the default value for
543 STGAla ; the language of an element then need be explicitly specified only for elements in languages other than the base language. For this reason, it is recommended practice to supply a default value for the
547 STGAla root element, or on both the
551 STGAla element. The latter is appropriate in the not uncommon case where the text element in a TEI document uses a different default language from that of the TEI header attached to it. Other language shifts in the source should be explicitly identified by use of the
555 STGAla In the following example schematic, an English language TEI header is attached to an English language text:
565 STGAla The same effect would be obtained by specifying the default language for both header and text:
575 STGAla The latter approach is necessary in the case where the two differ: for example, where an English language header is applied to a French text:
585 STGAla The same principle applies at any hierarchic level. In the following example, the default language of the text is French, but one section of it is in German:
614 STGAla element, by contrast, because it is in the same language as its parent.
622 STGAla Note that in cases where it is advisable or necessary to identify the language of the text that is pointed at, the (non-global) attribute
625 STGAla the pointer references text written in French.
634 STGAla Additional information about a particular language may be supplied in the
636 STGAla element within the header (see section
649 STGAre attributes are all used to give information about the physical presentation of the text in the source. In the following example,
651 STGAre is used to indicate that both the emphasized word and the proper name are printed in italics:
669 STGAre elements are rendered in the text by italics, it will be more convenient to register that fact in the TEI header once and for all (using the
675 STGAre value only for any elements which deviate from the stated rendition.
681 STGAre is that the value used for the former may contain one or more tokens from any vocabulary devised by the encoder, separated by space characters, whereas the value used for the latter must be a single string taken from a formally-defined style definition language such as CSS. The
683 STGAre attribute values are sequence-indeterminate set of whitespace-separated tokens, whereas
685 STGAre values allow whitespace and sequence relationships as part of the formally-defined style definition language.
692 STGAre element can then be associated with any element, either by default, or by means of the global
724 STGAre elements, each of which defines some aspect of the rendering or appearance of the text in its original form. These details may most conveniently be described using a formal style definition language, such as CSS (
726 STGAre ); in some other formal language developed for a specific project; or even informally in running prose. Although languages such as CSS and XSL-FO are generally used to describe document output to screen or print, they nonetheless provide formal and precise mechanisms for describing the appearance of source documents, especially print documents, but also many aspects of manuscript documents. For example, both CSS and XSL-FO provide mechanisms for describing typefaces, weight, and styles; character and line spacing; and so on.
730 STGAre attribute is provided for encoders wishing to describe the appearance of individual source elements using a language such as CSS directly rather than by reference to a
732 STGAre element. Its value may be any expression in the chosen formal style definition language.
734 STGAre Formal definition languages such as CSS typically identity a series of
738 STGAre are specified. A sequence of such property-value pairs makes up a stylesheet. The TEI uses such languages simply to describe the appearance of a source document, rather than to control how it should be formatted.
740 STGAre In the TEI scheme, it is possible to supply information about the appearance of elements within a source document in the following distinct ways:
742 STGAre One or more properties may be specified as the default for all elements of a given type, using the
750 STGAre attribute with any convenient set of one or more sequence-indeterminate tokens;
758 STGAre One or more properties may be supplied explicitly for individual element occurrences, using the
764 STGAre If the same property is specified in more than one of the above ways, the one with the highest number in the list above is understood to be applicable. The resulting properties from each way are then combined to provide the full set of property-value pairs applicable to the given element, and (by default) to all of its children.
768 STGAre attribute to indicate a different language for one or more
772 STGAre attribute, if this is used in combination with either
778 STGAre Note that these TEI attributes always describe the rendition or appearance of the source document,
786 STGAba Several TEI elements carry attributes whose values are defined as
788 STGAba , meaning that such attributes supply a link or pointer, typically expressed as a URL. Like other XML applications, the TEI allows use of a special attribute to set the context within which relative URLs are to be evaluated. The global attribute
790 STGAba is defined as part of the XML specification and belongs to the XML namespace rather than the TEI namespace. We do not describe it in detail here: reference information about
797 STGAba is used to set a context for all relative URLs within the scope of the element on which it is specified. For example:
816 STGAba which supplies a value for
824 STGAba which does not change the default context, and its target is therefore some element within the current document with the value
828 STGAba attribute. Further discussion of this element and its effect on TEI linking methods is provided in chapter
837 STGAxs provides a mechanism for indicating to systems processing an XML file how they should treat whitespace, that is, any sequences of consecutive tab (#x09), space (#x20), carriage return (#x0D) or linefeed (#x0A) characters. Like
839 STGAxs this attribute is defined as part of the XML specification and belongs to the XML namespace rather than the TEI namespace. Complete information about this attribute is provided by
841 STGAxs ; here we provide a summary of how its use affects users of the TEI scheme.
848 STGAxs default
849 STGAxs . The first indicates that whitespace in a text node—every carriage return, every tab, etc.—should be maintained as is when the document is processed. The second (which is implied when the attribute is not supplied), indicates that whitespace should be handled
853 STGAxs These Guidelines assume one of two different ways of processing whitespace will apply in a given case, depending on an element's content model. For an element that can contain only other elements with no intervening non-whitespace characters, whitespace is considered to have no semantic significance, and should therefore be discarded by a processor. For example, in a
863 STGAxs since non-whitespace text is not permitted between the
875 STGAxs element has a content model containing only elements: any punctuation or whitespace required between the lines of an address must therefore be supplied by the processor, as any whitespace present in the input document will be ignored.
877 STGAxs Elements with content models of this type are comparatively unusual in the TEI: a list of them is provided in the TEI release file
883 STGAxs Most TEI elements permit what is known as mixed-content: that is, they can contain both text and other elements. Here the assumption of these Guidelines is that whitespace will be normalized. This means that all space, carriage return, linefeed, and tab characters are converted into spaces, all consecutive spaces are then deleted and replaced by one space, and then space immediately after a start-tag or immediately before an end-tag is deleted. The result is that this encoding,
899 STGAxs . The space before his name has been removed, a space is included between his forenames, the comma is preserved, and the newlines within his name have all been removed.
902 STGAxs If the default treatment described above is not appropriate for a mixed content element, the processing required may be described in the
904 STGAxs element of the TEI header, but generic XML processing tools may not take note of this.
908 STGAxs attribute may be supplied with a value of
910 STGAxs in order to indicate that every space, tab, carriage return and linefeed character found within that element in the document being processed is significant. Typically, the result of that processing will be to retain the whitespace characters in the output. Thus if the above example began
911 STGAxs persName xml:space="preserve"
912 STGAxs , the resulting text would most likely be rendered over five lines, indented, and with a blank line following.
916 STGAxs attribute is rarely used in TEI documents because such layout features are generally captured with less risk and more precision by using native TEI elements such as
983 STECCM As noted above, the members of a given TEI model class share the property that they can all appear in the same location within a document. Wherever possible, the content model of a TEI element is expressed not directly in terms of specific elements, but indirectly in terms of particular model classes. This makes content models simpler and more consistent; it also makes them much easier to understand and to modify.
985 STECCM Like attribute classes, model classes may have subclasses or superclasses. Just as elements inherit from a class the ability to appear in certain locations of a document (wherever the class can appear), so all members of a subclass inherit the ability to appear wherever any superclass can appear. To some extent, the class system thus provides a way of reducing the whole TEI galaxy of elements into a tidy hierarchy. This is however not entirely the case.
987 STECCM In fact, the nature of a given class of elements can be considered along two dimensions: as noted, it defines a set of places where the class members are permitted within the document hierarchy; it also implies a semantic grouping of some kind. For example, the very large class of elements which can appear within a paragraph comprises a number of other classes, all of which have the same structural property, but which differ in their field of application. Some are related to highlighting, while others relate to names or places, and so on. In some cases, the
988 STECCM set of places where class members are permitted
989 STECCM is very constrained: it may just be within one specific element, or one class of element, for example. In other cases, elements may be permitted to appear in very many places, or in more than one such set of places.
991 STECCM These factors are reflected in the way that model classes are named. If a model class has a name containing
997 STECCM then it is primarily defined in terms of its structural location. For example, those elements (or classes of element) which appear as content of a
1001 STECCM class; those which appear as content of a
1005 STECCM class. If, however, a model class has a name containing
1011 STECCM , the implication is that its members all have some additional semantic property in common, for example containing a bibliographic description, or containing some form of name, respectively. These semantically-motivated classes often provide a useful way of dividing up large structurally-motivated classes: for example, the very general structural class
1014 STECCM data elements that form part of a paragraph
1015 STECCM ) has four semantically-motivated member classes (
1025 STECCM Although most classes are defined by the
1029 STECCM , but instead gain their members as a consequence of individual elements' declaration of their membership. The same class may therefore contain different members, depending on which modules are active. Consequently, the content model of a given element (being expressed in terms of model classes) may differ depending on which modules are active.
1031 STECCM Some classes contain only a single member, even when all modules are loaded. One reason for declaring such a class is to make it easier for a customization to add new member elements in a specific place, particularly in areas where the TEI does not make fully elaborated proposals. For example, the TEI class
1035 STECCM module to include just the TEI
1037 STECCM element. A project wishing to add an alternative way of structuring text-critical information could do so by defining their own elements and adding it to this class.
1039 STECCM Another reason for declaring single-member classes is where the class members are not needed in all documents, but appear in the same place as elements which are very frequently required. For example, the specialized element
1041 STECCM used to represent a non-Unicode character or glyph is provided as the only member of the
1043 STECCM class when the
1045 STECCM module is added to a schema. References to this class are included in almost every content model, since if it is used at all the
1047 STECCM must be available wherever text is available; however these references have no effect unless the gaiji module is loaded.
1049 STECCM At the other end of the scale, a few of the classes predefined by the tei module are subsequently populated with very many members. For example, the class
1051 STECCM groups all the classes of element for simple editorial correction and transcription which can appear within a
1061 STECCM element is one of the basic building blocks of a TEI document it is not surprising that each module will need to add elements to it. The class system here provides a very convenient way of controlling the resulting complexity. Typically, elements are not added directly to these very general classes, but via some intermediate semantically-motivated class.
1063 STECCM Just as there are a few classes which have a single member, so there are some classes which are used only once in the TEI architecture. These classes, which have no superclass and therefore do not fit into the class hierarchy defined here, are a convenient way of maintaining elements which are highly structured internally, but which appear from the outside to be uniform objects like others at the same level.
1067 STECCM Members of such classes can only ever appear within one element, or one class of elements. For example, the class
1069 STECCM is used only to express the content model for the element
1071 STECCM ; it references some other classes of elements, which can appear elsewhere, and also some elements which can only appear inside an address.
1076 STBTC Most TEI elements may also be informally classified as belonging to one of the following groupings:
1080 STBTC high level, possibly self-nesting, major divisions of texts. These elements populate such classes as
1084 STBTC , and typically form the largest component units of a text.
1091 STBTC , either directly or by means of other classes such as
1105 STBTC means any string of characters, and can apply to individual words, parts of words, and groups of words indifferently; it does not refer only to linguistically-motivated phrasal units. This may cause confusion for readers accustomed to applying the word in a more restrictive sense.
1109 STBTC The TEI also identifies two further groupings derived from these three:
1121 STBTC classes but rather a distinct grouping of elements which are both chunk-like and phrase-like. However, the classes
1132 STBTC elements which can appear directly within texts or text divisions; this is a combination of the inter- and chunk- level elements defined above. These elements populate the class
1134 STBTC , which is defined as a superset of the classes
1142 STBTC Broadly speaking, the front, body, and back of a text each comprises a series of components, optionally grouped into divisions.
1144 STBTC As noted above, some elements do not belong to any model class, and some model classes are not readily associated with any of the above informal groupings. However, over two-thirds of the
1145 STBTC elements defined in the present edition of these Guidelines are classified in this way, and future editions of these recommendations will extend and develop this classification scheme.
1147 STBTC A complete alphabetical list of all model classes is provided in
1269 STmacros The infrastructure module defined by this chapter also declares a number of
1271 STmacros , or shortcut names for frequently occurring parts of other declarations. Macros are used in two ways in the TEI scheme: to stand for frequently-encountered content models, or parts of content models (
1278 STECST As far as possible, the TEI schemas use the following set of frequently-encountered content models to help achieve consistency among different elements.
1290 STECST The present version of the TEI Guidelines includes some
1292 STECST shows, in descending order of frequency, the seven most commonly used content models.
1306 DTYPES The values which attributes may take in a TEI schema are defined, for the most part, by reference to a TEI
1307 DTYPES datatype
1308 DTYPES . Each such datatype is defined in terms of other primitive datatypes, derived mostly from
1310 DTYPES , literal values, or other datatypes. This indirection makes it possible for a TEI application to set constraints either globally or in individual cases, by redefining the datatype definition or the reference to it respectively. In some cases, the TEI datatype includes additional usage constraints which cannot be enforced by existing schema languages, although a TEI-compliant processor should attempt to validate them (see further discussion in chapter
1313 DTYPES Where literal values or name tokens are used in a datatype definition, an associated value list supplies definitions for the significance of suggested or (in the case of closed lists) all possible values.
1316 DTYPES TEI-defined datatypes may be grouped into those which define normalized values for numeric quantities, probabilities, or temporal expressions, those which define various kinds of shorthand codes or keys, and those which define pointers or links.
1330 DTYPES datatype include
1377 DTYPES in the case of durations, times, and date; W3C Schema datatypes in the case of truth values; BCP 47 in the case of language; and ISO 5218 in the case of sex.
1410 DTYPES By far the largest number of TEI attributes take values which are coded values or names of some kind. These values may be constrained or defined in a number of different ways, each of which is given a different name, as follows:
1431 DTYPES , are used to supply an identifier expressed as any kind of single token or word. The TEI places a few constraints on the characters which may be used for this purpose: only Unicode characters classified as letters, digits, punctuation characters, or symbols can appear in an attribute value of this kind. Note in particular that such values cannot include whitespace characters. Legal values include
1445 DTYPES Where identifiers are defined externally, for example as part of a database or file system, the inability to include whitespace or other special characters in a value may be problematic. In other cases, it may also be simply more convenient to supply a short sequence of natural language words including spaces as a single value. For these reasons, we also provide a datatype
1459 DTYPES . This datatype should be used with care since XML will not normalize whitespace characters within it: for example the values
1463 DTYPES (three spaces) would be considered distinct. This case should be distinguished from that of an attribute permitting multiple values, each of which may be separated by whitespace which
1472 DTYPES , but with the additional constraint that they must be legal XML identifiers, as defined by the XML 1.0 specification, or successors. Hence, they may not begin with digits or punctuation characters. Legal identifiers include
1494 DTYPES supplied by
1498 DTYPES above, with the added constraint that the word supplied is taken from a specific list of possibilities. In each case, the element or class specification which includes the definition for the attribute will also contain a list of possible values, together with a prose description of their intended significance. This list may be open (in which case the list is advisory), or closed (in which case it determines the range of legal values). In this latter case, the datatype will not be
1500 DTYPES , but an explicit list of the possible values.
1515 DTYPES An attribute may, of course, take more than one value of a given type, for example a list of pointer values, or a list of words. In the TEI scheme, this information is regarded as a property of the
1517 DTYPES element used to document the attribute in question rather than as a distinct
1518 DTYPES datatype
1525 STOV The TEI Infrastructure Module
1529 STOV module defined by this chapter is a required component of any TEI schema. It provides declarations for all datatypes, and initial declarations for the attribute classes, model classes, and macros used by other modules in the TEI scheme. Its components are listed below in alphabetical order:
1531 tei TEI Infrastructure
1533 tei Declarations for classes, datatypes, and macros available to all TEI modules
1547 STOV The order in which declarations are made within the infrastructure module is critical, since several class declarations refer to others, which must therefore precede them. Other constraints on the order of declarations derive from the way in which the modularity of the TEI scheme is implemented in different schema languages. The XML DTD fragment implementing this TEI module makes extensive use of
1551 STOV to effect a kind of conditional construction; the RELAX NG schema fragment similarly predeclares a number of patterns with null (

FM1-IntroductoryNote.xml#13134

# id text
4 FM1 This publication constitutes the fifth distinct version of the
6 FM1 , and the first complete revision since the appearance of P3 in 1994. It includes substantial amounts of new material and a major revision of the underlying technical infrastructure. With this version, the Guidelines enter a new stage in their development as a community-maintained open source project. This edition is the first version to have benefitted from the close overview and oversight of an elected TEI Technical Council. The editors are therefore particularly pleased to acknowledge with gratitude the hard work and dedication put into this project by the Council over the last five years.
8 FM1 The Chair of the TEI Board sits on the Technical Council, and the Board appoints the Chair of the Technical Council and one other member of the Council. Other Council members are all elected by the Consortium membership, and serve periods of up to two years at a time. The names and affiliations of all Council members who served during the production of this edition of the Guidelines are listed below.
40 FM1 Members Appointed by the TEI Board
144 FM1 The bulk of the Council's work has been carried out by email and by regular telephone conference. In addition, the Council has held many two-day face-to-face meetings. During production of P5, these meetings were generously hosted by the following institutions:
181 FM1 During the production of TEI P5, the Council chartered a number of smaller workgroups and similar activities, each of which made significant contribution to the intellectual content of the work. Active members of these are listed below:
186 FM1 Active between July 2001 and January 2005, this group revised and developed the recommendations now forming chapters
194 FM1 Active between February 2003 and February 2005, this group developed the material now forming
201 FM1 Active between February 2002 and January 2006, this group reviewed and expanded the material now largely forming part of
207 FM1 Active between February 2003 and December 2005, this group reviewed and finalised the material now forming
208 FM1 . It was chaired by Matthew Driscoll and comprised David Birnbaum and Merrillee Proffitt, in addition to the TEI Editors.
213 FM1 Active between January 2006 and May 2007, this group formulated the new material now forming part of
220 FM1 Active between January 2003 and August 2007, this group reviewed the material now presented in
224 FM1 From 2000 to 2008 the TEI had two appointed Editors, Lou Burnard (University of Oxford) and Syd Bauman (Brown University), who served
225 FM1 ex officio
228 FM1 The council also oversees an Internationalization and Localization project, led by Sebastian Rahtz and with funding from the ALLC. This activity, ongoing since October 2005, is engaged in translating key parts of the P5 source into a variety of languages.
255 FM1 Any one who works closely with the TEI Guidelines, whether as translator, editor, or reader is constantly reminded of the ambitious scope and exceptionally high editorial standards set by the original project, now approaching twenty years ago. It is appropriate therefore to retain a sense of the history of this document, as it has evolved since its first appearance in 1990, and to acknowledge with gratitude the contributions made to that evolution by very many individuals and institutions around the world. The original prefatory notes to each major edition of the Guidelines recording these names are therefore preserved in an appendix to the current edition (see

ND-NamesDates.xml#13218

# id text
5 ND it was noted that the elements provided in the core module allow an encoder to specify that a given text segment is a proper noun, or a
6 ND referring string
7 ND , and to specify the kind of object named or referred to only by supplying a value for the
11 ND This module also provides elements for the representation of information about the person, place, or organization to which a given name is understood to refer and to represent the name itself, independently of its application. In simple terms, where the core module allows one simply to represent that a given piece of text is a
12 ND name
14 ND personal name
16 ND person
18 ND canonical name
23 ND ), place names (section
35 NDATTS have specialized attributes which support linkage of a naming element with the entity (person, place, organization) being named; members of the class
37 NDATTS have specialized attributes which support a number of ways of normalizing the date or time of the data encoded by the element concerned.
46 NDATTSnr As discussed elsewhere, these attributes provide two different ways of associating any sort of name with its referent. For cases where all that is required is to provide some minimal information about the person name, for example their occupation or status, the
50 NDATTSnr attribute. It also provides an additional attribute, which allows the name itself to be associated with a base or canonical form:
57 NDATTSnr attribute should be used wherever it is possible to supply a direct link such as a URI to indicate the location of canonical information about the referent.
71 NDATTSnr More than one URI may be supplied if the name refers to more than one person. For example, assuming the existence of another
85 NDATTSnr attribute is provided for cases where no such direct link is required: for example because resolution of the reference is carried out by some local convention, or because the encoder judges that no such resolution is necessary. As an example of the first case, a project might maintain its own local database system containing canonical information about persons and places, each entry in which is accessed by means of some system-specific identifier constructed in a project-specific way from the value supplied for the
89 NDATTSnr a similar method is used to link element descriptions to the modules or classes to which they belong, for example.
90 NDATTSnr As an example of the second case, consider the use of well-established codifications such as country or airport codes, which it is probably unnecessary for an encoder to expand further:
98 NDATTSnr , interchange is improved by use of tag URIs in
106 NDATTSnr attribute has a more specialized use, where it is the name itself which is of interest rather than the person, place, or organization being named. See section
129 NDATTSda attribute is used to specify a normalized form for any temporal expression, independently of how it is represented in the text, as in the following example:
138 NDATTSda attribute provides a convenient way of associating an event or date with a named period. Its value is a pointer which should indicate some other element where the period concerned is more precisely defined. A convenient location for such definitions is the
144 NDATTSda of a TEI Header. A
146 NDATTSda may contain simply a bibliographic reference to an external definition for it. More usefully, it may also contain a series of
148 NDATTSda elements, each with an identifier and a description. The identifier can then be used as the target for a
150 NDATTSda attribute. For example, a taxonomy of named periods might be defined as follows:
186 NDATTSda The other dating attributes provided by this class support a wide range of methods of specifying temporal information in a normalized form. Some simple examples follow:
204 NDATTSda Normalization of date and time values permits the efficient processing of data (for example, to determine whether one event precedes or follows another). These examples all use the W3C standard format for representation of dates and times. Further examples, and discussion of some alternative approaches to normalization are given in section
214 NDPER The core
218 NDPER elements can distinguish names in a text but are insufficiently powerful to mark their internal components or structure. To conduct nominal record linkage or even to create an alphabetically sorted list of personal names, it is important to distinguish between a family name, a forename and an honorary title. Similarly, when confronted with a string such as
220 NDPER , the analyst will often wish to distinguish amongst the various constituent elements present, since they provide additional information about the status, occupation, or residence of the person to whom the name belongs. The following elements are provided for these and related purposes:
225 NDPER attributes mentioned above, all of the above elements are members of the class
234 NDPER element irrespective of whether or not the components of the personal name are also to be marked.
238 NDPER name type="person"
241 NDPER attribute allows for further subcategorization of the personal name itself, for example as a
244 NDPER birth
277 NDPER elements because distinctive name components occurring within it can be marked as such.
280 NDPER surname
281 NDPER and additional personal names, often known as
311 NDPER elements to provide further culture- or project-specific detail about the name component, for example:
340 NDPER attribute are not constrained, and may be chosen as appropriate to the encoding needs of the project. They may be used to distinguish different kinds of forename or surname, as well as to indicate the function a name component fills within the whole. In this example, we indicate that a surname is toponymic, and also point to the specific place name from which it is derived:
353 NDPER The value
355 NDPER was suggested above for the not uncommon case where the whole of a surname is composed of several other surname elements. These nested surnames may be individually tagged as well, together with appropriate type values:
369 NDPER attribute may be used to indicate whether a name is an abbreviation, initials, or given in full:
403 NDPER Alternatively, it may be felt more appropriate to mark a patronymic as a distinct kind of name, neither a forename nor a surname, using the
429 NDPER class; its effect is to state the sequence in which
433 NDPER elements should be combined when constructing a sort key for the name.
471 NDPER It is also often convenient to distinguish phrases (historically similar to the generational labels mentioned above) used to link parts of a name together, such as
477 NDPER etc. It is often a matter of arbitrary choice whether such components are regarded as part of the surname or not; the
499 NDPER elements are used to mark all name components other than those already listed. The distinction between them is that a
501 NDPER encloses an associated name component such as an aristocratic or official title which exists in some sense independently of its bearer. The distinction is not always a clear one. As elsewhere, the
506 NDPER An inherited or life-time title of nobility such as
515 NDPER An academic or other honorific prefixed to a name e.g.
542 NDPER role
543 NDPER a person has in a given context (such as
544 NDPER witness
549 NDPER element, since this is intended to mark roles which function as part of a person's name, not the role of the person bearing the name in general. Information about roles, occupations, etc. of a person are encoded within the
588 NDPER A name may have any combination of the above elements:
606 NDPER Although highly flexible, these mechanisms for marking personal name components will not cater for every personal name, nor for every processing need. Where the internal structure of personal names is highly complex or where name components are particularly ambiguous, feature structures are recommended as the most appropriate mechanism to mark and analyze them, as further discussed in chapter
609 NDPER White space is allowed and therefore significant between elements within
631 NDORG In these Guidelines, we use the term
633 NDORG for any named collection of people regarded as a single unit. Typical examples include institutions such as
645 NDORG . Giving a loosely-defined group of individuals a name often serves a particular political or social agenda and an analysis of the way such phrases are constructed and used may therefore be of considerable importance to the social historian, even where the objective existence of an
647 NDORG in this sense is harder to demonstrate than that of (say) a named person. In the case of businesses or other formally constituted institutions, the component parts of an organizational name may help to characterize the organization in terms of its perceived geographical location, ownership, likely number of employees, management structure, etc.
656 NDORG This element is a member of the same attribute classes as
663 NDORG element may be used to mark up any form of organizational name:
690 NDORG attribute should be used to characterize the name (rather than the organization), for example as an acronym:
716 NDORG The components of an organization's name may include place names as well as personal names:
724 NDORG or role names:
760 NDPLAC Like other proper nouns or noun phrases used as names, place names can simply be marked up with the
764 NDPLAC element. For cartographers and historical geographers, however, the component parts of a place name provide important information about the relation between the name and some spot in space and time. They also provide important evidence in historical linguistics.
766 NDPLAC These Guidelines distinguish three ways of referring to places. A place name (represented using the
769 NDPLAC ). A place named simply in terms of geographical features such as mountains or rivers is represented using the
772 NDPLAC ). Finally, an expression consisting of phrases expressing spatial or other kinds of relationship between other kinds of named place may itself be regarded as a way of referring to a place, and hence as a kind of named place (see section
785 NDPLAC mentioned above. These attributes are primarily useful as a means of linking a place name with information about a specific place. Recommendations for the encoding of information about a place, as distinct from its name, are provided in
794 NDPLAC name type="place"
796 NDPLAC rs type="place"
798 NDPLAC Strictly, a suitable value such as
800 NDPLAC should be added to the two place names which are presented periphrastically in the second version of this example. This would preserve the distinction indicated by the choice of
827 NDPLGU A place name may contain text with no indication of its internal structure:
829 NDPLGU More usually however, a place name of this kind will be further analysed in terms of its constitutive geo-political or administrative units. These may be arranged in ascending sequence according to their size or administrative importance, for example:
845 NDPLGU class, members of which may be used anywhere that text is permitted, including within each other as in the following examples:
924 NDPLGF element for this component of the name and then point to it using the
932 NDPLR All the place name specifications so far discussed are
934 NDPLR , in the sense that they define only one place. A place may however be specified in terms of its relationship to another place, for example
939 NDPLR relative place names
940 NDPLR will contain a place name which acts as a referent (e.g.
944 NDPLR ). They will also contain a word or phrase indicating the position of the place being named in relation to the referent (e.g.
948 NDPLR ). A distance, possibly only vaguely specified, between the referent place and the place being indicated may also be present (e.g.
954 NDPLR Relative place names may be encoded using the following elements in combination with either a
959 NDPLR Some examples of relative place names are:
995 NDPLR The internal structure of place names is like that of personal names—complex and subject to an enormous amount of variation across time and different cultures. The recommendations in this section should however be adequate for a majority of users and applications; they may be extended using the mechanisms described in chapter
996 NDPLR to add new elements to the existing classes. When the focus of interest is on the name components themselves, as in place name studies for example, the elements discussed in
1019 NDPERS This module defines a number of special purpose elements which can be used to markup biographical, historical, and prosopographical data. We envisage a number of users and uses for these elements. For example, an encoder may be interested in creating or converting a set of biographical records, for example of the type found in a Dictionary of National Biography. Another use is the creation or conversion of a database-like collection of information about a group of people, such as the people referenced in a marked-up collection of documents, or persons who have served as informants in the creation of spoken corpora. It is also appropriate to use these elements to register information relating to those who have taken part in the creation of a TEI document.
1021 NDPERS To cater for this diversity, these Guidelines propose a flexible strategy, in which encoders may choose for themselves the approach appropriate to their needs. If one were interested, for example, in converting existing DNB-type records, and wanted to preserve the text as is, the
1024 NDPERS ) could simply contain the text of an article, placed within
1030 NDPERS to mark up features of that text. For a more structured entry, however, one would extract the data and place information contained in the text, and encode it directly using the more specific elements described in this section.
1035 NDPERSbp Information about people, places, and organizations, of whatever type, essentially comprises a series of statements or assertions relating to:
1039 NDPERSbp which do not, by and large, change over time
1043 NDPERSbp which hold true only at a specific time
1046 NDPERSbp or incidents which may lead to a change of state or, less frequently, trait.
1052 NDPERSbp are typically independent of an individual's volition or action and can be either physical, such as sex or hair and eye colour, or cultural, such as ethnicity, caste, or faith. The distinction is not entirely straightforward, however: while sex is fairly obviously a physical trait, gender should rather be regarded as culturally determined, and the division of mankind into different
1054 NDPERSbp , proposed by early (white European) anthropologists on the basis of physical characteristics such as skin colour, hair type and skull measurements, is now considered to be more a social or mental construct. Furthermore, while some characteristics will obviously change over time, hair colour for example, none, in principle—not even sex—is immutable.
1057 NDPERSbp include, for example, marital status, place of residence and position or occupation. Such states have a definite duration, that is, they have a beginning and an end and are typically a consequence of the individual's own action or that of others.
1060 NDPERSbp changes in state
1061 NDPERSbp are meant the events in a person's life such as birth, marriage, or appointment to office; such events will normally be associated with a specific date or a fairly narrow date-range. Changes in states can also cause or be caused by changes in characteristics. Any statement or assertion on any of these aspects of a person's life will be based on some source, possibly multiple sources, possibly contradictory. Taking all this into account it follows that each such statement or assertion needs to be able to be documented, put into a time frame and be relatable to other statements or assertions of the same or any of the other types.
1063 NDPERSbp The elements defined by the module described in this chapter may, for the most part, all be regarded as specializations of one or other of the above three classes. Generic elements for state, trait, and event are also defined:
1076 NDPERSE Information about a person, as distinct from references to a person, for example by name, is grouped together within a
1078 NDPERSE element. Information about a group of people regarded as a single entity (for example
1082 NDPERSE element. Note however that information about a group of people with a distinct identity (for example a named theatrical troupe) should be recorded using the
1097 NDPERSE elements may be supplied within the
1101 NDPERSE element of a TEI header (see
1104 NDPERSE can also appear within the body of a text when the module defined by this chapter is included in a schema.
1130 NDPERSE element carries several attributes. As a member of the classes
1141 NDPERSE In addition, a small number of very commonly used personal properties may be recorded using attributes specific to
1149 NDPERSE These attributes are intended for use where only a small amount of data is to be encoded in a more or less normalized form, possibly for many person elements, for example when encoding basic facts about respondents to a questionnaire. When however a more detailed encoding is required for all kinds of information about a person, for example in a historical gazetteer, then it will be more appropriate to use the elements
1157 NDPERSE attribute is not intended to record the person's age expressed in years, months, or other temporal unit. Rather it is intended to record into which age bracket, for the purposes of some analysis, the person falls. A simple (perhaps too simple to be useful) binary classification of age brackets would be
1161 NDPERSE . The actual age brackets useful to various projects are likely to be varied and idiosyncratic, and thus these Guidelines make no particular recommendation as to possible values. Instead, individual projects are recommended to define the values they use in their own customization file, using a declaration like the following:
1201 NDPERSE element may contain many sub-elements, each specifying a different property of the person being described. The remainder of this section describes these more specific elements. For convenience, these elements are grouped into three classes, corresponding with the tripartite division outlined above: one for traits, one for states and one for events. Each class contains both specific elements for common types of biographical information, and a generic element for other, user-defined, types of information.
1203 NDPERSE All the elements in these three classes belong to the attribute class
1234 NDPERSEpc , allow content of ordinary prose containing phrase-level elements.
1241 NDPERSEpc The meanings of concepts such as sex, nationality, or age are highly culturally-dependent, and the encoder should take particular care to be explicit about any assumptions underlying their usage of them. For example, when recording personal age in different cultures, there may be different assumptions about the point from which age is reckoned. A statement of the practice adopted in a given encoding may usefully be provided in the
1248 NDPERSEpc element contains either paragraphs or a number of
1253 NDPERSEpc tag
1254 NDPERSEpc s for the languages. The
1258 NDPERSEpc attribute, which indicates the language with the same kind of
1259 NDPERSEpc language tag
1261 NDPERSEpc language tags
1291 NDPERSEpc attribute to give values from a project-internal taxonomy, or an external standard, such as vCard's sex property
1317 NDPERSEpc As elsewhere, these coded values may be used as an alternative to or normalization of the actual descriptive text contained in the element. The previous example might equally well be given as
1330 NDPERSEpc These element can be used to extend the range of information supplied about an individual's personal characteristics. Either may contain an optional
1332 NDPERSEpc element, used to provide a human-readable specification for the characteristic concerned and a description of the feature itself supplied within a
1354 NDPERSEpc These elements are provided as a simple means of extending the set of descriptive features available in a standardized way. For example, there are no predefined elements for such features as eye or hair colour. If these are to be recorded, they may simply be added as new types of trait:
1370 NDPERSEpc If none of the more specialized elements listed above is appropriate, then a choice must be made between the two generic elements
1378 NDPERSEpc for the latter. It may also be helpful to note that traits are typically, but not necessarily, independent of the volition or action of the holder. If the distinction between state and trait is not considered relevant or useful, use
1384 NDPERSEpc element is repeatable and can, like all TEI elements, take the attribute
1386 NDPERSEpc to indicate the language of the content of the element, as well as a
1388 NDPERSEpc attribute to indicate the type of name, whether a nickname, maiden or birth name, alternative form, etc. This is useful in cases where, for example, a person is known by a Latin name and also by any number of vernacular names, many or all of which may have claims to
1390 NDPERSEpc . In order to ensure uniformity, the method generally employed in the library world has been to accept the form found in some authority file, for example that of the American Library of Congress, as the
1396 NDPERSEpc an overtly foreign form of the name of their local saint or hero. Within the
1398 NDPERSEpc element any number of variant forms of a name can be given, with no prioritization, and hence less likelihood of offence. The Icelandic scholar and manuscript collector Árni Magnússon, to give his name in standard modern Icelandic spelling, is known in Danish as Arne Magnusson, the form which he himself, as a long term resident of Denmark, generally used; there is also a Latinized form, Arnas Magnæus, which he used in his scholarly writings. All three forms can be given, and in any order:
1410 NDPERSEpc At the other extreme, a person may be named periphrastically as in the following example:
1484 NDPERSEpe has a similar content model to that of
1490 NDPERSEpe element to identify the name of the place where the event occurred. It is used to describe any event in the life of an individual or organization.
1492 NDPERSEpe In the following example, we give a brief summary of the wedding of Jane Burden to the English writer, designer, and socialist William Morris, encoded as an
1496 NDPERSEpe element used to record data about Morris, though we could equally well have embedded the event within the
1568 NDPERSEpe elements point either to an external source or to a
1570 NDPERSEpe element within which other information about the person named may be found. As further discussed below (
1573 NDPERSEpe element may then be used to link them in a more meaningful way:
1580 NDPERSEpe As mentioned above, all these elements, both the specific and the generic, are members of the
1582 NDPERSEpe attribute class, which means they can be limited in terms of time. The following encoding, for example, demonstrates that the person named David Jones changed his name in 1966 to David Bowie:
1596 NDPERSEpe classes. These classes make available the attributes
1604 NDPERSEpe , a pointer to a resource from which the information derives. In this way it is possible, in the case of multiple and conflicting sources, to provide more than one view of what happened, as in the following example:
1626 NDPERSREL attributes in the usual way. The value specified for either attribute on a
1634 NDPERSREL , as defined here, may be any kind of describable link between specified participants. A participant (in this sense) might be a person, a place, or an organization. In the case of persons, therefore, a relationship might be a social relationship (such as employer/employee), a personal relationship (such as sibling, spouse, etc.) or something less precise such as
1640 NDPERSREL relationship); or it may not be if participants are not identical with respect to their role in the relationship (for example, the
1642 NDPERSREL relationship). For non-mutual relationships, only two kinds of role are currently supported; they are named
1648 NDPERSREL , in the sense that they are most readily described by a transitive verb, or a verb phrase of the form
1687 NDPERSREL This example defines the relationships amongst a number of people not further described here; we assume however that each person has been allocated an identifier such as
1695 NDPERSREL , etc. Then the above set of
1729 ND-org elements discussed elsewhere in this chapter, that is to provide a unique wrapper element for information about an entity, distinct from references to that entity which are typically encoded using a naming element such as
1730 ND-org name type="org"
1733 ND-org . The content of a naming element will represent the way an organization is named in a given context; the content of an
1737 ND-org An organization is not the same thing as a list or group of people because it has an identity of its own. That identity may be expressed solely in the existence of a name (for example
1739 ND-org ), but is likely to consist in the combination of that name with a number of events, traits, or states which are considered to apply to the organization itself, rather than any of its members. For example, a sports team might be described in terms of its membership (a
1743 ND-org ), its geographical affiliation (a
1747 ND-org attribute. However, it is the name of the sports team alone which identifies it.
1749 ND-org The content model for
1776 ND-org The names of the people making up an organization can also change over time, (if they are known at all). For example:
1843 ND-org element to group together a number of
1906 NDGEOG we discuss various ways of naming places such as towns, countries, etc. In much the same way as these Guidelines distinguish between the encoding of names for people and the encoding of other data about people, so they also distinguish between the encoding of names for places and the encoding of other data about places. In this section we present elements which may be used to record in a structured way data about places of any kind which might be named or referenced within a text. Such data may be useful as a way of normalizing or standardizing references to particular places, as the raw material for a gazetteer or similar reference document associated with a particular text or set of texts, or in conjunction with any form of geographical information system.
1916 NDGEOG class contains elements describing characteristics of a place which have a definite duration, such as its name. Any member of the
1924 NDGEOG For example, the modern city of Lyon in France was in Roman times known as Lugdunum. Although the modern and the Roman city are not physically co-extensive, they have significant areas which overlap, and we may therefore wish to regard them as the same place, while supplying both names with an indication of the time period during which each was current.
1926 NDGEOG A place is defined, however, by its physical location, which does not typically change over time. Locations may be specified in a number of ways: as a set of coordinates defining a point or an area on the surface of the earth, or by providing a description of how the place may be found, usually in terms of other place names. For example, we can identify the location of the Canadian city of London, either by specifying its latitude and longitude, or by specifying that we mean the city called London located in the province called Ontario within the country called Canada.
1928 NDGEOG In addition we may wish to supply a brief characterization of the place identified, for example to state that it is a city, an administrative area such as a country, or a landmark of some kind such as a monument or a battlefield. If our typology of places is simple, the open ended
1931 NDGEOG place type="city"
1933 NDGEOG place type="battlefield"
1938 NDGEOG element, the following elements may be used to provide more information about specific aspects of the place in a structured form:
1946 NDGEOGva A location may be specified in one or more of the following ways:
1948 NDGEOGva by supplying a string representing its coordinates in some standardized way within a
1952 NDGEOGva by supplying one or more place name component elements (e.g.
1956 NDGEOGva etc.) to place it within a geo-political context
1970 NDGEOGva The simplest method of specifying a location is by means of its geographic coordinates, supplied within the
1974 NDGEOGva ) used for the coordinate system itself. The default recommended by these Guidelines is to supply a string containing two real numbers separated by whitespace, of which the first indicates latitude and the second longitude according to the 1984 World Geodetic System (WGS84); this is the system currently used by most GPS applications which TEI users are likely to encounter.
1977 NDGEOGva We might therefore record the information about the place known as
1991 NDGEOGva Identifying Lyon by its geo-political status as a settlement within a country forming part of a larger political entity, we might represent the same
1992 NDGEOGva place
2014 NDGEOGva We may use the same procedure to represent the location of smaller places, such as a street or even an individual building:
2031 NDGEOGva attribute to categorize more precisely both the kind of place concerned (a building) and the kind of name used to locate it, for example by characterizing the generic
2053 NDGEOGva sometimes resembles a set of instructions for finding a place, rather than a name:
2073 NDGEOGva may also be used to identify a location in terms of its postal or other address:
2095 NDGEOGva When, as here, the same place is given multiple locations, the
2097 NDGEOGva attribute should be used to characterize the kind of location, as a means of indicating that these are alternative ways of identifying the same place, rather than that the place is spread across several locations.
2101 NDGEOGva element may thus identify a place to a greater or lesser degree of precision, using a variety of means: a name, a set of names, or a set of coordinates. The
2103 NDGEOGva element introduced earlier is by default understood to supply a value expressed in a specific (and widely used) notation. If a
2107 NDGEOGva , this is interpreted as being really the same place in the universe, but with different systems used to refer to it. If there is a lack of consensus about the location (of, for example, Camelot), more than one
2113 NDGEOGva By default, the content of
2117 NDGEOGva Firstly, the content of the
2140 NDGEOGva In the following example, we have defined the location of the place
2165 NDGEOGva to indicate the source of the location information.
2181 NDGEOGmp A place may contain other places. This containment relation can be directly modelled in XML: thus we can say that the towns of Vilnius and Kaunas are both in a place called Lithuania (or Lietuva) as follows:
2204 NDGEOGmp As a further example, the islands of Mauritius, Réunion, and Rodrigues are collectively known as the Mascarene Islands. Grouped together with Mauritius there are also several smaller offshore islands, with rather picturesque French names. These offshore islands do not however constitute an identifiable place as a whole. One way of representing this is as follows:
2234 NDGEOGmp Here is a more complex example, showing the variety of names associated at different times and in different languages with a set of hierarchically grouped places—the settlement of Carmarthen Castle, within the town of Carmarthen, within the administrative county of Carmarthenshire, Wales.
2277 NDGEOGmp place
2284 NDGEOGmp elements should be distinguished from the (possibly simpler) case where a number of places with some property in common are being grouped together for convenience, for example, in a gazetteer. The
2286 NDGEOGmp element is provided as a means of grouping places together where there is no implication that the grouped elements constitute a distinct place. For example:
2322 NDGEOGste There are many different kinds of information which it might be considered useful to record for a place in addition to its name and location, and the categories selected are likely to be very project-specific. As with persons therefore these Guidelines make no claim to comprehensiveness in this context. Instead, the generic
2330 NDGEOGste attribute. These are complemented by a small number of predefined elements of general utility:
2339 NDGEOGste element. This element may be used for almost any kind of event in the life of a place; no specialized version of this element is proposed, nor do we attempt to enumerate the possible values which might be appropriate for the
2456 NDGEOGste attribute are to be understood as cumulatively inherited, as elsewhere in the TEI scheme (for example on
2462 NDGEOGste element concerns the squirrel population between the dates given. This is then broken down into red and gray squirrel populations, and within that into male and female:
2480 NDGEOGste attribute: responsibility is not an additive property, and therefore an element either states it explicitly, or inherits it from its nearest ancestor. Dating is slightly different again, in that a child element may specify a date more precisely than its parent, as in the example above
2482 NDGEOGste Events may also be subdivided into other events. For example, a two part meeting might be represented as follows:
2500 NDGEOGste element is usually used to record information about a place, or a person; for this reason the element usually appears as content of a
2504 NDGEOGste . However, it is also possible to describe events independently of either a person or a place. This may be useful in such applications as chronologies, lists of significant events such as battles, legislation, etc.
2564 place-rel element may also be used to express relationships of various kinds between places, or between places and persons, in much the same way as it is used to express relationships between persons alone. Returning to the Mascarene Islands example cited above, we might define the island group and its constituents separately, but indicate the relationship by means of a
2594 place-rel style of representation has the advantage that we can now also represent the fact that a place may be a
2596 place-rel more than one other place; for example, Réunion is part of France, as well as part of the Mascarenes. If we add a declaration for France to the list above:
2653 NDNYM So far we have discussed ways in which a name or referring string encountered in running text may be resolved by considering the object that the name refers to: in the case of a personal name, the name refers to a person; in the case of a place name, to a place, for example. The resolution of this reference is effected by means of the
2675 NDNYM in Russian might all be regarded as existing independently of any person to which they are attached, and also independently of any variant forms that might be attested in different sources (such as Jon or Johnny in English, or Jehan or Jojo in French). We use the term
2676 NDNYM nym
2677 NDNYM to refer to the canonical or normalized form of a name regarded in such a way, and provide the following elements to encode it:
2687 NDNYM to indicate the nym with which it corresponds. Thus, given the following
2689 NDNYM for the name
2699 NDNYM an occurrence of this name in running text might be encoded as follows:
2705 NDNYM The person identified by this particular Tony may however be indicated independently using the
2707 NDNYM attribute, either on the forename or on the whole name component:
2726 NDNYM , etc. For example, we may show that the canonical form for a given nym has two orthographic variants in this way:
2790 NDNYM element used here is provided by the TEI
2792 NDNYM module, which would therefore also need to be included in a schema built to validate such markup. Other possibilities for more detailed linguistic analysis are provided by elements included in that and the
2802 NDNYM might be regarded as a nym in its own right:
2812 NDNYM Within running text, a name can specify all the nyms associated with it:
2818 NDNYM is used to indicate its constituent parts, where these have been identified as distinct nyms:
2828 NDNYM element may also combine a number of other
2830 NDNYM elements together, where it is intended to show that they are all regarded as variations on the same root. Thus the different forms of the name John, all being derived from the same root, may be represented as a hierarchic structure like this:
2898 NDDATE describes a date or time with reference to some other (absolute) temporal expression, and thus may contain an
2934 NDDATER after the lamented death of the Doctor
2937 NDDATER have two distinct components. As well as the absolute temporal expression or event to which reference is made (e.g.
2942 NDDATER the death of the Doctor
2947 NDDATER between the time or date which is indicated and the referent expression (e.g.
2954 NDDATER offset
2955 NDDATER describing the direction of the distance between the time or date indicated and the referent expression (e.g.
2974 NDDATER offset
3013 NDDATER and the cited date are parts of the same temporal expression, and hence to disambiguate the phrase
3039 NDDATER Where more complex or ambiguous expressions are involved, and where it is desirable to make more explicit the interpretive processes required, the feature structure notation described in chapter
3054 NDDATER ). It is used here to link the temporal phrase with an interpretation of it. Like most traditional fairs and market days, the Glasgow Fair was established by local custom and could vary from year to year. Consequently, in order to provide such an interpretation, it is necessary to draw upon additional information which may or may not be located in the particular text in question. In this case, it is necessary at least to know the spatial and temporal context (year and place) of the fair referred to. These and other features required for the analysis of this particular temporal expression may be combined together as one feature structure of type
3081 NDDATEA It may be useful to categorize a temporal expression which is given in terms of a named event, such as a public holiday, or a named time such as
3082 NDDATEA tea time
3123 NDDATEISO The attributes for normalization of dates and times so far described use a standard format defined by
3127 NDDATEISO . The full ISO standard provides formats not available in the W3C recommendation, for example, the capability to refer to a date by its ordinal date or week date, or to refer to a century. It also provides ways of indicating duration and range.
3129 NDDATEISO When this module is included in a schema, the following additional attributes are provided:
3133 NDDATEISO These attributes may be used in preference to their W3C equivalent when it is necessary to provide a normalized value in some form not supported by the W3C attributes. For example, a century date in the W3C format must be expressed as a range, using the
3146 NDDATEISO , however, it is possible to express the same normalized value in any of the following additional ways:
3170 NDDATECUSTOM All date-related encoding described above makes use of the Gregorian calendar, on which both the ISO and W3C datetime formats are based. However, historical texts often pre-date the invention of the Gregorian calendar in the 16th century, or its adoption in Europe over the following centuries, and many other calendars are used in texts from other cultures and contexts. Non-Gregorian dates can be encoded using methods described below.
3172 NDDATECUSTOM First, a Calendar Description element needs to be supplied in the
3199 NDDATECUSTOM element in the header which defines and describes the calendar used.
3203 NDDATECUSTOM attribute is used to specify the calendar used in the
3204 NDDATECUSTOM text content
3211 NDDATECUSTOM etc. to provide more precise expressions of dates and times in a constrained and computable form, it is often necessary to express a date or a date-range from a non-Gregorian calendar in a more precise manner. The attributes whose names end in
3215 NDDATECUSTOM is used to identify the calendar used in the content of these attributes:
3224 NDDATECUSTOM attribute specifies the calendar used in the text content of the
3228 NDDATECUSTOM attribute signifies that the calendar used in the
3230 NDDATECUSTOM attribute is also Julian. The schema could be customized in order to constrain the content of custom attributes in a manner similar to the constraints provided on regular Gregorian dating attributes such as
3236 NDDATECUSTOM , providing the Gregorian calendar equivalent of the Julian date:
3259 ND The selection and combination of modules to form a TEI schema is described in

DS-DefaultTextStructure.xml#13163

# id text
4 DS This chapter describes the default high-level structure for TEI documents. A full TEI document combines metadata describing it, represented by a
10 DS class, or the two in combination. This group of elements makes up a
23 DS , is also defined for the representation of language corpora, or other collections of encoded texts. A
33 DS . This permits the encoder to distinguish metadata applicable to the whole collection of encoded texts, which is represented by the outermost
37 DS elements within the corpus. Further information about the organization and encoding of language corpora is given in chapter
40 DS In summary, when the default structure module is included in a schema, the following elements are available for the representation of the outermost structure of a TEI document:
51 DS ). A TEI document may also contain elements from the
53 DS class (such as a collection of facsimile images, or a feature system declaration) if the appropriate module is included in a schema (see further
61 DS are available as major parts of a TEI document. These three elements are provided by the
70 DS TEI texts may be regarded either as
74 DS that is, consisting of several components which are in some important sense independent of each other. The distinction is not always entirely obvious: for example a collection of essays might be regarded as a single item in some circumstances, or as a number of distinct items in others. In such borderline cases, the encoder must choose whether to treat the text as unitary or composite; each may have advantages and disadvantages in a given situation.
76 DS Whether unitary or composite, the text is marked with the
78 DS tag and may contain front matter, a text body, and back matter. In unitary texts, the text body is tagged
80 DS ; in composite texts, where the text body consists of a series of subordinate texts or groups, it is tagged
85 DS The overall structure of a unitary text is:
102 DS The overall structure of a composite text made up of two unitary texts is:
137 DS element is provided for the case where one text is embedded within another, but does not contribute to its hierarchical organization, for example because it interrupts it, or simply quoted within it. This is useful in such common literary contexts as the
157 DS elements, used for more complex or composite text structures, are further discussed in section
159 DS , in the case of elements which can appear in any kind of document, or elsewhere in the case of elements specific to particular kinds of document.
163 DSDIV In some texts, the body consists simply of a sequence of low-level structural items, referred to here as
168 DSDIV ). Examples in prose texts include paragraphs or lists; in dramatic texts, speeches and stage directions; in dictionaries, dictionary entries. In other cases sequences of such elements will be grouped together hierarchically into textual divisions and subdivisions, such as chapters or sections. The names used for these structural subdivisions of texts vary with the genre and period of the text, or even at the whim of the author, editor, or publisher. For example, a major subdivision of an epic or of the Bible is generally called a
176 DSDIV —unless it is an epistolary novel, in which case it may be called a
178 DSDIV . Even texts which are not organized as linear prose narratives, or not as narratives at all, will frequently be subdivided in a similar way: a drama into
202 DSDIV , etc., where the number indicates the depth of this particular division within the hierarchy, the largest such division being
203 DSDIV div1
205 DSDIV div2
207 DSDIV div3
225 DSDIV1 , this element has the following additional attributes:
228 DSDIV1 Using this style, the body of a text containing two parts, each composed of two chapters, might be represented as follows:
266 DSDIV2 these elements all bear the following additional attributes:
269 DSDIV2 The largest possible subdivision of the body is
279 DSDIV2 Using this style, the body of a text containing two parts, each composed of two chapters, might be represented as follows:
338 DSDIV3 The choice between numbered and un-numbered divisions will depend to some extent on the complexity of the material: un-numbered divisions allow for an arbitrary depth of nesting, while numbered divisions limit the depth of the tree which can be constructed. Where divisions at different levels should be processed differently (for example to ensure that chapters, but not sections, begin on a new page), numbered divisions slightly simplify the task of defining the desired processing for each level, though this distinction could also be made by supplying this information on the
342 DSDIV3 . Some software may find numbered divisions easier to process, as there is no need to maintain knowledge of the whole document structure in order to know the level at which a division occurs; such software may, however, find it difficult to cope with some other aspects of the TEI scheme. On the other hand, in a collection of many works it may prove difficult or impossible to ensure that the same numbered division always corresponds with the same type of textual feature: a
360 DSDIV3 class may be used to provide a name or description for the division. Typical values might be
368 DSDIV3 , or (for verse texts)
448 DSDIV3 ), etc. For example, suppose that the body of a text consists of a series of diary entries, each of which is potentially divided into entries for the morning and the afternoon. This might be represented in any of the following ways. First, using the un-numbered style:
535 DSDIV3X (etc.) elements will be both complete and identically organized with reference to the original source. For some purposes however, in particular where dealing with unusually large or unusually small texts, encoders may find it convenient to present as textual divisions sequences of text which are incomplete with reference to the original text, or which are in fact an ad hoc agglomeration of tiny texts. Moreover, in some kinds of texts it is difficult or impossible to determine the order in which individual subdivisions should be combined to form the next higher level of subdivision, as noted below.
537 DSDIV3X To overcome these problems, the following additional attributes are defined for all elements in the
552 DSDIV3X represents a number for the chapter, and the
554 DSDIV3X attribute takes the value
556 DSDIV3X to indicate that this division is incomplete in some respect. Other possible values for this attribute indicate whether material has been omitted initially (I), finally (F), or in the middle (M) of the division, while the
559 DSDIV3X ) may be used to indicate exactly where material has been omitted:
568 DSDIV3X element in the TEI header should also be used to record the principles underlying the selection of incomplete samples, as further described in section
604 DSDIV3X , are really quite independent of each other, although they are all marked as subdivisions of the whole group. They can be read in any order without affecting the sense of the piece; indeed, in some cases, divisions of this nature are printed in such a way as to make it impossible to determine the order in which they are intended to be read. Individual stories can be added or removed without affecting the existing components.
611 DSDTB The divisions of any kind of text may sometimes begin with a brief heading or descriptive title, with or without a byline, an epigraph or brief quotation, or a salutation such as one finds at the start of a letter. They may also conclude with a brief trailer, byline, postscript, or signature. Many of these (e.g. a byline) may appear either at the start or at the end of a text division proper.
613 DSDTB To support this heterogeneity, the TEI architecture defines five classes, all of which are populated by this module:
635 DSHD Unlike some other markup schemes, the TEI scheme does
655 DSHD is the sole member to include other such elements if required.
657 DSHD In certain kinds of text (notably newspapers), there may be a need to categorize individual headings within the sequence at the start of a division, for example as
700 DSHD may be longer than in modern works. When heading-like material appears in the middle of a text, the encoder must decide whether or not to treat it as the start of a new division. If the phrase in question appears to be more closely connected with what follows than with what precedes it, then it may be regarded as a heading and tagged as the
706 DSHD often found in newspapers or magazines, then the
740 DSOC In addition to headings of various kinds, divisions sometimes include more or less formulaic opening or closing passages, typically conveying such information as the name and address of the person to whom the division is addressed, the place or time of its production, a salutation or exhortation to the reader, and so on. Divisions in epistolary form are particularly liable to include such features. Additional elements for the detailed encoding of personal names, dates, and places are provided in chapter
753 DSOC elements are used to encode headings which identify the authorship and provenance of a division. Although the terminology derives from newspaper usage, there is no implication that
777 DSOC Where a sequence of such elements appear together, either at the beginning or end of an element, it may be convenient to group them together using one of the following elements:
844 DSAE element may be used to encode the prefatory list of topics sometimes found at the start of a chapter or other division. It is most conveniently encoded as a list, since this allows each item to be distinguished, but may also simply be presented as a paragraph. The following are thus both equally valid ways of encoding the same argument:
881 DSAE epigraph
882 DSAE is a quotation from some other work, a saying, or a motto, appearing on a title page, or at the start of a division. It may be encoded using the special-purpose
894 DSAE When an epigraph contains a quotation, this may often be associated with a bibliographic reference. In such cases, it is recommended additionally to group the quotation and its source together using the
915 DSAE postscript
916 DSAE is a passage added after the signature of a letter or, less frequently, the main portion of the body of a book, article, or essay. In English a postscript is often abbreviated as
975 DSCO classes, every textual division (numbered or un-numbered) consists of a sequence of ungrouped
978 DSCO ). The actual elements available will depend on the modules in use; in all cases, at least the component-level structural elements defined in the core will be available (paragraphs, lists, dramatic speeches, verse lines and line groups etc.). If the drama module has been selected, then other component- or phrase- level items specialized for performance texts (for example, cast lists or camera angles) will be available, as defined in chapter
979 DSCO ) will be available. If the dictionary module is in use, then dictionary entries, related entries, etc. (as defined in chapter
980 DSCO ) will also be available; if the module for transcribed speech is in use, then utterances, pauses, vocals, kinesics, etc., as defined in chapter
983 DSCO Where a text contains low-level elements from more than one module these may appear at any point; there is no requirement that elements from the same module be kept together.
1004 DSGRPF should be used to represent a collection of independent texts which is to be regarded as a single unit for processing or other purposes. The
1007 DSGRPF should be used to represent an independent text which interrupts the text containing it at any point but after which the surrounding text resumes.
1014 DSGRP element include anthologies and other collections. The presence of common front matter referring to the whole collection, possibly in addition to front matter relating to each individual text, is a good indication that a given text might usefully be encoded in this way; this structure may be found useful in other circumstances too.
1016 DSGRP For example, the overall structure of a collection of short stories might be encoded as follows:
1091 DSGRP A text which is a member of a group may itself contain groups. This is quite common in collections of verse, but may happen in any kind of text. As an example, consider the overall structure of a typical collection, such as the
1093 DSGRP edition of Crashaw's poetry. Following a critical introduction and table of contents, this work contains the following major sections:
1096 DSGRP (a collection of verse first published in 1648)
1105 DSGRP I (a collection of fragments all taken from a single manuscript)
1108 DSGRP II (a further collection of fragments, taken from a different manuscript)
1111 DSGRP Each of the three collections published in Crashaw's lifetime has a reasonable claim to be considered as a text in its own right, and may therefore be encoded as such. It is rather more arbitrary as to whether the two posthumous collections should be treated as two groups, following the practice of the
1113 DSGRP edition. An encoder might elect to combine the two into a single group or simply to treat each fragment as an ungrouped unitary text.
1117 DSGRP edition reprints the whole of each of the three original collections, including their original front matter (title pages, dedications etc.). These should be encoded using the
1120 DSGRP ), while the body of each collection should be encoded as a single
1122 DSGRP element. Each individual poem within the collections should be encoded as a distinct
1124 DSGRP element. The beginning of the whole collection would thus appear as follows (for further discussion of the use of the elements
1237 DSGRP element may be used in this way to encode any kind of collection of which the constituents are regarded by the encoder as texts in their own right. Examples include anthologies or collections of verse or prose by multiple authors, florilegia, or commonplace books, journals, day books, etc. As a fairly typical example, we consider
1254 DSGRP Each titled section listed above comprises a group of extracts or complete texts from writers of a given historical period, preceded by an introductory essay. For example, the second group listed above contains, inter alia, the following:
1268 DSGRP Each group of writings by a single author is preceded by a brief biographical notice. Some of the extracts are quite lengthy, containing several chapters or other divisions; others are quite short. As the above list indicates, the texts included range across all kinds of material: verse, prose, journals and letters.
1270 DSGRP The easiest way of encoding such an anthology is to treat each individual extract as a text in its own right. A sequence of texts by a single author, together with the biographical note preceding it, can then be treated as a single
1274 DSGRP formed by the section. The sequence of single or composite texts making up a single section of the work is likewise treated, together with its prefatory essay, as a single
1345 DSGRP Note that the editor's introductory essays on each author may be treated as texts in their own right (as the essays on Lady Mary Wortley Montagu and Alexander Pope have been treated above), or as front matter to the embedded text, as the essay on Swift has been. The treatment in the example is intentionally inconsistent, to allow comparison of the two approaches. Consistency can be imposed either by treating the Swift section as a
1347 DSGRP containing one text by Swift and one by the editor, or by treating the Montagu and Pope sections as
1349 DSGRP elements containing the editor's essays as front matter. Marked in the second way, the Pope section of the book would look like this:
1370 DSGRP front
1377 DSGRP Where, as in this case, an anthology contains different kinds of text (for example, mixtures of prose and drama, or transcribed speech and dictionary entries, or letters and verse), the elements to be encoded will of course be drawn from more than one module. The elements provided by the core module described in chapter
1378 DSGRP should however prove adequate for most simple purposes, where prose, drama, and verse are combined in a single collection.
1380 DSGRP For anthologies of short extracts such as commonplace books, it may often be preferable to regard each extract not as a text in its own right but simply as a quotation or
1385 DSGRP which appears in the front matter of Melville's
1432 DSFLT An important characteristic of the unitary or composite text structures discussed so far is that they can be regarded as forming what is mathematically known as a
1434 DSFLT covering the whole of the available text (or text division) at each hierarchic level. Just as an XML document has a single root element containing a single tree, each node of which forms a properly nested sub-tree, so it seems natural to think of the internal structure of a text as decomposable hierarchically into subparts, each of which is a properly nested subtree. While this is undoubtedly true of a large number of documents, it is not true of all. In particular, it is not true of texts which are only partly tesselated at a given level. For example, if a text A is contained by text B in such a way that part of B precedes A and part follows it, we cannot tesselate the whole of B. In such a case, we say that text A is a
1446 DSFLT might be regarded as containing many floating texts embedded within another single text, the framing narrative, rather than as groups of discrete texts in which the fragments of framing narrative are regarded as front or back matter.
1448 DSFLT As an example, we consider an 18th century text
1451 DSFLT , by Jane Barker (1726). This lengthy narrative contains nearly a hundred distinct
1453 DSFLT embedded (as the title suggests) in a single patchwork. The work begins by introducing the central character, Galecia, but within a few pages launches into a distinct narrative, the story of Captain Manly:
1504 DSFLT In other multi-narrative texts, the individual nested tales may have greater significance than the framing narratives, and it may therefore be preferable to treat the fragments of framing narrative as front or back matter associated with each nested tale. This is commonly done, for example, in texts such as Chaucer's
1506 DSFLT , where each tale is typically presented with front matter in which the teller of the tale is introduced, and back matter in which the pilgrims comment on it.
1514 DSFLT suggest that its content derives from a source external to the current text,
1516 DSFLT carries no such implication and is simply used whenever the richer content model that it provides is required to support the markup of a part of a text that is presented as a discrete
1518 DSFLT In some cases, such inclusions could be considered external (e.g., enclosures, attachments, etc.); often however, as in the examples above, the included text bears no signs of emanating from outside.
1523 DSFLT may be used in combination. For a text with rich internal structure that is quoted at length,
1536 DSVIRT Where the whole of a division can be automatically generated, for example because it is derived from another part of this or another document, an encoder may prefer not to represent it explicitly but instead simply mark its location by means of a processing instruction, or by using the special purpose
1559 DSVIRT For example, if the table of contents (toc) for a given work is simply derived by copying the first
1564 DSVIRT Similarly, in a digital edition combining a transcribed version of some text with a translated version of it, it may be desired to represent the transcript, the translation, and an aligned version of the two as three distinct divisions. This could be achieved by an encoding like the following:
1568 DSVIRT The processing to be carried out when a
1570 DSVIRT element is rendered will be determined by the application program or stylesheet in use: the function of the TEI markup is simply to identify the location at which the virtual division is to be generated, and also to provide some information about the kind of division to be generated. As such it may be regarded as a special kind of processing instruction, and could equally well be represented by one.
1576 DSFRONT front matter
1577 DSFRONT we mean distinct sections of a text (usually, but not necessarily, a printed one), prefixed to it by way of introduction or identification as a part of its production. Features such as title pages or prefaces are clear examples; a less definite case might be the prologue attached to a play. The front matter of an encoded text should not be confused with the TEI header described in chapter
1578 DSFRONT , which serves as a kind of front matter for the computer file itself, not the text it encodes.
1580 DSFRONT An encoder may choose simply to ignore the front matter in a text, if the original presentation of the work is of no interest, or for other reasons; alternatively some or all components of the front matter may be thought worth including with the text as components of the
1586 DSFRONT With the exception of the title page, (on which see section
1587 DSFRONT ), front matter should be encoded using the same elements as the rest of a text. As with the divisions of the text body, no other specific tags are proposed here for the various kinds of subdivision which may appear within front matter: instead either numbered or un-numbered
1592 DSFRONT for attributes, it is recommended that software written to handle TEI-conformant texts be prepared to recognize and handle these values when they occur, without limiting the user to the values in this list.
1595 DSFRONT attribute may be used to distinguish various kinds of division characteristic of front matter:
1598 DSFRONT A foreword or preface addressed to the reader in which the author or publisher explains the content, purpose, or origin of the text.
1601 DSFRONT A formal declaration of acknowledgment by the author in which persons and institutions are thanked for their part in the creation of a text.
1604 DSFRONT A formal offering or dedication of a text to one or more persons or institutions by the author.
1605 DSFRONT abstract
1607 DSFRONT A summary of the content of a text as continuous prose.
1610 DSFRONT A table of contents, specifying the structure of a work and listing its constituents. The
1618 DSFRONT The following extended example demonstrates how various parts of the front matter of a text may be encoded. The front part begins with a title page, which is presented in section
1619 DSFRONT below. This is followed by a dedication and a preface, each of which is encoded as a distinct
1647 DSFRONT The front matter concludes with another
1649 DSFRONT element, shown in the next example, this time containing a table of contents, which contains a
1654 DSFRONT element to provide page-references: the implication here is that the target identifiers supplied (fish1, fish2, etc.) will correspond with identifiers used for the
1656 DSFRONT elements containing chapters of the text itself. (For the
1688 DSFRONT Alternatively, the pointers in the index might link to the page breaks at which a chapter begins, assuming that these have been included in the markup:
1702 DSFRONT The following example uses numbered divisions to mark up the front matter of a medieval text. Note that in this case no title page in the modern sense occurs; the title is simply given as a heading at the start of the front matter. Note also the use of the
1751 DSFRONT If, however, the table of contents can be automatically generated from the remainder of the text, it may be preferable simply to mark its presence, either by means of an empty
1758 DSTITL Detailed analysis of the title page and other
1760 DSTITL of older printed books and manuscripts is of major importance in descriptive bibliography and the cataloguing of printed books; such analysis may require a rather more detailed module than that proposed here.
1761 DSTITL The following elements are suggested as a means of encoding the major features of most title pages:
1782 DSTITL class. Any number of elements from this class can appear grouped together within a
1786 DSTITL element is included so as to enable encoders to record the presence of complex non-textual material on a title page. For simple cases such as printers' ornaments or illustrations the
1797 DSTITL element without any need to group them together and encode a complete title page.
1799 DSTITL Encoders wishing to add new elements to either class may do so using the methods described in section
1800 DSTITL . Two examples of the use of these elements follow. First, the title page of the work discussed earlier in this section:
1822 DSTITL tag to mark the line breaks of the original where necessary:
1868 DSTITL Where, as here, it is considered important to encode salient features of the way a title page was originally rendered, the techniques exemplified in
1873 DSTITL Where title pages are encoded, their physical rendition is often of considerable importance. One approach to this requirement would be to use the
1876 DSTITL , to segment the typographic content of each part of the title page, and then use the global
1888 DSBACK Conventions vary as to which elements are grouped as back matter and which as front. For example, some books place the table of contents at the front, and others at the back. Even title pages may appear at the back of a book as well as at the front. The content model for
1896 DSBACK attribute on all division elements, in order to distinguish various kinds of division characteristic of back matter:
1899 DSBACK An ancillary self-contained section of a work, often providing additional but in some sense extra-canonical text.
1902 DSBACK A list of terms associated with definition texts (
1905 DSBACK list type="gloss"
1913 DSBACK A list of bibliographic citations: this should be encoded as a
1917 DSBACK index
1919 DSBACK Any form of index to the work.
1920 DSBACK colophon
1925 DSBACK No additional elements are proposed for the encoding of back matter at present. Some characteristic examples follow; first, an index (for the case in which a printed index is of sufficient interest to merit transcription):
1958 DSBACK Note that if the page breaks in the original source have also been explicitly encoded, and given identifiers, the references to them in the above index can more usefully be recorded as links. For example, assuming that the encoding of page 461 of the original source starts like this:
1959 DSBACK then the last item above might be encoded more usefully in either of the following forms:
1984 DSBACK And finally, a list of corrigenda and addenda with pseudo-epistolary features:
2022 textstructure Default text structure
2037 DSSTRUC The selection and combination of modules to form a TEI schema is described in

TitlePageVerso.xml#12020

# id text
2 TitlePageVerso Releases of the TEI Guidelines

TC-CriticalApparatus.xml#13092

# id text
6 TC to the text. Witnesses to a text may include authorial or other manuscripts, printed editions of the work, early translations, or quotations of a work in other texts. Information concerning variant readings of a text may be accumulated in highly structured form in a critical apparatus of variants. This chapter defines a module for use in encoding such an apparatus of variants, which may be used in conjunction with any of the modules defined in these Guidelines. It also defines an element class which provides extra attributes for some elements of the core tag set when this module is selected.
8 TC Information about variant readings (whether or not represented by a critical apparatus in the source text) may be recorded in a series of
10 TC , each entry documenting one
12 TC , or set of readings, in the text. Elements for the apparatus entry and readings, and for the documentation of the witnesses whose readings are included in the apparatus, are described in section
14 TC . The available methods for embedding the apparatus in the rest of the text, or for linking an external apparatus to the base text, are described in section
15 TC . Finally, several extra attributes for some tags of the core tag set, made available when the additional tag set for text criticism is selected, are documented in section
18 TC Many examples given in this chapter refer to the following texts of the opening (usually just line 1) of Chaucer's
56 TCAPLL methods of identifying which witnesses support a particular reading, and for describing the witnesses included in the apparatus: see section
59 TCAPLL elements for indicating which portions of a text are covered by fragmentary witnesses: see section
65 TCAPLL element is in one sense a more sophisticated and complex version of the
68 TCAPLL as a way of marking points where the encoding of a passage in a single source may be carried out in more than one way. Unlike
79 TCAPEN element, which groups together all the readings constituting the variation. The identification of discrete textual variations or apparatus entries is not a purely mechanical process; different editors may group readings differently. No rules are given here as to how to group readings into apparatus entries; the tags given here may be used to group readings in whatever way the editor finds most perspicuous or useful.
81 TCAPEN The individual apparatus entry is encoded with the
93 TCAPEN , are used to link the apparatus entry to the base text, if present. In such cases, several methods may be used for such linkage, each involving a slightly different usage for these attributes. Linkage between text and apparatus is described below in section
103 TCAPEN or other elements, as described in the next section. A very simple partial apparatus for the first line of the
105 TCAPEN might take a form something like this:
115 TCAPEN , to indicate a preference for one reading, etc. The following sections on readings, subvariation, and witness information describe some of the more important complications which can arise.
124 TCAPLR Individual readings are the crucial elements in any critical apparatus of variants. The following elements should be used to tag individual readings within an apparatus entry:
128 TCAPLR N.B. the term
130 TCAPLR is used here in the text-critical sense of
131 TCAPLR the reading accepted as that of the original or of the base text
132 TCAPLR . This sense differs from that in which the word is used elsewhere in the Guidelines, for example as in the attribute
134 TCAPLR where the intended sense is
135 TCAPLR the root form of an inflected word
137 TCAPLR the heading of an entry in a reference book, especially a dictionary
140 TCAPLR In recording readings within an apparatus entry, the
152 TCAPLR element may also be used to record the base text of the source edition, to mark the readings of a base witness, to indicate the preference of an editor or encoder for a particular reading, or (e.g. in the case of an external apparatus) to indicate precisely to which portion of the main text the variation applies. Those who prefer to work without the notion of a base text or who are not using the parallel segmentation method may prefer not to use it at all. How it is used depends in part on the method chosen for linking the apparatus to the text; for more information, see section
160 TCAPLR As members of the attribute classes
174 TCAPLR As elsewhere, these attributes may be used to indicate the person responsible for the editorial decision being recorded, and also the degree of certainty associated with that decision by the person carrying out the encoding.
178 TCAPLR attribute identifies the witnesses which have the reading in question. It is required if the apparatus gathers together readings from different witnesses, but may be omitted in an apparatus recording the readings of only one witness, e.g. substitutions, divergent opinions on what is in the witness or on how to expand abbreviations, etc. Even in such a one-witness apparatus, however, the
180 TCAPLR attribute may still be useful when it is desired to record the occurrence of a particular reading in some other witness. For other methods of identifying the witnesses to a reading, see section
204 TCAPLR attributes may be used to convey information on the sequence and cause of variation. In the following apparatus fragment, the reading
209 TCAPLR per
244 TCAPLR Similarly, if a witness is hard to decipher, it may be desired to indicate responsibility for the claim that a particular reading is supported by a particular witness. In line 2212a of
246 TCAPLR , for example, the manuscript is read in different ways by different scholars; the editor Klaeber prints one text, using parentheses to indicate his expansion, and records in the apparatus two different accounts of the manuscript reading, by Zupitza and Chambers:
268 TCAPLR attributes are intelligible only on an element recording a reading from a single witness, and should not be used if more than one witness is given on the same
272 TCAPLR element. If more than one witness is given for the reading, they are undefined. To convey this information when the witness is one among several, the
277 TCAPLR Where there is a greater weight of editorial discussion and interpretation than can conveniently be expressed through the attributes provided on these elements (for example where there are multiple witnesses for a single reading or multiple editorial responsibility for an emendation) this information can be attached to the apparatus in a note, or recorded in the feature structure notation defined in chapter
278 TCAPLR . In particular, such recurring text-critical situations as palaeographic confusion of particular letters, or homœoarchy or homœoteleuton involving specific character groups, may lend themselves to feature structure treatment. Information concerning these recurrent situations may be encoded into database-like fragments within the text which would then be available to sophisticated computer-assisted analysis. Further work remains to be done on such mechanisms, however, and so no examples are given here of the use of feature structures in text-critical apparatus.
282 TCAPLR element may also be used to record the specific wording of notes in the apparatus of the source edition, as here in a transcription of Friedrich Klaeber's note on
293 TCAPLR Notes providing details of the reading of one particular witness should be encoded using the specialized
298 TCAPLR Encoders should be aware of the distinct fields of use of the attribute values
310 TCAPLR indicates the scholar responsible for asserting the existence of that reading in that physical entity. In some cases, the categories may blur: a scholar may produce an edition introducing readings for which he or she is responsible; that edition may itself become a witness in a later critical apparatus. Thus, readings introduced as corrections in the earlier edition will be seen in the later apparatus as witnessed by the earlier edition. As observed in the discussion concerning the discrimination of
328 TCAPSU element may be used to group readings, either because they have identical values on one or more attributes, or because they are seen as forming a self-contained variant sequence, or for some other reason. This grouping of readings is entirely optional: no such grouping of readings is required.
356 TCAPSU To indicate that both Hg and La vary only orthographically from the lemma, one might tag both readings
357 TCAPSU rdg type='orthographic'
373 TCAPSU may be used to organize the substantive variants of an apparatus entry. Editors may need to indicate that each of a group of witnesses may be taken as all supporting a particular reading, even though there may be variation concerning the exact form of that reading in, or the degree of support offered by, those witnesses. For example: one may identify three substantive variants on the first word of Chaucer's
381 TCAPSU . In fact, the manuscripts display many different spellings of these words, and a scholar may wish both to show that the manuscripts have all these variant spellings and that these variant spellings actually support only the three regularized spelling forms. One may term these variant spellings as
387 TCAPSU element by gathering the readings into three groups according to the normalized form of their reading. All the readings within each group may be accounted subvariants of the main reading for the group, which may be indicated by tagging it as a
390 TCAPSU rdg type='groupBase'
428 TCAPSU is supported by Ra2, even though the form differs in that manuscript. Accordingly, an application which recognizes that these apparatus entries show subvariation may then assign all the witnesses instanced as attesting the sub-variants on that lemma as actually supporting the reading of the lemma itself at a higher level of classification. Thus, Ha4 here supports the reading
434 TCAPSU element might also be used to group readings in the same way. The example above is substantially identical to the following, which uses
465 TCAPSU This expresses even more clearly than the previous encoding of this material that at the highest level of classification (apparatus entry A1), this variation has three normalized readings, and that the first of these is supported by manuscripts El, Hg, and Ha4; the second by Cp, Ld1, and La; and the third by Ra2. Some encoders may find the use of nested apparatus entries less intuitive than the use of reading groups, however, so both methods of classifying the readings of a variation are allowed.
467 TCAPSU Reading groups may also be used to bring together variants which form an apparent developmental sequence, and to make clear that other readings are not part of that sequence, as in the following example, which makes clear that the variant sequence
506 TCAPLW A given reading is associated with the set of witnesses attesting it by listing the witnesses in the
514 TCAPLW element. Special mechanisms, described in the following sections, are needed to associate annotation on a reading with one specific witness among several (section
515 TCAPLW ), to transcribe witness information verbatim from a source edition (section
516 TCAPLW ), and to identify the formal lists of witnesses typically provided in the front matter of critical editions (section
522 TCAPWD When it is desired to give additional information about a particular witness or witnesses for the reading, the information may be given in a
524 TCAPWD element. This is a specialized form of note, which can be linked to both a reading and to one or more of the witnesses for that reading. The former linkage is effected by the
541 TCAPWD cannot be included in the text at the point of attachment; it must point to the reading(s) being annotated by means of its
543 TCAPWD attribute. To indicate, on the authority of editor PR, that the Ellesmere manuscript has an ornamental capital in the word
555 TCAPWD This encoding makes clear that the ornamental capital mentioned is in the Ellesmere manuscript, and not in Hengwrt or Ha4. The
563 TCAPWD may be used to record the specific wording of information in the source text, even when the information itself is captured in some more formal way elsewhere. The example from the
566 TCAPWD ), for example, might be extended thus, to record the wording of the note explaining the variant:
590 TCAPWD Observe that a single witness detail element may be linked to several different readings (noting, for example, a recurrent phenomenon in a particular manuscript) by having the
592 TCAPWD attribute point at all the readings in question. Similarly, feature structures containing information about the text in a witness (whether retroversion, regularization, or other) can also be linked to specific
606 TCSCWL In the transcription of printed critical editions, it may be desirable to retain for future reference the exact form in which the source edition records the witnesses to a particular reading; this is particularly important in cases of ambiguity in the information, or uncertainty as to the correct interpretation. The
613 TCSCWL list may appear following a
619 TCSCWL element in any apparatus entry, and should be used only to transcribe the witness information in the form found in the source.
626 TCSCWL The advantage of holding witness information in the
633 TCSCWL an application can check that every sigil
634 TCSCWL We use the term sigil as the English equivalent of the Latin term
639 TCSCWL attribute has declared datatype of one or more
641 TCSCWL values, a check can be made that readings are assigned only to witness sigla which have been identified (using the
646 TCSCWL ). Such checking is more difficult for witness sigla held as the content of a
649 TCSCWL For this reason, it is recommended that encoders always hold witness information in the
655 TCSCWL , where possible. Thus, as in the examples below, even when a reference to a witness is exactly reproduced in the
657 TCSCWL element, the corresponding sigil for that witness can be written into the
663 TCSCWL . However, in cases where it is uncertain how the witness reference contained in the
665 TCSCWL element should be interpreted, or where no witness exists, the
703 TCSCWL Of course, the sigil used for a particular witness in the source, as recorded in the
705 TCSCWL element, may well differ from that used to indicated the same witness in the
707 TCSCWL attribute, as shown particularly in the apparatus for the second line of the poem (Diet.1.2).
716 TCAPWL A list of all identified witnesses should normally be supplied in the front matter of the edition, or in the
723 TCAPWL element, which contains a series of
727 TCAPWL element may contain a brief characterization of the witness, given as one or more prose paragraphs. If more detailed information about a manuscript witness is available, it should be represented using the
737 TCAPWL Whether information about a particular witness is supplied by means of a
743 TCAPWL element, a unique sigil for this source should always be supplied, using the global
745 TCAPWL attribute. This identifier can then be used elsewhere to refer to this particular witness.
753 TCAPWL The minimal information provided by a witness list is thus the set of sigla for all the witnesses named in the apparatus. For example, the witnesses referenced by the examples of this chapter might simply be listed thus:
770 TCAPWL It is more helpful, however, for witness lists to be somewhat more informative: each
781 TCAPWL As the last example shows, the witness description here may be complemented by a reference to a full description of the manuscript supplied elsewhere, typically as the content of a
821 TCAPWL . Note also that if the witnesses being recorded are not manuscripts but printed works, it may be preferable to document them using the standard
838 TCAPWL In text-critical work it is customary to refer to frequently occurring groups of witnesses by means of a single common sigil. Such sigla may be documented as pseudo-witnesses in their own right by including a nested witness list within the witness list, which uses the sigil for the group as its identifier, and supplies a fuller name for the group in its optional child
869 TCAPWL Note that a single witness cannot appear more than once in a witness list, and therefore cannot be assigned to more than one group of witnesses.
871 TCAPWL Situations commonly arise where there are many more or less fragmentary witnesses, such that there may be quite distinct groups of witnesses for different parts of a text or collection of texts. One may treat this with distinct
875 TCAPWL element at the beginning of the file or in its header listing all the witnesses, partial and complete, for the text, with the attestation of fragmentary witnesses indicated within the apparatus by use of the
882 TCAPWL If a witness list is provided, it may be unnecessary to give, in each apparatus entry, an exhaustive list of the witnesses which agree with the base text. An application program can—in principle—compare the witnesses given for each variant found with those given in the full list of witnesses, subtracting from this list all the witnesses not active at this point (perhaps because of lacuna, or because they contain a variation on a different, overlapping lemma) and thence calculate all the manuscripts agreeing with the base text. In practice, encoders may find it less error-prone to list all witnesses explicitly in each apparatus entry.
893 TCAPMI If a witness is incomplete (whether a single fragment, a series of fragments, or a relatively complete text with one or more lacunae), it is usually desirable to record explicitly where its preserved portions begin and end. The following empty tags, which may occur within any
897 TCAPMI element, indicate the beginning or end of a fragmentary witness or of a lacuna within a witness:
909 TCAPMI when the module defined by this chapter is included in a schema.
913 TCAPMI has a physical lacuna, and the text of the manuscript begins with
933 TCAPMI both appear in witness X. In some cases, the apparatus in the source may commence recording the readings for a particular witness without its being clear whether the previous absence of readings for this witness is due to a lacuna, or to some other reason. The
955 TCAPLK Three different methods may be used to link a critical apparatus to the text:
961 TCAPLK the parallel segmentation method.
968 TCAPLK apparatus, the former dispersed within the base text, the latter held in some separate location, within or outside the document with the base text. The parallel segmentation method does not use the concept of a base text and may only be used for in-line apparatus.
975 TCAPLK element provides a useful means of grouping together a series of
993 TCAPLK element of its TEI header, thus:
1000 TCAPLO The location-referenced method of encoding apparatus provides a convenient method for encoding printed apparatus; in this method as in most printed editions, the apparatus is linked to the base text by indicating explicitly only the block of text on which there is a variant (noted usually by a canonical reference scheme, or by line number in the edition, such as
1003 TCAPLO Page 15 line 1
1006 TCAPLO If the location-referenced method is used for an apparatus stored externally to the base text, the TEI header must have the declaration:
1010 TCAPLO of the document, the base text (here El) will appear:
1034 TCAPLO If the same text is encoded using in-line storage, the apparatus is dispersed through the base text block to which it refers. In this case, the location of the variant can be read from the line in which it occurs.
1047 TCAPLO Since the location is not required to be exact, the apparatus for a line might also appear at the end of the line:
1057 TCAPLO When the apparatus is linked to the text by means of location references, as shown here, it is not possible to find automatically the precise portion of text varied by the readings. In order to show explicitly what portion of the base text is replaced by the variant readings, the
1071 TCAPLO base text reading
1072 TCAPLO and requiring no qualification, but it may optionally carry the normal attributes, as shown here. Some text critics prefer to abbreviate or elide the lemma, in order to save space or trouble; such practice is not forbidden by these Guidelines, but no recommendations are made for conventions of abbreviating the lemma, whether abbreviation of each word, or suppression of all but the first and last word, etc.
1080 TCAPDE In the double end-point attachment method, the beginning and end of the lemma in the base text are both explicitly indicated. It thus differs from the location-referenced method, in which only the larger span of text containing the lemma is indicated. Double end-point attachment permits unambiguous matching of each variant reading against its lemma. It or the parallel-segmentation method should be used in all cases where this is desired, for example where the apparatus is intended to enable full reconstruction of the text, or of the substantives, of every witness.
1091 TCAPDE . In cases where it is not possible to insert anchors within the base text (e.g. where the text is on a read-only medium) the beginning and end of the lemma may be indicated by using the
1096 TCAPDE The double end-point attachment method may be used with in-line or external apparatus. In the latter case, the base text (here El) will appear with
1098 TCAPDE elements inserted at every place where a variant begins or ends (unless some element with an identifier already begins or ends at that point):
1120 TCAPDE attribute can use the identifier for the line as a whole; the lemma is assumed to run from the beginning of the element indicated by the
1124 TCAPDE attribute. If no value is given for
1149 TCAPDE element in this method, as it may be extracted reliably from the base text. If an exhaustive list of witnesses is available, it will also not be necessary to specify just which manuscripts agree with the base text to enable reconstruction of witnesses. An application will be able to determine the manuscripts that witness the base reading, by noting which witnesses are attested as having a variant reading, and inferring the base text reading for all others after adjusting for fragmentary witnesses and for witnesses carrying overlapping variant readings.
1151 TCAPDE Alternatively, if it is desired to make an explicit record of the attestation of the base text, the
1166 TCAPDE . For example, at line 117 of the Wife of Bath's Prologue, the manuscripts Hg (Hengwrt), El (Ellesmere), and Ha4 (British Library Harleian 7334) read:
1206 TCAPDE The parallel segmentation method, to be discussed next, cannot handle overlaps among variants, and would require the individual variants to be split into pieces.
1208 TCAPDE Because creation and interpretation of double end-point attachment apparatus will be lengthy and difficult it is likely that they will usually be created and examined by scholars only with mechanical assistance.
1214 TCAPPS This method differs from the double end-point attachment method in that all variants at any point of the text are expressed as variants on one another. In this method, no two variations can overlap, although they may nest. Thus, the concepts of a base text and of a lemma become unnecessary: the texts compared are divided into matching segments all synchronized with one another. This permits direct comparison of any span of text in any witness with that in any other witness. It is also very easy with this method for an application to extract the full text of any one witness from the apparatus.
1216 TCAPPS This method will (by definition) always be satisfactory when there are just two texts for comparison (assuming they are in the same language and script). It will also be useful where editors do not wish to privilege a text as the
1218 TCAPPS or when editors wish to present parallel texts. It will become less convenient as traditions become more complex and tension develops between the need to segment on the largest variation found and the need to express the finest detail of agreement between witnesses.
1220 TCAPPS In the parallel segmentation method, each segment of text on which there is variation is marked by an
1224 TCAPPS element; if it is desired to single out one reading as preferred, it may be tagged
1239 TCAPPS This method cannot be used with external apparatus: it must be used in-line. Note that apparatus encoded with this method may be translated into the double end-point attachment method and back without loss of information. Where double-end-point-attachment encodings have no overlapping lemmata, translation of these to the parallel segmentation encoding and back will also be possible without loss of information.
1241 TCAPPS For economy, the witnesses to the reading most widely attested need not be stated. Since all manuscripts must be represented in all apparatus entries, it will be possible for an application to read a
1243 TCAPPS declaring all the witnesses to the text and then calculate which witnesses have not been named. In the example below, only La and Ra2 are identified explicitly with a reading; an application might successfully infer from this that
1260 TCAPPS As noted, apparatus entries may nest in this method: if an imaginary fifth manuscript of the text read
1262 TCAPPS , the variation on the individual words of the line would nest within that for the line as a whole:
1293 TCAPPS Parallel segmentation cannot, however, deal very gracefully with variants which overlap without nesting: such variants must be broken up into pieces in order to keep all witnesses synchronized.
1300 TCAPLN When an apparatus is provided it does not need to be given at the location in the transcription where the variation, emendation, attribution, or other apparatus observation occurs. Instead it may be stored in a separate place in the same file, or indeed in another file, and point to the location at which it is meant to be used. Storing apparatus entries separately can be beneficial when encoding multiple competing, potentially overlapping, interpretations of the same point in the source texts.
1302 TCAPLN The location-referenced method can be used to point a position in a text using the
1310 TCAPLN or other element at the location where the apparatus observation takes place. The contents of an element pointed to are understood to be equivalent to a
1312 TCAPLN if none exists in the
1314 TCAPLN , and if a
1322 TCAPLN datatype and thus contains a URI as a value. This means that it can point directly to an
1353 TCAPLN is not provided in the source file.
1355 TCAPLN In addition, URLs can contain XPointer schemes including xpath(), range(), and string-range() which can be used in providing the location of an
1357 TCAPLN that is stored separately from the text to which it applies. Both
1361 TCAPLN can be used, as in the double end-point attachment method, to identify the starting and ending location for an apparatus using XPointer schemes described in
1362 TCAPLN section to more precisely identify this location where beneficial.
1379 TCAPLN attribute is provided then it should be understood that this supplies the location of the textual variance that the apparatus documents. If the
1381 TCAPLN attribute contains an XPointer scheme that identifies a range of text (or elements) then this is understood to record the starting and ending of the range as in the double end-point attachment method. In such a case a @to attribute is unnecessary.
1390 TCTR element. An application may then construct different
1398 TCTR element. Consider, for example, the three different transcriptions given below of line 105 of the Hengwrt manuscript of Chaucer's
1400 TCTR . The last word of the line
1407 TCTR u
1413 TCTR u
1428 TCTR This example uses special purpose elements
1456 TCTR In most cases, elements used to indicate features of a primary textual source may be represented within an
1464 TCTR elements in the example just given. However, in cases where the tagged feature extends across a span of text which might itself contain variant readings which it is desired to represent by
1466 TCTR structures, some adaptation of the tagging may be necessary. For example, a span of text may be marked in the transcription of the primary source as a single deletion but it may be desirable to represent just a few words from this source as individual deletions within the context of a critical apparatus drawing together readings from this and several other witnesses. In this case, the tagging of the span of words as one deletion may need to be decomposed into a series of one-word deletions for encoding within the apparatus. If it is important to record the fact that all were deleted by the same act, the markup may use the
1495 TC The selection and combination of modules to form a TEI schema is described in

DR-PerformanceTexts.xml#13128

# id text
3 DR This module is intended for use when encoding printed dramatic texts, screen plays or radio scripts, and written transcriptions of any other form of performance.
6 DR discusses elements such as cast lists, which can appear only in the front or back matter of printed dramatic texts. Section
7 DR discusses the structural components of performance texts: these include major structural divisions such as acts and scenes (section
10 DR ); stage directions (section
14 DR discusses a small number of additional elements characteristic of screen plays and radio or television scripts, as well as some elements for representing technical stage directions such as lighting or blocking.
16 DR The default structure for dramatic texts is similar to that defined by chapter
20 DR Two element classes are used by this module. The
22 DR class supplies specialized elements which can appear only in the front or back matter of performance texts. The
24 DR class supplies a set of elements for stage directions and similar items such as camera movements, which can occur between or within speeches.
31 DRFAB In dramatic texts, as in all TEI-conformant documents, the header element is followed by a
33 DRFAB element, which contains optional front and back matter, and either a
46 DRFAB elements are most likely to be of use when encoding preliminary materials in published performance texts. When the module defined by this chapter is included in a schema, the following additional elements not generally found in other forms of text become available as part of the front or back matter:
49 DRFAB Elements for encoding each of these specific kinds of front matter are discussed in the remainder of this section, in the order given above. In addition, the front matter of dramatic texts may include the same elements as that of any other kind of text, notably title pages and various kinds of text division, as discussed in section
51 DRFAB div type="performance"
53 DRFAB div1 type="set"
56 DRFAB Most other material in the front matter of a performance text will be marked with the default text structure elements described in chapter
57 DRFAB . For example, the title page, dedication, other commendatory material, preface, etc., in a printed text should be encoded using
61 DRFAB elements, containing headings, paragraphs, and other core tags.
70 DRSET A special form of note describing the setting of a dramatic text (that is, the time and place of its action) is sometimes found in the front matter.
71 DRSET Descriptions of the setting may also appear as initial stage directions in the body of the play, but such descriptions should be marked as stage directions, not
75 DRSET element should be used only where the description forms part of the front matter, as in the following examples:
125 DRPRO Many plays in the Western tradition include in their front matter a prologue, spoken by an actor, generally not in character. Similar speeches often also occur at the end of the play, as epilogues. The elements
129 DRPRO are provided for the encoding of such features within the front or back matter, where appropriate.
130 DRPRO A prologue may be encoded just like a distinct poem, as in the following example:
164 DRPRO A prologue or epilogue may also be encoded as a speech, using the
167 DRPRO . This is particularly appropriate where stage directions, etc., are involved, as in the following example:
203 DRPRO In cases where the prologue or epilogue is clearly a significant part of the dramatic action, it may be preferable to include it in the body of a text, rather than in the front or back matter. In such cases, the encoder (and theatrical tradition) will determine whether or not to regard it as a new scene or division, or simply the final speech in the play. In the First Folio version of Shakespeare's
205 DRPRO , for example, Prospero's final speech is clearly marked off as a distinct textual unit by the headings and layout of the page, and might therefore be encoded as back matter:
294 DRPERF Performance texts are not only printed in books to be read, they are also performed. It is common practice therefore to include within the front matter of a printed dramatic text some brief account of particular performances, using the following element:
297 DRPERF element may be used to group any and all information relating to the actual performance of a play or screenplay, whether it specifies how the play should be performed in general or how it was performed in practice on some occasion.
299 DRPERF Performance information may include complex structures such as cast lists, or paragraphs describing the date and location of a performance, details about the setting portrayed in the performance and so forth. (See the discussion of these specialized structures in section
300 DRPERF above.) If information for more than one performance is being recorded, then more than one
304 DRPERF Names of persons, places, and dates of particular significance within the performance record may be explicitly marked using the general purpose
307 DRPERF rs type="place"
401 DRCAST cast list
402 DRCAST is a specialized form of list, conventionally found at the start or end of a play, usually listing all the speaking and non-speaking roles in the play, often with additional description (
404 DRCAST ) or the name of an actor or actress (
406 DRCAST ). Cast lists may be encoded with the general purpose
426 DRCAST A cast list relating to a specific performance may be accompanied by notes about the time or place of that performance, indicating (for example) the name of the theatre where the play was first presented, the name of the producer or director, and so forth. When the cast list relates to a specific performance, it should be embedded within a
460 DRCAST . For example, the second cast item above might be encoded as follows:
472 DRCAST element, where it is desired to link speeches within the text explicitly to the role, using the
477 DRCAST The occasionally lengthy descriptions of a role sometimes found in written play scripts may be marked using the
500 DRCAST When a list of such minor roles is given together, the
504 DRCAST should indicate that it contains more than one role, by taking a value such as
505 DRCAST list
520 DRCAST A group of cast items forming a distinct subdivision of a cast list may be marked as such by using the special purpose
524 DRCAST attribute may be used to indicate whether this grouping is indicated in the text by layout alone (i.e. the use of whitespace), by long braces or by some other means. A
528 DRCAST element) followed by a series of
551 DRCAST as a role description, and encode the above example as follows:
569 DRCAST This version has the advantage that all role descriptions are treated alike, rather than in some cases being treated as headings. On the other hand there are also cases, such as the following, where the role description does function more like a heading:
660 DRBOD The body of a performance text may be divided into structural units, variously called acts, scenes, stasima, entr'actes, etc. All such formal divisions should be encoded using an appropriate text-division element (
667 DRBOD . Whether divided up into such units or not, all performance texts consist of sequences of speeches (see
668 DRBOD ) and stage directions (see
670 DRBOD number
672 DRBOD ). Speeches will generally consist of a sequence of
674 DRBOD -level items: paragraphs, verse lines, stanzas, or (in case of uncertainty as to whether something is verse or prose)
679 DRBOD The boundaries of formal units such as verse lines or paragraphs do not always coincide with speech boundaries. Units such as songs may be discontinuous or shared among several speakers. As described below in section
685 DRDIV Large divisions in drama such as acts, scenes, stasima, or entr'actes are indicated by numbered or unnumbered
692 DRDIV attributes may be used to define the type of division being marked, and to provide a name or number for it, as in the following example:
704 DRDIV Where the largest divisions of a performance text are themselves subdivided, most obviously in the case of plays traditionally divided into acts and scenes, further nested text-division elements may be used, as in this example:
741 DRDIV convention, (where the entrance of each new set of characters is marked as a distinct unit in the text) and the
743 DRDIV element to represent the acts into which the play is divided. The elements chosen are determined only by the hierarchic position of these units in the text as a whole. If the text had no acts, but only scenes, then the scenes might be represented by
745 DRDIV elements. Equally, if a play is divided only into
747 DRDIV , with no smaller subdivisions, then the
751 DRDIV should be used, as above, to make explicit the name associated with a particular category of subdivision.
755 DRDIV . The second act in the above example would then be represented as follows:
773 DRSP The following elements are used to identify speeches and speakers in a performance text:
775 DRSP As noted above, the structure of many performance texts may be analysed as multiply hierarchic: a scene of a verse play, for example, may be divided into speeches and, at the same time, into verse lines. The end of a line may or may not coincide with the end of a speech, and vice versa. Other structures, such as songs, may be discontinuous or split up over several speeches. For some purposes it will be appropriate to regard the verse-structure as the fundamental organizing principle of the text, and for others the speech structure; in some cases, the choice between the two may be arbitrary. The discussion in the remainder of this chapter assumes that it is the speech-based hierarchy which most prominently determines the structure of performance texts, but the same mechanisms could be employed to encode a view of a performance text in which individual speeches were entirely subordinate to the formal units of prose and verse. For more detailed discussion and examples of various treatments of this fundamental issue, refer to chapter
782 DRSP element are both used to indicate the speaker or speakers of a speech, but in rather different ways. The
784 DRSP element is used to encode the word or phrase actually used within the source text to indicate the speaker: it may contain any string or prefix, and may be thought of as a highly specialized form of stage direction. The
788 DRSP element in the TEI header
791 DRSP element in the cast list
792 DRSP , or even to some external source such as an online handbook of dramatic roles. The most usual case is that the pointer value supplied (prefixed by a sharp sign) corresponds with the value of an
846 DRSP If the speaker attributions are completely regular (and may thus be reconstructed mechanically from the values given for the
848 DRSP attribute), or are of no interest for the encoder of the text (as might be the case with editorially supplied attributions in older texts), then the
850 DRSP element need not be used; the former example above then might look like this:
866 DRSP More than one identifier may be listed as value for the
868 DRSP attribute if the speech is spoken by more than one person, as in the following example:
887 DRSP elements are both declared within the core module (see section
892 DRSPG This module makes available the following additional element for handling groups of speeches:
896 DRSPG element is intended for cases where the characters in a performance launch into something which might be regarded almost as a kind of separate structural division, typically associated with its own heading or numbering system, but which
898 DRSPG in the text, at the same hierarchic level as speeches preceding or following it. Such units are often numbered, titled, and visually presented as distinct objects within the text. Here is a typical example from a well-known American musical comedy:
961 DRSTA Both between and within the speeches of a written performance text, it is normal practice to include a wide variety of descriptive directions to indicate non-verbal action. The following elements are provided to represent these:
966 DRSTA A satisfactory typology of stage directions is difficult to define. Certain basic types such as
971 DRSTA setting
974 DRSTA , are easily identified. But the list is not a closed one, and it is not uncommon to mix types within a single direction. No closed set of values for the
976 DRSTA attribute is therefore proposed at the present time, though some suggested values are indicated in the list below, which also indicates the range of possibilities.
1005 DRSTA element of the TEI header (described in section
1085 DRSTA element may also be used in non-theatrical texts, to mark sound effects or musical effects, etc., as further discussed in section
1090 DRSTA element is intended to help overcome the fact that the stage directions of a printed text may often not provide full information about either the intended or the actual movement of actors on stage. It may be used to keep track of entrances and exits in detail, so as to know which characters are on stage at which time. Its attributes permit a relatively formal specification for movements of characters, using user-defined codes to identify the characters involved (the
1094 DRSTA attribute), and optionally which part of the stage is involved (
1098 DRSTA attribute is also provided; this allows the recording of different
1104 DRSTA element should be located at the position in the text where the move is presumed to take place. This will often coincide with a stage direction, as in the following simple example:
1113 DRSTA element can however appear independently of a stage direction, as in the following example:
1133 DRPAL The actual speeches of a dramatic text may be composed of running text, which must be formally organized into paragraphs, in the case of prose (see section
1134 DRPAL ), verse lines or line groups in that of verse (see section
1137 DRPAL elements, in case of doubt as to whether the material should be treated as verse or prose. The following elements, all of which are defined in the core, are particularly useful when marking units of prose or verse within speeches:
1139 DRPAL Like other milestone elements, the element
1152 DRPAL As a member of the classes
1170 DRPAL also gain additional attributes through their membership of the class
1174 DRPAL In many texts, prose and verse may be inextricably mingled; particularly in earlier printed texts, prose may be printed as verse or verse as prose, or it may be impossible to distinguish the two. In cases of doubt, an encoder may prefer to tag the dubious material consistently as verse, to tag it all as prose, to follow the typography of the source text, or to use the neutral
1180 DRPAL element of the header may be used to record explicitly what policy has been adopted.
1184 DRPAL ) and verse (marked as
1198 DRPAL class provides one simple way of indicating where the boundaries of a speech and of a verse line or line group do not coincide. The encoder may simply indicate that a line or line group is metrically incomplete by specifying the value
1221 DRPAL Alternatively, where the fragments of the line or line group are consecutive in the text (though possibly interrupted by stage directions), the values
1249 DRPAL or line group element is most often of use for the encoding of songs and other stanzaic material. Line groups may be fragmented across speakers in the same way as individual lines, and the same set of attributes may be used to record this fact. The element
1251 DRPAL is provided in order to simplify the situation, very common in performances, where performance of a single entity, such as a song, is shared amongst several performers, as in the following example:
1279 DRPAL This encoding however does not indicate that the three lines of Sir Joseph's song and the two lines following it together constitute a single verse stanza. This can be indicated by using the
1314 DREMB Although primarily composed of speeches, performance texts often contain other structural units such as songs or strophes which are shared among different speakers. More generally, complex nested structures of plays within plays, interpolated masques, or interludes are far from uncommon. In more modern material, comparably complex structural devices such as flashback or nested playback are equally frequent. In all kinds of performance material, it may be necessary to indicate several actions which are happening simultaneously.
1316 DREMB A number of different devices are available within the TEI scheme to support these complexities in the general case. Texts may be composite or self-nesting (see section
1318 DREMB ). The TEI encoding scheme provides a variety of linking mechanisms, which may be used to indicate temporal alignment and aggregation of fragmented structures. In this section we provide a few specific examples of the application of these techniques to performance texts:
1334 DREMB attributes on fragments of embedded structures to join them into a larger whole
1343 DREMB When the whole of a song appears within a single speech, it may require no special treatment if it is considered to form a part of the speech:
1368 DREMB If however, the song is to be regarded as forming a distinct item, perhaps with its own front and back matter, it may be better to regard it as a floating text:
1396 DREMB element, each of its constituent parts must be regarded as a distinct fragment; the problem then facing the encoder is to reconstitute the interrupted whole in some way.
1400 DREMB element may be used to group together consecutive speeches which are grouped together in some way, for example constituting a single song. Alternatively the
1404 DREMB element contains a partial, not a complete, verse line, may also be used on the
1406 DREMB element, to indicate that the line group is partial rather than complete, thus:
1429 DREMB When the fragments of a song are separated by other intervening dialogue, or even when not, they may be linked together with the
1434 DREMB . For example, the line groups making up Ophelia's song might be encoded as follows:
1502 DREMB : they form part of the module for alignment and linking; this module must therefore be included in a schema if they are to be used, as further discussed in section
1510 DREMB element is specifically intended to encode the fact that several discontiguous elements of the text together form one
1571 DREMB The location of the
1581 DREMB element requires the additional module for linking, which is selected as shown above.
1585 DRSIM In printed or written versions of performance texts, a variety of techniques may be used to indicate the temporal alignment of speeches or actions. Speeches may be printed vertically aligned on the page, or braced together; stage directions (e.g.
1586 DRSIM Speaking at the same time
1643 DRSIM In the original, the stage direction
1645 DRSIM is printed opposite a brace grouping all four speeches, indicating that all four characters speak at once, and that the stage direction applies to all of them. Rather than attempting to represent the appearance of the source, this example encoding represents its presumed meaning: the
1651 DRSIM attribute is used to specify the fact that the three speeches were grouped by the brace in the copy text. Producing a readable version of the text which simulates the original printed effect may however require more complex markup and processing.
1654 DRSIM . These would be appropriate for encodings the focus of which is on the actual performance of a text rather than its structure or formal properties. The module described in that chapter includes a large number of other detailed proposals for the encoding of such features as voice quality, prosody, etc., which might be relevant to such a treatment of performance texts.
1658 DROTH Most of the elements and structures identified thus far are derived from traditional theatrical texts. Although other performance texts, such as screenplays or radio scripts, have not been discussed specifically, they can be encoded using the elements and structures listed above. Encoders may however find it convenient to use, as well, the additional specialized elements discussed in this section. For scripts containing very detailed technical information, the
1663 DROTH Like other texts, screenplays and television or radio scripts may be divided into text divisions marked with
1673 DROTH , each associated with a single camera angle and setting. Shots and sequences should be encoded using an appropriate text-division element (i.e., a
1675 DROTH element if numbered division elements are in use and the next largest unit is a
1679 DROTH element if un-numbered divisions are in use) specifying
1680 DROTH sequence
1683 DROTH as the value of the
1687 DROTH It is normal practice in screenplays and radio scripts to distinguish directions concerning camera angles, sound effects, etc., from other forms of stage direction. Such texts also generally include far more detailed specifications of what the audience actually sees: descriptions of actions and background, etc. Scripts derived from cinema and television productions may also include texts displayed as captions superimposed on the action. All of these may be encoded using the general purpose
1701 DROTH Where particular words or phrases within a direction are emphasized (by change of typeface or use of capital letters), an appropriate phrase-level element may be used to indicate the fact, as in the following examples, where certain words in the original are given in small capitals:
1723 DROTH All of these elements, like other stage directions, can appear both within and between speeches.
1780 DRTEC Traditional stage scripts may contain additional technical information about such production-related factors as lighting,
1785 DRTEC . Alternatively, they may be formally distinguished from other stage directions by using the specialized
1790 DRTEC Like stage directions,
1815 DR The selection and combination of modules to form a TEI schema is described in