These tables list the 15,996 words (in 9,287 text nodes) that match the @ident of some *Spec.
The first column is just position(), for use as a reference and so any given table can be re-sorted back to its original order. The second column is either the @ident of the *Spec, or the closest ancestor @xml:id. The links often don't work, because I don't know how to consistently generate a proper link, and one can't easily test that the document is available, as doc-available() always fails because of the about:legacy-compat.
# | id | text |
---|---|---|
4 | factuality | describes the extent to which the text may be regarded as imaginative or non-imaginative, that is, as describing a fictional or a non-fictional world. |
27 | factuality | categorizes the factuality of the text. |
46 | factuality | the text is to be regarded as entirely imaginative |
62 | factuality | the text is to be regarded as entirely informative or factual |
78 | factuality | the text contains a mixture of fact and fiction |
94 | factuality | the fiction/fact distinction is not regarded as helpful or appropriate to this text |
147 | factuality | Usually empty, unless some further clarification of the type attribute is needed, in which case it may contain running prose |
149 | factuality | For many literary texts, a simple binary opposition between |
155 | factuality | are in any sense |
# | id | text |
---|---|---|
4 | collection | contains the name of a collection of manuscripts, not necessarily located within a single repository. |
# | id | text |
---|---|---|
4 | damage | contains an area of damage to the text witness. |
40 | damage | Since damage to text witnesses frequently makes them harder to read, the |
46 | damage | attribute may be used to group together several related |
# | id | text |
---|---|---|
2 | pubPlace | publication place |
13 | pubPlace | contains the name of the place where a bibliographic item was published. |
# | id | text |
---|---|---|
2 | cond | conditional feature-structure constraint |
14 | cond | defines a conditional feature-structure constraint; the consequent and the antecedent are specified as feature structures or feature-structure collections; the constraint is satisfied if both the antecedent and the consequent subsume a given feature structure, or if the antecedent does not. |
# | id | text |
---|---|---|
16 | classRef | the identifier used for the required class within the source indicated. |
23 | classRef | indicates how references to this class within a content model should be interpreted. |
31 | classRef | a single occurrence of all members of the class may appear in sequence |
35 | classRef | a single occurrence of one or more members of the class may appear in sequence |
43 | classRef | one or more occurrences of all members of the class may appear in sequence |
52 | classRef | c |
53 | classRef | , then a reference to the class within a content model is understood as being a reference to |
55 | classRef | when |
57 | classRef | has the value |
61 | classRef | when it has the value |
62 | classRef | sequence |
65 | classRef | when it has the value |
67 | classRef | ; to (a*,b*, c*) when it has the value |
69 | classRef | ; or to (a+,b+,c+) when it has the value |
77 | classRef | supplies a list of class members which are to be included in the schema being defined. |
84 | classRef | supplies a list of class members which are to be excluded from the schema being defined. |
105 | classRef | Attribute and model classes are identified by the name supplied as value for the |
109 | classRef | element in which they are declared. All TEI names are unique; attribute class names conventionally begin with the latters |
# | id | text |
---|---|---|
4 | event | contains data relating to any kind of significant event associated with a person, place, or organization. |
60 | event | indicates the location of an event by pointing to a |
# | id | text |
---|---|---|
2 | model.pLike.front | groups paragraph-like elements which can occur as direct constituents of front matter. |
# | id | text |
---|---|---|
2 | principal | principal researcher |
16 | principal | supplies the name of the principal researcher responsible for the creation of an electronic text. |
# | id | text |
---|---|---|
14 | biblScope | defines the scope of a bibliographic reference, for example as a list of page numbers, or a named subdivision of a larger work. |
79 | biblScope | . For example, if the citation has |
# | id | text |
---|---|---|
4 | gap | indicates a point where material has been omitted in a transcription, whether for editorial reasons described in the TEI header, as part of sampling practice, or because the material is illegible, invisible, or inaudible. |
125 | gap | in the case of text omitted from the transcription because of deliberate deletion by an identifiable hand, indicates the hand which made the deletion. |
144 | gap | in the case of text omitted because of damage, categorizes the cause of the damage, if it can be identified. |
163 | gap | damage results from rubbing of the leaf edges |
179 | gap | damage results from mildew on the leaf surface |
195 | gap | damage results from smoke |
262 | gap | core tag elements may be closely allied in use with the |
266 | gap | elements, available when using the additional tagset for transcription of primary sources. See section |
271 | gap | tag simply signals the editors decision to omit or inability to transcribe a span of text. Other information, such as the interpretation that text was deliberately erased or covered, should be indicated using the relevant tags, such as |
273 | gap | in the case of deliberate deletion. |
# | id | text |
---|---|---|
4 | surrogates | contains information about any representations of the manuscript being described which may exist in the holding institution or elsewhere. |
# | id | text |
---|---|---|
14 | head | contains any type of heading, for example the title of a section, or the heading of a list, glossary, manuscript description, etc. |
52 | head | may be rather longer than usual in modern works. If a section has an explicit ending as well as a heading, it should be marked as a |
165 | head | element is used for headings at all levels; software which treats (e.g.) chapter headings, section headings, and list titles differently must determine the proper processing of a |
169 | head | occurring as the first element of a list is the title of that list; one occurring as the first element of a |
171 | head | is the title of that chapter or section. |
# | id | text |
---|---|---|
4 | stress | contains the stress pattern for a dictionary headword, if given separately. |
36 | stress | Usually stress information is included within pronunciation information. |
# | id | text |
---|---|---|
15 | listForest | identifies the type of the forest group. |
# | id | text |
---|---|---|
2 | eLeaf | leaf or terminal node of an embedding tree |
14 | eLeaf | provides explicitly for a leaf of an embedding tree, which may also be encoded with the eTree element. |
48 | eLeaf | indicates the value of an embedding leaf, which is a feature structure or other analytic element. |
86 | eLeaf | tag may be used if the encoder does not wish to distinguish by name between nonleaf and leaf nodes in embedding trees; they are distinguished by their arrangement. |
# | id | text |
---|---|---|
2 | att.declaring | provides attributes for elements which may be independently associated with a particular declarable element within the header, thus overriding the inherited default for that element. |
50 | att.declaring | The rules governing the association of declarable elements with individual parts of a TEI text are fully defined in chapter |
# | id | text |
---|---|---|
2 | authority | release authority |
16 | authority | supplies the name of a person or other agency responsible for making a work available, other than a publisher or distributor. |
# | id | text |
---|---|---|
35 | undo | This encoding represents the following sequence of events: |
37 | undo | At stage s2, "just some sample text, we need" is deleted by overstriking, and "not" is added |
38 | undo | At stage s3, parts of the deletion are cancelled by underdotting, thus reinstating the words "just some" and "text". |
# | id | text |
---|---|---|
2 | entryFree | unstructured entry |
13 | entryFree | contains a single unstructured entry in any kind of lexical resource, such as a dictionary or lexicon. |
# | id | text |
---|---|---|
14 | alternate | The alternate element must have at least two child elements |
26 | alternate | This example content model permits either a |
# | id | text |
---|---|---|
4 | climate | contains information about the physical climate of a place. |
# | id | text |
---|---|---|
2 | when | indicates a point in time either relative to other elements in the same timeline tag, or absolutely. |
28 | when | supplies an absolute value for the time. |
75 | when | specifies the unit of time in which the |
77 | when | value is expressed, if this is not inherited from the parent |
172 | when | specifies a time interval either as a number or as one of the keywords defined by the datatype data.interval |
191 | when | identifies the reference point for determining the time of the current |
193 | when | element, which is obtained by adding the interval to the time of the reference point. |
227 | when | . If no value is supplied, and the |
229 | when | attribute is also unspecified, then the reference point is understood to be the origin of the enclosing |
272 | when | attribute must be supplied to specify an identifier for this point in time. The value used may be chosen freely provided that it is unique within the document and is a syntactically valid name. There is no requirement for values containing numbers to be in sequence. |
# | id | text |
---|---|---|
2 | titlePage | title page |
16 | titlePage | contains the title page of a text, appearing within the front or back matter. |
54 | titlePage | classifies the title page according to any convenient typology. |
74 | titlePage | This attribute allows the same element to be used for volume title pages, series title pages, etc., as well as for the |
76 | titlePage | title page of a work. |
# | id | text |
---|---|---|
2 | substJoin | substitution join |
6 | substJoin | identifies a series of possibly fragmented additions, deletions or other revisions on a manuscript that combine to make up a single intervention in the text |
# | id | text |
---|---|---|
2 | stage | stage direction |
14 | stage | contains any kind of stage direction within a dramatic text or fragment. |
39 | stage | indicates the kind of stage direction. |
106 | stage | describes stage business. |
122 | stage | is a narrative, motivating stage direction. |
303 | stage | attribute may be used to indicate more precisely the person or persons participating in the action described by the stage direction. |
# | id | text |
---|---|---|
20 | data.truthValue | The possible values of this datatype are |
30 | data.truthValue | This datatype applies only for cases where uncertainty is inappropriate; if the attribute concerned may have a value other than true or false, e.g. |
# | id | text |
---|---|---|
2 | model.headLike | groups elements used to provide a title or heading at the start of a text division. |
# | id | text |
---|---|---|
4 | mood | contains information about the grammatical mood of verbs (e.g. indicative, subjunctive, imperative). |
88 | mood | gram type="mood" |
# | id | text |
---|---|---|
68 | dimensions | dimensions relate to one or more leaves (e.g. a single leaf, a gathering, or a separately bound part) |
84 | dimensions | dimensions relate to the area of a leaf which has been ruled in preparation for writing. |
100 | dimensions | dimensions relate to the area of a leaf which has been pricked out in preparation for ruling (used where this differs significantly from the ruled area, or where the ruling is not measurable). |
116 | dimensions | dimensions relate to the area of a leaf which has been written, with the height measured from the top of the minims on the top line of writing, to the bottom of the minims on the bottom line of writing. |
132 | dimensions | dimensions relate to the miniatures within the manuscript |
148 | dimensions | dimensions relate to the binding in which the codex or manuscript is contained |
164 | dimensions | dimensions relate to the box or other container in which the manuscript is stored. |
241 | dimensions | This element may be used to record the dimensions of any text-bearing object, not necessarily a codex. For example: |
257 | dimensions | When simple numeric quantities are involved, they may be expressed on the |
278 | dimensions | Contains no more than one of each of the specialized elements used to express a three-dimensional object's height, width, and depth, combined with any number of other kinds of dimensional specification. |
# | id | text |
---|---|---|
2 | damageSpan | damaged span of text |
12 | damageSpan | marks the beginning of a longer sequence of text which is damaged in some way but still legible. |
85 | damageSpan | Both the beginning and ending of the damaged sequence must be marked: the beginning by the |
89 | damageSpan | attribute: if no other element available, the |
93 | damageSpan | The damaged text must be at least partially legible, in order for the encoder to be able to transcribe it. If it is not legible at all, the |
99 | damageSpan | element should be employed, with the value of the |
# | id | text |
---|---|---|
2 | lbl | label |
14 | lbl | contains a label for a form, example, translation, or other piece of information, e.g. abbreviation for, contraction of, literally, approximately, synonyms:, etc. |
39 | lbl | classifies the label using any convenient typology. |
# | id | text |
---|---|---|
2 | model.lLike | groups elements representing metrical components such as verse lines. |
# | id | text |
---|---|---|
41 | reg | If all that is desired is to call attention to the fact that the copy text has been regularized, |
# | id | text |
---|---|---|
14 | xr | contains a phrase, sentence, or icon referring the reader to some other location in this or another text. |
130 | xr | related or similar term |
316 | xr | This element encloses both the actual indication of the location referred to, which may be tagged using the |
320 | xr | elements, and any accompanying material which gives more information about why the reader is being referred there. |
# | id | text |
---|---|---|
2 | att.datable.iso | provides attributes for normalization of elements that contain datable events using the ISO 8601 standard. |
19 | att.datable.iso | supplies the value of a date or time in a standard form. |
35 | att.datable.iso | The following are examples of ISO date, time, and date & time formats that are |
125 | att.datable.iso | is a valid time with respect to the W3C |
133 | att.datable.iso | specifies the earliest possible date for the event in standard form, e.g. yyyy-mm-dd. |
152 | att.datable.iso | specifies the latest possible date for the event in standard form, e.g. yyyy-mm-dd. |
211 | att.datable.iso | The value of these attributes should be a normalized representation of the date, time, or combined date & time intended, in any of the standard formats specified by ISO 8601, using the Gregorian calendar. |
239 | att.datable.iso | are specified, the values should be interpreted as indicating a span of time by its starting time (or date) and duration. That is, |
240 | att.datable.iso | indicates the same time period as |
245 | att.datable.iso | form, no claim is made that the form in the source text is incorrect; the regularized form is simply that chosen as the main form for purposes of unifying variant forms under a single heading. |
# | id | text |
---|---|---|
2 | triangle | underspecified embedding tree, so called because of its characteristic shape when drawn |
14 | triangle | provides for an underspecified eTree, that is, an eTree with information left out. |
51 | triangle | supplies a value for the triangle, in the form of the identifier of a feature structure or other analytic element. |
95 | triangle | An optional label followed by zero or more embedding trees, triangles, or embedding leafs. |
# | id | text |
---|---|---|
12 | foreign | identifies a word or phrase as belonging to some language other than that of the surrounding text. |
61 | foreign | attribute should be supplied for this element to identify the language of the word or phrase marked. As elsewhere, its value should be a language tag as defined in |
66 | foreign | attribute should be used in preference to this element where it is intended to mark the language of the whole of some text element. |
# | id | text |
---|---|---|
2 | code | contains literal code from some formal language such as a programming language. |
25 | code | formal language |
35 | code | a name identifying the formal language in which the code is expressed |
# | id | text |
---|---|---|
2 | anchor | anchor point |
69 | anchor | attribute must be supplied to specify an identifier for the point at which this element occurs within a document. The value used may be chosen freely provided that it is unique within the document and is a syntactically valid name. There is no requirement for values containing numbers to be in sequence. |
# | id | text |
---|---|---|
4 | rendition | supplies information about the rendition or appearance of one or more elements in the source text. |
38 | rendition | styling applies to the first line of the target element |
46 | rendition | styling should be applied immediately before the content of the target element |
50 | rendition | styling should be applied immediately after the content of the target element |
71 | rendition | The present release of these Guidelines does not specify the content of this element in any further detail. It may be used to hold a description of the default rendition to be associated with the specified element, expressed in running prose, or in some more formal language such as CSS. |
# | id | text |
---|---|---|
4 | list | contains any sequence of items organized as a list. |
88 | list | The content of a "gloss" list should include a sequence of one or more pairs of a label element followed by an item element |
103 | list | each list item glosses some term or concept, which is given by a label element preceding the list item. |
121 | list | each list item is an entry in an index such as the alphabetical topical index at the back of a print volume. |
125 | list | each list item is a step in a sequence of instructions, as in a recipe. |
129 | list | each list item is one of a sequence of petitions, supplications or invocations, typically in a religious ritual. |
133 | list | each list item is part of an argument consisting of two or more propositions and a final conclusion derived from them. |
142 | list | to encode the rendering or appearance of a list (whether it was bulleted, numbered, etc.). The current recommendation is to use the |
148 | list | for the more appropriate task of characterizing the nature of the content of a list. |
155 | list | list type="gloss" |
336 | list | The following example treats the short numbered clauses of Anglo-Saxon legal codes as lists of items. The text is from an ordinance of King Athelstan (924–939): |
366 | list | Note that nested lists have been used so the tagging mirrors the structure indicated by the two-level numbering of the clauses. The clauses could have been treated as a one-level list with irregular numbering, if desired. |
385 | list | May contain an optional heading followed by a series of items, or a series of label and item pairs, the latter being optionally preceded by one or two specialized headings. |
# | id | text |
---|---|---|
22 | interleave | This example content model permits either a |
# | id | text |
---|---|---|
2 | docTitle | document title |
16 | docTitle | contains the title of a document, including all its constituents, as given on a title page. |
# | id | text |
---|---|---|
2 | num | number |
38 | num | indicates the type of numeric value. |
135 | num | supplies the value of the number in standard form. |
152 | num | a numeric value. |
157 | num | The standard form used is defined by the TEI datatype data.numeric. |
211 | num | Detailed analyses of quantities and units of measure in historical documents may also use the feature structure mechanism described in chapter |
# | id | text |
---|---|---|
2 | orig | original form |
119 | orig | will be combined with a regularized form within a |
# | id | text |
---|---|---|
2 | transpose | describes a single textual transposition as an ordered list of at least two pointers specifying the order in which the elements indicated should be re-combined. |
30 | transpose | Transposition is usually indicated in a document by a metamark such as a wavy line or numbering. |
# | id | text |
---|---|---|
2 | catDesc | category description |
16 | catDesc | describes some category within a taxonomy or text typology, either in the form of a brief prose description or in terms of the situational parameters used by the TEI formal textDesc. |
# | id | text |
---|---|---|
85 | item | May contain simple prose or a sequence of chunks. |
87 | item | Whatever string of characters is used to label a list item in the copy text may be used as the value of the global |
95 | item | element to record the enumerator of the list item. In glossary lists, however, the term being defined should be given with the |
# | id | text |
---|---|---|
4 | district | contains the name of any kind of subdivision of a settlement, such as a parish, ward, or other administrative or geographic unit. |
# | id | text |
---|---|---|
2 | postCode | postal code |
14 | postCode | contains a numerical or alphanumeric code used as part of a postal address to simplify sorting or delivery of mail. |
72 | postCode | The position and nature of postal codes is highly country-specific; the conventions appropriate to the country concerned should be used. |
# | id | text |
---|---|---|
16 | fs | , that is, a collection of feature-value pairs organized as a structural unit. |
# | id | text |
---|---|---|
2 | model.teiHeaderPart | groups high level elements which may appear more than once in a TEI header. |
# | id | text |
---|---|---|
2 | att.pointing | defines a set of attributes used by all elements which point to other elements by means of one or more URI references. |
18 | att.pointing | specifies the language of the content to be found at the destination referenced by |
21 | att.pointing | language tag |
33 | att.pointing | if @target is specified. |
52 | att.pointing | The value must conform to BCP 47. If the value is a private use code (i.e., starts with |
58 | att.pointing | element with a matching value for its |
60 | att.pointing | attribute should be supplied in the TEI header to document this value. Such documentation may also optionally be supplied for non-private-use codes, though these must remain consistent with their |
96 | att.pointing | specifies the intended meaning when the target of a pointer is itself a pointer. |
115 | att.pointing | if the element pointed to is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found which is not a pointer. |
131 | att.pointing | if the element pointed to is itself a pointer, then its target (whether a pointer or not) is taken as the target of this pointer. |
164 | att.pointing | If no value is given, the application program is responsible for deciding (possibly on the basis of user input) how far to trace a chain of pointers. |
# | id | text |
---|---|---|
21 | att.naming | may be used to specify further information about the entity referenced by this name in the form of a set of whitespace-separated values, for example the occupation of a person, or the status of a place. |
28 | att.naming | reference to the canonical name |
38 | att.naming | provides a means of locating the canonical form ( |
39 | att.naming | nym |
64 | att.naming | The value must point directly to one or more XML elements by means of one or more URIs, separated by whitespace. If more than one is supplied, the implication is that the name is associated with several distinct canonical names. |
# | id | text |
---|---|---|
2 | form | form information group |
14 | form | groups all the information on the written and spoken forms of one headword. |
49 | form | classifies form as simple, compound, etc. |
68 | form | single free lexical item |
100 | form | a variant form |
148 | form | word in other than usual dictionary form |
164 | form | multiple-word lexical item |
# | id | text |
---|---|---|
101 | usg | domain |
111 | usg | domain or subject matter (e.g. scientific, literary etc.) |
181 | usg | language |
191 | usg | name of a language mentioned in etymological or other linguistic discussion. |
405 | usg | unclassifiable piece of information to guide sense choice |
# | id | text |
---|---|---|
16 | notesStmt | collects together any notes providing information about a text additional to that recorded in other parts of the bibliographic description. |
# | id | text |
---|---|---|
2 | langKnown | language known |
14 | langKnown | summarizes the state of a person's linguistic competence, i.e., knowledge of a single language. |
38 | langKnown | supplies a valid language tag for the language concerned. |
56 | langKnown | The value for this attribute should be a language |
57 | langKnown | tag |
79 | langKnown | a code indicating the person's level of knowledge for this language |
# | id | text |
---|---|---|
2 | att.placement | provides attributes for describing where on the source page or object a textual element appears. |
18 | att.placement | specifies where this item is placed |
27 | att.placement | below the line |
97 | att.placement | on the other side of the leaf |
113 | att.placement | above the line |
145 | att.placement | within the body of the text. |
# | id | text |
---|---|---|
14 | adminInfo | contains information about the present custody and availability of the manuscript, and also about the record description itself. |
# | id | text |
---|---|---|
16 | att.handFeatures | gives a name or other identifier for the scribe believed to be responsible for this hand. |
34 | att.handFeatures | points to a full description of the scribe concerned, typically supplied by a |
43 | att.handFeatures | characterizes the particular script or writing style used by this hand, for example |
105 | att.handFeatures | points to a full description of the script or writing style used by this hand, typically supplied by a |
116 | att.handFeatures | , or other writing medium, e.g. |
# | id | text |
---|---|---|
4 | location | defines the location of a place as a set of geographical coordinates, in terms of other named geo-political entities, or as an address. |
# | id | text |
---|---|---|
2 | listOrg | list of organizations |
12 | listOrg | contains a list of elements, each of which provides information about an identifiable organization. |
80 | listOrg | The type attribute may be used to distinguish lists of organizations of a particular type if convenient. |
# | id | text |
---|---|---|
4 | text | contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample. |
156 | text | The body of a text may be replaced by a group of nested texts, as in the following schematic: |
175 | text | This element should not be used to represent a text which is inserted at an arbitrary point within the structure of another, for example as in an embedded or quoted narrative; the |
# | id | text |
---|---|---|
13 | metDecl | documents the notation employed to represent a metrical pattern when this is specified as the value of a |
19 | metDecl | attribute on any structural element of a metrical text (e.g. |
145 | metDecl | indicates whether the notation conveys the abstract metrical form, its actual prosodic realization, or the rhyme scheme, or some combination thereof. |
188 | metDecl | declaration applies to the abstract metrical form recorded on the |
266 | metDecl | declaration applies to the rhyme scheme recorded on the |
294 | metDecl | element documents the notation used for metrical pattern and realization. It may also be used to document the notation used for rhyme scheme information; if not otherwise documented, the rhyme scheme notation defaults to the traditional |
327 | metDecl | specifies a regular expression defining any value that is legal for this notation. |
346 | metDecl | The value must be a valid regular expression per the World Wide Web Consortium's |
370 | metDecl | This example is intended for the far more restricted case typified by the Shakespearean iambic pentameter. Only metrical patterns containing exactly ten syllables, alternately stressed and unstressed, (except for the first two which may be in either order) to each metrical line can be expressed using this notation. |
405 | metDecl | may contain either a sequence of |
407 | metDecl | elements or, alternately, a series of paragraphs or other components. If the |
411 | metDecl | elements are used, then all the codes appearing within the |
415 | metDecl | Only usable within the header if the verse module is used. |
# | id | text |
---|---|---|
2 | metSym | metrical notation symbol |
13 | metSym | documents the intended significance of a particular character or character sequence within a metrical notation, either explicitly or in terms of other symbol elements in the same metDecl. |
39 | metSym | specifies the character or character sequence being documented. |
60 | metSym | specifies whether the symbol is defined in terms of other symbols ( |
62 | metSym | is set to |
66 | metSym | is set to |
146 | metSym | The value |
148 | metSym | indicates that the element contains a prose definition of its meaning; the value |
# | id | text |
---|---|---|
14 | colloc | contains any sequence of words that co-occur with the headword with significant frequency. |
# | id | text |
---|---|---|
20 | data.name | Attributes using this datatype must contain a single word which follows the rules defining a legal XML name (see |
# | id | text |
---|---|---|
2 | relatedItem | contains or references some other bibliographic item which is related to the present one in some specified manner, for example as a constituent or alternative version of it. |
32 | relatedItem | is used, the relatedItem element must be empty |
34 | relatedItem | A relatedItem element should have either a 'target' attribute or a child element to indicate the related bibliographic item |
# | id | text |
---|---|---|
4 | cell | contains one cell of a table. |
# | id | text |
---|---|---|
31 | node | provides the value of a node, which is a feature structure or other analytic element. |
69 | node | initial node in a transition network |
85 | node | final node in a transition network |
265 | node | attributes when the graph is undirected and vice versa if the graph is directed. |
286 | node | gives the in degree of the node, the number of nodes which are adjacent from the given node. |
323 | node | gives the out degree of the node, the number of nodes which are adjacent to the given node. |
360 | node | gives the degree of the node, the number of arcs with which the node is incident. |
400 | node | attributes when the graph is undirected and vice versa if the graph is directed. |
442 | node | provides a label for the arc; the second provides a second label for the arc, and should be used if a transducer is being encoded whose actions are associated with nodes rather than with arcs. |
# | id | text |
---|---|---|
2 | span | associates an interpretative annotation directly with a span of text. |
27 | span | Only one of the attributes @target and @from may be supplied on |
34 | span | Only one of the attributes @target and @to may be supplied on |
41 | span | If @to is supplied on |
42 | span | , @from must be supplied as well |
49 | span | may each contain only a single value |
55 | span | gives the identifier of the node which is the starting point of the span of text being annotated; if not accompanied by a |
57 | span | attribute, gives the identifier of the node of the entire span of text being annotated. |
88 | span | gives the identifier of the node which is the end-point of the span of text being annotated. |
# | id | text |
---|---|---|
2 | listPrefixDef | list of prefix definitions |
4 | listPrefixDef | contains a list of definitions of prefixing schemes used in |
23 | listPrefixDef | In this example, two private URI scheme prefixes are defined and patterns are provided for dereferencing them. Each prefix is also supplied with a human-readable explanation in a |
# | id | text |
---|---|---|
2 | att.datable.custom | provides attributes for normalization of elements that contain datable events to a custom dating system (i.e. other than the Gregorian used by W3 and ISO). |
6 | att.datable.custom | supplies the value of a date or time in some custom standard form. |
12 | att.datable.custom | The following are examples of custom date or time formats that are |
34 | att.datable.custom | Not all custom date formulations will have Gregorian equivalents. |
38 | att.datable.custom | attribute and other custom dating are not contrained to a datatype by the TEI, but individual projects are recommended to regularize and document their dating formats. |
43 | att.datable.custom | specifies the earliest possible date for the event in some custom standard form. |
50 | att.datable.custom | specifies the latest possible date for the event in some custom standard form. |
81 | att.datable.custom | supplies a pointer to some location defining a named point in time with reference to which the datable item is understood to have occurred |
104 | att.datable.custom | element for the Julian calendar, specifying that the text content of the |
108 | att.datable.custom | attribute also points to the Julian calendar to indicate that the content of the |
110 | att.datable.custom | attribute value is Julian too. |
122 | att.datable.custom | In this example, a date is given in a Mediaeval text measured "from the creation of the world", which is normalised (in |
126 | att.datable.custom | ) to a machine-actionable, numeric version of the date from the Creation. |
135 | att.datable.custom | ) defines the calendar or dating system to which the date described by the parent element is normalized (i.e. in the |
141 | att.datable.custom | the calendar of the original date in the element. |
# | id | text |
---|---|---|
4 | faith | specifies the faith, religion, or belief set of a person. |
# | id | text |
---|---|---|
4 | listRelation | provides information about relationships identified amongst people, places, and organizations, either informally as prose or as formally expressed relation links. |
102 | listRelation | May contain a prose description organized as paragraphs, or a sequence of |
# | id | text |
---|---|---|
12 | moduleSpec | documents the structure, content, and purpose of a single module, i.e. a named and externally visible group of declarations. |
# | id | text |
---|---|---|
2 | div | text division |
16 | div | contains a subdivision of the front, body, or back of a text. |
# | id | text |
---|---|---|
14 | attRef | points to the definition of an attribute or group of attributes. |
36 | attRef | the name of the attribute class |
43 | attRef | the name of the attribute |
# | id | text |
---|---|---|
2 | index | index entry |
14 | index | marks a location to be indexed for whatever purpose. |
45 | index | a single word which follows the rules defining a legal XML name (see |
46 | index | ), supplying a name to specify which index (of several) the index entry belongs to. |
# | id | text |
---|---|---|
2 | valItem | documents a single value in a predefined list of values. |
30 | valItem | specifies the value concerned. |
# | id | text |
---|---|---|
4 | space | indicates the location of a significant space in the text. |
45 | space | indicates whether the space is horizontal or vertical. |
64 | space | the space is horizontal. |
80 | space | the space is vertical. |
97 | space | For irregular shapes in two dimensions, the value for this attribute should reflect the more important of the two dimensions. In conventional left-right scripts, a space with both vertical and horizontal components should be classed as |
116 | space | (responsible party) indicates the individual responsible for identifying and measuring the space |
141 | space | This element should be used wherever it is desired to record an unusual space in the source text, e.g. space left for a word to be filled in later, for later rubrication, etc. It is not intended to be used to mark normal inter-word space or the like. |
# | id | text |
---|---|---|
26 | att.combinable | add |
46 | att.combinable | if present already, the whole of the declaration for this object is removed from the current setup |
62 | att.combinable | this declaration changes the declaration of the same name in the current definition |
78 | att.combinable | this declaration replaces the declaration of the same name in the current definition |
100 | att.combinable | add |
102 | att.combinable | add |
103 | att.combinable | mode); raise an error if an object with the same identifier already exists |
109 | att.combinable | do not process this object or any existing object with the same identifier; raise an error if any new children supplied |
110 | att.combinable | change |
112 | att.combinable | change |
# | id | text |
---|---|---|
60 | pron | full form |
111 | pron | indicates what notation is used for the pronunciation, if more than one occurs in the machine-readable dictionary. |
195 | pron | The values used to specify the notation may be taken from any appropriate project-defined list of values. Typical values might be |
# | id | text |
---|---|---|
2 | soCalled | contains a word or phrase for which the author or narrator indicates a disclaiming of responsibility, for example by the use of scare quotes or italics. |
# | id | text |
---|---|---|
12 | msContents | describes the intellectual content of a manuscript or manuscript part, either as a series of paragraphs or as a series of structured manuscript items. |
60 | msContents | identifies the text types or classifications applicable to this object by pointing to other elements or resources defining the classification concerned. |
328 | msContents | . This constraint is not currently enforced by the schema. |
# | id | text |
---|---|---|
58 | name | , when the TEI module for names and dates is included. |
# | id | text |
---|---|---|
4 | byline | contains the primary statement of responsibility given for a work on its title page or at the head or end of the work. |
132 | byline | The byline on a title page may include either the name or a description for the document's author. Where the name is included, it may optionally be tagged using the |
# | id | text |
---|---|---|
2 | model.msItemPart | groups elements which can appear within a manuscript item description. |
# | id | text |
---|---|---|
8 | calendar | describes a calendar or dating system used in a dating formula in the text. |
# | id | text |
---|---|---|
35 | data.certainty | . The value |
# | id | text |
---|---|---|
47 | cRefPattern | The result of the substitution may be either an absolute or a relative URI reference. In the latter case it is combined with the value of |
49 | cRefPattern | in force at the place where the |
51 | cRefPattern | attribute occurs to form an absolute URI in the usual manner as prescribed by |
# | id | text |
---|---|---|
4 | seal | contains a description of one seal or similar attachment applied to a manuscript. |
35 | seal | specifies whether or not the seal is contemporary with the item to which it is affixed |
# | id | text |
---|---|---|
4 | graph | encodes a graph, which is a collection of nodes, and arcs which connect the nodes. |
83 | graph | undirected graph |
99 | graph | directed graph |
115 | graph | a directed graph with distinguished initial and final nodes |
131 | graph | a transition network with up to two labels on each arc |
152 | graph | , then the distinction between the |
158 | graph | tag is neutralized. Also, the |
168 | graph | (or any other value which implies directionality), then the |
239 | graph | states the order of the graph, i.e., the number of its nodes. |
258 | graph | states the size of the graph, i.e., the number of its arcs. |
# | id | text |
---|---|---|
2 | recordHist | recorded history |
13 | recordHist | provides information about the source and revision status of the parent manuscript description itself. |
# | id | text |
---|---|---|
4 | tree | encodes a tree, which is made up of a root, internal nodes, leaves, and arcs from root to leaves. |
46 | tree | gives the maximum number of children of the root and internal nodes of the tree. |
75 | tree | indicates whether or not the tree is ordered, or if it is partially ordered. |
95 | tree | indicates that all of the branching nodes of the tree are ordered. |
111 | tree | indicates that some of the branching nodes of the tree are ordered and some are unordered. |
127 | tree | indicates that all of the branching nodes of the tree are unordered. |
145 | tree | gives the order of the tree, i.e., the number of its nodes. |
163 | tree | The size of a tree is always one less than its order, hence there is no need for both a |
305 | tree | A root, and zero or more internal nodes and leaves, but if there is an internal node, there must also be at least one leaf. |
# | id | text |
---|---|---|
2 | iff | if and only if |
13 | iff | separates the condition from the consequence in a bicond element. |
# | id | text |
---|---|---|
9 | att.media | Where the media are displayed, indicates the display width |
16 | att.media | Where the media are displayed, indicates the display height |
23 | att.media | Where the media are displayed, indicates a scale factor to be applied when generating the desired display size |
# | id | text |
---|---|---|
2 | geogName | geographical name |
14 | geogName | identifies a name associated with some geographical feature such as Windrush Valley or Mount Sinai. |
# | id | text |
---|---|---|
4 | additional | groups additional information, combining bibliographic information about a manuscript, or surrogate copies of it with curatorial or administrative information. |
# | id | text |
---|---|---|
4 | figure | groups elements representing or containing graphic information such as an illustration, formula, or figure. |
# | id | text |
---|---|---|
2 | listRef | list of references |
14 | listRef | supplies a list of significant references to places where this element is discussed, in the current document or elsewhere. |
# | id | text |
---|---|---|
2 | binaryObject | provides encoded binary data representing an inline graphic, audio, video or other object. |
30 | binaryObject | The encoding used to encode the binary data. If not specified, this is assumed to be |
# | id | text |
---|---|---|
16 | scriptStmt | contains a citation giving details of the script used for a spoken text. |
# | id | text |
---|---|---|
15 | f | feature value specification |
16 | f | , that is, the association of a name with a value of any of several different types. |
55 | f | A feature value cannot contain both text and element content |
59 | f | A feature value can contain only one child element |
66 | f | a single word which follows the rules defining a legal XML name (see |
67 | f | ), providing a name for the feature. |
86 | f | feature value |
96 | f | references any element which can be used to represent the value of a feature. |
114 | f | If this attribute is supplied as well as content, the value referenced is to be unified with that contained. |
152 | f | If the element is empty then a value must be supplied for the |
154 | f | attribute. The content of |
156 | f | may also be textual, with the assumption that the data type of the feature value is determined by the schema—this is the approach used in many language-technology-oriented projects and recommendations. |
# | id | text |
---|---|---|
28 | data.pattern | , is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements. For example, the set containing the three strings |
36 | data.pattern | (or alternatively, it is said that the pattern |
# | id | text |
---|---|---|
4 | material | contains a word or phrase describing the material of which the object being described is composed. |
61 | material | attribute may be used to point to one or more items within a taxonomy of types of material, defined either internally or externally. |
# | id | text |
---|---|---|
4 | shift | marks the point at which some paralinguistic feature of a series of utterances by any one speaker changes. |
28 | shift | The @new attribute should always be supplied; use the special value "normal" to indicate that the feature concerned ceases to be remarkable at this point. |
101 | shift | tension or stress pattern. |
151 | shift | specifies the new state of the paralinguistic feature specified. |
172 | shift | . The special value |
174 | shift | should be used to indicate that the feature concerned ceases to be remarkable at this point. In earlier versions of these Guidelines, a null value for this attribute was understood to have the same effect: this practice is now deprecated and will be removed at a future release. |
208 | shift | is spoken loudly, the words |
# | id | text |
---|---|---|
2 | measureGrp | measure group |
12 | measureGrp | contains a group of dimensional specifications which relate to the same object, for example the height and width of a manuscript page. |
# | id | text |
---|---|---|
4 | provenance | contains any descriptive or other information concerning a single identifiable episode during the history of a manuscript or manuscript part, after its creation but before its acquisition. |
# | id | text |
---|---|---|
2 | application | provides information about an application which has acted upon the document. |
37 | application | supplies an identifier for the application, independent of its version number or display name. |
54 | application | supplies a version number for the application, independent of its identifier or display name. |
82 | application | This example shows an appInfo element documenting the fact that version 1.5 of the Image Markup Tool1 application has an interest in two parts of a document which was last saved on June 6 2006. The parts concerned are accessible at the URLs given as target for the two |
# | id | text |
---|---|---|
6 | att.fragmentable | specifies whether or not its parent element is fragmented in some way, typically by some other overlapping structure: for example a speech which is divided between two or more verse stanzas, a paragraph which is split across a page division, a verse line which is divided between two speakers. |
# | id | text |
---|---|---|
2 | mapping | character mapping |
14 | mapping | contains one or more characters which are related to the parent character or glyph in some respect, as specified by the |
# | id | text |
---|---|---|
32 | att.divLike | specifies how the content of the division is organized. |
53 | att.divLike | no claim is made about the sequence in which the immediate contents of this division are to be processed, or their inter-relationships. |
87 | att.divLike | indicates whether this division is a sample of the original source and if so, from which part. |
108 | att.divLike | division lacks material present at end in source. |
124 | att.divLike | division lacks material at start and end. |
140 | att.divLike | division lacks material at start. |
156 | att.divLike | position of sampled material within original unknown. |
# | id | text |
---|---|---|
2 | model.divBottom | groups elements appearing at the end of a text division |
# | id | text |
---|---|---|
56 | tagsDecl | TEI recommended practice is to specify this attribute. When the |
60 | tagsDecl | are used to list each of the element types in the associated |
62 | tagsDecl | , the value should be given as |
68 | tagsDecl | are used to provide usage information or default renditions for only a subset of the elements types within the associated |
70 | tagsDecl | , the value should be |
# | id | text |
---|---|---|
21 | att.timed | indicates the location within a temporal alignment at which this element begins. |
39 | att.timed | If no value is supplied, the element is assumed to follow the immediately preceding element at the same hierarchic level. |
56 | att.timed | indicates the location within a temporal alignment at which this element ends. |
74 | att.timed | If no value is supplied, the element is assumed to precede the immediately following element at the same hierarchic level. |
# | id | text |
---|---|---|
14 | charProp | provides a name and value for some property of the parent character or glyph. |
76 | charProp | If the property is a Unicode Normative Property, then its |
78 | charProp | must be supplied. Otherwise, its name must be specied by means of a |
82 | charProp | At a later release, additional constraints will be defined on possible value/name combinations using Schematron rules |
# | id | text |
---|---|---|
67 | att.typed | attribute is present on a number of elements, not all of which are members of |
76 | att.typed | provides a sub-categorization of the element, if needed |
96 | att.typed | attribute may be used to provide any sub-classification for the element additional to that provided by its |
128 | att.typed | When appropriate, values from an established typology should be used. Alternatively a typology may be defined in the associated TEI header. If values are to be taken from a project-specific list, this should be defined using the |
# | id | text |
---|---|---|
2 | model.correspContextPart | groups elements which may appear as part of the correspContext element |
# | id | text |
---|---|---|
2 | model.certLike | groups elements which are used to indicate uncertainty or precision of other elements. |
# | id | text |
---|---|---|
44 | distinct | specifies how the phrase is distinct diachronically |
63 | distinct | specifies how the phrase is distinct diatopically |
82 | distinct | specifies how the phrase is distinct diastatically |
# | id | text |
---|---|---|
2 | textNode | indicates the presence of a text node in a content model |
# | id | text |
---|---|---|
2 | model.featureVal.single | group elements used to represent atomic feature values in feature structures. |
# | id | text |
---|---|---|
4 | availability | supplies information about the availability of a text, for example any restrictions on its use or distribution, its copyright status, any licence applying to it, etc. |
36 | availability | supplies a code identifying the current availability of the text. |
59 | availability | the text is freely available. |
75 | availability | the status of the text is unknown. |
91 | availability | the text is not freely available. |
# | id | text |
---|---|---|
2 | att.editLike | provides attributes describing the nature of an encoded scholarly intervention or interpretation of any kind. |
41 | att.editLike | there is internal evidence to support the intervention. |
57 | att.editLike | there is external evidence to support the intervention. |
73 | att.editLike | the intervention or interpretation has been made by the editor, cataloguer, or scholar on the basis of their expertise. |
101 | att.editLike | The members of this attribute class are typically used to represent any kind of editorial intervention in a text, for example a correction or interpretation, or to date or localize manuscripts etc. |
106 | att.editLike | (if present) corresponding to a witness or witness group should reference a bibliographic citation such as a |
112 | att.editLike | element, or another external bibliographic citation, documenting the source concerned. |
# | id | text |
---|---|---|
4 | region | contains the name of an administrative unit such as a state, province, or county, larger than a settlement, but smaller than a country. |
# | id | text |
---|---|---|
40 | socecStatus | identifies the classification system or taxonomy in use, for example by pointing to a locally-defined |
61 | socecStatus | identifies a status code defined within the classification system or taxonomy defined by the |
122 | socecStatus | The content of this element may be used as an alternative to the more formal specification made possible by its attributes; it may also be used to supplement the formal specification with commentary or clarification. |
# | id | text |
---|---|---|
2 | vAlt | value alternation |
14 | vAlt | represents the value part of a feature-value specification which contains a set of values, only one of which can be valid. |
# | id | text |
---|---|---|
2 | front | front matter |
16 | front | contains any prefatory matter (headers, title page, prefaces, dedications, etc.) found at the start of a document, before the main body. |
212 | front | Because cultural conventions differ as to which elements are grouped as front matter and which as back matter, the content models for the |
# | id | text |
---|---|---|
50 | variantEncoding | apparatus uses line numbers or other canonical reference scheme referenced in a base text. |
82 | variantEncoding | alternate readings of a passage are given in parallel in the text; no notion of a base text is necessary. |
99 | variantEncoding | The value |
118 | variantEncoding | indicates whether the apparatus appears within the running text or external to it. |
140 | variantEncoding | The @location value "external" is inconsistent with the parallel-segmentation method of apparatus markup. |
180 | variantEncoding | The value |
# | id | text |
---|---|---|
14 | ex | contains a sequence of letters added by an editor or transcriber when expanding an abbreviation. |
# | id | text |
---|---|---|
13 | resp | contains a phrase describing the nature of a person's intellectual responsibility, or an organization's role in the production or distribution of a work. |
76 | resp | ) to a standardized list of responsibility types, such as that maintained by a naming authority, for example the list maintained at |
# | id | text |
---|---|---|
4 | argument | contains a formal list or prose description of the topics addressed by a subdivision of a text. |
69 | argument | Often contains either a list or a paragraph |
# | id | text |
---|---|---|
2 | interpGrp | interpretation group |
15 | interpGrp | collects together a set of related interpretations which share responsibility or type. |
109 | interpGrp | Any number of |
# | id | text |
---|---|---|
2 | model.personPart | groups elements which form part of the description of a person. |
# | id | text |
---|---|---|
4 | condition | contains a description of the physical condition of the manuscript. |
# | id | text |
---|---|---|
16 | samplingDecl | contains a prose description of the rationale and methods used in sampling texts in the creation of a corpus or collection. |
59 | samplingDecl | This element records all information about systematic inclusion or omission of portions of the text, whether a reflection of sampling procedures in the pure sense or of systematic omission of material deemed either too difficult to transcribe or not of sufficient interest. |
# | id | text |
---|---|---|
2 | licence | contains information about a licence or other legal agreement applicable to the text. |
48 | licence | element should be supplied for each licence agreement applicable to the text in question. The |
60 | licence | attributes may be used in combination to indicate the date or dates of applicability of the licence. |
# | id | text |
---|---|---|
2 | att.global.analytic | provides additional global attributes for associating specific analyses or interpretations with appropriate portions of a text. |
18 | att.global.analytic | analysis |
# | id | text |
---|---|---|
2 | att.styleDef | groups elements which specify the name of a formal definition language used to provide formatting or rendition information. |
6 | att.styleDef | identifies the language used to describe the rendition. |
51 | att.styleDef | Informal free text description |
65 | att.styleDef | A user-defined rendition description language |
80 | att.styleDef | If no value for the @scheme attribute is provided, then the default assumption should be that CSS is in use. |
85 | att.styleDef | supplies a version number for the style language provided in |
95 | att.styleDef | @schemeVersion can only be used if @scheme is specified. |
103 | att.styleDef | is used, then |
105 | att.styleDef | should also appear, with a value other than |
# | id | text |
---|---|---|
2 | docDate | document date |
16 | docDate | contains the date of a document, as given on a title page or in a dateline. |
43 | docDate | gives the value of the date in standard form, i.e. YYYY-MM-DD. |
61 | docDate | attribute should give the Gregorian or proleptic Gregorian date in one of the formats specified in |
113 | docDate | element in the core tag set. This specialized element is provided for convenience in marking and processing the date of the documents, since it is likely to require specialized handling for many applications. It should be used only for the date of the entire document, not for any subset or part of it. |
# | id | text |
---|---|---|
4 | offset | marks that part of a relative temporal or spatial expression which indicates the direction of the offset between the two place names, dates, or times involved in the expression. |
# | id | text |
---|---|---|
2 | dataSpec | datatype specification |
# | id | text |
---|---|---|
4 | trait | contains a description of some status or quality attributed to a person, place, or organization typically, but not necessarily, independent of the volition or action of the holder and usually not at some specific time or for a specific date range. |
96 | trait | the more general purpose element |
98 | trait | should be used even for unchanging characteristics. If you wish to distinguish between characteristics that are generally perceived to be time-bound states and those assumed to be fixed traits, then |
102 | trait | element encodes characteristics which are sometimes assumed to change, often at specific times or over a date range, whereas the |
# | id | text |
---|---|---|
6 | att.ranging | gives a minimum estimated value for the approximate measurement. |
15 | att.ranging | gives a maximum estimated value for the approximate measurement. |
24 | att.ranging | where the measurement summarizes more than one observation or a range, supplies the minimum value observed. |
33 | att.ranging | where the measurement summarizes more than one observation or a range, supplies the maximum value observed. |
42 | att.ranging | specifies the degree of statistical confidence (between zero and one) that a value falls within the range specified by |
# | id | text |
---|---|---|
2 | epigraph | contains a quotation, anonymous or attributed, appearing at the start or end of a section or on a title page. |
# | id | text |
---|---|---|
38 | tagUsage | specifies the name (generic identifier) of the element indicated by the tag, within the namespace indicated by the parent |
61 | tagUsage | specifies the number of occurrences of this element within the text. |
92 | tagUsage | specifies the number of occurrences of this element within the text which bear a distinct value for the global |
131 | tagUsage | element which defines how this element was rendered in the source text. |
# | id | text |
---|---|---|
2 | handNote | note on hand |
# | id | text |
---|---|---|
2 | gramGrp | grammatical information group |
# | id | text |
---|---|---|
2 | if | defines a conditional default value for a feature; the condition is specified as a feature structure, and is met if it subsumes the feature structure in the text for which a default value is sought. |
# | id | text |
---|---|---|
2 | div5 | level-5 text division |
16 | div5 | contains a fifth-level subdivision of the front, body, or back of a text. |
187 | div5 | any sequence of low-level structural elements, possibly grouped into lower subdivisions. |
# | id | text |
---|---|---|
4 | place | contains data about a geographic location |
# | id | text |
---|---|---|
4 | roleName | contains a name component which indicates that the referent has a particular role or position in society, such as an official title or rank. |
# | id | text |
---|---|---|
4 | link | defines an association or hypertextual link among elements or passages, of some type not more precisely specifiable by other elements. |
57 | link | The location of this element within a document has no significance, unless it is included within a |
59 | link | , in which case it may inherit the value of the |
61 | link | attribute from the value given on the |
# | id | text |
---|---|---|
13 | biblFull | contains a fully-structured bibliographic citation, in which all components of the TEI file description are present. |
# | id | text |
---|---|---|
4 | bloc | contains the name of a geo-political unit consisting of two or more nation states or countries. |
# | id | text |
---|---|---|
16 | elementRef | the identifier used for the required element within the source indicated. |
29 | elementRef | available from the current default source. |
38 | elementRef | available from the TEI P5 1.2.1 release. |
42 | elementRef | Elements are identified by the name supplied as value for the |
46 | elementRef | element in which they are declared. TEI element names are unique. |
# | id | text |
---|---|---|
4 | distributor | supplies the name of a person or other agency responsible for the distribution of a text. |
# | id | text |
---|---|---|
2 | bicond | bi-conditional feature-structure constraint |
14 | bicond | defines a biconditional feature-structure constraint; both consequent and antecedent are specified as feature structures or groups of feature structures; the constraint is satisfied if both subsume a given feature structure, or if both do not. |
# | id | text |
---|---|---|
4 | layout | describes how text is laid out on the page, including information about any ruling, pricking, or other evidence of page-preparation techniques. |
28 | layout | specifies the number of columns per page |
45 | layout | If a single number is given, all pages have this number of columns. If two numbers are given, the number of columns per page varies between the values supplied. |
51 | layout | specifies the number of ruled lines per column |
68 | layout | If a single number is given, all columns have this number of ruled lines. If two numbers are given, the number of ruled lines per column varies between the values supplied. |
74 | layout | specifies the number of written lines per column |
91 | layout | If a single number is given, all columns have this number of written lines. If two numbers are given, the number of written lines per column varies between the values supplied. |
# | id | text |
---|---|---|
2 | model.divBottomPart | groups elements which can occur only at the end of a text division. |
# | id | text |
---|---|---|
4 | model.publicationStmtPart.agency | element of the TEI header that indicate an authorising agent. |
32 | model.publicationStmtPart.agency | child elements, while not required, are required if one of the |
# | id | text |
---|---|---|
14 | del | contains a letter, word, or passage deleted, marked as deleted, or otherwise indicated as superfluous or spurious in the copy text by an author, scribe, or a previous annotator or corrector. |
77 | del | element should be used for longer sequences of text, for those containing structural subdivisions, and for those containing overlapping additions and deletions. |
79 | del | The text deleted must be at least partially legible in order for the encoder to be able to transcribe it (unless it is restored in a |
81 | del | tag). Illegible or lost text within a deletion may be marked using the |
83 | del | tag to signal that text is present but has not been transcribed, or is no longer visible. Attributes on the |
85 | del | element may be used to indicate how much text is omitted, the reason for omitting it, etc. If text is not fully legible, the |
87 | del | element (available when using the additional tagset for transcription of primary sources) should be used to signal the areas of text which cannot be read with confidence in a similar way. |
94 | del | There is a clear distinction in the TEI between |
104 | del | indicates a deletion present in the source being transcribed, which states the author's or a later scribe's intent to cancel or remove text. |
106 | del | indicates material present in the source being transcribed which should have been so deleted, but which is not in fact. |
110 | del | , by contrast, signal an editor's or encoder's decision to omit something or their inability to read the source text. See sections |
# | id | text |
---|---|---|
2 | data.language | defines the range of attribute values used to identify a particular combination of human language and writing system. |
23 | data.language | The values for this attribute are language |
30 | data.language | language tag |
31 | data.language | , per BCP 47, is assembled from a sequence of components or |
35 | data.language | , U+002D). The tag is made of the following subtags, in the following order. Every subtag except the first is optional. If present, each occurs only once, except the fourth and fifth components (variant and extension), which are repeatable. |
36 | data.language | language |
37 | data.language | The IANA-registered code for the language. This is almost always the same as the ISO 639 2-letter language code if there is one. The list of available registered language subtags can be found at |
38 | data.language | . It is recommended that this code be written in lower case. |
40 | data.language | The ISO 15924 code for the script. These codes consist of 4 letters, and it is recommended they be written with an initial capital, the other three letters in lower case. The canonical list of codes is maintained by the Unicode Consortium, and is available at |
41 | data.language | . The IETF recommends this code be omitted unless it is necessary to make a distinction you need. |
42 | data.language | region |
43 | data.language | Either an ISO 3166 country code or a UN M.49 region code that is registered with IANA (not all such codes are registered, e.g. UN codes for economic groupings or codes for countries for which there is already an ISO 3166 2-letter code are not registered). The former consist of 2 letters, and it is recommended they be written in upper case. The list of codes can be found at |
44 | data.language | . The latter consist of 3 digits; the list of codes can be found at |
48 | data.language | are used to indicate additional, well-recognized variations that define a language or its dialects that are not covered by other available subtags |
51 | data.language | An extension has the format of a single letter followed by a hyphen followed by additional subtags. These exist to allow for future extension to BCP 47, but as of this writing no such extensions are in use. |
59 | data.language | element must be present in the TEI header. |
62 | data.language | There are two exceptions to the above format. First, there are language tags in the |
68 | data.language | Second, an entire language tag can consist of only a private use subtag. These tags start with |
70 | data.language | , and do not need to follow any further rules established by the IETF and endorsed by these Guidelines. Like all language tags that make use of private use subtags, the language in question must be documented in a corresponding |
72 | data.language | element in the TEI header. |
82 | data.language | English as spoken in Sierra Leone |
86 | data.language | Spanish as spoken in Mexico |
88 | data.language | Spanish as spoken in Latin America |
# | id | text |
---|---|---|
14 | u | contains a stretch of speech usually preceded and followed by silence or by a change of speaker. |
76 | u | this utterance begins without unusual pause or rapidity. |
92 | u | this utterance begins with a markedly shorter pause than normal. |
185 | u | will be delimited by pause or change of speaker, |
187 | u | is not required to represent a turn or any communicative event, nor to be bounded by pauses or change of speaker. At a minimum, a |
# | id | text |
---|---|---|
2 | att.global.linking | defines a set of attributes for hypertextual linking. |
76 | att.global.linking | . The language is indicated using |
78 | att.global.linking | , whose value is inherited; both the tag with the |
80 | att.global.linking | and the tag pointed to by the |
82 | att.global.linking | inherit the value from their immediate parent. |
118 | att.global.linking | elements in a literary personography. This correspondence represents a slightly looser relationship than the one in the preceding example; there is no sense in which an allegorical character could be substituted for the physical city, or vice versa, but there is obviously a correspondence between them. |
189 | att.global.linking | Any content of the current element should be ignored. Its true content is that of the element being pointed at. |
269 | att.global.linking | selects one or more alternants; if one alternant is selected, the ambiguity or uncertainty is marked as resolved. If more than one alternant is selected, the degree of ambiguity or uncertainty is marked as reduced by the number of alternants not selected. |
# | id | text |
---|---|---|
2 | castItem | cast list item |
14 | castItem | contains a single entry within a cast list, describing either a single role or a list of non-speaking roles. |
61 | castItem | role |
65 | castItem | the item describes a single role. |
81 | castItem | the item describes a list of non-speaking roles. |
# | id | text |
---|---|---|
97 | fileDesc | The major source of information for those seeking to create a catalogue entry or bibliographic citation for an electronic file. As such, it provides a title and statements of responsibility together with details of the publication or distribution of the file, of any series to which it belongs, and detailed bibliographic notes for matters not addressed elsewhere in the header. It also contains a full bibliographic description for the source or sources from which the electronic text was derived. |
# | id | text |
---|---|---|
4 | caption | contains the text of a caption or other text displayed as part of a film script or screenplay. |
85 | caption | A specialized form of stage direction. |
# | id | text |
---|---|---|
2 | editionStmt | edition statement |
16 | editionStmt | groups information relating to one edition of a text. |
# | id | text |
---|---|---|
2 | att.lexicographic | defines a set of global attributes available on elements in the base tag set for dictionaries. |
23 | att.lexicographic | gives an expanded form of information presented more concisely in the dictionary |
68 | att.lexicographic | gives a normalized form of information given by the source text in a non-normalized form |
105 | att.lexicographic | gives the list of split values for a merged form |
126 | att.lexicographic | gives a value which lacks any realization in the printed source text. |
151 | att.lexicographic | gives the original string or is the empty string when the element does not appear in the source text. |
174 | att.lexicographic | element typically elsewhere in the document, but possibly in another document, which is the original location of this component. |
# | id | text |
---|---|---|
4 | signatures | contains discussion of the leaf or quire signatures found within a codex. |
# | id | text |
---|---|---|
2 | model.frontPart.drama | groups elements which appear at the level of divisions within front or back matter of performance texts only. |
# | id | text |
---|---|---|
2 | caesura | marks the point at which a metrical line may be divided. |
# | id | text |
---|---|---|
2 | lb | line break |
14 | lb | marks the start of a new (typographic) line in some edition or version of a text. |
40 | lb | This example shows typographical line breaks within metrical lines, where they occur at different places in different editions: |
74 | lb | This example encodes typographical line breaks as a means of preserving the visual appearance of a title page. The |
76 | lb | attribute is used to show that the line break does not (as elsewhere) mark the start of a new word. |
100 | lb | elements should appear at the point in the text where a new line starts. The |
102 | lb | attribute, if used, indicates the number or other value associated with the text between this point and the next |
104 | lb | element, typically the sequence number of the line within the page, or other appropriate unit. This element is intended to be used for marking actual line breaks on a manuscript or printed page, at the point where they occur; it should not be used to tag structural units such as lines of verse (for which the |
110 | lb | attribute may be used to characterize the line break in any respect. The more specialized attributes |
116 | lb | should be preferred when the intent is to indicate whether or not the line break is word-breaking, or to note the source from which it derives. |
# | id | text |
---|---|---|
2 | att.transcriptional | provides attributes specific to elements encoding authorial or scribal intervention in a text when transcribing manuscript or similar sources. |
38 | att.transcriptional | indicates the effect of the intervention, for example in the case of a deletion, strikeouts which include too much or too little text, or in the case of an addition, an insertion which duplicates some of the text already present. |
58 | att.transcriptional | all of the text indicated as an addition duplicates some text that is in the original, whether the duplication is word-for-word or less exact. |
72 | att.transcriptional | part of the text indicated as an addition duplicates some text that is in the original |
86 | att.transcriptional | some text at the beginning of the deletion is marked as deleted even though it clearly should not be deleted. |
100 | att.transcriptional | some text at the end of the deletion is marked as deleted even though it clearly should not be deleted. |
114 | att.transcriptional | some text at the beginning of the deletion is not marked as deleted even though it clearly should be. |
128 | att.transcriptional | some text at the end of the deletion is not marked as deleted even though it clearly should be. |
142 | att.transcriptional | some text in the deletion is not marked as deleted even though it clearly should be. |
171 | att.transcriptional | Status information on each deletion is needed rather rarely except in critical editions from authorial manuscripts; status information on additions is even less common. |
173 | att.transcriptional | Marking a deletion or addition as faulty is inescapably an interpretive act; the usual test applied in practice is the linguistic acceptability of the text with and without the letters or words in question. |
203 | att.transcriptional | repeated for the purpose of fixation |
207 | att.transcriptional | repeated to clarify a previously illegible or badly written text or mark |
214 | att.transcriptional | sequence |
224 | att.transcriptional | assigns a sequence number related to the order in which the encoded features carrying this attribute are believed to have occurred. |
# | id | text |
---|---|---|
2 | orgName | organization name |
# | id | text |
---|---|---|
2 | figDesc | description of figure |
14 | figDesc | contains a brief prose description of the appearance or content of a graphic figure, for use when documenting an image without displaying it. |
58 | figDesc | This element is intended for use as an alternative to the content of its parent |
60 | figDesc | element ; for example, to display when the image is required but the equipment in use cannot display graphic images. It may also be used for indexing or documentary purposes. |
# | id | text |
---|---|---|
2 | note | contains a note or annotation. |
30 | note | indicates whether the copy text shows the exact place of reference for the note. |
48 | note | In modern texts, notes are usually anchored by means of explicit footnote or endnote symbols. An explicit indication of the phrase or line annotated may however be used instead (e.g. |
52 | note | attribute indicates whether any explicit location is given, whether by symbol or by prose cross-reference. The value |
54 | note | indicates that such an explicit location is indicated in the copy text; the value |
56 | note | indicates that the copy text does not indicate a specific place of attachment for the note. If the specific symbols used in the copy text at the location the note is anchored are to be recorded, use the |
77 | note | points to the end of the span to which the note is attached, if the note is not embedded in the text at that point. |
93 | note | This attribute is retained for backwards compatibility; it may be removed at a subsequent release of the Guidelines. The recommended way of pointing to a span of elements is by means of the |
109 | note | In the following example, the translator has supplied a footnote containing an explanation of the term translated as "painterly": |
120 | note | For this example to be valid, the code |
122 | note | must be defined elsewhere, for example by means of a responsibility statement in the associated TEI header: |
160 | note | attribute may be used to supply the symbol or number used to mark the note's point of attachment in the source text, as in the following example: |
166 | note | However, if notes are numbered in sequence and their numbering can be reconstructed automatically by processing software, it may well be considered unnecessary to record the note numbers. |
# | id | text |
---|---|---|
2 | docAuthor | document author |
16 | docAuthor | contains the name of the author of the document, as given on the title page (often but not always contained in a byline). |
71 | docAuthor | The document author's name often occurs within a byline, but the |
# | id | text |
---|---|---|
56 | lem | The term |
58 | lem | is used in text criticism to describe the reading in the text itself (as opposed to those in the apparatus); this usage is distinct from that of mathematics (where a lemma is a major step in a proof) and natural-language processing (where a lemma is the dictionary form associated with an inflected form in the running text). |
# | id | text |
---|---|---|
62 | macroSpec | indicates which type of entity should be generated, when an ODD processor is generating a module using XML DTD syntax. |
91 | macroSpec | datatype entity |
# | id | text |
---|---|---|
161 | s | You may not nest one s element within another: use seg instead |
196 | s | element may be used to mark orthographic sentences, or any other segmentation of a text, provided that the segmentation is end-to-end, complete, and non-nesting. For segmentation which is partial or recursive, the |
202 | s | attribute may be used to indicate the type of segmentation intended, according to any convenient typology. |
# | id | text |
---|---|---|
2 | model.settingPart | groups elements used to describe the setting of a linguistic interaction. |
# | id | text |
---|---|---|
2 | model.hiLike | groups phrase-level elements which are typographically distinct but to which no specific function can be attributed. |
# | id | text |
---|---|---|
14 | postBox | contains a number or other identifier for some postal delivery point other than a street address. |
66 | postBox | The position and nature of postal codes is highly country-specific; the conventions appropriate to the country concerned should be used. |
# | id | text |
---|---|---|
2 | att.textCritical | defines a set of attributes common to all elements representing variant readings in text critical work. |
105 | att.textCritical | variant sequence |
115 | att.textCritical | provides a number indicating the position of this reading in a sequence, when there is reason to presume a sequence to the variants. |
133 | att.textCritical | Different variant sequences could be coded with distinct number trails: 1-2-3 for one sequence, 5-6-7 for another. More complex variant sequences, with (for example) multiple branchings from single readings, may be expressed through the |
# | id | text |
---|---|---|
4 | history | groups elements describing the full history of a manuscript or manuscript part. |
# | id | text |
---|---|---|
2 | model.titlepagePart | groups elements which can occur as direct constituents of a title page, such as |
# | id | text |
---|---|---|
2 | TEI | TEI document |
16 | TEI | contains a single TEI-conformant document, containing a single TEI header, a single text, one or more members of the model.resourceLike class, or a combination of these. A series of |
18 | TEI | elements may be combined together to form a |
80 | TEI | specifies the major version number of the TEI Guidelines against which this document is valid. |
100 | TEI | The major version number is historically prefixed by a P (for Proposal), and is distinct from the version number used for individual releases of the Guidelines, as used by (for example) the |
222 | TEI | This element is required. It is customary to specify the TEI namespace |
# | id | text |
---|---|---|
5 | filiation | filiation |
107 | filiation | includes a link to some other manuscript description which has the identifier |
# | id | text |
---|---|---|
25 | data.probability | Probability is expressed as a real number between 0 and 1; 0 representing |
# | id | text |
---|---|---|
4 | death | contains information about a person's death, such as its date and place. |
# | id | text |
---|---|---|
2 | retrace | contains a sequence of writing which has been retraced, for example by over-inking, to clarify or fix it. |
24 | retrace | within another. In principle, a retrace differs from a substitution in that second and subsequent rewrites do not materially alter the content of an element. Where minor changes have been made during the retracing action however these may be marked up using |
28 | retrace | , etc. with an appropriate value for the |
# | id | text |
---|---|---|
4 | sponsor | specifies the name of a sponsoring organization or institution. |
55 | sponsor | Sponsors give their intellectual authority to a project; they are to be distinguished from |
# | id | text |
---|---|---|
2 | vRange | value range |
14 | vRange | defines the range of allowed values for a feature, in the form of an |
18 | vRange | , or primitive value; for the value of an |
20 | vRange | to be valid, it must be subsumed by the specified range; if the |
24 | vRange | attribute), then each value must be subsumed by the |
# | id | text |
---|---|---|
4 | epilogue | contains the epilogue to a drama, typically spoken by an actor out of character, possibly in association with a particular performance or venue. |
142 | epilogue | Contains optional headings, a sequence of one or more component-level elements, and an optional sequence of closing material. |
# | id | text |
---|---|---|
2 | att.repeatable | supplies attributes for the elements which define component parts of a content model. |
7 | att.repeatable | supplies an XPath identifying a context within which this component of a content model must be found |
14 | att.repeatable | minimum number of occurences |
26 | att.repeatable | indicates the smallest number of times this component may occur. |
35 | att.repeatable | maximum number of occurences |
47 | att.repeatable | indicates the largest number of times this component may occur. |
# | id | text |
---|---|---|
21 | att.global.facs | facsimile |
29 | att.global.facs | points to all or part of an image which corresponds with the content of the element. |
# | id | text |
---|---|---|
92 | altIdentifier | An identifying number of some kind must be supplied if known; if it is not known, this should be stated. |
# | id | text |
---|---|---|
2 | listNym | list of canonical names |
12 | listNym | contains a list of nyms, that is, standardized names for any thing. |
117 | listNym | The type attribute may be used to distinguish lists of names of a particular type if convenient. |
# | id | text |
---|---|---|
2 | notatedMusic | encodes the presence of music notation in a text |
31 | notatedMusic | It is possible to describe the content of the notation using elements from the |
35 | notatedMusic | . It is possible to specify the location of digital objects representing the notated music in other media such as images or audio-visual files. The encoder's interpretation of the correspondence between the notated music and these digital objects is not encoded explicitly. We recommend the use of graphic and binaryObject mainly as a fallback mechanism when the notated music format is not displayable by the application using the encoding. The alignment of encoded notated music, images carrying the notation, and audio files is a complex matter for which we refer the encoder to other formats and specifications such as MPEG-SMR. |
# | id | text |
---|---|---|
133 | view | A view is a particular form of stage direction. |
# | id | text |
---|---|---|
2 | funder | funding body |
16 | funder | specifies the name of an individual, institution, or organization responsible for the funding of a project or text. |
73 | funder | Funders provide financial support for a project; they are distinct from |
75 | funder | , who provide intellectual support and authority. |
# | id | text |
---|---|---|
2 | dataNode | defines possible values for a data node, usually as part of an attribute's datatype |
11 | dataNode | supplies the name of a predefined datatype in the datatype library specified by the |
18 | dataNode | points to the datatype library in which the name specified by the |
26 | dataNode | The default source is the list of datatypes provided by |
32 | dataNode | supplies a string representing a regular expression providing additional constraints on the strings used to represent values conforming to this datatype |
# | id | text |
---|---|---|
4 | handNotes | elements documenting the different hands identified within the source texts. |
# | id | text |
---|---|---|
2 | vDefault | value default |
14 | vDefault | declares the default value to be supplied when a feature structure does not contain an instance of |
16 | vDefault | for this name; if unconditional, it is specified as one (or, depending on the value of the |
22 | vDefault | elements or primitive values; if conditional, it is specified as one or more |
24 | vDefault | elements; if no default is specified, or no condition matches, the value |
99 | vDefault | May contain a legal feature value, or a series of |
# | id | text |
---|---|---|
2 | persName | personal name |
# | id | text |
---|---|---|
17 | model.phrase | This class of elements can occur within paragraphs, list items, lines of verse, etc. |
# | id | text |
---|---|---|
2 | setting | describes one particular setting in which a language interaction takes place. |
79 | setting | attribute is not supplied, the setting is assumed to be that of all participants in the language interaction. |
# | id | text |
---|---|---|
2 | roleDesc | role description |
14 | roleDesc | describes a character's role in a drama. |
# | id | text |
---|---|---|
5 | depth | width |
41 | depth | If used to specify the width of a non text-bearing portion of some object, for example a monument, this element conventionally refers to the axis facing the observer, and perpendicular to that indicated by the |
42 | depth | width |
# | id | text |
---|---|---|
4 | floatingText | contains a single text of any kind, whether unitary or composite, which interrupts the text containing it at any point and after which the surrounding text resumes. |
132 | floatingText | A floating text has the same content as any other and may thus be interrupted by another floating text, or contain a group of tesselated texts. |
# | id | text |
---|---|---|
2 | model.divPart.spoken | groups elements structurally analogous to paragraphs within spoken texts. |
# | id | text |
---|---|---|
2 | orth | orthographic form |
14 | orth | gives the orthographic form of a dictionary headword. |
58 | orth | gives the extent of the orthographic information provided. |
79 | orth | full form |
# | id | text |
---|---|---|
2 | purpose | characterizes a single purpose or communicative function of the text. |
109 | purpose | specifies the extent to which this purpose predominates. |
129 | purpose | this purpose is predominant |
131 | purpose | this purpose is intermediate |
133 | purpose | this purpose is weak |
135 | purpose | extent unknown |
180 | purpose | Usually empty, unless some further clarification of the type attribute is needed, in which case it may contain running prose |
# | id | text |
---|---|---|
16 | idno | supplies any form of identifier used to identify some object, such as a bibliographic item, a person, a title, an organization, etc. in a standardized way. |
# | id | text |
---|---|---|
2 | macro.phraseSeq | phrase sequence |
14 | macro.phraseSeq | defines a sequence of character data and phrase-level elements. |
# | id | text |
---|---|---|
2 | div1 | level-1 text division |
16 | div1 | contains a first-level subdivision of the front, body, or back of a text. |
150 | div1 | any sequence of low-level structural elements, possibly grouped into lower subdivisions. |
# | id | text |
---|---|---|
41 | join | specifies the name of an element which this aggregation may be understood to represent. |
77 | join | root |
83 | join | attribute are joined, each subtree become a child of the virtual element created by the join |
169 | join | attribute. The value |
170 | join | root |
322 | join | is specified with the value of |
324 | join | to indicate that the virtual list being constructed is to be made by taking the lists indicated by the |
# | id | text |
---|---|---|
4 | nationality | contains an informal description of a person's present or past nationality or citizenship. |
# | id | text |
---|---|---|
6 | att.milestoneUnit | provides a conventional name for the kind of section changing at this milestone. |
69 | att.milestoneUnit | line breaks (synonymous with the |
145 | att.milestoneUnit | changes of speaker or narrator. |
253 | att.milestoneUnit | If the milestone marks the beginning of a piece of text not present in the reference edition, the special value |
255 | att.milestoneUnit | may be used as the value of |
257 | att.milestoneUnit | . The normal interpretation is that the reference edition does not contain the text which follows, until the next |
259 | att.milestoneUnit | tag for the edition in question is encountered. |
# | id | text |
---|---|---|
2 | att.edition | provides attributes identifying the source edition from which some encoded feature derives. |
8 | att.edition | edition |
12 | att.edition | supplies a sigil or other arbitrary identifier for the source edition in which the associated feature (for example, a page, column, or line break) occurs at this point in the text. |
21 | att.edition | edition reference |
23 | att.edition | provides a pointer to the source edition in which the associated feature (for example, a page, column, or line break) occurs at this point in the text. |
# | id | text |
---|---|---|
4 | prologue | contains the prologue to a drama, typically spoken by an actor out of character, possibly in association with a particular performance or venue. |
# | id | text |
---|---|---|
4 | handShift | marks the beginning of a sequence of text written in a new hand, or the beginning of a scribal stint. |
71 | handShift | element may be used either to denote a shift in the document hand (as from one scribe to another, on one writing style to another). Or, it may indicate a shift within a document hand, as a change of writing style, character or ink. Like other milestone elements, it should appear at the point of transition from some other state to the state which it describes. |
# | id | text |
---|---|---|
12 | value | contains a single value for some property, attribute, or other analysis. |
# | id | text |
---|---|---|
2 | appInfo | application information |
12 | appInfo | records information about an application which has edited the TEI file. |
# | id | text |
---|---|---|
16 | projectDesc | describes in detail the aim or purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected. |
# | id | text |
---|---|---|
35 | binding | specifies whether or not the binding is contemporary with the majority of its contents |
53 | binding | The value |
55 | binding | indicates that the binding is contemporaneous with its contents; the value |
57 | binding | that it is not. The value |
59 | binding | should be used when the date of either binding or manuscript is unknown |
# | id | text |
---|---|---|
2 | model.contentPart | groups elements which may appear as part of the content element. |
# | id | text |
---|---|---|
2 | att.namespaceable | provides an attribute indicating the target namespace for an object being created |
6 | att.namespaceable | namespace |
18 | att.namespaceable | specifies the namespace to which this element belongs |
# | id | text |
---|---|---|
32 | redo | This encoding represents the following sequence of events: |
# | id | text |
---|---|---|
2 | settingDesc | setting description |
14 | settingDesc | describes the setting or settings within which a language interaction takes place, or other places otherwise referred to in a text, edition, or metadata. |
74 | settingDesc | May contain a prose description organized as paragraphs, or a series of |
76 | settingDesc | elements. If used to record not settings of language interactions, but other places mentioned in the text, then |
# | id | text |
---|---|---|
2 | docEdition | document edition |
16 | docEdition | contains an edition statement as presented on a title page of a document. |
61 | docEdition | element of bibliographic citation. As usual, the shorter name has been given to the more frequent element. |
# | id | text |
---|---|---|
2 | eTree | embedding tree |
14 | eTree | provides an alternative to tree element for representing ordered rooted tree structures. |
52 | eTree | provides the value of an embedding tree, which is a feature structure or other analytic element. |
144 | eTree | an optional label followed by zero or more embedding trees, triangles, or embedding leafs. |
# | id | text |
---|---|---|
2 | macro.specialPara | 'special' paragraph content |
14 | macro.specialPara | defines the content model of elements such as notes or list items, which either contain a series of component-level elements or else have the same structure as a paragraph, containing a series of phrase-level and inter-level elements. |
# | id | text |
---|---|---|
2 | data.version | defines the range of attribute values which may be used to specify a TEI or Unicode version number. |
13 | data.version | The value of this attribute follows the pattern specified by the Unicode consortium for its version number ( |
14 | data.version | ). A version number contains digits and fullstop characters only. The first number supplied identifies the major version number. A second and third number, for minor and sub-minor version numbers, may also be supplied. |
# | id | text |
---|---|---|
2 | facsimile | contains a representation of some written source in the form of a set of images rather than as transcribed or encoded text. |
# | id | text |
---|---|---|
2 | altIdent | alternate identifier |
12 | altIdent | supplies the recommended XML name for an element, class, attribute, etc. in some language. |
48 | altIdent | All documentation elements in ODD have a canonical name, supplied as the value for their |
52 | altIdent | element is used to supply an alternative name for the corresponding XML object, perhaps in a different language. |
# | id | text |
---|---|---|
2 | textClass | text classification |
16 | textClass | groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc. |
# | id | text |
---|---|---|
22 | data.point | A point is defined by two numeric values, which may be expressed in any notation permitted. |
# | id | text |
---|---|---|
2 | electronic mail address | |
13 | contains an email address identifying a location to which email messages can be delivered. | |
47 | The format of a modern Internet email address is defined in |
# | id | text |
---|---|---|
2 | model.ptrLike | groups elements used for purposes of location and reference. |
# | id | text |
---|---|---|
2 | styleDefDecl | style definition language declaration |
4 | styleDefDecl | specifies the name of the formal language in which style or renditional information is supplied elsewhere in the document. The specific version of the scheme may also be supplied. |
# | id | text |
---|---|---|
4 | sound | describes a sound effect or musical sequence specified within a screen play or radio script. |
27 | sound | categorizes the sound in some respect, e.g. as music, special effect, etc. |
46 | sound | indicates whether the sound overlaps the surrounding speeches or interrupts them. |
66 | sound | The value |
68 | sound | indicates that the sound is heard between the surrounding speeches; the value |
70 | sound | indicates that the sound overlaps one or more of the surrounding speeches. |
169 | sound | A specialized form of stage direction. |
# | id | text |
---|---|---|
2 | trailer | contains a closing title or footer appearing at the end of a division of a text. |
# | id | text |
---|---|---|
4 | normalization | indicates the extent of normalization or regularization of the original source carried out in converting it to electronic form. |
33 | normalization | indicates a bibliographic description or other resource documenting the principles underlying the normalization carried out. |
77 | normalization | normalization made silently |
93 | normalization | normalization represented using markup |
# | id | text |
---|---|---|
42 | zone | indicates the amount by which this zone has been rotated clockwise, with respect to the normal orientation of the parent |
44 | zone | element as implied by the dimensions given in the |
48 | zone | itself. The orientation is expressed in arc degrees. |
82 | zone | The position of every zone for a given surface is always defined by reference to the coordinate system defined for that surface. |
84 | zone | A graphic element contained by a zone represents the whole of the zone. |
86 | zone | A zone may be of any shape. The attribute |
# | id | text |
---|---|---|
4 | pause | marks a pause either between or within utterances. |
# | id | text |
---|---|---|
2 | listWit | witness list |
58 | listWit | May contain a series of |
68 | listWit | Situations commonly arise where there are many more or less fragmentary witnesses, such that there may be quite distinct groups of witnesses for different parts of a text or collection of texts. Such groups may be given separately, or nested within a single |
77 | listWit | Note however that a given witness can only be defined once, and can therefore only appear within a single |
# | id | text |
---|---|---|
2 | content | content model |
14 | content | contains the text of a declaration for the schema documented. |
47 | content | controls whether or not pattern names generated in the corresponding Relax NG schema source are automatically prefixed to avoid potential nameclashes. |
56 | content | Each name referenced in e.g. a |
58 | content | element within a content model is automatically prefixed by the value of the |
66 | content | No prefixes are added: any prefix required by the value of the |
70 | content | must therefore be supplied explicitly, as appropriate. |
87 | content | element defines a content model allowing either a sequence of paragraphs or a series of msItem elements optionally preceded by a summary: |
102 | content | This content model defines a content model allowing either a sequence of paragraphs or a series of msItem elements optionally preceded by a summary: |
165 | content | As the example shows, content models may be expressed using the RELAX NG syntax directly. To avoid ambiguity when schemas using elements from different namespaces are created, the name supplied for an element in a content model will be automatically prefixed by a short string, as specified by the |
174 | content | macro.schemaPattern |
175 | content | defines which elements may be used to define content models. Alternatively, a content model may be expressed using the TEI |
# | id | text |
---|---|---|
2 | charName | character name |
14 | charName | contains the name of a character, expressed following Unicode conventions. |
48 | charName | The name must follow Unicode conventions for character naming. Projects working in similar fields are recommended to coordinate and publish their list of |
50 | charName | s to facilitate data exchange. |
# | id | text |
---|---|---|
4 | broadcast | describes a broadcast used as the source of a spoken text. |
# | id | text |
---|---|---|
18 | att.translatable | specifies the date on which the source text was extracted and sent to the translator |
38 | att.translatable | attribute can be used to determine whether a translation might need to be revisited, by comparing the modification date on the containing file with the |
40 | att.translatable | value on the translation. If the file has changed, changelogs can be checked to see whether the source text has been modified since the translation was made. |
# | id | text |
---|---|---|
2 | data.numeric | defines the range of attribute values used for numeric values. |
27 | data.numeric | Any numeric value, represented as a decimal number, in floating point format, or as a ratio. |
33 | data.numeric | , may be used. In this format, the value is expressed as two numbers separated by the letter E. The first number, the significand (sometimes called the mantissa) is given in decimal format, while the second is an integer. The value is obtained by multiplying the mantissa by 10 the number of times indicated by the integer. Thus the value represented in decimal notation as 1000.0 might be represented in scientific notation as 10E3. |
35 | data.numeric | A value expressed as a ratio is represented by two integer values separated by a solidus (/) character. Thus, the value represented in decimal notation as 0.5 might be represented as a ratio by the string 1/2. |
# | id | text |
---|---|---|
2 | precision | indicates the numerical accuracy or precision associated with some aspect of the text markup. |
23 | precision | indicates the degree of precision to be assigned as a value between 0 (none) and 1 (optimally precise) |
30 | precision | characterizes the precision of the element or attribute pointed to by the |
39 | precision | supplies a standard deviation associated with the value in question |
# | id | text |
---|---|---|
2 | unicodeName | unicode property name |
14 | unicodeName | contains the name of a registered Unicode normative or informative property. |
37 | unicodeName | specifies the version number of the Unicode Standard in which this property name is defined. |
73 | unicodeName | A definitive list of current Unicode property names is provided in The Unicode Standard. |
# | id | text |
---|---|---|
4 | arc | encodes an arc, the connection from one node to another in a graph. |
31 | arc | gives the identifier of the node which is adjacent from this arc. |
50 | arc | gives the identifier of the node which is adjacent to this arc. |
102 | arc | element must be used if the arcs are labeled. Otherwise, arcs can be encoded using the |
118 | arc | provides a label for the arc; the second provides a second label for the arc, and should be used if a transducer is being encoded. |
# | id | text |
---|---|---|
2 | correspAction | contains a structured description of the place, the name of a person/organization and the date related to the sending/receiving of a message or any other action related to the correspondence |
# | id | text |
---|---|---|
2 | witDetail | witness detail |
49 | witDetail | indicates the sigil or sigla identifying the witness or witnesses to which the detail refers. |
109 | witDetail | note type='witnessDetail' |
112 | witDetail | attribute, which permits an application to extract all annotation concerning a particular witness or witnesses from the apparatus. It also differs in that the location of a |
# | id | text |
---|---|---|
4 | source | describes the original source for the information contained with a manuscript description. |
# | id | text |
---|---|---|
2 | valDesc | value description |
14 | valDesc | specifies any semantic or syntactic constraint on the value that an attribute may take, additional to the information carried by the |
# | id | text |
---|---|---|
4 | tag | contains text of a complete start- or end-tag, possibly including attribute specifications, but excluding the opening and closing markup delimiter characters. |
27 | tag | indicates the type of XML tag intended |
88 | tag | supplies the name of the schema in which this tag is defined. |
105 | tag | TEI |
109 | tag | text encoding initiative |
113 | tag | This tag is defined as part of the TEI scheme. |
133 | tag | this tag is part of the Docbook scheme. |
159 | tag | this tag is part of an unknown scheme. |
# | id | text |
---|---|---|
5 | number | indicates grammatical number associated with a form, as given in a dictionary. |
83 | number | gram type="num" |
# | id | text |
---|---|---|
2 | teiHeader | TEI header |
16 | teiHeader | supplies the descriptive and declarative information making up an electronic title page for every TEI-conformant document. |
48 | teiHeader | specifies the kind of document to which the header is attached, for example whether it is a corpus or individual text. |
67 | teiHeader | text |
71 | teiHeader | the header is attached to a single text. |
87 | teiHeader | the header is attached to a corpus. |
307 | teiHeader | One of the few elements unconditionally required in any TEI document. |
# | id | text |
---|---|---|
4 | author | in a bibliographic reference, contains the name(s) of an author, personal or corporate, of a work; for example in the same form as that provided by a recognized bibliographic name authority. |
69 | author | Particularly where cataloguing is likely to be based on the content of the header, it is advisable to use a generally recognized name authority file to supply the content for this element. The attributes |
75 | author | In the case of a broadcast, use this element for the name of the company or network responsible for making the broadcast. |
77 | author | Where an author is unknown or unspecified, this element may contain text such as |
81 | author | . When the appropriate TEI modules are in use, it may also contain detailed tagging of the names used for people, organizations or places, in particular where multiple names are given. |
# | id | text |
---|---|---|
41 | objectDesc | a short project-specific name identifying the physical form of the carrier, for example as a codex, roll, fragment, partial leaf, cutting etc. |
# | id | text |
---|---|---|
2 | fsdLink | feature structure declaration link |
12 | fsdLink | associates the name of a typed feature structure with a feature structure declaration for it. |
32 | fsdLink | identifies the type of feature structure to be documented; this will be the value of the |
# | id | text |
---|---|---|
2 | origDate | origin date |
13 | origDate | contains any form of date, used to identify the date of origin for a manuscript or manuscript part. |
# | id | text |
---|---|---|
2 | re | related entry |
14 | re | contains a dictionary entry for a lexical item related to the headword, such as a compound phrase or derived form, embedded inside a larger entry. |
51 | re | shows a single related entry for which no definition is given, since its meaning is held to be readily derivable from the root entry: |
350 | re | shows a number of related entries embedded in the main entry. The original entry resembles the following: |
367 | re | One encoding for this entry would be: |
443 | re | s in its main entry for |
447 | re | This entry may be encoded thus: |
513 | re | May contain character data mixed with any other elements defined in the dictionary tag set. |
517 | re | tag, and used where a dictionary has embedded information inside one entry which could have formed a separate entry. Some authorities distinguish related entries, run-on entries, and various other types of degenerate entries; no such typology is attempted here. |
# | id | text |
---|---|---|
2 | body | text body |
16 | body | contains the whole body of a single unitary text, excluding any front or back matter. |
# | id | text |
---|---|---|
2 | linkGrp | link group |
13 | linkGrp | defines a collection of associations or hypertextual links. |
124 | linkGrp | A web or link group is an administrative convenience, which should be used to collect a set of links together for any purpose, not simply to supply a default value for the |
# | id | text |
---|---|---|
2 | tech | technical stage direction |
14 | tech | describes a special-purpose stage direction that is not meant for the actors. |
37 | tech | categorizes the technical stage direction. |
72 | tech | a sound cue |
122 | tech | performance |
134 | tech | elements documenting the performance or performances to which this technical direction applies. |
# | id | text |
---|---|---|
18 | att.ascribed | indicates the person, or group of people, to whom the element content is ascribed. |
38 | att.ascribed | ) in the body of the play are linked to |
# | id | text |
---|---|---|
14 | hi | marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made. |
# | id | text |
---|---|---|
4 | namespace | supplies the formal name of the namespace to which the elements documented by its children belong. |
30 | namespace | specifies the full formal name of the namespace concerned. |
# | id | text |
---|---|---|
4 | locus | defines a location within a manuscript or manuscript part, usually as a (possibly discontinuous) sequence of folio references. |
30 | locus | identifies the foliation scheme in terms of which the location is being specified by pointing to some |
53 | locus | specifies the starting point of the location in a normalized form, typically a page number. |
74 | locus | specifies the end-point of the location in a normalized form, typically as a page number. |
189 | locus | attribute is available globally when the |
211 | locus | attribute should only be used to point to elements that contain or indicate a transcription of the locus being described, as in the first example above. To associate a |
253 | locus | When the location being defined consists of a single page, use the |
261 | locus | . For example, if the manuscript description being transcribed has |
# | id | text |
---|---|---|
2 | att.resourced | provides attributes by which a resource (such as an externally held media file) may be located. |
16 | att.resourced | specifies the URL from which the media concerned may be obtained. |
# | id | text |
---|---|---|
2 | addrLine | address line |
13 | addrLine | contains one line of a postal |
86 | addrLine | Addresses may be encoded either as a sequence of lines, or using any sequence of component elements from the |
92 | addrLine | if they form part of the printed address in some source text. |
# | id | text |
---|---|---|
4 | attList | contains documentation for all the attributes associated with this element, as a series of |
52 | attList | specifies whether all the attributes in the list are available (org="group") or only one of them (org="choice") |
69 | attList | group |
# | id | text |
---|---|---|
14 | respons | identifies the individual(s) responsible for some aspect of the content or markup of particular element(s). |
64 | respons | responsibility is being assigned concerning the name of the element or attribute used. |
76 | respons | responsibility is being assigned concerning the location of the element concerned. |
80 | respons | responsibility is being assigned concerning the content (for an element) or the value (for an attribute) |
210 | respons | element is designed for cases in which fine-grained information about specific aspects of the markup of a text is desirable for whatever reason. Global responsibility for certain aspects of markup is usually more simply indicated in the TEI header, using the |
212 | respons | element within the title statement, edition statement, or change log. |
# | id | text |
---|---|---|
16 | revisionDesc | summarizes the revision history for a file. |
80 | revisionDesc | to record the status at the time of that change. Conventionally change elements should be given in reverse date order, with the most recent change at the start of the list. |
# | id | text |
---|---|---|
30 | occupation | indicates the classification system or taxonomy in use, for example by supplying the identifier of a |
63 | occupation | identifies an occupation code defined within the classification system or taxonomy defined by the |
138 | occupation | The content of this element may be used as an alternative to the more formal specification made possible by its attributes; it may also be used to supplement the formal specification with commentary or clarification. |
# | id | text |
---|---|---|
14 | add | contains letters, words, or phrases inserted in the source text by an author, scribe, or a previous annotator or corrector. |
45 | add | In a diplomatic edition attempting to represent an original source, the |
47 | add | element should not be used for additions to the current TEI electronic edition made by editors or encoders. In these cases, either the |
53 | add | In a TEI edition of a historical text with previous editorial emendations in which such additions or reconstructions are considered part of the source text, the use of |
# | id | text |
---|---|---|
2 | att.declarable | provides attributes for those elements in the TEI header which may be independently selected by means of the special purpose |
32 | att.declarable | indicates whether or not this element is selected by default when its parent is selected. |
53 | att.declarable | This element is selected if its parent is selected |
69 | att.declarable | This element can only be selected explicitly, unless it is the only one of its kind, in which case it is selected if its parent is selected. |
88 | att.declarable | The rules governing the association of declarable elements with individual parts of a TEI text are fully defined in chapter |
91 | att.declarable | attribute with a value of |
# | id | text |
---|---|---|
2 | recording | recording event |
16 | recording | provides details of an audio or video recording event used as the source of a spoken text, either directly or from a public broadcast. |
70 | recording | audio recording |
86 | recording | audio and video recording |
# | id | text |
---|---|---|
4 | institution | contains the name of an organization such as a university or library, with which a manuscript is identified, generally its holding institution. |
# | id | text |
---|---|---|
2 | model.stageLike | groups elements containing stage directions or similar things defined by the module for performance texts. |
# | id | text |
---|---|---|
2 | spGrp | speech group |
4 | spGrp | contains a group of speeches or songs in a performance text presented in a source as constituting a single unit or |
5 | spGrp | number |
# | id | text |
---|---|---|
2 | correction | correction principles |
44 | correction | indicates the degree of correction applied to the text. |
67 | correction | the text has been thoroughly checked and proofread. |
83 | correction | the text has been checked at least once. |
99 | correction | the text has not been checked. |
115 | correction | the correction status of the text is unknown. |
213 | correction | May be used to note the results of proof reading the text against its original, indicating (for example) whether discrepancies have been silently rectified, or recorded using the editorial tags described in section |
# | id | text |
---|---|---|
2 | specGrpRef | reference to a specification group |
51 | specGrpRef | points at the specification group which logically belongs here. |
132 | specGrpRef | usually produces a comment indicating that a set of declarations printed in another section will be inserted at this point in the |
138 | specGrpRef | The specification group identified by the |
# | id | text |
---|---|---|
15 | desc | contains a brief description of the object documented by its parent element, including its intended usage, purpose, or application where this is appropriate. |
58 | desc | TEI convention requires that this be expressed as a finite clause, begining with an active verb. |
# | id | text |
---|---|---|
4 | publisher | provides the name of the organization responsible for the publication or distribution of a bibliographic item. |
63 | publisher | Use the full form of the name by which a company is usually referred to, rather than any abbreviation of it which may appear on a title page |
# | id | text |
---|---|---|
5 | placeName | contains an absolute or relative place name. |
# | id | text |
---|---|---|
64 | eg | If the example contains material in XML markup, either it must be enclosed within a CDATA marked section, or character entity references must be used to represent the markup delimiters. If the example contains well-formed XML, it should be marked using the more specific |
# | id | text |
---|---|---|
2 | castGroup | cast list grouping |
14 | castGroup | groups one or more individual castItem elements within a cast list. |
125 | castGroup | Note that in this example the role description |
# | id | text |
---|---|---|
6 | att.readFrom | specifies the source from which declarations and definitions for the components of the object being defined may be obtained. |
12 | att.readFrom | The context indicated must provide a set of TEI-conformant specifications in a form directly usable by an ODD processor. By default, this will be the location of the current release of the TEI Guidelines. |
14 | att.readFrom | The source may be specified in the form of a private URI, for which the form recommended is |
20 | att.readFrom | for 1.5.1 release of TEI P5 or (as a special case) |
# | id | text |
---|---|---|
2 | gi | element name |
14 | gi | contains the name (generic identifier) of an element. |
37 | gi | supplies the name of the scheme in which this name is defined. |
54 | gi | TEI |
58 | gi | this element is part of the TEI scheme. |
143 | gi | This example shows the use of both a namespace prefix and the schema attribute as alternative ways of indicating that the gi in question is not a TEI element name: in practice only one method should be adopted. |
# | id | text |
---|---|---|
2 | model.global | groups elements which may appear at any point within a TEI text. |
# | id | text |
---|---|---|
2 | recordingStmt | recording statement |
16 | recordingStmt | describes a set of recordings used as the basis for transcription of a spoken text. |
# | id | text |
---|---|---|
5 | explicit | explicit |
6 | explicit | of a manuscript item, that is, the closing words of the text proper, exclusive of any rubric or colophon which might follow it. |
# | id | text |
---|---|---|
2 | sealDesc | seal description |
13 | sealDesc | describes the seals or other external items attached to a manuscript, either as a series of paragraphs or as a series of distinct |
15 | sealDesc | elements, possibly with additional |
# | id | text |
---|---|---|
2 | model.entryPart.top | groups high level elements within a structured dictionary entry |
17 | model.entryPart.top | Members of this class typically contain related parts of a dictionary entry which form a coherent subdivision, for example a particular sense, homonym, etc. |
# | id | text |
---|---|---|
67 | org | specifies a primary role or classification for the organization. |
83 | org | Values for this attribute may be locally defined by a project, using arbitrary keywords such as |
88 | org | family group |
# | id | text |
---|---|---|
4 | keywords | contains a list of keywords or phrases identifying the topic or nature of a text. |
33 | keywords | identifies the controlled vocabulary within which the set of keywords concerned is defined identifies the classification scheme within which the set of categories concerned is defined, for example by a |
109 | keywords | Each individual keyword (including compound subject headings) should be supplied as a |
121 | keywords | If no control list exists for the keywords used, then no value should be supplied for the |
# | id | text |
---|---|---|
13 | classSpec | contains reference information for a TEI element class; that is a group of elements which appear together in content models, or which share some common attribute, or both. |
81 | classSpec | content model |
91 | classSpec | members of this class appear in the same content models |
135 | classSpec | indicates which alternation and sequence instantiations of a model class may be referenced. By default, all variations are permitted. |
170 | classSpec | members of the class are to be provided in sequence |
218 | classSpec | members of the class may be provided one or more times, in sequence |
# | id | text |
---|---|---|
4 | language | characterizes a single language or sublanguage used within a text. |
38 | language | Supplies a language code constructed as defined in |
40 | language | which is used to identify the language documented by this element, and which is referenced by the global |
93 | language | specifies the approximate percentage (by volume) of the text which uses this language. |
154 | language | Particularly for sublanguages, an informal prose characterization should be supplied as content for the element. |
# | id | text |
---|---|---|
2 | model.resourceLike | groups non-textual elements which may appear together with a header and a text to constitute a TEI document. |
# | id | text |
---|---|---|
2 | street | contains a full street address including any name or number identifying a building as well as the name of the street or route on which it is located. |
63 | street | The order and presentation of house names and numbers and street names, etc., may vary considerably in different countries. The encoding should reflect the order which is appropriate in the country concerned. |
# | id | text |
---|---|---|
14 | seg | represents any segmentation of text below the |
137 | seg | element may be used at the encoder's discretion to mark any segments of the text of interest for processing. One use of the element is to mark text features for which no appropriate markup is otherwise defined. Another use is to provide an identifier for some segment which is to be pointed at by some other element—i.e. to provide a target, or a part of a target, for a |
# | id | text |
---|---|---|
30 | gb | attribute indicates the number or other value used to identify this gathering in a collation. |
# | id | text |
---|---|---|
2 | langKnowledge | language knowledge |
12 | langKnowledge | summarizes the state of a person's linguistic knowledge, either as prose or by a list of |
61 | langKnowledge | supplies one or more valid language tags for the languages specified |
79 | langKnowledge | This attribute should be supplied only if the element contains no |
81 | langKnowledge | children. Its values are language |
# | id | text |
---|---|---|
2 | macro.anyXML | defines a content model within which any XML elements are permitted |
11 | macro.anyXML | egXML |
# | id | text |
---|---|---|
2 | set | setting |
13 | set | contains a description of the setting, time, locale, appearance, etc., of the action of a play, typically found in the front matter of a printed performance text (not a stage direction). |
167 | set | This element should not be used outside the front matter; for similar contextual descriptions within the body of the text, use the |
# | id | text |
---|---|---|
4 | settlement | contains the name of a settlement such as a city, town, or village identified as a single geo-political or administrative unit. |
# | id | text |
---|---|---|
2 | metamark | contains or describes any kind of graphic or written signal within a document the function of which is to determine how it should be read rather than forming part of the actual content of the document. |
23 | metamark | identifies one or more elements to which the function indicated by the metamark applies. |
# | id | text |
---|---|---|
133 | publicationStmt | classes rather than one or more paragraphs or anonymous blocks, care should be taken to ensure that the repeated elements are presented in a meaningful order. It is a conformance requirement that elements supplying information about publication place, address, identifier, availability, and date be given following the name of the publisher, distributor, or authority concerned, and preferably in that order. |
# | id | text |
---|---|---|
4 | age | specifies the age of a person. |
29 | age | supplies a numeric code representing the age or age group |
47 | age | This attribute may be used to complement a more detailed discussion of a person's age in the content of the element |
79 | age | As with other culturally-constructed traits such as sex, the way in which this concept is described in different cultural contexts may vary. The normalizing attributes are provided as a means of simplifying that variety to Western European norms and should not be used where that is inappropriate. The content of the element may be used to describe the intended concept in more detail, using plain text. |
# | id | text |
---|---|---|
2 | model.oddRef | groups elements which reference declarations in some markup language in ODD documents. |
# | id | text |
---|---|---|
2 | joinGrp | join group |
14 | joinGrp | groups a collection of join elements and possibly pointers. |
50 | joinGrp | supplies the default value for the |
92 | joinGrp | Any number of |
# | id | text |
---|---|---|
11 | data.percentage | Any non-negative integer value less than 100. |
# | id | text |
---|---|---|
2 | personGrp | personal group |
14 | personGrp | describes a group of individuals treated as a single person for analytic purposes. |
48 | personGrp | specifies the role of this group of participants in the interaction. |
66 | personGrp | Values for this attribute may be locally defined by a project, using arbitrary keywords such as |
80 | personGrp | specifies the sex of the participant group. |
98 | personGrp | Values for this attribute may be locally defined by a project, or may refer to an external standard, such as vCard's sex property |
123 | personGrp | . For a mixed group, a value such as "mixed" may also be supplied. |
128 | personGrp | specifies the age group of the participants. |
146 | personGrp | Values for this attribute may be locally defined by a project, using arbitrary keywords such as |
162 | personGrp | describes informally the size or approximate size of the group for example by means of a number and an indication of accuracy e.g. |
194 | personGrp | May contain a prose description organized as paragraphs, or any sequence of demographic elements in any combination. |
198 | personGrp | attribute should be used to identify each speaking participant in a spoken text if the |
# | id | text |
---|---|---|
53 | w | provides a lemma for the word, such as an uninflected dictionary entry form. |
# | id | text |
---|---|---|
2 | decoNote | note on decoration |
13 | decoNote | contains a note describing either a decorative component of a manuscript, or a fairly homogenous class of such components. |
# | id | text |
---|---|---|
13 | ref | defines a reference to another location, possibly modified by additional text or comment. |
41 | ref | Only one of the attributes @target' and @cRef' may be supplied on |
# | id | text |
---|---|---|
4 | wit | contains a list of one or more sigla of witnesses attesting a given reading, in a textual variation. |
54 | wit | attribute of the reading; it may be used to record the exact form of the sigla given in the source edition, when that is of interest. |
# | id | text |
---|---|---|
14 | fsConstraints | specifies constraints on the content of valid feature structures. |
55 | fsConstraints | May contain a series of conditional or biconditional elements. |
# | id | text |
---|---|---|
2 | macro.limitedContent | paragraph content |
12 | macro.limitedContent | defines the content of prose elements that are not used for transcription of extant materials. |
# | id | text |
---|---|---|
4 | closer | groups together salutations, datelines, and similar phrases appearing as a final group at the end of a division, especially of a letter. |
# | id | text |
---|---|---|
2 | back | back matter |
203 | back | Because cultural conventions differ as to which elements are grouped as back matter and which as front matter, the content models for the |
# | id | text |
---|---|---|
2 | channel | primary channel |
14 | channel | describes the medium or channel by which a text is delivered or experienced. For a written text, this might be print, manuscript, email, etc.; for a spoken one, radio, telephone, face-to-face, etc. |
37 | channel | specifies the mode of this channel with respect to speech and writing. |
58 | channel | spoken |
78 | channel | spoken to be written |
104 | channel | written to be spoken |
# | id | text |
---|---|---|
2 | model.labelLike | groups elements used to gloss or explain other parts of a document. |
# | id | text |
---|---|---|
2 | change | documents a change or set of changes made during the production of a source document, or during the revision of an electronic file. |
123 | change | element elsewhere in the header, identifying the person responsible for the change and their role in making it. |
127 | change | attribute may be used to indicate the status of a document following the change documented. |
# | id | text |
---|---|---|
4 | performance | contains a section of front or back matter describing how a dramatic piece is to be performed in general or how it was performed on some specific occasion. |
151 | performance | contains paragraphs and an optional cast list only. |
# | id | text |
---|---|---|
13 | handDesc | contains a description of all the different kinds of writing used in a manuscript. |
50 | handDesc | specifies the number of distinct hands identified within the manuscript |
# | id | text |
---|---|---|
10 | att.global.change | elements documenting a state or revision campaign to which the element bearing this attribute and its children have been assigned by the encoder. |
# | id | text |
---|---|---|
134 | profileDesc | Although the content model permits it, it is rarely meaningful to supply multiple occurrences for any of the child elements of |
# | id | text |
---|---|---|
33 | formula | names the notation used for the content of the element. |
# | id | text |
---|---|---|
2 | symbol | symbolic value |
14 | symbol | represents the value part of a feature-value specification which contains one of a finite list of symbols. |
38 | symbol | supplies a symbolic value for the feature, one of a finite list that may be specified in a feature declaration. |
# | id | text |
---|---|---|
2 | root | root node |
14 | root | represents the root node of a tree. |
38 | root | identifies the root node of the network by pointing to a feature structure or other analytic element. |
57 | root | identifies the elements which are the children of the root node. |
75 | root | If the root has no children (i.e., the tree is |
77 | root | ), then the |
110 | root | indicates whether or not the root is ordered. |
128 | root | The value |
130 | root | indicates that the children of the root are ordered, whereas |
134 | root | Use if and only if |
140 | root | element and the root has more than one child. |
177 | root | gives the out degree of the root, the number of its children. |
195 | root | The in degree of the root is always 0. |
# | id | text |
---|---|---|
2 | constraintSpec | constraint on schema |
4 | constraintSpec | contains a constraint, expressed in some formal syntax, which cannot be expressed in the structural content model |
28 | constraintSpec | Rules in the Schematron 1.* language must be inside a constraintSpec with a value of 'schematron' on the scheme attribute |
37 | constraintSpec | Rules in the ISO Schematron language must be inside a constraintSpec with a value of 'isoschematron' on the scheme attribute |
46 | constraintSpec | Rules in XSLT must be inside a constraintSpec with a value of 'isoschematron' on the scheme attribute |
54 | constraintSpec | An ISO Schematron constraint specification for a macro should not have an 'assert' or 'report' element without a parent 'rule' element |
61 | constraintSpec | supplies the name of the language in which the constraints are defined |
80 | constraintSpec | private constraint language |
87 | constraintSpec | This constraint uses Schematron to enforce the presence of the |
120 | constraintSpec | This constraint uses a language which is not expressed in XML to check whether the title and author are identical: |
# | id | text |
---|---|---|
4 | origin | contains any descriptive or other information concerning the origin of a manuscript or manuscript part. |
# | id | text |
---|---|---|
16 | stdVals | specifies the format used when standardized date or number values are supplied. |
# | id | text |
---|---|---|
2 | model.measureLike | groups elements which denote a number, a quantity, a measurement, or similar piece of text that conveys some numerical meaning. |
# | id | text |
---|---|---|
2 | model.ptrLike.form | groups elements used for purposes of location of particular orthographic or pronunciation forms within a dictionary entry. |
# | id | text |
---|---|---|
37 | equiv | a single word which follows the rules defining a legal XML name (see |
86 | equiv | references an external script which contains a method to transform instances of this element to canonical TEI |
109 | equiv | hi rend='bold' |
177 | equiv | attribute should be used to supply the MIME media type of the filter script specified by the |
# | id | text |
---|---|---|
2 | msName | alternative name |
14 | msName | contains any form of unstructured alternative name used for a manuscript, such as an |
# | id | text |
---|---|---|
2 | att.source | provides attributes for pointing to the source of a bibliographic reference. |
8 | att.source | provides a pointer to the bibliographical source from which a quotation or citation is drawn. |
# | id | text |
---|---|---|
14 | respStmt | supplies a statement of responsibility for the intellectual content of a text, edition, recording, or series, where the specialized elements for authors, editors, etc. do not suffice or do not apply. May also be used to encode information about individuals or organizations which have played a role in the production or distribution of a bibliographic work. |
# | id | text |
---|---|---|
5 | colophon | colophon |
# | id | text |
---|---|---|
2 | macro.phraseSeq.limited | limited phrase sequence |
12 | macro.phraseSeq.limited | defines a sequence of character data and those phrase-level elements that are not typically used for transcribing extant documents. |
# | id | text |
---|---|---|
4 | entry | contains a single structured entry in any kind of lexical resource, such as a dictionary or lexicon. |
122 | entry | s; one convenient method is to use the orthographic form of the headword, appending a disambiguating number where necessary. Identification codes are sometimes included on machine-readable tapes of dictionaries for in-house use. |
126 | entry | element even for an entry that has only one sense to group together all parts of the definition relating to the word sense since this leads to more consistent encoding across entries. |
# | id | text |
---|---|---|
28 | data.pointer | (IRIs) mapping to URIs. For example, |
# | id | text |
---|---|---|
2 | glyphName | character glyph name |
14 | glyphName | contains the name of a glyph, expressed following Unicode conventions for character names. |
47 | glyphName | For characters of non-ideographic scripts, a name following the conventions for Unicode names should be chosen. For ideographic scripts, an |
49 | glyphName | (IDS) as described in Chapter 10.1 of the Unicode Standard is recommended where possible. Projects working in similar fields are recommended to coordinate and publish their list of |
51 | glyphName | s to facilitate data exchange. |
# | id | text |
---|---|---|
14 | def | contains definition text in a dictionary entry. |
# | id | text |
---|---|---|
2 | att.interpLike | provides attributes for elements which represent a formal analysis or interpretation. |
116 | att.interpLike | points to instances of the analysis or interpretation represented by the current element. |
134 | att.interpLike | The current element should be an analytic one. The element pointed at should be a textual one. |
# | id | text |
---|---|---|
2 | valList | value list |
53 | valList | specifies the extensibility of the list of values specified. |
# | id | text |
---|---|---|
2 | model.divTop | groups elements appearing at the beginning of a text division |
# | id | text |
---|---|---|
2 | model.nameLike | groups elements which name or refer to a person, place, or organization. |
# | id | text |
---|---|---|
49 | att.enjamb | indicates that the end of a verse line is marked by enjambement. |
68 | att.enjamb | the line is end-stopped |
84 | att.enjamb | the line in question runs on into the next |
100 | att.enjamb | the line is weakly enjambed |
116 | att.enjamb | the line is strongly enjambed |
133 | att.enjamb | The usual practice will be to give the value |
135 | att.enjamb | to this attribute when enjambement is being marked, or the values |
139 | att.enjamb | if degrees of enjambement are of interest; if no value is given, however, the attribute does not default to a value of |
141 | att.enjamb | ; this allows the attribute to be omitted entirely when enjambement is not of particular interest. |
# | id | text |
---|---|---|
2 | then | separates the condition from the default in an |
# | id | text |
---|---|---|
47 | data.outputMeasurement | These values map directly onto the values used by XSL-FO and CSS. For definitions of the units see those specifications; at the time of this writing the most complete list is in the |
# | id | text |
---|---|---|
3 | incipit | incipit |
4 | incipit | of a manuscript item, that is the opening words of the text proper, exclusive of any |
5 | incipit | rubric |
6 | incipit | which might precede it, of sufficient length to identify the work uniquely; such incipits were, in former times, frequently used a means of reference to a work, in place of a title. |
# | id | text |
---|---|---|
76 | biblStruct | WARNING: use of deprecated method — the use of the idno element as a direct child of the biblStruct element will be removed from the TEI on 2016-09-18 |
# | id | text |
---|---|---|
4 | heraldry | contains a heraldic formula or phrase, typically found as part of a blazon, coat of arms, etc. |
# | id | text |
---|---|---|
88 | term | This element is used to supply the form under which an index entry is to be made for the location of a parent |
94 | term | element may be used to mark any of these. No position is taken on the philosophical issue of what a term can be; the looser definition simply allows the |
100 | term | class, instances of this element occuring in a text may be associated with a canonical definition, either by means of a URI (using the |
102 | term | attribute), or by means of some system-specific code value (using the |
# | id | text |
---|---|---|
2 | msItemStruct | structured manuscript item |
13 | msItemStruct | contains a structured description for an individual work or item within the intellectual content of a manuscript or manuscript part. |
98 | msItemStruct | identifies the text types or classifications applicable to this item by pointing to other elements or resources defining the classification concerned. |
# | id | text |
---|---|---|
4 | constitution | describes the internal composition of a text or text sample, for example as fragmentary, complete, etc. |
27 | constitution | specifies how the text was constituted. |
48 | constitution | a single complete text |
64 | constitution | a text made by combining several smaller items, each individually complete |
90 | constitution | a text made by combining several smaller, not necessarily complete, items |
# | id | text |
---|---|---|
4 | rubric | contains the text of any |
5 | rubric | rubric |
6 | rubric | or heading attached to a particular manuscript item, that is, a string of words through which a manuscript signals the beginning of a text division, often with an assertion as to its author and title, which is in some way set off from the text itself, usually in red ink, or by use of different size or type of script, or some other such visual device. |
# | id | text |
---|---|---|
2 | calendarDesc | calendar description |
10 | calendarDesc | contains a description of the calendar system used in any dating expression found in the text. |
196 | calendarDesc | s are from W3 guidelines at |
# | id | text |
---|---|---|
2 | att.damaged | provides attributes describing the nature of any physical damage affecting a reading. |
19 | att.damaged | in the case of damage (deliberate defacement, inking out, etc.) assignable to a distinct hand, signifies the hand responsible for the damage by pointing to one of the hand identifiers declared in the document header (see section |
37 | att.damaged | categorizes the cause of the damage, if it can be identified. |
54 | att.damaged | damage results from rubbing of the leaf edges |
68 | att.damaged | damage results from mildew on the leaf surface |
82 | att.damaged | damage results from smoke |
98 | att.damaged | provides a coded representation of the degree of damage, either as a number between 0 (undamaged) and 1 (very extensively damaged), or as one of the codes |
110 | att.damaged | attribute should only be used where the text may be read with some confidence; text supplied from other sources should be tagged as |
161 | att.damaged | element is appropriate where it is desired to record the fact of damage although this has not affected the readability of the text, for example a weathered inscription. Where the damage has rendered the text more or less illegible either the |
163 | att.damaged | tag (for partial illegibility) or the |
165 | att.damaged | tag (for complete illegibility, with no text supplied) should be used, with the information concerning the damage given in the attribute values of these tags. See section |
223 | att.damaged | assigns an arbitrary number to each stretch of damage regarded as forming part of the same physical phenomenon. |
# | id | text |
---|---|---|
2 | iNode | intermediate (or internal) node |
14 | iNode | represents an intermediate (or internal) node of a tree. |
38 | iNode | indicates an intermediate node, which is a feature structure or other analytic element. |
57 | iNode | provides a list of identifiers of the elements which are the children of the intermediate node. |
105 | iNode | indicates whether or not the internal node is ordered. |
123 | iNode | The value |
125 | iNode | indicates that the children of the intermediate node are ordered, whereas |
129 | iNode | Use if and only if |
135 | iNode | element and the intermediate node has more than one child. |
172 | iNode | provides the identifier of an element which this node follows. |
190 | iNode | If the tree is unordered or partially ordered, this attribute has the property of fixing the relative order of the intermediate node and the element which is the value of the attribute. |
203 | iNode | gives the out degree of an intermediate node, the number of its children. |
221 | iNode | The in degree of an intermediate node is always 1. |
# | id | text |
---|---|---|
2 | langUsage | language usage |
# | id | text |
---|---|---|
2 | model.persStateLike | groups elements describing changeable characteristics of a person which have a definite duration, for example occupation, residence, or name. |
# | id | text |
---|---|---|
52 | att.msExcerpt | In the case of an incipit, indicates whether the incipit as given is defective, i.e. the first words of the text as preserved, as opposed to the first words of the work itself. In the case of an explicit, indicates whether the explicit as given is defective, i.e. the final words of the text as preserved, as opposed to what the closing words would have been had the text of the work been whole. |
# | id | text |
---|---|---|
4 | table | contains text displayed in tabular form, in rows and columns. |
58 | table | indicates the number of rows in the table. |
76 | table | If no number is supplied, an application must calculate the number of rows. |
101 | table | indicates the number of columns in each row of the table. |
119 | table | If no number is supplied, an application must calculate the number of columns. |
283 | table | Contains an optional heading and a series of rows. |
285 | table | Any rendition information should be supplied using the global |
287 | table | attribute, at the table, row, or cell level as appropriate. |
# | id | text |
---|---|---|
2 | listEvent | list of events |
6 | listEvent | contains a list of descriptions, each of which provides information about an identifiable event. |
# | id | text |
---|---|---|
2 | altGrp | alternation group |
14 | altGrp | groups a collection of |
51 | altGrp | states whether the alternations gathered in this collection are exclusive or inclusive. |
167 | altGrp | Any number of alternations, pointers or extended pointers. |
# | id | text |
---|---|---|
2 | att.duration.iso | provides attributes for recording normalized temporal durations. |
56 | att.duration.iso | are specified, the values should be interpreted as indicating a span of time by its starting time (or date) and duration. In order to represent a time range by a duration and its ending time the |
62 | att.duration.iso | form, no claim is made that the form in the source text is incorrect; the regularized form is simply that chosen as the main form for purposes of unifying variant forms under a single heading. |
# | id | text |
---|---|---|
7 | att.witnessed | witness or witnesses |
17 | att.witnessed | contains a space-delimited list of one or more pointers indicating the witnesses which attest to a given reading. |
37 | att.witnessed | This attribute may occur both within an apparatus gathering variant readings in the transcription of an individual witness and within an apparatus gathering readings from different witnesses. |
39 | att.witnessed | Additional descriptions or alternative versions of the sigla referenced may be supplied as the content of a child |
# | id | text |
---|---|---|
14 | att.personal | common attributes for those elements which form part of a name usually, but not necessarily, a personal name. |
33 | att.personal | indicates whether the name component is given in full, as an abbreviation or simply as an initial. |
56 | att.personal | the name component is spelled out in full. |
82 | att.personal | the name component is given in an abbreviated form. |
108 | att.personal | the name component is indicated only by one initial. |
128 | att.personal | specifies the sort order of the name component in relation to others within the name. |
# | id | text |
---|---|---|
40 | moduleRef | are only allowed when an external module is being loaded |
47 | moduleRef | specifies a default prefix which will be prepended to all patterns from the imported module |
62 | moduleRef | Use of this attribute avoids name collisions (and thus invalid schemas) when the external schema being mixed in with TEI uses a name the TEI or some other included external schema already uses for a pattern. |
68 | moduleRef | supplies a list of the elements which are to be copied from the specified module into the schema being defined. |
75 | moduleRef | supplies a list of the elements which are not to be copied from the specified module into the schema being defined. |
84 | moduleRef | the name of a TEI module |
105 | moduleRef | refers to a non-TEI module of RELAX NG code by external location |
123 | moduleRef | This includes all objects available from the linking module. |
139 | moduleRef | This includes all elements available from the linking module except for the |
154 | moduleRef | elements from the linking module. |
169 | moduleRef | A TEI module is identified by the name supplied as value for the |
175 | moduleRef | attribute may be used to specify an online source from which the specification of that module may be read. A URI may alternatively be supplied in the case of a non-TEI module, and this is expected to be written as a RELAX NG schema. |
# | id | text |
---|---|---|
3 | per | person |
15 | per | contains an indication of the grammatical person (1st, 2nd, 3rd, etc.) associated with a given inflected form in a dictionary. |
99 | per | gram type="person" |
# | id | text |
---|---|---|
14 | monogr | contains bibliographic elements describing an item (e.g. a book or journal) published as an independent item (i.e. as a separate physical object). |
# | id | text |
---|---|---|
15 | sp | contains an individual speech in a performance text, or a passage presented as such in a prose or verse text. |
140 | sp | Lines or paragraphs, stage directions, and phrase-level elements. |
# | id | text |
---|---|---|
26 | rhyme | provides a label (usually a single letter) to identify which part of a rhyme scheme this rhyming string instantiates. |
47 | rhyme | elements with the same value for their |
49 | rhyme | attribute are assumed to rhyme with each other. The scope is defined by the nearest ancestor element for which the |
# | id | text |
---|---|---|
2 | defaultVal | default value |
13 | defaultVal | specifies the default declared value for an attribute. |
52 | defaultVal | any legal declared value or TEI-defined keyword |
# | id | text |
---|---|---|
72 | expan | The content of this element should usually be a complete word or phrase. The |
76 | expan | module may be used to mark up sequences of letters supplied within such an expansion. |
# | id | text |
---|---|---|
2 | msItem | manuscript item |
13 | msItem | describes an individual work or item within the intellectual content of a manuscript or manuscript part. |
56 | msItem | identifies the text types or classifications applicable to this item by pointing to other elements or resources defining the classification concerned. |
# | id | text |
---|---|---|
2 | punctuation | specifies editorial practice adopted with respect to punctuation marks in the original. |
16 | punctuation | indicates whether or not punctation marks have been retained as content within the text. |
23 | punctuation | no punctuation marks have been retained |
27 | punctuation | some punctuation marks have been retained |
31 | punctuation | all punctuation marks have been retained |
44 | punctuation | punctuation marks are captured inside adjacent elements |
48 | punctuation | punctuation marks are captured outside adjacent elements |
# | id | text |
---|---|---|
4 | hyphenation | summarizes the way in which hyphenation in a source text has been treated in an encoded version of it. |
42 | hyphenation | indicates whether or not end-of-line hyphenation has been retained in a text. |
65 | hyphenation | all end-of-line hyphenation has been retained, even though the lineation of the original may not have been. |
81 | hyphenation | end-of-line hyphenation has been retained in some cases. |
97 | hyphenation | all soft end-of-line hyphenation has been removed: any remaining end-of-line hyphenation should be retained. |
113 | hyphenation | all end-of-line hyphenation has been removed: any remaining hyphenation occurred within the line. |
# | id | text |
---|---|---|
4 | time | contains a phrase defining a time of day in any format. |
# | id | text |
---|---|---|
2 | titlePart | contains a subsection or division of the title of a work, as indicated on a title page. |
28 | titlePart | specifies the role of this subdivision of the title. |
51 | titlePart | main title of the work |
95 | titlePart | alternate |
107 | titlePart | alternative title of the work |
123 | titlePart | abbreviated form of title |
# | id | text |
---|---|---|
6 | att.deprecated | provides a date before which the construct being defined will not be removed. |
24 | att.deprecated | The value of this attribute should represent a date (in standard |
26 | att.deprecated | format) which is later than the date on which the attribute is added to an ODD. Technically, this attribute asserts only the intent to leave a construct in future releases of the markup language being defined up to at least the specified date, and makes no assertion about what happens past that date. In practice, the expectation is that the construct will be removed from future releases of the markup language being defined sometime shortly after the |
32 | att.deprecated | date that is in the past. An ODD processor will typically warn users about constructs which have a |
34 | att.deprecated | date that is in the future. E.g., the documentation for such a construct might include the phrase |
# | id | text |
---|---|---|
2 | domain | domain of use |
14 | domain | describes the most important social context in which the text was realized or for which it is intended, for example private vs. public, education, religion, etc. |
37 | domain | categorizes the domain of use. |
104 | domain | business and work place |
120 | domain | education |
202 | domain | Usually empty, unless some further clarification of the type attribute is needed, in which case it may contain running prose. |
204 | domain | The list presented here is primarily for illustrative purposes. |
# | id | text |
---|---|---|
2 | listPlace | list of places |
12 | listPlace | contains a list of places, optionally followed by a list of relationships (other than containment) defined amongst them. |
# | id | text |
---|---|---|
2 | addSpan | added span of text |
14 | addSpan | marks the beginning of a longer sequence of text added by an author, scribe, annotator or corrector (see also |
95 | addSpan | Both the beginning and the end of the added material must be marked; the beginning by the |
# | id | text |
---|---|---|
2 | binary | binary value |
14 | binary | represents the value part of a feature-value specification which can contain either of exactly two possible values. |
40 | binary | supplies a binary value. |
57 | binary | This attribute has a datatype of data.truthValue, which may be represented by the values |
91 | binary | The value attribute may take any value permitted for attributes of the W3C datatype Boolean: this includes for example the strings |
# | id | text |
---|---|---|
3 | att.duration | provides attributes for normalization of elements that contain datable events. |
28 | att.duration | class. In general, the possible values of attributes restricted to the W3C datatypes form a subset of those values available via the ISO 8601 standard. However, the greater expressiveness of the ISO datatypes is rarely needed, and there exists much greater software support for the W3C datatypes. |
# | id | text |
---|---|---|
4 | choice | groups a number of alternative encodings for the same point in a text. |
79 | choice | element all represent alternative ways of encoding the same sequence, it is natural to think of them as mutually exclusive. However, there may be cases where a full representation of a text requires the alternative encodings to be considered as parallel. |
85 | choice | Where the purpose of an encoding is to record multiple witnesses of a single work, rather than to identify multiple possible encoding decisions at a given point, the |
# | id | text |
---|---|---|
2 | vMerge | merged collection of values |
14 | vMerge | represents a feature value which is the result of merging together the feature values contained by its children, using the organization specified by the |
133 | vMerge | This example returns a list, concatenating the indeterminate value with the set of values masculine, neuter and feminine. |
# | id | text |
---|---|---|
2 | rs | referencing string |
14 | rs | contains a general purpose name or referring string. |
# | id | text |
---|---|---|
4 | group | contains the body of a composite text, grouping together a sequence of distinct texts (or groups of such texts) which are regarded as a unit for some purpose, for example the collected works of an author, a sequence of prose essays, etc. |
# | id | text |
---|---|---|
2 | att.pointing.group | defines a set of attributes common to all elements which enclose groups of pointer elements. |
40 | att.pointing.group | If this attribute is supplied every element specified as a target must be contained within the element or elements named by it. An application may choose whether or not to report failures to satisfy this constraint as errors, but may not access an element of the right identifier but in the wrong context. If this attribute is not supplied, then target elements may appear anywhere within the target document. |
134 | att.pointing.group | The number of separate values must match the number of values in the |
144 | att.pointing.group | element may be needed to accomplish this). It should also match the number of values in the |
146 | att.pointing.group | attribute, of the current element, if one has been specified. |
# | id | text |
---|---|---|
42 | citedRange | . For example, if the citation has |
# | id | text |
---|---|---|
6 | att.docStatus | describes the status of a document either currently or, when associated with a dated element, at the time indicated. |
# | id | text |
---|---|---|
89 | move | character moves on stage |
105 | move | specifies the direction of a stage movement. |
134 | move | stage left |
160 | move | stage right |
186 | move | centre stage |
206 | move | upper stage left |
226 | move | performance |
236 | move | identifies the performance or performances in which this movement occurred as specified by pointing to one or more |
# | id | text |
---|---|---|
4 | case | contains grammatical case information given by a dictionary for a given form. |
109 | case | May contain character data and phrase-level elements. Typical values will be of the form |
120 | case | gram type="case" |
# | id | text |
---|---|---|
2 | div3 | level-3 text division |
16 | div3 | contains a third-level subdivision of the front, body, or back of a text. |
162 | div3 | any sequence of low-level structural elements, possibly grouped into lower subdivisions. |
# | id | text |
---|---|---|
4 | datatype | specifies the declared value for an attribute, by referring to any datatype defined by the chosen schema language. |
32 | datatype | minimum number of occurences |
44 | datatype | indicates the minimum number of times this datatype may occur in the specification of the attribute being defined |
65 | datatype | maximum number of occurences |
77 | datatype | indicates the maximum number of times this datatype may occur in the specification of the attribute being defined |
151 | datatype | The encoding in the following example requires that the attribute being defined contain at least two URIs in its value, as is the case for the |
164 | datatype | In the TEI scheme, most datatypes are expressed using pre-defined TEI macros, which map a name in the form |
# | id | text |
---|---|---|
16 | encodingDesc | documents the relationship between an electronic text and the source or sources from which it was derived. |
# | id | text |
---|---|---|
18 | att.entryLike | indicates type of entry, in dictionaries with multiple types. |
39 | att.entryLike | a main entry (default). |
99 | att.entryLike | a reduced entry whose only function is to point to another main entry (e.g. for forms of an irregular verb or for variant spellings: |
163 | att.entryLike | an entry for a prefix, infix, or suffix. |
189 | att.entryLike | an entry for an abbreviation. |
205 | att.entryLike | a supplemental entry (for use in dictionaries which issue supplements to their main work in which they include updated information about entries). |
221 | att.entryLike | an entry for a foreign word in a monolingual dictionary. |
# | id | text |
---|---|---|
2 | delSpan | deleted span of text |
14 | delSpan | marks the beginning of a longer sequence of text deleted, marked as deleted, or otherwise signaled as superfluous or spurious by an author, scribe, annotator, or corrector. |
95 | delSpan | Both the beginning and ending of the deleted sequence must be marked: the beginning by the |
101 | delSpan | The text deleted must be at least partially legible, in order for the encoder to be able to transcribe it. If it is not legible at all, the |
103 | delSpan | tag should not be used. Rather, the |
105 | delSpan | tag should be employed to signal that text cannot be transcribed, with the value of the |
109 | delSpan | element should be used to signal the areas of text which cannot be read with confidence. See further sections |
112 | delSpan | tag with the |
125 | delSpan | tag should not be used for deletions made by editors or encoders. In these cases, either the |
127 | delSpan | tag or the |
129 | delSpan | tag should be used. |
# | id | text |
---|---|---|
14 | fvLib | assembles a library of reusable feature value elements (including complete feature structures). |
62 | fvLib | A feature value library may include any number of values of any kind, including multiple occurrences of identical values such as |
65 | fvLib | default |
66 | fvLib | . The only thing guaranteed unique in a feature value library is the set of labels used to identify the values. |
# | id | text |
---|---|---|
2 | glyph | character glyph |
14 | glyph | provides descriptive information about a character glyph |
# | id | text |
---|---|---|
23 | data.word | Attributes using this datatype must contain a single |
25 | data.word | which contains only letters, digits, punctuation characters, or symbols: thus it cannot include whitespace. |
# | id | text |
---|---|---|
2 | data.enumerated | defines the range of attribute values expressed as a single XML name taken from a list of documented possibilities. |
20 | data.enumerated | Attributes using this datatype must contain a single |
22 | data.enumerated | matching the rules for XML names: i.e., a token beginning with a letter or one of a few punctuation characters, and continuing with letters, digits, hyphens, underscores, colons, or full stops. |
24 | data.enumerated | Typically, the list of documented possibilities will be provided (or exemplified) by a value list in the associated attribute specification, expressed with a |
# | id | text |
---|---|---|
2 | model.lPart | groups phrase-level elements which may appear within verse only. |
# | id | text |
---|---|---|
4 | interpretation | describes the scope of any analytic or interpretive information added to the text in addition to the transcription. |
# | id | text |
---|---|---|
2 | att.global.rendition | provides rendering attributes common to all elements in the TEI encoding scheme. |
7 | att.global.rendition | rendition |
17 | att.global.rendition | indicates how the element in question was rendered or presented in the source text. |
58 | att.global.rendition | These Guidelines make no binding recommendations for the values of the |
60 | att.global.rendition | attribute; the characteristics of visual presentation vary too much from text to text and the decision to record or ignore individual characteristics varies too much from project to project. Some potentially useful conventions are noted from time to time at appropriate points in the Guidelines. The values of the |
62 | att.global.rendition | attribute are a set of sequence-indeterminate individual tokens separated by whitespace. |
85 | att.global.rendition | contains an expression in some formal style definition language which defines the rendering or presentation used for this element in the source text |
107 | att.global.rendition | attribute may contain whitespace. This attribute is intended for recording inline stylistic information concerning the source, not any particular output. |
109 | att.global.rendition | The formal language in which values for this attribute are expressed may be specified using the |
111 | att.global.rendition | element in the TEI header. |
116 | att.global.rendition | points to a description of the rendering or presentation used for this element in the source text. |
168 | att.global.rendition | attribute defined for XHTML but with the important distinction that its function is to describe the appearance of the source text, not necessarily to determine how that text should be presented on screen or paper. |
178 | att.global.rendition | element defining the intended rendition in terms of some appropriate style language, as indicated by the |
# | id | text |
---|---|---|
78 | fLib | attribute may be used to supply an informal name to categorize the library's contents. |
# | id | text |
---|---|---|
8 | data.xmlName | The rules defining an XML name form a part of the XML Specification. |
# | id | text |
---|---|---|
4 | terrain | contains information about the physical terrain of a place. |
# | id | text |
---|---|---|
2 | att.datable | provides attributes for normalization of elements that contain dates, times, or datable events. |
23 | att.datable | indicates the system or calendar to which the date represented by the content of this element belongs. |
43 | att.datable | @calendar indicates the system or calendar to which the date represented by the content of this element belongs, but this |
69 | att.datable | ) defines the calendar system of the date in the original material defined by the parent element, |
71 | att.datable | the calendar to which the date is normalized. |
75 | att.datable | supplies a pointer to some location defining a named period of time within which the datable item is understood to have occurred. |
101 | att.datable | classes. In general, the possible values of attributes restricted to the W3C datatypes form a subset of those values available via the ISO 8601 standard. However, the greater expressiveness of the ISO datatypes may not be needed, and there exists much greater software support for the W3C datatypes. |
# | id | text |
---|---|---|
2 | specList | specification list |
12 | specList | marks where a list of descriptions is to be inserted into the prose documentation. |
# | id | text |
---|---|---|
52 | schemaSpec | specifies entry points to the schema, i.e. which patterns may be used as the root of documents conforming to it. |
69 | schemaSpec | TEI |
73 | schemaSpec | specifies a default prefix which will be prepended to all patterns relating to TEI elements, unless otherwise stated. |
94 | schemaSpec | Use of this attribute allows an external schema which has an element with the same local name as a TEI element to be mixed in. |
107 | schemaSpec | target language |
117 | schemaSpec | specifies which language to use when creating the objects in a schema if names for elements or attributes are available in more than one language |
136 | schemaSpec | documentation language |
146 | schemaSpec | specifies which languages to use when creating documentation if the description for an element, attribute, class or macro is available in more than one language |
180 | schemaSpec | combines references to modules, individual element or macro declarations, and specification groups together to form a unified schema. The processing of the |
# | id | text |
---|---|---|
2 | macro.schemaPattern | provides a pattern to match elements from the chosen schema language |
# | id | text |
---|---|---|
15 | tns | indicates the grammatical tense associated with a given inflected form in a dictionary. |
99 | tns | gram type="tense" |
# | id | text |
---|---|---|
2 | data.count | defines the range of attribute values used for a non-negative integer value used as a count. |
# | id | text |
---|---|---|
2 | hyph | hyphenation |
14 | hyph | contains a hyphenated form of a dictionary headword, or hyphenation information in some other form. |
# | id | text |
---|---|---|
2 | att.spanning | provides attributes for elements which delimit a span of text by pointing mechanisms rather than by enclosing it. |
18 | att.spanning | indicates the end of a span initiated by the element bearing this attribute. |
50 | att.spanning | The span is defined as running in document order from the start of the content of the pointing element to the end of the content of the element pointed to by the |
52 | att.spanning | attribute (if any). If no value is supplied for the attribute, the assumption is that the span is coextensive with the pointing element. If no content is present, the assumption is that the starting point of the span is immediately following the element itself. |
# | id | text |
---|---|---|
4 | watermark | contains a word or phrase describing a watermark or similar device. |
# | id | text |
---|---|---|
2 | macro.xtext | extended text |
14 | macro.xtext | defines a sequence of character data and gaiji elements. |
# | id | text |
---|---|---|
4 | surplus | marks text present in the source which the editor believes to be superfluous or redundant. |
18 | surplus | one or more words indicating why this text is believed to be superfluous, e.g. |
# | id | text |
---|---|---|
2 | sourceDoc | contains a transcription or other representation of a single source document potentially forming part of a |
4 | sourceDoc | or collection of sources. |
49 | sourceDoc | for TEI documents containing only page images, or for documents containing both images and transcriptions. Transcriptions may be provided within the |
51 | sourceDoc | elements making up a source document, in parallel with them as part of a |
53 | sourceDoc | element, or in both places if the encoder wishes to distinguish these two modes of transcription. |
# | id | text |
---|---|---|
4 | row | contains one row of a table. |
# | id | text |
---|---|---|
20 | att.tableDecoration | indicates the kind of information held in this cell or in each cell of this row. |
74 | att.tableDecoration | When this attribute is specified on a row, its value is the default for all cells in this row. When specified on a cell, its value overrides any default specified by the |
109 | att.tableDecoration | indicates the number of rows occupied by this cell or row. |
129 | att.tableDecoration | A value greater than one indicates that this cell |
130 | att.tableDecoration | spans several rows. Where several cells span multiple rows, it may be more convenient to use nested tables. |
157 | att.tableDecoration | indicates the number of columns occupied by this cell or row. |
177 | att.tableDecoration | A value greater than one indicates that this cell or row spans several columns. Where an initial cell spans an entire row, it may be better treated as a heading. |
# | id | text |
---|---|---|
2 | divGen | automatically generated text division |
14 | divGen | indicates the location at which a textual division generated automatically by a text-processing application is to appear. |
40 | divGen | specifies what type of generated text division (e.g. index, table of contents, etc.) is to appear. |
59 | divGen | an index is to be generated and inserted at this point. |
77 | divGen | a table of contents |
91 | divGen | a list of figures |
107 | divGen | a list of tables |
138 | divGen | One use for this element is to allow document preparation software to generate an index and insert it in the appropriate place in the output. The example below assumes that the |
142 | divGen | elements in the text has been used to specify index entries for the two generated indexes, named NAMES and THINGS: |
234 | divGen | is to specify the location of an automatically produced table of contents: |
250 | divGen | This element is intended primarily for use in document production or manipulation, rather than in the transcription of pre-existing materials; it makes it easier to specify the location of indices, tables of contents, etc., to be generated by text preparation or word processing software. |
# | id | text |
---|---|---|
2 | data.duration.w3c | defines the range of attribute values available for representation of a duration in time using W3C datatypes. |
60 | data.duration.w3c | A duration is expressed as a sequence of number-letter pairs, preceded by the letter P; the letter gives the unit and may be Y (year), M (month), D (day), H (hour), M (minute), or S (second), in that order. The numbers are all unsigned integers, except for the |
64 | data.duration.w3c | as the decimal point). If any number is |
66 | data.duration.w3c | , then that number-letter pair may be omitted. If any of the H (hour), M (minute), or S (second) number-letter pairs are present, then the separator |
69 | data.duration.w3c | time |
# | id | text |
---|---|---|
2 | att.citing | provides attributes for specifying the specific part of a bibliographic item being cited. |
70 | att.citing | the element contains a page number or page range. |
86 | att.citing | the element contains a line number or line range. |
# | id | text |
---|---|---|
2 | titleStmt | title statement |
16 | titleStmt | groups information about the title of a work and those responsible for its content. |
# | id | text |
---|---|---|
2 | g | character or glyph |
38 | g | points to a description of the character or glyph intended. |
99 | g | The medieval brevigraph per could similarly be considered as an individual glyph, defined in a |
102 | g | per |
108 | g | The name |
111 | g | gaiji |
112 | g | , which is the Japanese term for a non-standardized character or glyph. |
# | id | text |
---|---|---|
12 | am | contains a sequence of letters or signs present in an abbreviation which are omitted or replaced in the expanded form of the abbreviation. |
# | id | text |
---|---|---|
82 | att.patternReplacement | etc. are references to the corresponding group in the regular expression specified by |
84 | att.patternReplacement | (counting open parenthesis, left to right). Processors are expected to replace them with whatever matched the corresponding group in the regular expression. |
86 | att.patternReplacement | If a digit preceded by a dollar sign is needed in the actual replacement pattern (as opposed to being used as a back reference), the dollar sign must be written as |
# | id | text |
---|---|---|
39 | data.temporal.w3c | If it is likely that the value used is to be compared with another, then a time zone indicator should always be included, and only the dateTime representation should be used. |
# | id | text |
---|---|---|
2 | constraint | constraint rules |
4 | constraint | the formal rules of a constraint |
# | id | text |
---|---|---|
2 | div4 | level-4 text division |
16 | div4 | contains a fourth-level subdivision of the front, body, or back of a text. |
158 | div4 | any sequence of low-level structural elements, possibly grouped into lower subdivisions. |
# | id | text |
---|---|---|
2 | meeting | contains the formalized descriptive title for a meeting or conference, for use in a bibliographic description for an item derived from such a meeting, or as a heading or preamble to publications emanating from it. |
# | id | text |
---|---|---|
4 | extent | describes the approximate size of a text stored on some carrier medium or of some other object, digital or non-digital, specified in any convenient units. |
40 | extent | element may be used to supplied normalised or machine tractable versions of the size or sizes concerned. |
# | id | text |
---|---|---|
2 | specGrp | specification group |
99 | specGrp | A specification group is referenced by means of its |
# | id | text |
---|---|---|
36 | att.identified | : the value of the module attribute (" |
37 | att.identified | ") should correspond to an existing module, via a moduleSpec or moduleRef |
91 | att.identified | supplies a name for the module in which this object is to be declared. |
110 | att.identified | indicates the current status of the object identified with respect to the current version of the TEI Guidelines. |
119 | att.identified | the item is not recommended for use, and may be withdrawn at a future release. |
123 | att.identified | the item is new and still under review. |
127 | att.identified | the item has changed significantly since the preceding version. |
131 | att.identified | the item has not recently changed and is not expected to do so except for correction of any errors. |
# | id | text |
---|---|---|
2 | actor | contains the name of an actor appearing within a cast list. |
60 | actor | This element should be used only to mark the name of the actor as given in the source. Chapter |
# | id | text |
---|---|---|
2 | model.persNamePart | groups elements which form part of a personal name. |
# | id | text |
---|---|---|
20 | data.sex | Values for attributes using this datatype may be locally defined by a project, or may refer to an external standard, such as vCard's sex property |
# | id | text |
---|---|---|
2 | numeric | numeric value |
14 | numeric | represents the value part of a feature-value specification which contains a numeric value or range. |
38 | numeric | supplies a lower bound for the numeric value represented, and also (if |
71 | numeric | supplies an upper bound for the numeric value represented. |
90 | numeric | specifies whether the value represented should be truncated to give an integer value. |
113 | numeric | This represents the numeric value 42. |
148 | numeric | attribute had the value FALSE, this example would represent any of the infinite number of numeric values between 42.45 and 50.0 |
154 | numeric | attribute in the absence of a value for the |
# | id | text |
---|---|---|
12 | said | indicates passages thought or spoken aloud, whether explicitly indicated in the source or not, whether directly or indirectly reported, whether by real people or fictional characters. |
61 | said | The value |
63 | said | indicates the encoded passage was expressed outwardly (whether spoken, signed, sung, screamed, chanted, etc.); the value |
127 | said | The value |
129 | said | indicates the speech or thought is represented directly; the value |
# | id | text |
---|---|---|
2 | geogFeat | geographical feature name |
# | id | text |
---|---|---|
4 | restore | indicates restoration of text to an earlier state by cancellation of an editorial or authorial marking or instruction. |
36 | restore | attribute categorizes the way that the cancelled intervention has been indicated in some way, for example by means of a marginal note, over-inking, additional markup, etc. |
# | id | text |
---|---|---|
13 | elementSpec | documents the structure, content, and purpose of a single element type. |
69 | elementSpec | specifies a default prefix which will be prepended to all patterns relating to the element, unless otherwise stated. |
# | id | text |
---|---|---|
13 | decoDesc | contains a description of the decoration of a manuscript, either as a sequence of paragraphs, or as a sequence of topically organized |
# | id | text |
---|---|---|
2 | quote | quotation |
14 | quote | contains a phrase or passage attributed by the narrator or author to some agency external to the text. |
61 | quote | If a bibliographic citation is supplied for the source of a quotation, the two may be grouped using the |
# | id | text |
---|---|---|
14 | particDesc | describes the identifiable speakers, voices, or other participants in any kind of text or other persons named or otherwise referred to in a text, edition, or metadata. |
83 | particDesc | This example shows both a very simple person description, and a very detailed one, using some of the more specialized elements from the module for Names and Dates. |
161 | particDesc | May contain a prose description organized as paragraphs, or a structured list of persons and person groups, with an optional formal specification of any relationships amongst them. |
# | id | text |
---|---|---|
2 | model.global.spoken | groups elements which may appear globally within spoken texts. |
# | id | text |
---|---|---|
72 | attDef | should have a closed valList or a datatype |
79 | attDef | It does not make sense to make " |
80 | attDef | " the default value of @ |
97 | attDef | the default value of the @ |
98 | attDef | attribute is not among the closed list of possible values |
108 | attDef | the default value of the @ |
109 | attDef | attribute is not among the closed list of possible values |
181 | attDef | namespace |
193 | attDef | specifies the namespace to which this attribute belongs |
# | id | text |
---|---|---|
4 | additions | contains a description of any significant additions found within a manuscript, such as marginalia or other annotations. |
# | id | text |
---|---|---|
2 | catRef | category reference |
16 | catRef | specifies one or more defined categories within some taxonomy or text typology. |
41 | catRef | identifies the classification scheme within which the set of categories concerned is defined, for example by a |
125 | catRef | The scheme attribute need be supplied only if more than one taxonomy has been declared. |
# | id | text |
---|---|---|
18 | att.rdgPart | witness or witnesses |
28 | att.rdgPart | contains a space-delimited list of one or more sigla indicating the witnesses to this reading beginning or ending at this point. |
# | id | text |
---|---|---|
14 | fw | contains a running head (e.g. a header, footer), catchword, or similar material appearing on the current page. |
38 | fw | classifies the material encoded according to some useful typology. |
57 | fw | a running title at the top of the page |
73 | fw | a running title at the bottom of the page |
89 | fw | page number |
99 | fw | a page number or foliation symbol |
115 | fw | line number |
125 | fw | a line number, either of prose or poetry |
147 | fw | a signature or gathering symbol |
214 | fw | element is intended for cases where the running head changes from page to page, or where details of page layout and the internal structure of the running heads are of paramount importance. |
# | id | text |
---|---|---|
2 | abstract | contains a summary or formal abstract prefixed to an existing source document by the encoder. |
28 | abstract | The abstract for a born digital document should be located within the |
30 | abstract | ; this element is provided for cases where no abstract is available in the original source. |
# | id | text |
---|---|---|
2 | model.dimLike | groups elements which describe a measurement forming part of the physical dimensions of some object. |
# | id | text |
---|---|---|
4 | segmentation | describes the principles according to which the text has been segmented, for example into sentences, tone-units, graphemic strata, etc. |
# | id | text |
---|---|---|
2 | data.enumerated | defines the range of attribute values expressed as a single XML name taken from a list of documented possibilities. |
20 | data.enumerated | Attributes using this datatype must contain a single word matching the pattern defined for this datatype: for example it cannot include whitespace but may begin with digits. |
22 | data.enumerated | Typically, the list of documented possibilities will be provided (or exemplified) by a value list in the associated attribute specification, expressed with a |
# | id | text |
---|---|---|
2 | classCode | classification code |
14 | classCode | contains the classification code used for this text in some standard classification system. |
# | id | text |
---|---|---|
2 | default | default feature value |
14 | default | represents the value part of a feature-value specification which contains a defaulted value. |
# | id | text |
---|---|---|
2 | accMat | accompanying material |
14 | accMat | contains details of any significant additional material which may be closely associated with the manuscript being described, such as non-contemporaneous documents or fragments bound in with the manuscript at some earlier historical period. |
# | id | text |
---|---|---|
16 | att.coordinated | indicates the element within a transcription of the text containing at least the start of the writing represented by this zone or surface. |
25 | att.coordinated | gives the x coordinate value for the upper left corner of a rectangular space. |
42 | att.coordinated | gives the y coordinate value for the upper left corner of a rectangular space. |
59 | att.coordinated | gives the x coordinate value for the lower right corner of a rectangular space. |
76 | att.coordinated | gives the y coordinate value for the lower right corner of a rectangular space. |
93 | att.coordinated | identifies a two dimensional area within the bounding box specified by the other attributes by means of a series of pairs of numbers, each of which gives the x,y coordinates of a point on a line enclosing the area. |
# | id | text |
---|---|---|
4 | date | contains a date in any format. |
# | id | text |
---|---|---|
70 | att.dimensions | lines of text |
92 | att.dimensions | characters of text |
125 | att.dimensions | indicates the size of the object concerned using a project-specific vocabulary combining quantity and units in a single string of words. |
144 | att.dimensions | characterizes the precision of the values specified by the other attributes. |
# | id | text |
---|---|---|
17 | model.common | This class defines the set of chunk- and inter-level elements; it is used in many content models, including those for textual divisions. |
# | id | text |
---|---|---|
2 | att.global.responsibility | provides attributes indicating the agency responsible for some aspect of the text, the markup or something asserted by the markup, and the degree of certainty associated with it. |
8 | att.global.responsibility | certainty |
18 | att.global.responsibility | signifies the degree of certainty associated with the intervention or interpretation. |
47 | att.global.responsibility | indicates the agency responsible for the intervention or interpretation, for example an editor or transcriber. |
67 | att.global.responsibility | pointing to a person or organization is likely to be somewhat ambiguous with regard to the nature of the responsibility. For this reason, we recommend that |
79 | att.global.responsibility | or similar element which clarifies the exact role played by the agent. Pointing to multiple |
81 | att.global.responsibility | s allows the encoder to specify clearly each of the roles played in part of a TEI file (creating, transcribing, encoding, editing, proofing etc.). |
# | id | text |
---|---|---|
52 | vocal | The value |
54 | vocal | indicates that the vocal effect is repeated several times rather than just occurring once. |
# | id | text |
---|---|---|
6 | att.datcat | attributes which are used to align XML elements or attributes with the appropriate Data Categories (DCs) defined by the ISO 12620:2009 standard and stored in the Web repository called ISOCat at |
19 | att.datcat | contains a PID (persistent identifier) that aligns the content of the given element or the value of the given attribute with the appropriate simple Data Category (or categories) in ISOcat. |
29 | att.datcat | relates the feature name to the data category "partOfSpeech" and |
31 | att.datcat | the feature value to the data category "commonNoun". Both these data categories reside in the ISOcat DCR at |
42 | att.datcat | ISO 12620:2009 is a standard describing the data model and procedures for a Data Category Registry (DCR). Data categories are defined as elementary descriptors in a linguistic structure. In the DCR data model each data category gets assigned a unique Peristent IDentifier (PID), i.e., an URI. Linguistic resources or preferably their schemas that make use of data categories from a DCR should refer to them using this PID. For XML-based resources, like TEI documents, ISO 12620:2009 normative Annex A gives a small Data Category Reference XML vocabulary (also available online at |
# | id | text |
---|---|---|
2 | nameLink | name link |
6 | nameLink | contains a connecting phrase or link used within a name but not regarded as part of it, such as |
# | id | text |
---|---|---|
2 | listApp | list of apparatus entries |
6 | listApp | contains a list of apparatus entries. |
31 | listApp | In the following example from the exegetical Yasna, the base text is encoded in the |
# | id | text |
---|---|---|
4 | activity | contains a brief informal description of what a participant in a language interaction is doing other than speaking, if anything. |
44 | activity | For more fine-grained description of participant activities during a spoken text, the |
# | id | text |
---|---|---|
2 | div7 | level-7 text division |
16 | div7 | contains the smallest possible subdivision of the front, body or back of a text, larger than a paragraph. |
133 | div7 | any sequence of low-level structural elements, e.g., paragraphs ( |
# | id | text |
---|---|---|
74 | c | element, or a sequence of graphemes to be treated as a single character. The |
79 | c | punctuation |
# | id | text |
---|---|---|
2 | textDesc | text description |
14 | textDesc | provides a description of a text in terms of its situational parameters. |
# | id | text |
---|---|---|
12 | geo | contains any expression of a set of geographic coordinates, representing a point, line, or area on the surface of the earth in some notation. |
67 | geo | element supplied in the TEI header, using the |
69 | geo | attribute. If no such link is made, the assumption is that the content of each |
# | id | text |
---|---|---|
2 | val | value |
# | id | text |
---|---|---|
4 | population | contains information about the population of a place. |
# | id | text |
---|---|---|
41 | ptr | Only one of the attributes @target and @cRef may be supplied on |
# | id | text |
---|---|---|
4 | locusGrp | groups a number of locations which together form a distinct but discontinuous item within a manuscript or manuscript part, according to a specific foliation. |
21 | locusGrp | identifies the foliation scheme in terms of which all the locations contained by the group are specified by pointing to some |
# | id | text |
---|---|---|
56 | ab | element may be used at the encoder's discretion to mark any component-level elements in a text for which no other more specific appropriate markup is defined. |
# | id | text |
---|---|---|
146 | etym | May contain character data mixed with any other elements defined in the dictionary tag set. |
# | id | text |
---|---|---|
2 | model.offsetLike | groups elements which can appear only as part of a place name. |
# | id | text |
---|---|---|
86 | gen | May contain character data and phrase-level elements. Typical content will be |
95 | gen | gram type="gender" |
# | id | text |
---|---|---|
4 | finalRubric | contains the string of words that denotes the end of a text division, often with an assertion as to its author and title, usually set off from the text itself by red ink, by a different size or type of script, or by some other such visual device. |
# | id | text |
---|---|---|
2 | cit | cited quotation |
13 | cit | contains a quotation from some other document, together with a bibliographic reference to its source. In a dictionary it may contain an example text with at least one occurrence of the word form, used in the sense being described, or a translation of the headword, or an example. |
# | id | text |
---|---|---|
2 | textLang | text language |
13 | textLang | describes the languages and writing systems identified within the bibliographic work being described, rather than its description. |
49 | textLang | main language |
60 | textLang | supplies a code which identifies the chief language used in the bibliographic work. |
128 | textLang | This element should not be used to document the languages or writing systems used for the bibliographic or manuscript description itself: as for all other TEI elements, such information should be provided by means of the global |
133 | textLang | language tag |
136 | textLang | . Additional documentation for the language may be provided by a |
138 | textLang | element in the TEI Header. |
# | id | text |
---|---|---|
2 | summary | contains an overview of the available information concerning some aspect of an item (for example, its intellectual content, history, layout, typography etc.) as a complement or alternative to the more detailed information carried by more specific elements. |
# | id | text |
---|---|---|
2 | genName | generational name component |
13 | genName | contains a name component used to distinguish otherwise similar names on the basis of the relative ages or generations of the persons named. |
# | id | text |
---|---|---|
12 | subst | groups one or more deletions with one or more additions when the combination is to be regarded as a single intervention in the text. |
40 | subst | must have at least one child add and at least one child del |
# | id | text |
---|---|---|
2 | data.key | defines the range of attribute values expressing a coded value by means of an arbitrary identifier, typically taken from a set of externally-defined possibilities. |
20 | data.key | Information about the set of possible values for an attribute using this datatype may (but need not) be documented in the document header. Externally defined constraints, for example that values should be legal keys in an external database system, cannot usually be enforced by a TEI system. Similarly, because the key is externally defined, no constraint other than a requirement that it consist of Unicode characters is possible. |
# | id | text |
---|---|---|
3 | editor | contains a secondary statement of responsibility for a bibliographic item, for example the name of an individual, institution or organization, (or of several such) acting as editor, compiler, translator, etc. |
52 | editor | Particularly where cataloguing is likely to be based on the content of the header, it is advisable to use generally recognized authority lists for the exact form of personal names. |
# | id | text |
---|---|---|
4 | person | provides information about an identifiable individual, for example a participant in a language interaction, or a person referred to in a historical source. |
39 | person | specifies a primary role or classification for the person. |
57 | person | Values for this attribute may be locally defined by a project, using arbitrary keywords such as |
62 | person | author |
73 | person | specifies the sex of the person. |
91 | person | Values for this attribute may be locally defined by a project, or may refer to an external standard, such as vCard's sex property |
121 | person | specifies an age group for the person. |
139 | person | Values for this attribute may be locally defined by a project, using arbitrary keywords such as |
250 | person | May contain either a prose description organized as paragraphs, or a sequence of more specific demographic elements drawn from the |
# | id | text |
---|---|---|
4 | classes | specifies all the classes of which the documented element or class is a member or subclass. |
49 | classes | this declaration changes the declaration of the same name in the current definition |
63 | classes | this declaration replaces the declaration of the same name in the current definition |
# | id | text |
---|---|---|
14 | att | contains the name of an attribute appearing within running text. |
39 | att | supplies an identifier for the scheme in which this name is defined. |
56 | att | TEI |
60 | att | text encoding initiative |
70 | att | this attribute is part of the TEI scheme. |
135 | att | the attribute is part of the XHTML language |
137 | att | the attribute is part of the XML language |
211 | att | A namespace prefix may be used in order to specify the scheme as an alternative to specifying it via the scheme attribute: it takes precedence |
# | id | text |
---|---|---|
38 | height | If used to specify the height of a non text-bearing portion of some object, for example a monument, this element conventionally refers to the axis perpendicular to the surface of the earth. |
# | id | text |
---|---|---|
2 | att.canonical | provides attributes which can be used to associate a representation such as a name or title with canonical information about the object being named or referenced. |
8 | att.canonical | provides an externally-defined means of identifying the entity (or entities) being named, using a coded value of some kind. |
32 | att.canonical | The value may be a unique identifier from a database, or any other externally-defined string identifying the referent. |
36 | att.canonical | attribute, since its form will depend entirely on practice within a given project. For the same reason, this attribute is not recommended in data interchange, since there is no way of ensuring that the values used by one project are distinct from those used by another. In such a situation, a preferable approach for magic tokens which follows standard practice on the Web is to use a |
38 | att.canonical | attribute whose value is a tag URI as defined in |
53 | att.canonical | provides an explicit means of locating a full definition for the entity being named by means of one or more URIs. |
67 | att.canonical | The value must point directly to one or more XML elements or other resources by means of one or more URIs, separated by whitespace. If more than one is supplied the implication is that the name identifies several distinct entities. |
# | id | text |
---|---|---|
2 | data.text | defines the range of attribute values used to express some kind of identifying string as a single sequence of unicode characters possibly including whitespace. |
10 | data.text | Attributes using this datatype must contain a single |
12 | data.text | in which whitespace and other punctuation characters are permitted. |
# | id | text |
---|---|---|
14 | edition | describes the particularities of one edition of a text. |
# | id | text |
---|---|---|
4 | writing | contains a passage of written text revealed to participants in the course of a spoken text. |
31 | writing | indicates whether the writing is revealed all at once or gradually. |
49 | writing | The value |
51 | writing | indicates the writing is revealed gradually; the value |
53 | writing | that the writing is revealed all at once. |
100 | writing | element will usually be short and most simply transcribed as a character string; the content model also allows a sequence of paragraphs and paragraph-level elements, in case the writing has enough internal structure to warrant such markup. In either case the usual phrase-level tags for written text are available. |
# | id | text |
---|---|---|
2 | teiCorpus | contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI elements, each containing a single text header and a text. |
45 | teiCorpus | The version of the TEI scheme |
144 | teiCorpus | Must contain one TEI header for the corpus, and a series of |
148 | teiCorpus | This element is mandatory when applicable. |
# | id | text |
---|---|---|
2 | addName | additional name |
14 | addName | contains an additional name component, such as a nickname, epithet, or alias, or any other descriptive phrase used within a personal name. |
# | id | text |
---|---|---|
12 | geoDecl | documents the notation and the datum used for geographic coordinates expressed as content of the |
44 | geoDecl | supplies a commonly used code name for the datum employed. |
97 | geoDecl | the values supplied are geospatial entity object codes, based on |
119 | geoDecl | the value supplied is to be interpreted as a British National Grid Reference. |
143 | geoDecl | the value supplied is to be interpreted as latitude followed by longitude according to the European Datum coordinate system. |
# | id | text |
---|---|---|
4 | lacunaStart | indicates the beginning of a lacuna in the text of a mostly complete textual witness. |
# | id | text |
---|---|---|
2 | model.addressLike | groups elements used to represent a postal or email address. |
# | id | text |
---|---|---|
2 | witEnd | fragmented witness end |
13 | witEnd | indicates the end, or suspension, of the text of a fragmentary witness. |
# | id | text |
---|---|---|
2 | castList | cast list |
14 | castList | contains a single cast list or dramatis personae. |
# | id | text |
---|---|---|
4 | equipment | provides technical details of the equipment and media used for an audio or video recording used as the source for a spoken text. |
# | id | text |
---|---|---|
7 | speaker | contains a specialized form of heading or label, giving the name of one or more speakers in a dramatic text or fragment. |
# | id | text |
---|---|---|
4 | sex | specifies the sex of a person. |
29 | sex | supplies a coded value for sex |
35 | sex | Values for this attribute may be locally defined by a project, or may refer to an external standard, such as vCard's sex property |
114 | sex | As with other culturally-constructed traits such as age, the way in which this concept is described in different cultural contexts may vary. The normalizing attributes are provided only as an optional means of simplifying that variety to one or more external standards for purposes of interoperability, or project-internal taxonomies for consistency, and should not be used where that is inappropriate or unhelpful. The content of the element may be used to describe the intended concept in more detail, using plain text. |
# | id | text |
---|---|---|
2 | sense | groups together all information relating to one word sense in a dictionary entry, for example definitions, examples, and translation equivalents. |
35 | sense | gives the nesting depth of this sense. |
111 | sense | May contain character data mixed with any other elements defined in the dictionary tag set. |
# | id | text |
---|---|---|
2 | layoutDesc | layout description |
13 | layoutDesc | collects the set of layout descriptions applicable to a manuscript. |
# | id | text |
---|---|---|
2 | localName | locally-defined property name |
14 | localName | contains a locally defined name for some property. |
52 | localName | No definitive list of local names is proposed. However, the name |
54 | localName | is recommended as a means of naming the property identifying the recommended character entity name for this character or glyph. |
# | id | text |
---|---|---|
14 | gram | within an entry in a dictionary or a terminological data file, contains grammatical information relating to a term, word, or form. |
38 | gram | classifies the grammatical information given according to some convenient typology—in the case of terminological information, preferably the dictionary of data element types specified in |
79 | gram | any of the word classes to which a word may be assigned in a given language, based on form, meaning, or a combination of features, e.g. noun, verb, adjective, etc. |
121 | gram | number |
180 | gram | A much fuller list of values for the |
182 | gram | attribute may be generated from the data category registry accessible from |
# | id | text |
---|---|---|
2 | interp | interpretation |
15 | interp | summarizes a specific interpretative annotation which can be linked to a span of text. |
67 | interp | attribute. This permits the encoder to explicitly associate the interpretation represented by the content of an |
77 | interp | attribute which points to one or more textual elements to which the analysis represented by the content of the |
# | id | text |
---|---|---|
2 | custEvent | custodial event |
13 | custEvent | describes a single event during the custodial history of a manuscript. |
# | id | text |
---|---|---|
4 | repository | contains the name of a repository within which manuscripts are stored, possibly forming part of an institution. |
# | id | text |
---|---|---|
4 | state | contains a description of some status or quality attributed to a person, place, or organization often at some specific time or for a specific date range. |
132 | state | the more general purpose element |
134 | state | should be used even for unchanging characteristics. If you wish to distinguish between characteristics that are generally perceived to be time-bound states and those assumed to be fixed traits, then |
138 | state | element encodes characteristics which are sometimes assumed to change, often at specific times or over a date range, whereas the |
# | id | text |
---|---|---|
2 | listBibl | citation list |
14 | listBibl | contains a list of bibliographic citations of any kind. |
# | id | text |
---|---|---|
4 | title | contains a title for any kind of work. |
50 | title | analytic |
60 | title | the title applies to an analytic item, such as an article, poem, or other work published as part of a larger item. |
86 | title | the title applies to a monograph such as a book or other item considered to be a distinct publication, including single volumes of multi-volume works |
110 | title | the title applies to any serial or periodical publication such as a journal, magazine, or newspaper |
126 | title | series |
136 | title | the title applies to a series of otherwise distinct publications such as a collection |
156 | title | the title applies to any unpublished material (including theses and dissertations unless published by a commercial press) |
173 | title | The level of a title is sometimes implied by its context: for example, a title appearing directly within an |
182 | title | s |
185 | title | attribute is not required in contexts where its value can be unambiguously inferred. Where it is supplied in such contexts, its value should not contradict the value implied by its parent element. |
249 | title | classifies the title according to some convenient typology. |
268 | title | main title |
294 | title | subtitle, title of part |
310 | title | alternate |
320 | title | alternate title, often in another language, by which the work is also known |
336 | title | abbreviated form of title |
362 | title | descriptive paraphrase of the work functioning as a title |
483 | title | may be used to indicate the canonical form for the title; the former, by supplying (for example) the identifier of a record in some external library system; the latter by pointing to an XML element somewhere containing the canonical form of the title. |
# | id | text |
---|---|---|
4 | witness | contains either a description of a single witness referred to within the critical apparatus, or a list of witnesses which is to be referred to by a single sigil. |
39 | witness | The content of the |
41 | witness | element may give bibliographic information about the witness or witness group, or it may be empty. |
# | id | text |
---|---|---|
2 | att.ptrLike.form | form pointers |
32 | att.ptrLike.form | identifies the orthographic form or pronunciation referred to. |
# | id | text |
---|---|---|
21 | prefixDef | supplies a name which functions as the prefix for an abbreviated pointing scheme such as a private URI scheme. The prefix constitutes the text preceding the first colon. |
39 | prefixDef | The abbreviated pointer may be dereferenced to produce either an absolute or a relative URI reference. In the latter case it is combined with the value of |
41 | prefixDef | in force at the place where the pointing attribute occurs to form an absolute URI in the usual manner as prescribed by |
# | id | text |
---|---|---|
4 | milestone | marks a boundary point separating any kind of section of a text, typically but not necessarily indicating a point at which some part of a standard reference system changes, where the change is not represented by a structural element. |
49 | milestone | attribute indicates the new number or other value for the unit which changes at this milestone. The special value |
53 | milestone | The order in which milestone elements are given at a given point is not normally significant. |
# | id | text |
---|---|---|
4 | gloss | identifies a phrase or word used to provide a gloss or definition for some other word or phrase. |
# | id | text |
---|---|---|
156 | specDesc | The description is usually displayed as a label and an item. |
157 | specDesc | The list of attributes may include some which are inherited by virtue of an element's class membership; descriptions for such attributes may also be retrieved using another |
159 | specDesc | , this time pointing at the relevant class. |
# | id | text |
---|---|---|
14 | cb | marks the beginning of a new column of a text on a multi-column page. |
79 | cb | attribute indicates the number or other value associated with the column which follows the point of insertion of this |
81 | cb | element. Encoders should adopt a clear and consistent policy as to whether the numbers associated with column breaks relate to the physical sequence number of the column in the whole text, or whether columns are numbered within the page. The |
83 | cb | element is placed at the head of the column to which it refers. |
# | id | text |
---|---|---|
4 | supplied | signifies text supplied by the transcriber or editor for any reason; for example because the original cannot be read due to physical damage, or because of an obvious omission by the author or scribe. |
28 | supplied | one or more words indicating why the text has had to be supplied, e.g. |
# | id | text |
---|---|---|
2 | model.frontPart | groups elements which appear at the level of divisions within front or back matter. |
# | id | text |
---|---|---|
19 | data.interval | Any value greater than zero or any one of the values |
# | id | text |
---|---|---|
2 | corr | correction |
12 | corr | contains the correct form of a passage apparently erroneous in the copy text. |
37 | corr | If all that is desired is to call attention to the fact that the copy text has been corrected, |
# | id | text |
---|---|---|
2 | rdgGrp | reading group |
# | id | text |
---|---|---|
2 | model.availabilityPart | groups elements such as licences and paragraphs of text which may appear as part of an availability statement |
# | id | text |
---|---|---|
2 | div6 | level-6 text division |
16 | div6 | contains a sixth-level subdivision of the front, body, or back of a text. |
155 | div6 | any sequence of low-level structural elements, possibly grouped into lower subdivisions. |
# | id | text |
---|---|---|
14 | relation | describes any kind of relationship or linkage amongst a specified group of places, events, persons, objects or other items. |
43 | relation | One of the attributes 'name', 'ref' or 'key' must be supplied |
49 | relation | Only one of the attributes @active and @mutual may be supplied |
55 | relation | the attribute 'passive' may be supplied only if the attribute 'active' is supplied |
62 | relation | supplies a name for the kind of relationship of which this is an instance. |
97 | relation | supplies a list of participants amongst all of whom the relationship holds equally. |
148 | relation | This indicates that the person with identifier p1 is supervisor of persons p2, p3, and p4. |
183 | relation | This example records a relationship, defined by the SAWS ontology, between a passage of text identified by a CTS URN, and a variant passage of text in the Perseus Digital Library, and assigns the identification of the relationship to a particular editor (all using resolvable URIs). |
193 | relation | may be supplied only if the attribute |
# | id | text |
---|---|---|
15 | egXML | element itself functions as the root element. |
59 | egXML | the example is intended to be fully valid, assuming that its root element, or a provided root element, could have been used as a possible root element in the schema concerned. |
63 | egXML | the example could be transformed into a valid document by inserting any number of valid attributes and child elements anywhere within it; or it is valid against a version of the schema concerned in which the provision of character data, list, element, or attribute values has been made optional. |
131 | egXML | In the source of the TEI Guidelines, this element declares itself and its content as belonging to the namespace |
133 | egXML | . This enables the content of the element to be validated independently against the TEI scheme. Where this element is used outside this context, a different namespace or none at all may be preferable. The content must however be a well-formed XML fragment or document: where this is not the case, the more general |
135 | egXML | element should be used in preference. In a TEI context use of the |
137 | egXML | attribute in the TEI namespace, as opposed to the TEI Examples namespace, enables recording of rendition information. |
# | id | text |
---|---|---|
20 | att.measurement | indicates the units used for the measurement, usually using the standard symbol for the desired units. |
109 | att.measurement | SI base unit of time |
165 | att.measurement | SI unit of pressure or stress |
302 | att.measurement | 10⁻¹⁰ m |
490 | att.measurement | If the measurement being represented is not expressed in a particular unit, but rather is a number of discrete items, the unit |
496 | att.measurement | Wherever appropriate, a recognized SI unit name should be used (see further |
498 | att.measurement | ). The list above is indicative rather than exhaustive. |
543 | att.measurement | specifies the number of the specified units that comprise the measurement |
582 | att.measurement | In general, when the commodity is made of discrete entities, the plural form should be used, even when the measurement is of only one of them. |
# | id | text |
---|---|---|
2 | refState | reference state |
14 | refState | specifies one component of a canonical reference defined by the milestone method. |
55 | refState | When constructing a reference, if the reference component found is of numeric type, the length is made up by inserting leading zeros; if it is not, by inserting trailing blanks. In either case, reference components are truncated if necessary at the right hand side. |
57 | refState | When seeking a reference, the length indicates the number of characters which should be compared. Values longer than this will be regarded as matching, if they start correctly. If no value is provided, the length is unlimited and goes to the next delimiter or to the end of the value. |
90 | refState | supplies a delimiting string following the reference component. |
# | id | text |
---|---|---|
27 | att.scoping | which identifies a set of nodes, selected within the context identified by the |
29 | att.scoping | attribute if this is supplied, or within the context of the element bearing this attribute if it is not. |
42 | att.scoping | The expression of certainty applies to the nodeset identified by the value of the |
44 | att.scoping | attribute, possibly modified additionally by the value of the |
46 | att.scoping | attribute. If neither attribute is present, the expression of certainty applies to the context of the |
50 | att.scoping | Note that the value of the |
# | id | text |
---|---|---|
2 | data.xTruthValue | extended truth value |
7 | data.xTruthValue | defines the range of attribute values used to express a truth value which may be unknown. |
31 | data.xTruthValue | In cases where where uncertainty is inappropriate, use the datatype |
# | id | text |
---|---|---|
2 | nym | canonical name |
14 | nym | contains the definition for a canonical name or name component of any kind. |
# | id | text |
---|---|---|
13 | dictScrap | encloses a part of a dictionary entry in which other phrase-level dictionary elements are freely combined. |
102 | dictScrap | This element is used to mark part of a dictionary entry in which lower level dictionary elements appear, but which does not itself form an identifiable structural unit. |
# | id | text |
---|---|---|
2 | model.catDescPart | groups component elements of the TEI header Category Description. |
# | id | text |
---|---|---|
2 | string | string value |
14 | string | represents the value part of a feature-value specification which contains a string. |
# | id | text |
---|---|---|
20 | model.nameLike.agent | This class is used in the content model of elements which reference names of people or organizations. |
# | id | text |
---|---|---|
46 | data.temporal.iso | If it is likely that the value used is to be compared with another, then a time zone indicator should always be included, and only the dateTime representation should be used. |
# | id | text |
---|---|---|
3 | model.textDescPart | groups elements used to categorize a text for example in terms of its situational parameters. |
# | id | text |
---|---|---|
2 | model.castItemPart | groups component elements of an entry in a cast list, such as dramatic role or actor's name. |
# | id | text |
---|---|---|
12 | timeline | provides a set of ordered points in time which can be linked to elements of a spoken text to create a temporal alignment of that text. |
37 | timeline | designates the origin of the timeline, i.e. the time at which it begins. |
55 | timeline | If this attribute is not supplied, the implication is that the time of origin is not known. If it is supplied, it must point either to one of the |
68 | timeline | specifies the unit of time corresponding to the |
70 | timeline | value of the timeline or of its constituent points in time. |
151 | timeline | specifies a time interval either as a positive integral value or using one of a set of predefined codes. |
169 | timeline | The value |
171 | timeline | indicates uncertainty about all the intervals in the timeline; the value |
173 | timeline | indicates that all the intervals are evenly spaced, but the size of the intervals is not known; numeric values indicate evenly spaced values of the size specified. If individual points in time in the timeline are given different values for the |
175 | timeline | attribute, those values locally override the value given in the timeline. |
# | id | text |
---|---|---|
2 | model.pPart.edit | groups phrase-level elements for simple editorial correction and transcription. |
# | id | text |
---|---|---|
2 | lg | line group |
14 | lg | contains one or more verse lines functioning as a formal unit, e.g. a stanza, refrain, verse paragraph, etc. |
70 | lg | An lg element must contain at least one child l, lg or gap element. |
157 | lg | contains verse lines or nested line groups only, possibly prefixed by a heading. |
# | id | text |
---|---|---|
4 | model.publicationStmtPart.detail | element of the TEI header. |
# | id | text |
---|---|---|
2 | macroRef | identifies the datatype of an attribute value, either by referencing an item in an externally defined datatype library, or by pointing to a TEI-defined data specification |
16 | macroRef | the identifier used for this datatype specification |
23 | macroRef | the name of a datatype in the list provided by |
32 | macroRef | a pointer to a datatype defined in some datatype library |
40 | macroRef | supplies a string representing a regular expression providing additional constraints on the strings used to represent values of this datatype |
# | id | text |
---|---|---|
2 | headLabel | heading for list labels |
14 | headLabel | contains the heading for the label or term column in a glossary list or similar structured list. |
69 | headLabel | element may appear only if each item in the list is preceded by a |
# | id | text |
---|---|---|
21 | model.segLike | The principles on which segmentation is carried out, and any special codes or attribute values used, should be defined explicitly in the |
25 | model.segLike | within the associated TEI header. |
# | id | text |
---|---|---|
20 | att.cReferencing | element in the TEI header |
50 | att.cReferencing | The value of |
52 | att.cReferencing | should be constructed so that when the algorithm for the resolution of canonical references (described in section |
# | id | text |
---|---|---|
14 | alt | identifies an alternation or a set of choices among elements or passages. |
44 | alt | states whether the alternations gathered in this collection are exclusive or inclusive. |
202 | alt | , the sum of weights must be in the range from 0 to the number of alternants. |
# | id | text |
---|---|---|
2 | origPlace | origin place |
13 | origPlace | contains any form of place name, used to identify the place of origin for a manuscript or manuscript part. |
60 | origPlace | origin |
61 | origPlace | , for example original place of publication, as opposed to original place of printing. |
# | id | text |
---|---|---|
2 | custodialHist | custodial history |
13 | custodialHist | contains a description of a manuscript's custodial history, either as running prose or as a series of dated custodial events. |
# | id | text |
---|---|---|
6 | att.sortable | supplies the sort key for this element in an index, list or group which contains it. |
24 | att.sortable | The sort key is used to determine the sequence and grouping of entries in an index. It provides a sequence of characters which, when sorted with the other values, will produced the desired order; specifics of sort key construction are application-dependent |
26 | att.sortable | Dictionary order often differs from the collation sequence of machine-readable character sets; in English-language dictionaries, an entry for |
40 | att.sortable | may all appear in numeric order |
46 | att.sortable | . The sort key is required if the orthography of the dictionary entry does not suffice to determine its location. |
# | id | text |
---|---|---|
2 | att.global | provides attributes common to all elements in the TEI encoding scheme. |
83 | att.global | number |
93 | att.global | gives a number (or other label) for an element, which is not necessarily unique within the document. |
111 | att.global | The value of this attribute is always understood to be a single token, even if it contains space or other punctuation characters, and need not be composed of numbers only. It is typically used to specify the numbering of chapters, sections, list items, etc.; it may also be used in the specification of a standard reference system for the text. |
134 | att.global | language |
144 | att.global | indicates the language of the element content using a |
145 | att.global | tag |
189 | att.global | The xml:lang value will be inherited from the immediately enclosing element, or from its parent, and so on up the document hierarchy. It is generally good practice to specify xml:lang at the highest appropriate level, noticing that a different default may be needed for the teiHeader from that needed for the associated resource element or elements, and that a single TEI document may contain texts in many languages. |
191 | att.global | The authoritative list of registered language subtags is maintained by IANA and is available at |
192 | att.global | . For a good general overview of the construction of language tags, see |
196 | att.global | The value used must conform with BCP 47. If the value is a private use code (i.e., starts with |
202 | att.global | element with a matching value for its |
204 | att.global | attribute should be supplied in the TEI header to document this value. Such documentation may also optionally be supplied for non-private-use codes, though these must remain consistent with their |
357 | att.global | signals an intention about how white space should be managed by applications. |
372 | att.global | signals that the application's default white-space processing modes are acceptable |
376 | att.global | indicates the intent that applications preserve all white space |
# | id | text |
---|---|---|
2 | listTranspose | supplies a list of transpositions, each of which is indicated at some point in a document typically by means of metamarks. |
23 | listTranspose | This example might be used for a source document which indicates in some way that the elements identified by |
25 | listTranspose | and code |
# | id | text |
---|---|---|
12 | ident | contains an identifier or name for an object of some kind in a formal language. |
# | id | text |
---|---|---|
2 | surface | defines a written surface as a two-dimensional coordinate space, optionally grouping one or more graphic representations of that space, zones of interest within that space, and transcriptions of the writing within them. |
47 | surface | describes the method by which this surface is or was connected to the main surface |
54 | surface | glued in place |
58 | surface | pinned or stapled in place |
62 | surface | sewn in place |
68 | surface | indicates whether the surface is attached and folded in such a way as to provide two writing surfaces |
87 | surface | element represents any two-dimensional space on some physical surface forming part of the source material, such as a piece of paper, a face of a monument, a billboard, a scroll, a leaf etc. |
89 | surface | The coordinate space defined by this element may be thought of as a grid |
101 | surface | element may contain graphic representations or transcriptions of written zones, or both. The coordinate values used by every |
# | id | text |
---|---|---|
4 | camera | describes a particular camera angle or viewpoint in a screen play. |
# | id | text |
---|---|---|
4 | superEntry | groups a sequence of entries within any kind of lexical resource, such as a dictionary or lexicon which function as a single unit, for example a set of homographs. |
# | id | text |
---|---|---|
2 | lang | language name |
14 | lang | contains the name of a language mentioned in etymological or other linguistic discussion. |
# | id | text |
---|---|---|
2 | imprimatur | contains a formal statement authorizing the publication of a work, sometimes required to appear on a title page or its verso. |
# | id | text |
---|---|---|
2 | listState | list of states and/or traits |
4 | listState | contains a list of various kinds of characteristics of people, places, and organizations. |
30 | listState | attribute may be used to distinguish lists of characteristics of a particular type if convenient. |
# | id | text |
---|---|---|
2 | listPerson | list of persons |
13 | listPerson | contains a list of descriptions, each of which provides information about an identifiable person or a group of people, for example the participants in a language interaction, or the people referred to in a historical source. |
79 | listPerson | The type attribute may be used to distinguish lists of people of a particular type if convenient. |
# | id | text |
---|---|---|
58 | objectType | attribute may be used to point to one or more items within a taxonomy of types of object, defined either internally or externally. |
# | id | text |
---|---|---|
13 | msPart | contains information about an originally distinct manuscript or part of a manuscript, now forming part of a composite manuscript. |
70 | msPart | children if needed) should be used instead of an |
77 | msPart | WARNING: use of deprecated method — the use of the altIdentifier element as a direct child of the msPart element will be removed from the TEI on 2016-09-09 |
137 | msPart | As this last example shows, for compatibility reasons the identifier of a manuscript part may be supplied as a simple |
# | id | text |
---|---|---|
58 | memberOf | add |
92 | memberOf | supplies the maximum number of times the element can occur in elements which use this model class in their content model |
99 | memberOf | supplies the minumum number of times the element must occur in elements which use this model class in their content model |
111 | memberOf | This element will appear in any content model which references |
137 | memberOf | Elements or classes which are members of multiple (unrelated) classes will have more than one |
141 | memberOf | element. If an element is a member of a class C1, which is itself a subclass of a class C2, there is no need to state this, other than in the documentation for class C1. |
143 | memberOf | Any additional comment or explanation of the class membership may be provided as content for this element. |
# | id | text |
---|---|---|
2 | series | series information |
14 | series | contains information about the series in which a book or other bibliographic item has appeared. |
# | id | text |
---|---|---|
2 | model.emphLike | groups phrase-level elements which are typographically distinct and to which a specific function can be attributed. |
# | id | text |
---|---|---|
4 | derivation | describes the nature and extent of originality of this text. |
27 | derivation | categorizes the derivation of the text. |
46 | derivation | text is original |
62 | derivation | text is a revision of some other text |
78 | derivation | text is a translation of some other text |
94 | derivation | text is an abridged version of some other text |
110 | derivation | text is plagiarized from some other text |
126 | derivation | text has no obvious source but is one of a number derived from some common ancestor |
160 | derivation | For derivative texts, details of the ancestor may be included in the source description. |
# | id | text |
---|---|---|
52 | kinesic | The value |
54 | kinesic | indicates that the kinesic is repeated several times rather than occurring only once. |
# | id | text |
---|---|---|
2 | att.metrical | defines a set of attributes which certain elements may use to represent metrical information. |
46 | att.metrical | The pattern may be specified by means of either a standard term for the kind of metrical unit (e.g. |
99 | att.metrical | The pattern may be specified by means of either a standard term for the kind of metrical unit (e.g. |
128 | att.metrical | rhyme scheme |
138 | att.metrical | specifies the rhyme scheme applicable to a group of verse lines. |
156 | att.metrical | By default, the rhyme scheme is expressed as a string of alphabetic characters each corresponding with a rhyming line. Any non-rhyming lines should be represented by a hyphen or an X. Alternative notations may be defined as for |
160 | att.metrical | element in the TEI header. |
162 | att.metrical | When the default notation is used, it does not make sense to specify this attribute on any unit smaller than a line. Nor does the default notation provide any way to record internal rhyme, or to specify non-conventional rhyming practice. These extensions would require user-defined alternative notations. |
# | id | text |
---|---|---|
4 | label | contains any label or heading used to identify part of a text, typically but not exclusively in a list or glossary. |
28 | label | Labels are commonly used for the headwords in glossary lists; note the use of the global |
30 | label | attribute to set the default language of the glossary list to Middle English, and identify the glosses and headings as modern English or Latin: |
296 | label | Labels may also be used to record explicitly the numbers or letters which mark list items in ordered lists, as in this extract from Gibbon's |
315 | label | Labels may also be used for other structured list items, as in this extract from the journal of Edward Gibbon: |
343 | label | rather than as its sibling. Though syntactically valid, this usage is not recommended TEI practice. |
347 | label | Labels may also be used to represent a label or heading attached to a paragraph or sequence of paragraphs not treated as a structural division, or to a group of verse lines. Note that, in this case, the |
373 | label | In this example the text of the label appears in the right hand margin of the original source, next to the paragraph it describes, but approximately in the middle of it. |
# | id | text |
---|---|---|
2 | div2 | level-2 text division |
16 | div2 | contains a second-level subdivision of the front, body, or back of a text. |
195 | div2 | any sequence of low-level structural elements, possibly grouped into lower subdivisions. |
# | id | text |
---|---|---|
58 | msIdentifier | An msIdentifier must contain either a repository or location of some type, or a manuscript name |
# | id | text |
---|---|---|
2 | model.divTopPart | groups elements which can occur only at the beginning of a text division. |
# | id | text |
---|---|---|
4 | birth | contains information about a person's birth, such as its date and place. |
# | id | text |
---|---|---|
2 | vLabel | value label |
14 | vLabel | represents the value part of a feature-value specification which appears at more than one point in a feature structure. |
39 | vLabel | supplies a name identifying the sharing point. |
# | id | text |
---|---|---|
46 | fsDecl | gives a name for the type of feature structure being declared. |
65 | fsDecl | gives the name of one or more typed feature structures from which this type inherits feature specifications and constraints; if this type includes a feature specification with the same name as that of any of those specified by this attribute, or if more than one specification of the same name is inherited, then the set of possible values is defined by unification. Similarly, the set of constraints applicable is derived by combining those specified explicitly within this element with those implied by the |
69 | fsDecl | attribute is specified, no feature specification or constraint is inherited. |
113 | fsDecl | The process of combining constraints may result in a contradiction, for example if two specifications for the same feature specify disjoint ranges of values, and at least one such specification is mandatory. In such a case, there is no valid representative for the type being defined. |
# | id | text |
---|---|---|
2 | att.datable.w3c | provides attributes for normalization of elements that contain datable events conforming to the W3C |
21 | att.datable.w3c | supplies the value of the date or time in a standard form, e.g. yyyy-mm-dd. |
37 | att.datable.w3c | Examples of W3C date, time, and date & time formats. |
133 | att.datable.w3c | specifies the earliest possible date for the event in standard form, e.g. yyyy-mm-dd. |
152 | att.datable.w3c | specifies the latest possible date for the event in standard form, e.g. yyyy-mm-dd. |
216 | att.datable.w3c | The value of these attributes should be a normalized representation of the date, time, or combined date & time intended, in any of the standard formats specified by |
220 | att.datable.w3c | The most commonly-encountered format for the date portion of a temporal attribute is |
232 | att.datable.w3c | may also be used. For the time part, the form |
236 | att.datable.w3c | Note that this format does not currently permit use of the value |
238 | att.datable.w3c | to represent the year 1 BCE; instead the value |
# | id | text |
---|---|---|
2 | model.placeNamePart | groups elements which form part of a place name. |
# | id | text |
---|---|---|
4 | stamp | contains a word or phrase describing a stamp or similar device. |
# | id | text |
---|---|---|
3 | locale | contains a brief informal description of the kind of place concerned, for example: a room, a restaurant, a park bench, etc. |
# | id | text |
---|---|---|
4 | country | contains the name of a geo-political unit, such as a nation, country, colony, or commonwealth, larger than or administratively superior to a region and smaller than a bloc. |
47 | country | The recommended source for codes to represent coded country names is ISO 3166. |
# | id | text |
---|---|---|
2 | analytic | analytic level |
14 | analytic | contains bibliographic elements describing an item (e.g. an article or poem) published within a monograph or journal and not as an independent publication. |
77 | analytic | , where its use is mandatory for the description of an analytic level bibliographic item. |
# | id | text |
---|---|---|
2 | headItem | heading for list items |
14 | headItem | contains the heading for the item or gloss column in a glossary list or similar structured list. |
88 | headItem | element may appear only if each item in the list is preceded by a |
# | id | text |
---|---|---|
49 | pVar | indicates what notation is used for the pronunciation, if more than one occurs in the machine-readable dictionary. |
# | id | text |
---|---|---|
2 | witStart | fragmented witness start |
13 | witStart | indicates the beginning, or resumption, of the text of a fragmentary witness. |
# | id | text |
---|---|---|
2 | line | contains the transcription of a topographic line in the source document |
59 | line | This element should be used only to mark up writing which is topographically organized as a series of lines, horizontal or vertical. It should not be used to mark lines of verse (for which use |
61 | line | ) nor to mark linebreaks within text which has been encoded using structural elements such as |
# | id | text |
---|---|---|
2 | data.duration.iso | defines the range of attribute values available for representation of a duration in time using ISO 8601 standard formats |
64 | data.duration.iso | A duration is expressed as a sequence of number-letter pairs, preceded by the letter P; the letter gives the unit and may be Y (year), M (month), D (day), H (hour), M (minute), or S (second), in that order. The numbers are all unsigned integers, except for the last, which may have a decimal component (using either |
68 | data.duration.iso | as the decimal point; the latter is preferred). If any number is |
70 | data.duration.iso | , then that number-letter pair may be omitted. If any of the H (hour), M (minute), or S (second) number-letter pairs are present, then the separator |
73 | data.duration.iso | time |
# | id | text |
---|---|---|
14 | fDecl | declares a single feature, specifying its name, organization, range of allowed values, and optionally its default value. |
45 | fDecl | a single word which follows the rules defining a legal XML name (see |
46 | fDecl | ), indicating the name of the feature being declared; matches the |
93 | fDecl | indicates whether or not the value of this feature may be present. |
113 | fDecl | If a feature is marked as optional, it is possible for it to be omitted from a feature structure. If an obligatory feature is omitted, then it is understood to have a default value, either explicitly declared, or, if no default is supplied, the special value |
115 | fDecl | . If an optional feature is omitted, then it is understood to be missing and any possible value (including the default) is ignored. |
# | id | text |
---|---|---|
2 | macro.paraContent | paragraph content |
14 | macro.paraContent | defines the content of paragraphs and similar elements. |
# | id | text |
---|---|---|
20 | model.global.meta | Elements in this class are typically used to hold groups of links or of abstract interpretations, or by provide indications of certainty etc. It may find be convenient to localize all metadata elements, for example to contain them within the same divison as the elements that they relate to; or to locate them all to a division of their own. They may however appear at any point in a TEI text. |
# | id | text |
---|---|---|
4 | support | contains a description of the materials etc. which make up the physical support for the written part of a manuscript. |
# | id | text |
---|---|---|
4 | quotation | specifies editorial practice adopted with respect to quotation marks in the original. |
39 | quotation | quotation marks |
49 | quotation | indicates whether or not quotation marks have been retained as content within the text. |
70 | quotation | no quotation marks have been retained |
86 | quotation | some quotation marks have been retained |
102 | quotation | all quotation marks have been retained |
# | id | text |
---|---|---|
2 | graphic | indicates the location of an inline graphic, illustration, or figure. |
65 | graphic | attribute should be used to supply the MIME media type of the image specified by the |
# | id | text |
---|---|---|
2 | att.duration.w3c | provides attributes for recording normalized temporal durations. |
54 | att.duration.w3c | are specified, the values should be interpreted as indicating a span of time by its starting time (or date) and duration. In order to represent a time range by a duration and its ending time the |
60 | att.duration.w3c | form, no claim is made that the form in the source text is incorrect; the regularized form is simply that chosen as the main form for purposes of unifying variant forms under a single heading. |
# | id | text |
---|---|---|
91 | abbr | the abbreviation comprises a special symbol or mark. |
107 | abbr | the abbreviation includes writing above the line. |
139 | abbr | the abbreviation is for a title of address (Dr, Ms, Mr, …) |
155 | abbr | the abbreviation is for the name of an organization. |
190 | abbr | attribute is provided for the sake of those who wish to classify abbreviations at their point of occurrence; this may be useful in some circumstances, though usually the same abbreviation will have the same type in all occurrences. As the sample values make clear, abbreviations may be classified by the method used to construct them, the method of writing them, or the referent of the term abbreviated; the typology used is up to the encoder and should be carefully planned to meet the needs of the expected use. For a typology of Middle English abbreviations, see |
269 | abbr | tag is not required; if appropriate, the encoder may transcribe abbreviations in the source text silently, without tagging them. If abbreviations are not transcribed directly but |
271 | abbr | silently, then the TEI header should so indicate. |
# | id | text |
---|---|---|
2 | model.orgPart | groups elements which form part of the description of an organization. |
# | id | text |
---|---|---|
2 | seriesStmt | series statement |
16 | seriesStmt | groups information about the series, if any, to which a publication belongs. |
# | id | text |
---|---|---|
28 | leaf | provides a pointer to a feature structure or other analytic element. |
66 | leaf | provides an identifier of an element which this leaf follows. |
84 | leaf | If the tree is unordered or partially ordered, this attribute has the property of fixing the relative order of the leaf and the element which is the value of the attribute. |
114 | leaf | The in degree of a leaf is always 1, its out degree always 0. |
# | id | text |
---|---|---|
2 | app | apparatus entry |
14 | app | contains one entry in a critical apparatus, with an optional lemma and usually one or more readings or notes on the relevant passage. |
118 | app | This attribute should be used when either the double-end point method of apparatus markup, or the location-referenced method with a URL rather than canonical reference, are used. |
149 | app | This attribute is only used when the double-end point method of apparatus markup is used, when the encoded apparatus is not embedded |
168 | app | location |
178 | app | indicates the location of the variation, when the location-referenced method of apparatus markup is used. |
196 | app | This attribute is used only when the location-referenced encoding method is used. It supplies a string containing a canonical reference for the passage to which the variation applies. |
# | id | text |
---|---|---|
2 | sequence | sequence of references |
14 | sequence | The sequence element must have at least two child elements |
20 | sequence | if true, indicates that the order in which component elements of a sequence appear in a document must correspond to the order in which they are given in the content model. |
37 | sequence | This example content model matches a sequence consisting of either a |
41 | sequence | followed by nothing, or by a sequence of up to five |
# | id | text |
---|---|---|
40 | width | If used to specify the depth of a non text-bearing portion of some object, for example a monument, this element conventionally refers to the axis facing the observer, and perpendicular to that indicated by the |
41 | width | depth |
# | id | text |
---|---|---|
2 | model.oddDecl | groups elements which generate declarations in some markup language in ODD documents. |
# | id | text |
---|---|---|
4 | affiliation | contains an informal description of a person's present or past affiliation with some organization, for example an employer or sponsor. |
64 | affiliation | If included, the name of an organization may be tagged using either the |
# | id | text |
---|---|---|
2 | certainty | indicates the degree of certainty associated with some aspect of the text markup. |
32 | certainty | certainty |
42 | certainty | signifies the degree of certainty associated with the object pointed to by the |
51 | certainty | indicates more exactly the aspect concerning which certainty is being expressed: specifically, whether the markup is correctly located, whether the correct element or attribute name has been used, or whether the content of the element or attribute is correct, etc. |
70 | certainty | uncertainty concerns whether the name of the element or attribute used is correctly applied. |
86 | certainty | uncertainty concerns the content (for an element) or the value (for an attribute) |
92 | certainty | provides an alternative value for the aspect of the markup in question—an alternative generic identifier, transcription, or attribute value, or the identifier of an |
100 | certainty | ; if none is given, it applies to the markup in the text. |
233 | certainty | The envisioned typical value of this attribute would be the identifier of another |
235 | certainty | element or a list of such identifiers. It may thus be possible to construct probability networks by chaining |
239 | certainty | elements (with no value for |
241 | certainty | ). The semantics of this chaining would be understood in this way: if a |
243 | certainty | element is specified, via a reference, as the assumption, then it is not the attribution of uncertainty that is the assumption, but rather the assertion itself. For instance, in the example above, the first |
# | id | text |
---|---|---|
4 | opener | groups together dateline, byline, salutation, and similar phrases appearing as a preliminary group at the start of a division, especially of a letter. |
# | id | text |
---|---|---|
52 | pb | A page break may be associated with a facsimile image of the page it introduces by means of the |
76 | pb | attribute indicates the number or other value associated with this page. This will normally be the page number or signature printed on it, since the physical sequence number is implicit in the presence of the |
# | id | text |
---|---|---|
2 | docImprint | document imprint |
16 | docImprint | contains the imprint statement (place and date of publication, publisher name), as given (usually) at the foot of a title page. |
130 | docImprint | element of bibliographic citations. As with title, author, and editions, the shorter name is reserved for the element likely to be used more often. |
# | id | text |
---|---|---|
2 | vColl | collection of values |
14 | vColl | represents the value part of a feature-value specification which contains multiple values organized as a set, bag, or list. |
54 | vColl | indicates organization of given value or values as |
# | id | text |
---|---|---|
4 | preparedness | describes the extent to which a text may be regarded as prepared or spontaneous. |
78 | preparedness | follows a predefined set of conventions |
# | id | text |
---|---|---|
14 | q | contains material which is distinguished from the surrounding text using quotation marks or a similar method, for any one of a variety of reasons including, but not limited to: direct speech or thought, technical terms or jargon, authorial distance, quotations from elsewhere, and passages that are mentioned but not used. |
39 | q | may be used to indicate whether the offset passage is spoken or thought, or to characterize it more finely. |
90 | q | quotation from a written source |
128 | q | linguistically distinct |
138 | q | technical term |
215 | q | May be used to indicate that a passage is distinguished from the surrounding text for reasons concerning which no claim is made. When used in this manner, |
219 | q | with a value of |
221 | q | that indicates the use of such mechanisms as quotation marks. |
# | id | text |
---|---|---|
4 | creation | contains information about the creation of a text. |
85 | creation | element may be used to record details of a text's creation, e.g. the date and place it was composed, if these are of interest. |
91 | creation | element, which records date and place of publication. |
# | id | text |
---|---|---|
2 | bindingDesc | binding description |
13 | bindingDesc | describes the present and former bindings of a manuscript, either as a series of paragraphs or as a series of distinct |
15 | bindingDesc | elements, one for each binding of the manuscript. |
# | id | text |
---|---|---|
2 | listChange | groups a number of change descriptions associated with either the creation of a source text or the revision of an encoded text. |
62 | listChange | element it documents the set of revision campaigns or stages identified during the evolution of the original text. When it appears within the |
# | id | text |
---|---|---|
20 | model.rdgLike | element to be easily created via TEI customizations. |
# | id | text |
---|---|---|
2 | model.pPart.transcriptional | groups phrase-level elements used for editorial transcription of pre-existing source materials. |
# | id | text |
---|---|---|
2 | sourceDesc | source description |
13 | sourceDesc | describes the source from which an electronic text was derived or generated, typically a bibliographic description in the case of a digitized text, or a phrase such as "born digital" for a text which has no previous existence. |
# | id | text |
---|---|---|
14 | sic | contains text reproduced although apparently incorrect or inaccurate. |
# | id | text |
---|---|---|
4 | role | contains the name of a dramatic role, as given in a cast list. |
# | id | text |
---|---|---|
2 | supportDesc | support description |
13 | supportDesc | groups elements describing the physical support for the written part of a manuscript. |
58 | supportDesc | a short project-defined name for the material composing the majority of the support |
# | id | text |
---|---|---|
2 | vNot | value negation |
14 | vNot | represents a feature value which is the negation of its content. |
# | id | text |
---|---|---|
15 | macroRef | the identifier used for the required pattern within the source indicated. |
32 | macroRef | Patterns or macros are identified by the name supplied as value for the |
36 | macroRef | element in which they are declared. All TEI macro names are unique. |
# | id | text |
---|---|---|
16 | att.internetMedia | MIME media type |
24 | att.internetMedia | specifies the applicable multimedia internet mail extension (MIME) media type |
47 | att.internetMedia | is used to indicate that the URL points to a TEI XML file encoded in UTF-8. |
54 | att.internetMedia | This attribute class provides an attribute for describing a computer resource, typically available over the internet, using a value taken from a standard taxonomy. At present only a single taxonomy is supported, the Multipurpose Internet Mail Extensions (MIME) Media Type system. This typology of media types is defined by the Internet Engineering Task Force in |
57 | att.internetMedia | list of types |
60 | att.internetMedia | attribute must have a value taken from this list. |
# | id | text |
---|---|---|
38 | iType | indicates the type of indicator used to specify the inflection class, when it is necessary to distinguish between the usual abbreviated indications (e.g. |
83 | iType | coded reference to a table of verbs |
101 | iType | gram type='inflectional type' |
148 | iType | May contain character data and phrase-level elements. Typical content will be |
# | id | text |
---|---|---|
4 | interaction | describes the extent, cardinality and nature of any interaction among those producing and experiencing the text, for example in the form of response or interjection, commentary, etc. |
28 | interaction | specifies the degree of interaction between active and passive participants in the text. |
47 | interaction | no interaction of any kind, e.g. a monologue |
63 | interaction | some degree of interaction, e.g. a monologue with set responses |
95 | interaction | this parameter is inappropriate or inapplicable in this case |
113 | interaction | specifies the number of active participants (or |
194 | interaction | number of addressors unknown or unspecifiable |
212 | interaction | specifies the number of passive participants (or |
214 | interaction | ) to whom a text is directed or in whose presence it is created or performed. |
243 | interaction | text is addressed to the originator e.g. a diary |
259 | interaction | text is addressed to one other person e.g. a personal letter |
275 | interaction | text is addressed to a countable number of others e.g. a conversation in which all participants are identified |
291 | interaction | text is addressed to an undefined but fixed number of participants e.g. a lecture |
307 | interaction | text is addressed to an undefined and indeterminately large number e.g. a published book |
# | id | text |
---|---|---|
2 | media | indicates the location of any form of external media such as an audio or video clip etc. |
61 | media | The attributes available for this element are not appropriate in all cases. For example, it makes no sense to specify the temporal duration of a graphic. Such errors are not currently detected. |
65 | media | attribute must be used to specify the MIME media type of the resource specified by the |
# | id | text |
---|---|---|
2 | pc | punctuation character |
4 | pc | contains a character or string of characters regarded as constituting a single punctuation mark. |
26 | pc | indicates the extent to which this punctuation mark conventionally separates words or phrases |
33 | pc | the punctuation mark is a word separator |
37 | pc | the punctuation mark is not a word separator |
41 | pc | the punctuation mark may or may not be a word separator |
47 | pc | provides a name for the kind of unit delimited by this punctuation mark. |
54 | pc | indicates whether this punctuation mark precedes or follows the unit it delimits. |
# | id | text |
---|---|---|
4 | unclear | contains a word, phrase, or passage which cannot be transcribed with certainty because it is illegible or inaudible in the source. |
29 | unclear | indicates why the material is hard to transcribe. |
92 | unclear | Where the difficulty in transcription arises from damage, categorizes the cause of the damage, if it can be identified. |
111 | unclear | damage results from rubbing of the leaf edges |
127 | unclear | damage results from mildew on the leaf surface |
143 | unclear | damage results from smoke |
187 | unclear | The same element is used for all cases of uncertainty in the transcription of element content, whether for written or spoken material. For other aspects of certainty, uncertainty, and reliability of tagging and transcription, see chapter |
# | id | text |
---|---|---|
4 | COL | The text of this manual was prepared electronically on a variety of systems. Most sections were originally drafted by members of the work groups and working committees of the TEI; all have been revised by the editors to achieve greater uniformity of style and greater consistency in the tag set. |
8 | COL | Almost every available SGML and XML editor or processing program has been used at one time or another by the TEI; but without the open source implementations of XML parsers, editors and XSLT engines by James Clark, Richard Stallman, Michael Kay, and Daniel Veillard, the TEI could not survive, and we thank these individuals. We would also like to thank the staff at Syncrosoft, creators of the oXygen editor, for their support for the TEI during the creation on P5. |
10 | COL | Many volunteers contributed to the preparation of this release of the Guidelines; we particularly note the work of Sabine Krott, Eva Radermacher and Arianna Ciula for their work in structuring the bibliographies. |
12 | COL | The production and release process for TEI P5 was managed by Sebastian Rahtz for the TEI Technical Council. |
# | id | text |
---|---|---|
5 | PREFS | prefixed to each revision of the TEI Guidelines since its first publication in 1994. |
9 | p4pf02 | The primary goal of this revision has been to make available a new and corrected version of the TEI Guidelines which: |
13 | p4pf02 | generates a set of DTD fragments that can be combined together to form either SGML or XML document type definitions; |
17 | p4pf02 | can be processed and maintained using readily available XML tools instead of the special-purpose ad hoc software originally used for TEI P3. |
21 | p4pf02 | A second major design goal of this revision has been to ensure that the DTD fragments generated would not break existing documents: in other words, that any document conforming to the original TEI P3 SGML DTD would also conform to the new XML version of it. Although full backwards compatibility cannot be guaranteed, we believe our implementation is consistent with that goal. |
23 | p4pf02 | In most respects, the TEI Guidelines have stood the test of time remarkably well. The present edition makes no substantial attempt to rewrite those few parts of them which have now been rendered obsolete by changes since their first publication, though an indication is given in the text of where such rewriting is now considered necessary. Neither does the present version attempt to address any of the many possible new areas of digital activity in which the TEI approach to standardization may have something to offer. Both these tasks require the existence of an informed and active TEI Council to direct and validate such extension and maintenance work, in response to the changing needs and priorities of the TEI user community. |
29 | p4pf02 | workgroup chaired by Christian Wittern, which undertook to provide expert advice and correction at very short notice, in the latter task. |
31 | p4pf02 | The preparation of this new version relied extensively on preliminary work carried out by the former North American editor of the TEI Guidelines, C.M. Sperberg-McQueen. In a TEI working paper written in 1999 |
32 | p4pf02 | TEI ED W69 |
33 | p4pf02 | , available from the TEI web site at |
35 | p4pf02 | he sketched out a precise blueprint for the conversion of the TEI from SGML to XML, which we have implemented, with only slight modification. |
37 | p4pf02 | The Editors would also like to express thanks to the team of volunteers from the TEI community who helped us with the task of proofreading the first draft during the summer of 2001; and to Sebastian Rahtz of Oxford University Computing Services, without whose skill and enthusiasm this new edition would not have been possible. |
39 | p4pf02 | A substantial proportion of the work of preparing this new edition was funded with the assistance of a grant from the US National Endowment for the Humanities, whose continued support of the TEI has also been crucial to the effort of setting up the TEI Consortium. |
41 | p4pf02 | Finally, we would like to thank all our colleagues on the interim management board of the TEI Consortium, in particular its Chairman John Unsworth, for their continued support of the TEI's work, and their willingness to devote effort to the difficult task of overseeing its transition to a new organizational infrastructure. |
52 | p4pf01 | To complete the work started in June of this year, the TEI Editors asked for volunteers from the TEI community to proofread the preliminary XML version. 24 volunteers responded to this call during August, and gave invaluable help both by identifying a number of previously un-noticed errors, and by suggesting areas in which more substantial revision should be undertaken in the future. The Editors gratefully acknowledge the assistance of the following individuals during this exercise: |
56 | p4pf01 | In addition to error correction, and clear delineation of those sections in which substantial revision is yet to be undertaken for TEI P5, the present draft differs from earlier ones in the following respects: |
58 | p4pf01 | Formal Public Identifiers have been introduced as a means of constructing TEI DTDs and an SGML Open Catalog is now included with the standard release; |
62 | p4pf01 | The chapters on obtaining the TEI DTDs and WSDs have been brought up to date; the chapter on modification has been expanded to include a discussion of the TEI Lite customization; |
74 | PPF2 | This is a preliminary version of a revised and fully XML-compliant edition of the TEI Guidelines. Although work on revising and correcting the text of the document is incomplete, by making available this preliminary version we hope to facilitate testing of the XML document type declarations which it describes by as wide a range of TEI users as possible. |
76 | PPF2 | The primary goal of this revision is to make available the corrected (May 1999) edition of the Guidelines in a new version which: |
80 | PPF2 | generates a set of XML DTD fragments that can be combined together in the same way as the existing TEI (P3) SGML DTD fragments to form true TEI XML DTD fragments without loss of functionality; |
82 | PPF2 | can be processed and maintained using readily available XML tools instead of the special-purpose ad hoc software originally used for TEI P3. |
84 | PPF2 | As noted elsewhere, a number of errors were corrected in the May 1999 edition. A (much) smaller number of errors have also been corrected in this edition, but no new material has been added. We expect the expansion and modification of the Guidelines to become a real possibility in the context of the newly formed TEI Consortium, which has funded the preparation of this present edition. |
86 | PPF2 | A major design goal of both this and the previous revision has been to ensure that the DTD fragments generated would not break existing documents: in other words, that any document conforming to the original TEI P3 SGML DTD would also conform to the new XML version of it. Although full backwards compatibility cannot be guaranteed, we believe our implementation is consistent with that goal. |
88 | PPF2 | In making this new version, we relied extensively on preliminary work carried out by the outgoing North American editor of the TEI Guidelines, Michael Sperberg-McQueen. In a TEI working paper written in 1999, TEI ED W69, Michael sketched out a precise blueprint for the conversion of the TEI from SGML to XML, which we have implemented, with only slight modification. The current TEI editors wish to express here our admiration for the detailed care put into that paper, without which our task would have been forbiddingly difficult, if not impossible. We would also like to express our thanks to Sebastian Rahtz of Oxford University Computing Services, for his invaluable assistance in preparing this new edition. |
90 | PPF2 | We list here in summary form all the changes made in the present edition. Full technical details are provided in documents TEI EDW69 and TEI EDW70, available from the TEI web site. |
94 | PPF2 | has been added. By setting its value to |
96 | PPF2 | , rather than the default |
100 | PPF2 | The content models of all elements have been checked, and, where necessary, changed so that they are equally valid as SGML or as XML; |
102 | PPF2 | The declared value for all attributes has been changed to a form which is equally valid as SGML or as XML; |
109 | PPF2 | tag omissibility |
114 | PPF2 | used within element declarations in the DTD. When XML is to be generated, the parameter entities concerned are redeclared with the null string as their value. |
116 | PPF2 | The second change was achieved by removing SGML-specific features (ampersand connectors, inclusion and exclusion exceptions, various types of attribute content) from the DTD and revising the syntax of the DTD to conform to XML requirements (specifically in the representation of mixed-content models, and by removing redundant parentheses). In making these changes, we took care to ensure that the resulting content model would continue to accept existing valid documents, though in the nature of things it could not be guaranteed to reject the same set of documents. As further discussed in EDW69 and EDW70, some constraints (exclusion exceptions, for example) which could be carried out by a generic SGML parser using TEI P3 will have to be implemented by a special purpose TEI validator using TEI P4. |
118 | PPF2 | Much work remains to be done, firstly in testing the new DTD fragments against as wide a range of TEI materials as possible, secondly in revising the discussion of markup theory and practice within the text to reflect current thinking. A few sections of the current text (the Gentle Introduction to SGML and the discussion of Extended Pointer syntax are two examples) will need substantial rewriting. For the most part, however, we think the Guidelines have stood the test of time well and can be recommended to a new generation of text encoders scarcely born at the time they were first formulated. |
128 | ppf | No work of the size and complexity of the TEI |
130 | ppf | could reasonably be expected to be error-free on publication, nor to remain long uncorrected. It has however taken rather longer than might have been anticipated to complete production of the present corrected reprint of the first edition, for which we present our apologies, both to the many individuals and institutions whose enthusiastic adoption and promotion of the TEI encoding scheme have ensured its continued survival in the rapidly changing world of digital scholarship, and also to the many helpfully critical users whose assiduous uncovering and reporting of our errors have made possible the present revision. |
132 | ppf | At its first meeting in Bergen, in June 1996, the TEI Technical Review Committee (TRC) approved the setting up of a small working committee to oversee the production of a revised edition of the TEI |
134 | ppf | , to include corrections of as many as possible of the `corrigible errors' notified to the editors since publication of the first edition in May 1994, the bulk of which are summarized in a TEI working paper (TEI EDW67, available from the TEI web site). |
138 | ppf | The work of making the corrections and regenerating the text proceeded rather fitfully during 1998 and 1999, largely because of increasing demands on the editors' time from their other responsibilities. With the establishment of the new TEI Consortium, it is be hoped that maintenance of the Guidelines will be placed on a more secure footing. Some specific areas in which we anticipate future revisions being carried out are listed below. |
144 | ppf-tcm | examples of TEI markup throughout the text were all checked against the relevant DTD fragment and an embarassingly large number of tagging errors corrected; |
150 | ppf-tcm | listed in working paper TEI EDW67 were all corrected: some of these required specific changes to the DTD which are listed in the next section. |
157 | ppf-spc | A major goal of this revision was to avoid changes which might invalidate existing data, even where existing constructs seemed erroneous in retrospect. To that end, wherever changes have been made in content models for existing elements, they have as far as possible been made so that the DTD will now accept a superset of what was previously legal. Only one new element ( |
161 | ppf-spc | Where possible, a few content models have been changed in such a way as to facilitate conversion to XML, but XML compatibility is |
204 | ppf-spc | ; this class was then added to the global inclusion class |
281 | ppf-spc | for use in simplification of the content model for |
291 | ppf-spc | corrected an error whereby global attributes were not properly defined for elements specifying a non-default value for any of the |
313 | ppf-spc | changed content models to permit empty |
319 | ppf-spc | changed content model for |
323 | ppf-spc | changed content model for |
335 | ppf-spc | changed content model for |
341 | ppf-spc | changed content model for |
351 | ppf-spc | changed content models for |
363 | ppf-spc | A number of content models were changed with a view to easing the creation of an XML compatible version of the Guidelines. Specifically: |
374 | ppf-spc | changed the mixed content models for |
397 | ppf-err | A small number of other known problems remain uncorrected in this version and are briefly listed below. Please watch the TEI mailing list for announcements of their correction. |
410 | ppf-err | need to be addressed systematically; in particular, the treatment of list items or notes which contain several paragraphs continues to surprise many users: no whitespace is allowed between the paragraphs; |
419 | ppf-err | Our next priority however will be the production of a fully XML-compliant version of the TEI DTD, work on which is already well advanced. |
429 | PF | The impetus for the project came from the humanities computing community, which sought a common encoding scheme for complex textual structures in order to reduce the diversity of existing encoding practices, simplify processing by machine, and encourage the sharing of electronic texts. It soon became apparent that a sufficiently flexible scheme could provide solutions for text encoding problems generally. The scope of the TEI was therefore broadened to meet the varied encoding requirements of any discipline or application. Thus, the TEI became the only systematized attempt to develop a fully general text encoding model and set of encoding conventions based upon it, suitable for processing and analysis of any type of text, in any language, and intended to serve the increasing range of existing (and potential) applications and use. |
431 | PF | What is published here is a major milestone in this effort. It provides a single, coherent framework for all kinds of text encoding which is hardware-, software- and application-independent. Within this framework, it specifies encoding conventions for a number of key text types and features. The ongoing work of the TEI is to extend the scheme presented here to cover additional text types and features, as well as to continue to refine its encoding recommendations on the basis of extensive experience with their actual application and use. |
433 | PF | We therefore offer these Guidelines to the user community for use in the same spirit of active collaboration and cooperation with which they have so far been developed. The TEI is committed to actively supporting the wide-spread and large-scale use of the Guidelines which, with the publication of this volume, is now for the first time possible. In addition, we anticipate that users of the TEI Guidelines will in some instances adapt and extend them as necessary to suit particular needs; we invite such users to engage in the further development of the Guidelines by working with us as they do so. |
435 | PF | Like any standard which is actually used, these Guidelines do not represent a static finished work, but rather one which will evolve over time with the active involvement of its community of users. We invite and encourage the participation of the user community in this process, in order to ensure that the TEI Guidelines become and remain useful in all sorts of work with machine-readable texts. |
437 | PF | This document was made possible in part by financial support from the U.S. National Endowment for the Humanities, an independent federal agency; Directorate General XIII of the Commission of the European Communities; the Andrew W. Mellon Foundation; and the Social Science and Humanities Research Council of Canada. Direct and indirect support has also been received from the University of Illinois at Chicago, the Oxford University Computing Services, the University of Arizona, the University of Oslo and Queen's University (Kingston, Ont.), Bellcore (Bell Communications Research), the Istituto di Linguistica Computazionale (C.N.R.) Pisa, the British Academy, and Ohio State University, as well as the employers and host institutions of the members of the TEI working committees and work groups listed in the acknowledgments. |
439 | PF | The production of this document has been greatly facilitated by the willingness of many software vendors to provide us with evaluation versions of their products. Most parts of this text have been processed at some time by almost every currently available SGML-aware software system. In particular, we gratefully acknowledge the assistance of the following vendors: |
456 | PF | Details of the software actually used to produce the current document are given in the colophon at the end of the work. |
461 | WG | Many people have given of their time, energy, expertise, and support in the creation of this document; it is unfortunately not possible to thank them all adequately. Below are listed those who have served as formal members of the TEI's Work Groups and Working Committees during its six-year history; others not so officially enfranchised also contributed much to the quality of the result. |
467 | WGWC | TEI Working Committees (1990-1993) |
495 | WGWC | In addition, the two TEI editors served ex officio on each committee. |
497 | WGWC | Following publication of the first draft of the TEI Guidelines (P1) in November 1990, a number of specialist work groups were charged with responsibility for drafting revisions and extensions, which, together with material already presented in P1, constitute the basis of the present work. |
499 | WGWC | In addition, many members of the work groups listed below met on three occasions to review the emerging proposals in detail at technical review meetings convened by the TEI Steering Committee. These meetings, held in Myrdal, Norway (November 1991), Chicago (May 1992) and Oxford (May 1993), were largely responsible for the technical content and organization of the present work. Attendants at these meetings are starred in the list below. |
521 | WGWC | TR11 Drama and performance texts |
530 | WGWC | AI2 Spoken text |
542 | WGWC | AI5 Print dictionaries |
554 | WGAB | Members of the TEI Advisory Board during the lifetime of the project are listed below, grouped under the name of the organization represented. |
603 | WGSC | Members of the Steering Committee of the TEI during the preparation of this work were: |
# | id | text |
---|---|---|
4 | DI | This chapter defines a module for encoding lexical resources of all kinds, in particular human-oriented monolingual and multilingual dictionaries, glossaries, and similar documents. The elements described here may also be useful in the encoding of computational lexica and similar resources intended for use by language-processing software; they may also be used to provide a rich encoding for wordlists, lexica, glossaries, etc. included within other documents. Dictionaries are most familiar in their printed form; however, increasing numbers of dictionaries exist also in electronic forms which are independent of any particular printed form, but from which various displays can be produced. |
6 | DI | Both typographically and structurally, print dictionaries are extremely complex. Such lexical resources are moreover of interest to many communities with different and sometimes conflicting goals. As a result, many general problems of text encoding are particularly pronounced here, and more compromises and alternatives within the encoding scheme may be required in the future. |
21 | DI | dictionaries; encoding guidelines should include these structural principles. We therefore define two distinct elements for dictionary entries, one ( |
34 | DI | Second, since so much of the information in printed dictionaries is implicit or highly compressed, their encoding requires clear thought about whether it is to capture the precise typographic form of the source text or the underlying structure of the information it presents. Since both of these views of the dictionary may be of interest, it proves necessary to develop methods of recording both, and of recording the interrelationship between them as well. Users interested mainly in the printed format of the dictionary will require an encoding to be faithful to an original printed version. However, other users will be interested primarily in capturing the lexical information in a dictionary in a form suitable for further processing, which may demand the expansion or rearrangement of the information contained in the printed form. Further, some users wish to encode |
36 | DI | of these views of the data, and retain the links between related elements of the two encodings. Problems of recording these two different views of dictionary data are discussed in section |
37 | DI | , together with mechanisms for retaining both views when this is desired. |
39 | DI | To deal with this complexity, and in particular to account for the wide variety of linguistic contexts within which a dictionary may be designed, it can be necessary to customize or change the schema by providing more restriction or possibly alternate content models for the elements defined in this chapter. Section |
40 | DI | illustrates this with the provision of a closed set of values for grammatical descriptors. |
42 | DI | This chapter contains a large number of examples taken from existing print dictionaries; in each case, the original source is identified. In presenting such examples, we have tried to retain the original typographic appearance of the example as well as presenting a suggested encoding for it. Where this has not been possible (for example in the display of pronunciation) we have adopted the transliteration found in the electronic edition of the |
44 | DI | . Also, the middle dot in quoted entries is rendered with a full stop, while within the sample transcriptions hyphenation and syllabification points are indicated by a vertical bar |, regardless of their appearance in the source text. |
49 | DIBO | Overall, dictionaries have the same structure of front matter, body, and back matter familiar from other texts. In addition, this module defines |
55 | DIBO | as component-level elements which can occur directly within a text division or the text body. |
68 | DIBO | As members of the classes |
82 | DIBO | The front and back matter of a dictionary may well contain specialized material such as lists of common and proper nouns, grammatical tables, gazetteers, a |
84 | DIBO | , etc. These should be tagged using elements defined elsewhere in these Guidelines, chiefly in the core module (chapter |
89 | DIBO | element consists of a set of |
93 | DIBO | elements. These text divisions might, for example, correspond to sections for different letters of the alphabet, or to sections for different languages in a bilingual dictionary, as in the following example: |
118 | DIBO | In a print dictionary, the entries are typically typographically distinct entities, each headed by some morphological form of the lexical item described (the |
120 | DIBO | ), and sorted in alphabetical order or (especially for non-alphabetic scripts) in some other conventional sequence. Dictionary entries should be encoded as distinct successive items, each marked as an |
128 | DIBO | Some dictionaries provide distinct entries for homographs, on the basis of etymology, part-of-speech, or both, and typically provide a numeric superscript on the headword identifying the homograph number. In these cases each homograph should be encoded as a separate entry; the |
130 | DIBO | element may optionally be used to group such successive homograph entries. In addition to a series of |
136 | DIBO | group (see section |
137 | DIBO | ) when information about hyphenation, pronunciation, etc., is given only once for two or more homograph entries. If the homograph number is to be recorded, the global attribute |
139 | DIBO | may be used for this purpose. In some dictionaries, homographs are treated in distinct parts of the same entry; in these cases, they may be separated by use of the |
146 | DIBO | attribute, is often required for superentries and entries, especially in cases where the order of entries does not follow the local character-set collating sequence (as, for example, when an entry for |
148 | DIBO | appears at the place where |
210 | DIEN | A simple dictionary entry may contain information about the form of the word treated, its grammatical characterization, its definition, synonyms, or translation equivalents, its etymology, cross-references to other entries, usage information, and examples. These we refer to as the |
224 | DIEN | In addition, however, dictionary entries often have a complex hierarchical structure. For example, an entry may consist of two or more sub-parts, each corresponding to information for a different part-of-speech homograph of the headword. The entry (or part-of-speech homographs, if the entry is split this way) may also consist of senses, each of which may in turn be composed of two or more sub-senses, etc. Each sub-part, homograph entry, sense, or sub-sense we call a |
232 | DIENHI | The outermost structural level of an entry is marked with the elements |
242 | DIENHI | element even for an entry that has only one sense to group together all parts of the definition relating to the word sense since this leads to more consistent encoding across entries. All of these levels may each contain any of the constituent parts of an entry. A special case of hierarchical structure is represented by the |
247 | DIENHI | may be used at any point in the hierarchy to delimit parts of the dictionary entry which are structurally anomalous, as further discussed in section |
257 | DIENHI | For example, an entry with two senses will have the following structure: |
265 | DIENHI | An entry with two homographs, the first with two senses and the second with three (one of which has two sub-senses), may have a structure like this: |
326 | DIENHI | The hierarchic structure of a dictionary entry is enforced by the structures defined in this module. The content model for |
328 | DIENHI | specifies that entries do not nest, that homographs nest within entries, and that senses nest within entries, homographs, or senses, and may be nested to any depth to reflect the embedding of sub-senses. Any of the top-level constituents ( |
352 | DIENGP | information about the form of the word treated (orthography, pronunciation, hyphenation, etc.) |
356 | DIENGP | definitions or translations into another language |
395 | DIENGP | In a simple entry with no internal hierarchy, all top-level constituents can appear as children of |
403 | DIENGP | n person who competes. |
432 | DIENGP | Any top-level constituent can appear at any level when the hierarchical structure of the entry is more complex. The most obvious examples are |
438 | DIENGP | level when several senses or translations exist: |
481 | DIENGP | n cry of an ass; sound of a trumpet. ∙ vt [VP2A] make a cry or sound of this kind. |
518 | DIENGP | Information of the same kind can appear at different levels within the same entry; here, grammatical information occurs both at entry and homograph level. |
582 | DIENGP | 2 n [U] the state when one's feelings and actions are uncontrolled; freedom from control... |
677 | DITPFO | Dictionary entries most often begin with information about the form of the word to which the entry applies. Typically, the orthographic form of the word, sometimes marked for syllabification or hyphenation, is the first item in an entry. Other information about the word, including variant or alternate forms, inflected forms, pronunciation, etc., is also often given. |
712 | DITPFO | gen, number, case |
723 | DITPFO | when describing that particular form of the word. |
725 | DITPFO | Different dictionaries use different means to mark hyphenation, syllabification, and stress, and they often use some unusual glyphs (e.g., the |
728 | DITPFO | . When transcribing representations of pronunciation the International Phonetic Alphabet should be used. It may be convenient (as has been done in the text of this chapter) to use a simple transliteration scheme for this; such a scheme should however be properly documented in the header. |
753 | DITPFO | For a variety of reasons including ease of processing, it may be desired to split into separate elements information which is collapsed into a single element in the source text; orthography and hyphenation may for example be transcribed as separate elements, although given together in the source text. For a discussion of the issues involved, and of methods for retaining both the presentation form and the interpreted form, see section |
797 | DITPFO | Or the inflectional pattern may be indicated by reference to a table of paradigms, as here: |
820 | DITPFO | Explanatory labels may be attached to alternate forms: |
825 | DITPFO | mean time between failures. |
866 | DITPFO | element is repeated to associate the first orthographic form explicitly with the first pronunciation, and the second orthographic form with the second pronunciation: |
894 | DITPFO | element can preserve relations among elements that are implicit in the text. For example, in the CED entry for |
962 | DITPGR | , or any other element containing content about which there is grammatical information. For example, in the entry |
977 | DITPGR | , the elements for morphological information are simply shorthand for the general purpose |
979 | DITPGR | element. Consider this entry for the French word |
987 | DITPGR | This entry can be tagged using specialized grammatical elements: |
1120 | DITPSE | Dictionaries may describe the meanings of words in a wide variety of different ways—by means of synonyms, paraphrases, translations into other languages, formal definitions in various highly stylized forms, etc. No attempt is made here to distinguish all the different forms which sense information may take; all of them may be tagged using the |
1125 | DITPSE | As a special case it is frequently desirable to distinguish the provision of translation equivalents in other languages from other forms of sense information; the use of |
1126 | DITPSE | cit type="translation" |
1127 | DITPSE | (which groups a translation equivalent with related information such as its grammatical description) for this purpose is described in section |
1134 | DITPDE | Dictionary definitions are those pieces of prose in a dictionary entry that describe the meaning of some lexical item. Most often, definitions describe the headword of the entry; in some cases, they describe translated texts, examples, etc.; see |
1135 | DITPDE | cit type="translation" |
1138 | DITPDE | cit type="example" |
1142 | DITPDE | element directly contains the text of the definition; unlike |
1146 | DITPDE | , it does not serve solely to group a set of smaller elements. The close analysis of definition text, such as the tagging of hypernyms, typical objects, etc., is not covered by these Guidelines. |
1148 | DITPDE | Definitions may occur directly within an entry; when multiple definitions are given, they are typically identified as belonging to distinct senses, as here: |
1228 | DITPTR | Multilingual dictionaries contain information about translations of a given word in some source language for one or more target languages. Minimally, the dictionary provides the corresponding translation in the target language; other material, such as morphological information (gender, case), various kinds of usage restrictions, etc., may also be given. If translation equivalents are to be distinguished from other kinds of sense information, they may be encoded using |
1229 | DITPTR | cit type="translation" |
1236 | DITPTR | element is used in multilingual dictionaries to group information (forms, grammatical information, usage, translation(s), etc.) about a given sense of a word where necessary. Information about the individual translation equivalents within a sense is grouped using |
1237 | DITPTR | cit type="translation" |
1238 | DITPTR | . This information may include the translation text (tagged |
1260 | DITPTR | Note how in the following example, different translation equivalents are grouped into the same or different senses, following the punctuation of the source and the usage labels: |
1389 | DITPTR | cit type="translation" |
1390 | DITPTR | may also be used in monolingual dictionaries when a translation is given for a foreign word: |
1437 | DITPET | marks a block of etymological information. Etymologies may contain highly structured lists of words in an order indicating their descent from each other, but often also include related words and forms outside the direct line of descent, for comparison. Not infrequently, etymologies include commentary of various sorts, and can grow into short (or long!) essays with prose-like structure. This variation in structure makes it impracticable to define tags which capture the entire intellectual structure of the etymology or record the precise interrelation of all the words mentioned. It is, however, feasible to mark some of the more obvious phrase-level elements frequently found in etymologies, using tags defined in the core module or elsewhere in this chapter. Of particular relevance for the markup of etymologies are: |
1449 | DITPET | As in other prose, individual word forms mentioned in an etymological description are tagged with |
1459 | DITPET | element may be used to identify a particular language name where it appears, in addition to using the |
1545 | DITPEG | cit type="example" |
1546 | DITPEG | element contains usage examples and associated information; the example text itself should be enclosed in a |
1552 | DITPEG | element associates a quotation with a bibliographic reference to its source. |
1571 | DITPEG | adj tech having many parts: the multiplex eye of the fly. |
1578 | DITPEG | Or when one wants a more comprehensive representation of examples: |
1679 | DITPEG | When a source is indicated, the example should be marked with a |
1710 | DITPUS | Most dictionaries provide restrictive labels and phrases indicating the usage of given words or particular senses. Other phrases, not necessarily related to usage, may also be attached to forms, translations, cross-references, and examples. The following elements are provided to mark up such labels: |
1717 | DITPUS | element may be used for any kind of significative phrase or label within the text. The |
1733 | DITPUS | Many dictionaries provide an explanation and/or a list of such usage labels in a preface or appendix. The type of the usage information may be indicated in the |
1740 | DITPUS | geo |
1746 | DITPUS | time |
1759 | DITPUS | domain |
1762 | DITPUS | reg |
1790 | DITPUS | lang |
1793 | DITPUS | language for foreign words, spellings pronunciations, etc. |
1796 | DITPUS | gram |
1801 | DITPUS | In addition to this kind of information, multilingual dictionaries often provide |
1803 | DITPUS | to help the user determine the right sense of a word in the source language (and hence the correct translation). These include synonyms, concept subdivisions, typical subjects and objects, typical verb complements, etc. These labels may also be marked with the |
1822 | DITPUS | colloc |
1855 | DITPUS | unclassifiable piece of information to guide sense choice |
1961 | DITPUS | When the usage label is hard to classify, it may be described as a |
1994 | DITPXR | Dictionary entries frequently refer to information in other entries, often using extremely dense notations to convey the headword of the entry to be sought, the particular part of the entry being referred to, and the nature of the information to be sought there (synonyms, antonyms, usage notes, etymology, an illustration, etc.) |
1996 | DITPXR | Cross-references may be tagged in dictionaries using the |
2000 | DITPXR | elements defined in the core module (section |
2003 | DITPXR | element may be used to group all the information relating to a cross-reference. |
2015 | DITPXR | ) is used to tag the cross-reference target proper (in dictionaries, usually the headword, possibly accompanied by a homograph number, a sense number, or other further restriction specifying what portion of the target entry is being referred to). The |
2017 | DITPXR | element is used to group the target with any accompanying phrases or symbols used to label the cross-reference; the cross-reference label itself may be tagged as a |
2057 | DITPXR | to mark the cross-reference label, the two examples differ in another way. The former assumes that the first sense of |
2061 | DITPXR | , and that the specific form of the reference in the source volume can be reconstructed, if needed, from that information. The latter does not require the first sense of |
2063 | DITPXR | to have an identifier, and retains the print form of the cross-reference; by omitting the |
2069 | DITPXR | and find the location referred to, or else that such processing will not be necessary. |
2075 | DITPXR | element may be used to indicate what kind of cross-reference is being made, using any convenient typology. Since different dictionaries may label the same kind of cross-reference in different ways, it may be useful to give normalized indications in the |
2131 | DITPXR | Strictly speaking, the reference above is not to the entry for |
2133 | DITPXR | , but to the list of synonyms found within that entry. |
2135 | DITPXR | In some cases, the cross-reference is to a particular subset of the meanings of the entry in question: |
2167 | DITPXR | The asterisk signals a reference to the entry for |
2175 | DITPXR | In some cases, the form in the definition is inflected, and thus |
2226 | DITPNO | am not, is not, are not, have not |
2232 | DITPNO | Although the interrogative form |
2235 | DITPNO | am I not? |
2236 | DITPNO | , it is generally avoided in spoken English and never used in formal English. |
2291 | DITPRE | element encloses a degenerate entry which appears in the body of another entry for some purpose. Many dictionaries include related entries for direct derivatives or inflected forms of the entry word, or for compound words, phrases, collocations, and idioms containing the entry word. |
2372 | DIHW | Examples, definitions, etymologies, and occasionally other elements such as cross-references, orthographic forms, etc., often contain a shortened or iconic reference to the headword, rather than repeating the headword itself. The references may be to the orthographic form or to the pronunciation, to the form given or to a variant of that form. The following elements are used to encode such iconic references to a headword: |
2382 | DIHW | which may optionally be used to resolve any ambiguity about the headword form being referred to. |
2390 | DIHW | indicates a reference to the full form of the headword |
2410 | DIHW | gives the initial of the word followed by a full stop, to indicate reference to the full form of the headword |
2414 | DIHW | refers to a capitalized form of the headword |
2420 | DIHW | element should be used for iconic or shortened references to the orthographic form(s) of the headword itself. It is an empty element and replaces, rather than enclosing, the reference. Note that the reference to a headword is not necessarily a simple string replacement. In the example |
2426 | DIHW | , the tilde stands for either headword form ( |
2520 | DIHW | attribute to refer to a specific form of the headword: |
2525 | DIHW | comb form … : vagus nerve < |
2625 | DIHW | In many cases the reference is not to the orthographic form of the headword, but rather to another form of the headword—usually to an inflected form. In these cases, the element |
2627 | DIHW | should be used; this element takes as its content the string as it appears in the text. |
2666 | DIHW | , which are defined in the additional module for linking, segmentation, and alignment (see chapter |
2689 | DIHW | In addition, some dictionaries make reference to the pronunciation of the headword in the pronunciation of related entries, variants, or examples. The |
2746 | DIHW | Since existing printed dictionaries use different conventions for headword references (swung dash, first letter abbreviated form, capitalization, or italicization of the word, etc.) the exact method used should be documented in the header. |
2764 | DIMV | typographic view |
2765 | DIMV | —the two-dimensional printed page, including information about line and page breaks and other features of layout |
2768 | DIMV | editorial view |
2769 | DIMV | —the one-dimensional sequence of tokens which can be seen as the input to the typesetting process; the wording and punctuation of the text and the sequencing of items are visible in this view, but specifics of the typographic realization are not |
2772 | DIMV | lexical view |
2773 | DIMV | —this view includes the underlying information represented in a dictionary, without concern for its exact textual form |
2777 | DIMV | For example, a domain indication in a dictionary entry might be broken over a line and therefore hyphenated ( |
2781 | DIMV | ); the typographic view of the dictionary preserves this information. In a purely editorial view, the particular form in which the domain name is given in the particular dictionary (as |
2787 | DIMV | , etc.) would be preserved, but the fact of the line break would not. Font shifts might plausibly be included in either a strictly typographic or an editorial view. In the lexical view, the only information preserved concerning domain would be some standard symbol or string representing the nautical domain (e.g. |
2789 | DIMV | ) regardless of the form in which it appears in the printed dictionary. |
2795 | DIMV | , the fonts in which different types of information are to be rendered, etc.), and then the typographic view, which is tied to a specific printed rendering. Computational linguists and philologists often begin with the typographic view and analyse it to obtain the editorial and/or lexical views. Some users may ultimately be concerned with retaining only the lexical view, or they may wish to preserve the typographic or editorial views as a reference text, perhaps as a guard against the loss or misinterpretation of information in the translation process. Some researchers may wish to retain all three views, and study their interrelations, since research questions may well span all three views. |
2797 | DIMV | In general, an electronic encoding of a text will allow the recovery of at least one view of that text (the one which guided the encoding); if editorial and typographic practices are consistently applied in the production of a printed dictionary, or if exceptions to the rules are consistently recorded in the electronic encoding, then it is |
2799 | DIMV | possible to recover the editorial view from an encoding of the lexical view, and the typographic view from an encoding of the editorial view. In practice, of course, the severe compression of information in dictionaries, the variety of methods by which this compression is achieved, the complexity of formulating completely explicit rules for editorial and typographic practice, and the relative rarity of complete consistency in the application of such rules, all make the mechanical transformation of information from one view into another something of a vexed question. |
2801 | DIMV | This section describes some principles which may be useful in capturing one or the other of these views as consistently and completely as possible, and describes some methods of attempting to capture more than one view in a single encoding. Only the editorial and lexical views are explicitly treated here; for methods of recording the physical or typographic details of a text, see chapter |
2806 | DIMV | attributes to link feature structures to a transcription of the editorial view of a dictionary, are not discussed here (for feature structures, see chapter |
2807 | DIMV | . For linkage of textual form and underlying information, see chapter |
2813 | DIMVTV | Common practice in encoding texts of all sorts relies on principles such as the following, which can be used successfully to capture the editorial view when encoding a dictionary: |
2815 | DIMVTV | All characters of the source text should be retained, with the possible exception of |
2816 | DIMVTV | rendition text |
2819 | DIMVTV | Characters appearing in the source text should typically be given as character data content in the document, rather than as the value of an attribute; again, rendition text may optionally be excepted from this rule. |
2821 | DIMVTV | Apart from the characters or graphics in the source text, nothing else should appear as content in the document, although it may be given in attribute values. |
2823 | DIMVTV | The material in the source text should appear in the encoding in the same order. Complications of the character sequence by footnotes, marginal notes, etc., text wrapping around illustrations, etc., may be dealt with by the usual means (for notes, see section |
2825 | DIMVTV | Complications of sequence caused by marginal or interlinear insertions and deletions, which are frequent in manuscripts, or by unconventional page layouts, as in concrete poetry, magazines with imaginative graphic designers, and texts about the nature of typography as a medium, typically do not occur in dictionaries, and so are not discussed here. |
2830 | DIMVTV | In a very conservative transcription of the editorial view of a text, |
2831 | DIMVTV | rendition characters |
2833 | DIMVTV | rendition text |
2834 | DIMVTV | (for example, conjunctions joining alternate headwords, etc.) are typically retained. Removing the tags from such a transcription will leave all and only the characters of the source text, in their original sequence. |
2835 | DIMVTV | This is a slight oversimplification. Even in conservative transcriptions, it is common to omit page numbers, signatures of gatherings, running titles and the like. The simple description above also elides, for the sake of simplicity, the difficulties of assigning a meaning to the phrase |
2836 | DIMVTV | original sequence |
2837 | DIMVTV | when it is applied to the printed characters of a source text; the |
2838 | DIMVTV | original sequence |
2839 | DIMVTV | retained or recovered from a conservative transcription of the editorial view is, of course, the one established during the transcription by the encoder. |
2849 | DIMVTV | . a feather, wing, fin, or similarly shaped part. 3. another name for |
2853 | DIMVTV | A conservative encoding of the editorial view of this entry, which retains all rendition text, might resemble the following: |
2916 | DIMVTV | A somewhat simplified encoding of the editorial view of this entry might exploit the fact that rendition text is often systematically recoverable. For example, parentheses consistently appear around pronunciation in this dictionary, and thus are effectively implied by the start- and end-tags for |
2919 | DIMVTV | The omission of rendition text is particularly common in systems for document production; it is considered good practice there, since automatic generation of rendition text is more reliable and more consistent than attempting to maintain it manually in the electronic text. |
2920 | DIMVTV | In such an encoding, removing the tags should exactly reproduce the sequence of characters in the source, minus rendition text. The original character sequence can be recovered fully by replacing tags with any rendition text they imply. |
2924 | DIMVTV | element in the header would be used to record the following patterns of rendition text: |
2934 | DIMVTV | appears before alternate forms |
2940 | DIMVTV | , inflection information, and sense numbers |
2942 | DIMVTV | senses are numbered in sequence unless otherwise specified using the global |
3006 | DIMVTV | When rendition text is omitted, it is recommended that the means to regenerate it be fully documented, using the |
3008 | DIMVTV | element of the TEI header. |
3010 | DIMVTV | If rendition text is used systematically in a dictionary, with only a few mistakes or exceptions, the global attribute |
3012 | DIMVTV | may be used on any tag to flag exceptions to the normal treatment. The values of the |
3020 | DIMVTV | element in the TEI header. |
3052 | DIMVLV | If the text to be interchanged retains only the lexical view of the text, there may be no concern for the recoverability of the editorial (not to speak of the typographic) view of the text. However, it is strongly recommended that the TEI header be used to document fully the nature of all alterations to the original data, such as normalization of domain names, expansion of inflected forms, etc. |
3054 | DIMVLV | In an encoding of the lexical view of a text, there are degrees of departure from the original data: normalizing inconsistent forms like |
3068 | DIMVLV | reorganizing the order of elements in an entry to show their relationship, as in |
3073 | DIMVLV | where in a strictly lexical view one might wish to group |
3079 | DIMVLV | splitting an entry into two separate entries, as in |
3082 | DIMVLV | /"selIb@sI/ n [U] state of living unmarried, esp as a religious obligation. celi.bate /"selIb@t/ n [C] unmarried person (esp a priest who has taken a vow not to marry). |
3084 | DIMVLV | For some purposes, this entry might usefully be split into an entry for |
3086 | DIMVLV | and a separate entry for |
3092 | DIMVLV | An encoding which captures the lexical view of the example given in the previous section might look something like the following. In this encoding: |
3161 | DIMVLV | Whether the given dictionary encoding focusses on the lexical view and thus approaches the status of lexical databases, or uses the typographic/editorial view approach and needs to communicate the sometimes informally stated values for the particular descriptive features, the issue of |
3163 | DIMVLV | of the content and of the container objects becomes relevant, in view of the growing tendency to interlink pieces of information across Internet resources. In such cases, it becomes crucial to be able to encode the fact that whether the information on, for instance, the value of the grammatical category of Number is provided as "sg.", "sing.", "Singular", or equivalently "poj." in Polish, or "Ez." in German, etc., what is actually referred to is always the same grammatical value that can be rendered with a plethora of markers, depending on the publisher, language, or lexicographic tradition. In order to signal that this variety of surface markers in fact indicate the same underlying value, it is possible to align them with an external inventory of standardized values. The TEI provides means to align grammatical categories as well as their content with the ISOcat reference, which is a Web implementation of |
3167 | DIMVLV | In the example below, a fragment of the entry for |
3174 | DIMVLV | ). Depending on the status and extent of the dictionary, various strategies may be used to reduce the redundancy of the repeated ISOcat references. |
3193 | DIMVBO | It is sometimes desirable to retain both the lexical and the editorial view, in which case a potential conflict exists between the two. When there is a conflict between the encodings for the lexical and editorial views, the principles described in the following sections may be applied. |
3198 | DIMVAV | If the order of the data is the same in both views, then both views may be captured by encoding one |
3200 | DIMVAV | view in the character data content of the document, and encoding the other using attribute values on the appropriate elements. If all tags were to be removed, the remaining characters would be those of the dominant view of the text. |
3204 | DIMVAV | is used to provide attributes for use in encoding multiple views of the same dictionary entry. These attributes are available for use on all elements defined in this chapter when the base module for dictionaries is selected. |
3206 | DIMVAV | When the editorial view is dominant, the following attributes may be used to capture the lexical view: |
3211 | DIMVAV | When the lexical view is dominant, the following attributes may be used to record the editorial view: |
3221 | DIMVAV | For example, if the source text had the domain label |
3223 | DIMVAV | , it might be encoded as follows. With the editorial view dominant: |
3227 | DIMVAV | The lexical view of the same label would transcribe the normalized form as content of the |
3229 | DIMVAV | element, the typographic form as an attribute value: |
3235 | DIMVAV | If the source text gives inflectional information for the verb |
3241 | DIMVAV | . An encoding of the editorial view might take this form: |
3259 | DIMVAV | tag with null content, to enable the representation of implicit information even though it has no print realization. |
3261 | DIMVAV | The lexical view might be encoded thus: |
3284 | DIMVAV | A particular problem may be posed by the common practice of presenting two alternate forms of a word in a single string, by marking some parts of the word as optional in some forms. The following entry is for a word which can be spelled either |
3292 | DIMVAV | With the editorial view dominant, this entry might begin thus: |
3300 | DIMVAV | With the lexical view dominant, however, two |
3349 | DIMVAV | attribute is recommended, however, when long spans of text are involved, or when the optional part contains embedded tags. |
3362 | DIMVAV | A simple encoding solution would be to leave the definition text unanalysed, but this might be felt inadequate since it does not show that there are two definitions. A possible alternative encoding would be: |
3372 | DIMVAV | This transcribes some characters of the source text twice, however, which deviates from the usual practice. The following encoding records both the editorial and lexical views: |
3388 | DIMVOL | The attributes described in the previous section are useful only when the order of material is the same in both the editorial and the lexical view. When the two views impose different orders on the data, the standard linking mechanisms may be used to show the original location of material transposed in an encoding of the lexical view. |
3392 | DIMVOL | element may be used to mark the original location of the material, and the |
3394 | DIMVOL | attribute may be used on the lexical encoding of that material to indicate its original location(s). Like those in the preceding section, this attribute is defined for the attribute class |
3562 | DIFR | The content model for the |
3564 | DIFR | element provides an entry structure suitable for many average dictionaries, as well as many regular entries in more exotic dictionaries. However, the structure of some dictionaries does not allow the restrictions imposed by the content model for |
3570 | DIFR | elements are provided to support much wider variation in entry structure. The |
3572 | DIFR | element offers less freedom, in that it can only contain phrase level elements, but it can itself appear at any point within a dictionary entry where any of the structural components of a dictionary entry are permitted. As such, it acts as a container for otherwise anomalous parts of an entry. |
3588 | DIFR | element. For example, in the following entry from a dictionary already in electronic form, it is necessary to include a |
3592 | DIFR | . This is not permitted in the content model for |
3629 | DIFR | ) elements—that is, using no grouping elements at all. This can be desirable if the encoder wants a completely |
3631 | DIFR | view, with no indication of or commitment to the association of one element with another. The following encoding uses no grouping elements, and keeps all rendition text: |
3659 | DIFR | Here is an alternative way of representing the same structure, this time using |
3697 | DI | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
3 | CC | The term |
4 | CC | language corpus |
5 | CC | is used to mean a number of rather different things. It may refer simply to any collection of linguistic data (for example, written, spoken, signed, or multimodal), although many practitioners prefer to reserve it for collections which have been organized or collected with a particular end in view, generally to characterize a particular state or variety of one or more languages. Because opinions as to the best method of achieving this goal differ, various subcategories of corpora have also been identified. For our purposes however, the distinguishing characteristic of a corpus is that its components have been selected or structured according to some conscious set of design criteria. |
7 | CC | These design criteria may be very simple and undemanding, or very sophisticated. A corpus may be intended to represent (in the statistical sense) a particular linguistic variety or sublanguage, or it may be intended to represent all aspects of some assumed |
8 | CC | core |
9 | CC | language. A corpus may be made up of whole texts or of fragments or text samples. It may be a |
15 | CC | corpus, the composition of which may change over time. However, since an open corpus is of necessity finite at any particular point in time, the only likely effect of its expansibility from the encoding point of view may be some increased difficulty in maintaining consistent encoding practices (see further section |
23 | CC | ). This is because although each discrete sample of language in a corpus clearly has a claim to be considered as a text in its own right, it is also regarded as a subdivision of some larger object, if only for convenience of analysis. Corpora share a number of characteristics with other types of composite texts, including anthologies and collections. Most notably, different components of composite texts may exhibit different structural properties (for example, some may be composed of verse, and others of prose), thus potentially requiring elements from different TEI modules. |
25 | CC | Aside from these high-level structural differences, and possibly differences of scale, the encoding of language corpora and the encoding of individual texts present identical sets of problems. Any of the encoding techniques and elements presented in other chapters of these Guidelines may therefore prove relevant to some aspect of corpus encoding and may be used in corpora. Therefore, we do not repeat here the discussion of such fundamental matters as the representation of multiple character sets (see chapter |
27 | CC | ). In addition to these general purpose elements, these Guidelines offer a range of more specialized sets of tags which may be of use in certain specialized corpora, for example those consisting primarily of verse (chapter |
28 | CC | ), drama (chapter |
29 | CC | ), transcriptions of spoken text (chapter |
31 | CC | should be reviewed for details of how these and other components of the Guidelines should be tailored to create a document type definition appropriate to a given application. In sum, it should not be assumed that only the matters specifically addressed in this chapter are of importance for corpus creators. |
33 | CC | This chapter does however include some other material relevant to corpora and corpus-building, for which no other location appeared suitable. It begins with a review of the distinction between unitary and composite texts, and of the different methods provided by these Guidelines for representing composite texts of different kinds (section |
35 | CC | describes a set of additional header elements provided for the documentation of contextual information, of importance largely though not exclusively to language corpora. This is the additional module for language corpora proper. Section |
36 | CC | discusses a mechanism by which individual parts of the TEI header may be associated with different parts of a TEI-conformant text. Section |
37 | CC | reviews various methods of providing linguistic annotation in corpora, with some specific examples of relevance to current practice in corpus linguistics. Finally, section |
55 | CCDEF | ); this section discusses their application to composite texts in particular. |
58 | CCDEF | text |
59 | CCDEF | refers to any stretch of discourse, whether complete or incomplete, unitary or composite, which the encoder chooses (perhaps merely for purposes of analytic convenience) to regard as a unit. The term |
60 | CCDEF | composite text |
63 | CCDEF | language corpora |
67 | CCDEF | poem cycles and epistolary works (novels or essays written in the form of collections or series of letters) |
70 | CCDEF | The elements listed above may be combined to encode each of these varieties of composite text in different ways. |
72 | CCDEF | In corpora, the component samples are clearly distinct texts, but the systematic collection, standardized preparation, and common markup of the corpus often make it useful to treat the entire corpus as a unit, too. Some corpora may become so well established as to be regarded as texts in their own right; the Brown and LOB corpora are now close to achieving this status. |
76 | CCDEF | element is intended for the encoding of language corpora, though it may also be useful in encoding newspapers, electronic anthologies, and other disparate collections of material. The individual samples in the corpus are encoded as separate |
78 | CCDEF | elements, and the entire corpus is enclosed in a |
88 | CCDEF | element, in which the corpus as a whole, and encoding practices common to multiple samples may be described. The overall structure of a TEI-conformant corpus is thus: |
105 | CCDEF | Header information which relates to the whole corpus rather than to individual components of it should be factored out and included in the |
107 | CCDEF | element prefixed to the whole. This two-level structure allows for contextual information to be specified at the corpus level, at the individual text level, or at both. Discussion of the kinds of information which may thus be specified is provided below, in section |
112 | CCDEF | In some cases, the design of a corpus is reflected in its internal structure. For example, a corpus of newspaper extracts might be arranged to combine all stories of one type (reportage, editorial, reviews, etc.) into some higher-level grouping, possibly with sub-groups for date, region, etc. The |
114 | CCDEF | element provides no direct support for reflecting such internal corpus structure in the markup: it treats the corpus as an undifferentiated series of components, each tagged |
118 | CCDEF | If it is essential to reflect a single permanent organization of a corpus into sub- and sub-sub-corpora, then the corpus or the high-level subcorpora may be encoded as composite texts, using the |
121 | CCDEF | . The mechanisms for corpus characterization described in this chapter, however, are designed to reduce the need to do this. Useful groupings of components may easily be expressed using the text classification and identification elements described in section |
122 | CCDEF | , and those for associating declarations with corpus components described in section |
123 | CCDEF | . These methods also allow several different methods of text grouping to co-exist, each to be used as needed at different times. This helps minimize the danger of cross-classification and misclassification of samples, and helps improve the flexibility with which parts of a corpus may be characterized for different applications. |
125 | CCDEF | Anthologies and collections are often treated as texts in their own right, if only for historical reasons. In conventional publishing, at least, anthologies are published as units, with single editorial responsibility and common front and back matter which may need to be included in their electronic encodings. The texts collected in the anthology, of course, may also need to be identifiable as distinct individual objects for study. |
127 | CCDEF | Poem cycles, epistolary novels, and epistolary essays differ from anthologies in that they are often written as single works, by single authors, for single occasions; nevertheless, it can be useful to treat their constituent parts as individual texts, as well as the cycle itself. Structurally, therefore, they may be treated in the same way as anthologies: in both cases, the body of the text is composed largely of other texts. |
133 | CCDEF | element can also be used to record the potentially complex internal structure of language corpora. For a full description, see chapter |
140 | CCDEF | elements. The embedded text itself may be encoded using the |
145 | CCDEF | All composite texts share the characteristic that their different component texts may be of structurally similar or dissimilar types. If all component texts may all be encoded using the same module, then no problem arises. If however they require different modules, then these must be included in the schema. This process is described in more detail in section |
150 | CCAH | Contextual information is of particular importance for collections or corpora composed of samples from a variety of different kinds of text. Examples of such contextual information include: the age, sex, and geographical origins of participants in a language interaction, or their socio-economic status; the cost and publication data of a newspaper; the topic, register or factuality of an extract from a textbook. Such information may be of the first importance, whether as an organizing principle in creating a corpus (for example, to ensure that the range of values in such a parameter is evenly represented throughout the corpus, or represented proportionately to the population being sampled), or as a selection criterion in analysing the corpus (for example, to investigate the language usage of some particular vector of social characteristics). |
152 | CCAH | Such contextual information is potentially of equal importance for unitary texts, and these Guidelines accordingly make no particular distinction between the kinds of information which should be gathered for unitary and for composite texts. In either case, the information should be recorded in the appropriate section of a TEI header, as described in chapter |
153 | CCAH | . In the case of language corpora, such information may be gathered together in the overall corpus header, or split across all the component texts of a corpus, in their individual headers, or divided between the two. The association between an individual corpus text and the contextual information applicable to it may be made in a number of ways, as further discussed in section |
157 | CCAH | , which should be read in conjunction with the present section, describes in full the range of elements available for the encoding of information relating to the electronic file itself, for example its bibliographic description and those of the source or sources from which it was derived (see section |
159 | CCAH | ); more detailed descriptive information about the creation and content of the corpus, such as the languages used within it and any descriptive classification system used (see section |
160 | CCAH | ); and version information documenting any changes made in the electronic text (see section |
164 | CCAH | , several other elements can be used in the TEI header if the additional module defined by this chapter is invoked. These additional tags make it possible to characterize the social or other situation within which a language interaction takes place or is experienced, the physical setting of a language interaction, and the participants in it. Though this information may be relevant to, and provided for, unitary texts as well as for collections or corpora, it is more often recorded for the components of systematically developed corpora than for isolated texts, and thus this module is referred to as being |
165 | CCAH | for language corpora |
168 | CCAH | When the module defined in this chapter is included in a schema, a number of additional elements become available within the |
170 | CCAH | element of the TEI header (discussed in section |
187 | CCAHTD | element provides a full description of the situation within which a text was produced or experienced, and thus characterizes it in a way relatively independent of any |
191 | CCAHTD | . The description is organized as a set of values and optional prose descriptions for the following eight |
200 | CCAHTD | By default, a text description will contain each of the above elements, supplied in the order specified. Except for the |
202 | CCAHTD | element, which may be repeated to indicate multiple purposes, no element should appear more than once within a single text description. Each element may be empty, or may contain a brief qualification or more detailed description of the value expressed by its attributes. It should be noted that some texts, in particular literary ones, may resist unambiguous classification in some of these dimensions; in such cases, the situational parameter in question should be given the content |
206 | CCAHTD | Texts may be described along many dimensions, according to many different taxonomies. No generally accepted consensus as to how such taxonomies should be defined has yet emerged, despite the best efforts of many corpus linguists, text linguists, sociolinguists, rhetoricians, and literary theorists over the years. Rather than attempting the task of proposing a single taxonomy of |
208 | CCAHTD | (or the equally impossible one of enumerating all those which have been proposed previously), the closed set of |
220 | CCAHTD | it is equally applicable to spoken, written, or signed texts |
222 | CCAHTD | Two alternative approaches to the use of these parameters are supported by these Guidelines. One is to use pre-existing taxonomies such as those used in subject classification or other types of text categorization. Such taxonomies may also be appropriate for the description of the topics addressed by particular texts. Elements for this purpose are described in section |
224 | CCAHTD | . A second approach is to develop an application-specific set of |
232 | CCAHTD | Where the organizing principles of a corpus or collection so permit, it may be convenient to regard a particular set of values for the situational parameters listed in this section as forming a |
234 | CCAHTD | in its own right; this may also be useful where the same set of values applies to several texts within a corpus. In such a case, the set of text-types so defined should be regarded as a |
235 | CCAHTD | taxonomy |
243 | CCAHTD | element rather than a prose description. Particular texts may then be associated with such definitions using the mechanisms described in sections |
308 | CCAHPA | element provides additional information about the participants in a spoken text or, where this is judged appropriate, the persons named or depicted in a written text. When the detailed elements provided by the |
311 | CCAHPA | are included in a schema, this element can contain detailed demographic or descriptive information about individual speakers or groups of speakers, such as their names or other personal characteristics. Individually identified persons may also identified by a code which can then be used elsewhere within the encoded text, for example as the value of a |
316 | CCAHPA | speaker |
321 | CCAHPA | within a written text, except where otherwise stated. For the purposes of analysis of language usage, the information specified here should be equally applicable to written, spoken, or signed texts. |
325 | CCAHPA | contains a description of the participants in an interaction, which may be supplied as straightforward prose, possibly containing a list of names, encoded using the usual |
341 | CCAHPA | Alternatively, when the |
365 | CCAHPA | An identified character in a drama or a novel may also be regarded as a participant in this sense, and encoded using the same techniques: |
366 | CCAHPA | It is particularly useful to define participants in a dramatic text in this way, since it enables the |
368 | CCAHPA | attribute to be used to link |
393 | CCAHSE | element is used to describe the setting or settings in which language interaction takes place. It may contain a prose description, analogous to a stage description at the start of a play, stating in broad terms the locale, or a more detailed description of a series of such settings. |
395 | CCAHSE | Each distinct setting is described by means of a |
405 | CCAHSE | . If this attribute is not specified, the setting details provided are assumed to apply to all participants represented in the language interaction. Note however that it is not possible to encode different settings for the same participant: a participant is deemed to be a person within a specific setting. |
409 | CCAHSE | element may contain either a prose description or a selection of elements from the classes |
415 | CCAHSE | . By default, when the module defined by this chapter is included in a schema, these classes thus provide the following elements: |
426 | CCAHSE | may also be available if the |
430 | CCAHSE | The following example demonstrates the kind of background information often required to support transcriptions of language interactions, first encoded as a simple prose narrative: |
471 | CCAHSE | Again, a more detailed encoding for places is feasible if the |
473 | CCAHSE | module is included in the schema. The above examples assume that only the general purpose |
475 | CCAHSE | element supplied in the core module is available. |
484 | CCAS | This section discusses the association of the contextual information held in the header with the individual elements making up a TEI text or corpus. Contextual information is held in elements of various kinds within the TEI header, as discussed elsewhere in this section and in chapter |
485 | CCAS | . Here we consider what happens when different parts of a document need to be associated with different contextual information of the same type, for example when one part of a document uses a different encoding practice from another, or where one part relates to a different setting from another. In such situations, there will be more than one instance of a header element of the relevant type. |
487 | CCAS | The TEI scheme allow for the following possibilities: |
489 | CCAS | A given element may appear in the corpus header only, in the header of one or more texts only, or in both places |
491 | CCAS | There may be multiple occurrences of certain elements in either corpus or text header. |
498 | CCAS1 | A TEI-conformant document may have more than one header only in the case of a TEI corpus, which must have a header in its own right, as well as the obligatory header for each text. Every element specified in a corpus-header is understood as if it appeared within every text header in the corpus. An element specified in a text header but not in the corpus header supplements the specification for that text alone. If any element is specified in both corpus and text headers, the corpus header element is over-ridden for that text alone. |
502 | CCAS1 | for a corpus text is understood to be prefixed by the |
504 | CCAS1 | given in the corpus header. All other optional elements of the |
506 | CCAS1 | should be omitted from an individual corpus text header unless they differ from those specified in the corpus header. All other header elements behave identically, in the manner documented below. This facility makes it possible to state once for all in the corpus header each piece of contextual information which is common to the whole of the corpus, while still allowing for individual texts to vary from this common denominator. |
508 | CCAS1 | For example, the following schematic shows the structure of a corpus comprising three texts, the first and last of which share the same encoding description. The second one has its own encoding description. |
555 | CCAS2 | Certain of the elements which can appear within a TEI header are known as |
557 | CCAS2 | . These elements have in common the fact that they may be linked explicitly with a particular part of a text or corpus by means of a |
559 | CCAS2 | attribute on that element. This linkage is used to over-ride the default association between declarations in the header and a corpus or corpus text. The only header elements which may be associated in this way are those which would not otherwise be meaningfully repeatable. |
570 | CCAS2 | An alphabetically ordered list of declarable elements follows: |
611 | CCAS2 | . Since there are two, one of them (in this case |
629 | CCAS2 | For texts associated with the header in which this declaration appears, correction method |
631 | CCAS2 | will be assumed, unless they explicitly state otherwise. Here is the structure for a text which does state otherwise: |
641 | CCAS2 | In this case, the contents of the divisions D1 and D3 will both use correction policy |
643 | CCAS2 | , and those of division D2 will use correction policy |
657 | CCAS2 | , as well as smaller structural units, down to the level of paragraphs in prose, individual utterances in spoken texts, and entries in dictionaries. However, TEI recommended practice is to limit the number of multiple declarable elements used by a document as far as possible, for simplicity and ease of processing. |
663 | CCAS2 | An identifier specifying an element which contains multiple instances of one or more other elements should be interpreted as if it explicitly identified the elements identified as the default in each such set of repeated elements |
665 | CCAS2 | Each element specified, explicitly or implicitly, by the list of identifiers must be of a different kind. |
708 | CCAS2 | applies, correction method C1A and normalization method N1 apply, since these are the specified defaults within |
710 | CCAS2 | . In the same way, for a text specifying |
714 | CCAS2 | , correction C2A, and normalization N2B will apply. |
716 | CCAS2 | A finer grained approach is also possible. A text might specify |
717 | CCAS2 | text decls='C2B N2A' |
720 | CCAS2 | declarations as required. A tag such as |
721 | CCAS2 | text decls='ED1 ED2' |
722 | CCAS2 | would (obviously) be illegal, since it includes two elements of the same type; a tag such as |
723 | CCAS2 | text decls='ED2 C1A' |
728 | CCAS2 | , resulting in a list that identifies two correction elements (C1A and C2A). |
734 | CCAS3 | If there is a single occurrence of a given declarable element in a corpus header, then it applies by default to all elements within the corpus. |
736 | CCAS3 | If there is a single occurrence of a given declarable element in the text header, then it applies by default to all elements of that text irrespective of the contents of the corpus header. |
738 | CCAS3 | Where there are multiple occurrences of declarable elements within either corpus or text header, |
740 | CCAS3 | each must have a unique value specified as the value of its |
746 | CCAS3 | attribute with the value |
754 | CCAS3 | An association made by one element applies by default to all of its descendants. |
759 | CCAN | Language corpora often include analytic encodings or annotations, designed to support a variety of different views of language. The present Guidelines do not advocate any particular approach to linguistic annotation (or |
761 | CCAN | ); instead a number of general analytic facilities are provided which support the representation of most forms of annotation in a standard and self-documenting manner. Analytic annotation is of importance in many fields, not only in corpus linguistics, and is therefore discussed in general terms elsewhere in the Guidelines. |
766 | CCAN | The present section presents informally some particular applications of these general mechanisms to the specific practice of corpus linguistics. |
772 | CCAN1 | we mean here any annotation determined by an analysis of linguistic features of the text, excluding as borderline cases both the formal structural properties of the text (e.g. its division into chapters or paragraphs) and descriptive information about its context (the circumstances of its production, its genre, or medium). The structural properties of any TEI-conformant text should be represented using the structural elements discussed elsewhere in these Guidelines, for example in chapters |
774 | CCAN1 | . The contextual properties of a TEI text are fully documented in the TEI header, which is discussed in chapter |
778 | CCAN1 | Other forms of linguistic annotation may be applied at a number of levels in a text. A code (such as a word-class or part-of-speech code) may be associated with each word or token, or with groups of such tokens, which may be continuous, discontinuous, or nested. A code may also be associated with relationships (such as cohesion) perceived as existing between distinct parts of a text. The codes themselves may stand for discrete non-decomposable categories, or they may represent highly articulated bundles of textual features. Their function may be to place the annotated part of the text somewhere within a narrowly linguistic or discoursal domain of analysis, or within a more general semantic field, or any combination drawn from these and other domains. |
780 | CCAN1 | The manner by which such annotations are generated and attached to the text may be entirely automatic, entirely manual, or a mixture. The ease and accuracy with which analysis may be automated may vary with the level at which the annotation is attached. The method employed should be documented in the |
782 | CCAN1 | element within the encoding description of the TEI header, as described in section |
783 | CCAN1 | . Where different parts of a corpus have used different annotation methods, the |
788 | CCAN1 | An extended example of one form of linguistic analysis commonly practised in corpus linguistics is given in section |
794 | CCREC | These Guidelines include proposals for the identification and encoding of a far greater variety of textual features and characteristics than is likely to be either feasible or desirable in any one language corpus, however large and ambitious. The reasoning behind this catholic approach is further discussed in chapter |
795 | CCREC | . For most large-scale corpus projects, it will therefore be necessary to determine a subset of TEI recommended elements appropriate to the anticipated needs of the project, as further discussed in chapter |
796 | CCREC | ; these mechanisms include the ability to exclude selected element types, add new element types, and change the names of existing elements. A discussion of the implications of such changes for TEI conformance is provided in chapter |
799 | CCREC | Because of the high cost of identifying and encoding many textual features, and the difficulty in ensuring consistent practice across very large corpora, encoders may find it convenient to divide the set of elements to be encoded into the following four categories: |
802 | CCREC | texts included within the corpus will always encode textual features in this category, should they exist in the text |
805 | CCREC | textual features in this category will be encoded wherever economically and practically feasible; where present but not encoded, a note in the header should be made. |
808 | CCREC | textual features in this category may or may not be encoded; no conclusion about the absence of such features can be inferred from the absence of the corresponding element in a given text. |
812 | CCREC | textual features in this category are deliberately not encoded; they may be transcribed as unmarked up text, or represented as |
833 | CC | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
6 | FS | is a general purpose data structure which identifies and groups together individual |
8 | FS | , each of which associates a name with one or more values. Because of the generality of feature structures, they can be used to represent many different kinds of information, but they are of particular usefulness in the representation of linguistic analyses, especially where such analyses are partial, or |
29 | FSor | binary |
34 | FSor | numeric |
36 | FSor | string |
43 | FSor | set |
47 | FSor | list |
49 | FSor | discusses how the operations of alternation, negation, and collection of feature values may be represented. Section |
62 | FSBI | The fundamental elements used to represent a feature structure analysis are |
74 | FSBI | attribute which may be used to represent typed feature structures, and may contain any number of |
81 | FSBI | value |
82 | FSBI | . The value may be simple: that is, a single binary, numeric, symbolic (i.e. taken from a restricted set of legal values), or string value, or a collection of such values, organized in various ways, for example, as a list; or it may be complex, that is, it may itself be a feature structure, thus providing a degree of recursion. Values may be under-specified or defaulted in various ways. These possibilities are all described in more detail in this and the following sections. |
86 | FSBI | . The components of such libraries may then be referenced from other feature or feature-value representations, using the |
92 | FSBI | We begin by considering the simple case of a feature structure which contains binary-valued features only. The following three XML elements are needed to represent such a feature structure: |
101 | FSBI | are not discussed in this section: they provide an alternative way of indicating the content of an element, as further discussed in section |
108 | FSBI | elements with binary values can be straightforwardly used to encode the |
145 | FSBI | attribute to indicate the name of the feature. Feature structures need not be typed, but features must be named. Similarly, the |
153 | FSBI | to a binary value) requires additional validation, as does any restriction on the features available within a feature structure of a particular type (e.g. whether a feature structure of type |
157 | FSBI | ). Such validation may be carried out at the document level, using special purpose processing, at the schema level using additional validation rules, or at the declarative level, using an additional mechanism such as the |
162 | FSBI | Although we have used the term |
163 | FSBI | binary |
172 | FSBI | ), it should be noted that such values are not restricted to propositional assertions. As this example shows, this kind of value is intended for use with any binary-valued feature. |
181 | FSSY | numeric values |
183 | FSSY | string values |
184 | FSSY | . The module defined by this chapter allows for the specification of additional datatypes if necessary, by extending the underlying class |
194 | FSSY | element is used for the value of a feature when that feature can have any of a small, finite set of possible values, representable as character strings. For example, the following might be used to represent the claim that the Latin noun form |
210 | FSSY | case |
214 | FSSY | number |
215 | FSSY | ) are used to define morpho-syntactic properties of a word. Each of these features can take one of a small number of values (for example, case can be |
225 | FSSY | elements. Note that, instead of using a symbolic value for grammatical number, one could have named the feature |
229 | FSSY | and given it an appropriate binary value, as in the following example: |
234 | FSSY | Whether one uses a binary or symbolic value in situations like this is largely a matter of taste. |
238 | FSSY | element is used for the value of a feature when that value is a string drawn from a very large or potentially unbounded set of possible strings of characters, so that it would be impractical or impossible to use the |
240 | FSSY | element. The string value is expressed as the content of the |
242 | FSSY | element, rather than as an attribute value. For example, one might encode a street address as follows: |
250 | FSSY | element is used when the value of a feature is a numeric value, or a range of such values. For example, one might wish to regard the house number and the street name as different features, using an encoding like the following: |
257 | FSSY | If the numeric value to be represented falls within a specific range (for example an address that spans several numbers), the |
266 | FSSY | It is also possible to specify that the numeric value (or values) represented should (or should not) be truncated. For example, assuming that the daily rainfall in mm is a feature of interest for some address, one might represent this by an encoding like the following: |
269 | FSSY | This represents any of the infinite number of numeric values falling between 0 and 1.3; by contrast |
274 | FSSY | Some communities of practice, notably those with a strong computer-science bias, prefer to dissociate the information on the value of the given feature from the specification of the data type that this value represents. In such cases, feature values can be provided directly as textual content of |
281 | FSSY | As noted above, additional processing is necessary to ensure that appropriate values are supplied for particular features, for example to ensure that the feature |
283 | FSSY | is not given a value such as |
284 | FSSY | symbol value="feminine"/ |
285 | FSSY | . There are two ways of attempting to ensure that only certain combinations of feature names and values are used. First, if the total number of legal combinations is relatively small, one can predefine all of them in a construct known as a |
287 | FSSY | , and then reference the combination required using the |
292 | FSSY | feature value library |
293 | FSSY | (so called, since a feature structure may be the value of a feature). A total of 30 feature structures (5 × 3 × 2) is required to enumerate all the possible combinations of individual case, gender and number values in the preceding illustration. We discuss the use of such libraries and their representation in XML further in section |
301 | FSSY | Whether at the level of feature-system declarations, feature- and feature-value libraries, or individual features, it is possible to align both feature names and their values with standardized external data category repositories such as ISOcat. |
306 | FSSY | and its value |
321 | FSFL | As the examples in the preceding section suggest, the direct encoding of feature structures can be verbose. Moreover, it is often the case that particular feature-value combinations, or feature structures composed of them, are re-used in different analyses. To reduce the size and complexity of the task of encoding feature structures, one may use the |
337 | FSFL | ). If a feature has as its value a feature structure or other value which is predefined in this way, the |
344 | FSFL | For example, suppose a feature library for phonological feature specifications is set up as follows. |
391 | FSFL | Then the feature structures that represent the analysis of the phonological segments (phonemes) |
405 | FSFL | The preceding are but four of the 128 logically possible fully specified phonological segments using the seven binary features listed in the feature library. Presumably not all combinations of features correspond to phonological segments (there are no strident vowels, for example). The legal combinations, however, can be collected together, each one represented as an identifiable |
423 | FSFL | attribute; for example, one might use them in a feature value pair such as: |
427 | FSFL | Feature structures stored in this way may also be associated with the text which they are intended to annotate, either by a link from the text (for example, using the TEI global |
429 | FSFL | attribute), or by means of stand-off annotation techniques (for example, using the TEI |
434 | FSFL | Note that when features or feature structures are linked to in this way, the result is effectively a copy of the item linked to into the place from which it is linked. This form of linking should be distinguished from the phenomenon of |
444 | FSST | Features may have complex values as well as atomic ones; the simplest such complex value is represented by supplying a |
446 | FSST | element as the content of an |
450 | FSST | element as the value for the |
464 | FSST | To illustrate the use of complex values, consider the following simple model of a word, as a structure combining surface form information, a syntactic category, and semantic information. Each word analysis is represented as a |
465 | FSST | fs type='word' |
467 | FSST | surface |
472 | FSST | . The first of these has an atomic string value, but the other two have complex values, represented as nested feature structures of types |
473 | FSST | category |
492 | FSST | This analysis does not tell us much about the meaning of the symbols |
514 | FSST | element, as a number of |
516 | FSST | elements. Alternatively, the relevant features may be referenced by their identifiers, supplied as the value of the |
532 | FSST | With such libraries in place, and assuming the availability of similarly predefined feature structures for transitivity and semantics, the preceding example could be considerably simplified: |
556 | FSVAR | Sometimes the same feature value is required at multiple places within a feature structure, in particular where the value is only partially specified at one or more places. The |
563 | FSVAR | For example, suppose one wishes to represent noun-verb agreement as a single feature structure. Within the representation, the feature indicating (say) number appears more than once. To represent the fact that each occurrence is another appearance of the same feature (rather than a copy) one could use an encoding like the following: |
590 | FSVAR | vLabel |
595 | FSVAR | The scope of the names used to label re-entrancy points is that of the outermost |
597 | FSVAR | element in which they appear. When a feature structure is imported from a feature value library, or referenced from elsewhere (for example by using the |
599 | FSVAR | attribute) the names of any sharing points it may contain are implicitly prefixed by the identifier used for the imported feature structure, to avoid name clashes. Thus, if some other feature structure were to reference the |
602 | FSVAR | then the labelled points in the example would be interpreted as if they had the name |
616 | FSSS | A feature whose value is regarded as a set, bag, or list may have any positive number of values as its content, or none at all, (thus allowing for representation of the empty set, bag, or list). The items in a list are ordered, and need not be distinct. The items in a set are not ordered, and must be distinct. The items in a bag are neither ordered nor distinct. Sets and bags are thus distinguished from lists in that the order in which the values are specified does not matter for the former, but does matter for the latter, while sets are distinguished from bags and lists in that repetitions of values do not count for the former but do count for the latter. |
618 | FSSS | If no value is specified for the |
622 | FSSS | defines a list of values. If the |
628 | FSSS | attribute, suppose that a feature structure analysis is used to represent a genealogical tree, with the information about each individual treated as a single feature structure, like this: |
654 | FSSS | element is first used to supply a list of |
655 | FSSS | name |
658 | FSSS | feature. Other features are defined by reference to values which we assume are held in some external feature value library (not shown here). For example, the |
660 | FSSS | element is used a second time to indicate that the persons's siblings should be regarded as constituting a set rather than a list. Each sibling is represented by a feature structure: in this example, each feature structure is a copy of one specified in the feature value library. |
662 | FSSS | If a specific feature contains only a single feature structure as its value, the component features of which are organized as a set, bag, or list, it may be more convenient to represent the value as a |
666 | FSSS | . For example, consider the following encoding of the English verb form |
670 | FSSS | feature whose value is a feature structure which contains |
671 | FSSS | person |
673 | FSSS | number |
714 | FSSS | element is also useful in cases where an analysis has several components. In the following example, the French word |
716 | FSSS | has a two-part analysis, represented as a list of two values. The first specifies that the word contains a preposition; the second that it contains a masculine plural relative pronoun: |
736 | FSSS | The set, bag, or list which has no members is known as the null (or empty) set, bag, or list. A |
738 | FSSS | element with no content and with no value for its |
740 | FSSS | attribute is interpreted as referring to the null set, bag, or list, depending on the value of its |
755 | FSSS | elements, if, for example one of the members of a set is itself a set, or if two lists are concatenated together. Note that such collections pay no attention to the contents of the nested |
757 | FSSS | elements: if it is desired to produce the union of two sets, the |
759 | FSSS | element discussed below should be used to make a new collection from the two sets. |
764 | FVE | It is sometimes desirable to express the value of a feature as the result of an operation over some other value (for example, as |
768 | FVE | , or as the concatenation of two collections). Three special purpose elements are provided to represent disjunctive alternation, negation, and collection of values: |
779 | FVALT | element can be used wherever a feature value can appear. It contains two or more feature values, any one of which is to be understood as the value required. Suppose, for example, that we are using a feature system to describe residential property, using such features as |
781 | FVALT | . In a particular case, we might wish to represent uncertainty as to whether a house has two or three bathrooms. As we have already shown, one simple way to represent this would be with a numeric maximum: |
791 | FVALT | element represents alternation over feature values, not feature-value pairs. If therefore the uncertainty relates to two or more feature value specifications, each must be represented as a feature structure, since a feature structure can always appear where a value is required. For example, suppose that it is uncertain as to whether the house being described has two bathrooms or two bedrooms, a structure like the following may be used: |
805 | FVALT | : in the case above, the implication is that having two bathrooms excludes the possibility of having two bedrooms and vice versa. If inclusive alternation is required, a |
824 | FVALT | This analysis indicates that the property may have two bathrooms, two bedrooms, or both two bathrooms and two bedrooms. |
830 | FVALT | to describe items that are mentioned to enhance a property's sales value, such as whether it has a pool or a good view. Now suppose for a particular listing, the selling points include an alarm system and a good view, and either a pool or a jacuzzi (but not both). This situation could be represented, using the |
870 | FVALT | If a large number of ambiguities or uncertainties need to be represented, involving a relatively small number of features and values, it is recommended that a stand-off technique, for example using the general-purpose |
883 | FVNOT | element can be used wherever a feature value can appear. It contains any feature value and returns the complement of its contents. For example, the feature |
885 | FVNOT | in the following example has any whole numeric value other than 2: |
892 | FVNOT | element is to provide the complement of the feature values it contains, rather than their negation. If a feature system declaration is available which defines the possible values for the associated feature, then it is possible to say more about the negated value. For example, suppose that the available values for the feature |
893 | FVNOT | case |
894 | FVNOT | are declared to be nominative, genitive, dative, or accusative, whether in a TEI feature system declaration or by some other means. Then the following two specifications are equivalent: |
906 | FVNOT | If however no such system declaration is available, all that one can say about a feature specified via negation is that its value is something other than the negated value. |
908 | FVNOT | Negation is always applied to a feature value, rather than to a feature-value pair. The negation of an atomic value is the set of all other values which are possible for the feature. |
910 | FVNOT | Any kind of value can be negated, including collections (represented by a |
914 | FVNOT | elements). The negation of any complex value is understood to be the set of values which cannot be unified with it. Thus, for example, the negation of the feature structure F is understood to be the set of feature structures which are not unifiable with F. In the absence of a constraint mechanism such as the Feature System Declaration, the negation of a collection is anything that is not unifiable with it, including collections of different types and atomic values. It will generally be more useful to require that the organization of the negated value be the same as that of the original value, for example that a negated set is understood to mean the set which is a complement of the set, but such a requirement cannot be enforced in the absence of a constraint mechanism. |
921 | FVCOLL | element can be used wherever a feature value can appear. It contains two or more feature values, all of which are to be collected together. The organization of the resulting collection is specified by the value of the |
923 | FVCOLL | attribute, which need not necessarily be the same as that of its constituent values if these are collections. For example, one can change a list to a set, or vice versa. |
940 | FVCOLL | Suppose however that we discover for some language it is necessary to add a new possible value, and to treat the value of the feature as a list rather than as a set. The |
961 | FSBO | The value of a feature may be underspecified in a number of different ways. It may be null, unknown, or uncertain with respect to a range of known possibilities, as well as being defined as a negation or an alternation. As previously noted, the specification of the range of known possibilities for a given feature is not part of the current specification: in the TEI scheme, this information is conveyed by the |
963 | FSBO | . Using this, or some other system, we might specify (for example) that the range of values for an element includes symbols for masculine, feminine, and neuter, and that the default value is neuter. With such definitions available to us, it becomes possible to say that some feature takes the default value, or some unspecified value from the list. The following special element is provided for this purpose: |
968 | FSBO | The value of an empty |
982 | FSBO | If, however, the value is explicitly stated to be the default one, using the |
984 | FSBO | element, then the following two representations are equivalent: |
992 | FSBO | Similarly, if the value is stated to be the negation of the default, then the following two representations are equivalent: |
1007 | FSLINK | Text elements can be linked with feature structures using any of the linking methods discussed elsewhere in the Guidelines (see for example sections |
1121 | FSLINK | element is used to link selected characters in the text |
1168 | FSLINK | It would then be possible to link each word to its intended annotation in the feature library quoted above, as follows: |
1183 | FD | The Feature System Declaration (FSD) is intended for use in conjunction with a TEI-conforming text that makes use of |
1187 | FD | It provides a mechanism by which the encoder can list all of the feature names and feature values and give a prose description as to what each represents. |
1193 | FD | It provides a mechanism by which the encoder can define the intended interpretation of underspecified feature structures. This involves defining default values (whether literal or computed) for missing features. |
1196 | FD | . This chapter relies upon, but does not reproduce, formal definitions and descriptions presented more thoroughly in the ISO standard, which should be consulted in case of ambiguity or uncertainty. |
1198 | FD | The FSD serves an important function in documenting precisely what the encoder intended by the system of feature structure markup used in an XML-encoded text. The FSD is also an important resource which standardizes the rules of inference used by software to validate the feature structure markup in a text, and to infer the full interpretation of underspecified feature structures. |
1200 | FD | The reader should be aware the terminology used in this document does not always closely follow conventional practice in formal logic, and may also diverge from practice in some linguistic applications of typed feature structures. In particular, the term |
1201 | FD | interpretation |
1202 | FD | when applied to a feature structure is not an interpretation in the model-theoretic sense, but is instead a minimally informative (or equivalently, most general) extension |
1203 | FD | of that feature structure that is consistent with a set of constraints declared by an FSD. In linguistic application, such a system of constraints is the principal means by which the grammar of some natural language is expressed. There is a great deal of disagreement as to what, if any, model-theoretic interpretation feature structures have in such applications, but the status of this formal kind of interpretation is not germane to the present document. Similarly, the term |
1205 | FD | is used here as elsewhere in these Guidelines to identify the syntactic state of well-formedness in the sense defined by the logic of typed feature structures itself, as distinct from and in addition to the |
1209 | FD | We begin by describing how an encoded text is associated with one or more feature system declarations. The second, third, and fourth sections describe the overall structure of a feature system declaration and give details of how to encode its components. The final section offers a full example; fuller discussion of the reasoning behind FSDs and another complete example are provided in |
1213 | FDLK | Linking a TEI Text to Feature System Declarations |
1215 | FDLK | In order for application software to use feature system declarations to aid in the automatic interpretation of encoded texts, or even for human readers to find the appropriate declarations which document the feature system used in markup, there must be a formal link from the encoded texts to the declarations. However, the schema which declares the syntax of the Feature System itself should be kept distinct from the feature structure schema, which is an application of that system. |
1219 | FDLK | element for each distinct type of feature structure used must be provided and associated with the type, which is the value used within each feature structure for its |
1230 | FDLK | element may be supplied either within the header of a standard TEI document, or as a standalone document in its own right. It contains one or more |
1245 | FDLK | element for each within the header attached to the document as follows: |
1274 | FDLK | In this case there is an implicit link between the |
1278 | FDLK | element because they share the same value for their |
1280 | FDLK | attribute and appear within the same document. This is a short cut for the more general case which requires a more explicit link provided by means of the |
1285 | FDLK | Ways of pointing to components of a TEI document without using an XML identifier are discussed in |
1286 | FDLK | way of accomplishing this is to add an XML identifier to each |
1301 | FDLK | (Although in this case the XML identifier is simply an uppercase version of the type name, there is no necessary connection between the two names. The only requirement is that the XML identifier conform to the standards required for identifiers, and that it be unique within the document containing it.) |
1332 | FDLK | there is no requirement for the local name for a given type of feature structures to be the same as that used by |
1348 | FDLK | element of a TEI document containing typed feature structures. Alternatively, it may appear independently of any feature structures, as a document in its own right, possibly with its own |
1362 | FDLK | value specified on a |
1371 | FDOV | A feature system declaration contains one or more feature structure declarations, each of which has up to three parts: an optional description (which gives a prose comment on what that type of feature structure encodes), an obligatory set of feature declarations (which specify range constraints and default values for the features in that type of structure), and optional feature structure constraints (which specify co-occurrence restrictions on feature values). |
1380 | FDOV | element may name one or more |
1385 | FDOV | fsDecl type="Basic" |
1387 | FDOV | fDecl name="One" |
1389 | FDOV | fDecl name="Two" |
1391 | FDOV | fsDecl type="Derived" baseTypes="Basic" |
1393 | FDOV | fDecl name="Three" |
1395 | FDOV | fs type="Derived" |
1397 | FDOV | fsDecl type="Derived" |
1399 | FDOV | fsDecl type="Basic" |
1400 | FDOV | when it specifies a base type of |
1422 | FDOV | gives the name of one or more types from which this type inherits feature specifications and constraints; if this type includes a feature specification with the same name as one inherited from any of the types specified by this attribute, or if more than one specification of the same name is inherited, then the possible values of that feature is determined by unification. Similarly, the set of constraints applicable is derived by conjoining those specified explicitly within this element with those implied by the |
1424 | FDOV | attribute. When no base type is specified, no feature specification or constraint is inherited. |
1426 | FDOV | Although the present standard does provide for default feature values, feature inheritance is defined to be monotonic. |
1427 | FDOV | The process of combining constraints may result in a contradiction, for example if two specifications for the same feature specify disjoint ranges of values, and at least one such specification is mandatory. In such a case, there is no valid feature structure of the type being defined. |
1432 | FDOV | fsDecl type="Sub" baseTypes="Super1 Super2" |
1455 | FDFD | has three parts: an optional prose description (which should explain what the feature and its values represent), an obligatory range specification (which declares what values the feature is allowed to have), and an optional default specification (which declares what default value should be supplied when the named feature does not appear in an |
1460 | FDFD | has no value provided, or the value |
1466 | FDFD | either has no default specified, or has conditional defaults, none of the conditions on which is met, |
1468 | FDFD | then the value of this feature in the feature structure's most general valid extension is the most general value provided in its |
1470 | FDFD | , in the case of a unit organization, or the singleton set, bag, or list containing that element, in the case of a complex organization. If the feature: |
1473 | FDFD | has no value provided, or the value |
1477 | FDFD | either has a default specified, or has conditional defaults, one of the conditions on which is met, |
1479 | FDFD | then this feature does have a value in the feature structure's most general valid extension when it exists, namely the default value that pertains. |
1481 | FDFD | It is possible that a feature structure will not have a valid extension because the default value that pertains to a feature is not consistent with that feature's declared range. Additional tools are required for the enforcement of such criteria. |
1492 | FDFD | The logic for validating feature values and for matching the conditions for supplying default values is based on the operation of |
1506 | FDFD | containing the value |
1510 | FDFD | . The negation of a value |
1515 | FDFD | ) subsumes any value that is not |
1519 | FDFD | subsumes any numeric value other than zero. |
1520 | FDFD | The value |
1521 | FDFD | fs type="X"/ |
1524 | FDFD | , even if it is not valid. |
1534 | FDFD | The INV feature, which encodes whether or not a sentence is inverted, allows only the values plus (+) and minus (-). If the feature is not specified, then the default rule (FSD 1 above) says that a value of minus is always assumed. The feature declaration for this feature would be encoded as follows: |
1544 | FDFD | The value range is specified as an alternation (more precisely, an exclusive disjunction), which can be represented by the |
1546 | FDFD | feature value. That is, the value must be either true or false, but cannot be both or neither. |
1548 | FDFD | The CONJ feature indicates the surface form of the conjunction used in a construction. The ~ in the default rule (see FSD 2 above) represents negation. This means that by default the feature is not applicable, in other words, no conjunction is taking place. Note that CONJ not being present is distinct from CONJ being present but having the NIL value allowed in the value range. In their analysis, NIL means that the phenomenon of conjunction is taking place but there is no explicit conjunction in the surface form of the sentence. The feature declaration for this feature would be encoded as follows: |
1568 | FDFD | is not strictly necessary in this case, since the binary value of |
1572 | FDFD | The COMP feature indicates the surface form of the complementizer used in a construction. In value range, it is analogous to CONJ. However, its default rule (see FSD 9 above) is conditional. It says that if the verb form is infinitival (the VFORM feature is not mentioned in the rule since it is the only feature that can take INF as a value), and the construction has a subject, then a |
1598 | FDFD | The AGR feature stores the features relevant to subject-verb agreement. Gazdar et al. specify the range of this feature as CAT. This means that the value is a |
1599 | FDFD | category |
1600 | FDFD | , which is their term for a feature structure. This is actually too weak a statement. Not just any feature structure is allowable here; it must be a feature structure for agreement (which is defined in the complete example at the end of the chapter to contain the features of person and number). The following feature declaration encodes this constraint on the value range: |
1605 | FDFD | That is, the value must be a feature structure of type |
1608 | FDFD | fsDecl type="Agreement" |
1610 | FDFD | fDecl name="PERS" |
1612 | FDFD | fDecl name="NUM" |
1615 | FDFD | The PFORM feature indicates the surface form of the preposition used in a construction. Since PFORM is specified above as an open set, |
1626 | FDFD | subsumes any string that is not the empty string. |
1646 | FDFS | Ensuring the validity of feature structures may require much more than simply specifying the range of allowed values for each feature. There may be constraints on the co-occurrence of one feature value with the value of another feature in the same feature structure or in an embedded feature structure. |
1648 | FDFS | Such constraints on valid feature structures are expressed as a series of conditional and biconditional tests in the |
1652 | FDFS | . A particular feature structure is valid only if it meets all the constraints. The |
1654 | FDFS | element encodes the conventional if-then conditional of boolean logic which succeeds when both the antecedent and consequent are true, or whenever the antecedent is false. The |
1656 | FDFS | element encodes the biconditional (if and only if) operation of boolean logic. It succeeds only when the corresponding if-then conditionals in both directions are true. |
1657 | FDFS | In feature structure constraints the antecedent and consequent are expressed as feature structures; they are considered true if they |
1660 | FDFS | ) the feature structure in question, but in the case of consequents, this truth is asserted rather than simply tested. That is to say, a conditional is enforced by determining that the antecedent does not (and will never) subsume the given feature structure, or by determining that the antecedent does subsume the given feature structure, and then unifying the consequent with it (the result of which, if successful, will be subsumed by the consequent). In practice, the enforcement of such constraints can result in periods in which the truth of a constraint with respect to a given feature structure is simply not known; in this case, the constraint must be persistently monitored as the feature structure becomes more informative until either its truth value is determined or computation fails for some other reason. |
1675 | FDFS | The first constraint says that if a construction is inverted, it must also have an auxiliary and a finite verb form. That is, |
1683 | FDFS | The second constraint says that if a construction has a BAR value of zero (i.e., it is a sentence), then it must have a value for the features N, V, and SUBCAT. By the same token, because it is a biconditional, if it has values for N, V, and SUBCAT, it must have BAR='0'. That is, |
1694 | FDFS | The final constraint says that if a construction has a BAR value of 1 (i.e., it is a phrase), then the SUBCAT feature should be absent (~). This is not biconditional, since there are other instances under which the SUBCAT feature is inappropriate. That is, |
1830 | FSDEF | This elements discussed in this chapter constitute a module of the TEI scheme which is formally defined as follows: |
1844 | FSDEF | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
3 | CE | Encoders of text often find it useful to indicate that some aspects of the encoded text are problematic or uncertain, and to indicate who is responsible for various aspects of the markup of the electronic text. These Guidelines provide several methods of recording uncertainty about the text or its markup: |
8 | CE | may be used with a value of |
9 | CE | certainty |
20 | CE | element defined in this chapter may be used to record the accuracy with which some numerical value (such as a date or quantity) is provided by some other element or attribute. |
24 | CE | element defined in the module for linking and segmentation may be used to provide alternative encodings for parts of a text, as described in section |
28 | CE | the TEI header records who is responsible for an electronic text by means of the |
48 | CE | element may be used with a value of |
49 | CE | resp |
63 | CE | elements, since they are defined in the core module and header respectively. The |
65 | CE | element is only available when the module for linking has been selected, as described in chapter |
72 | CE | elements, the module for certainty and responsibility must be selected. |
81 | CE | These attributes enable statements about certainty, precision, or responsibility to be made with respect to the whole of a document, or any part or parts of it which can be identified using standard XML location methods. Several examples are given in the discussion of the |
91 | CECERT | a given tag may or may not correctly apply (e.g. a given word may be a personal name, or perhaps not) |
95 | CECERT | the value given for an attribute is uncertain |
97 | CECERT | the content given for an element is unreliable for any reason. |
105 | CECERT | the numerical precision associated with a number or date (for this use the |
110 | CECERT | the content of the document being transcribed is identifiable, but may be read or understood in different ways (for this use the transcriptional elements such as |
115 | CECERT | a transcriber, editor, or author wishes to indicate a level of confidence in a factual assertion made in the text (for this use the interpretative mechanisms discussed in |
123 | CECENO | The simplest way of recording uncertainty about markup is to attach a note to the element or location about which one is unsure. In the following (invented) paragraph, for example, an encoder might be uncertain whether to mark |
125 | CECENO | as a place name or a personal name, since both might be plausible in the given context: |
140 | CECENO | Using the normal mechanisms, the note may be associated unambiguously with specific elements of the text, thus: |
166 | CECECE | is in fact a place name, as it is tagged, we use the |
171 | CECECE | name |
180 | CECECE | element is placed in a document; it may be placed adjacent to the target element, or elsewhere in the same or another document. Its position is however significant when the |
186 | CECECE | really is a place name here. The |
190 | CECECE | element, expressed as a number between 0 and 1: |
193 | CECECE | This expresses the point of view that there is a 60 percent chance of |
195 | CECECE | being a place name here, and hence a 40 percent chance of its being a personal name. We can use two |
197 | CECECE | elements to indicate the two probabilities independently. Both elements indicate the same location in the text, but the second provides an alternative choice of name identifier (in this case |
199 | CECECE | ), which is given as the value of the |
210 | CECECE | In the simplest case, it is also possible to place the |
218 | CECECE | is specified, by default the proposed certainty applies to its parent element, in this case the |
230 | CEconcon | attribute to list the identifiers of |
256 | CEconcon | element is interpreted as claiming a given degree of confidence in a particular markup given the assertional content of the |
258 | CEconcon | elements indicated. That is, a conjectural assertion is being made solely on the assumption that the interpretation indicated by the element named by the |
266 | CEconcon | as a personal name or a place name, assigning a 60 percent probability to the former. If it is a place name, there may be a 50 percent chance that the place name actually in question is |
270 | CEconcon | , while if it is correctly tagged as a personal name, it is much more likely (say, 90 percent certain) that the name is |
272 | CEconcon | . Hence there is uncertainty about the correct location for the markup as well as about which markup to use. This state of affairs can be expressed using the |
296 | CEconcon | Multiplying the numeric values out, this markup may be interpreted as assigning specific probabilities to three different ways of marking up the sentence: |
304 | CEconcon | The probabilities do not add up to 1.00 because the markup indicates that if |
306 | CEconcon | is (part of) a personal name, there is a 10 percent likelihood that the element should start somewhere other than the place indicated, without however giving an alternative location; there is thus a 6 percent chance (0.1 × 0.6) that none of the alternatives given is correct. |
313 | CECECE | attribute may be used to supply a pattern identifying the portion of a document concerning which certainty is being expressed. The value of the |
324 | CECECE | has been supplied here, and so by default the |
326 | CECECE | expressed would therefore apply to the parent element. However, in this case the XPath supplied as the value for |
328 | CECECE | returns a set of all the |
347 | CECECE | value of |
352 | CECECE | If an element in a document is matched by more than one match expression, then the most specific pattern applies. |
355 | CECECE | As a simple case, if both the preceding |
360 | CECECE | div type="checked" |
361 | CECECE | element would potentially match both pattern expressions. However because the second pattern is more specific than the former, in fact this is the only one that would apply. If multiple patterns match and have the same priority, then the first one (in document order) is applied. Only those statements of certainty which have matched in this sense are available for conditional application using the |
363 | CECECE | attribute mentioned above. |
367 | CECECE | attribute is processed, the namespace bindings in force are those in effect at that point in the document. For example, |
373 | CECECE | might be used to indicate a high degree of certainty about the content of any elements taken the namespace associated with the prefix |
375 | CECECE | . This namespace prefix must be associated with an appropriate namespace definition, either on the |
382 | CECECE | Doubts about whether the content of an element is correct may also be expressed by assigning to |
384 | CECECE | the value |
385 | CECECE | value |
386 | CECECE | . For example, if the source is hard to read and so the transcription is uncertain: |
404 | CECECE | attribute should be used to provide an alternative value for whatever aspect of the markup is in doubt: an alternative name, or the identifier of an alternative starting or ending point, as already shown, an alternative attribute value, or alternative element content, as in this example: |
412 | CECECE | attribute is not generally useful for specifying alternative transcriptions; it cannot for example be used if the alternative reading contains markup of any kind. More robust methods of handling uncertainties of transcription are the |
421 | CECECE | element allows for indications of uncertainty to be structured with at least as much detail and clarity as appears to be currently required in most ongoing text projects. |
430 | CECECE | data.pointer |
431 | CECECE | as values and may thus also contain an XPath expression of arbitrary complexity. Because full support for XPath is not provided by current processors, it is not generally recommended TEI practice. There are however some simple cases in which XPath syntax is to be preferred, notably those in which the |
437 | CECECE | attribute has the value |
447 | CECECE | value (expressed as an URI) and a |
449 | CECECE | value (expressed as an XPath). The former defines the context within which the latter is to be evaluated. As previously noted, if no value is supplied for |
451 | CECECE | , the context within which the value of |
457 | CECECE | A typical case where it may be convenient to specify both |
461 | CECECE | is that where we wish to indicate that the value of an attribute on some specific element is uncertain. In this case, the |
463 | CECECE | attribute takes the value |
464 | CECECE | value |
465 | CECECE | . For example, supposing there is only a 50 percent chance that the question was spoken by participant A: |
477 | CECECE | attributes together provide a powerful mechanism which can be used to indicate precision for a large number of assertions throughout an encoded document in an economical way. Some further examples follow: |
480 | CECECE | This encoding indicates that there is only a 0.2 certainty that the boundaries of all |
487 | CECECE | This encoding indicates that there is only a 0.2 certainty that the boundaries of the |
491 | CECECE | value |
499 | CECECE | This encoding indicates that there is only a 0.2 certainty that the value for the |
508 | CECECE | This encoding indicates that there is only a 0.2 certainty that any value for the |
514 | CECECE | This encoding indicates that there is only a 0.2 certainty that the value for the |
522 | CECECE | This encoding indicates that there is only a 0.2 certainty that the content of any element the |
524 | CECECE | attribute of which has the value |
530 | CECECE | element and the other TEI mechanisms for indicating uncertainty provide a range of methods of graduated complexity. Simple expressions of uncertainty may be made by using the |
536 | CECECE | element, and in cases where highly structured certainty information must be given, it is recommended that the |
550 | CEPREC | As noted above, certainty about the accuracy of an encoding or its content is not the same thing as the |
551 | CEPREC | precision |
552 | CEPREC | with which a value is specified. In the case of a date or a quantity, for example, we might be certain that the value given is imprecise, or uncertain about whether or not the value given is correct. The latter possibility would be represented by the |
558 | CEPREC | The elements concerning which statements of precision are to be made are identified using the same |
570 | CEPREC | several ways of indicating ranges of values were introduced. For example, if we know that a date falls between 1930 and 1935, without being certain exactly where, this fact may be encoded using attributes |
578 | CEPREC | Equally, if we know that every page of a manuscript has a width of at least 10 cm but no more than 30, we can use the attributes |
586 | CEPREC | Suppose however that the precision with which the value of such an attribute can be specified is variable. For example, suppose an event is dated |
587 | CEPREC | about fifty years after the death of Augustus |
588 | CEPREC | . In this case, the precision of one end of the range (the death of Augustus) is higher than the other, assuming we know when Augustus died. We can say that the latest possible date is probably 50 years after that, but with less confidence than we can attach to the earliest possible date. |
592 | CEPREC | element allows us to indicate the two attributes concerned and attach different levels of precision to them, using a similar mechanism as that provided for the |
601 | CEPREC | In much the same way, we may wish to indicate different levels of precision about the dating of either end of a historical period. For example, the elements defined for encoding personal data all bear a similar set of attributes to indicate normalized values for earliest or latest dates, etc. (see section |
602 | CEPREC | ); the precision of these attribute values may be indicated in exactly the same way. For example, |
608 | CEPREC | It may also be useful to indicate that the precisions given for minimum and maximum quanta differ. For example, to indicate that all pages measure at least 10 cm wide, and at most |
621 | CEPREC | might be used to record the average number of characters per line in a typescript. If in addition we wish to record the standard deviation for the values summarized by that average, this would require an additional |
632 | CERESP | In general, attribution of responsibility for the transcription and markup of an electronic text is made by |
634 | CERESP | elements within the header: specifically, within the title statement, the edition statement(s), and the revision history. |
636 | CERESP | In some cases, however, more detailed element-by-element information may be desired. For example, an encoder may wish to distinguish between the individuals responsible for transcribing the content and those responsible for determining that a given word or phrase constitutes a proper noun. Where such fine-grained attribution of responsibility is required, the |
665 | CERESP | element at the location indicated: |
676 | CERESP | Similarly, in the following example, we indicate that RC is responsible for proposing the value of the |
688 | CE | The module described in this chapter makes available the following additional elements: |
699 | CE | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
5 | PH | provides elements for the encoding of digital facsimiles or images of such materials, while the remainder of the chapter discusses ways of encoding detailed transcriptions of such materials. This module may also be useful in the preparation of critical editions, but the module defined here is distinct from that defined in chapter |
7 | PH | , but again the present module may be used independently if such data is not required. |
13 | PH | to the encoding of printed matter or indeed any form of written source, including monumental inscriptions. Similarly, where in the following descriptions terms such as |
16 | PH | author |
18 | PH | editor |
25 | PH | plays a role analogous to the |
27 | PH | , while in an authorial manuscript, the author and the scribe are the same person. |
32 | PHFAX | These Guidelines are mostly concerned with the preparation of digital texts in which pre-existing sources are transcribed or otherwise converted into character form, and marked up in XML. However, it is also very common practice to make a different form of |
33 | PHFAX | digital text |
34 | PHFAX | , which is instead composed of digital images of the original source, typically one per page, or other written surface. We call such a resource a |
35 | PHFAX | digital facsimile |
36 | PHFAX | . A digital facsimile may, in the simplest case, just consist of a collection of images, with some metadata to identify them and the source materials portrayed. It may sometimes contain a variety of images of the same source pages, perhaps of different resolutions, or of different kinds. Such a collection may form part of any kind of document, for example a commentary of a codicological or paleographic nature, where there is a need to align explanatory text with image data. It may also be complemented by a transcribed or encoded version of the original source, which may be linked to the page images. In this section we present elements designed to support these various possibilities and discuss the associated mechanisms provided by these Guidelines. |
56 | PHFAX | In the simple case where a digital text is composed of page images, the |
74 | PHFAX | attribute represents the whole of the text following the |
78 | PHFAX | element. Any convenient milestone element (see further |
79 | PHFAX | ) could be used in the same way; for example if the images represent individual columns, the |
81 | PHFAX | element might be used. Though simple, this method has some drawbacks. It does not scale well to more complex cases where, for example, the images do not correspond exactly with transcribed pages, or where the intention is to align specific marked up elements with detailed images, or parts of images. The management of information about the images may become more difficult if references to them are scattered through many files rather than being concentrated in a single identifiable location. Nevertheless, this solution may be adequate for many straightforward |
97 | PHFAX | , which are also provided by this module. These elements make it possible to accommodate multiple images of each page, as well as to record the position and relative size of elements identified on any kind of written surface and to link such elements with digital facsimile images of them. Typical applications include the provision of full text search in |
98 | PHFAX | digital facsimile editions |
99 | PHFAX | , and ways of annotating graphics, for example so as to identify individuals appearing in group portraits and link them to data about the people represented. |
114 | PHFAX | elements may be used to represent a digital facsimile. Either may appear within a TEI document along with, or instead of, the |
119 | PHFAX | element is designed for the case where the digital facsimile contains only images, whereas the |
121 | PHFAX | element is for use in the case where such images are complemented by a documentary transcription. In this section, we first discuss the simpler case, returning to the use of the |
124 | PHFAX | below. When this module is selected therefore, a legal TEI document may thus comprise any of the following: |
126 | PHFAX | a TEI header and a text element |
128 | PHFAX | a TEI header and a facsimile element |
130 | PHFAX | a TEI header and a sourceDoc element |
132 | PHFAX | a TEI header, a facsimile element, and a text element |
134 | PHFAX | a TEI header, one or more sourceDoc or facsimile elements, and a text element |
150 | PHFAX | In the simplest case, a facsimile just contains a series of |
169 | PHFAX | In this simple case, the four page images are understood to represent the complete facsimile, and are to be read in the sequence given. Suppose, however, that the second page of this particular work is available both as an ordinary photograph and as an infra-red image, or in two different resolutions. The |
171 | PHFAX | element may be used to group the two image files, since these correspond with the same area of the work: |
186 | PHFAX | element provides a way of indicating that the two images of page2 represent the same surface within the source material. A |
187 | PHFAX | surface |
188 | PHFAX | might be one side of a piece of paper or parchment, an opening in a codex treated as a single surface by the writer, a face of a monument, a billboard, a membrane of a scroll, or indeed any two-dimensional surface, of any size. |
209 | PHFAX | Simply grouping related graphics is not however the main purpose of the |
211 | PHFAX | element: rather it is to help identify the location and size of the various two-dimensional spaces constituting the digital facsimile. Note that the actual dimensions of the object represented are not provided by the |
215 | PHFAX | element defines an abstract coordinate space which may be used to address parts of the image. Four attributes supplied by the |
223 | PHFAX | By default, the same coordinate space is used for a |
226 | PHFAX | The coordinate space may be thought of as a grid superimposed on a rectangular space. Rectangular areas of the grid are defined as four numbers |
227 | PHFAX | a b c d |
232 | PHFAX | points from the origin along the |
236 | PHFAX | points from the origin along the |
239 | PHFAX | It may be most convenient to derive a coordinate space from a digital image of the surface in question such that each pixel in the image corresponds with a whole number of units (typically 1) in the coordinate space. In other cases it may be more convenient to use units such as millimetres. Neither practice implies any specific mapping between the coordinate system used and the actual dimensions of the physical object represented. |
245 | PHFAX | elements, each of which represents a region or |
247 | PHFAX | defined in terms of the same coordinate space as that of its parent |
249 | PHFAX | element. A zone may be rectangular or non-rectangular: a rectangular zone is defined by a sequence of four coordinates in the same way as a surface; a non-rectangular zone is defined using the attribute |
251 | PHFAX | , which provides a sequence of coordinates, each of which specifies a point on the perimeter of the zone. |
256 | PHFAX | in the same form as that required by the |
263 | PHFAX | A zone may be used to define any region of interest, such as a detail or illustration, or some part of the surface which is to be aligned with a particular text element, or otherwise distinguished from the rest of the surface. A surface establishes a coordinate system which may be used to address parts or the whole of some digital representation of a written surface. A zone, by contrast, defines any arbitrary area of interest relative to that surface, using the same coordinate system. It might be bigger or smaller than its parent surface, or might overlap its boundaries. The only constraint is that it must be defined using the same coordinate system. |
265 | PHFAX | When an image of some kind is supplied within either a zone or a surface, the implication is that the image represents the whole of the zone or surface concerned. In the simple case therefore, we might imagine a surface defining a page, within which there is a graphic representing the whole of that page, and a number of zones defining parts of the page, each with its own graphic, each representing a part of the page. If however one of those graphics actually represents an area larger than the page (for example to include a binding or the surface of a desk on which the page rests), then it will be enclosed by a zone with coordinates larger than those of the parent surface. |
273 | PHFAX | This is an image of a two page spread from a manuscript in the Badische Landesbibliothek, Karlsruhe. We have no information as to the dimensions of the original object, but the low resolution image displayed here contains 500 pixels horizontally and 321 pixels vertically. For convenience, we might map each pixel to one cell of the coordinate space. |
274 | PHFAX | The coordinate space used here is based on pixels, but the mapping between pixels and units in the coordinate space need not be one-to-one; it might be convenient to define a more delicate grid, to enable us to address much smaller parts of the image. This can be done simply by supplying appropriate values for the attributes which define the coordinate space; for example doubling them all would map each pixel to two grid points in the coordinate space. |
279 | PHFAX | element corresponding with the area of the image which represents the whole of the two page spread and embed the graphic within it: |
315 | PHFAX | elements may be used to identify parts of a surface for analytical purposes. |
317 | PHFAX | The relationship between zone and surface can be quite complex: for example, it may be appropriate to treat the whole of a two page spread as a single written surface, perhaps because particular written zones span both pages. A zone may contain a nested surface, if for example a page has an additional scrap of paper attached to it. A zone may be of any shape, not simply rectangular. Discussion of these and other cases are provided in section |
320 | PHFAX | In the following extended example, we discuss a hypothetical digital edition of an early 16th century French work, Charles de Bovelles' |
323 | PHFAX | The image is taken from the collection at |
329 | PHFAX | element used to contain the whole set of pages, we define a |
340 | PHFAX | We can now identify distinct zones within the page image using the coordinate scale defined for the surface. In the following figure |
348 | facs-fig1 | Detail of p 49r from Bovelles |
351 | PHFAX | The following encoding defines each of the four zones identified in the figure above. |
365 | PHFAX | Note that the location of each zone is defined independently but using the same coordinate system. |
381 | PHFAX | element has been associated directly with the surface of the page rather than nesting it within a zone. However, it is also possible to include multiple |
385 | PHFAX | element, if for example a detailed image is available. Since all |
389 | PHFAX | ), there is no need to demonstrate enclosure of one zone within another by means of nesting. To continue the current example, supposing that we have an additional image called |
391 | PHFAX | containing an additional image of the figure in the third zone above, we might encode that zone as follows: |
402 | PH-transcr | A digitized source document may contain nothing more than page images and a small amount of metadata. It may also contain an encoded transcription of the pages represented, which may either be |
406 | PH-transcr | element, or supplied in parallel with a |
410 | PH-transcr | If the transcription is regarded as a text in its own right, organized and structured independently of its physical realization in the document or documents represented by the facsimile, then the recommended practice is to use the |
419 | PH-transcr | below. Alternatively, if the transcription is intended not to prioritize representation of the final text so much as the process by which the document came to take its present form, or the physical disposition of its component parts, it may be preferable to present it as an embedding transcription, as further described in section |
425 | PH-bov | Suppose now that we wish to align a transcription of the page discussed in the preceding section with particular zones. We begin by giving each relevant part of the facsimile an identifier: |
492 | PH-bov | attribute, which supplies the identifier of the element containing at least the start of the transcribed text found within the surface or zone concerned. Thus, another way of linking this page with its transcription would be simply |
546 | PHZLAB | When supplied within a |
548 | PHZLAB | element, these elements may contain transcriptions of the written content of a source in addition to or as an alternative to digital images of them. Such transcription may be placed directly within the |
552 | PHZLAB | elements, for cases where the writing is linear, in the sense that it is composed of discrete tokens organized physically into groups, typically organized in a sequence corresponding with the way they are intended to be read. Depending on the directionality of the writing system used, this might be any combination of top-down and left to right, or vice versa. The element |
554 | PHZLAB | may be used to hold a complete group of such tokens. Where, however, the lineation is not considered significant, any group of tokens may be indicated using the |
565 | PHZLAB | Returning to the preceding example, we might transcribe the content of the zone to which we gave the identifier |
598 | PHZLAB | As mentioned above, some or all of the written surfaces being transcribed may be composed of physically distinct scraps. In the following example, taken from the Walt Whitman Archive, two pieces of newsprint have been glued to a piece of blue paper on which a poem is being drafted: |
601 | sleeprs | Single leaf of notes possibly related to the poem eventually titled Sleepers. From the Walt Whitman Archive (Duke 258). |
603 | PHZLAB | The two pieces of newsprint might simply be regarded as special kinds of zone, but they are also new surfaces, since they might contain additional written zones themselves (such as the numbers in this case). |
650 | PHZLAB | elements identified in the transcription. The encoder may choose to complement a transcription with graphic representations of its source at whatever level is considered effective, or not at all. Equally, the encoder may choose to provide only graphics without any transcription, to provide only a structured (non-embedded) transcription, or to provide any combination of the three. |
654 | PHZLAB | element they are to be found, other than the reading order implicit in their sequence. Such information could be added if desired by specifying a coordinate system on the outermost |
656 | PHZLAB | element, and then indicating values within that system for each of the two fragments, as was discussed above. We discuss this in further detail in section |
666 | PHST | transcription or a critical edition. In either case they may also wish to include other editorial material, such as comments on the status or possible origin of particular readings, corrections, or text supplied to fill lacunae. |
672 | PHST | of writing in one or more documents. Transcriptions of this kind are closely focussed on the physical appearance of specific documents, needing to distinguish the traces of different writing activities on them, such as additions and deletions but also other indications of how the writing is to be read, such as indications of transposition, re-affirmation of writing which has been deleted, and so on. Such distinctions are considered of particular importance when dealing with authorial manuscripts, but are also relevant in the case of historical sources such as charters or other legal documents. |
674 | PHST | In either case, it is customary in transcriptions to register certain features of the source, such as ornamentation, underlining, deletion, areas of damage and lacunae. This chapter provides ways of encoding such information: |
676 | PHST | methods of recording editorial or other alterations to the text, such as expansion of abbreviations, corrections, conjectures, etc. (section |
679 | PHST | methods of describing important extra-linguistic phenomena in the source: unusual spaces, lines, page and line breaks, changes of manuscript hand, etc. (section |
685 | PHST | methods of representing aspects of layout such as spacing or lines |
688 | PHST | methods of representing material such as running heads, catch-words, and the like (section |
696 | PHST | , etc. are used to mark writing traces and their functions within the document. Each such element can be assigned to one or more editorially-defined modification groups, termed a |
697 | PHST | change |
700 | PHST | attribute, which references a definition for the modification group concerned, typically provided within the TEI header |
717 | PHST | These recommendations are not intended to meet every transcriptional circumstance likely to be faced by any scholar. Rather, they should be regarded as a base which can be elaborated if necessary by different scholars in different disciplines |
720 | PHST | As a rule, all elements which may be used in the course of a transcription of a single witness may also be used in a critical apparatus, i.e. within the elements proposed in chapter |
721 | PHST | . This can generally be achieved by nesting a particular reading containing tagged elements from a particular witness within the |
727 | PHST | Just as a critical apparatus may contain transcriptional elements within its record of variant readings in various witnesses, one may record variant readings in an individual witness by use of the apparatus mechanisms |
737 | PHCH | In the detailed transcription of any source, it may prove necessary to record various types of actual or potential alteration of the text: expansion of abbreviations, correction of the text (either by author, scribe, or later hand, or by previous or current editors or scholars), addition, deletion, or substitution of material, and similar matters. The sections below describe how such phenomena may be encoded using either elements defined in the core module (defined in chapter |
738 | PHCH | ) or specialized elements available only when the module described in this chapter is available. |
757 | PHCO | All of these elements bear additional attributes for specifying who is responsible for the interpretation represented by the markup, and the associated certainty. In addition, some of them bear an attribute allowing the markup to be categorized by type and source. |
766 | PHCO | The following sections describe how the core elements just named may be used in the transcription of primary source materials. |
772 | PHAB | The writing of manuscripts by hand lends itself to the use of abbreviation to shorten scribal labour. Commonly occurring letters, groups of letters, words, or even whole phrases, may be represented by significant marks. This phenomenon of manuscript abbreviation is so widespread and so various that no taxonomy of it is here attempted. Instead, methods are shown which allow abbreviations to be encoded using the core elements mentioned above. |
774 | PHAB | A manuscript abbreviation may be viewed in two ways. One may transcribe it as a particular sequence of letters or marks upon the page: thus, a |
775 | PHAB | p with a bar through the descender |
781 | PHAB | per |
783 | PHAB | re |
788 | PHAB | In many cases the glyph found in the manuscript source also exists in the Unicode character set: for example the common Latin brevigraph ⁊, standing for |
792 | PHAB | can be directly represented in any XML document as the Unicode character with code point |
803 | PHAB | These two methods of coding abbreviation may also be combined. An encoder may record, for any abbreviation, both the sequence of letters or marks which constitutes it, and its sense, that is, the letter or letters for which it is believed to stand. For example, in the following fragment the phrase |
805 | PHAB | is represented by a sequence of abbreviated characters: |
826 | PHAB | Note that in each case the |
859 | PHAB | When abbreviated forms such as these are expanded, two processes are carried out: some characters not present in the abbreviation are added (always), and some characters or glyphs present in the abbreviation are omitted or replaced (often). For example, when the abbreviation |
871 | PHAB | element surrounds characters or signs such as tittles or tildes, used to indicate the presence of an abbreviation, which are typically removed or replaced by other characters in the expanded form of the abbreviation: |
887 | PHAB | The content of the |
905 | PHAB | As implied in the preceding discussion, making decisions about which of these various methods of representing abbreviation to use will form an important part of an encoder's practice. As a rule, the |
909 | PHAB | elements should be preferred where it is wished to signify that the content of the element is an abbreviation, without necessarily indicating what the abbreviation may stand for. The |
913 | PHAB | elements should be used where it is wished to signify that the content of the element is not present in the source but has been supplied by the transcriber, without necessarily indicating the abbreviation used in the original. The decision as to which course of action is appropriate may vary from abbreviation to abbreviation; there is no requirement that the same system be used throughout a transcription, although doing so will generally simplify processing. The choice is likely to be a matter of editorial policy. If the highest priority is to transcribe the text |
915 | PHAB | (letter by letter), while indicating the presence of abbreviations, the choice will be to use |
919 | PHAB | throughout. If the highest priority is to present a reading transcription, while indicating that some letters or words are not actually present in the original, the choice will be to use |
934 | PHAB | , a note is attached to an editorial expansion of the tail on the final d of |
951 | PHAB | The editor might declare a degree of certainty for this expansion, based on the OED examples, and state the responsibility for the expansion: |
955 | PHAB | The value supplied for the |
957 | PHAB | attribute should point to the name of the editor responsible for this and possibly other interventions; an appropriate element therefore might be a |
959 | PHAB | element in the header like the following: |
972 | PHAB | element only to indicate confidence in the content of the element (i.e. the expansion), and responsibility for suggesting this expansion respectively. |
984 | PHAB | If it is desired to express aspects of certainty and responsibility for some other aspect of the use of these elements, then the mechanisms discussed in chapter |
986 | PHAB | for discussion of the issues of certainty and responsibility in the context of transcription. |
1025 | PHCC | and its correction |
1038 | PHCC | element is used to provide a corrected form which is |
1040 | PHCC | present in the source; in the case of a correction made in the source itself, whether scribal, authorial, or by some other hand, the |
1053 | PHCC | element indicates the transcriber's correction of them. Where the transcriber considers that one or more words have been erroneously omitted in the original source and corrects this omission, the |
1058 | PHCC | . Thus, in the following example, from George Moore's draft of additional materials for |
1072 | PHCC | , the choice as to whether to record simply that there is an apparent error, or simply that a correction has been applied, or to record both possible readings within a |
1074 | PHCC | element is left to the encoder. The decision is likely to be a matter of editorial policy, which might be applied consistently throughout or decided case by case. If the highest priority is to present an uncorrected transcription while noting perceived errors in the original, the choice will typically be to use only |
1076 | PHCC | throughout. If the highest priority is to present a reading transcription, while indicating that perceived errors in the original have been corrected, the choice will be to use only |
1119 | PHCC | is used to indicate who is responsible for the proposed emendation. Its value is a pointer, which will typically indicate a |
1123 | PHCC | element in the header of the transcribed document, but can point anywhere, for example to some online authority file. Using these two attributes, the |
1154 | PHCC | element. However, if the number of corrections is large and the number of notes is small, it may well be both more practical and more appropriate to regard the collection of annotations as constituting a typology and then use the |
1156 | PHCC | attribute. Suppose that the note given above is one of half a dozen possible kinds of corrected phenomena identified in a given text; others might include, say, |
1157 | PHCC | repetition of a word from the preceding line |
1162 | PHCC | element can be used to specify an arbitrary code for the particular kind of correction (or other editorial intervention) identified within it. This code can be chosen freely and is not treated as a pointer. |
1175 | PHCC | In addition, the conscientious encoder will provide documentation explaining the circumstances in which particular codes are judged appropriate. A suitable location for this might be within the |
1196 | PHCC | choice type="substitution" subtype="graphicResemblance" |
1203 | PHCC | attributes automatically. This is easily done but requires customization of the TEI system using techniques described in |
1207 | PHCC | When making a correction in a source which forms part of a textual tradition attested by many witnesses, a textual editor will sometimes use a reading from one witness to correct the reading of the source text. In the general case, such encoding is best achieved with the mechanisms provided by the module for textual criticism described in chapter |
1214 | PHCC | mentioned above, Parkes proposes to emend the problematic word |
1223 | PHCC | The value of the |
1225 | PHCC | attribute here is, like the value of the |
1227 | PHCC | attribute, a pointer, in this case indicating the manuscript used as a witness. Elsewhere in the transcribed text, a list of witnesses used in this text will be given, one of which has an identifier |
1229 | PHCC | . Each witness will be represented either by a |
1266 | PHCC | attribute were supplied on the |
1268 | PHCC | element, it would indicate the person responsible for asserting that the manuscript indicated has this reading, who is not necessarily the same as the person responsible for asserting that this reading should be used to correct the others. Editorial intervention elements such as |
1272 | PHCC | to provide this additional information: |
1283 | PHCC | found in Gg is regarded as a correction by Parkes. |
1295 | PHCC | element, these attributes indicate confidence in and responsibility for identifying the reading within the sources specified; when used on the |
1297 | PHCC | element they indicate confidence in and responsibility for the use of the reading to correct the base text. If no other source is indicated (either by the |
1303 | PHCC | ), the reading supplied within a |
1305 | PHCC | has been provided by the person indicated by the |
1309 | PHCC | If it is desired to express certainty of or responsibility for some other aspect of the use of these elements, then the mechanisms discussed in chapter |
1311 | PHCC | for further discussion of the issues of certainty and responsibility in the context of transcription. |
1317 | PHAD | Additions and deletions observed in a source text may be described using the following elements: |
1327 | PHAD | are included in the core module, while |
1331 | PHAD | are available only when using the module defined in this chapter. These particular elements are members of the |
1338 | PHAD | Further characteristics of each addition and deletion, such as the hand used, its effect (complete or incomplete, for example), or its position in a sequence of such operations may conveniently be recorded as attributes of these elements, all of which are members of the |
1384 | PHAD | attribute may be useful to indicate the classification; when they are classified by the manner in which they were effected, or by their appearance, however, this will lead to a certain arbitrariness in deciding whether to use the |
1392 | PHAD | attribute be reserved for higher level or more abstract classifications. |
1396 | PHAD | attribute is also available to indicate the location of an addition. For example, consider the following passage from a draft letter by Robert Graves: |
1420 | PHAD | above the line, and then deletes it. This may be encoded similarly: |
1426 | PHAD | has been added and then deleted: |
1434 | PHAD | , and then changed it; it may be that he inserted other punctuation marks between the letters before replacing them with the centre dots used elsewhere to represent this acronym. We do not deal with these possibilities here, and mention them only to indicate that any encoding of manuscript material of this complexity will need to make decisions about what is and is not worth mentioning. |
1442 | PHAD | , then deletes |
1462 | PHAD | elements defined in the core module suffice only for the description of additions and deletions which fit within the structure of the text being transcribed, that is, which each deletion or addition is completely contained by the structural element (paragraph, line, division) within which it occurs. Where this is not the case, for example because an individual addition or deletion involves several distinct structural subdivisions, such as poems or prose items, or otherwise crosses a structural boundary in the text being encoded, special treatment is needed. The |
1476 | PHAD | element is first declared, within the header of the document, to associate the identifier |
1478 | PHAD | with Helgi. Each of the added poems is encoded as a distinct |
1480 | PHAD | element. In the body of the text, an |
1482 | PHAD | element is placed to mark the beginning of the span of added text, and an |
1506 | PHAD | several occasions where sequences of whole lines are marked for deletion, either by boxes or by being struck out. If the encoder is marking up individual verse lines with the |
1528 | PHAD | It is also often the case that deletions and additions may themselves contain other deletions and additions. For example, in Thomas Moore's autograph of the second version of |
1543 | PHAD | In this case the |
1551 | PHAD | The text deleted must be at least partially legible, in order for the encoder to be able to transcribe it. If all of part of it is not legible, the |
1553 | PHAD | element should be used to indicate where text has not been transcribed, because it could not be. The |
1556 | PHAD | may be used to indicate areas of text which cannot be read with confidence. See further section |
1566 | PHSU | As we have shown, the simplest method of recording a substitution is simply to record both the addition and the deletion. However, when the module defined by this chapter is in use, additional elements are available to indicate that the encoder believes the addition and the deletion to be part of the same intervention: a substitution. |
1580 | PHSU | Since the purpose of this element is solely to group its child elements together, the order in which they are presented is not significant. When both deletion and addition are present, it may not always be clear which occurs first: using the |
1590 | PHSU | and this is then replaced by |
1594 | PHSU | This may be encoded as follows, representing the two changes as a sequence of additions and deletions: |
1606 | PHSU | to record text first added, then deleted in the source. The numbers assigned by the |
1608 | PHSU | attribute may be used to identify the order in which the various additions and deletions are believed by the encoder to have been carried out, and thus provide a simple method of supporting the kind of |
1617 | PHSU | The case of a single substitution or scribal correction that involves non-contiguous addition and deletion can be handled by using the |
1619 | PHSU | element to make an explicit connection between one or more |
1627 | PHSU | to group this |
1633 | PHSU | allows the encoder to indicate that additions and deletions separated in this way are part of a single scribal intervention: |
1688 | PHSU | in the last line is simply marked as a deletion; |
1695 | PHSU | provides similar facilities, by treating each state of the text as a distinct reading. The |
1717 | PHCD | An author or scribe may mark a word or phrase in some way, and then on reflection decide to cancel the marking. For example, text may be marked for deletion and the deletion then cancelled, thus restoring the deleted text. Such cancellation may be indicated by the |
1723 | PHCD | This element bears the same attributes as the other transcriptional elements. These may be used to supply further information such as the hand in which the restoration is carried out, the type of restoration, and the person responsible for identifying the restoration as such, in the same way as elsewhere. |
1725 | PHCD | Presume that Lawrence decided to restore |
1730 | PHCD | For I hate this my body |
1733 | PHCD | first deleted then restored by writing |
1740 | PHCD | Another feature commonly encountered in manuscripts is the use of circles, lines, or arrows to indicate transposition of material from one point in the text to another. No specific markup for this phenomenon is proposed at this time. Such cases are most simply encoded as additions at the point of insertion and deletions at the point of encirclement or other marking. |
1746 | PHOM | Where text is not transcribed, whether because of damage to the original, or because it is illegible, or for some other reason such as editorial policy, the |
1748 | PHOM | core element may be used to register the omission; where such text is transcribed, but the editor wishes to indicate that they consider it to be superfluous, for example because it is an inadvertent scribal repetition, the |
1750 | PHOM | element may be used in preference. Where text not present in the source is supplied (whether conjecturally or from other witnesses) to fill an apparent gap in the text, the |
1760 | PHOM | element has no content. It marks a point in the text where nothing at all can be read, whether because of authorial or scribal erasure, physical damage, or any other form of illegibility. Its attributes allow the encoder to specify the amount of text which is illegible in this way at this point, using any convenient units, where this can be determined. For example, in the Beerbohm manuscript of |
1762 | PHOM | cited above, the author has erased a passage amounting about 10 cm in length by inking over it completely: |
1769 | PHOM | The degree of precision attempted when measuring the size of a gap will vary with the purpose of the encoding and the nature of the material: no particular recommendation is made here. |
1773 | PHOM | element should only be used where text has not been transcribed. If partially legible text has been transcribed, one of the elements |
1778 | PHOM | ); if the text is legible and has been transcribed, but the editor wishes to indicate that they regard it is superfluous or redundant, then the element |
1780 | PHOM | may be used in preference to the core element |
1782 | PHOM | used to indicate text regarded as erroneous. |
1784 | PHOM | Amongst the many examples cited in Hans Krummrey & Silvio Panciera's classic text on the editing of epigraphic inscriptions is the following. In a late classical inscription, the form |
1786 | PHOM | is encountered. The editor may choose any of the following three possibilities: |
1789 | PHOM | mark this as an erroneous form |
1794 | PHOM | additionally supply a corrected form |
1802 | PHOM | indicate that the erroneous form contains surplus characters which the editor wishes to suppress |
1825 | PHOM | here are metrically inconsistent with the rest and have been marked by the editor as such. |
1827 | PHOM | If some part of the source text is completely illegible or missing, an encoder may sometimes wish to supply new (conjectural) material to replace it. This conjectural reading is analogous to a correction in that it contains text provided by the encoder and not attested in the source. This is not however a correction, since no error is necessarily present in the original; for that reason a different element |
1830 | PHOM | I am dear Sir your very humble Servt Sydney Smith |
1831 | PHOM | , the text illegible in the autograph might be supplied in the transcription: |
1839 | PHOM | attributes are used, as elsewhere, to indicate respectively the sigil of a manuscript from which the supplied reading has been taken, and the identifier of the person responsible for deciding to supply the text. If the |
1841 | PHOM | attribute is not supplied, the implication is that the encoder (or whoever is indicated by the value of the |
1843 | PHOM | attribute) has supplied the missing reading. Both |
1859 | PHPH | This section discusses in more detail the representation of aspects of responsibility perceived or to be recorded for the writing of a primary source. These include points at which one scribe takes over from another, or at which ink, pen, or other characteristics of the writing change. A discussion of the usage of the |
1870 | PHDH | For many text-critical purposes it is important to signal the person responsible (the |
1872 | PHDH | ) for the writing of a whole document, a stretch of text within a document, or a particular feature within the document. A hand, as the name suggests, need not necessarily be identified with a particular known (or unknown) scribe or author; it may simply indicate a particular combination of writing features recognized within one or more documents. The examples given above of the use of the |
1874 | PHDH | attribute with coding of additions and deletions illustrate this. |
1887 | PHDH | attribute, may appear in either of two places in the TEI header, depending on which modules are included in a schema. When the |
1893 | PHDH | element of the TEI header, to hold one or more |
1901 | PHDH | also becomes available as part of a structured manuscript description. The encoder may choose to place |
1903 | PHDH | elements identifying individual hands in either location without affecting their accessibility since the element is always addressed by means of its |
1907 | PHDH | element may be more appropriate when a full cataloguing of each manuscript is required; the |
1909 | PHDH | element if only a brief characterization of each hand is needed. It is also possible to use the two elements together if, for example, the |
1911 | PHDH | element contains a single summary describing all the hands discursively, while the |
1913 | PHDH | element gives specific details of each. The choice will depend on individual encoders' priorities. |
1917 | PHDH | attribute is available on several elements to indicate the hand in which the content of the element (usually a deletion or addition) is carried out. The |
1919 | PHDH | element may also be used within the body of a transcription to indicate where a change of hand is detected for whatever reason. |
1935 | PHDH | A single hand may employ different writing styles and inks within a document, or may change character. For example, the writing style might shift from |
1939 | PHDH | , or the ink from blue to brown, or the character of the hand may change. Simple changes of this kind may be indicated by assigning a new value to the appropriate attribute within the |
1941 | PHDH | element. It is for the encoder to decide whether a change in these properties of the writing style is so marked as to require treatment as a distinct hand. |
1943 | PHDH | Where such a change is to be identified, the |
1945 | PHDH | attribute indicates the hand applicable to the material following the |
1947 | PHDH | . The sequence of such |
1949 | PHDH | elements will often, but not necessarily, correspond with the order in which the material was originally written. Where this is not the case, the facilities described in section |
1952 | PHDH | As might be expected, a single hand may also vary renditions within the same writing style, for example medieval scribes often indicate a structural division by emboldening all the words within a line. Such changes should be indicated by use of the |
1958 | PHDH | In the following example there is a change of ink within a single hand. This is simply indicated by a new value for the |
1969 | PHDH | In the following example, the encoder has identified two distinct hands within the document and given them identifiers |
1973 | PHDH | , by means of the following declarations included in the document's TEI header: |
1983 | PHDH | Then the change of hand is indicated in the text: |
1987 | PHDH | When a more precise or nuanced discussion of the writing in a manuscript is required, the |
2004 | PHHR | attributes have similar, but not identical, meanings. Observe their distinctive uses in the following encoding of the William James passage mentioned above in section |
2009 | PHHR | , and the consequent editorial correction of |
2034 | PHHR | should be reserved for indicating the hand of any form of marking—here, addition but also deletion, correction, annotation, underlining, etc.—within the primary text being transcribed. The scribal or authorial responsibility for this marking may be inferred from the value of the |
2036 | PHHR | attribute. The value of the |
2038 | PHHR | attribute should be a pointer to a hand identifiers typically declared in the document header but potentially in another document or repository (see section |
2043 | PHHR | attribute, by contrast, indicates the person responsible for deciding to mark up this part of the text with this particular element. In the case of the |
2049 | PHHR | attribute is supplied) to which hand it should be attributed. In this case, Bowers is credited with identifying the hand as that of William James. In the case of the |
2053 | PHHR | attribute indicates who is responsible for supplying the intellectual content of the correction reported in the transcription: here, Bowers' correction of |
2057 | PHHR | . In the case of a deletion, the |
2067 | PHHR | attributes are defined for a particular element, the two attributes refer to the same aspect of the markup. The one indicates who is intellectually responsible for some item of information, the other indicates the degree of confidence in the information. Thus, for a correction, the |
2069 | PHHR | attribute signifies the person responsible for supplying the correction, while the |
2073 | PHHR | attribute signifies the person responsible for supplying the expansion and the |
2081 | PHHR | attributes with each element is intended to provide for the most frequent circumstances in which encoders might wish to make unambiguous statements regarding the responsibility for and certainty of aspects of their encoding. The |
2085 | PHHR | attributes, as so defined, give a convenient mechanism for this. However, there will be cases where it is desirable to state responsibility for and certainty concerning other aspects of the encoding. For example, one may wish in the case of an apparent addition to state the responsibility for the use of the |
2087 | PHHR | element, rather than the responsibility for identifying the hand of the addition. It may also be that one editor may make an electronic transcription of another editor's printed transcription of a manuscript text—here, one will wish to assign layers of responsibility, so as to allow the reader to determine exactly what in the final transcription was the responsibility of each editor. In these complex cases of divided editorial responsibility for and certainty concerning the content, attributes, and application of a particular element, the more general mechanisms for representing certainty and responsibility described in chapter |
2091 | PHHR | It should be noted that the certainty and responsibility mechanisms described in chapter |
2100 | PHHR | in line 117 of Chaucer's |
2113 | PHHR | Exactly the same information could be conveyed using the certainty and responsibility mechanisms, as follows: |
2119 | PHHR | The choice of which mechanism to use is left to the encoder. In transcriptions where only such statements of responsibility and certainty are made as can be accommodated within the |
2127 | PHHR | attributes of those elements. Where many statements of responsibility and certainty are made which cannot be so accommodated, it may be economical to use the |
2133 | PHHR | The above discussion supposes that in each case an encoder is able to specify exactly what it is that one wishes to state responsibility for and certainty about. Situations may arise when an encoder wishes to make a statement concerning certainty or responsibility but is unable or unwilling to specify so precisely the domain of the certainty or responsibility. In these cases, the |
2137 | PHHR | attribute set to |
2140 | PHHR | resp |
2141 | PHHR | and the content of the note giving a prose description of the state of affairs. |
2148 | PHDAMCON | The carrier medium of a primary source may often sustain physical damage which makes parts of it hard or impossible to read. In this section we discuss elements which may be used to represent such situations and give recommendations about how these should be used in conjunction with the other related elements introduced previously in this chapter. |
2158 | PHDA | ) should be used with appropriate attributes where the degree of damage or illegibility in a text is such that nothing can be read and the text must be either omitted or supplied conjecturally or from one or more other sources. In many cases, however, despite damage or illegibility, the text may yet be read with reasonable confidence. In these cases, the following elements should be used: |
2181 | PHDA | inherits the following additional attribute: |
2190 | PHDA | In the first line of this leaf, the transcriber may believe that the last three letters of |
2198 | PHDA | If, as is often the case, the damage crosses structural divisions, so that the |
2225 | PHDA | element, since it is the whole of the leaf (the text between the two |
2230 | PHDA | If, as is also likely, the damage affects several disjoint parts of the text, each such part must be marked with a separate |
2236 | PHDA | attribute may be used as in the following example. In this (imaginary) text of Fitzgerald's translation from Omar Khayam, water damage has affected an area covering parts of several lines: |
2255 | PHDA | which may be used to link together arbitrary elements of any kind in the transcription. Here, several phenomena of illegibility and conjecture all result from a single cause: an area of damage to the text caused by rubbing at various points. The damage is not continuous, and affects the text at irregular points. In cases such as this, the join element may be used to indicate which tagged features are part of the same physical phenomenon. |
2257 | PHDA | If the damage has been so severe as to render parts of the text only imperfectly legible, the |
2285 | PHDA | element may if desired be enclosed within a |
2304 | PHDA | Where elements are nested in this way, information about agency, etc. is by default inherited. In the following imaginary example, there is a smoke-damaged part within which two stretches can be read with some difficulty, and a third stretch which cannot be read at all: |
2355 | PHCOMB | elements may be closely allied in their use. For example, an area of damage in a primary source might be encoded with any one of the first four of these elements, depending on how far the damage has affected the readability of the text. Further, certain of the elements may nest within one another. The examples given in the last sections illustrate something of how these elements are to be distinguished in use. This may be formulated as follows: |
2357 | PHCOMB | where the text has been rendered completely illegible by deletion or damage and no text is supplied by the editor in place of what is lost: place an empty |
2361 | PHCOMB | attribute to state the cause (damage, deletion, etc.) of the loss of text. |
2363 | PHCOMB | where the text has been rendered completely illegible by deletion or damage and text is supplied by the editor in place of what is lost: surround the text supplied at the point of deletion or damage with the |
2367 | PHCOMB | attribute to state the cause (damage, deletion, etc.) of the loss of text leading to the need to supply the text. |
2369 | PHCOMB | where the text has been rendered partly illegible by deletion or damage so that the text can be read but without perfect confidence: transcribe the text and surround it with the |
2373 | PHCOMB | attribute to state the cause (damage, deletion, etc.) of the uncertainty in transcription and the |
2377 | PHCOMB | where there is deletion or damage but at least some of the text can be read with perfect confidence: transcribe the text and surround it with the |
2387 | PHCOMB | where there is an area of deletion or damage and parts of the text within that area can be read with perfect confidence, other parts with less confidence, other parts not at all: in transcription, surround the whole area with the |
2395 | PHCOMB | element. Places within the damaged area where the text has been rendered completely illegible and no text is supplied by the editor may be marked with the |
2397 | PHCOMB | element. For each element, one may use appropriate attribute values to indicate the cause and type of deletion or damage and the certainty of the reading. |
2404 | PHCOMB | elements, and for the interpretation of such combinations, are similar: |
2407 | PHCOMB | if one |
2413 | PHCOMB | ), then the addition |
2424 | PHCOMB | if one |
2435 | PHCOMB | if a |
2439 | PHCOMB | element, the normal interpretation will be that an addition was made within a passage which was later deleted in its entirety: |
2444 | PHCOMB | if an |
2448 | PHCOMB | element, the normal interpretation will be that a deletion was made from a passage which had earlier been added: |
2459 | alterations | Modifications of various kinds (correction, addition, deletion, etc.) are frequently found within a single document, and may also be inferred when different documents are compared, although it may be an open question as to whether inter-document discrepancies |
2462 | alterations | In this section we discuss a number of elements which may be useful when attempting to record traces of the writing process within a document. |
2467 | PH-mod | Most, if not all, transcriptional elements imply a certain level of semantic interpretation. For instance, using the |
2469 | PH-mod | element to encode a word or phrase that occupies interlinear space involves a decision that it has been deliberately inserted as an addition rather than an alternative, and indeed a judgment that it was written after, rather than before, the other lines. Where it is felt desirable to keep the recording of |
2472 | PH-mod | what is the editor’s interpretation |
2484 | PH-mod | attribute, but they provide no further interpretation of the function or intention of the passage so marked up. The |
2486 | PH-mod | attribute may be used to indicate the end of a modified passage if this extends across the boundaries of some other XML element, for example from the middle of one line tagged as a |
2515 | PH-meta | metamark |
2516 | PH-meta | we mean marks such as numbers, arrows, crosses, or other symbols introduced by the writer into a document expressly for the purpose of indicating how the text is to be read. Such marks thus constitute a kind of markup of the document, rather than forming part of the text. |
2521 | PH-meta | Unlike marginal notes or other additions to the text, metamarks are used by the writer to indicate a deliberate alteration of the writing itself, such as |
2522 | PH-meta | move this passage over there |
2523 | PH-meta | . An addition or annotation by contrast would typically concern some property of the passage other than its intended location or status within the text flow. A metamark may contain text, or some other graphic which the encoder wishes to represent, or it may simply consist of arrows, dots, lines etc. which the encoder simply describes. |
2540 | PH-meta | . The passage to which the metamark applies may be indicated in either of two ways: the |
2546 | PH-meta | itself must be supplied at the position in the document where the passage concerned begins; in the former case it may be supplied at any convenient point. Both attributes should not be supplied. |
2560 | PH-meta | . It is thought to function as a metamark, indicating that this sentence forms part of the regulations. A further sentence was then added, while at some later stage the text and also the metamark were deleted. We might encode this as follows: |
2596 | PH-meta | deletion symbol to left and right of the section. The deletion itself might be encoded by using the normal |
2602 | PH-meta | element. This is quite a different case from that of the next example, in which the writer does not intend to suppress the content, but only to mark that it has been copied to another manuscript or reused. |
2607 | PH-meta | From "I am that halfgrown angry boy" (MS q 25), David M. Rubenstein Rare Book & Manuscript Library, Duke University. |
2613 | PH-meta | signalled by the larger of the two single vertical lines, which shows that the written material has been transferred or re-used, not deleted. |
2648 | PH-meta | In this example, we class as metamarks both the long vertical line and the annotation |
2651 | PH-meta | Both metamarks are assumed to indicate that the whole of the written zone with identifier |
2659 | PH-fix | A writer may sometimes rewrite material a second time without significant change and in the same place. We consider this a distinct activity from addition as usually defined because no new textual material results; instead the status of existing material is reaffirmed. We may distinguish two variants of this: |
2674 | PH-fix | hastily, and then returned to it to make the letter |
2675 | PH-fix | l |
2719 | PH-fix | element is used only for cases where text has been written multiple times. When metamarks and other markup-like strokes have been rewritten multiple times, the |
2740 | undo | ) is provided for the comparatively simple case where a simple deletion is marked as having been subsequently cancelled. The |
2742 | undo | element discussed here is more widely applicable and may be used for any kind of cancellation. It points to the element or elements which are being cancelled. These components need not be contiguous, provided that the cancellation is clearly a single act; each distinct act of cancellation requires a distinct |
2755 | undo | We hypothesize that the text has gone through three states or changes, as follows: |
2765 | undo | This sequence of events might be encoded as follows: |
2781 | undo | attribute, to delimit the two parts of the deletion which were reverted at change s3. Note that in this case, since |
2791 | undo | to delimit the two sequences whose deletion is being reverted, and then use the |
2817 | transpo | occurs when metamarks are found in a document indicating that passages should be moved to a different position. Typically this may be done using arrows, asterisks or numbers, or other means. By definition the result of a transposition is not present in the document, and should not therefore be encoded, if the intention is to represent the actual appearance of the document. Instead, the following elements may be used to indicate the intended reordering: |
2851 | transpo | element to identify the sections of text being transposed. When (as in the following example) the whole of a line is to be transposed, there is no need to delimit the sections concerned: |
2878 | transpo | elements may be supplied either embedded within the text or in the |
2896 | alter | In this example two alternative readings are provided, but no preference is indicated. While the author apparently first composed the line |
2902 | alter | . The manuscript supplies no indication of which word Moore favours at this point, although in fact, in the first printed edition of |
2912 | alter | module gives a simple way of encoding the state of this manuscript, as follows: |
2946 | instantcorr | necessarily implies that the modifications they indicate were made at some time after the original writing. An exception to this is where a false start or |
2948 | instantcorr | correction has been identified: the author starts to write, and then immediately corrects what has been written. |
2954 | instantcorr | class to modify this default assumption. When the value of |
2956 | instantcorr | is set to |
2958 | instantcorr | , the addition or deletion is considered to belong to the same change as its parent element, while |
2960 | instantcorr | means some change later than that of its parent. |
2962 | instantcorr | An example of false start or instant correction can be seen in the following line: |
2966 | instantcorr | [I am a curse] |
2970 | instantcorr | in which we can detect the following sequence of events: |
2974 | instantcorr | is written and then immediately deleted |
2983 | instantcorr | is then deleted |
2991 | instantcorr | To indicate that the first of these acts must have taken place during the main act of writing, before the other deletion and additions, we might encode this revision campaign as follows: |
3023 | PH-surfzone | element is both to identify a specific area containing writing and to provide a two dimensional set of coordinates which can be used to position and provide dimensions for sub-parts of it. Furthermore, surfaces may nest within other surfaces, as in the case of |
3025 | PH-surfzone | or other written materials attached to the main writing surface. In the general case, the position and dimensions of such nested surfaces will be defined using the same coordinate system as that supplied by the parent |
3038 | PH-surfzone | when given on the |
3040 | PH-surfzone | element define the coordinate scheme, rather than specifying the location of that surface. We must therefore introduce an additional |
3067 | PH-surfzone | element that contains it. This zone, and the preceding one, which contains a sequence of |
3073 | PH-surfzone | elements occupy a rectangle with coordinates (1,1,10,10), while the nested surface occupies a rectangle with coordinates (4,4,20,20). |
3075 | PH-surfzone | Now suppose that we wish to define a finer scale grid for the newspaper patch, perhaps because we wish to localize zones within it with greater accuracy. To do this we will need to specify the position of the nested surface as in the previous example, but also to define the new coordinate system. We accomplish this as follows: |
3091 | PH-surfzone | As before, the second zone defines the position and size of the newspaper patch itself in terms of a coordinate system running from 0 to 50 on both X and Y axes. The nested |
3093 | PH-surfzone | element however defines a new scale for all of its components, running from 0 to 100 on both X and Y axes. The position of the nested zone containing the text |
3099 | PH-surfzone | attribute may be used to define non-rectangular zones as a series of points. For example, in the last of the Whitman examples discussed in section |
3100 | PH-surfzone | above, we might wish to record the exact shape of the zone containing the metamark |
3104 | PH-surfzone | attribute to indicate the points defining a polygon which contains it. The values used are expressed in terms of a coordinate space running from 0 to 229 in the X dimension, and 0 to 160 in the Y dimension. |
3112 | PH-surfzone | In exactly the same way, we may wish to identify the curved zone in the following image containing the word |
3119 | PH-surfzone | This curved zone might be encoded in the following way: |
3129 | PH-surfzone | does not need to be entirely contained within the two-dimensional space defined by its parent surface. For example, we might wish to encode the example in |
3130 | PH-surfzone | above not as a surface representing the whole of the two page spread, but as a surface representing only the written part of this opening. The written part appears 50 units from the left of the image and 20 units from the top, while the bottom right corner of the written part appears 400 units from the left of the image, and 280 units from the top. We therefore define the written surface within this image as follows: |
3135 | PH-surfzone | To describe the whole image, we will now need to define a zone of interest which represents an area larger than this surface. Using the same coordinate system as that defined for the surface, its coordinates are |
3137 | PH-surfzone | . This zone of interest can be defined by a |
3139 | PH-surfzone | element, within which we can place the uncropped |
3153 | PHLAY | The following methods are available to capture general aspects of the layout of material on a page where this is considered important. Within the |
3184 | PHLAY | s corresponding with each two page opening, for example where it is clear that the writer regarded each such opening as a single writing surface, with written zones or other features crossing the page divide. An example is shown here: |
3193 | PHLAY | The coloured lines added to this image indicate a number of zones of writing, colour coded to indicate the order in which they were written (purple, then green, then red). For example, the zone marked in red on the left contains a note referring to the purple zone on the right. |
3196 | PHLAY | This approach assumes that the transcription will primarily be organized in the same way as the physical layout of the source, using embedded transcription elements. Alternatively, where the a non-embedded transcription has been provided, using the |
3198 | PHLAY | element, it is still possible to record gathering breaks, page breaks, column breaks, line breaks etc in the source, using the elements described in section |
3199 | PHLAY | . Detailed metadata about the physical make-up of a source will usually be summarized by the |
3209 | PHSP | The author or scribe may have left space for a word, or for an initial capital, and for some reason the word or capital was never supplied and the space left empty. The presence of significant space in the text being transcribed may be indicated by the |
3214 | PHSP | Note that this element should not be used to mark normal inter-word space or the like. |
3216 | PHSP | In line 694 of Chaucer's |
3218 | PHSP | in the Holkham manuscript the scribe has left a space for a word where other manuscripts read |
3225 | PHSP | element discussed in the previous section may be used to supply the text presumed missing: |
3229 | PHSP | Here, the fact of the space within the manuscript is indicated by the value of the |
3231 | PHSP | attribute. The source of the supplied text is shown by the value of the |
3233 | PHSP | attribute as the Hengwrt manuscript; the transcriber responsible for supplying the text is ES. |
3239 | PHLN | One of the more common forms of modification encountered in written documents of any kind is the presence of lines written under, beside, or through the text. Such lines may be of various types: they may be solid, dashed or dotted, doubled or tripled, wavy or straight, or a combination of these and other renderings. The line may be used for emphasis, or to mark a foreign or technical term, or to signal a quotation or a title, etc.: the elements |
3249 | PHLN | may be used for these. Where the line has a clear paratextual function the |
3251 | PHLN | element may be considered more appropriate. Frequently, a scholar may judge that a line is used to delete text: the |
3274 | PHLN | The above examples presume the common case where a single word or phrase is marked by a line, with no doubt as to where the marking begins or ends and with no overlapping of the area of text with other marked areas of text. Where there is doubt, the |
3287 | PHLN | Where the area of text marked overlaps other areas of text, for example crossing a structural division, one of the spanning mechanisms mentioned above must be used; for example where the line is thought to mark a deletion, the |
3289 | PHLN | element may be used. Where it is desired simply to record the marking of a span of text in circumstances where it is not possible to surround the text with a |
3299 | PHLN | More work needs to be done on clarifying the treatment of other textual features marked by lines which might so overlap or nest. For example, in many Middle English manuscripts (e.g. the Jesus and Digby verse collections), marginal sidebars may indicate metrical structure: couplets may be linked in pairs, with the pairs themselves linked into stanzas. Or, marginal sidebars may indicate emphasis, or may point out a region of text on which there is some annotation: in many manuscripts of Chaucer's |
3307 | PHLN | element, containing a prose description of the manuscript at this point, enhanced by a link to a visual representation (or facsimile) of the feature in question. For example, in the Chaucer example just cited, one may wish to record that the |
3325 | PHSK | Such information as page numbers, signatures, or catchwords may be recorded in a specialized |
3327 | PHSK | element provided for that purpose. Although the name derives from the term |
3333 | PHSK | element may be used for such features of any document, written or printed. Note that the purpose of this element is to record page numbers etc. |
3346 | PHSK | : since this information is usually provided by the encoder, it is not subject to the constraint that it should be present only if textually present in the source being encoded. In text-critical situations it may be useful to provide both a normalized version of the pagination and a representation of the catch-word or numbering, especially when the latter presents a variant reading, or is significant for compositor identification. |
3361 | PHSK | other material repeated from page to page, which falls outside the stream of the text |
3386 | PH-changes | A major purpose of genetic editing is the identification of |
3390 | PH-changes | . An editor may wish to assign a set of alterations (deletions, additions, substitutions, transpositions, etc.) or any other act of writing to a particular change, to indicate both that one or more of such phenomena preceded or followed another and also to indicate that they are related in some way, for example that one is a consequence of the other. They might also wish to group together certain revisions, regardless of when they might have occurred, based on a variety of other shared characteristics (e.g., corrections of factual errors or revisions that incorporate suggestions made by a given reader). To document this we need: |
3392 | PH-changes | a system to assign phenomena to a particular change |
3394 | PH-changes | a way to characterize a change, in itself and in relation to other changes. |
3399 | PH-changes | (within the TEI header profile description) contains all information relating to the genesis or production of a text. It may contain a |
3401 | PH-changes | element which contains a number of |
3409 | PH-changes | In the following example an editor has identified four distinct changes: |
3435 | PH-changes | (the default). The attribute specifies whether the order of child elements signifies a temporal order for the revision campaigns which they document. In the example above, the editor has asserted that the four stages distinguished are ordered chronologically according to the order of the |
3440 | PH-changes | elements can be nested hierarchically. This may be helpful in two cases. Firstly one can build up hypotheses about related revisions step-by-step, starting with stages of smaller coverage, whose members are certainly related, and then in a subsequent pass grouping these stages in turn, thereby extending their reach. |
3481 | PH-changes | In addition to the possibility of ordering text stages in relation to each other, |
3483 | PH-changes | elements may carry a number of attributes from the |
3497 | PH-changes | ) which allow each stage to be dated as exactly or inexactly as necessary, in the same way as is currently possible for the TEI |
3542 | PH-changes | element, apart from declaring a distinct change in the creation of the document, may also contain references to other annotations contained within the |
3544 | PH-changes | or in the document (as shown in the previous example). Such references, along with the textual content, are purely documentary and do not affect the textual stage associated with any element thus referred to. The association of a textual component with a change is always made explicitly, either by using the |
3554 | PH-changes | element is associated with some element, it is also associated with all of that element's children, unless otherwise indicated, for example by a new value for the |
3558 | PH-changes | In the following simple example, the text at one stage read |
3570 | PH-changes | In this example, however, the text originally read |
3584 | PH-changes | Note that in this case both the deletion and the addition are associated with the second stage. The word |
3594 | PH-changes | and the like carry an implied semantics concerning the order in which events in the writing of a document was carried out: something which is deleted must have been written before it was deleted; something which is added must have been added at a later stage of the writing. Even when a combination of such elements is used, the chronology can usually be inferred (see further |
3595 | PH-changes | ). Explicit indication of the stage to which some modification belongs is mostly useful in situations where all the alterations identified in a document are to be grouped, for example chronologically. |
3599 | PH-changes | The interpretation of change assignments for a particular text passage is based on a number of implicit assumptions and constraints which have the effect of minimizing the amount of tagging necessary. The system is also flexible enough to support an explicit distinction between acts of writing and textual alterations, since either of these can be associated with changes described in the encoding. The following example shows an encoding in which the same passage is transcribed twice, once from a documentary perspective, and once from a textual one: |
3655 | PH-changes | The documentary transcription stresses the writing process, while the textual transcription emphasizes textual alterations. In either case, the change of writing activity associated with a particular feature in the transcript is explicitly indicated. From the documentary perspective, by assigning particular modifications to a specific change, we describe the writing process, in that they specify which segment has been written when |
3656 | PH-changes | . From the textual perspective, the markup concentrates simply on the existence of textual alterations and makes no explicit claims about the order of writing. |
3663 | PHTRXX | We repeat the advice given at the beginning of this chapter, that these recommendations are not intended to meet every transcriptional circumstance ever likely to be faced by any scholar. They are intended rather as a base to enable encoding of the most common phenomena found in the course of scholarly transcription of primary source materials. These guidelines particularly do not address the encoding of physical description of textual witnesses: the materials of the carrier, the medium of the inscribing implement, the organisation of the carrier materials themselves (as quiring, collation, etc.), authorial instructions or scribal markup, etc., except insofar as these are involved in the broader question of manuscript description, as addressed by the |
3688 | PH | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
2 | HD | The TEI Header |
4 | HD | This chapter addresses the problems of describing an encoded work so that the text itself, its source, its encoding, and its revisions are all thoroughly documented. Such documentation is equally necessary for scholars using the texts, for software processing them, and for cataloguers in libraries and archives. Together these descriptions and declarations provide an electronic analogue to the title page attached to a printed work. They also constitute an equivalent for the content of the code books or introductory manuals customarily accompanying electronic data sets. |
6 | HD | Every TEI-conformant text must carry such a set of descriptions, prefixed to it and encoded as described in this chapter. The set is known as the |
7 | HD | TEI header |
16 | HD | , containing a full bibliographical description of the computer file itself, from which a user of the text could derive a proper bibliographic citation, or which a librarian or archivist could use in creating a catalogue entry recording its presence within a library or archive. The term |
18 | HD | here is to be understood as referring to the whole entity or document described by the header, even when this is stored in several distinct operating system files. The file description also includes information about the source or sources from which the electronic document was derived. The TEI elements used to encode the file description are described in section |
25 | HD | , which describes the relationship between an electronic text and its source or sources. It allows for detailed description of whether (or how) the text was normalized during transcription, how the encoder resolved ambiguities in the source, what levels of encoding or analysis were applied, and similar matters. The TEI elements used to encode the encoding description are described in section |
29 | HD | text profile |
32 | HD | , containing classificatory and contextual information about the text, such as its subject matter, the situation in which it was produced, the individuals described by or participating in producing it, and so forth. Such a text profile is of particular use in highly structured composite texts such as corpora or language collections, where it is often highly desirable to enforce a controlled descriptive vocabulary or to perform retrievals from a body of text in terms of text type or origin. The text profile may however be of use in any form of automatic text processing. The TEI elements used to encode the profile description are described in section |
36 | HD | revision history |
39 | HD | , which allows the encoder to provide a history of changes made during the development of the electronic text. The revision history is important for |
41 | HD | and for resolving questions about the history of a file. The TEI elements used to encode the revision description are described in section |
45 | HD | A TEI header can be a very large and complex object, or it may be a very simple one. Some application areas (for example, the construction of language corpora and the transcription of spoken texts) may require more specialized and detailed information than others. The present proposals therefore define both a |
46 | HD | core |
47 | HD | set of elements (all of which may be used without formality in any TEI header) and some additional elements which become available within the header as the result of including additional specialized modules within the schema. When the module for language corpora (described in chapter |
48 | HD | ) is in use, for example, several additional elements are available, as further detailed in that chapter. |
50 | HD | The next section of the present chapter briefly introduces the overall structure of the header and the kinds of data it may contain. This is followed by a detailed description of all the constituent elements which may be used in the core header. Section |
51 | HD | , at the end of the present chapter, discusses the recommended content of a minimal TEI header and its relation to standard library cataloguing practices. |
53 | HD1 | Organization of the TEI Header |
55 | HD11 | The TEI Header and Its Components |
61 | HD11 | front matter |
62 | HD11 | of the text itself (for which see section |
63 | HD11 | ). A composite text, such as a corpus or collection, may contain several headers, as further discussed below. In the general case, however, a TEI-conformant text will contain a single |
71 | HD11 | The header element has the following description: |
76 | HD11 | element has four principal components: |
81 | HD11 | element is required in all TEI headers; the others are optional. Only one of the four components of the TEI header (the |
84 | HD11 | below. The smallest possible valid TEI Header thus looks like this: |
94 | HD11 | The content of the elements making up a TEI header may be given in any language, not necessarily that of the text to which the header applies, and not necessarily English. As elsewhere, the |
96 | HD11 | attribute should be used at an appropriate level to specify the language. For example, in the following schematic example, an English text has been given a French header: |
106 | HD11 | In the case of language corpora or collections, it may be desirable to record header information either at the level of the individual components in the corpus or collection, or at the level of the corpus or collection itself (more details concerning the tagging of composite texts are given in section |
109 | HD11 | attribute may be used to indicate whether the header applies to a corpus or a single text. A corpus may thus take the form: |
144 | HD12 | Types of Content in the TEI Header |
146 | HD12 | The elements occurring within the TEI header may contain several types of content; the following list indicates how these types of content are described in the following sections: |
151 | HD12 | should be understood to imply a series of paragraphs, each marked as a |
165 | HD12 | ) usually enclose a group of specialized elements recording some structured information. In the case of the bibliographic elements, the suffix |
171 | HD12 | . On the relation between the TEI proposals and other standards for bibliographic description, see further section |
173 | HD12 | In most cases grouping elements may contain prose descriptions as an alternative to the set of specialized elements, thus allowing the encoder to choose whether or not the information concerned should be presented in a structured form or in prose. |
182 | HD12 | ) enclose information about specific encoding practices applied in the electronic text; often these practices are described in coded form. Typically, such information takes the form of a series of declarations, identifying a code with some more complex structure or description. A declaration which applies to more than one text or division of a text need not be repeated in the header of each such text or subdivision. Instead, the |
184 | HD12 | attribute of each text (or subdivision of the text) to which the declaration applies may be used to supply a cross-reference to it, as further described in section |
197 | HD1 | Model Classes in the TEI Header |
199 | HD1 | The TEI header provides a very rich collection of metadata categories, but makes no claim to be exhaustive. It is certainly the case that individual projects may wish to record specialized metadata which either does not fit within one of the predefined categories identified by the TEI header or requires a more specialized element structure than is proposed here. To overcome this problem, the encoder may elect to define additional elements using the customization methods discussed in |
200 | HD1 | . The TEI class system makes such customizations simpler to effect and easier to use in interchange. |
202 | HD1 | These classes are specific to parts of the header: |
224 | HD2 | The bibliographic description of a machine-readable or digital text resembles in structure that of a book, an article, or any other kind of textual object. The file description element of the TEI header has therefore been closely modelled on existing standards in library cataloguing; it should thus provide enough information to allow users to give standard bibliographic references to the electronic text, and to allow cataloguers to catalogue it. Bibliographic citations occurring elsewhere in the header, and also in the text itself, are derived from the same model (on bibliographic citations in general, see further section |
228 | HD2 | The bibliographic description of an electronic text should be supplied by the mandatory |
288 | HD21 | It contains the title given to the electronic work, together with one or more optional |
295 | HD21 | element contains the chief name of the electronic work, including any alternative title or subtitles it may have. It may be repeated, if the work has more than one title (perhaps in different languages) and takes whatever form is considered appropriate by its creator. Where the electronic work is derived from an existing source text, it is strongly recommended that the title for the former should be derived from the latter, but clearly distinguishable from it, for example by the addition of a phrase such as |
298 | HD21 | a digital edition |
300 | HD21 | This will distinguish the electronic work from the source text in citations and in catalogues which contain descriptions of both types of material. |
302 | HD21 | The electronic work will also have an external name (its |
305 | HD21 | data set name |
306 | HD21 | ) or reference number on the computer system where it resides at any time. This name is likely to change frequently, as new copies of the file are made on the computer system. Its form is entirely dependent on the particular computer system in use and thus cannot always easily be transferred from one system to another. Moreover, a given work may be composed of many files. For these reasons, these Guidelines strongly recommend that such names should |
329 | HD21 | which identify the person(s) responsible for the intellectual or artistic content of an item and any corporate bodies from which it emanates. |
331 | HD21 | Any number of such statements may occur within the title statement. At a minimum, identify the author of the text and (where appropriate) the creator of the file. If the bibliographic description is for a corpus, identify the creator of the corpus. |
332 | HD21 | Optionally include also names of others involved in the transcription or elaboration of the text, sponsors, and funding agencies. The name of the person responsible for physical data input need not normally be recorded, unless that person is also intellectually responsible for some aspect of the creation of the file. |
334 | HD21 | Where the person whose responsibility is to be documented is not an author, sponsor, funding body, or principal researcher, the |
340 | HD21 | element indicating the nature of the responsibility. No specific recommendations are made at this time as to appropriate content for the |
344 | HD21 | Names given may be personal names or corporate names. Give all names in the form in which the persons or bodies wish to be publicly cited. This would usually be the fullest form of the name, including first names. |
345 | HD21 | Agencies compiling catalogues of machine-readable files are recommended to use available authority lists, such as the Library of Congress Name Authority List, for all common personal names. |
400 | HD22 | It contains either phrases or more specialized elements identifying the edition and those responsible for it: |
404 | HD22 | edition |
405 | HD22 | applies to the set of all the identical copies of an item produced from one master copy and issued by a particular publishing agency or a group of such agencies. A change in the identity of the distributing body or bodies does not normally constitute a change of edition, while a change in the master copy does. |
409 | HD22 | is not entirely appropriate, since they are far more easily copied and modified than printed ones; nonetheless the term |
410 | HD22 | edition |
411 | HD22 | may be used for a particular state of a machine-readable text at which substantive changes are made and fixed. Synonymous terms used in these Guidelines are |
424 | HD22 | changes have to be before they are regarded as producing a new edition, rather than a simple update. The general principle proposed here is that the production of a new edition entails a significant change in the intellectual content of the file, rather than its encoding or appearance. The addition of analytic coding to a text would thus constitute a new edition, while automatic conversion from one coded representation to another would not. Changes relating to the character code or physical storage details, corrections of misspellings, simple changes in the arrangement of the contents and changes in the output format do not normally constitute a new edition, whereas the addition of new information (e.g. a linguistic analysis expressed in part-of-speech tagging, sound or graphics, referential links to external data sets) almost always does. |
426 | HD22 | Clearly, there will always be borderline cases and the matter is somewhat arbitrary. The simplest rule is: if you think that your file is a new edition, then call it such. An edition statement is optional for the first release of a computer file; it is mandatory for each later release, though this requirement cannot be enforced by the parser. |
430 | HD22 | changes in a file considered significant, whether or not they are regarded as constituting a new edition or simply a new revision, should be independently noted in the revision description section of the file header (see section |
435 | HD22 | element should contain phrases describing the edition or version, including the word |
436 | HD22 | edition |
439 | HD22 | , or equivalent, together with a number or date, or terms indicating difference from other editions such as |
440 | HD22 | new edition |
442 | HD22 | revised edition |
443 | HD22 | etc. Any dates that occur within the edition statement should be marked with the |
453 | HD22 | elements may also be used to supply statements of responsibility for the edition in question. These may refer to individuals or corporate bodies and can indicate functions such as that of a reviser, or can name the person or body responsible for the provision of supplementary matter, of appendices, etc., in a new edition. For further detail on the |
487 | HD23 | For printed books, information about the carrier, such as the kind of medium used and its size, are of great importance in cataloguing procedures. The print-oriented rules for bibliographic description of an item's medium and extent need some re-interpretation when applied to electronic media. An electronic file exists as a distinct entity quite independently of its carrier and remains the same intellectual object whether it is stored on a magnetic tape, a CD-ROM, a set of floppy disks, or as a file on a mainframe computer. Since, moreover, these Guidelines are specifically aimed at facilitating transparent document storage and interchange, any purely machine-dependent information should be irrelevant as far as the file header is concerned. |
497 | HD23 | Although it is equally system-dependent, some measure of the size of the computer file may be of use for cataloguing and other practical purposes. Because the measurement and expression of file size is fraught with difficulties, only very general recommendations are possible; the element |
543 | HD23 | Note that when more than one |
545 | HD23 | is supplied in a single |
558 | HD24 | element and is mandatory. Its function is to name the agency by which a resource is made available (for example, a publisher or distributor) and to supply any additional information about the way in which it is made available such as licensing conditions, identifying numbers, etc. |
562 | HD24 | These elements form the |
564 | HD24 | class; if the agency making the resource available is unknown, but other structured information about it is available, an explicit statement such as |
565 | HD24 | publisher unknown |
569 | HD24 | publisher |
570 | HD24 | is the person or institution by whose authority a given edition of the file is made public. The |
571 | HD24 | distributor |
572 | HD24 | is the person or institution from whom copies of the text may be obtained. Where a text is not considered formally published, but is nevertheless made available for circulation by some individual or organization, this person or institution is termed the |
573 | HD24 | release authority |
576 | HD24 | Whichever of these elements is chosen, it may be followed by one or more of the following elements, which together form the |
596 | HD24 | elements all supply additional information relating to the the publisher, distributor, or release authority immediately preceding them. In the following example, Benson is identified as responsible for distribution of some resource at the date and place cited: |
605 | HD24 | A resource may have (for example) both a publisher and a distributor, or more than one publisher each using different identifiers for the same resource, and so on. For this reason, the sequence of at least one |
611 | HD24 | The following example shows a resource published by one agency (Sigma Press) at one address and date, which is also distributed by another (Oxford Text Archive), with a specified identifier and a different date: |
641 | HD24 | always refers to the date of publication, first distribution, or initial release. If the text was created at some other date, this may be recorded using the |
645 | HD24 | element. Other useful dates (such as dates of collection of data) may be given using a note in the |
663 | HD24 | attribute to point to a location from which the licence document itself may be obtained. Alternatively, the licence document may simply be contained within the |
680 | HD26 | series |
683 | HD26 | A group of separate items related to one another by the fact that each item bears, in addition to its own title proper, a collective title applying to the group as a whole. The individual items may or may not be numbered. |
687 | HD26 | A separately numbered sequence of volumes within a series or serial. |
695 | HD26 | may be used to supply any identifying number associated with the item, including both standard numbers such as an ISSN and particular issue numbers. (Arabic numerals separated by punctuation are recommended for this purpose: |
701 | HD26 | attribute is used to categorize the number further, taking the value |
737 | HD27 | the nature, scope, artistic form, or purpose of the file; also the genre or other intellectual category to which it may belong: e.g. |
744 | HD27 | an abstract or summary of the content of a document which has been supplied by the encoder because no such abstract forms part of the content of the source. This should be supplied in the |
751 | HD27 | summary description providing a factual, non-evaluative account of the subject content of the file: e.g. |
758 | HD27 | bibliographic details relating to the source or sources of an electronic text: e.g. |
759 | HD27 | Transcribed from the Norton facsimile of the 1623 Folio |
765 | HD27 | further information relating to publication, distribution, or release of the text, including sources from which the text may be obtained, any restrictions on its use or formal terms on its availability. These should be placed in the appropriate division of the |
771 | HD27 | ICPSR study number 1803 |
773 | HD27 | Oxford Text Archive text number 1243 |
785 | HD27 | dates, when they are relevant to the content or condition of the computer file: e.g. |
790 | HD27 | names of persons or bodies connected with the technical production, administration, or consulting functions of the effort which produced the file, if these are not named in statements of responsibility in the title or edition statements of the file description: e.g. |
793 | HD27 | availability of the file in an additional medium or information not already recorded about the availability of documentation: e.g. |
796 | HD27 | language of work and abstract, if not encoded in the |
801 | HD27 | The unique name assigned to a serial by the International Serials Data System (ISDS), if not encoded in an |
804 | HD27 | lists of related publications, either describing the source itself, or concerned with the creation or use of the electronic work, e.g. |
808 | HD27 | Each such item of information may be tagged using the general-purpose |
819 | HD27 | There are advantages, however, to encoding such information with more precise elements elsewhere in the TEI header, when such elements are available. For example, the notes above might be encoded as follows: |
847 | HD3 | element. It is a mandatory element and is used to record details of the source or sources from which a computer file is derived. This might be a printed text or manuscript, another computer file, an audio or video recording of some kind, or a combination of these. An electronic file may also have no source, if what is being catalogued is an original text created in electronic form. |
852 | HD3 | element may contain little more than a simple prose description, or a brief note stating that the document has no source: |
864 | HD3 | These classes make available by default a range of ways of providing bibliographic citations which specify the provenance of the text. For written or printed sources, the source may be described in the same way as any other bibliographic citation, using one of the following elements: |
871 | HD3 | . Using them, a source might be described in very simple terms: |
896 | HD3 | When the header describes a text derived from some pre-existing TEI-conformant or other digital document, it may be simpler to use the following element, which is designed specifically for documents derived from texts which were |
912 | HD3 | class also makes available additional elements when additional modules are included. For example, when the |
916 | HD3 | element may also include the following special-purpose elements, intended for cases where an electronic text is derived from a spoken text rather than a written one: |
920 | HD3 | A single electronic text may be derived from multiple source documents, in whole or in part. The |
935 | HD3 | may be used to associate parts of the encoded text with the bibliographic element from which it derives in either case. |
937 | HD3 | The source description may also include lists of names, persons, places, etc. when these are considered to form part of the source for an encoded document. When such information is recorded using the specialized elements discussed in the |
956 | HD31 | If a computer file (call it B) is derived not from a printed source but from another computer file (call it A) which includes a TEI file header, then the source text of computer file B is another computer file, A. The four sections of A's file header will need to be incorporated into the new header for B in slightly differing ways, as listed below: |
957 | HD31 | fileDesc |
964 | HD31 | profileDesc |
969 | HD31 | encodingDesc |
971 | HD31 | A's encoding practice may or (more likely) may not be the same as B's. Since the object of the encoding description is to define the relationship between the current file and its source, in principle only changes in encoding practice between A and B need be documented in B. The relationship between A and its source(s) is then only recoverable from the original header of A. In practice it may be more convenient to create a new complete |
974 | HD31 | revisionDesc |
988 | HD5 | element is the second major subdivision of the TEI header. It specifies the methods and editorial principles which governed the transcription or encoding of the text in hand and may also include sets of coded definitions used by other components of the header. Though not formally required, its use is highly recommended. |
1022 | HD51 | element may be used to describe, in prose, the purpose for which a digital resource was created, together with any other relevant information concerning the process by which it was assembled or collected. This is of particular importance for corpora or miscellaneous collections, but may be of use for any text, for example to explain why one kind of encoding practice has been followed rather than another. |
1048 | HD52 | the underlying population being sampled |
1059 | HD52 | It may also include a simple description of any parts of the source text included or excluded. |
1064 | HD52 | A sampling declaration which applies to more than one text or division of a text need not be repeated in the header of each such text. Instead, the |
1066 | HD52 | attribute of each text (or subdivision of the text) to which the sampling declaration applies may be used to supply a cross-reference to it, as further described in section |
1079 | HD53 | It may contain a prose description only, or one or more of a set of specialized elements, members of the TEI |
1083 | HD53 | Some of these policy elements carry attributes to support automated processing of certain well-defined editorial decisions; all of them contain a prose description of the editorial principles adopted with respect to the particular feature concerned. Examples of the kinds of questions which these descriptions are intended to answer are given in the list below. |
1091 | HD53 | Was the text corrected during or after data capture? If so, were corrections made silently or are they marked using the tags described in section |
1092 | HD53 | ? What principles have been adopted with respect to omissions, truncations, dubious corrections, alternate readings, false starts, repetitions, etc.? |
1099 | HD53 | Was the text normalized, for example by regularizing any non-standard spellings, dialect forms, etc.? If so, were normalizations performed silently or are they marked using the tags described in section |
1100 | HD53 | ? What authority was used for the regularization? Also, what principles were used when normalizing numbers to provide the standard values for the |
1110 | HD53 | How were quotation marks processed? Are apostrophes and quotation marks distinguished? How? Are quotation marks retained as content in the text or replaced by markup? Are there any special conventions regarding for example the use of single or double quotation marks when nested? Is the file consistent in its practice or has this not been checked? See section |
1111 | HD53 | for discussion of ways in which quotation marks may be encoded. |
1122 | HD53 | hyphens? What principle has been adopted with respect to end-of-line hyphenation where source lineation has not been retained? Have soft hyphens been silently removed, and if so what is the effect on lineation and pagination? See section |
1123 | HD53 | for discussion of ways in which hyphenation may be encoded. |
1130 | HD53 | How is the text segmented? If |
1134 | HD53 | segmentation units have been used to divide up the text for analysis, how are they marked and how was the segmentation arrived at? |
1153 | HD53 | Has any analytic or |
1155 | HD53 | information been provided—that is, information which is felt to be non-obvious, or potentially contentious? If so, how was it generated? How was it encoded? If feature-structure analysis has been used, are |
1166 | HD53 | How has the encoding of punctuation marks present in the original source been treated? For example, has it been normalised, or suppressed in favour of descriptive markup? If it has been retained, is it located within or around elements such as |
1170 | HD53 | Any information about the editorial principles applied not falling under one of the above headings should be recorded in a distinct list of items. Experience shows that a full record should be kept of decisions relating to editorial principles and encoding practice, both for future users of the text and for the project which produced the text in the first instance. Some simple examples follow: |
1202 | HD53 | An editorial practices declaration which applies to more than one text or division of a text need not be repeated in the header of each such text. Instead, the |
1204 | HD53 | attribute of each text (or subdivision of the text) to which it applies may be used to supply a cross-reference to it, as further described in section |
1213 | HD57 | the namespace to which elements appearing within the transcribed text belong. |
1215 | HD57 | how often particular elements appear within the text, so that a recipient can validate the integrity of a text during interchange. |
1219 | HD57 | a default rendition applicable to all instances of an element. |
1230 | HD57 | element consists of an optional sequence of |
1232 | HD57 | elements, each of which must bear a unique identifier, followed by an optional sequence of one or more |
1234 | HD57 | elements, each of which contains a series of |
1236 | HD57 | elements, up to one for each element type from that namespace occurring within the associated |
1249 | HD57-1 | element allows the encoder to specify how one or more elements are rendered in the original source in any of the following ways: |
1253 | HD57-1 | using a standard stylesheet language such as CSS or XSL-FO |
1255 | HD57-1 | using a project-defined formal language |
1264 | HD57-1 | element may be used to indicate a default rendition for all occurrences of the named element |
1268 | HD57-1 | attribute may be used on any element to indicate its rendition, overriding or complementing any supplied default value |
1279 | HD57-1 | elements are by default to be rendered using one set of specifications identified as |
1306 | HD57-1 | As noted above, the content of a |
1308 | HD57-1 | element may describe the appearance of the source material using prose, a project-defined formal language, or any standard languages such as the Cascading Stylesheet Language ( |
1313 | HD57-1 | ) may be supplied within the |
1327 | HD57-1 | First we define a rendition element for each aspect of the source page rendition that we wish to retain. Details of CSS are given in |
1328 | HD57-1 | ; we use it here simply to provide a vocabulary with which to describe such aspects as font size and style, letter and line spacing, colour, etc. Note that the purpose of this encoding is to describe the original, rather than specify how it should be reproduced, although the two are obviously closely linked. |
1355 | HD57-1 | attribute can now be used to specify on any element which of the above rendition features apply to it. For example, a title page might be encoded as follows: |
1393 | HD57-1 | pseudo-elements can be used often in conjunction with the "content" property to add additional characters which need to be added before or after the element content to make it more closely resemble the appearance of the source. |
1395 | HD57-1 | For example, assuming that a text has been encoded using the |
1397 | HD57-1 | element to enclose passages in quotation marks, but the quotation marks themselves have been routinely omitted from the encoding, a set of renditions such as the following: |
1409 | HD57-1 | element is actually rendered in the source with initial and final quotation marks, it may then be encoded as follows: |
1420 | HD57-2 | element, if present, should contain up to one occurrence of a |
1422 | HD57-2 | element for each element type from the given namespace that occurs within the outermost |
1427 | HD57-2 | In the case of a TEI corpus ( |
1430 | HD57-2 | in a corpus header will describe tag usage across the whole corpus, while one in an individual text header will describe tag usage for the individual text concerned. |
1433 | HD57-2 | element may be used to supply a count of the number of occurrences of this element within the text, which is given as the value of its |
1435 | HD57-2 | attribute. It may also be used to hold any additional usage information, which is supplied as running prose within the element itself. |
1447 | HD57-2 | attribute may optionally be used to specify how many of the occurrences of the element in question bear a value for the global |
1455 | HD57-2 | The content of the |
1461 | HD57-2 | attributes, but if it does, then the counts provided must correspond with the number of such elements present in the associated |
1474 | HD57-1a | The content of the |
1476 | HD57-1a | element and the value of the |
1478 | HD57-1a | attribute are expressed using one of a small number of formally defined style definition languages. For ease of processing, it is strongly recommended to use a single such language throughout an encoding project, although the TEI system permits a mixture. |
1484 | HD57-1a | element, is used to supply the name of the default style definition language. The name is supplied as the value of the |
1490 | HD57-1a | Informal free text description |
1499 | HD57-1a | A user-defined formal description language |
1503 | HD57-1a | attribute may be used to supply the precise version of the style definition language used, and the content of this element, if any, may supply additional information. |
1507 | HD57-1a | attribute is used, its value must always be expressed using whichever default style definition language is in force. If more than one occurrence of the |
1509 | HD57-1a | is provided, there will be more than one default available, and the |
1522 | HD54 | It may contain either a series of prose paragraphs or the following specialized elements: |
1527 | HD54 | Note that not all possible referencing schemes are equally easily supported by current software systems. A choice must be made between the convenience of the encoder and the likely efficiency of the particular software applications envisaged, in this context as in many others. For a more detailed discussion of referencing systems supported by these Guidelines, see section |
1534 | HD54 | as a series of pairs of regular expressions and XPaths |
1537 | HD54 | milestone |
1538 | HD54 | s |
1545 | HD54 | element can be included in the header if more than one canonical reference scheme is to be used in the same document, but the current proposals do not check for mutual inconsistency. |
1551 | HD54P | by a simple prose description. Such a description should indicate which elements carry identifying information, and whether this information is represented as attribute values or as content. Any special rules about how the information is to be interpreted when reading or generating a reference string should also be specified here. Such a prose description cannot be processed automatically, and this method of specifying the structure of a canonical reference system is therefore not recommended for automatic processing. |
1592 | HD54M | This method is appropriate when only |
1593 | HD54M | milestone |
1597 | HD54M | A reference based on milestone tags concatenates the values specified by one or more such tags. Since each tag marks the point at which a value changes, it may be regarded as specifying the |
1598 | HD54M | refState |
1599 | HD54M | of a variable. A reference declaration using this method therefore specifies the individual components of the canonical reference as a sequence of |
1608 | HD54M | might be thought of as representing the state of three variables: the |
1610 | HD54M | variable is in state |
1614 | HD54M | variable is in state |
1618 | HD54M | variable is in state |
1620 | HD54M | . If milestone tagging has been used, there should be a tag marking the point in the text at which each of the above |
1625 | HD54M | tag itself, what are here referred to as |
1634 | HD54M | therefore an application must scan left to right through the text, monitoring changes in the state of each of these three variables as it does so. When all three are simultaneously in the required state, the desired point will have been reached. There may of course be several such points. |
1642 | HD54M | tags in the text are to be checked for state-changes. A state-change is signalled whenever a new |
1644 | HD54M | tag is found with |
1650 | HD54M | element in question. The value for the new state may be given explicitly by the |
1654 | HD54M | element, or it may be implied, if the |
1658 | HD54M | For example, for canonical references in the form |
1662 | HD54M | represents the page number in the first edition, and |
1664 | HD54M | the line number within this page, a reference system declaration such as the following would be appropriate: |
1668 | HD54M | This implies that milestone tags of the form |
1670 | HD54M | will be found throughout the text, marking the positions at which page and line numbers change. Note that no value has been specified for the |
1672 | HD54M | attribute on the second milestone tag above; this implies that its value at each state change is monotonically increased. For more detail on the use of milestone tags, see section |
1677 | HD54M | The milestone referencing scheme, though conceptually simple, is not supported by a generic XML parser. Its use places a correspondingly greater burden of verification and accuracy on the encoder. |
1687 | HD54M | A reference system declaration which applies to more than one text or division of a text need not be repeated in the header of each such text. Instead, the |
1689 | HD54M | attribute of each text (or subdivision of the text) to which the declaration applies may be used to supply a cross-reference to it, as further described in section |
1695 | HD55 | element is used to group together definitions or sources for any descriptive classification schemes used by other parts of the header. Each such scheme is represented by a |
1705 | HD55 | element has two slightly different, but related, functions. For well-recognized and documented public classification schemes, such as Dewey or other published descriptive thesauri, it contains simply a bibliographic citation indicating where a full description of a particular taxonomy may be found. |
1715 | HD55 | element contains a description of the taxonomy itself as well as an optional bibliographic citation. The description consists of a number of |
1717 | HD55 | elements, each defining a single category within the given typology. The category is defined by the contents of a nested |
1719 | HD55 | element, which may contain either a phrase describing the category, or any number of elements from the |
1721 | HD55 | class. When the corpus module is included in a schema, this class provides the |
1723 | HD55 | element whose components allow the definition of a text type in terms of a set of |
1726 | HD55 | ; if the corpus module is not included in a schema, this class is empty and the |
1730 | HD55 | If the category is subdivided, each subdivision is represented by a nested |
1732 | HD55 | element, having the same structure. Categories may be nested to an arbitrary depth in order to reflect the hierarchical structure of the taxonomy. Each |
1766 | HD55 | Linkage between a particular text and a category within such a taxonomy is made by means of the |
1771 | HD55 | . Where the taxonomy permits of classification along more than one dimension, more than one category will be referenced by a particular |
1773 | HD55 | , as in the following example, which identifies a text with the sub-categories |
1779 | HD55 | within the category |
1787 | HD55 | child, when for example the category is described in more than one language, as in the following example: |
1821 | HDGDECL | The following element is provided to indicate (within the header of a document, or in an external location) that a particular coordinate notation, or a particular datum, has been employed in a text. The default notation is a string containing two real numbers separated by whitespace, of which the first indicates latitude and the second longitude according to the 1984 World Geodetic System (WGS84). |
1833 | HDSCHSPEC | , it allows embedding of a schema inside a TEI header; alternatively, this element may be used in the |
1840 | HDSCHSPEC | element contains all the information needed to generate schemas for a particular TEI customization, and the ODD documentation elements, by reference to the TEI, are more succinct than the schemas derived from them. Therefore you may find it convenient to make a copy of the |
1844 | HDSCHSPEC | itself, in addition to supplying an external schema and/or ODD file; if the XML file becomes separated from its schema, the schema can be regenerated at any time using the information in the |
1864 | HDAPP | to allow an application to discover that it has previously opened or edited a file, and what version of itself was used to do that; |
1866 | HDAPP | to show (through a date) which application last edited the file to allow for diagnosis of any problems that might have been caused by that application; |
1868 | HDAPP | to allow users to discover information about an application used to edit the file |
1870 | HDAPP | to allow the application to declare an interest in elements of the file which it has edited, so that other applications or human editors may be more wary of making changes to those sections of the file. |
1886 | HDAPP | element identifies the current state of one software application with regard to the current file. This element is a member of the |
1888 | HDAPP | class, which provides a variety of attributes for associating this state with a date and time, or a temporal range. The |
1892 | HDAPP | attributes should be used to uniquely identify the application and its major version number (for example, |
1894 | HDAPP | ). It is not intended that an application should add a new |
1896 | HDAPP | each time it touches the file. |
1898 | HDAPP | The following example shows how these elements might be used to document the fact that version 1.5 of an application called |
1916 | HDENCOTH | The elements discussed so far are available to any schema. When the schema in use includes some of the more specialized TEI modules, these make available other more module-specific components of the encoding description. These are discussed fully in the documentation for the module in question, but are also noted briefly here for convenience. |
1919 | HDENCOTH | element is available only when the |
1921 | HDENCOTH | module is included in a schema. Its purpose is to document the |
1924 | HDENCOTH | ) underlying any analytic |
1927 | HDENCOTH | ) present in the text documented by this header. |
1930 | HDENCOTH | element is available only when the |
1932 | HDENCOTH | module is included in a schema. Its purpose is to document any metrical notation scheme used in the text, as further discussed in section |
1933 | HDENCOTH | . It consists either of a prose description or a series of |
1938 | HDENCOTH | element is available only when the |
1940 | HDENCOTH | module is included in a schema. Its purpose is to document the method used to encode textual variants in the text, as discussed in section |
1949 | HD4 | element is the third major subdivision of the TEI header. It is an optional element, the purpose of which is to enable information characterizing various descriptive aspects of a text or a corpus to be recorded within a single unified framework. |
1952 | HD4 | In principle, almost any component of the header might be of importance as a means of characterizing a text. The author of a written text, its title or its date of publication, may all be regarded as characterizing it at least as strongly as any of the parameters discussed in this section. The rule of thumb applied has been to exclude from discussion here most of the information which generally forms part of a standard bibliographic style description, if only because such information has already been included elsewhere in the TEI header. |
1958 | HD4 | element, followed by any number of additional elements taken from the |
1960 | HD4 | class. The default members of this class are the following : |
1991 | HD4 | . Its purpose is to group together a number of |
1995 | HD4 | element can also appear within a structured manuscript description, when the |
2000 | HD4 | element is actually declared within the header module, but is only accessible to a schema when one or other of the |
2020 | HD4C | element contains phrases describing the origin of the text, e.g. the date and place of its composition. |
2023 | HD4C | The date and place of composition are often of particular importance for studies of linguistic variation; since such information cannot be inferred with confidence from the bibliographic description of the copy text, the |
2025 | HD4C | element may be used to provide a consistent location for this information: |
2044 | HD41 | elements, each of which provides information about a single language, notably the quantity of that language present in the text. Note that this element should |
2056 | HD41 | element may be supplied for each different language used in a document. If used, its |
2058 | HD41 | attribute should specify an appropriate language identifier, as further discussed in section |
2059 | HD41 | . This is particularly important if extended language identifiers have been used as the value of |
2079 | HD43 | element is used to classify a text in some way. |
2087 | HD43 | by providing a set of keywords, as provided for example by British Library or Library of Congress Cataloguing in Publication data |
2089 | HD43 | by referencing any other taxonomy of text categories recognized in the field concerned, or peculiar to the material in hand; this may include one based on recurring sets of values for the situational parameters defined in section |
2101 | HD43 | element simply categorizes an individual text by supplying a list of keywords which may describe its topic or subject matter, its form, date, etc. In some schemes, the order of items in the list is significant, for example, from major topic to minor; in others, the list has an organized substructure of its own. No recommendations are made here as to which method is to be preferred. Wherever possible, such keywords should be taken from a recognized source, such as the British Library/Library of Congress Cataloguing in Publication data in the case of printed books, or a published thesaurus appropriate to the field. |
2105 | HD43 | attribute is used to indicate the source of the keywords used, in the case where such a source exists. If the keywords are taken from an externally defined authority which is available online, this attribute should point directly to it, as in the following examples: |
2125 | HD43 | If the authority file is not available online, but is generally recognized and commonly cited, a bibliographic description for it should be supplied within the |
2130 | HD43 | attribute may then reference that |
2154 | HD43 | If no authority file exists, perhaps because the keywords used were assigned directly by an author, the |
2158 | HD43 | Alternatively, if the keyword vocabulary itself is locally defined, the |
2172 | HD43 | element also categorizes an individual text, by supplying a numerical or other code rather than descriptive terms. Such codes constitute a recognized classification scheme, such as the Dewey Decimal Classification. On this element, the |
2174 | HD43 | attribute is required; it indicates the source of the classification scheme in the same way as for keywords: this may be a pointer of any kind, either to a TEI element, possibly in the current document, as in the |
2176 | HD43 | examples above, or to some canonical source for the scheme, as in the following example: |
2183 | HD43 | element categorizes an individual text by pointing to one or more |
2192 | HD43 | ) holds information about a particular classification or category within a given taxonomy. Each such category must have a unique identifier, which may be supplied as the value of the |
2196 | HD43 | elements which are regarded as falling within the category indicated. |
2198 | HD43 | A text may, of course, fall into more than one category, in which case more than one identifier may be supplied as the value for the |
2205 | HD43 | attribute may be supplied to specify the taxonomy to which the categories identified by the target attribute belong, if this is not adequately conveyed by the resource pointed to. For example, |
2207 | HD43 | Here the same text has been classified as of categories |
2213 | HD43 | ), and as of category |
2219 | HD43 | with multiple identifiers in the value of |
2223 | HD43 | elements, each with a single identifier in the value of |
2225 | HD43 | . However, note that maintenance of a TEI document with a large number of values within a single |
2233 | HD43 | elements is that the values used as identifying codes are exhaustively enumerated for the former, typically within the TEI header. In the latter case, however, the values use any externally-defined scheme, and therefore may be taken from a more open-ended descriptive classification system. |
2240 | HD4ABS | The main purpose of the |
2242 | HD4ABS | element is to supply a brief resume or abstract for an article which was originally published without such a component. An abstract or summary forming part of the document at its creation should usually appear in the front matter ( |
2265 | HD4ABS | The same element may be used to provide other summary information supplied by the encoder, perhaps grouped together into a list of discrete items: |
2310 | HD44 | Each such element contains one or more paragraphs of description for the calendar system concerned, and also supplies an identifying code for it as the value of its |
2324 | HD44 | This identifying code may then be referenced from any element supplying a date expressed using that calendar system: |
2348 | HD44CD | This information is complementary to the detailed descriptions of physical objects (such as letters) associated with correspondence activities, which are typically provided by the sourceDesc element. |
2367 | HD44CD | element is used to group references relevant to the item of correspondence being described, typically to other items such as the item to which it is a reply, or the item which replies to it: |
2394 | HD44CD | to describe the sending of a letter by Adelbert von Chamisso from Vertus on 29 January 1807 to Louis de La Foye at Caen. The date of reception is unknown: |
2414 | HD44CD | to provide a normalized form of the date. The content of the |
2416 | HD44CD | element may also be omitted, since no underlying source is being transcribed. |
2420 | HD44CD | if the action is considered to apply to them all acting as a single group. In the following example two people are considered to have received the communication. |
2459 | HD44CD | The same person may be associated with many actions. For example, it will often be the case that the author and sender of a message are identical, and that many individual letters will need to be associated with the same person. The |
2462 | HD44CD | may be used to indicate that the same name applies to many actions. Its value will usually be the identifier of an element defining the person or name concerned, which is supplied elsewhere in the document. |
2470 | HD44CD | It is assumed that each correspondence action applies to a single act of communication. It may however be the case that the same physical object is involved in several such acts, if for example person A sends a letter to person B, who then annotates it and sends it on to person C, or if persons A and B both use the same document to convey quite different messages. In such situations, multiple |
2472 | HD44CD | elements should be supplied, one for each communication. In the following example, the same document contains distinct messages, sent by two different people to the same destination: |
2520 | HD6 | The final sub-element of the TEI header, the |
2522 | HD6 | element, provides a detailed change log in which each change made to a text may be recorded. Its use is optional but highly recommended. It provides essential information for the administration of large numbers of files which are being updated, corrected, or otherwise modified as well as extremely useful documentation for files being passed from researcher to researcher or system to system. Without change logs, it is easy to confuse different versions of a file, or to remain unaware of small but important changes made in the file by some earlier link in the chain of distribution. No significant change should be made in any TEI-conformant file without corresponding entries being made in the change log. |
2529 | HD6 | The main purpose of the revision description is to record changes in the text to which a header is prefixed. However, it is recommended TEI practice to include entries also for significant changes in the header itself (other than the revision description itself, of course). At the very least, an entry should be supplied indicating the date of creation of the header. |
2531 | HD6 | The log consists of a list of entries, one for each change. Changes may be grouped and organised using either the |
2537 | HD6 | . Alternatively, a simple sequence of |
2543 | HD6 | may be supplied for each |
2545 | HD6 | element to indicate its date and the person responsible for it respectively. The description of the change itself can range from a simple phrase to a series of paragraphs. If a number is to be associated with one or more changes (for example, a revision number), the global |
2628 | HD7 | The TEI header allows for the provision of a very large amount of information concerning the text itself, its source, its encodings, and revisions of it, as well as a wealth of descriptive information such as the languages it uses and the situation(s) in which it was produced, together with the setting and identity of participants within it. This diversity and richness reflects the diversity of uses to which it is envisaged that electronic texts conforming to these Guidelines will be put. It is emphatically |
2630 | HD7 | intended that all of the elements described above should be present in every TEI Header. |
2632 | HD7 | The amount of encoding in a header will depend both on the nature and the intended use of the text. At one extreme, an encoder may expect that the header will be needed only to provide a bibliographic identification of the text adequate to local needs. At the other, wishing to ensure that their texts can be used for the widest range of applications, encoders will want to document as explicitly as possible both bibliographic and descriptive information, in such a way that no prior or ancillary knowledge about the text is needed in order to process it. The header in such a case will be very full, approximating to the kind of documentation often supplied in the form of a manual. Most texts will lie somewhere between these extremes; textual corpora in particular will tend more to the latter extreme. In the remainder of this section we demonstrate first the minimal, and next a commonly recommended, level of encoding for the bibliographic information held by the TEI header. |
2634 | HD7 | Supplying only the minimal level of encoding required, the TEI header of a single text might look like the following example: |
2656 | HD7 | The only mandatory component of the TEI header is the |
2664 | HD7 | are all required constituents. Within the title statement, a title is required, and an author should be specified, even if it is |
2666 | HD7 | , as should some additional statement of responsibility, here given by the |
2670 | HD7 | , a publisher, distributor, or other agency responsible for the file must be specified. Finally, the source description should contain at the least a loosely structured bibliographic citation identifying the source of the electronic text if (as is usually the case) there is one. |
2672 | HD7 | We now present the same example header, expanded to include additionally recommended information, adequate to most bibliographic purposes, in particular to allow for the creation of an |
2674 | HD7 | -conformant bibliographic record. We have also added information about the encoding principles used in this (imaginary) encoding, about the text itself (in the form of Library of Congress subject headings), and about the revision of the file. |
2848 | HD7 | Many other examples of recommended usage for the elements discussed in this chapter are provided here, in the reference index and in the associated tutorials. |
2852 | HD8 | A strong motivation in preparing the material in this chapter was to provide in the TEI header a viable chief source of information for cataloguing computer files. The TEI header is not a library catalogue record, and so will not make all of the distinctions essential in standard library work. It also includes much information generally excluded from standard bibliographic descriptions. It is the intention of the developers, however, to ensure that the information required for a catalogue record be retrievable from the TEI file header, and moreover that the mapping from the one to the other be as simple and straightforward as possible. Where the correspondence is not obvious, it may prove useful to consult one of the works which were influential in developing the content of the TEI header. These include: |
2856 | HD8 | is an international standard setting out what information should be recorded in a description of a bibliographical item. Until a consolidated edition published in 2011, there was a general standard called ISBD(G) and separate ISBDs covering different types of material, e.g. ISBD(M) for monographs, ISBD(ER) for electronic resources. These separate ISBDs follow the same general scheme as the main ISBD(G), but provide appropriate interpretations for the specific materials under consideration. |
2862 | HD8 | were published in 1978, with revisions appearing periodically through 2005. AACR2 provides guidelines for the construction of catalogues in general libraries in the English-speaking world. AACR2 is explicitly based on the general framework of the ISBD(G) and the subsidiary ISBDs: it gives a description of how to describe bibliographic items and how to create access points such as subject or name headings and uniform titles. Other national cataloguing codes exist as well, including the Z44 series of standards from issued by the Association française de normalisation (AFNOR), |
2865 | HD8 | Regole italiane di catalogazione per autore |
2876 | HD8 | Since the TEI file description elements are based on the ISBD areas, it should be possible to use the content of file description as the basis for a catalog record for a TEI document. However, cataloguers should be aware that the permissive nature of the TEI Guidelines may lead to divergences between practice in using the TEI file description and the comparatively strict recommendations of AACR2 and other national cataloguing codes. Such divergences as the following may preclude automatic generation of catalogue records from TEI headers: |
2878 | HD8 | The TEI Guidelines do not require that text be transcribed from the |
2879 | HD8 | chief source of information |
2880 | HD8 | using normalized capitalization and punctuation |
2883 | HD8 | The TEI title statement may not categorize constituent titles in the same way as prescribed by a national cataloguing code. |
2885 | HD8 | The TEI title statement contains authors, editors, and other responsible parties in separate elements, with names which may not have been normalized; it does not necessarily contain a single statement of responsibility |
2888 | HD8 | There is no specific place in a TEI header to specify the |
2889 | HD8 | main entry |
2893 | HD8 | name or title headings under which a catalogue record is filed |
2896 | HD8 | The TEI header does not require use of a particular vocabulary for subject headings nor require the use of subject headings. |
2900 | HD | The TEI Header Module |
2904 | header | The TEI Header |
2913 | HD | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
4 | TD | This chapter describes a module which may be used for the documentation of the XML elements and element classes which make up any markup scheme, in particular that described by the TEI Guidelines, and also for the automatic generation of schemas or DTDs conforming to that documentation. It should be used also by those wishing to customize or modify these Guidelines in a conformant manner, as further described in chapters |
6 | TD | and may also be useful in the documentation of any other comparable encoding scheme, even though it contains some aspects which are specific to the TEI and may not be generally applicable. |
13 | TD | , and was the name invented by the original TEI Editors for the predecessor of the system currently used for this purpose. See further |
16 | TD | Like any other piece of XML software, an ODD processor may be instantiated in many ways: the current system uses a number of XSLT stylesheets which are freely available from the TEI, but this specification makes no particular assumptions about the tools which will be used to provide an ODD processing environment. |
18 | TD | As the name suggests, an ODD processor uses a single XML document to generate multiple outputs. These outputs will include: |
23 | TD | detailed descriptive documentation, embedding some parts of the formal reference documentation, such as the tag description lists provided in this and other chapters of these Guidelines; |
25 | TD | declarative code for one or more XML schema languages, such as RELAX NG, W3C Schema, ISO Schematron, or DTD. |
30 | TD | The input required to generate these outputs consists of running prose, and special purpose elements documenting the components (elements, classes, etc.) which are to be declared in the chosen schema language. All of this input is encoded in XML using elements defined in this chapter. In order to support more than one schema language, these elements constitute a comparatively high-level model which can then be mapped by an ODD processor to the specific constructs appropriate for the schema language in use. Although some modern schema languages such as RELAX NG or W3C Schema natively support self-documentary features of this kind, we have chosen to retain the ODD model, if only for reasons of compatibility with earlier versions of these Guidelines. For reasons of backwards compatibility, the ISO standard XML schema language RELAX NG ( |
31 | TD | ) may be used as a means of declaring content models and datatypes, but it is also possible to express content models using natively TEI XML constructs. We also use the ISO Schematron language to define additional constraints beyond those expressed in the content model, as further discussed in |
34 | TD | In the TEI system, a |
38 | TD | and has an identifier unique across the whole TEI scheme. For convenience, these specifications are grouped into a number of discrete |
40 | TD | , which can also be combined more or less as required. Each major chapter of these Guidelines defines a distinct module. Each module declares a number of |
43 | TD | classes |
44 | TD | . All classes are available globally, irrespective of the module in which they are declared; particular modules extend the meaning of a class by adding elements or attributes to it. Wherever possible, element content models are defined in terms of classes rather than in terms of specific elements. Modules can also declare particular |
46 | TD | , which act as short-cuts for commonly used content models or class references. |
48 | TD | In the present chapter, we discuss the components needed to support this system. In addition, section |
49 | TD | discusses some general purpose elements which may be useful in any kind of technical documentation, wherever there is need to talk about technical features of an XML encoding such as element names and attributes. Section |
54 | TD | provides a summary overview of the elements provided by this module. |
62 | TDphraseTE | In any kind of technical documentation, the following phrase-level elements may be found useful for marking up strings of text which need to be distinguished from the running text because they come from some formal language: |
66 | TDphraseTE | Like other phrase-level elements used to indicate the semantics of a typographically distinct string, these are members of the |
68 | TDphraseTE | class. They are available anywhere that running prose is permitted when the module defined by this chapter is included in a schema. |
74 | TDphraseTE | elements are intended for use when citing brief passages in some formal language such as a programming language, as in the following example: |
91 | TDphraseTE | A further group of similar phrase-level elements is also defined for the special case of representing parts of an XML document: |
101 | TDphraseTE | . They are also available anywhere that running prose is permitted when the module defined by this chapter is included in a schema. |
103 | TDphraseTE | As an example of the recommended use of these elements, we quote from an imaginary TEI working paper: |
131 | TDphraseTE | element may be used to enclose any kind of example, which will typically be rendered as a distinct block, possibly using particular formatting conventions, when the document is processed. It is a specialized form of the more general |
133 | TDphraseTE | element provided by the TEI core module. In documents containing examples of XML markup, the |
136 | TDphraseTE | , since the content of this element can be checked for well-formedness. |
140 | TDphraseTE | when this module is included in a schema. That class is a part of the general |
152 | TDphraseEA | Within the body of a document using this module, the following elements may be used to reference parts of the specification elements discussed in section |
159 | TDphraseEA | TEI practice recommends that a |
161 | TDphraseEA | listing the elements under discussion introduce each subsection of a module's documentation. The source for the present section, for example, begins as follows: |
178 | TDphraseEA | element in this example, an ODD processor might simply generate the section number and title of the section referred to, perhaps additionally inserting a link to the section. In a similar way, when processing the |
184 | TDphraseEA | in this case) from their associated declaration elements: typically, the details recovered will include a brief description of the element and its attributes. These, and other data, will be stored in a specification element elsewhere within the current document, or they may be supplied by the ODD processor in some other way, for example from a database. For this reason, the link to the required specification element is always made using a TEI-defined key rather than an XML IDREF value. The ODD processor uses this key as a means of accessing the specification element required. There is no requirement that this be performed using the XML ID/IDREF mechanism, but there is an assumption that the identifier be unique. |
213 | TDmodules | As mentioned above, the primary purpose of this module is to facilitate the documentation and creation of an XML schema derived from the TEI Guidelines. The following elements are provided for this purpose: |
217 | TDmodules | is a convenient way of grouping together element and other declarations, and of associating an externally-visible name with the resulting group. A |
218 | TDmodules | specification group |
219 | TDmodules | performs essentially the same function, but the resulting group is not accessible outside the scope of the ODD document in which it is defined, whereas a module can be accessed by name from any TEI schema specification. Elements, and their attributes, element classes, and patterns are all individually documented using further elements described in section |
220 | TDmodules | below; part of that specification includes the name of the module to which the component belongs. |
224 | TDmodules | element found. For example, the chapter documenting the TEI module for names and dates contains a module specification like the following: |
241 | TDmodules | attribute, the value of which is |
242 | TDmodules | namesdates |
245 | TDmodules | element above can thus generate a schema fragment for the TEI |
249 | TDmodules | In most realistic applications, it will be desirable to combine more than one module together to form a complete |
251 | TDmodules | . A schema consists of references to one or more modules or specification groups, and may also contain explicit declarations or redeclarations of elements (see further |
253 | TDmodules | The distinction between base and additional tagsets in earlier versions of the TEI scheme has not been carried forward into P5. |
256 | TDmodules | A schema can combine references to TEI modules with references to other (non-TEI) modules using different namespaces, for example to include mathematical markup expressed using MathML in a TEI document. By default, the effect of combining modules is to allow all of the components declared by the constituent modules to coexist (where this is syntactically possible: where it is not—for example, because of name clashes—a schema cannot be generated). It is also possible to over-ride declarations contained by a module, as further discussed in section |
264 | TDmodules | attribute, and may then be referenced from any point in an ODD document using the |
266 | TDmodules | element. This is useful if, for example, it is desired to describe particular groups of elements in a specific sequence. Note however that the order in which element declarations appear within the schema code generated from an ODD file element is not in general affected by the order of declarations within a |
270 | TDmodules | An ODD processor will generate a piece of schema code corresponding with the declarations contained by a |
272 | TDmodules | element in the documentation being output, and a cross-reference to such a piece of schema code when processing a |
274 | TDmodules | . For example, if the input text reads |
285 | TDmodules | then the output documentation will replace the two |
287 | TDmodules | elements above with a representation of the schema code declaring the elements |
297 | TDmodules | respectively. Similarly, if the input text contains elsewhere a passage such as |
304 | TDmodules | then the |
306 | TDmodules | elements may be replaced by an appropriate piece of reference text such as |
331 | TDcrystals | Unlike most elements in the TEI scheme, each of these |
333 | TDcrystals | has a fairly rigid internal structure consisting of a large number of child elements which are always presented in the same order. |
334 | TDcrystals | Furthermore, since these elements all describe markup objects in broadly similar ways, they have several child elements in common. In the remainder of this chapter, we discuss first the elements which are common to all the specification elements, and then those which are specific to a particular type. |
338 | TDcrystals | element, but the specification element for any particular component may only appear once (except in the case where a modification is being defined; see further |
339 | TDcrystals | ). The order in which they appear will not affect the order in which they are presented within any schema module generated from the document. In documentation mode, however, an ODD processor will output the schema declarations corresponding with a specification element at the point in the text where they are encountered, provided that they are contained by a |
342 | TDcrystals | as discussed in the previous section. An ODD processor will also associate all declarations found with the nominated module, thus including them within the schema code generated for that module, and it will also generate a full reference description for the object concerned in a catalogue of markup objects. These latter two actions always occur irrespective of whether or not the declaration is included in a |
355 | TDcrystalsCE | This section discusses the child elements common to all of the specification elements; some of these are defined in the core module ( |
373 | TDcrystalsCEdc | element may be used to provide a brief explanation for the name of the object if this is not self-explanatory. For example, the specification for the element |
375 | TDcrystalsCEdc | used to mark arbitrary blocks of text begins as follows: |
382 | TDcrystalsCEdc | may also be supplied for an attribute name or an attribute value in similar circumstances: |
400 | TDcrystalsCEdc | element is needed to explain the significance of the identifier for an item only when this is not apparent, for example because it is abbreviated, as in the above example. It should not be used to provide a full description of the intended meaning (this is the function of the |
402 | TDcrystalsCEdc | element), nor to comment on equivalent values in other schemes (this is the purpose of the |
406 | TDcrystalsCEdc | attribute value in other languages (this is the purpose of the |
412 | TDcrystalsCEdc | element provide a brief characterization of the intended function of the object being documented in a form that permits its quotation out of context, as in the following example: |
428 | TDcrystalsCEdc | Where specifications are supplied in multiple languages, the elements |
432 | TDcrystalsCEdc | may be repeated as often as needed. Each such description or gloss should carry both an |
436 | TDcrystalsCEdc | attribute to indicate the language used and the date on which the translated text was last checked against its source. |
442 | TDcrystalsCEdc | attribute is used to supply a pointer to some location where such external concepts are defined. For example, to indicate that the TEI |
444 | TDcrystalsCEdc | element corresponds to the concept defined by the CIDOC CRM category E69, the declaration for the former might begin as follows: |
458 | TDcrystalsCEdc | attributes to point to an implementation of the mapping. This is useful when a TEI customization (see |
461 | TDcrystalsCEdc | for convenience of data entry or markup readability. For example, suppose that in some TEI customization an element |
464 | TDcrystalsCEdc | hi rend='bold' |
467 | TDcrystalsCEdc | element can be converted to canonical TEI by obtaining a filter from the URI specified, and running the procedure with the name |
471 | TDcrystalsCEdc | attribute specifies the language (in this case XSL) in which the filter is written: |
484 | TDcrystalsCEdc | element is used to provide an alternative name for an object, for example using a different natural language. Thus, the following might be used to indicate that the |
496 | TDcrystalsCEdc | may also be referred to using the alternate identifier |
512 | TDcrystalsCEdc | of a component is identical to the value of its |
518 | TDcrystalsCEdc | element contains any additional commentary about how the item concerned may be used, details of implementation-related issues, suggestions for other ways of treating related information etc., as in the following example: |
534 | TDcrystalsCEdc | A specification element will usually conclude with a list of references, each tagged using the standard |
538 | TDcrystalsCEdc | element: in the case of the |
540 | TDcrystalsCEdc | element discussed above, the list is as follows: |
545 | TDcrystalsCEdc | where the value |
570 | TDeg | attribute may be used on either element to indicate the source from which an example is taken, typically by means of a pointer to an entry in an associated bibliography, as in the following example: |
576 | TDeg | element should be used. In such a case, it will clearly be necessary to distinguish the markup within the example from the markup of the document itself. In an XML environment, this is easily done by using a different name space for the content of the |
592 | TDeg | If the XML contained in an example is not well-formed then it must either be enclosed in a CDATA marked section, or |
606 | TDeg | element should not be used to tag non-XML examples: the general purpose |
616 | TDcrystalsCEcl | In the TEI scheme elements are assigned to one or more |
617 | TDcrystalsCEcl | classes |
630 | TDcrystalsCEcl | element. It specifies the classes of which the element or class concerned is a member by means of one or more |
679 | DEFCON | may have three different kinds of content. It may express a content model directly using the TEI elements discussed in the remainder of this section. Alternatively, it may use a schema language of some kind, as defined by a pattern called |
680 | DEFCON | macro.schemaPattern |
682 | DEFCON | below. As a third possibility, the legal content for an element may be exhaustively specified using the |
687 | DEFCON | The following elements are used to define a content model: |
707 | DEFCON | provides the name of an element which may appear at a certain point in a content model. A |
709 | DEFCON | provides the name of a class, members of which may appear at a certain point in content model. A |
711 | DEFCON | provides the name of a predefined macro, the expansion of which may be inserted at a certain point in a content model. |
718 | DEFCON | Finally, two wrapper elements are provided to indicate whether the components of a content model form a sequence or an alternation: |
731 | DEFCON | This is the content model for the macro |
733 | DEFCON | , which is defined as containing any number (including zero) of elements from the |
745 | DEFCON | This is the content model for the |
747 | DEFCON | element, which is defined as a sequence of components, firstly a mandatory |
749 | DEFCON | , followed by any number (including zero) of elements from the |
759 | TDTAGCONT | Alternatively, element content models may be defined using RELAX NG patterns, or by expressions in some other schema language, depending on the value of the |
760 | TDTAGCONT | macro.schemaPattern |
769 | TDTAGCONT | element appears will have a content model which is expressed in RELAX NG as |
770 | TDTAGCONT | text |
771 | TDTAGCONT | , using the RELAX NG namespace. This model will be copied unchanged to the output when RELAX NG schemas are being generated. When an XML DTD is being generated, an equivalent declaration (in this case |
787 | TDTAGCONT | This is the content model for the |
793 | TDTAGCONT | The RELAX NG language does not formally distinguish element names, attribute names, class names, or macro names: all names are patterns which are handled in the same way, as the above example shows. Within the TEI scheme, however, different naming conventions are used to distinguish amongst the objects being named. Unqualified names ( |
794 | TDTAGCONT | fileDesc |
796 | TDTAGCONT | revisionDesc |
805 | TDTAGCONT | ) are always class names. In DTD language, classes are represented by parameter entities ( |
810 | TDTAGCONT | The RELAX NG pattern names generated by an ODD processor by default include a special prefix, the default value for which is set using the |
815 | TDTAGCONT | The purpose of this is to ensure that the pattern name generated is uniquely identified as belonging to a particular schema, and thus avoid name clashes. For example, in a RELAX NG schema combining the TEI element |
822 | TDTAGCONT | ident |
823 | TDTAGCONT | . Most of the time, this behaviour is entirely transparent to the user; the one occasion when it is not will be where a content model (expressed using RELAX NG syntax) needs explicitly to reference either the TEI |
829 | TDTAGCONT | may be used. For example, suppose that we wish to define a content model for |
831 | TDTAGCONT | which permits either a TEI |
835 | TDTAGCONT | defined by some other vocabulary. A suitable content model would be generated from the following |
850 | TDTAGCONS | element, a set of general |
854 | TDTAGCONS | attribute) in order that a TEI customization may override, delete or change them individually. Each |
863 | TDTAGCONS | assertion language |
864 | TDTAGCONS | , together with a RELAXNG to validate it. The Schematron assertion language provides a powerful way of expressing constraints on the content of any XML document in addition to those provided by other schema languages. Such constraints can be embedded within a TEI schema specification using the methods exemplified in this chapter. An ODD processor will typically process any |
866 | TDTAGCONS | elements in a TEI specification whose |
870 | TDTAGCONS | The TEI Guidelines include some additional constraints which are expressed using the ISO Schematron language. A conformant TEI document should respect these constraints, although automatic validation of them may not be possible for all processors. A TEI customization may likewise specify additional constraints using this mechanism. Some examples of what is possible using the Schematron language are given below. |
872 | TDTAGCONS | Constraints are generally used to model local rules which may be outside the scope of the target schema language. For example, in earlier versions of these Guidelines several constraints on the usage of the attributes of the TEI element |
881 | TDTAGCONS | may be supplied only if the attribute |
884 | TDTAGCONS | . Few schema language support co-occurence constraints such as the latter. In the current version of the Guidelines, constraint specifications expressed as Schematron rules have been added, as follows: |
906 | TDTAGCONS | The constraints in the preceding example all related to attributes in the empty namespace, and the schematron rules did not therefore need to define a TEI namespace prefix. The Schematron language |
908 | TDTAGCONS | element should be used to do this when a constraint needs to refer to a TEI element, as in the following example, which models the constraint that a TEI |
921 | TDTAGCONS | Schematron rules are also useful where an application needs to enforce rules on attribute values, as in the following examples which check that various types of |
939 | TDTAGCONS | As a further example, Schematron may be used to enforce rules applicable to a TEI document which is going to be rendered into accessible HTML, for example to check that some sort of content is available from which the |
956 | TDTAGCONS | Schematron rules can also be used to enforce other HTML accessibility rules about tables; note here the use of a report and an assertion within one pattern: |
973 | TDTAGCONS | Constraints can be expressed using any convenient language. The following example uses a pattern matching language called SPITBOL to express the requirement that title and author should be different. Implementing private schemes of this kind will generally be more problematic than simply adopting a widely-deployed system such as ISO Schematron however. |
988 | TDATT | element is used to document information about a collection of attributes, either within an |
992 | TDATT | . An attribute list can be organized either as a group of attribute definitions, all of which are understood to be available, or as a choice of attribute definitions, of which only one is understood to be available. An attribute list may thus contain nested attribute lists. |
998 | TDATT | elements are all to be made available, or whether only one of them may be used. For example, the attribute list for the element |
1000 | TDATT | contains a nested attribute list to indicate that either the |
1020 | TDATT | element is used to document a single attribute, using an appropriate selection from the common elements already mentioned and the following which are specific to attributes: |
1034 | TDATT | is used to specify only the attributes which are specific to that particular element. Instances of the element may carry other attributes which are declared by the classes of which the element is a member. These extra attributes, which are shared by other elements, or by all elements, are specified by an |
1046 | TD-datatypes | element is used to state what kind of value an attribute may have. The TEI defines a number of datatype macros, each with an identifier beginning |
1048 | TD-datatypes | , which are used in preference to the datatypes available natively from the target schema, since the facilities provided by different schema languages vary so widely. The available TEI datatypes are described in section |
1051 | TD-datatypes | A TEI schema specification using RELAX NG may choose to define datatypes directly using RELAX NG syntax, for example |
1054 | TD-datatypes | permits any string of Unicode characters not containing markup, and is thus the equivalent of |
1058 | TD-datatypes | The RELAX NG language also provides support for a number of more complex cases such as choices or lists. |
1059 | TD-datatypes | Such usages are permitted by the scheme documented here, but are not recommended when it is desired to remain independent of a particular schema language, since the full generality of one schema language cannot readily be converted to that of another. In the TEI abstract model, datatyping should preferably be carried out either by explicit enumeration of permitted values (using the TEI-specific |
1061 | TD-datatypes | element described below), by reference to an existing datatype macro, or by definition of a new datatype, using the |
1070 | TD-datatypes | are provided for the case where an attribute may take more than one value of the type specified. The |
1083 | TD-datatypes | attribute may take any number of values, each being of the type defined by the TEI |
1085 | TD-datatypes | macro. As is usual in XML, multiple values for a single attribute are separated by one or more white space characters. Hence, values such as |
1098 | TDATTvs | element may be used to describe constraints on data content in an informal way: for example |
1115 | TDATTvs | must take positive integer values less than 150, the datatype |
1155 | TDATTvs | Where all the possible values for an attribute can be enumerated, the datatype |
1173 | TDATTvs | element here to explain the otherwise less than obvious meaning of the codes used for these values. Since this value list specifies that it is of type |
1181 | TDATTvs | attribute will have the value |
1212 | TDATTvs | The datatype will be |
1220 | TDATTvs | element) to put constraints on the permitted content of an element, as noted at |
1221 | TDATTvs | . This use is not however supported by all schema languages, and is therefore not recommended if support for non-RELAX NG systems is a consideration. |
1246 | TDCLA | A model class specification does not list all of its members. Instead, its members declare that they belong to it by means of a |
1252 | TDCLA | element for each class of which the relevant element is a member, supplying the name of the relevant class. For example, the |
1280 | TDCLA | The function of a model class declaration is to provide another way of referring to a group of elements. It does not confer any other properties on the elements which constitute its membership. |
1288 | TDCLA | classes. In the case of attribute classes, the attributes provided by membership in the class are documented by an |
1292 | TDCLA | . In the case of model classes, no further information is needed to define the class beyond its description, its identifier, and optionally any classes of which it is a member. |
1294 | TDCLA | When a model class is referenced in the content model of an element (i.e. in the |
1298 | TDCLA | ), its meaning will depend on the name used to reference the class. |
1300 | TDCLA | If the reference simply takes the form of the class name, it is interpreted to mean an alternated list of all the current members of the class. For example, suppose that the members of the class |
1308 | TDCLA | . Then a content model such as |
1312 | TDCLA | would be equivalent to the explicit content model: |
1322 | TDCLA | ). However, a content model referencing the class as |
1324 | TDCLA | would be equivalent to the following explicit content model: |
1334 | TDCLA | The following suffixes, appended with an underscore, can be given to a class name when it is referenced in a content model: |
1340 | TDCLA | sequence |
1342 | TDCLA | members of the class are to be provided in sequence |
1354 | TDCLA | members of the class must be provided one or more times, in sequence |
1360 | TDCLA | in a content model would be equivalent to: |
1384 | TDCLA | sequence |
1385 | TDCLA | in which members of a class appear in a content model when one of the sequence options is used is that in which the elements are declared. |
1391 | TDCLA | attribute, which can be used to say that this particular model may only be referenced in a content model with the suffixes it specifies. For example, if the |
1395 | TDCLA | took the form |
1396 | TDCLA | classSpec ident="model.hiLike" generate="sequence sequenceOptional" |
1397 | TDCLA | then a content model referring to (say) |
1411 | TDCLA | defines a small set of attributes common to all elements which are members of that class: those attributes are listed by the |
1423 | TDCLA | , to which some modules contribute additional attributes when they are included in a schema. |
1453 | TDENT | element may be used to select a specific named pattern from those available. Patterns are used as a shorthand chiefly to describe common content models and datatypes, but may be used for any purpose. The following elements are used to represent patterns: |
1488 | TDbuild | specification elements also have an attribute which determines which namespace to which the object being created will belong. In the case of |
1490 | TDbuild | , this namespace is inherited by all the elements created in the schema, unless they have their own |
1496 | TDbuild | These attributes are used by an ODD processor to determine how declarations are to be combined to form a schema or DTD, as further discussed in this section. |
1498 | TDbuild | As noted above, a TEI schema is defined by a |
1500 | TDbuild | element containing an arbitrary mixture of explicit declarations for objects (i.e. elements, classes, patterns, or macro specifications) and references to other objects containing such declarations (i.e. references to specification groups, or to modules). A major purpose of this mechanism is to simplify the process of defining user customizations, by providing a formal method for the user to combine new declarations with existing ones, or to modify particular parts of existing declarations. |
1506 | TDbuild | An ODD processor, given such a document, should combine the declarations which belong to the named modules, and deliver the result as a schema of the requested type. It may also generate documentation for the elements declared by those modules. No source is specified for the modules, and the schema will therefore combine the declarations found in the most recent release version of the TEI Guidelines known to the ODD processor in use. |
1508 | TDbuild | The value specified for the |
1510 | TDbuild | attribute, when it is supplied as a URL, specifies any convenient location from which the relevant ODD files may be obtained. For the current release of the TEI Guidelines, a URL in the form |
1516 | TDbuild | . Alternatively, if the ODD files are locally installed, it may be more convenient to supply a value such as |
1520 | TDbuild | The value for the |
1522 | TDbuild | attribute may be any form of URI. A set of TEI-conformant specifications in a form directly usable by an ODD processor must be available at the location indicated. When no |
1524 | TDbuild | value is supplied, an ODD processor may either raise an error or assume that the location of the current release of the TEI Guidelines is intended. |
1526 | TDbuild | If the source is specified in the form of a private URI, the form recommended is |
1530 | TDbuild | is a prefix indicating the markup language in use, and |
1534 | TDbuild | should be used to reference release 1.2.1 of the current TEI Guidelines. When such a URI is used, it will usually be necessary to translate it before such a file can be used in blind interchange. |
1542 | TDbuild | which allow the encoder to supply an explicit lists of elements from the stated module which are to be included or excluded respectively. For example: |
1546 | TDbuild | The schema specified here will include all the elements supplied by the core module except for |
1558 | TDbuild | elements from the linking module. |
1567 | TDbuild | Note that in this last case, there is no need to specify the name of the module from which the two element declarations are to be found; in the TEI scheme, element names are unique across all modules. The module is simply a convenient way of grouping together a number of related declarations. |
1578 | TDbuild | , which is not defined in the TEI scheme, will be added to the output schema. This element will also be added to the existing TEI class |
1580 | TDbuild | , and will thus be available in TEI conformant documents. |
1590 | TDbuild | The effect of this is to redefine the content model for the element |
1600 | TDbuild | which appear both in the original specification and in the new specification supplied above: |
1602 | TDbuild | in this example. Note that if the value for |
1610 | TDbuild | A schema may not contain more than two declarations for any given component. The value of the |
1612 | TDbuild | attribute is used to determine exactly how the second declaration (and its constituents) should be combined with the first. The following table summarizes how a processor should resolve duplicate declarations; the term |
1619 | TDbuild | mode value |
1627 | TDbuild | add |
1631 | TDbuild | add new declaration to schema; process its children in add mode |
1635 | TDbuild | add |
1659 | TDbuild | change |
1667 | TDbuild | change |
1671 | TDbuild | process identifiable children according to their modes; process unidentifiable children in replace mode; retain existing children where no replacement or change is provided |
1694 | ST-aliens | Combining TEI and Non-TEI Modules |
1696 | ST-aliens | In the simplest case, all that is needed to include a non-TEI module in a schema is to reference its RELAX NG source using the |
1702 | ST-aliens | (defining Standard Vector Graphics) are included. To avoid any risk of name clashes, the schema specifies that all TEI patterns generated should be prefixed by the string "TEI_". |
1712 | ST-aliens | This specification generates a single schema which might be used to validate either a TEI document (with the root element |
1714 | ST-aliens | ), or an SVG document (with a root element |
1718 | ST-aliens | validate a TEI document containing |
1722 | ST-aliens | element must become a member of a TEI model class ( |
1723 | ST-aliens | ), so that it may be referenced by other TEI elements. To achieve this, we modify the last |
1735 | ST-aliens | This states that when the declarations from the |
1739 | ST-aliens | in the TEI module should be extended to include the element |
1741 | ST-aliens | as an alternative. This has the effect that elements in the TEI scheme which define their content model in terms of that element class (notably |
1743 | ST-aliens | ) can now include it. A RELAX NG schema generated from such a specification can be used to validate documents in which the TEI |
1763 | TD-LinkingSchemas | This example includes a standard RELAX NG schema, a Schematron schema which might be used for checking that all pointing attributes point at existing targets, and also a link to the TEI ODD file from which the RELAX NG schema was generated. See also |
1764 | TD-LinkingSchemas | for details of another method of linking an ODD specification into your file by including a |
1778 | tagdocs | Documentation of TEI modules |
1787 | TDformal | The selection and combination of modules to form a TEI schema is described in |
1808 | TDformal | ). All of these classes are declared along with the other general TEI classes, in the basic structure module documented in |
1815 | TDformal | macro.schemaPattern |
# | id | text |
---|---|---|
23 | VEMEana-eg-23 | Doglia mi reca ne lo core ardire |
79 | TSSASE-eg-20 | Structures of social action: Studies in conversation analysis |
343 | NDPER-eg-17 | membrane 5, entry 154 |
441 | VEST-eg-4 | 2nd edition |
566 | DIC-CP | Collins Pocket Dictionary of the English language |
586 | SA-BIBL-2 | Orbis Pictus: a facsimile of the first English edition of 1659 |
603 | PHegsurp2 | Poeti del Duecento |
853 | COEDADD-eg-89 | The waste land: a facsimile and transcript of the original drafts including the annotations of Ezra Pound |
883 | DS-eg-05 | Is there a text in this class? The authority of interpretive communities |
922 | FTGRA-eg-18 | 2nd edition |
1006 | COHQU-eg-43 | Natural language processing in Prolog |
1257 | DRSTA-eg-40 | Everyman's library: the drama |
1289 | COBICOR-eg-248 | ISO 690:1987: Information and documentation – Bibliographic references – Content, form and structure |
1473 | COHQQ-eg-33 | note 12 |
1600 | DRPRO-eg-7 | epilogue |
1634 | STGA-eg-9 | Crofts American history series |
1703 | TSBA-eg-19 | The approach of the Text Encoding Initiative to the encoding of spoken discourse |
1723 | MS-eg-001 | A summary catalogue of western manuscripts in the Bodleian Library at Oxford which have not hitherto been catalogued ... |
1733 | MS-eg-001 | P5-MS: A general purpose tagset for manuscript description |
1762 | STGA-eg-10 | Crofts American history series |
1931 | TSSASE-eg-37 | Report on the compatibility of J P French's spoken corpus transcription conventions with the TEI guidelines for transcription of spoken texts |
1958 | GDFT-eg-12 | Partial family tree for Bertrand Russell |
2322 | DSBACK-eg-83 | index to vol. 1 |
2556 | WHITMS1 | "[I am a curse]" in |
2562 | WHITMS2 | Single leaf of Notes for a poem about night "visions," possibly related to the untitled 1855 poem that Whitman eventually titled "The Sleepers." Fragments of an unidentified newspaper clipping about the Puget Sound area have been pasted to the leaf. The Trent Collection of Walt Whitman Manuscripts, Duke University Rare Book, Manuscript, and Special Collections Library. |
3666 | BIB | Works cited elsewhere in the text of the Guidelines |
3752 | Burnard1995b | The Design of the TEI Encoding Scheme |
4361 | SG-BIBL-2 | Refining our notion of what text really is: the problem of overlapping hierarchies |
4630 | CO-BIBL-1 | An international handbook of the science of language and society |
4767 | TS-BIBL-3 | TEI document TEI AI2 W1 |
4912 | DI-BIBL-3 | TEI working paper TEI AIW20 |
5015 | DI-BIBL-6 | Principles for Encoding machine readable dictionaries |
5069 | DI-BIBL-8 | Electronic dictionary encoding: customizing the TEI Guidelines |
5609 | NH-BIBL-7 | The layered markup and annotation language |
5661 | FS-BIBL-01 | A rationale for the TEI recommendations for feature-structure markup, |
5728 | ISO-690 | ISO 690:1987: Information and documentation – Bibliographic references – Content, form and structure |
5740 | ISO-12620 | ISO 12620:2009: Terminology and other language and content resources – Specification of data categories and management of a Data Category Registry for language resources |
5750 | RICA | Istituto Centrale per il Catalogo Unico |
5752 | RICA | Regole italiane di catalogazione per autori |
5819 | BIB-RDG | Reading list |
5821 | BIB-RDG | The following lists of readings in markup theory and the TEI derive from work originally prepared by Susan Schreibman and Kevin Hawkins for the TEI Education Special Interest Group, recoded in TEI P5 by Sabine Krott and Eva Radermacher. They should be regarded only as a snapshot of work in progress, to which further contributions and corrections are welcomed (see further |
6297 | Burnard1999 | Closing plenary address at the XML Europe Conference, Granada, May 1999 |
6375 | Burnard2001a | Dalle «Due Culture» Alla Cultura Digitale: La Nascita del Demotico Digitale |
6491 | Burnard2005b | Metadata for corpus work |
7448 | Pichler1995 | Culture and Value: Philosophy and the Cultural Sciences. Beiträge des 18. Internationalen Wittgenstein Symposiums 13–20. August 1995 Kirchberg am Wechsel |
7451 | Pichler1995 | Kirchberg am Wechsel |
8364 | Unsworthetaleds2004 | TEI Consortium |
8502 | BIB-RDG | TEI |
8617 | BaumanandCatapano1999 | TEI and the Encoding of the Physical Structure of Books |
8647 | Bauman2005 | TEI HORSEing Around |
8729 | Burnard1993 | Rolling your own with the TEI |
8845 | Burnard1997 | Prepared for a seminar on Etiquetación y extracción de información de grandes corpus textuales within the Curso Industrias de la Lengua (14–18 de Julio de 1997). Sponsored by the Fundacion Duques de Soria. |
8862 | BurnardandPopham1999 | Putting Our Headers Together: A Report on the TEI Header Meeting 12 September 1997. |
8925 | Ciottied2005 | Il Manuale TEI Lite: Introduzione Alla Codifica Elettronica Dei Testi Letterari |
8945 | Chang2001 | The Implications of TEI |
8991 | DigitalLibraryFederation1998 | TEI and XML in Digital Libraries: Meeting June 30 and July 1, 1998, Library of Congress, Summary/Proceedings |
9007 | DigitalLibraryFederation2007 | TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices |
9105 | Loiseaunodate | Introduction à la TEI |
9129 | MarkoandKelleher2001 | Descriptive Metadata Strategy for TEI Headers: A University of Michigan Library Case Study |
9159 | Mertz2003 | XML Matters: TEI — the Text Encoding Initiative |
9273 | Rahtz2003 | Building TEI DTDs and Schemas on demand |
9305 | Rahtzetal2004 | A unified model for text markup: TEI, Docbook, and beyond |
9365 | Robinsonnodate | Making a Digital Edition with TEI and Anastasia |
9383 | Seaman1995 | The Electronic Text Center Introduction to TEI and Guide to Document Preparation |
9403 | Simons1999 | Using Architectural Forms to Map TEI Data into an Object-Oriented Database |
9433 | Smith1999 | Textual Variation and Version Control in the TEI |
9565 | Vanhoutte2004 | An Introduction to the TEI and the TEI Consortium |
# | id | text |
---|---|---|
4 | VE | This module is intended for use when encoding texts which are entirely or predominantly in verse, and for which the elements for encoding verse structure already provided by the core module are inadequate. |
7 | VE | include elements for the encoding of verse lines and line groups such as stanzas: these are available for any TEI document, irrespective of the module it uses. Like the modules for prose and for drama, the module for verse additionally makes use of the module defined in chapter |
16 | VE | The module for verse extends the facilities provided by these modules in the following ways: |
18 | VE | a special purpose |
20 | VE | element is provided, to allow for segmentation of the verse line (see section |
23 | VE | a set of attributes is provided for the encoding of rhyme scheme and metrical information (see sections |
27 | VE | a special purpose |
29 | VE | element is provided to support simple analysis of rhyming words (see section |
36 | VEST | Like other kinds of text, texts written in verse may be of widely differing lengths and structures. A complete poem, no matter how short, may be treated as a free-standing text, and encoded in the same way as a distinct prose text. A group of poems functioning as a single unit may be encoded either as a |
40 | VEST | , depending on the encoder's view of the text. For further discussion, including an example encoding for a verse anthology, see chapter |
90 | VEST | Often, however, lines are grouped, formally or informally, into stanzas, verse paragraphs, etc. The |
92 | VEST | element defined in the core tag set (in section |
124 | VEST | It may also be used to mark the verse paragraphs into which longer poems are often divided, as in the following example from Samuel Taylor Coleridge's |
161 | VEST | element, where a verse line is broken between two line groups, as discussed in section |
166 | VEST | element is used to mark the highly regular line groups which characterize stanzaic and similar verse forms, as in the following example from Chaucer: |
191 | VEST | elements may be nested hierarchically. For example, one particularly common English stanzaic form consists of a quatrain or sestet followed by a couplet. The |
220 | VEST | attribute to name the type of unit encoded by the |
232 | VEST | attribute is intended solely for conventional names of different classes of text block. For systematic analysis of metrical and rhyme schemes, use the |
239 | VEST | As a further example, consider the Shakespearean sonnet. This may be divided into two parts: a concluding couplet, and a body of twelve lines, itself subdivided into three quatrains: |
292 | VEST | each of which contains a prologue followed by twelve |
294 | VEST | . Each prologue and each canto consists of nine-line |
348 | VESE | It is often convenient for various kinds of analysis to encode subdivisions of verse lines. The general purpose |
350 | VESE | element defined in the tag set for segmentation and alignment (section |
355 | VESE | To use this element together with the module for verse, the module for segmentation and alignment must also be enabled as further described in section |
358 | VESE | In Old and Middle English alliterative verse, individual verse lines are typically split into half lines. The |
385 | VESE | element, down to whatever level of detailed structure is required. In the following example, the line has been divided into |
392 | VESE | attribute) this example will still require additional processing, since whitespace should be retained for the lower level |
395 | VESE | syll |
426 | VESE | element may be used to identify any subcomponent of a line which has content; its type attribute may characterize such units in any way appropriate to the needs of the encoder. For the specific case of labeling each foot with its formal type ( |
447 | VESE | ). If both kinds of segmentation are required, the |
491 | VESE | element, it might be simpler just to mark the point at which the caesura occurs. An additional element is provided for analyses of this kind, in which what is to be marked are points |
493 | VESE | , which have some significance within a verse line: |
497 | VESE | caesura |
500 | VESE | , which occurs on a foot boundary (not to be confused with the division of a diphthong into two syllables, or the diacritic symbol used to indicate such division, each of which is also termed |
502 | VESE | ). This distinction is rarely made nowadays, the term |
503 | VESE | caesura |
510 | VESE | element, we refer again to the example from Langland. An encoder might choose simply to record the location of the caesura within each line, rather than encoding each half-line as a segment in its own right, as follows: |
524 | VESE | Logically, the opposite of caesura might be considered to be |
528 | VESE | module is included in a schema, an additional class called |
537 | VESE | elements and the syntactic structure of the verse (a discrepancy of some significance in some schools of verse): |
552 | VESA | It is possible that certain textual structures may span multiple lines of verse, either by incorporating more than one, or by crossing line hierarchy. This is common, for example, when lines contain reported thought or speech (i.e. |
554 | VESA | ), or other forms of quotation (i.e. |
606 | VEME | When the module for verse is in use, the following additional attributes are available to record information about rhyme and metrical form: |
617 | VEME | , etc. In general, the attributes should be specified at the highest level possible; they may not however be specifiable at the highest level if some of the subdivisions of a text are in prose and others in verse. All these attributes may also be attached to the |
621 | VEME | elements, but the default notation for the |
623 | VEME | attribute has no defined meaning when specified on |
627 | VEME | . The value for these attributes may take any form desired by the encoder, but the nature of the notation used will determine how well the attribute values can be processed by automatic means. |
631 | VEME | attribute, as further discussed below. A simple mechanism is also provided for recording the actual realization of a rhyme pattern; see |
662 | VEMEsamp | This text is written entirely in |
664 | VEMEsamp | ; each line is an iambic pentameter (which, using a common notation, can be described with the formula |
674 | VEMEsamp | a line-end), and the couplets rhyme (which can be represented with the conventional formula |
678 | VEMEsamp | Because both rhyme pattern and metrical form are consistent throughout the poem, they may be conveniently specified on the |
690 | VEMEsamp | attributes is user-defined, no binding description can be given of its details or of how its interpretation must proceed. (A default notation is provided for the |
693 | VEMEsamp | .) It is expected, however, that software should be able to support these attributes in useful ways; the more intelligent the software is, and the more knowledge of metrics is built into it, the better it will be able to support these attributes. In the extract given above, for example, the |
703 | VEMEsamp | value specifies the metrical form of a single verse line, the structure of the |
705 | VEMEsamp | as a whole is understood to involve as many repetitions of the pattern as there are lines in the verse paragraph. The same attribute value, when inherited in turn by the |
709 | VEMEsamp | to repeat. With sufficiently sophisticated software, segments within the line might even be understood as inheriting precisely that portion of the formula which applies to the segment in question; this will, however, be easier to accomplish for some languages than for others. |
713 | VEMEsamp | attribute in this example uses the default notation to specify a rhyme scheme applicable only to pairs of lines. As elsewhere, the default notation for the |
715 | VEMEsamp | attribute has no meaning for metrical units at the line level or below. In verse forms where line-internal rhyme is structurally significant, e.g. in some skaldic poetry, the default notation is incapable of expressing the required information, since the rhyme pattern may need to be specified for units smaller than the line. In such cases, a user-specified rhyme notation must be substituted for the default notation, or else the rhyme pattern must be described using some alternative method (e.g. by using the |
723 | VEMEsamp | attribute, when user-specified notations are used. |
725 | VEMEsamp | A formal definition of the significance of each component of the pattern given as the value of the |
731 | VEMEsamp | element in the TEI header (see section |
732 | VEMEsamp | ). The encoder is free to invent any notation appropriate to his or her analytic needs, provided that it is adequately documented in this element. The notation may define metrical components using invented or traditional names (such as |
746 | VEMEsamp | attribute has the same value as the |
748 | VEMEsamp | attribute on the same element; it is only necessary to provide an explicit value when the realization differs in some way from the abstract metrical pattern. The tension between conventional metrical pattern and its realization may thus be recorded explicitly. For example, many readers of the above passage would stress the word |
750 | VEMEsamp | at the beginning of the third line rather than the word |
757 | VEMEsamp | attribute is used to over-ride the default or conventional metrical pattern, it applies only to the element on which it is specified. The default pattern for any subsequent lines is unaffected. |
770 | VEMEsamp | attribute, the encoder is required to determine whether the change is a systematic or conventional one (as in this example) or an occasional variation, perhaps for local effect. In the following example, from Goethe's |
811 | VEMElevels | The examples given so far have encoded information about the realization of metrical conventions at the level of the whole verse-line. This has obvious advantages of simplicity, but the disadvantage that any deviation from metrical convention is not marked at its precise point of occurrence in the text. Greater precision may be achieved, but only at the cost of marking deviant metrical units explicitly. This may be done with the |
813 | VEMElevels | element, giving the variant realization as the value of the |
827 | VEMElevels | The marking of the foot boundaries with the symbol |
831 | VEMElevels | attribute value of the |
833 | VEMElevels | element allows the human reader, or a sufficiently intelligent software program, to isolate the correct portion of that attribute value as the default value for the same attribute on the |
841 | VEMElevels | here, and whether or not also to tag the feet in the line in which there is no deviation from the metrical convention. The ability of software to infer which foot is being marked, if not all are tagged, will depend heavily on the language of the text and the knowledge of prosody built into the software; the fuller and more explicit the markup, the easier it will be for software to handle it. It may prove useful, however, to mark metrical deviations in the manner shown, even if the available software is not sufficiently intelligent to scan lines without aid from the markup. Human readers who are interested in prosody may well be able to exploit the markup in useful ways even with less sophisticated software. |
847 | VEMElevels | . If we wish to identify the exact location of the different types of foot in the first line of Virgil's |
849 | VEMElevels | , the text could be encoded as follows (for simplicity's sake the caesura has been omitted): |
862 | VEMElevels | An appropriate value of the |
864 | VEMElevels | attribute might also be supplied on the enclosing |
868 | VEMElevels | at the level of the foot may be considered a series of local variations on this fundamental pattern; in cases like this, of course, the local variations may also be considered aspects of realization rather than of convention, in which case the |
872 | VEMElevels | , if desired. |
878 | VEMEana | The method described above may be used to encode quite complex verse forms, for instance various kinds of fixed-form stanzas. Let us take one of Dante's canzoni, in which each stanza except the last has the same combination of eleven-syllable and seven-syllable lines, and the same rhyme scheme: |
894 | VEMEana | attribute specifies a rhyme scheme for each stanza, in the same way. |
898 | VEMEana | represents a line containing nine syllables which may or may not be metrically prominent, a tenth which is prominent and an optional non-prominent eleventh syllable. The letter |
900 | VEMEana | is used to represent a line containing five syllables which may or may not be metrically prominent, a sixth which is prominent and an optional non-prominent seventh syllable. A suitable definition for this notation might be given by a |
928 | VEMEana | attribute on the eighth stanza itself, which will override the default value inherited from parent |
949 | VEMEana | . Moreover, although it is quite regular (in the sense that the last stanza of each |
962 | VERH | attribute is used to specify the rhyme pattern of a verse form. It should not be confused with the |
974 | VERH | element in the TEI header. Unlike |
978 | VERH | attribute has a default notation; if this default notation is used, no |
982 | VERH | The default notation for rhyme offers the ability to record patterns of rhyming lines, using the traditional notation in which distinct letters stand for rhyming lines. For a work in rhyming couplets, like the Pope example above, the |
986 | VERH | , indicating that pairs of adjacent lines rhyme with each other. For a slightly more complex scheme, applicable to groups of four lines, in which lines 1 and 3 rhyme, as do lines 2 and 4, this attribute would have the value |
990 | VERH | , indicating that within each nine line stanza, lines 1 and 3 rhyme with each other, as do lines 2, 4, 5 and 7, and lines 6, 8 and 9. |
992 | VERH | Non-rhyming lines within such a group may be represented using a hyphen or an x, as in the following example: |
1007 | VERH | element may be used to mark the words (or parts of words) which rhyme according to a predefined pattern: |
1020 | VERH | attribute is used to specify which parts of a rhyme scheme a given set of rhyming words represent: |
1057 | VERH | elements with the same value for their |
1059 | VERH | attribute are assumed to rhyme with each other: thus, in the above example, the two rhymes labelled |
1061 | VERH | in the first stanza rhyme with each other, but not necessarily with those labelled |
1069 | VERH | element can appear anywhere within a verse line, and not necessarily around a single word. It can thus be used to mark quite complex internal rhyming schemes, as in the following example: |
1097 | VERH | This mechanism, although reasonably simple for simple cases, may not be appropriate for more complex applications. In general, rhyme may be considered as a special form of |
1099 | VERH | , and hence encoded using the mechanisms defined for that purpose in section |
1129 | VERH | Now that each rhyming word, or part-word, has been tagged and allocated an arbitrary identifier, the general purpose |
1154 | VERH | class when the module defined by this chapter is included in a schema. |
1162 | HDMN | element of the TEI header to document the metrical notation used in marking up a text. |
1167 | HDMN | As with other components of the header, metrical notation may be specified either formally or informally. In a formal specification, every symbol used in the metrical notation must be documented by a corresponding |
1173 | HDMN | if |
1177 | HDMN | if any |
1179 | HDMN | is defined, then any notation using undefined symbols should be regarded as invalid |
1181 | HDMN | if both pattern and symbol are defined, then every symbol appearing explicitly within pattern must be defined |
1190 | HDMN | As a simple example, consider the case of the notation in which metrical prominence, metrical feet, and line boundaries are all to be encoded. Legal specifications in this notation may be written for any sequence of metrically prominent or non-prominent features, optionally separated by foot or metrical line boundaries at arbitrary points. Assuming that the symbol |
1198 | HDMN | for line boundary, then the following declaration achieves this object: |
1219 | HDMN | attribute values within the text which use this metrical notation. |
1223 | HDMN | attribute should be used to indicate for a given symbol whether or not it may be re-defined in terms of other symbols used within the same notation. For example, here is a notation for encoding classical metres, in which symbols are provided for the most common types of foot. |
1244 | HDMN | attribute to supply an additional name for the symbols being documented. |
1250 | HDMN | , each supplied with an |
1254 | HDMN | attribute may be used in the text of the document to specify which |
1258 | HDMN | s are defined in the header, one with an English verse pattern and one with a French pattern. In the body of the document, there are two |
1306 | VEETC | A number of procedures that may be of particular concern to encoders of verse texts are dealt with elsewhere in these guidelines. Some aspects of layout and physical appearance, especially important in the case of free verse, are dealt with in chapter |
1307 | VEETC | . Some initial recommendations for the encoding of phonetic or prosodic transcripts, which may be helpful in the analysis of sound structures in poetry, are to be found in chapter |
1311 | VEETC | contains much which will be found useful for the aligning of multiple levels of commentary and structure within verse analysis. Encoders of verse (as of other types of literary text) will frequently wish to attach identifying labels to portions of text that are not part of a system of hierarchical divisions, may overlap with one another, and/or may be discontinuous; for instance passages associated with particular characters, themes, images, allusions, topoi, styles, or modes of narration. Much of the computerized analysis of verse seems likely to require dividing texts up into blocks in this way. The |
1315 | VEETC | , provide a powerful means of encoding a wide variety of aspects of verse literature, including not only the metrical structures discussed above, but also such stylistic and rhetorical features as metaphor. |
1317 | VEETC | For other features it must for the time being be left to encoders to devise their own terminology. Elements such as |
1321 | VEETC | might well suggest themselves; but given the problems of definition involved, and the great richness of modern metaphor theory, it is clear that any such format, if predefined by these Guidelines, would have seemed objectionable to some and excessively restrictive to many. Leaving the choice of tagging terminology to individual encoders carries with it one vital corollary, however: the encoder must be utterly explicit, in the TEI header, about the methods of tagging used and the criteria and definitions on which they rest. Where no formal elements are currently proposed, such information may readily be given as simple prose description within the |
1346 | VESTR | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
10 | msov | This chapter is based on the work of the European MASTER (Manuscript Access through Standards for Electronic Records) project, funded by the European Union from January 1999 to June 2001, and led by Peter Robinson, then at the Centre for Technology and the Arts at De Montfort University, Leicester (UK). Significant input also came from a TEI Workgroup headed by Consuelo W. Dutschke of the Rare Book and Manuscript Library, Columbia University (USA) and Ambrogio Piazzoni of the Biblioteca Apostolica Vaticana (IT) during 1998-2000. |
11 | msov | defines a special purpose element which can be used to provide detailed descriptive information about handwritten primary sources. Although originally developed to meet the needs of cataloguers and scholars working with medieval manuscripts in the European tradition, the scheme presented here is general enough that it can also be extended to other traditions and materials, and is potentially useful for any kind of inscribed artefact. |
13 | msov | The scheme described here is also intended to accommodate the needs of many different classes of encoders. On the one hand, encoders may be engaged in |
16 | msov | ex nihilo |
17 | msov | , that is, creating new detailed descriptions for materials never before catalogued. Some may be primarily concerned to represent accurately the description itself, as opposed to the ideas and interpretations the description represents; others may have entirely opposite priorities. At one extreme, a project may simply wish to capture an existing catalogue in a form that can be displayed on the Web, and which can be searched for literal strings, or for such features such as titles, authors and dates; at the other, a project may wish to create, in highly structured and encoded form, a detailed database of information about the physical characteristics, history, interpretation, etc. of the material, able to support practitioners of |
21 | msov | To cater for this diversity, here as elsewhere, these Guidelines propose a flexible strategy, in which encoders must choose for themselves the approach appropriate to their needs, and are provided with a choice of encoding mechanisms to support those differing degrees. |
31 | msdesc | element of the header of a TEI-conformant document, where the document being encoded is a digital representation of some manuscript original, whether as an encoded transcription, as a collection of digital images (as described in |
32 | msdesc | ), or as some combination of the two. However, in cases where the document being encoded is essentially a collection of manuscript descriptions, the |
40 | msdesc | ) making up the TEI element class |
50 | msdesc | element has the following components, which provide more detailed information under a number of headings. Each of these component elements is further described in the remainder of this chapter. |
66 | msdesc | ), and then either one or more paragraphs, marked up as a series of |
80 | msdesc | ). These elements are all optional, but if used they must appear in the order given here. Finally, in the case of a composite manuscript, a full description may also contain one or more |
95 | msdesc | The simplest way of digitizing this catalogue entry would simply be to key in the text, tagging the relevant parts of it which make up the mandatory |
118 | msdesc | and add some of the additional phrase-level elements available when this module is in use: |
160 | msdesc | Note that in this version the text has been slightly reorganized, but no actual rewriting has been necessary. The encoding now allows the user to search for such features as title, material, and date and place of origin; it is also possible to distinguish quoted material from descriptive passages and to search within descriptions relating to a particular topic (for example, history as distinct from material). |
162 | msdesc | This process could be continued further, restructuring the whole entry so as to take full advantage of many more of the encoding possibilities provided by the module described in this chapter: |
279 | msphrase | Within a manuscript description, many other standard TEI phrase level elements are available, notably those described in the Core module ( |
297 | msdates | elements respectively, used to indicate specifically the date and place of origin of a manuscript or manuscript part. Such information would normally be encoded within the |
304 | msdates | can also be used to identify the place or date of origin of any aspect of the manuscript, such as its decoration or binding, when these are not of the same date or from the same location as rest of the manuscript. Both these elements are members of the |
312 | msdates | class, and may thus also carry additional attributes giving normalized values for the associated dating. |
320 | msmat | element can be used to tag any specific term used for the physical material of which a manuscript (or binding, seal, etc.) is composed. The |
322 | msmat | element may be used to tag any term specifying the type of object or manuscript upon with the text is written. |
327 | msmat | These elements may appear wherever a term regarded as significant by the encoder occurs, as in the following examples: |
356 | mswat | These element may appear wherever a term regarded as significant by the encoder occurs. The |
369 | mswat | element will typically appear when text from the source is being transcribed, for example within a rubric in the following case: |
385 | mswat | If, as here, any text contained by a stamp is included in its description it should be clearly distinguished from that description. The element |
395 | msdim | element can be used to specify the size of some aspect of the manuscript, and thus may be thought of as a specialized form of the existing TEI |
403 | msdim | element will normally occur within the element describing the particular feature or aspect of a manuscript whose dimensions are being given; thus the size of the leaves would be specified within the |
410 | msdim | ), while the dimensions of other specific parts of a manuscript, such as accompanying materials, binding, etc., would be given in other parts of the description, as appropriate. |
438 | msdim | are used only when the measurement applies to several items, for example the size of all leaves in a manuscript; attributes |
442 | msdim | are used when the measurement applies to a single item, for example the size of a specific codex, but has had to be estimated. Attribute |
444 | msdim | is used when the measurement can be given exactly, and applies to a single item; this is the usual situation. In this case, the units in which dimensions are measured may be specified using the |
446 | msdim | attribute, which will normally take from a closed set of values appropriate to the project, using standard units of measurement wherever possible, such as following values: |
453 | msdim | line |
455 | msdim | char |
456 | msdim | . If however the only data available for the measurement uses some other unit, or it is preferred to normalize it in some other way, then it may be supplied as a string value by means of the |
464 | msdim | More usually, the measurement will be normalized into a value and an appropriate SI unit: |
466 | msdim | Where the exact value is uncertain, the attributes |
474 | msdim | It is often convenient to supply a measurement which applies to a number of discrete observations: for example, the number of ruled lines on the pages of a manuscript (which may not all be the same), or the diameter of an object like a bell, which will differ depending where it is measured. In such cases, the |
488 | msdim | element may be repeated as often as necessary, with appropriate attribute values to indicate the nature and scope of the measurement concerned. For example, in the following case the leaf size and ruled space of the leaves of the manuscript are specified: |
498 | msdim | This indicates that for most leaves of the manuscript being described the ruled space is 90 mm high and 48 mm wide, while the leaves throughout are between 157 and 160 mm in height and 105 mm in width. |
502 | msdim | element is provided for cases where some measurement other than height, width, or depth is required. Its |
514 | msdim | element may be supplied is not constrained. |
525 | msloc | element, used to indicate a location, or sequence of locations, within a manuscript. |
532 | msloc | element is used to reference a single location within a manuscript, typically to specify the location occupied by the element within which it appears. If, for example, it is used as the first component of a |
537 | msloc | below) then it is understood to specify the location (or locations) of that item within the manuscript being described. |
543 | msloc | element can be used to identify any reference to one or more folios within a manuscript, wherever such a reference is appropriate. Locations are conventionally specified as a sequence of folio or page numbers, but may also be a discontinuous list, or a combination of the two. This specification should be given as the content of the |
553 | msloc | A normalized form of the location can also be supplied, using special purpose attributes on the |
563 | msloc | When the item concerned occupies a discontinuous sequence of pages, this may simply be indicated in the body of the |
572 | msloc | Alternatively, if it is desired to indicate normalized values for each part of the sequence, a sequence of |
587 | msloc | Finally, the content of the |
589 | msloc | element may be omitted if a formatting application can construct it automatically from the values of the |
609 | msloc | attribute can also be used to associate a location within a manuscript with facsimile images of that location, using the |
611 | msloc | attribute, or with a transcription of the text occurring at that location. The former association is effected by means of the |
619 | msloc | is available only when the |
640 | msloc | attribute uses a URI reference to point directly to images of the relevant pages. This method may be found cumbersome when many images are to be associated with a single location. It is of most use when specific pages are referenced within a description, as in the following example: |
690 | msloc | When (as in this example) a sequence of elements is to be supplied as target value, it may be given explicitly as above, or using the xPointer range() syntax defined at |
691 | msloc | . Note however that support for this pointer mechanism is not widespread in current XML processing systems. |
695 | msloc | attribute should only be used to point to elements that contain or indicate a transcription of the locus being described. To associate a |
706 | msloc | attribute may be used to distinguish them. For example, MS 65 Corpus Christi College, Cambridge contains two fly leaves bearing music. These leaves have modern foliation 135 and 136 respectively, but are also marked with an older foliation. This may be preserved in an encoding such as the following: |
721 | msloc | attribute should be supplied on the |
742 | msnames | The standard TEI element |
769 | msnames | name |
770 | msnames | , not the person, place, or organization to which that name refers. In the last example above, the |
772 | msnames | attribute is used to associate the name with a more detailed description of the person named. This is provided by means of the |
774 | msnames | element, which becomes available when the |
777 | msnames | is included in a schema. An element such as the following might then be used to provide detailed information about the person indicated by the name: |
792 | msnames | element must be provided for each distinct |
794 | msnames | value specified. For example, in the case above, the value |
800 | msnames | element; the same value will be used as the |
808 | msnames | attribute may be used to supply a unique identifying code for the person referenced by the name independently of both the existence of a |
810 | msnames | element and the use of the standard URI reference mechanism. If, for example, a project maintains as its authority file some non-digital resource, or uses a database which cannot readily be integrated with other digital resources for this purpose, the unique codes used by such |
815 | msnames | , interchange is improved by use of tag URIs in |
823 | msnames | elements referenced by a particular document set should be collected together within a |
826 | msnames | element, located in the TEI header. This functions as a kind of prosopography for all the people referenced by the set of manuscripts being described, in much the same way as a |
828 | msnames | element in the back matter may be used to hold bibliographic information for all the works referenced. |
843 | msmisc | element is used to describe one method by which correct ordering of the quires of a codex is ensured. Typically, this takes the form of a word or phrase written in the lower margin of the last leaf verso of a gathering, which provides a preview of the first recto leaf of the successive gathering. This may be a simple phrase such as the following: |
859 | msmisc | element can be used for either leaf signatures, or a combination of quire and leaf signatures, whether the marking is alphabetic, alphanumeric, or some ad hoc system, as in the following more complex example: |
869 | msmisc | ) taken from a specific known point in a codex (for example the first few words on the second leaf). Since these words will differ from one copy of a text to another, the practice originated in the middle ages of using them when cataloguing a manuscript in order to distinguish individual copies of a work in a way which its opening words could not. |
878 | mshera | Descriptions of heraldic arms, supporters, devices, and mottos may appear at various points in the description of a manuscript, usually in the context of ownership information, binding descriptions, or detailed accounts of illustrations. A full description may also contain a detailed account of the heraldic components of a manuscript independently considered. Frequently, however, heraldic descriptions will be cited as short phrases within other parts of the record. The phrase level element |
919 | msid | element is intended to provide an unambiguous means of uniquely identifying a particular manuscript. This may be done in a structured way, by providing information about the holding institution and the call number, shelfmark, or other identifier used to indicate its location within that institution. Alternatively, or in addition, a manuscript may be identified simply by a commonly used name. |
923 | msid | A manuscript's actual physical location may occasionally be different from its place of ownership; at Cambridge University, for example, manuscripts owned by various colleges are kept in the central University Library. Normally, it is the ownership of the manuscript which should be specified in the manuscript identifier, while additional or more precise information on the physical location of the manuscript can be given within the |
938 | msid | These elements are all structurally equivalent to the standard TEI |
940 | msid | element with an appropriate value for its |
948 | msid | and they must, if present, appear in the order given. |
958 | msid | to reference a single standardized source of information about the entity named. |
969 | msid | Major manuscript repositories will usually have a preferred form of citation for manuscript shelfmarks, including rules about punctuation, spacing, abbreviation, etc., which should be adhered to. Where such a format also contains information which might additionally be supplied as a distinct subcomponent of the |
971 | msid | , for example a collection name, a decision must be taken as to whether to use the more specific element, or to include such information within the |
1012 | msid | In the former example, the preferred form of the identifier can be retrieved by prefixing the content of the |
1028 | msid | might be considered helpful in some circumstances (if, for example, some of the items in the Ellesmere collection had shelfmarks which did not begin |
1032 | msid | In some cases the shelfmark may contain no information about the collection; in other cases, the item may be regarded as belonging to more than one collection. The |
1070 | msid | Note in the latter case the use of the |
1072 | msid | element to provide a common name other than the shelfmark by which a manuscript is known. Where a manuscript has several such names, more than one of these elements may be used, as in the following example: |
1090 | msid | attribute has been used to specify the language of the alternative names. |
1092 | msid | In very rare cases a repository may have only one manuscript (or only one of any significance), which will have no shelfmark as such but will be known by a particular name or names. In such circumstances, the |
1094 | msid | element may be omitted, and the manuscript identified by the name or names used for it, using one or more |
1111 | msid | Where manuscripts have moved from one institution to another, or even within the same institution, they may have identifiers additional to the ones currently used, such as former shelfmarks, which are sometimes retained even after they have been officially superseded. In such cases it may be useful to supply an alternative identifier, with a detailed structure similar to that of the |
1115 | msid | in the collection of the Duque de Osuna, but which now has the shelfmark |
1139 | msid | , except in cases where a manuscript is likely still to be referred to or known by its former identifier. For example, an institution may have changed its call number system but still wish to retain a record of the earlier number, perhaps because the manuscript concerned is frequently cited in print under its previous number: |
1153 | msid | Where (as in this example) no repository is specified for the |
1157 | msid | . Where the holding institution has only one preferred form of citation but wishes to retain the other for internal administrative purposes, the secondary could be given within |
1159 | msid | with an appropriate value on the |
1182 | msid | , substantial parts of which are to be found in three separate repositories, in Ljubljana, Warsaw, and St. Petersburg. This should be represented using three distinct |
1184 | msid | elements, using an appropriate value on the type attribute to indicate that these three identifiers are not alternate ways of referring to the same physical object, but three parts of the same entity. |
1217 | msid | As mentioned above, the smallest possible description is one that contains only the element |
1241 | msdo | . This will often have been enough to identify a manuscript in a small collection because the identity of the author is implicit. Where a title does not imply the author, and is thus insufficient to identify the main text of a manuscript, the author should be stated explicitly (e.g. |
1245 | msdo | ). Many inventories of manuscripts consist of no more than an author and title, with some form of copy-specific identifier, such as a shelfmark or |
1253 | msdo | ); information on date and place of writing will sometimes also be included. The standard TEI element |
1258 | msdo | In this way the cataloguer or scholar can supply in one place a minimum of essential information, such as might be displayed or printed as the heading of a full description. For example: |
1276 | msdo | element is intended principally to contain a heading. More structured information concerning the contents, physical form, or history of the manuscript should be given within the specialized elements described below, |
1284 | msdo | element may also be used to supply an unstructured collection of such information, as in the example given above ( |
1293 | msco | element is used to describe the intellectual content of a manuscript or manuscript part. It comprises |
1295 | msco | a series of informal prose paragraphs |
1297 | msco | a series of |
1301 | msco | elements, each of which provides a more detailed description of a single item contained within the manuscript. These may be prefaced, if desired, by a |
1325 | msco | This description may of course be expanded to include any of the TEI elements generally available within a |
1394 | msco | elements if it is desired to provide both a general summary of the contents of a manuscript and more detail about some or all of the individual items within it. It may not however be used within an individual |
1419 | mscoit | Each discrete item in a manuscript or manuscript part can be described within a distinct |
1464 | mscoit | is that in the former, the order and number of child elements is not constrained; any element, in other words, may be given in any order, and repeated as often as is judged necessary. In the latter, however, the sub-elements, if used, must be given in the order specified above and only some of them may be repeated; specifically, |
1480 | mscoit | may contain untagged running text, both permit an unstructured description to be provided in the form of one or more paragraphs of text. They differ in this respect also: if paragraphs are supplied as the content of an |
1482 | mscoit | , then none of the other component elements listed above is permitted; in the |
1490 | mscoit | elements may also nest, where a number of separate items in a manuscript are grouped under a single title or rubric, as is the case, for example, with a work like |
1549 | mscoit | ; they are available only when the |
1563 | msat | element should be used to supply a regularized form of the item's title, as distinct from any rubric quoted from the manuscript. If the item concerned has a standardized distinctive title, e.g. |
1565 | msat | , then this should be the form given as content of the |
1567 | msat | element, with the value of the |
1571 | msat | . If no uniform title exists for an item, or none has been yet identified, or if one wishes to provide a general designation of the contents, then a |
1572 | msat | supplied |
1573 | msat | title can be given, e.g. |
1575 | msat | , in which case the |
1579 | msat | should be given the value |
1580 | msat | supplied |
1583 | msat | Similarly, if used within a manuscript description, the |
1585 | msat | element should always contain the normalized form of an author's name, irrespective of how (or whether) this form of the name is cited in the manuscript. If it is desired to retain the form of the author's name as given in the manuscript, this may be tagged as a distinct |
1587 | msat | element, within the text at the point where it occurs. |
1594 | msat | element carrying full details of the person concerned (see further |
1599 | msat | element can be used to supply the name and role of a person other than the author who is responsible for some aspect of the intellectual content of the manuscript: |
1612 | msat | element can also be used where there is a discrepancy between the author of an item as given in the manuscript and the accepted scholarly view, as in the following example: |
1622 | msat | Note that such attributions of authorship, both correct and incorrect, are frequently found in the rubric or final rubric (and occasionally also elsewhere in the text), and can therefore be transcribed and included in the description, if desired, using the |
1633 | mscorie | It is customary in a manuscript description to record the opening and closing words of a text as well as any headings or colophons it might have, and the specialized elements |
1647 | mscorie | , for recording other bits of the text not covered by these elements. Each of these elements has the same substructure, containing a mixture of phrase-level elements and plain text. A |
1649 | mscorie | element can be included within each, in order to specify the location of the component, as in the following example: |
1667 | mscorie | In the following example, standard TEI elements for the transcription of primary sources have been used to mark the expansion of abbreviations and other features present in the original: |
1702 | mscorie | to indicate that the text begins and ends defectively. |
1716 | mscorie | may always be used to identify the language of the text quoted, if this is different from the default language specified by the |
1750 | msclass | One or more text classification or text-type codes may be specified, either for the whole of the |
1779 | msclass | The value used for the |
1791 | msclass | element of the TEI header ( |
1820 | mslangs | element should be used to provide information about the languages used within a manuscript item. It may take the form of a simple note, as in the following example: |
1825 | mslangs | Where, for validation and indexing purposes, it is thought convenient to add keywords identifying the particular languages used, the |
1836 | mslangs | A manuscript item will sometimes contain material in more than one language. The |
1846 | mslangs | Since Old Church Slavonic may be written in either Cyrillic or Glagolitic scripts, and even occasionally in both within the same manuscript, it might be preferable to use a more explicit identifier: |
1851 | mslangs | The form and scope of language identifiers recommended by these Guidelines is based on the IANA standard described at |
1852 | mslangs | and should be followed throughout. Where additional detail is needed correctly to describe a language, or to discuss its deployment in a given text, this should be done using the |
1854 | mslangs | element in the TEI header, within which individual |
1861 | mslangs | element defines a particular combination of human language and writing system. Only one |
1863 | mslangs | element may be supplied for each such combination. Standard TEI practice also allows this element to be referenced by any element using the global |
1865 | mslangs | attribute in order to specify the language applicable to the content of that element. For example, assuming that |
1902 | msph | we subsume a large number of different aspects generally regarded as useful in the description of a given manuscript. These include: |
1904 | msph | aspects of the form, support, extent, and quire structure of the manuscript object and of the way in which the text is laid out on the page ( |
1910 | msph | and discussion of its binding, seals, and any accompanying material ( |
1914 | msph | Most manuscript descriptions touch on several of these categories of information though few include them all, and not all distinguish them as clearly as we propose here. In particular, it is often the case that an existing description will include information for which we propose distinct elements within a single paragraph, or even sentence. The encoder must then decide whether to rewrite the description using the structure proposed here, or to retain the existing prose, marked up simply as a series of |
1922 | msph | element may thus be used in either of two distinct ways. It may contain a series of paragraphs addressing topics listed above and similar ones. Alternatively, it may act as a container for any choice of the more specialized elements described in the remainder of this section, each of which itself contains a series of paragraphs, and may also have more specific attributes. |
1926 | msph | element will normally contain either a series of |
1928 | msph | elements, or a sequence of specialized elements from the |
1932 | msph | the description already exists in a prose form where some of the specialized topics are treated together in paragraphs of prose, but others are treated distinctly; |
1955 | msph | The order in which specific elements may appear is also constrained by the content model; again this is for simplicity of processing. They may of course be processed or displayed in any desired order, but for ease of validation, they must be given in the order specified below. |
1961 | msph1 | element is used to group together those parts of the physical description which relate specifically to the text-bearing object, its format, constitution, layout, etc. The |
1963 | msph1 | attribute is used to indicate the specific type of writing vehicle being described, for example, as a codex, roll, tablet, etc. If used it must appear first in the sequence of specialized elements. The |
1966 | msph1 | support |
1967 | msph1 | , i.e. the physical carrier on which the text is inscribed; and a description of the |
1968 | msph1 | layout |
1969 | msph1 | , i.e. the way text is organized on the carrier. |
1971 | msph1 | Taking these in turn, the description of the support is tagged using the following elements, each of which is discussed in more detail below: |
1981 | msph1 | ), may be used to tag specific terms of interest if so desired. |
2007 | msph1sup | element groups together information about the physical carrier. Typically, for western manuscripts, this will entail discussion of the material (parchment, paper, or a combination of the two) written on. For paper, a discussion of any watermarks present may also be useful. If this discussion makes reference to standard catalogues of such items, these may be tagged using the standard |
2030 | msph1ext | element, defined in the TEI header, may also be used in a manuscript description to specify the number of leaves a manuscript contains, as in the following example: |
2070 | msph1col | element, which is provided when the |
2121 | msphfo | element may be used to indicate the scheme, medium or location of folio, page, column, or line numbers written in the manuscript, frequently including a statement about when and, if known, by whom, the numbering was done. |
2129 | msphfo | Where a manuscript contains traces of more than one foliation, each should be recorded as a distinct |
2131 | msphfo | element and optionally given a distinct value for its |
2136 | msphfo | can then indicate which foliation scheme is being cited by means of its |
2155 | msphco | element is used to summarize the overall physical state of a manuscript, in particular where such information is not recorded elsewhere in the description. It should not, however, be used to describe changes or repairs to a manuscript, as these are more appropriately described as a part of its custodial history (see |
2156 | msphco | ). It should be supplied within the |
2158 | msphco | element, if it discusses the condition of the physical support of the manuscript; within the |
2163 | msphco | ) if it discusses only the condition of the binding or bindings concerned; or within the |
2165 | msphco | element if it discusses the condition of any seal attached to the manuscript. |
2187 | msphla | of the manuscript, that is the way in which text and illumination are arranged on the page, specifying for example the number of written, ruled, or pricked lines and columns per page, size of margins, distinct blocks such as glosses, commentaries, etc. This may be given as a simple series of paragraphs. Alternatively, one or more different layouts may be identified within a single manuscript, each described by its own |
2196 | msphla | element is used, the layout will often be sufficiently regular for the attributes on this element to convey all that is necessary; more usually however a more detailed treatment will be required. The attributes are provided as a convenient shorthand for commonly occurring cases, and should not be used except where the layout is regular. The value |
2198 | msphla | (not-applicable) should be used for cases where the layout is either very irregular, or where it cannot be characterized simply in terms of lines and columns, for example, where blocks of commentary and text are arranged in a regular but complex pattern on each page |
2217 | msphla | elements within the content of the element, as in the following example: |
2239 | msph2 | The second group of elements within a structured physical description concerns aspects of the writing, illumination, or other notation (notably, music) found in a manuscript, including additions made in later hands—the |
2240 | msph2 | text |
2259 | msphwr | element can contain a short description of the general characteristics of the writing observed in a manuscript, as in the following example: |
2276 | msphwr | Where several distinct hands have been identified, this fact can be registered by using the |
2318 | msphwr | can be used to link the relevant parts of the transcription to the appropriate |
2321 | msphwr | handShift new="#Eirsp-2"/ |
2334 | msphwr | element can simply provide a summary description: |
2357 | msphwr | elements should be supplied. Similarly, in the following example, the source text is a typescript with extensive handwritten annotation: |
2391 | msphdec | It can be difficult to draw a clear distinction between aspects of a manuscript which are purely physical and those which form part of its intellectual content. This is particularly true of illuminations and other forms of decoration in a manuscript. We propose the following elements for the purpose of delimiting discussion of these aspects within a manuscript description, and for convenience locate them all within the physical description, despite the fact that the illustrative features of a manuscript will in many cases also be seen as constituting part of its intellectual content. |
2401 | msphdec | Alternatively, it may contain a series of more specific typed |
2428 | msphdec | Where more exact indexing of the decorative content of a manuscript is required, the standard TEI elements |
2470 | msphmu | element may be used to describe the form of notation employed, as in the following example: |
2486 | mspham | element can be used to list or describe any additions to the manuscript, such as marginalia, scribblings, doodles, etc., which are considered to be of interest or importance. Such topics may also be discussed or referenced elsewhere in a description, for example in the |
2590 | msph3 | The third major component of the physical description relates to supporting but distinct physical components, such as bindings, seals and accompanying material. These may be described using the following specialist elements: |
2602 | msphbi | element contains a description of the state of the present and former bindings of a manuscript, including information about its material, any distinctive marks, and provenance information. This may be given as a series of paragraphs if only one binding is being described, or as a series of distinct |
2604 | msphbi | elements, each describing a distinct binding where these are separately described. For example: |
2612 | msphbi | Within a binding description, the elements |
2639 | msphbi | for paragraphs concerned exclusively with the condition of a binding, where this has not been supplied as part of the physical description. |
2679 | msadac | The circumstance may arise where material not originally part of a manuscript is bound into or otherwise kept with a manuscript. In some cases this material would best be treated in a separate |
2682 | msadac | below). There are, however, cases where the additional matter is not self-evidently a distinct manuscript: it might, for example, be a set of notes by a later scholar, or a file of correspondence relating to the manuscript. The |
2688 | msadac | Here is an example of the use of this element, describing a note by the Icelandic manuscript collector Árni Magnússon which has been bound with the manuscript: |
2734 | mshy | The following elements are used to record information about the history of a manuscript: |
2752 | mshy | Information about the origins of the manuscript, its place and date of writing, should be given as one or more paragraphs contained by a single |
2754 | mshy | element; following this, any available information on distinct stages in the history of the manuscript before its acquisition by its current holding institution should be included as paragraphs within one or more |
2802 | mshy | elements where distinct periods of ownership for the manuscript have been identified: |
2841 | msad | Three categories of additional information are provided for by the scheme described here, grouped together within the |
2852 | msad | is required. If any is supplied, it may appear once only; furthermore, the order in which elements are supplied should be as specified above. |
2862 | msadad | element is used to hold information relating to the curation and management of a manuscript. This may be supplied as a note using the global |
2875 | msrh | element may contain simply a series of paragraphs. Alternatively it may contain a |
2877 | msrh | element, followed by an optional series of |
2886 | msrh | element is used to document the primary source of information for the record containing it, in a similar way to the standard TEI |
2888 | msrh | element within a TEI Header. If the record is a new one, made without reference to anything other than the manuscript itself, then it may simply contain a |
2895 | msrh | Frequently, however, the record will be derived from some previously existing description, which may be specified using the standard TEI |
2907 | msrh | If, as is likely, a full bibliographic description of the source from which cataloguing information was taken is included within the |
2911 | msrh | element, or elsewhere in the current document, then it need not be repeated here. Instead, it should be referenced using the standard TEI |
2947 | msrh | element of the standard TEI header; its use here is intended to signal the similarity of function between the two container elements. Where the TEI header should be used to document the revision history of the whole electronic file to which it is prefixed, the |
2960 | msadch | element is another element also available in the TEI header, which should be used here to supply any information concerning access to the current manuscript, such as its physical location (where this is not implicit in its identifier), any restrictions on access, information about copyright, etc. |
2977 | msadch | record is used to describe the custodial history of a manuscript, recording any significant events noted during the period that it has been located within its holding institution. It may contain either a series of |
2979 | msadch | elements, or a series of |
2981 | msadch | elements, each describing a distinct incident or event, further specified by a |
3018 | msadsu | element is used to provide information about representations such as photographs or other representations of the manuscript which may exist within the holding institution or elsewhere. |
3028 | msadsu | element. However, it is often also convenient to record information such as negative numbers or digital identifiers for unpublished collections of manuscript images maintained within the holding institution, as well as to provide more detailed descriptive information about the surrogate itself. Such information may be provided as prose paragraphs, within which identifying information about particular surrogates may be presented using the standard TEI |
3056 | msadsu | Note the use of the specialized form of title ( |
3057 | msadsu | general material designation |
3060 | msadsu | At a later revision, the content of the |
3062 | msadsu | element is likely to be expanded to include elements more specifically intended to provide detailed information such as technical details of the process by which a digital or photographic image was made. For information about the inclusion of digital facsimile images within a TEI document, refer also to |
3137 | MSref | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
4 | SA | This chapter discusses a number of ways in which encoders may represent analyses of the structure of a text which are not necessarily linear or hierarchic. The module defined by this chapter provides for the following common requirements: |
6 | SA | to link disparate elements using the |
11 | SA | to link disparate elements without using the |
17 | SA | to segment text into elements convenient for the encoder and to mark arbitrary points within documents (section |
20 | SA | to represent correspondence or alignment among groups of text elements, both those with content and those which are empty (section |
22 | SA | We use the term |
24 | SA | as a special case for the more general notion of correspondence. Using A as a short form for |
27 | SA | set to the value |
29 | SA | , and suppose elements A1, A2, and A3 occur in that order and form one group, while elements B1, B2, and B3 occur in that order and form another group. Then a relation in which A1 corresponds to B1, A2 corresponds to B2, and A3 corresponds to B3 is an alignment. On the other hand, a relation in which A1 corresponds to B2, B1 to C2, and C1 to A2 is not an alignment. |
31 | SA | to synchronize elements of a text, that is to represent temporal correspondences and alignments among text elements (section |
32 | SA | ) and also to align them with specific points in time (section |
35 | SA | to specify that one text element is identical to or a copy of another (section |
47 | SA | to associate segments of a text with interpretations or analyses of their significance (section |
51 | SA | These facilities all use the same set of techniques based on the W3C XPointer framework ( |
63 | SA | is extended to include eight additional attributes to support the various kinds of linking listed above. Each of these attributes is introduced in the appropriate section below. In addition, for many of the topics discussed, a choice of methods of encoding is offered, ranging from simple but less general ones, which use attribute values only, to more elaborate and more general ones, which use specialized elements. |
70 | SAPT | to others if the first has an attribute whose value is a reference to the others: such an element is called a |
80 | SAPT | . These elements all indicate an association between one place in the document (the location of the pointer itself) and one or more others (the elements whose identifiers are specified by the pointer's |
83 | SAPT | link |
100 | SAPTL | element, which represents an association between two (or more) locations by specifying each location explicitly. Its own location is irrelevant to the intended linkage. All three elements use the attribute |
104 | SAPTL | class as a means of indicating the location or locations referenced or pointed to. |
114 | SAPTL | between an element (which, in the case of a pure pointer, is simply a location in a document), and one or more others, known collectively as its |
121 | SAPTL | point, conceptually, at a single target, even if that target may be discontinuous in the document. The |
126 | SAPTL | These three elements also share a common set of attributes, derived from the |
141 | SAPTL | element. All that is required is that the value of the |
143 | SAPTL | (or other pointing) attribute of the one be the value of the |
161 | SAPTL | attribute may take as value one or more URI reference. In the simplest case, each such reference will indicate an element in the current document (or in some other document), for example by supplying the value used for its global |
163 | SAPTL | attribute. It may however carry as value any form of URI, such as a URL pointing to some other document or location on the Internet. Pointing or linking to external documents and pointing and linking where identifiers are not available is described below in section |
170 | SAPTEG | As an example of the use of mechanisms which establish connections among elements, consider the practice (common in 18th century English verse and elsewhere) of providing footnotes citing parallel passages from classical authors. |
172 | POPE | The figure shows the original page of Pope's Dunciad which is discussed in the text. |
178 | SAPTEG | attribute, placed adjacent to the passage to which the note refers: |
181 | SAPTEG | attribute on the note is used to classify the notes using the typology established in the Advertisement to the work: |
185 | SAPTEG | In the source text, the text of the poem shares the page with two sets of notes, one headed |
214 | SAPTEG | implicit linking |
215 | SAPTEG | ). It relies on the juxtaposition of the note to the text being commented on for the connection to be understood. If it is felt that the mere juxtaposition of the note to the text does not make it sufficiently clear exactly what text segment is being commented on (for example, is it the immediately preceding line, or the immediately preceding two lines, or what?), or if it is decided to place the note at some distance from the text, then the pointing or the linking must be made explicit. We now consider various methods for doing that. |
219 | SAPTEG | element might be placed at an appropriate point within the text to link it with the annotation: |
242 | SAPTEG | ) to enable it to be specified as the target of the pointer element. Because there is nothing in the text to signal the existence of the annotation, the |
244 | SAPTEG | attribute has been given the value |
254 | SAPTEG | attribute has been supplied for the associated text: |
264 | SAPTEG | Given this encoding of the text itself, we can now link the various notes to it. In this case, the note itself contains a pointer to the place in the text which it is annotating; this could be encoded using a |
268 | SAPTEG | attribute of its own and contains a (slightly misquoted) extract from the text marked as a |
292 | SAPTEG | a pointer within one line indicates the note |
294 | SAPTEG | the note indicates the line |
296 | SAPTEG | a pointer within the note indicates the line |
298 | SAPTEG | Note that we do not have any way of pointing from the line itself to the note: the association is implied by containment of the pointer. We do not as yet have a true double link between text and note. To achieve that we will need to supply identifiers for the annotations as well as for the verse lines, and use a |
331 | SAPTEG | element here bears the identifier of the note followed by that of the verse line. We could also allocate an identifier to the reference within the note and encode the association between it and the verse line in the same way: |
346 | SAPTEG | s could be combined into one, as follows: |
352 | SAPTLG | Clearly, there are many reasons for which an encoder might wish to represent a link or association between different elements. For some of them, specific elements are provided in these Guidelines; some of these are discussed elsewhere in the present chapter. The |
354 | SAPTLG | element is a general purpose element which may be used for any kind of association. The element |
356 | SAPTLG | may be used to group links of a particular type together in a single part of the document; such a collection may be used to represent what is sometimes referred to in the literature of Hypertext as a |
358 | SAPTLG | , a term introduced by the Brown University FRESS project in 1969, and not to be confused with the World Wide Web. |
373 | SAPTLG | element provides a convenient way of establishing a default for the |
375 | SAPTLG | attribute on a group of links of the same type: by default, the |
379 | SAPTLG | element has the same value as that given for |
385 | SAPTLG | Typical software might hide a web entirely from the user, but use it as a source of information about links, which are displayed independently at their referenced locations. Alternatively, software might provide a direct view of the link collection, along with added functions for manipulating the collection, as by filtering, sorting, and so on. To continue our previous example, this text contains many other notes of a kind similar to the one shown above. Here are a few more of the lines to which annotations have to be attached, followed by the annotations themselves: |
426 | SAPTLG | attribute can be used to identify the text elements within which the individual targets of the links are to be found. Suppose that the text under discussion is organized into a |
428 | SAPTLG | element, containing the text of the poem, and a |
432 | SAPTLG | attribute can have as its value the identifiers of the |
436 | SAPTLG | , to enable an application to verify that the link targets are in fact contained by appropriate elements, or to limit its search space: |
448 | SAPTLG | domain |
449 | SAPTLG | ; if some notes are contained by a section with identifier |
460 | SAPTLG | attribute can be used to provide further information about the role or function of the various targets specified for each link in the group. The value of the |
462 | SAPTLG | attribute is a list of names (formally, name tokens), one for each of the targets in the link; these names can be chosen freely by the encoder, but their significance should be documented in the encoding description in the header. |
463 | SAPTLG | Since no special element is provided for this purpose in the present version of these Guidelines, the information should be supplied as a series of paragraphs at the end of the |
467 | SAPTLG | In the current example, we might think of the note as containing the |
468 | SAPTLG | source |
469 | SAPTLG | of the imitation and the verse line as containing the |
489 | SAPTIP | In the preceding examples, we have shown various ways of linking an annotation and a single verse line. However, the example cited in fact requires us to encode an association between the note and a |
491 | SAPTIP | of verse lines (lines 284 and 285); we call these two lines a |
492 | SAPTIP | span |
495 | SAPTIP | There are a number of possible ways of correcting this error: one could use the |
497 | SAPTIP | attribute to indicate one end of the span and the special purpose |
501 | SAPTIP | element to point to the other. Another possibility might be to create an element which represents the whole span itself and assign that an |
503 | SAPTIP | attribute, which can then be linked to the |
531 | SAPTIP | then provides an identifier which can be linked to the |
540 | SAPTIP | value of |
546 | SAPTIP | had the value |
548 | SAPTIP | , the link target would be the pointer itself, rather than the objects it points to. |
552 | SAPTIP | element is used to group a collection of |
565 | SAXP | This section introduces more formally the pointing mechanisms available in the TEI. In addition to those discussed so far, the TEI provides methods of pointing: |
575 | SAXP | at arbitrary content in any XML document using TEI-defined XPointer schemes. |
579 | SAXP | All TEI attributes used to point at something else are declared as having the datatype |
599 | SAUR | Like the ubiquitous if misnamed XHTML pointing attribute |
601 | SAUR | , the TEI pointing attributes can point to a document that is not the current document (the one that contains the pointing element) whether it is in the same local filesystem as the current document, or on a different system entirely. In either case, the pointing can be accomplished absolutely (using the entire address of the target document) or relatively (using an address relative to the current base URI in force). The |
605 | SAUR | . If there is none, the base URI is that of the current document. In common practice the current base URI in force is likely to be the value of the |
616 | SAUR | This example points explicitly to a location on the Web, accessible via HTTP |
617 | SAUR | . Suppose however that we wish to access a document stored locally in a file. Again we will supply an absolute URI reference, but this time using a different protocol: |
631 | SAUR | is specified here, the location of the resource |
635 | SAUR | In the following example, however, we first change the current base URI by setting a new value for |
637 | SAUR | . The resource required is then identified by means of a relative URI: |
691 | SABN | Because the default base URI is the current document, a pointer that is specified as a |
692 | SABN | bare name |
694 | SABN | In more recent W3C documents, the term |
695 | SABN | bare name |
696 | SABN | is deprecated in favour of the more explicit |
720 | SABN | of the target element as a bare name only (e.g., |
722 | SABN | ) is the simplest and often the best approach where it can be applied, i.e. where both the source element and target element are in the same XML document, and where the target element carries an identifier. It is the method used extensively in previous sections of this chapter and elsewhere in these Guidelines. |
729 | SAPU | is a useful way of handling the repeated use of long external URIs. However, it is less convenient when your text contain many references to a variety of different sources in different locations. Even in the case of relative links on the local file system, |
733 | SAPU | attributes may become quite lengthy and make XML code difficult to read. To deal with this problem, the TEI provides a useful method of using abbreviated pointers and documenting a way to dereference them automatically. |
735 | SAPU | Imagine a project which has a large collection of XML documents organized like this: |
765 | SAPU | If you want to link a |
773 | SAPU | file, the link will look like this: |
777 | SAPU | If there are many names to tag in a single paragraph, the XML encoding will be congested, and such lengthy links are prone to typographical error. In addition, if the project organization is changed, every relative link will have to be found and altered. |
787 | SAPU | element in the TEI header, as described in |
788 | SAPU | . However, such a link cannot be mechanically processed by an external system that does not know how to interpret it; a human will have to read the header explanation and write code explicitly to reconstruct the intended link. |
794 | SAPU | , and can therefore be used as the value of any attribute which has that datatype, such as |
798 | SAPU | . Such a scheme consists of a prefix with a colon, and then a value. You might, for example, use the prefix |
800 | SAPU | (for "person"), and structure your name tags like this: |
806 | SAPU | ? Essentially, it isn't, except that TEI provides a structured method of dereferencing it (turning it into a computable path, such as |
810 | SAPU | in the TEI header, using the elements and attributes for prefix declaration: |
831 | SAPU | value is constructed with a |
837 | SAPU | , and it contains any number of |
847 | SAPU | provides the string which will be used as a replacement. In this example, using |
849 | SAPU | , the value |
853 | SAPU | , and also captured (through the parentheses in the regular expression); it would then be replaced by the value |
869 | SAPU | in the header to see if there is an available expansion for it, and if there is, it can automatically provide the expansion and generate a full or relative URI. |
873 | SAPU | element in the personography file, it might also be useful to point to an external source which is available on the network, representing the same information in a different way. So there might be a second |
881 | SAPU | Any number of |
883 | SAPU | elements may be provided for the same prefix. A processor may decide to process one or all of them; if it processes only one, it should choose the first one with the correct |
891 | SAPU | When creating private URI schemes, it is recommended that you avoid using any existing registered prefix. A list of registered prefixes is maintained by IANA at |
906 | SATS | TEI XPointer Schemes |
908 | SATS | The pointing schemes described in this chapter are part of a number of such schemes envisaged by the W3C, which together constitute a framework for addressing data within XML documents, known as the XPointer Framework ( |
912 | SATS | . The W3C has predefined a set of such schemes, and maintains a register for their expansion. |
917 | SATS | . These Guidelines also define six other pointer schemes, which provide access to parts of an XML document such as points within data content or stretches of data content. These additional TEI pointer schemes are defined in sections |
921 | SATSin | Introduction to TEI Pointers |
923 | SATSin | Before discussing the TEI pointer schemes, we introduce slightly more formally the terminology used to define them. So far, we have discussed only ways of pointing at components of the XML information set node such as elements and attributes. However, there is often a need in text analysis to address additional types of location such as the |
931 | SATSin | that may arbitrarily cross the boundaries of nodes in a document. The content of an XML document is organized sequentially as well as hierarchically, and it makes sense to consider ranges of characters within a document independently of the nodes to which they belong. From the perspective of most of the pointer schemes discussed below, a TEI document is a tree structure superimposed upon a character stream. Nodes are entities available only in the tree, while points are available only in the stream. For this reason, the schemes below that rely upon character positions ( |
937 | SATSin | ) cannot take nodes into account. Similarly, XPath, being a method for locating nodes in the tree, treats those nodes as atomic, and is unable to address parts of nodes in their document context. |
939 | SATSin | The TEI pointer scheme thus distinguishes the following kinds of object: |
943 | SATSin | A node is an instance of one of the node kinds defined in the |
945 | SATSin | . It represents a single item in the XML information set for a document. For pointing purposes, the only nodes that are of interest are Text Nodes, Element Nodes, and Attribute nodes. |
949 | SATSin | A Sequence follows the definition in the XPath 2.0 Data Model, with one alteration. A Sequence is an ordered collection of zero or more items, where an item is either a node or a partial text node. |
953 | SATSin | A Text Stream is the concatenation of the text nodes in a document and behaves as though all tags had been removed. A text stream begins at a reference node and encompasses all of the text inside that node (if any) and all the text following it in document order. In XPath terms, this would encompass all of the text nodes beginning at a particular node, and following it on the |
959 | SATSin | A Point represents a dimensionless point between nodes or characters in a document. Every point is adjacent to either characters or elements, and never to another point. Points can only be referenced in relation to an element or text node in the document (i.e. something addressable by either an XPath or a fragment identifier). Points occur either immediately before or after an element, or at a numbered position inside a text stream. Position zero in the stream would be immediately before the first character. Note that points within attribute values cannot mark the beginning or end of a range extending beyond the attribute value, because points indicate a position within a document. Since attribute nodes are by definition un-ordered, they cannot be said to have a fixed position. |
963 | SATSin | The TEI recommends the following seven pointer schemes: |
967 | SATSin | Addresses a node or nodeset using the XPath syntax. ( |
974 | SATSin | addresses the point before (left) or after (right) a node or node set ( |
980 | SATSin | addresses a point inside a text node ( |
994 | SATSin | addresses a range which matches a specified string within a node ( |
1001 | SATSin | scheme refers to the existing XPath specification which is adopted with one modification: the default namespace for any XPath used as a parameter to this scheme is assumed to be the TEI namespace |
1007 | SATSin | draft, but are individually much simpler. At the time of this writing, there is no current or scheduled activity at the W3C towards revising this draft or issuing it as a recommendation. |
1009 | SATSin | A note on namespaces |
1014 | SATSin | ) which when prepended to a resolvable pointer allows for the definition of namespace prefixes to be used in XPaths in subsequent pointers. TEI Pointer schemes assume that un-prefixed element names in TEI Pointer XPaths are in the TEI namespace, |
1018 | SATSin | is thus optional, provided no new prefixes need to be defined. If the schemes described here are used to address non-TEI elements, then any new prefixes to be used in pointer XPaths may be defined using the |
1030 | SATSXP | scheme locates a node within an XML Information Set. The single argument |
1038 | SATSXP | scheme because they represent extracted values rather than locations in the source document. XPath expressions that address attribute nodes are only advisable in the |
1042 | SATSXP | The example below, and all subsequent examples in this section refer to the following TEI fragment |
1075 | SATSXP | A TEI Pointer that referenced the "normalized" form in the |
1076 | SATSXP | choice |
1077 | SATSXP | in line 1 of the example might look like: |
1081 | SATSXP | When an XPath is interpreted by a TEI processor, the information set of the referenced document is interpreted without any additional information supplied by any schema processing that may or may not be present. In particular this means that no whitespace normalization is applied to a document before the XPath is interpreted. |
1087 | SATSXP | pointers more robust than the other mechanisms discussed in this section even if the designated document changes. For durability in the presence of editing, use of |
1089 | SATSXP | is always recommended when possible. |
1101 | SATSL | scheme locates the point immediately preceding the node addressed by its argument, which is either an |
1105 | SATSL | , the value of an |
1112 | SATSL | lb |
1114 | SATSL | gap |
1134 | SATSR | scheme locates the point immediately following the node addressed by its argument. |
1139 | SATSR | lb |
1156 | SATSSI | scheme locates a point based on character positions in a text stream relative to the node identified by the IDREF or XPATH parameter. The |
1160 | SATSSI | . An offset of 0 represents the position immediately before the first character in either the first text node descendant of the node addressed in the first parameter or the first following text node, if the addressed element contains no text node descendants. |
1165 | SATSSI | s |
1170 | SATSSI | in line 2. |
1184 | SATSRN | s, which are each members of the set |
1196 | SATSRN | locates a (possibly non-contiguous) sequence beginning at the first POINTER parameter and ending at the last. If the POINTER locates a node (i.e. is an XPATH or IDREF), then that node is a member of the addressed sequence. If a sequence addressed by a range pointer overlaps, but does not wholly contain, an element (i.e. it contains only the start but not the end tag or vice-versa), then that element is not part of the sequence. |
1199 | SATSRN | s may address sequences of non-contiguous nodes. For example, a range() might select text beginning before an |
1201 | SATSRN | , encompassing the content of a single |
1210 | SATSRN | line 4 |
1219 | SATSRN | indicates the sequence |
1225 | SATSRN | indicates the non-contiguous sequence |
1237 | SATSSR | The string-range() scheme locates a sequence based on character positions in a text stream relative to the node identified by the first parameter. The location of the beginning of the addressed sequence is determined precisely as for |
1245 | SATSSR | parameter is a positive integer that denotes the length of the text stream captured by the sequence. As with |
1247 | SATSSR | , the addressed sequence may contain text nodes and/or elements. The |
1249 | SATSSR | scheme, can accept multiple OFFSET, LENGTH pairs to address a non-contiguous sequence in mauch the same way that range() can accept multiple pairs of pointers. |
1251 | SATSSR | Because string-range() addresses points in the text stream, tags are invisible to it. For example, if an empty tag like |
1253 | SATSSR | is encountered while processing a string-range(), it will be included in the resulting sequence, but the LENGTH count will not increment when it is captured. |
1258 | SATSSR | line 5 |
1259 | SATSSR | from the text immediately following the |
1260 | SATSSR | lb |
1262 | SATSSR | ab |
1267 | SATSSR | indicates the sequence |
1273 | SATSSR | indicates the non-contiguous sequence |
1285 | SATSMA | The match scheme locates a sequence based on matching the REGEX parameter against a text stream relative to the reference node identified by the first parameter. REGEX is a regular expression as defined by |
1299 | SATSMA | are assumed to operate in multi-line mode. The end of the string to be matched against is either the end of the text contained by the element in the first parameter or the end of the document, if that parameter indicates an empty element. The meta-character |
1301 | SATSMA | therefore matches the beginning of the text stream inside or following the reference node, and the meta-character |
1305 | SATSMA | The optional INDEX parameter is an integer greater than 0 which specifies which match should be chosen when there is more than one possibility. If omitted, the first match in the text stream will be used. |
1315 | SATSMA | indicates the sequence |
1318 | SATSMA | line 5 |
1326 | SATSMA | unclear |
1329 | SATSMA | , just their text children. |
1343 | SACR | , chapter 5, verse 7. |
1344 | SACR | They might then wish to translate the string |
1357 | SACR | Several elements in the TEI scheme ( |
1367 | SACR | , just for this purpose. Using the system described in this section, an encoder may specify references to canonical works in a discipline-familiar format, and expect software to derive a complete URI from it. The value of the |
1369 | SACR | attribute is processed as described in this section, and the resulting URI reference is treated as if it were the value of the |
1379 | SACR | attribute to function as required, a mechanism is needed to define the mapping between (for example) |
1385 | SACR | in the TEI header, which contains an algorithm for translating a canonical reference string (like |
1421 | SACR | When an application encounters a canonical reference as the value of |
1423 | SACR | attribute, it might follow this sequence of specific steps to transform it into a URI reference: |
1436 | SACR | match the value of the |
1438 | SACR | attribute to the regular expression found as the value of the |
1442 | SACR | if the value of the |
1446 | SACR | take the value of the |
1448 | SACR | attribute and substitute the back references ($1, $2, etc.) with the corresponding matched substrings |
1450 | SACR | the result is taken as if it were a relative or absolute URI reference specified on the |
1454 | SACR | attribute value as usual |
1456 | SACR | no further processing of this value of the |
1460 | SACR | should take place |
1464 | SACR | if, however, the value of the |
1466 | SACR | attribute does not match the regular expression specified in the value of the |
1478 | SACR | The regular expression language used as the value of the |
1486 | SACR | tei |
1487 | SACR | matches any string that contains |
1488 | SACR | tei |
1489 | SACR | , in the W3C language it only matches the string |
1490 | SACR | tei |
1492 | SACR | The value of the |
1498 | SACR | are replaced by the corresponding substring match. Note that since a maximum of nine substring matches are permitted, the string |
1501 | SACR | the value of the first matched substring followed by the character |
1505 | SACR | . If there is a need for an actual string including a dollar sign followed by a digit that is not supposed to be replaced, the dollar sign should be written as |
1519 | SACRWE | above, an application comes across a |
1521 | SACRWE | value of |
1529 | SACRWE | . The application would first apply the regular expression |
1539 | SACRWE | . The application would then apply these substrings to the pattern |
1549 | SACRWE | If, however, the input string had been |
1551 | SACRWE | , the first regular expression would not have matched. The application would have then tried the second, |
1557 | SACRWE | . It would then have substituted those matched substrings into the pattern |
1559 | SACRWE | to produce a fragment identifier, which when appended to the |
1565 | SACRWE | If the input string had been |
1567 | SACRWE | , neither the first nor the second regular expressions would have successfully matched. The application would have then tried the third, |
1586 | SACRex | In the above example, the value of |
1639 | SACRmu | Canonical reference pointers are intended for use by TEI encoders. However, this specification might be useful to the development of a process for recognizing canonical references in non-TEI documents (such as plain text documents), possibly as part of their conversion to TEI. |
1647 | SASE | In this section, we discuss three general purposes elements which may be used to mark and categorize both a span of text and a point within one. These elements have several uses, most notably to provide elements which can be given identifiers for use when aligning or linking to parts of a document, as discussed elsewhere in this chapter. They also provide a convenient way of extending the semantics of the TEI markup scheme in a theory-neutral manner, by providing for two neutral or |
1649 | SASE | elements to which the encoder can add any meaning not supplied by other TEI defined elements. |
1690 | SASE | , it is useful where multiple views of a document are to be combined, for example, when a logical view based on paragraphs or verse lines is to be mapped on to a physical view based on manuscript lines. Like those elements, it is a member of the class |
1692 | SASE | and can therefore appear anywhere within a document when the module defined by this chapter is included in a schema. Unlike the other elements in its class, the |
1695 | SASE | , rather than as a means of marking segment boundaries for some arbitrary segmentation of a text. |
1697 | SASE | For example, suppose that we wish to mark the end of the fifth word following each occurrence of some term in a particular text, perhaps to assist with some collocational analysis. This can most easily be done with the help of the |
1712 | SASE | element may be used at the encoder's discretion to mark almost any segment of the text of interest for processing. One use of the element is to mark text features for which no appropriate markup is otherwise defined, i.e. as a simple extension mechanism. Another use is to provide an identifier for some segment which is to be pointed at by some other element, i.e. to provide a target, or a part of a target, for a |
1720 | SASE | as a means of marking segments significant in a metrical or rhyming analysis (see section |
1723 | SASE | as a means of marking typographic lines in drama (see section |
1724 | SASE | ) or title pages (see section |
1735 | SASE | element simply delimits the extent of a stutter, a textual feature for which no element is provided in these Guidelines. |
1759 | SASE | elements may be nested directly within one another, to any degree of analysis considered appropriate. This is taken a little further in the following example, where the |
1802 | SASE | to facilitate this particular kind of analysis. These allow for the explicit markup of units called |
1829 | SASE | attribute of these specialized elements now carries the value carried by the |
1833 | SASE | element. For an analysis not using these traditional linguistic categories however, the |
1837 | SASE | In language corpora and similar material, the |
1839 | SASE | element may be used to provide an end-to-end segmentation as an alternative to the more specific |
1848 | SASE | element can then be used to mark both features within s-units and segments composed of s-units, as in the following example: |
1850 | SASE | , where the text from which this fragment is taken is analyzed. |
1864 | SASE | tag must be properly enclosed within other elements. Thus, a single |
1866 | SASE | element can be used to group together words in different sentences only if the sentences are not themselves tagged. The first of the following two encodings is legal, but the second is not. |
1890 | SASE | element has the same content as a paragraph in prose: it can therefore be used to group together consecutive sequences of |
1892 | SASE | class elements, such as lists, quotations, notes, stage directions, etc. as well as to contain sequences of phrase-level elements. It cannot however be used to group together sequences of paragraphs or similar text units such as verse lines; for this purpose, the encoder should use intermediate pointers, as described in section |
1894 | SASE | . It is particularly important that the encoder provide a clear description of the principles by which a text has been segmented, and the way in which that segmentation is represented. This should include a description of the method used and the significance of any categorization codes. The description should be provided as a series of paragraphs within the |
1896 | SASE | element of the encoding description in the TEI header, as described in section |
1901 | SASE | element may also be used to encode simultaneous or mutually exclusive variants of a text when the more special purpose elements for simple editorial changes, abbreviation and expansion, addition and deletion, or for a critical apparatus are not appropriate. In these circumstances, one |
1903 | SASE | is encoded for each possible variant, and the set of them is enclosed in a |
1907 | SASE | For example, if one were writing dual-platform instructions for installation of software, it might be useful to use |
1916 | SASE | Elsewhere in this chapter we provide a number of examples where the |
1924 | SASE | element, but is used for portions of the text which occur not within paragraphs or other component-level elements, but at the component level themselves. It is therefore a member of the |
1930 | SASE | element may be used, for example, to tag the canonical verse divisions of Biblical texts: |
1948 | SASE | In other cases, where the text clearly indicates paragraph divisions containing one or more verses, the |
1950 | SASE | element may be used to tag the paragraphs, and the |
1978 | SASE | element is also useful for marking dramatic speeches when it is not clear whether the speech is to be regarded as prose or verse. If, for example, an encoder does not wish to express an opinion as to whether the opening lines of Shakespeare's |
2027 | SACS | , which is a special kind of correspondence involving an ordered set of correspondences. Both cases may be represented using the |
2032 | SACS | . We also discuss the special case of alignment in time or |
2034 | SACS | , for which special purpose elements are proposed in section |
2040 | SACS1 | A common requirement in text analysis is to represent correspondences between two or more parts of a single document, or between places in different documents. Provided that explicit elements are available to represent the parts or places to be linked, then the global linking attribute |
2055 | SACS1 | element should be used, if no other element is available. Where the correspondence is between |
2059 | SACS1 | element should be used, if no other element is available. |
2063 | SACS1 | attribute with spans of content is illustrated by the following example: |
2081 | SACS1 | attributes. This mechanism is simple to apply, but has the drawback that it is not possible to specify more exactly what kind of correspondence is intended. Where this attribute is used, therefore, encoders are encouraged to specify their intent in the associated encoding description in the TEI header. |
2139 | SACSAL | One very important application area for the alignment of parallel texts is multilingual corpora. Consider, for example, the need to align |
2141 | SACSAL | of sentences drawn from a corpus such as the Canadian Hansard, in which each sentence is given in both English and French. Concerning this problem, Gale and Church write: |
2142 | SACSAL | Most English sentences match exactly one French sentence, but it is possible for an English sentence to match two or more French sentences. The first two English sentences [in the example below] illustrate a particularly hard case where two English sentences align to two French sentences. No smaller alignments are possible because the clause |
2144 | SACSAL | in the first English sentence corresponds to (part of) the second French sentence. The next two alignments ... illustrate the more typical case where one English sentence aligns with exactly one French sentence. The final alignment matches two English sentences to a single French sentence. These alignments [which were produced by a computer program] agreed with the results produced by a human judge. |
2146 | SACSAL | , from which the example in the text is taken. |
2148 | SACSAL | The alignment produced by Gale and Church's program can be expressed in four different ways. The encoder must first decide whether to represent the alignment in terms of points within each text (using the |
2152 | SACSAL | element. To some extent the choice will depend on the process by which the software works out where alignment occurs, and the intention of the encoder. Secondly, the encoder may elect to represent the actual encoding using either |
2183 | SACSAL | attribute be specified in both English and French texts, since (as noted above) this attribute is defined as representing a mutual association. However, it may simplify processing to do so, and also avoids giving the impression that the English is translating the French, or vice versa. More seriously, this encoding does not make explicit that it is in fact the entire stretch of text between the anchors which is being aligned, not simply the points themselves. If for example one text contained material omitted from the other, this approach would not be appropriate. |
2239 | SACSXA | The preceding encoding of the alignment of parallel passages from two texts requires that those texts and the alignment all be part of the same document. If the texts are in separate documents, then complete URIs, whether absolute or relative (section |
2240 | SACSXA | ), will be required. These external pointers may appear anywhere within the document, but if they are created solely for use in encoding links, they may for convenience be grouped within the |
2250 | SACSXA | Each topic covered in this work has three parts: a picture, a prose text in Latin describing the topic, and a carefully-aligned translation of the Latin into English, German, or some other vernacular. Key terms in the two texts are typographically distinct, and are linked to the picture by numbers, which appear in the two texts and within the picture as well. |
2252 | SACSXA | First, we consider the text portions. The English and Latin portions have been encoded as distinct |
2299 | SACSXA | Next we consider the non-textual parts of the page. Encoding this requires providing two distinct components: firstly a digitized rendering of the page itself, and secondly a representation of the areas within that image which are to be aligned. In section |
2309 | SACSXA | This example of SVG defines two rectangles at the locations with the specified x and y coordinates. A view is defined on these, enabling them to be mapped by an SVG processor to the image found at the URL specified ( |
2312 | SACSXA | ; for further discussion of using non-TEI XML vocabularies such as SVG within a TEI document, see section |
2315 | SACSXA | As printed, the Comenius text exhibits three kinds of alignment. |
2321 | SACSXA | Particular words or phrases are marked as terms in the two languages by a change of rendition: the English text, which otherwise uses black letter type throughout, has the words |
2339 | SACSXA | Numbered labels appear within the text portions, linking keywords to each other and to sections of the picture. These labels, which have been left out of the above encoding, are attached to the first, third, and last segments in each language quoted below, and also appear (rather indistinctly) within the picture itself. Thus, the images of the study, the student, and his books are each aligned with the correct term for them in the two languages. |
2375 | SACSXA | This map, of course, only aligns whole segments and image portions, since these are the only parts of our encoding which bear identifiers and can therefore be pointed to. To add to it the alignment between the typographically distinct words mentioned above, new elements must be defined, either within the text itself or externally by using stand off techniques. Encoding these word pairs as |
2379 | SACSXA | , although intuitively obvious, requires a non-trivial decision as to whether the Latin text is glossing the English, or vice versa. Tagging all the marked words as |
2381 | SACSXA | avoids the difficult decision, but might be thought by some encoders to convey the wrong information about the words in question. Simply tagging them as additional embedded |
2385 | SACSXA | These solutions all require the addition of further markup to the text. This may pose no problems, or it may be infeasible, for example because the text is held on a read-only medium. If it is not feasible to add more markup to the original text, some form of stand-off markup will be needed. Any item within the text that can be pointed to using the various pointer schemes discussed in this chapter may be used, not simply those which rely on the existence of an |
2410 | SACSXA | To express the same alignment mentioned above, we could use an XPath expression to identify the required |
2422 | SACSXA | correspond, we might express the link between them as follows: |
2429 | SASY | In the previous section we discussed two particular kinds of alignment: alignment of parallel texts in different languages; and alignment of texts and portions of an image. In this section we address another specialized form of alignment: synchronization. The need to mark the relative positions of text components with respect to time arises most naturally and frequently in transcribed spoken texts, but it may arise in any text in which quoted speech occurs, or events are described within a time frame. The methods described here are also generalizable for other kinds of alignment (for example, alignment of text elements with respect to space). |
2434 | SASYNC | Provided that explicit elements are available to represent the parts or places to be synchronized, then the global linking attribute |
2443 | SASYNC | elements may be used to make explicit the fact that the synchronous elements are aligned. |
2445 | SASYNC | To illustrate the use of these mechanisms for marking synchrony, consider the following representation of a spoken text: |
2447 | SASYNC | B: The first time in twenty five years, we've cooked Christmas (unclear) for a blooming great load of people. A: So you're [1] (unclear) [2] B: [1] It will be [2] nice in a way, but, [3] be strange. [4] A: [3] Yeah [4], yeah, cos it, it's [5] the [6] B: [5] not [6] |
2456 | SASYNC | To encode this we use the spoken texts module, described in chapter |
2516 | SASYNC | As with other forms of alignment, synchronization may be expressed between stretches of speech as well as between points. When complete utterances are synchronous, for example, if one person says |
2529 | SASYNC | (where one speaker starts speaking before another has finished) is thus to use the |
2548 | SASYNC | element and the content of a |
2550 | SASYNC | element, and between the content of an |
2563 | SASYMP | A synchronous alignment specifies which points in a spoken text occur at the same time, and the order in which they occur, but does not say at what time those points actually occur. If that information is available to the encoder it can be represented by means of the |
2573 | SASYMP | attribute, whose value is a string which specifies a particular time, or indirectly by means of the |
2579 | SASYMP | is used, then the |
2583 | SASYMP | attributes should also be used to indicate the amount of time that has elapsed since the time specified by the element pointed to by the |
2585 | SASYMP | attribute; the value |
2591 | SASYMP | elements are uniformly spaced in time, then the |
2599 | SASYMP | elements. If the intervals vary, but the units are all the same, then the |
2615 | SASYMP | element which specifies the reference or origin for the timings within the |
2617 | SASYMP | ; this must, of course, specify its position in time absolutely. If the origin of a timeline is unknown, then this attribute may be omitted. |
2643 | SASYMP | To avoid the need for two distinct link groups (one marking the synchronization of anchors with each other, and the other marking their alignment with points on the time line) it would be better to link the |
2656 | SASYMP | Finally, suppose that a digitized audio recording is also available, and an XML file that assigns identifiers to the various temporal spans of sound is available. For example, the following Synchronized Multimedia Integration Language (SMIL, pronounced "smile") fragment: |
2682 | SAIE | , that is, an element which is not explicitly present in a text, but the presence of which an application can infer from the encoding supplied. In this section, we are concerned with virtual elements made by simply cloning existing elements. In the next section ( |
2685 | SAIE | Provided that explicit elements are available to represent the parts or places to be linked, then the global linking attributes |
2694 | SAIE | It is useful to be able to represent the fact that one element of text is identical to others, for analytical purposes, or (especially if the elements have lengthy content) to obviate the need to repeat the content. For example, consider the repetition of the |
2708 | SAIE | element above has identical content to the first. The |
2710 | SAIE | attribute is provided for this purpose. Using it, we could recode the last line of the above example as follows: |
2716 | SAIE | attribute may be used to document the fact that two elements have identical content. It may be regarded as a special kind of link. It should only be attached to an element with identical content to that which it targets, or to one the content of which clearly designates it as a repetition, such as the word |
2720 | SAIE | in the representation of the chorus of a song, the second time it is to be sung. The relation specified by the |
2722 | SAIE | attribute is symmetric: if a chorus is repeated three times and each repetition bears a |
2728 | SAIE | attribute is used in a similar way to indicate that the content of the element bearing it is identical to that of another. The difference is that the content is not itself repeated. The effect of this attribute is thus to create a |
2730 | SAIE | of the element indicated. Using this attribute, the repeated date in the first example above could be recoded as follows: |
2732 | SAIE | An application program should replace whatever is the actual content of an element bearing a |
2734 | SAIE | attribute with the content of the element specified by it. If the content of the element specified includes other elements, these will become embedded within the element bearing the attribute. Care must be taken to ensure that the document is valid both before and after this embedding takes place. If, for example, the element bearing a |
2736 | SAIE | attribute requires a mandatory sub-component, then this component must be present (though possibly empty), even though it will be replaced by the content of the targetted element. |
2790 | SAAG | Because of the strict hierarchical organization of elements, or for other reasons, it may not always be possible or desirable to include all the parts of a possibly fragmented text segment within a single element. In section |
2791 | SAAG | we introduced the notion of an intermediate pointer as a way of pointing to discontinuous segments of this kind. In this section we first describe another way of linking the parts of a discontinuous whole, using a set of linking attributes, which are made available for any tag by following the procedure described at the beginning of this chapter. We then describe how the |
2795 | SAAG | element, which is a special-purpose linking element specifically for representing the aggregation of parts, and the |
2801 | SAAG | The linking attributes for aggregation are |
2814 | SAAG | Here is the material on which we base our first illustration of the use of these mechanisms. Our problem is to represent the s-units identified below as |
2844 | SAAG | attributes, we can link the s-units with identifiers |
2854 | SAAG | Double linking of the two s-units, as illustrated by the last of these encodings, is equivalent to specifying a |
2862 | SAAG | attribute with a value of |
2863 | SAAG | join |
2864 | SAAG | to specify that the link is to be understood as joining its targets into a single aggregate. |
2871 | SAAG | join |
2883 | SAAG | element within a text is significant: it must be supplied at a position where the element indicated by its |
2893 | SAAG | As a further example, consider the following list of authors' names. The object of the |
2895 | SAAG | element here is to provide another list, composed of those authors from the larger list who happen to come from Heidelberg: |
2917 | SAAG | can be used to reconstruct a text cited in fragments presented out of order. The poem being remembered (an unusual translation of a well-known poem by Basho) runs |
2958 | SAAG | is available for use when a number of |
2964 | SAAG | if they are all of the same type, and also allows us to restrict the domain within which their target elements are to be found, in the same way as for |
2971 | SAAG | may appear only where the elements represented by its contents are legal. Thus if we had created many |
2973 | SAAG | tags of the sort just described, we could group them together, and require that their components are all contained by an element with the identifier |
2985 | SAAG | ). It may also be used as a convenient way of representing a variety of analytic units, like the |
2998 | SAAG | And then he added, |
3011 | SAAG | Suppose now that we wish to represent an interpretation of the above passage in which we distinguish between the various |
3015 | SAAG | attribute has been used for this purpose; its value on each occasion supplies a pointer to the |
3017 | SAAG | to which each speech is attributed. (For convenience in this example, we use simply the first occurrence of the names used for each voice as the target for these pointers.) Note also that we add |
3019 | SAAG | attributes to each distinct speech fragment, which we can then use to link the material spoken by each voice: |
3060 | SAAG | s making up the |
3068 | SAAG | value for them. |
3147 | SAAT | if any of those elements could be present in a text, but one and only one of them is; in addition, we say that those elements are |
3151 | SAAT | if at least one (and possibly more) of them is present. The elements that are in alternation may also be called |
3155 | SAAT | The need to mark exclusive alternation arises frequently in text encoding. A common situation is one in which it can be determined that exactly one of several different words appears in a given location, but it cannot be determined which one. One way to mark such an exclusive alternation is to use the linking attribute |
3157 | SAAT | . Having marked an exclusive alternation, it can sometimes later be determined which of the alternants actually appears in the given location. To preserve the fact that an alternation was posited, one can add the linking attribute |
3159 | SAAT | to a tag which hierarchically encompasses the alternants, which points to the one which actually appears. To assign responsibility and degree of certainty to the choice, one can use the |
3161 | SAAT | tag described in chapter |
3162 | SAAT | . Also see that chapter for further discussion of certainty in general. |
3172 | SAAT | A more general way to mark alternation, encompassing both exclusive and inclusive alternation, is to use the linking element |
3174 | SAAT | . The description and attributes of this tag and of the associated grouping tag |
3180 | SAAT | To take a simple hypothetical example, suppose in transcribing a spoken text, we encounter an utterance that we can understand either as |
3193 | SAAT | If it is then determined that the speaker said |
3197 | SAAT | , the encoder could amend the text by deleting the alternant containing |
3203 | SAAT | value to the |
3205 | SAAT | attribute value on the |
3225 | SAAT | seg type="word" |
3227 | SAAT | seg type="character" |
3252 | SAAT | , but is certain that if it is |
3254 | SAAT | , then the other uncertain word is definitely |
3290 | SAAT | The value of the |
3292 | SAAT | attribute is defined as a list of identifiers; hence it can also be used to narrow down the range of alternants, as in: |
3302 | SAAT | element tag appears, and is thus equivalent to just the alternation of those two tags: |
3311 | SAAT | attribute can also be used in case there is uncertainty about the tag that appears in a certain position. For example, the occurrence of the word |
3315 | SAAT | can be interpreted, in the absence of other information, either as a person's name or as a date. The uncertainty can be rendered as follows, using the |
3326 | SAAT | ; this avoids having to repeat the content of the element whose correct tagging is in doubt. |
3341 | SAAT | element in the body of a document, or as the first |
3358 | SAAT | attribute, if used, would appear on the |
3384 | SAAT | Now we define the specialized linking element |
3407 | SAAT | , which is to be used if one wishes to assign |
3409 | SAAT | to the targets (alternants). Its value is a list of numbers, corresponding to the targets, expressing the probability that each target appears. |
3410 | SAAT | If the alternants are mutually exclusive, then the weights must sum to 1.0. |
3467 | SAAT | alt mode="incl" |
3472 | SAAT | is the number of targets. If the sum is 0%, then the alternation is equivalent to exclusive alternation; if the sum is (100 x k)%, then all of the alternants must appear, and the situation is better encoded without an |
3486 | SAAT | attribute defaults to the value |
3498 | SAAT | , but that if the first word is |
3500 | SAAT | , then the third word is |
3502 | SAAT | . Now suppose we have the following additional information: if |
3504 | SAAT | occurs, then the probability that |
3508 | SAAT | occurs is 50%; if |
3510 | SAAT | occurs, then the probability that |
3530 | SAAT | As noted above, when the |
3534 | SAAT | has the value |
3536 | SAAT | , then each weight states the probability that the corresponding alternative occurs, given that at least one of the other alternatives occurs. |
3546 | SAAT | Another very similar example is the following regarding the text of a Broadway song. In three different versions of the song, the same line reads |
3552 | SAAT | The variant readings are found in the commercial sheet music, the performance score, and the Broadway cast recording. |
3564 | SAAT | Let us extend the example with a further (imaginary) variation, supposing for the sake of the argument that the next line is variously given as |
3570 | SAAT | element, we can express the conviction that if the first choice for the second line is correct, then the probability that the first line contains |
3572 | SAAT | is 90%, and each of the others 5%; whereas if the second choice for the second line is correct, then the probability that the first line contains |
3616 | SASOin | Most of the mechanisms defined in this chapter rely to a greater or lesser extent on the fact that tags in a marked-up document can both assert a property for a span of text which they enclose, and assert the existence of an association between themselves and some other span of text elsewhere. In stand-off markup, there is a clear separation of these two behaviours: the markup does not directly contain any part of the text, but instead includes it by reference. One specific mechanism recommended by these Guidelines for this purpose is the standard XInclude mechanism defined by the W3C; another is to use pointers as demonstrated elsewhere in this chapter. |
3618 | SASOin | There are many reasons for using stand-off markup: the source text might be read-only so that additional markup cannot be added, or a single text may need to be marked up according to several hierarchically incompatible schemes, or a single scheme may need to accommodate multiple hierarchical ambiguities, so that a single markup tree is not the most faithful representation of the source material. |
3628 | SASOin | source document |
3631 | SASOin | a document to which the stand-off markup refers (a source document can be either XML or plain text); there may be more than one source document. |
3637 | SASOin | markup that is already present in an XML source document |
3643 | SASOin | markup that is either outside of the source document and points in to it to the data it describes, or alternatively is in another part of the source document and points elsewhere within the document to the data it describes |
3649 | SASOin | a document that contains stand-off markup that points to a different, source document |
3655 | SASOin | the action of creating a new XML document with external markup and data integrated with the source document data, and possibly some source document markup as well |
3661 | SASOin | a process applied to markup from a pre-existing XML document, which splits it into two documents, an XML (external) document containing some of the markup of the original document, and another (source) XML document containing whatever text content and markup has not been extracted into the stand-off document; if all markup has been externalized from a document, the new source may be a plain text document |
3667 | SASOin | any valid TEI markup can be either internal or external, |
3669 | SASOin | external markup can be internalized by applying it to the document content by either substituting the existing markup or adding to it, to form a valid TEI document, and |
3679 | SASOov | Stand-off markup which relies on the inclusion of virtual content is adequately supported by the W3C XInclude recommendation, which is also recommended for use by these Guidelines. |
3680 | SASOov | The version on which this text is based is the |
3685 | SASOov | XInclude defines a namespace ( |
3695 | SASOov | discussed elsewhere in this chapter to point to the actual fragments of text to be internalized. Although XInclude only requires support for the |
3700 | SASOov | XInclude is a W3C recommendation which specifies a syntax for the inclusion within an XML document of data fragments placed in different resources. Included resources can be either plain text or XML. XInclude instructions within an XML document are meant to be replaced by a resource targetted by a URI, possibly augmented by an XPointer that identifies the exact subresource to be included. |
3706 | SASOov | attribute to specify the location of the resource to be included; its value is an URI containing, if necessary, an XPointer. Additionally, it uses the |
3709 | SASOov | text |
3712 | SASOov | ) to specify whether the included content is plain text or an XML fragment, and the |
3714 | SASOov | attribute to provide a hint, when the included fragment is text, of the character encoding of the fragment. An optional |
3718 | SASOov | ; it specifies alternative content to be used when the external resource cannot be fetched for some reason. Its use is not however recommended for stand-off markup. |
3722 | SASOso | Stand-off Markup in TEI |
3726 | SASOso | internalization of one or more source documents' content into a stand-off document. TEI use of XInclude for stand-off markup enables use of XInclude-conformant software to perform this useful operation. However, internalization is not clearly defined for all stand-off files, because the structure of the internal and external markup trees may overlap. In particular, when an external markup document selects a range that overlaps partial elements in the source document, it is not clear how the semantics of internalization (inclusion) should work, since partial elements are not XML objects. |
3728 | SASOso | XInclude defines a semantics for this case that involves only complete elements. |
3730 | SASOso | When a range selection partially overlaps a number of elements in a source document, XInclude specifies that the partially overlapping elements should be included as well as all completely overlapping elements and characters (partially overlapping characters are not possible). The effect of this is that elements that straddle the start or end of a selected range will be included as wrappers for those of their children that are completely or partially selected by the range. For example, given the following source document: |
3746 | SASOso | The result of the inclusion is two paragraph elements, while the original range designated in the source document overlapped two paragraph fragments. |
3747 | SASOso | The semantics of XInclude require the creation of well-formed XML results even though the pointing mechanisms it uses do not necessarily respect the hierarchical structure of XML documents, as in this case. While this is a good way to ensure that internalization is always possible, it has implications for the use of XInclude as a notation for the |
3751 | SASOso | When overlapping hierarchies need to be represented for a single document, each hierarchy must be represented by a separate set of XInclude tags pointing to a common source document. This sort of structure corresponds to common practice in work with linguistic text corpora. In such corpora, each potentially overlapping hierarchy of elements for the text is represented as a separate stream of stand-off markup. Generally the source text contains markup for the smallest significant units of analysis in the corpus, such as words or morphemes, this information and its markup representing a layer of common information that is shared by all the various hierarchies. As a way of organizing the representation of complex data, this technique generally allows a large number of |
3753 | SASOso | attributes to be attached to the shared elements, providing robust anchors for links and facilitating adjustments to the source document without breaking external documents that reference it. |
3756 | SASOso | Any tag can be externalized by |
3757 | SASOso | removing its content and replacing it with an |
3761 | SASOso | For instance the following portion of a TEI document: |
3777 | SASOso | can be externalized by placing the actual text in a separate document, and providing exactly the same markup with the |
3793 | SASOso | Please note that this specification requires that the XInclude namespace declaration is present in all cases. The |
3795 | SASOso | element contains text or XML fragments to be placed in the document if the inclusion fails for any reason (for instance due to inaccessibility of an external resource). The |
3797 | SASOso | element is optional; if it is not present an XInclude processor must signal a fatal error when a resource is not found. This is the preferred behaviour for use with stand-off markup. These Guidelines recommend against the use of |
3805 | SASOva | The whole source fragment identified by an XInclude element, as well as any markup therein contained is inserted in the position specified, and an XInclude processor is required to ensure that the resulting internalized document is well-formed. This has obvious implications when the external document contains XML markup. A plain text source document will always create a well-formed internalized document. |
3807 | SASOva | While a TEI customization may permit |
3809 | SASOva | elements in various places in a TEI document instance, in general these Guidelines suggest that validity be verified after the resolution of all the |
3817 | SASOfr | When the source text is plain text the overall form of the XPointer pointing to it is of minimal importance. The form of the XPointer matters considerably, on the other hand, when the source document is XML. |
3819 | SASOfr | In this case, it is rather important to distinguish whether we intend to substitute the source XML with the new one, or just to add new markup to it. The XPointers used in the references can express both cases. |
3851 | SASOfr | will select the whole poem, text content |
3857 | SASOfr | hypertext links (NB: in XPointer whitespace-only text nodes count). |
3863 | SASOfr | will only select the text of the poem, with no markup inside. |
3881 | SAAN | and elsewhere, provision is made for analytic and interpretive markup to be represented outside of textual markup, either in the same document or in a different document. The elements in these separate domains can be connected, either with the pointing attributes |
3884 | SAAN | analysis |
3904 | linking | Linking, segmentation and alignment |
3913 | SAref | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
23 | VEMEana-eg-23 | Doglia mi reca ne lo core ardire |
79 | TSSASE-eg-20 | Structures of social action: Studies in conversation analysis |
343 | NDPER-eg-17 | membrane 5, entry 154 |
441 | VEST-eg-4 | 2nd edition |
566 | DIC-CP | Collins Pocket Dictionary of the English language |
586 | SA-BIBL-2 | Orbis Pictus: a facsimile of the first English edition of 1659 |
603 | PHegsurp2 | Poeti del Duecento |
853 | COEDADD-eg-89 | The waste land: a facsimile and transcript of the original drafts including the annotations of Ezra Pound |
883 | DS-eg-05 | Is there a text in this class? The authority of interpretive communities |
922 | FTGRA-eg-18 | 2nd edition |
1006 | COHQU-eg-43 | Natural language processing in Prolog |
1257 | DRSTA-eg-40 | Everyman's library: the drama |
1289 | COBICOR-eg-248 | ISO 690:1987: Information and documentation – Bibliographic references – Content, form and structure |
1473 | COHQQ-eg-33 | note 12 |
1600 | DRPRO-eg-7 | epilogue |
1634 | STGA-eg-9 | Crofts American history series |
1703 | TSBA-eg-19 | The approach of the Text Encoding Initiative to the encoding of spoken discourse |
1723 | MS-eg-001 | A summary catalogue of western manuscripts in the Bodleian Library at Oxford which have not hitherto been catalogued ... |
1733 | MS-eg-001 | P5-MS: A general purpose tagset for manuscript description |
1762 | STGA-eg-10 | Crofts American history series |
1931 | TSSASE-eg-37 | Report on the compatibility of J P French's spoken corpus transcription conventions with the TEI guidelines for transcription of spoken texts |
1958 | GDFT-eg-12 | Partial family tree for Bertrand Russell |
2322 | DSBACK-eg-83 | index to vol. 1 |
2556 | WHITMS1 | "[I am a curse]" in |
2562 | WHITMS2 | Single leaf of Notes for a poem about night "visions," possibly related to the untitled 1855 poem that Whitman eventually titled "The Sleepers." Fragments of an unidentified newspaper clipping about the Puget Sound area have been pasted to the leaf. The Trent Collection of Walt Whitman Manuscripts, Duke University Rare Book, Manuscript, and Special Collections Library. |
3666 | BIB | Works cited elsewhere in the text of the Guidelines |
3752 | Burnard1995b | The Design of the TEI Encoding Scheme |
4361 | SG-BIBL-2 | Refining our notion of what text really is: the problem of overlapping hierarchies |
4630 | CO-BIBL-1 | An international handbook of the science of language and society |
4767 | TS-BIBL-3 | TEI document TEI AI2 W1 |
4912 | DI-BIBL-3 | TEI working paper TEI AIW20 |
5015 | DI-BIBL-6 | Principles for Encoding machine readable dictionaries |
5069 | DI-BIBL-8 | Electronic dictionary encoding: customizing the TEI Guidelines |
5609 | NH-BIBL-7 | The layered markup and annotation language |
5661 | FS-BIBL-01 | A rationale for the TEI recommendations for feature-structure markup, |
5728 | ISO-690 | ISO 690:1987: Information and documentation – Bibliographic references – Content, form and structure |
5740 | ISO-12620 | ISO 12620:2009: Terminology and other language and content resources – Specification of data categories and management of a Data Category Registry for language resources |
5750 | RICA | Istituto Centrale per il Catalogo Unico |
5752 | RICA | Regole italiane di catalogazione per autori |
5819 | BIB-RDG | Reading list |
5821 | BIB-RDG | The following lists of readings in markup theory and the TEI derive from work originally prepared by Susan Schreibman and Kevin Hawkins for the TEI Education Special Interest Group, recoded in TEI P5 by Sabine Krott and Eva Radermacher. They should be regarded only as a snapshot of work in progress, to which further contributions and corrections are welcomed (see further |
6296 | Burnard1999 | Closing plenary address at the XML Europe Conference, Granada, May 1999 |
6374 | Burnard2001a | Dalle «Due Culture» Alla Cultura Digitale: La Nascita del Demotico Digitale |
6490 | Burnard2005b | Metadata for corpus work |
7447 | Pichler1995 | Culture and Value: Philosophy and the Cultural Sciences. Beiträge des 18. Internationalen Wittgenstein Symposiums 13–20. August 1995 Kirchberg am Wechsel |
7450 | Pichler1995 | Kirchberg am Wechsel |
8357 | Unsworthetaleds2004 | TEI Consortium |
8495 | BIB-RDG | TEI |
8609 | BaumanandCatapano1999 | TEI and the Encoding of the Physical Structure of Books |
8639 | Bauman2005 | TEI HORSEing Around |
8720 | Burnard1993 | Rolling your own with the TEI |
8836 | Burnard1997 | Prepared for a seminar on Etiquetación y extracción de información de grandes corpus textuales within the Curso Industrias de la Lengua (14–18 de Julio de 1997). Sponsored by the Fundacion Duques de Soria. |
8853 | BurnardandPopham1999 | Putting Our Headers Together: A Report on the TEI Header Meeting 12 September 1997. |
8916 | Ciottied2005 | Il Manuale TEI Lite: Introduzione Alla Codifica Elettronica Dei Testi Letterari |
8936 | Chang2001 | The Implications of TEI |
8982 | DigitalLibraryFederation1998 | TEI and XML in Digital Libraries: Meeting June 30 and July 1, 1998, Library of Congress, Summary/Proceedings |
8998 | DigitalLibraryFederation2007 | TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices |
9096 | Loiseaunodate | Introduction à la TEI |
9120 | MarkoandKelleher2001 | Descriptive Metadata Strategy for TEI Headers: A University of Michigan Library Case Study |
9150 | Mertz2003 | XML Matters: TEI — the Text Encoding Initiative |
9264 | Rahtz2003 | Building TEI DTDs and Schemas on demand |
9296 | Rahtzetal2004 | A unified model for text markup: TEI, Docbook, and beyond |
9356 | Robinsonnodate | Making a Digital Edition with TEI and Anastasia |
9374 | Seaman1995 | The Electronic Text Center Introduction to TEI and Guide to Document Preparation |
9394 | Simons1999 | Using Architectural Forms to Map TEI Data into an Object-Oriented Database |
9424 | Smith1999 | Textual Variation and Version Control in the TEI |
9556 | Vanhoutte2004 | An Introduction to the TEI and the TEI Consortium |
# | id | text |
---|---|---|
2 | USE | Using the TEI |
4 | USE | This section discusses some technical topics concerning the deployment of the TEI markup scheme documented elsewhere in these Guidelines. |
6 | USE | we discuss the scope and variety of the TEI customization mechanisms, distinguishing between |
8 | USE | modifications, which result in a schema that supports a subset of the distinctions made in the full TEI system, on the one hand, from |
12 | USE | TEI Conformance |
13 | USE | , distinguishing between documents which are algorithmically TEI-conformant ("TEI-conformable") from those which are intrinsically conformant ("TEI-conformant"); we also define the concept of a TEI extension. Since the ODD markup description language defined in chapter |
14 | USE | is fundamental to the way conformance and customization are handled in the TEI system, these two definitional sections are followed by a section ( |
20 | MEDIATYPE | Serving TEI files with the TEI Media Type |
22 | MEDIATYPE | In February 2011, the media type |
28 | MEDIATYPE | ). We recommend that any XML file whose root element is in the TEI namespace be served with the media type |
30 | MEDIATYPE | to enable and encourage automated recognition and processing of TEI files by external applications. |
33 | DT | Obtaining the TEI Schemas |
36 | DT | , the modules making up the TEI scheme are generated from a single set of XML source files. Schemas can be generated for TEI customizations in each of XML DTD language, W3C schema language, and RELAX NG schema language. In the body of the Guidelines, only the latter form is presented, using the compact syntax. |
38 | DT | The TEI schemas and Guidelines are widely available over the Internet and elsewhere. The canonical home for the TEI source, the schema fragments generated from it, and example modifications, is the TEI repository at |
39 | DT | ; versions are also available in other formats, along with copies of the Guidelines and related materials, from the TEI web site at |
46 | MD | These Guidelines provide an encoding scheme suitable for encoding a very wide range of texts, and capable of supporting a wide variety of applications. For this reason, the TEI scheme supports a variety of different approaches to solving similar problems, and also defines a much richer set of elements than is likely to be necessary in any given project. Furthermore, the TEI scheme may be extended in well-defined and documented ways for texts that cannot be conveniently or appropriately encoded using what is provided. For these reasons, it is almost impossible to use the TEI scheme without customizing or personalizing it in some way. |
48 | MD | This section describes how the TEI encoding scheme may be customized, and should be read in conjunction with chapter |
49 | MD | , which describes how a specific application of the TEI encoding scheme should be documented. The documentation system described in that chapter is, like the rest of the TEI scheme, independent of any particular schema or document type definition language. |
51 | MD | Formally speaking, these Guidelines provide both syntactic rules about how elements and attributes may be used in valid documents and semantic recommendations about what interpretation should be attached to a given syntactic construct. In this sense, they provide both a |
56 | MD | TEI Abstract Model |
57 | MD | , which defines a set of related concepts, and the |
58 | MD | TEI schema |
59 | MD | which defines a set of syntactic rules and constraints. Many (though not all) of the semantic recommendations are provided solely as informal descriptive prose, though some of them are also enforced by means of such constructs as datatypes (see |
62 | MD | them in the sense of attaching slightly variant semantics to them. |
68 | MD | which can take on arbitrary string values, depending on how it is used in a document. A new type of |
69 | MD | note |
70 | MD | , therefore, requires no change in the existing model. On the other hand, for many applications, it may be desirable to constrain the possible values for the |
72 | MD | attribute to a small set of possibilities. A schema modified in this way would no longer necessarily regard as valid the same set of documents as the corresponding unmodified TEI schema, but would remain faithful to the same conceptual model. |
74 | MD | This section explains how the TEI scheme can be customized by suppressing elements, modifying classes of elements, adding elements, and renaming elements. Documents which validate against an application of the TEI scheme which has been customized in this way may or may not be considered |
79 | MD | The TEI scheme is designed to support modification and customization in a documented way that can be validated by an XML processor. This is achieved by writing a small TEI-conformant document, from which an appropriate processor can generate both human-readable documentation, and a schema expressed in a language such as RELAX NG or DTD. The mechanisms used to instantiate a TEI schema differ for different schema languages, and are therefore not defined here. In XML DTDs, for example, extensive use is made of parameter entities, while in RELAX NG schemas, extensive use is made of patterns. In either case, the names of elements and, wherever possible, their attributes and content models are defined indirectly. The syntax used to implement this indirection also varies with the schema language used, but the underlying constructs in the TEI Abstract Model are given the same names. |
82 | MD | , the TEI encoding scheme comprises a set of class and macro declarations, and a number of |
84 | MD | . Each module is made up of element and attribute declarations, and a schema is made by combining a particular set of modules together. In the absence of any other kind of personalization, when modules are combined together: |
88 | MD | each such element is identified by the canonical name given it in these Guidelines; |
90 | MD | the content model of each such element is as defined by these Guidelines; |
94 | MD | the elements comprising element classes and the meaning of macro declarations expressed in terms of element classes is determined by the particular combination of modules selected. |
95 | MD | The TEI personalization mechanisms allow the user to control this behaviour as follows: |
97 | MD | particular elements may be suppressed, removing them from any classes in which they are members, and also from any generated schema; |
99 | MD | within certain limits, the name (generic identifier) associated with an element may be changed, without changing the semantic or syntactic properties of the element; |
101 | MD | new elements may be added to an existing class, thus making them available in macros or content models defined in terms of those classes; |
103 | MD | additional attributes, or attribute values, may be specified for an individual element or for classes of elements; |
105 | MD | within certain limits, attributes, or attribute values, may also be removed either from an individual element or for classes of elements; |
107 | MD | the characteristics inherited by one class from another class may be modified by modifying its class membership: all members of the class then inherit the changed characteristics; |
109 | MD | the set of values legal for an attribute or attribute class may be constrained or relaxed by supplying or modifying a value list, or by modifying its datatype. |
114 | MD | ; in the remainder of this section we give specific examples to illustrate how that system may be applied. An ODD processor, such as the Roma application supported by the TEI, or any other comparable set of stylesheets will use the declarations provided by an ODD to generate appropriate sets of declarations in a specific schema language such as RELAX NG or the XML DTD language. We do not discuss in detail here how this should be done, since the details are schema language-specific; some background information about the methods used for XML DTD and RELAX NG schema generation is however provided in section |
115 | MD | . Several example ODD files are also provided as part of the standard TEI release: see further section |
126 | MDMD | modification of content models; |
135 | MDMD | Each kind of modification changes the set of documents that will be considered valid according to the resulting schema. Any combination of unchanged TEI modules may be thought of as defining a certain set of documents. Each schema resulting from a modified combination of TEI modules will define a different set of documents. The set of documents valid according to the unmodified schema may or may not be properly contained in the set of documents considered to be valid according to the modified schema. We use the term |
137 | MDMD | to describe a modification which regards as valid a subset of the documents considered valid by the same combination of TEI modules unmodified. Alternatively, the set of documents considered valid by the original schema might be disjoint from the set of documents considered valid by the modified schema, with neither being properly contained by the other. Modifications that have this result are called |
141 | MDMD | Cleanliness can only be assessed with reference to elements in the TEI namespace. |
145 | MDMDSU | The simplest way to modify the supplied modules is to suppress one or more of the supplied elements. This is simply done by setting the |
153 | MDMDSU | For example, if the |
158 | MDMDSU | attribute here supplies the canonical name of the element to be deleted, the |
162 | MDMDSU | attribute specifies what is to be done with it. Note that the module name must be supplied explicitly, and that the schema specification in which this declaration appears must also contain a reference to the module itself. The full specification for a schema in which this modification is applied would thus be something like the following: |
169 | MDMDSU | In most cases, deletion is a clean modification, since most elements are optional. Documents that are valid with respect to the modified schema are also valid according to the unmodified schema. To say this another way, the set of documents matching the new schema is contained by the set of documents matching the original schema. |
171 | MDMDSU | There are however some elements in the TEI scheme which have mandatory children; for example, the element |
185 | MDMDSU | In general, whenever the element deleted by a modification is mandatory within the content model of some other (undeleted) element, the result is an unclean modification, and may also break the TEI Abstract Model ( |
186 | MDMDSU | ). However, the parent of a mandatory child can be safely removed if it is itself optional. |
188 | MDMDSU | To determine whether or not an element is mandatory in a given context, the user must inspect the content model of the element concerned. In most cases, content models are expressed in terms of model classes rather than elements; hence, removing an element will generally be a clean modification, since there will generally be other members of the class available. If a class is completely depopulated by a modification, then the cleanliness of the modification will depend upon whether or not the class reference is mandatory or optional, in the same way as for an individual element. |
193 | MDMDNM | Every element and other named markup construct in the TEI scheme has a |
194 | MDMDNM | canonical name |
195 | MDMDNM | , usually in the English language: this name is supplied as the value of the |
205 | MDMDNM | used to define it. The element or attribute declaration used within a schema generated from that specification may however be different, thus permitting schemas to be written using elements with generic identifiers from a different language, or otherwise modified. There may be many alternative identifiers for the same markup construct, and an ODD processor may choose which of them to use for a given purpose. Each such alternative name is supplied by means of an |
220 | MDMDNM | now takes the value |
221 | MDMDNM | change |
222 | MDMDNM | to indicate that those parts of the element specification not supplied are to be inherited from the standard definition. The content of the |
224 | MDMDNM | element will be used in place of the canonical |
226 | MDMDNM | value in the schema generated. |
230 | MDMDNM | modification. Although it is an inherently unclean modification (because the set of documents matched by the resulting schema is disjoint with the set matched by its unmodified equivalent), the process of converting any document in which elements have been renamed into an exactly equivalent document using canonical names is completely deterministic, requiring only access to the ODD in which the renaming has been specified. This assumes that the renamed elements used are not placed in the TEI namespace but either use a null namespace or some user-defined namespace, as further discussed in |
231 | MDMDNM | ; if this is not the case, care must be taken to avoid name collision between the new name and all existing TEI names. Furthermore, unclean modifications which do not specify a namespace are not conformant (see further |
234 | MDMDNM | The TEI provides a systematic set of renamings into languages other than English. These all use a language-specific namespace. |
239 | MDMDCM | The content model for an element in the TEI scheme is defined by means of a |
243 | MDMDCM | which specifies it. As shown elsewhere in these Guidelines, the content model is defined using RELAX NG syntax, whether the resulting schema is expressed in RELAX NG or in some other schema language. |
254 | MDMDCM | This indicates that the content model contains declarations taken from the RELAX NG namespace, and that it consists of a reference to a pattern called |
256 | MDMDCM | . Further examination shows that this pattern in turn expands to an optional repeatable alternation of text ( |
258 | MDMDCM | ) with references to three other classes ( |
264 | MDMDCM | ). For some particular application it might be preferable to insist that |
276 | MDMDCM | This is a clean modification which does not change the meaning of a TEI element; there is therefore no need to assign the element to some other namespace than that of the TEI, though it may be considered good practice; see further |
279 | MDMDCM | A change of this kind, which simplifies the possible content of an element by reducing its model to one of its existing components, is always clean, because the set of documents matched by the resulting schema is a subset of the set of documents which would have been matched by the unmodified schema. |
281 | MDMDCM | Note that content models are generally defined (as far as possible) in terms of references to model classes, rather than to explicit elements. This means that the need to modify content models is greatly reduced: if an element is deleted or modified, for example, then the deletion or modification will be available for every content model which references that element via its class, as well as those which reference it explicitly. For this reason it is not (in general) good practice to replace class references by explicit element references, since this may have unintended side effects. |
283 | MDMDCM | An unqualified reference to an element class within a content model generates a content model which is equivalent to an alternation of all the members of the class referenced. Thus, a content model which refers to the model class |
285 | MDMDCM | will generate a content model in which any one of the members of that class is equally acceptable. It is also possible to reference predefined content model fragments based on classes, such as |
288 | MDMDCM | a sequence containing no more than one of each member of the class |
292 | MDMDCM | Content model changes which are not simple restrictions on an existing model should be undertaken with caution. The set of documents matching the schema which results from such changes is likely to be disjoint with the set of documents matching the unmodified schema, and such changes are therefore regarded as unclean. When content models are changed or extended, care should be taken to respect the existing semantics of the element concerned as stated in the Guidelines. For example, the element |
294 | MDMDCM | is defined as containing a line of verse. It would not therefore make sense to redefine its content model so that it could also include members of the class |
296 | MDMDCM | : such a modification although syntactically feasible would not be regarded as TEI-conformant because it breaks the TEI Abstract Model. |
307 | MDMDAL | element. To add a new attribute to an element, the schema builder should therefore first check to see whether this attribute is already defined by some existing attribute class. If it is, then the simplest method of adding it will be to make the element in question a member of that class, as further discussed below. If this is not possible, then a new |
320 | MDMDAL | content |
331 | MDMDAL | Suppose, for example, that we wish to add two attributes to the |
345 | MDMDAL | element in fact has no local attributes defined for it at all: we will therefore need to add not only an |
365 | MDMDAL | The value supplied for the |
370 | MDMDAL | add |
371 | MDMDAL | ; if this attribute already existed on the element we are modifying this should generate an error, since a specification cannot have more than one attribute of the same name. If the attribute is already present, we can replace the whole of the existing declaration by supplying |
373 | MDMDAL | as the value for |
375 | MDMDAL | ; alternatively, we can change some parts of an existing declaration only by supplying just the new parts, and setting |
376 | MDMDAL | change |
377 | MDMDAL | as the value for |
381 | MDMDAL | Because the new attribute is not defined by the TEI, we must specify a namespace for it on the |
391 | MDMDAL | The canonical name for the new attribute is |
393 | MDMDAL | , and is supplied on the |
397 | MDMDAL | element. In this simple example, we supply only a description and datatype for the new attribute; the former is given by the |
402 | MDMDAL | ). The content of the |
406 | MDMDAL | element, uses patterns from the RELAX NG namespace, in this case to select one of the predefined TEI datatypes ( |
409 | MDMDAL | It is often desirable to constrain the possible values for an attribute to a greater extent than is possible by simply supplying a TEI datatype for it. This facility is provided by the |
413 | MDMDAL | element. Suppose for example that, rather than supplying them as pointers to a bibliography, all that we wish to indicate about the source of our examples is that each comes from one of three predefined sources, which we call A, B, and C. A declaration like the following might be appropriate: |
442 | MDMDAL | supplied as part of any attribute in the TEI scheme. |
444 | MDMDAL | Depending on the modification, the set of documents matched by a schema generated from an ODD modified in this way, may or may not be a subset of the set of documents matched by the unmodified schema. As such, it is difficult to tell in principle whether such modifications are intrinsically unclean. |
449 | MDMDCL | The concept of element classes was introduced in |
450 | MDMDCL | ; an understanding of it is fundamental to successful use of the TEI scheme. As noted there, we distinguish |
451 | MDMDCL | model classes |
453 | MDMDCL | attribute classes |
454 | MDMDCL | , the members of which simply share a set of attributes. |
458 | MDMDCL | . All classes to which the element belongs must be specified within this, using a |
462 | MDMDCL | To add an element to a class in which it is not already a member, all that is needed is to supply a new |
466 | MDMDCL | element for the element concerned. For example, to add an element to the |
477 | MDMDCL | element is set to |
478 | MDMDCL | change |
479 | MDMDCL | (rather than its default value of |
483 | MDMDCL | element retains its membership of the two classes ( |
493 | MDMDCL | defined in the core module is a member of two attribute classes, |
510 | MDMDCL | If the intention is to change the class membership of an element completely, rather than simply add or remove it to or from one or more classes, the value of |
514 | MDMDCL | can be set to |
516 | MDMDCL | (which is the default if no value is specified), indicating that the memberships indicated by its child |
531 | MDMDCL | attribute is set to |
532 | MDMDCL | change |
537 | MDMDCL | To change or remove attributes inherited from an attribute class for all members of the class (as opposed to specific members of that class), it is also possible to modify the class specification itself. For example, the class |
561 | MDMDCL | defining the attributes inherited through membership of this class has the value |
562 | MDMDCL | change |
567 | MDMDCL | The classes used in the TEI scheme are further discussed in chapter |
568 | MDMDCL | . Note in particular that classes are themselves classified: the attributes inherited by a member of attribute class A may come to it directly from that class, or from another class of which A is itself a member. For example, the class |
570 | MDMDCL | is itself a member of the classes |
574 | MDMDCL | . By default, these two classes are predefined as empty. However, if (for example) the |
576 | MDMDCL | module is included in a schema, a number of attributes ( |
584 | MDMDCL | will then inherit these new attributes (see further section |
593 | MDMDCL | Such global changes should be undertaken with caution: in general removing existing non-mandatory attributes from a class will always be a clean modification, in the same way as removing non-mandatory elements. Adding a new attribute to a class however can be a clean modification only if the new attribute is labelled as belonging to some namespace other than the TEI. |
595 | MDMDCL | The same mechanisms are available for modification of model classes. Care should be taken when modifying the model class membership of existing elements since model class membership is what determines the content model of most elements in the TEI scheme, and a small change may have unintended consequences. |
600 | MDMDNE | To add a completely new element into a schema involves providing a complete element specification for it, the |
602 | MDMDNE | element of which includes a reference to at least one TEI model class. Without such a reference, the new element will not be referenced by the content model of any other TEI element, and will therefore be inaccessible within a TEI document. |
612 | MDMDNE | . To add a fourth member (say |
622 | MDMDNE | The other parts of this declaration will typically include a description for the new element and information about its content model, its attributes, etc., as further described in |
629 | MDNS | All the elements defined by the TEI scheme are labelled as belonging to a single |
630 | MDNS | namespace |
631 | MDNS | , maintained by the TEI and with the URI |
636 | MDNS | used to represent TEI examples has its own namespace, |
639 | MDNS | Only elements which are unmodified or which have undergone a clean modification may use this namespace. In a TEI-conformant document, it is assumed that all attributes not explicitly labelled with a namespace (such as, for example |
641 | MDNS | ) also belong to the TEI namespace, and are defined by the TEI. |
643 | MDNS | This implies that any other modification (including a renaming or reversible modification) must either specify a different namespace or specify no namespace at all. The |
653 | MDNS | Suppose, for example, that we wish to add a new attribute |
655 | MDNS | to the existing TEI element |
657 | MDNS | . In the absence of namespace considerations, this would be an unclean modification, since |
659 | MDNS | does not currently have such an attribute. The most appropriate action is to explicitly attach the new attribute to a new namespace by a declaration such as the following: |
678 | MDNS | is explicitly labelled as belonging to something other than the TEI namespace, we regard the modification which introduced it as clean. A namespace-aware processor will be able to validate those elements in the TEI namespace against the unmodified schema. |
679 | MDNS | Full namespace support does not exist in the DTD language, and therefore these techniques are available only to users of more modern schema languages such as RELAX NG or W3C Schema. |
681 | MDNS | Similar considerations apply when modification is made to the content model or some other aspect of an element, or when a new element is declared. Clean modification requires that all such changes be explicitly labelled as belonging to some non-TEI name space or to no name space at all. |
685 | MDNS | attribute is supplied on a |
687 | MDNS | element, it identifies the namespace applicable to all components of the schema being specified. Even if such a schema includes unmodified modules from the TEI namespace, the elements contained by such modules will now be regarded as belonging to the namespace specified on the |
689 | MDNS | . This can be useful if it is desired simply to avoid namespace processing. For example, the following schema specification results in a schema called |
691 | MDNS | which has no namespace, even though it comprises declarations from the TEI |
698 | MDNS | In addition to the TEI canonical namespace mentioned above, the TEI may also define namespaces for approved translations of the TEI scheme into other languages. These may be used as appropriate to indicate that a customization uses a standardized set of renamings. The namespace for such translations is the same as that for the canonical namespace, suffixed by the appropriate ISO language identifier ( |
699 | MDNS | ). A schema specification using the Chinese translation, for example, would use the namespace |
705 | MDDO | The elements used to define a TEI customization ( |
711 | MDDO | , etc.) will typically be used within a TEI document which supplies further information about the intended use of the new schema, the meaning and application of any new or modified elements within it, and so on. This document will typically conform to a TEI (or other) schema which includes the module described in chapter |
715 | MDDO | Where the customization to be documented simply consists in a selection of modules, perhaps with some deletion of unwanted elements or attributes, the documentation need not specify anything further. Even here however it may be considered worthwhile to replace some of the semantic information provided by the unmodified TEI specification. For example, the |
717 | MDDO | element of an unmodified TEI |
732 | MDDO | elements are not required, or in which any other rule stated in these Guidelines is either not enforced or not enforceable. In fact, the mechanism, if used in an extreme way, permits replacement of all that the TEI has to say about every component of its scheme. Such revisions would result in documents that are not TEI-conformant in even the broadest sense, and it is not intended that encoders use the mechanism in this way. We discuss exactly what is meant by the concept of |
733 | MDDO | TEI conformance |
739 | MDlite | Several examples of customizations of the TEI are provided as part of the standard release. They include the following: |
743 | MDlite | The schema generated from this customization is the minimum needed for TEI Conformance. It provides only a handful of elements. |
747 | MDlite | The schema generated from this customization combines all available TEI modules, providing |
752 | MDlite | The schema generated from this customization combines all available TEI modules with three other non-TEI vocabularies, specifically MathML, SVG, and XInclude. |
756 | MDlite | It is unlikely that any project would wish to use any of these extremes unchanged. However, they form a useful starting point for customization, whether by removing modules from tei_all or tei_allPlus, or by replacing elements deleted from tei_bare. They also demonstrate how an ODD document may be constructed to provide a basic reference manual to accompany schemas generated from it. |
758 | MDlite | Shortly after publication of the first edition of these Guidelines, as a demonstration of how the TEI encoding scheme might be adopted to meet 90% of the needs of 90% of the TEI user community, the TEI editors produced a brief tutorial defining one specific |
760 | MDlite | modification of the TEI scheme, which they called TEI Lite. This tutorial and its associated DTD became very popular and are still available from the TEI web site at |
761 | MDlite | . The tutorial and associated schema specification is also included as one of the exemplars provided with TEI P5. |
763 | MDlite | The exemplars provided with TEI P5 also include a customization file from which a schema for the validation of other customization files may be generated. This ODD, called tei_odds, combines the four basic modules with the tagdocs, dictionaries, gaiji, linking, and figures modules as well as including the (non-TEI) module defining the RELAX NG language. This enables schemas derived from this customization file to validate examples contained within them in a number of ways, further described within the document. |
771 | CF | TEI Conformance |
772 | CF | is intended to assist in the description of the format and contents of a particular XML document instance or set of documents. It may be found useful in such situations as: |
780 | CF | specifying the form of documents to be produced by or for a given project. |
782 | CF | It is not intended to provide any other evaluation, for example of scholarly merit, intellectual integrity, or value for money. A document may be of major intellectual importance and yet not be TEI-conformant; a TEI-conformant document may be of no scholarly value whatsoever. |
784 | CF | In this section we explore several aspects of conformance, and in particular attempt to define how the term |
786 | CF | should be used. The terminology defined here should be considered normative: users and implementors of the TEI Guidelines should use the phrases |
791 | CF | TEI Extension |
796 | CF | if it: |
802 | CF | TEI Schema |
803 | CF | , that is, a schema derived from the TEI Guidelines ( |
806 | CF | conforms to the TEI Abstract Model ( |
810 | CF | TEI Namespace |
817 | CF | ) which refers to the TEI Guidelines |
821 | CF | A document is said to be |
823 | CF | if it is a well-formed XML document which can be transformed algorithmically and automatically into a TEI-conformant document as defined above without loss of information. Such a document may informally be described as TEI-conformant; the terms |
829 | CF | A document is said to use a |
830 | CF | TEI Extension |
831 | CF | if it is a well-formed XML document which is valid against a TEI Schema which contains additional distinctions, representing concepts not present in the TEI Abstract Model, and therefore not documented in these Guidelines. Such a document cannot, in general, be algorithmically conformant since it cannot be automatically transformed without loss of information. However, since one of the goals of the TEI is to support extensions and modifications, it should not be assumed that no TEI document can include extensions: an extension which is expressed by means of the recommended mechanisms is also a TEI document provided that those parts of it which are not extensions are TEI-conformant, or -conformable. |
833 | CF | A TEI-conformant (or -conformable) document is said to follow |
834 | CF | TEI Recommended Practice |
844 | CFWF | . Other ways of representing the concepts of the TEI Abstract Model are possible, and other representations may be considered appropriate for use in particular situations (for example, for data capture, or project-internal processing). But such alternative representations are at best |
851 | CFWF | A TEI-conformant document must use the TEI namespace, and therefore must also include an XML-conformant namespace declaration, as defined below ( |
854 | CFWF | The use of XML greatly reduces the need to consider hardware or software differences between processing environments when exchanging data. No special packing or interchange format is required for an XML document, beyond that defined by the W3C recommendations, and no special |
856 | CFWF | format is therefore proposed by these Guidelines. For discussion of encoding issues that may arise in the processing of special character sets or non-standard writing systems, see further chapter |
861 | CFWF | document, as being a well-formed document which matches a specific set of rules or syntactic constraints, defined by a |
863 | CFWF | . As noted above, TEI conformance implies that the schema used to determine validity of a given document should be derived from the present Guidelines, by means of an ODD which references and documents the schema fragments which the Guidelines define. |
870 | CFVL | documents must validate against a schema file that has been derived from the published TEI Guidelines, combined and documented in the manner described in section |
872 | CFVL | TEI Schema |
875 | CFVL | The TEI does not mandate use of any particular schema language, only that this schema |
880 | CFVL | TEI ODD file |
881 | CFVL | that references the TEI Guidelines. Currently available tools permit the expression of schemas in any or all of the XML DTD language, W3C XML Schema, and RELAX NG (both compact and XML formats). Some of what is syntactically possible using the ODD formalism cannot be represented by all schema languages; and there are some features of some schema languages which have no counterpart in ODD. No single schema language fully captures all the constraints implied by conformance to the TEI Abstract Model. A document which is valid according to a TEI schema represented using one schema language may not be valid against the same schema expressed in other languages; in particular the DTD language does not fully support namespaces. Features which cannot be represented in all schema languages are documented in chapters |
886 | CFVL | , many varieties of TEI schema are possible and not all of them are necessarily |
888 | CFVL | ; derivation from an ODD is a necessary but not a sufficient condition for TEI Conformance. |
892 | CFAM | Conformance to the TEI Abstract Model |
895 | CFAM | TEI Abstract Model |
896 | CFAM | is the conceptual schema instantiated by the TEI Guidelines. These Guidelines define, both formally and informally, a set of abstract concepts such as |
902 | CFAM | s do not contain |
904 | CFAM | s. These Guidelines also define classes of elements, which have both semantic and structural properties in common. Those semantic and structural properties are also a part of the TEI Abstract Model; the class membership of an existing TEI element cannot therefore be changed without changing the model. Elements can however be removed from a class by deletion, and new non-TEI elements within their own namespaces can be added to existing TEI classes. |
908 | CFAMsc | It is an important condition of TEI conformance that elements defined in the TEI Guidelines as having one specific meaning should not be used with another. For example, the element |
910 | CFAMsc | is defined in the TEI Guidelines as containing a line of verse. A schema in which it is redefined to mean a typographic line, or an ordered queue of objects of some kind, cannot therefore be TEI-conformant, whatever its other properties. |
912 | CFAMsc | The semantics of elements defined in the TEI Guidelines are conveyed in a number of ways, ranging from formally verifiable datatypes to informal descriptive prose. In addition, a mapping between TEI elements and concepts in other conceptual models may be provided by the |
916 | CFAMsc | A schema which shares equivalent concepts to those of the TEI conceptual model may be mappable to the TEI Schema by means of such a mechanism. For example, the concept of paragraph expressed in the TEI scheme by the |
920 | CFAMsc | element. In this respect (though not in others) a DocBook-conformant document might therefore be considered to be TEI-conformable. Such areas of overlap facilitate interoperability, because elements from one namespace may be readily integrated with those from another, but do not affect the definition of conformance. |
922 | CFAMsc | A document is said to conform to the |
923 | CFAMsc | TEI Abstract Model |
924 | CFAMsc | if features for which an encoding is proposed by the TEI Guidelines are encoded within it using the markup and other syntactic properties defined by means of a valid |
926 | CFAMsc | schema. Hence, even though the names of elements or attributes may vary, a TEI-conformant document must respect the TEI Semantic Model, and be valid with respect to a TEI-conformant Schema. Although it may be possible to transform a document which follows the |
927 | CFAMsc | TEI Abstract Model |
934 | CFAMmc | Mandatory Components of a TEI Document |
958 | CFAMmc | in the case of a corpus or collection, a single overall |
960 | CFAMmc | element followed by a series of |
973 | CFAMmc | This should include the title of the TEI document expressed using a |
979 | CFAMmc | This should include the place and date of publication or distribution of the TEI document, expressed using the |
994 | CFNS | TEI Namespace |
997 | CFNS | ) provides a way for an XML document to combine markup from different vocabularies without risking name collision and consequent processing difficulties. While the scope of the TEI is large, there are many areas in which it makes no particular recommendation, or where it recommends that other defined markup schemes should be adopted, such as graphics or mathematics. It is also considered desirable that users of other markup schemes should be able to integrate documents using TEI markup with their own system. To meet these objectives without compromising the reliability of its encoding, a TEI-conformant document is required to make appropriate use of the TEI namespace. |
999 | CFNS | Essentially all elements in a TEI Schema which represents concepts from the TEI Abstract Model belong to the TEI namespace, |
1001 | CFNS | , maintained by the TEI. A TEI-conformant document is required to declare the namespace for all the elements it contains whether these come from the TEI namespace or from other schemes. |
1003 | CFNS | A TEI Schema may be created which assigns TEI elements to some other namespace, or to no namespace at all. A document using such a schema must be regarded as a TEI extension and cannot be considered TEI-conformant, though it may be TEI-conformable. A document which places non-TEI elements or attributes within the TEI namespace cannot be TEI-conformant; such practices are strongly deprecated as they may lead to serious difficulties for processing or interchange. |
1010 | CFOD | above, a TEI Schema can only be generated from a TEI ODD, which also serves to document the semantics of the elements defined by it. A TEI-conformant document should therefore always be accompanied by (or refer to) a valid |
1011 | CFOD | TEI ODD file |
1012 | CFOD | specifying which modules, elements, classes, etc. are in use together with any modifications or renamings applied, and from which a TEI Schema can be generated to validate the document. The TEI supplies a number of predefined |
1013 | CFOD | TEI Customization exemplar ODD files |
1015 | CFOD | ), but most projects will typically need to customize the TEI beyond what these examples provide. It is assumed, for example, that most projects will customize the TEI scheme by removing those elements that are not needed for the texts they are encoding, and by providing further constraints on the attribute values and element content models the TEI provides. All such customizations must be specified by means of a valid |
1016 | CFOD | TEI ODD |
1019 | CFOD | As different sorts of customization have different implications for the interchange and interoperability of TEI documents, it cannot be assumed that every customization will necessarily result in a schema that validates only TEI-conformant documents. The ODD language permits modifications which conflict with the TEI Abstract Model, even though observing this model is a requirement for TEI Conformance. The ODD language can in fact be used to describe many kinds of markup scheme, including schemes which have nothing to do with the TEI at all. |
1021 | CFOD | Equally, it is possible to construct a TEI Schema which is identical to that derived from a given TEI ODD file without using the ODD scheme. A schema can constructed simply by combining the predefined schema language fragments corresponding with the required set of TEI modules and other statements in the relevant schema language. The status of such a schema with respect to the |
1023 | CFOD | schema cannot however be determined, in general; it may therefore be impossible to determine whether such a schema represents a clean modification or an extension. This is one reason for making the presence of a TEI ODD file a requirement for conformance. |
1027 | CFCATSCH | Varieties of TEI Conformance |
1031 | CFCATSCH | Is it a valid XML document, for which a TEI Schema exists? If not, then the document cannot be considered TEI-conformant in any sense. |
1033 | CFCATSCH | Is the document accompanied by a TEI-conformant ODD specification describing its markup scheme and intended semantics? If not, then the document can only be considered TEI-conformant if it validates against a predefined TEI Schema and conforms to the TEI abstract model. |
1035 | CFCATSCH | Does the markup in the document correctly represent the TEI abstract model? Though difficult to assess, this is essential to TEI conformance. |
1037 | CFCATSCH | Does the document claim that all of its elements come from some namespace other than the TEI (or no namespace)? If so, the document cannot be TEI-conformant. |
1039 | CFCATSCH | If the document claims to use the TEI namespace, in part or wholly, do the elements associated with that namespace in fact belong to it? If not, the document cannot be TEI-conformant; if so, and if all non-TEI elements and attributes are correctly associated with other namespaces, then the document may be TEI-conformant. |
1041 | CFCATSCH | Is the document valid according to a schema made by combining all TEI modules as well as valid according to the schema derived from its associated ODD specification? If so, the document is TEI-conformant. |
1045 | CFCATSCH | ? If so, the document uses a TEI extension. |
1049 | CFCATSCH | , using only information supplied in the accompanying ODD and without loss of information? If so, the document is TEI-conformable. |
1075 | tab-conformance | Conforms to TEI Abstract Model |
1135 | tab-conformance | Uses TEI and other namespaces correctly |
1176 | tab-conformance | Document can be converted automatically to a form which is valid as a subset of |
1200 | CFCATSCH | The document in column A is TEI-conformant. Its tagging follows the TEI Abstract Model, both as regards syntactic constraints (its |
1206 | CFCATSCH | elements appear to contain verse lines rather than typographic ones). It is accompanied by a valid ODD which documents exactly how it uses the TEI. All the TEI-defined elements and attributes in the document are placed in the TEI namespace. The schema against which it is valid is a |
1212 | CFCATSCH | The document in column B is not a TEI document. Although it is accompanied by a valid TEI ODD, the resulting schema includes some |
1214 | CFCATSCH | modifications, and represents some concepts from the TEI Abstract Model using non-TEI elements; for example, it re-defines the content model of |
1220 | CFCATSCH | which appears to have the same meaning as the existing TEI |
1222 | CFCATSCH | element, but the equivalence is not made explicit in the ODD. It uses the TEI namespace correctly to identify the TEI elements it contains, but the ODD does not contain enough information automatically to convert its non-TEI elements into TEI equivalents. |
1224 | CFCATSCH | The document in column C is TEI-conformable. It is almost the same as the document in column A, except that the names of the elements used are not those specified by the TEI namespace. Because the ODD accompanying it contains an exact mapping for each element name (using the |
1226 | CFCATSCH | element) and there are no name conflicts, it is possible to make an automatic conversion of this document. |
1228 | CFCATSCH | The document in column D is a TEI Extension. It combines elements from its own namespace with unmodified TEI elements in the TEI namespace. Its usage of TEI elements conforms to the TEI Abstract Model. Its ODD defines a new |
1230 | CFCATSCH | element which has no exact TEI equivalent, but which is assigned to an existing TEI class; consequently its schema is not a clean subset of |
1232 | CFCATSCH | . If the associated ODD provided a way of mapping this element to an existing TEI element, then this would be TEI-conformable. |
1234 | CFCATSCH | The document in column E is superficially similar to document D, but because it does not use any namespace declarations (or, equivalently, it assigns unmodified TEI elements to its own namespace), it may contain name collisions; there is no way of knowing whether a |
1238 | CFCATSCH | or has some other meaning. The accompanying ODD file may be used to provide the human reader with information about equivalently named elements in the TEI namespace, and hence to determine whether the document is valid with respect to the TEI Abstract Model but this is not an automatable process. In particular, cases of apparent conflict (for example use of an element |
1240 | CFCATSCH | to represent a concept not in the TEI Abstract Model but in the abstract model of some other system, whose namespace has been removed as well) cannot be reliably resolved. By our current definition therefore, this is not a TEI document. |
1244 | CFCATSCH | which is used in this document is a specialization of an existing TEI element, and the ODD in which it is defined specifies the mapping (a |
1252 | CFCATSCH | ; if it does not, this would also be a case of TEI Extension. |
1254 | CFCATSCH | The document in column G is not a TEI document. Its structure is fully documented by a valid TEI ODD, but it does not claim to represent the TEI Abstract Model, does not use the TEI namespace, and is not intended to validate against any TEI schema. |
1256 | CFCATSCH | The document in column H is very like that in column A, but it lacks an accompanying ODD. Instead, the schema used to validate it is produced simply by combining TEI schema fragments in the same way as an ODD processor would, given the ODD. If the resulting schema is a clean subset of |
1258 | CFCATSCH | , such a document is indistinguishable from a TEI-conformant one, but there is no way of determining (without inspection) whether this is the case if any modification or extension has been applied. Its status is therefore, like that of Text E, impossible to determine. |
1268 | IM | The specifications in this section are illustrative but not normative. Its function is to further illustrate the intended scope and application of the elements documented in chapter |
1269 | IM | , since it is believed that these may have application beyond the areas directly addressed by the TEI. |
1271 | IM | An ODD processing system has to accomplish two main tasks. A set of selections, deletions, changes, and additions supplied by an ODD customization (as described in |
1272 | IM | ) must first be merged with the published TEI P5 ODD specifications. Next, the resulting unified ODD must be processed to produce the desired outputs. |
1274 | IM | An ODD processor is not required to do these two stages in sequence, but that may well be the simplest approach; the ODD processing tools currently provided by the TEI Consortium, which are also used to process the source of these Guidelines, adopt this approach. |
1288 | IM-unified | attribute. This provides a name for the generated schema, which other components of the processing system may use to refer to the schema being generated, e.g. in issuing error messages or as part of the generated output schema file or files. The |
1290 | IM-unified | attribute may be used to specify the default namespace within which elements valid against the resulting schema belong, as discussed in |
1295 | IM-unified | element contains an unordered series of specialized elements, each of which is of one of the following four types: |
1301 | IM-unified | (by default |
1315 | IM-unified | add |
1317 | IM-unified | If the value of |
1320 | IM-unified | add |
1321 | IM-unified | , then the object is simply copied to the output, but if it is |
1322 | IM-unified | change |
1327 | IM-unified | , then it will be looked at by other parts of the process. |
1336 | IM-unified | element, in turn, groups together a set of ODD specifications (among other things, including further |
1360 | IM-unified | references to TEI Modules |
1365 | IM-unified | attributes refer to components of the TEI. The value of the |
1371 | IM-unified | element defining a TEI module. The |
1373 | IM-unified | must be dereferenced by some means, such as reading an XML file with the TEI ODD specification (either from the local hard drive or off the Web), or looking up the reference in an XML database (again, locally or remotely); whatever means is used, it should return a stream of XML containing the element, class, and macro specifications collected together in the specified module. These specification elements are then processed in the same way as if they had been supplied directly within the |
1383 | IM-unified | attribute; the content of such modules, which must be available in the RELAX NG XML syntax, are passed directly and without modification to the output schema when that is created. |
1387 | IM-unified | Each object obtained from the TEI ODD specification using |
1395 | IM-unified | if there is an object in the ODD customization with the same value for the |
1399 | IM-unified | value of |
1401 | IM-unified | , then the object from the module is ignored; |
1403 | IM-unified | if there is an object in the ODD customization with the same value for the |
1407 | IM-unified | value of |
1409 | IM-unified | , then the object from the module is ignored, and the one from the ODD customization is used in its place; |
1411 | IM-unified | if there is an object in the ODD customization with the same value for the |
1415 | IM-unified | value of |
1416 | IM-unified | change |
1417 | IM-unified | , then the two objects must be merged, as described below; |
1419 | IM-unified | if there is an object in the ODD customization with the same value for the |
1423 | IM-unified | value of |
1424 | IM-unified | add |
1425 | IM-unified | , then an error condition should be raised; |
1441 | IM-unified | elements). If such a component is found in the ODD customization, it will be copied to the output; if it is not found there, but is present in the TEI ODD specification, then that will be copied to the output. |
1447 | IM-unified | , for example); these are always copied to the output, and their children are then processed following the rules given in this list. |
1481 | IM-unified | elements. These should be copied from both the TEI ODD specification and the ODD customization, and all occurrences included in the output. |
1522 | IM-unified | This means that when |
1523 | IM-unified | memberOf key="att.typed"/ |
1524 | IM-unified | is processed, that class is looked up, each attribute which it defines is examined in turn, and the customization is searched for an override. If the modification is of the attribute class itself, work proceeds as usual; if, however, the modification is at the element level, the class reference is deleted and a series of |
1526 | IM-unified | elements is added to the element, one for each attribute inherited from the class. Since attribute classes can themselves be members of other attribute classes, membership must be followed recursively. |
1542 | IM-unified | to provide an alternate description in another language. Nothing prevents the user from supplying |
1554 | IM-unified | In the processing of the content models of elements and the content of macros, deleted elements may require special attention. |
1555 | IM-unified | The carthago program behind the Pizza Chef application, written by Michael Sperberg-McQueen for TEI P3 and P4, went to very great efforts to get this right. The XSLT transformations used by the P5 Roma application are not as sophisticated, partly because the RELAX NG language is more forgiving than DTDs. |
1556 | IM-unified | A content model like this: |
1575 | IM-unified | requires no special treatment because everything is expressed in terms of model classes; if deletions result in |
1577 | IM-unified | having no members, then |
1581 | IM-unified | . An ODD processor may or may not elect to simplify the resulting choice between nothing and |
1585 | IM-unified | element. However, such simplification may be considerably more complex in the general case (if for example the |
1591 | IM-unified | ), and an ODD processor is therefore likely to be more successful in carrying out such simplification as a distinct stage during processing of ODD sources. |
1614 | IM-unified | Note that deletion of required elements will cause the schema specification to accept as valid documents which cannot be TEI-conformant, since they no longer conform to the TEI Abstract Model; conformance topics are addressed in more detail in |
1622 | IM-unified | which contains a complete and internally consistent set of element, class, and macro specifications, possibly also including |
1632 | IMGS | Assuming that any modifications have been resolved, as outlined in the previous section, making a schema is now a four stage process: |
1634 | IMGS | all datatype and other macro specifications must be collected together and declared at the start of the output schema; |
1636 | IMGS | all classes must be declared in the right order (since some classes reference others, the order is significant); |
1646 | IMGS | Working in this order gives the best chance of successfully supporting all the schema languages. However, there are a number of obstacles to overcome along the way. |
1648 | IMGS | An ODD processor may use any desired schema language or languages for its schema output. The TEI ODD specification uses RELAX NG to express content models, and is therefore biased towards this language. However, the current TEI ODD processing system is capable of producing schema output in the three main schema languages, as follows: |
1650 | IMGS | A RELAX NG (XML) schema is generated by creating wrappers around the content models taken directly from the ODD specification; a version re-expressed in the RELAX NG compact syntax is generated using James Clark's |
1654 | IMGS | A DTD schema is generated by converting the RELAX NG content models to DTD language, often simplifying it to allow for the less-sophisticated output language. |
1656 | IMGS | A W3C Schema schema is created by generating a RELAX NG schema and then using James Clark's |
1666 | IMGS | Secondly, it is possible to create two rather different styles of schema. On the one hand, the schema can try to maintain all the flexibility of ODD by using the facilities of the schema language for parameterization; on the other, it can remove all customization features and produce a flat result which is not suitable for further manipulation. The TEI project currently generates both styles of schema; the first as a set of schema fragments in DTD and RELAX NG languages, which can be included as modules in other schemas, and customized further; the second as the output from a processor such as Roma, in which many of the parameterization features have been removed. |
1702 | IMGS | performance = element performance { (model.divTop | model.global)*, (model.common, model.global*)+, (model.divBottom, model.global*)* att.global.attribute.xmlspace, att.global.attribute.xmlid, att.global.attribute.n, att.global.attribute.xmllang, att.global.attribute.rend, att.global.attribute.xmlbase, att.global.linking.attribute.corresp, att.global.linking.attribute.synch, att.global.linking.attribute.sameAs, att.global.linking.attribute.copyOf, att.global.linking.attribute.next, att.global.linking.attribute.prev, att.global.linking.attribute.exclude, att.global.linking.attribute.select } |
1705 | IMGS | ) would have no effect, since references to such classes have been expanded to reference their constituent attributes. |
1708 | IMGS | performance = element performance { performance.content, performance.attributes } performance.content = (model.divTop | model.global)*, (model.common, model.global*)+, (model.divBottom, model.global*)* performance.attributes = att.global.attributes, empty |
1711 | IMGS | is provided via an explicit reference ( |
1713 | IMGS | ), and can therefore be redefined. Moreover, the attributes are separated from the content model, allowing either to be overridden. |
1719 | IMGS | are used to distinguish the two schema types. An ODD processor is not required to support both, though the simple schema output is generally preferable for most applications. |
1744 | IMGS | class. What happens if |
1762 | IMGS | it is impossible to be sure which rule is being used. This situation is not detected when RELAX NG is used, since the language is able to cope with non-deterministic content models of this kind and does not require that only a single rule be used. |
1764 | IMGS | Finally, an application will need to have some method of associating the schema with document instances that use it. The TEI does not mandate any particular method of doing this, since different schema languages and processors vary considerably in their requirements. ODD processors may wish to build in support for some of the methods for associating a document instance with a schema. The TEI does not mandate any particular method, but does suggest that those which are already part of XML (the DOCTYPE declaration for DTDs) and W3C Schema (the |
1770 | IMGS | attribute to be valid when a document is validated against either a DTD or a RELAX NG schema, ODD processors may wish to add declarations for this attribute and its namespace to the root element, even though these are not part of the TEI |
1771 | IMGS | per se |
1774 | IMGS | to the list of attributes on the root element, which permits the non-namespace-aware DTD language to recognize the |
1776 | IMGS | notation. For RELAX NG, the namespace and attribute would be declared in the usual way: |
1777 | IMGS | namespace xsi = "http://www.w3.org/2001/XMLSchema-instance" |
1779 | IMGS | attribute xsi:schemaLocation { list { data.namespace, data.pointer }+ } |
1780 | IMGS | inside the root element declaration. |
1784 | IMGS | attribute in a W3C Schema schema is not permitted. Therefore, if W3C Schemas are being generated by converting the RELAX NG schema (for example, with |
1798 | IM-naming | If a RELAX NG pattern or DTD parameter entity is being created, its name is the value of the corresponding |
1800 | IM-naming | attribute, prefixed by the value of any |
1804 | IM-naming | . This allows for elements from an external schema to be mixed in without risk of name clashes, since all TEI elements can be given a distinctive prefix such as |
1814 | IM-naming | tei_sp = element sp { ... } |
1817 | IM-naming | If an element or attribute is being created, its default name is the value of the |
1819 | IM-naming | attribute, but if there is an |
1821 | IM-naming | child, its content is used instead. |
1827 | IM-naming | should be copied into the generated schema. If there is only one occurrence of either of these elements, it should be used regardless, but if there are several, local processing rules will need to be applied. For example, if there are several with different values of |
1829 | IM-naming | , a locale indication in the processing environment might be used to decide which to use. For example, |
1843 | IM-naming | might generate a RELAX NG schema fragment like the following, if the locale is determined to be French: |
1844 | IM-naming | head = ## en-tête element head { head.content, head.attributes } |
1847 | IM-naming | Alternatively, a selection might be made on the basis of the value of the |
1853 | IM-naming | In addition, there are three conventions about naming patterns relating to classes; ODD processors need not follow them, but those reading the schemas generated by the TEI project will find it necessary to understand them: |
1855 | IM-naming | when a pattern for an attribute class is created, it is named after the attribute class identifier (as above) suffixed by |
1861 | IM-naming | when a pattern for an attribute is created, it is named after the attribute class identifier (as above) suffixed by |
1863 | IM-naming | and then the identifier of the attribute (e.g. |
1868 | IM-naming | when a parameterized schema is created, each element generates patterns for its attributes and its contents separately, suffixing respectively |
1890 | IMRN | element defining which elements can occur as the root of a document. The ODD |
1896 | IMRN | . A pattern normally corresponds to an element name, but if a prefix (see above, |
1897 | IMRN | ) is supplied for an element, the pattern consists of the prefix name with the element name. |
1902 | IMMA | An ODD macro generates a corresponding RELAX NG pattern simply by copying the body of the |
1930 | IMMA | Although some versions of these Guidelines show the RELAX NG output in the compact syntax, both the content of the |
1932 | IMMA | element and the unified ODD specification generated by the TEI ODD processing software always store RELAX NG in the more verbose XML syntax. However, the two formats are interchangeable. |
1952 | IMCL | if the elements |
1958 | IMCL | are included. Depending on the value of the |
1962 | IMCL | , it may also generate a set of sequences as well as alternation patterns. Thus we may also generate the |
2010 | IMCL | where the pattern name is created by appending an underscore and the name of the generation sequence to the class name. |
2012 | IMCL | Attribute classes work by producing a pattern containing definitions of the appropriate attributes. So |
2063 | IMCL | Since the processor may have expanded the attribute classes already, separate patterns are generated for each attribute in the class as well as one for the class itself. This allows an element to refer directly to a member of a class. Notice that the |
2065 | IMCL | element is used to add an |
2073 | IMCL | Naturally, this behaviour is not mandatory; and other ODD processors may create documentation in other ways, or ignore those parts of the ODD specifications when creating schemas. |
2084 | IMCL | attribute in the namespace |
2088 | IMCL | . The body of the attribute is taken from the |
2094 | IMCL | value of |
2096 | IMCL | . In that case an |
2146 | IMCL | namespace to provide default values and documentation. |
2156 | IMEL | pattern by which other elements can refer to it, and then it must generate an |
2158 | IMEL | with the content model and attributes. It may be convenient to make two separate patterns, one for the element's attributes and one for its content model. |
2160 | IMEL | The content model is created simply by copying the body of the |
2171 | IM-makeDTD | . A DTD may not refer to an entity which has not yet been declared. Since both macros and classes generate DTD parameter entities, the TEI Guidelines are constructed so that they can be declared in the right order. A processor must therefore work in the following order: |
2173 | IM-makeDTD | declare all model classes which have a |
2175 | IM-makeDTD | value of |
2180 | IM-makeDTD | value of |
2183 | IM-makeDTD | declare all other classes |
2209 | IM-makeDTD | <!ENTITY % faith 'INCLUDE' > <![ %faith; [ <!--doc:specifies the faith, religion, or belief set of a person. --> <!ELEMENT %n.faith; %om.RR; %macro.phraseSeq;> <!ATTLIST %n.faith; xmlns CDATA "http://www.tei-c.org/ns/1.0"> <!ATTLIST %n.faith; %att.global.attributes; %att.editLike.attributes; %att.datable.attributes; > ]]> |
2211 | IM-makeDTD | ), the element name is parameterized (see |
2216 | IM-makeDTD | . Note the additional attribute which provides a default |
2218 | IM-makeDTD | declaration for the element; the effect of this is that if the document is processed by a DTD-aware XML processor, the namespace declaration will be present automatically without the document author even being aware of it. |
2220 | IM-makeDTD | A simpler rendition for a flattened DTD generated from a customization will result in the following, with no containing marked section, and no parameterized name: |
2221 | IM-makeDTD | <!ELEMENT faith %macro.phraseSeq;> <!ATTLIST faith xmlns CDATA "http://www.tei-c.org/ns/1.0"> <!ATTLIST faith %att.global.attribute.xmlspace; %att.global.attribute.xmlid; %att.global.attribute.n; %att.global.attribute.xmllang; %att.global.attribute.rend; %att.global.attribute.xmlbase; %att.global.linking.attribute.corresp; %att.global.linking.attribute.synch; %att.global.linking.attribute.sameAs; %att.global.linking.attribute.copyOf; %att.global.linking.attribute.next; %att.global.linking.attribute.prev; %att.global.linking.attribute.exclude; %att.global.linking.attribute.select; %att.editLike.attribute.cert; %att.editLike.attribute.resp; %att.editLike.attribute.evidence; %att.datable.w3c.attribute.period; %att.datable.w3c.attribute.when; %att.datable.w3c.attribute.notBefore; %att.datable.w3c.attribute.notAfter; %att.datable.w3c.attribute.from; %att.datable.w3c.attribute.to;> |
2222 | IM-makeDTD | Here the attributes from classes have been expanded into individual entity references. |
2241 | IMGD | The generated documentation may be of two forms. On the one hand, we may document the customization itself, that is, only those elements (etc.) which differ in their specification from that provided by the TEI reference documentation. Alternatively, we may generate reference documentation for the complete subset of the TEI which results from applying the customization. The TEI Roma tools take the latter approach, and operate on the result of the first stage processing described in |
2252 | IMGD | for each element, by tracing which other elements have them as possible members of their content models. |
2270 | STPE | Using TEI Parameterized Schema Fragments |
2272 | STPE | The TEI parameterized DTD and RELAX NG fragments make use of parameter entities and patterns for several purposes. In this section we describe their interface for the user. In general we recommend use of ODD instead of this technique. |
2276 | STPED | Special-purpose parameter entities are used to specify which modules are to be combined into a TEI DTD. They take the form |
2280 | STPED | is the name of the module as given in table |
2286 | STPED | . All such parameter entities are declared by default with the value |
2288 | STPED | : to select a module, therefore, the encoder declares the appropriate parameter entities with the value |
2292 | STPED | For XML DTD fragments, note that some modules generate two DTD fragments: for example the |
2298 | STPED | . This is because the declarations they contain are needed at different points in the creation of an XML DTD. |
2314 | STPED | If TEI.linking has its default value of IGNORE, neither declaration has any effect. If however it has the value INCLUDE, then the content of each marked section is acted upon: the parameter entities |
2318 | STPED | are referenced, which has the effect of embedding the content of the files they represent at the appropriate point in the DTD. |
2327 | STPEEX | The TEI DTD fragments also use marked sections and parameter entity references to allow users to exclude the definitions of individual elements, in order either to make the elements illegal in a document or to allow the element to be redefined. The parameter entities used for this purpose have exactly the same name as the generic identifier of the element concerned. The default definition for these parameter entities is |
2331 | STPEEX | in order to exclude the standard element and attribute definition list declarations from the DTD. |
2335 | STPEEX | , for example, are preceded by a definition for a parameter entity with the name |
2340 | STPEEX | <!ENTITY % p 'INCLUDE' > <![ %p; [ <!-- element and attribute list declaration for p here --> ]] |
2350 | STPEEX | <!ENTITY % p 'IGNORE' > |
2351 | STPEEX | is added earlier in the DTD than the default (see further |
2354 | STPEEX | Similarly, in the parameterized RELAX NG schemas, every element is defined by a pattern named after the element. To undefine an element therefore all that is necessary is to add a declaration like the following: |
2355 | STPEEX | p = notAllowed |
2360 | STPEGI | In the TEI DTD fragments, elements are not referred to directly by their generic identifiers; instead, the DTD fragments refer to parameter entities which expand to the standard generic identifiers. This allows users to rename elements by redefining the appropriate parameter entity. Parameter entities used for this purpose are formed by taking the standard generic identifier of the element and attaching the string |
2372 | STPEGI | These declarations are generated by an ODD processor when TEI DTD fragments are created. |
2374 | STPEGI | In the RELAX NG schemas, all elements are normally defined using a pattern with the same name as the element (as described in |
2376 | STPEGI | abbr = element abbr { abbr.content, abbr.attributes } |
2378 | STPEGI | abbr = element abbrev { abbr.content, abbr.attributes } |
2379 | STPEGI | More complex revisions, such as redefining the content of the element (defined by the pattern |
2383 | STPEGI | ) can be accomplished in a similar way, using the features of the RELAX NG language. The recommended method of carrying out such modifications is however to use the ODD language as further described in section |
2389 | STOVLO | Any local modifications to a DTD (i.e. changes to a schema other than simple inclusion or exclusion of modules) are made by declarations stored in one of two local extension files, one containing modifications to the TEI parameter entities, and the other new or changed declarations of elements and their attributes. Entity declarations must be made which associate the names of these two files with the appropriate parameter entity so that the declarations they contain can be embedded within the TEI DTD at an appropriate point. |
2393 | STOVLO | file to embed portions of the TEI DTD fragments or locally developed extensions. |
2396 | STOVLO | identifies a local file containing extensions to the TEI parameter entities |
2400 | STOVLO | identifies a local file containing extensions to the TEI module |
2403 | STOVLO | For example, if the relevant files are called |
2407 | STOVLO | , then declarations like the following would be appropriate: |
2410 | STOVLO | When an entity is declared more than once, the first declaration is binding and the others are ignored. The local modifications to parameter entities should therefore be handled before the standard parameter entities themselves are declared in |
2414 | STOVLO | is referred to before any TEI declarations are handled, to allow the user's declarations to take priority. If the user does not provide a |
2418 | STOVLO | For example the encoder might wish to add two phrase-level elements |
2423 | STOVLO | hi rend='italics' |
2425 | STOVLO | hi rend='bold' |
2427 | STOVLO | , this involves two distinct steps: one to define the new elements, and the other to ensure that they are placed into the TEI document structure at the right place. |
2429 | STOVLO | Creating the new declarations is done in the same way for user-defined elements as for any other; the same parameter entities need to be defined so that they may be referenced by other elements. The content models of these new elements may also reference other parameter entities, which is why they need to be declared after other declarations. |
2433 | STOVLO | should be modified to include the generic identifiers for the new elements we wish to create. The declaration for each modifiable parameter entity in the DTD includes a reference to an additional parameter entity with the same name prefixed by an |
2435 | STOVLO | ; these entities are declared by default as the null string. However, in the file containing local declarations they may be redeclared to include references to the new class members: |
2437 | STOVLO | and this declaration will take precedence over the default when the declaration for macro.phraseSeq is evaluated. |
# | id | text |
---|---|---|
3 | AI | This chapter describes a module for associating simple analyses and interpretations with text elements. We use the term |
4 | AI | analysis |
5 | AI | here to refer to any kind of semantic or syntactic interpretation which an encoder wishes to attach to all or part of a text. Examples discussed in this chapter include familiar linguistic categorizations (such as |
19 | AI | introduces elements which can be used to characterize text segments according to the familiar linguistic categories of |
34 | AI | punctuation mark |
41 | AI | introduces an additional global attribute which allows passages of text to be associated with specialized elements representing their interpretation. These |
48 | AI | . They allow the encoder to specify an analysis as a series of names and associated values, |
51 | AI | ; this term should not be confused, however, with XML attributes and their values, which are similar in concept but distinct in their formal definitions. |
52 | AI | each such pair being linked to one or more stretches of text, either directly, in the case of spans, or indirectly, in the case of interpretations. |
55 | AI | revisits the topic of linguistic analysis, and illustrates how these interpretative mechanisms may be used to associate simple linguistic analysis with text segments. |
60 | AILC | linguistic segment category |
61 | AILC | elements which may be used to represent the segmentation of a text into the traditional linguistic categories of |
74 | AILC | punctuation marks |
99 | AILCW | . They may thus appear anywhere that text is permitted within a document, when the module defined by this chapter is included in a schema. |
103 | AILCW | element may be used simply to segment a text end-to-end into a series of non-overlapping segments, referred to here and elsewhere as |
115 | AILCW | element is more restricted both in its content and its usage than the generic |
132 | AILCW | Neither this constraint, nor the requirement that the whole of the text be segmented by |
134 | AILCW | elements is enforced by the current TEI schemas; such constraints may however be introduced in a later version of these Guidelines. |
137 | AILCW | element is intended for use as a generic segmentation element, the specific function of which may be indicated by its |
146 | AILCW | seg type="s-unit" |
148 | AILCW | seg type="clause" |
150 | AILCW | seg type="phrase" |
195 | AILCW | elements in the same way. A text may be segmented directly into clauses, or into phrases, with no need to include segmentation at a higher level as well. |
197 | AILCW | For verse texts, the overlapping of metrical and syntactic structure requires that special care be given to representing both using an element hierarchy. One simple approach is to split the syntactic phrases into fragments when they cross verse boundaries, reuniting them with the |
222 | AILCW | attributes defined in the additional module for linking (chapter |
234 | AILCW | attribute on linguistic segment categories can be used to provide additional interpretative information about the category. The |
240 | AILCW | elements can be used to provide additional information about the function of the category. Legal values for these two attributes are not defined by these Guidelines, but should be documented in the |
244 | AILCW | element within the document's header. A general approach to the encoding of linguistic categories for parts of a text is discussed in section |
263 | AILCW | Segmentation into clauses and phrases can, of course, be combined. Such detailed encodings as the following may require careful formatting if they are to be easily readable however. |
329 | AILCW | This style of markup may introduce spurious new lines and blanks into the text. If the original layout is important, it should be explicitly encoded, using such facilities as the |
348 | AILCW | w |
350 | AILCW | m |
352 | AILCW | c |
355 | AILCW | is permitted to occur. However, their content is more constrained than |
377 | AILCW | elements should contain only plain text, most often only a single character or a sequence of graphemes to be treated as a single character. Consequently, while these more specific elements can be translated directly into typed |
381 | AILCW | The restriction on the content of the |
383 | AILCW | element in particular requires that a certain care must be exercised when using it, especially in relation to the use of other tags that one may think of as |
393 | AILCW | element is not part of the content model of the |
417 | AILCW | carries additional attributes which may be of use in many indexing or analytic applications. The |
421 | AILCW | , that is the head- or uninflected form of an inflected verb or noun, for example: |
437 | AILCW | pointer attribute than to supply an explicit uninflected form. This attribute assumes the existence of a list of uninflected forms, for example in an online lexicon, with which individual |
438 | AILCW | w |
439 | AILCW | entries can be associated using the usual TEI pointer mechanisms. Assuming that a standardized lexicon for Latin is available at the location |
458 | AIPC | element is used to mark up morphologically identified segmentation below the word level. Analogous to the |
467 | AIPC | base form |
500 | AIPC | There is a substantial linguistic difference between characters like letters or diacritics and punctuation marks. The former are used to construct meaningful units like morphemes or words. The latter are functionally independent units acting at the level of syntactic units. A word may consist of a single letter (for example |
553 | AIPC | use to mark non-lexical punctuation marks is deprecated, since the |
559 | AIPC | (punctuation character) element should be used to mark up characters which are specifically regarded as providing punctuation, rather than constituting parts of a word. It may be particularly useful when transcribing older written materials, in which an encoding of the original punctuation may be useful for interpretive or analytic purposes, in much the same way as an encoding of the original orthography may be. For example, in the following extract from a Bodleian Library musical manuscript |
562 | AIPC | two different punctuation marks are used to distinguish kinds of pause in the text. The |
583 | AIPC | element carries special attributes to record analyses of the functional behaviour or classification of the punctuation mark it contains. The |
587 | AIPC | element to name the kind of unit which the punctuation mark delimits, for example a paragraph or section. The |
589 | AIPC | attribute may be used to indicate whether the punctuation precedes or follows the unit it delimits. The |
591 | AIPC | attribute indicates the strength of the association between the punctuation mark and its adjacent word. |
593 | AIPC | In the following example, the paragraph marker (¶) has been tagged as a strong punctuation mark, preceding the unit it marks, which is named |
610 | AIPC | elements can be used together to give a fairly detailed low-level grammatical analysis of text. For example, consider the following segmentation of the English S-unit |
635 | AIPC | . A further advantage of segmenting the text down to this level is that it becomes relatively simple to associate each such segment with a more detailed formal analysis, for example by providing a baseform, or morphological analysis at whichever level is appropriate. This matter is taken up in detail in section |
651 | AIATTS | When the module described by this chapter is selected, an additional attribute is defined for all elements: |
654 | AIATTS | attribute may be specified for any element. Its effect is to associate the element with one or more others representing an analysis or interpretation of it. Its target should be one of the elements described in the section |
669 | AISP | The simplest mechanisms for attaching analytic notes in some structured vocabulary to particular passages of text are provided by the |
695 | AISP | elements may be used to indicate that the annotations are of specific types, for example thematic or structural. The annotation itself is supplied as the content of the |
699 | AISP | element. In the case of the |
701 | AISP | element, the span of text being annotated is indicated by values of the |
709 | AISP | attribute is supplied, then the span is coterminous with the element indicated by its value; if both |
713 | AISP | are supplied, the span runs from the start of the element indicated by the |
717 | AISP | attribute; if the |
719 | AISP | attribute is used, the span is defined by aggregating the contents of the (possibly non-contiguous) elements pointed to by its values. It is an error to supply only the |
721 | AISP | attribute; to supply more than one pointer value for either |
727 | AISP | attribute. In the case of |
729 | AISP | (see below), the span is indicated by a pointer from a |
747 | AISP | Here the two components of the span follow each other, so the |
763 | AISP | This second approach might be cumbersome if the number of components to be combined is very large. It is however essential if the components do not follow each other, as in this example: |
801 | AISP | element may, as in this example, be placed in the text near the textual span it is associated with. Alternatively, it may be placed elsewhere in the same or a different document. Where several |
805 | AISP | elements share the same attributes, for example having the same responsibility or type, it may be convenient to group them within a |
816 | AISP | Spans may also be used to represent structural divisions within a narrative, particularly when these do not coincide with the structure implied by the element structure. Consider the following narrative: |
819 | AISP | The rule marks spaces left for the missing name in the manuscript. |
820 | AISP | And when he came home, Borghild asked him to go away, but Sigmund offered her weregild, and she was obliged to accept it. At the funeral feast Borghild was serving beer. She took poison, a big drinking horn full, and brought it to Sinfiotli. When Sinfiotli looked into the horn, he saw that poison was in it, and said to Sigmund |
822 | AISP | Sigmund took the horn and drank it off. It is said that Sigmund was hardy and that poison did him no harm, inside or out. And all his sons could tolerate poison on their skin. Borghild brought another horn to Sinfiotli, and asked him to drink, and everything happened as before. And a third time she brought him a horn, and reproachful words as well, if he didn't drink from it. He spoke again to Sigmund as before. He said |
826 | AISP | Sigmund carried him a long way in his arms and came to a long, narrow fjord, and there was a small boat there and a man in it. He offered to ferry Sigmund over the fjord. But when Sigmund carried the body out to the boat, it was fully laden. The man said Sigmund should go around the fjord inland. The man pushed the boat out and then suddenly vanished. |
828 | AISP | King Sigmund lived a long time in Denmark in the kingdom of Borghild, after he married her. Then he went south to Frankish lands, to the kingdom he had there. Then he married Hiordis, the daughter of King Eylimi. Their son was Sigurd. King Sigmund fell in a battle with the sons of Hunding. And then Hiordis married Alf, the son of King Hialprec. Sigurd grew up there as a boy. |
833 | AISP | A structural analysis of this text, dividing it into narrative units in a pattern shared with other texts from the same literature, might look like this: |
880 | AISP | unit which is normally part of the narrative pattern but which is not realized in the text shown. |
883 | AISP | The same analysis may be expressed with the |
887 | AISP | element; this element provide attributes for recording an interpretive category and its value, as well as the identity of the interpreter, but does not itself indicate which passage of text is being interpreted; the same interpretive structures can thus be associated with many passages of the text. The association between text passages and |
889 | AISP | elements must be made either by pointing from the text to the |
894 | AISP | , or by pointing at both text and interpretation from a |
901 | AISP | , it is necessary to create a text element which contains—or corresponds to—the third, fourth, and fifth orthographic sentences (S-units) in the paragraph. This can be done either with the |
907 | AISP | . The resulting element can then be associated with the |
938 | AISP | tags in a similar manner. The interpretation itself can be expressed in an |
960 | AISP | elements may be linked to the text either by means of the |
968 | AISP | elements introduced specifically for this purpose), the text would be encoded as follows: |
1001 | AISP | element, whose content is a set of |
1003 | AISP | elements which point to each interpretive element and its corresponding text unit. This method does not require the use of the |
1005 | AISP | attribute on the text units. |
1019 | AISP | elements for the Sigmund text is that the |
1026 | AISP | elements may require the creation of special text elements not otherwise needed (e.g. the |
1045 | AILA | we mean here any annotation determined by an analysis of linguistic features of the text, excluding as borderline cases both the formal structural properties of the text (e.g. its division into chapters or paragraphs) and descriptive information about its context (the circumstances of its production, its genre or medium). The structural properties of any TEI-conformant text should be represented using the structural elements discussed elsewhere in this chapter and in chapters |
1047 | AILA | , and the various chapters of Part III. The contextual properties of a TEI text are fully documented in the TEI header, which is discussed in chapter |
1051 | AILA | Other forms of linguistic annotation may be applied at a number of levels in a text. A code (such as a word-class or part-of-speech code) may be associated with each word or token, or with groups of such tokens, which may be continuous, discontinuous, or nested. A code may also be associated with relationships (such as cohesion) perceived as existing between distinct parts of a text. The codes themselves may stand for discrete and non-decomposable categories, or they may represent highly articulated bundles of textual features. Their function may be to place the annotated part of the text somewhere within a narrowly linguistic or discoursal domain of analysis, or within a more general semantic field, or any combination drawn from these and other domains. |
1053 | AILA | The manner by which such annotations are generated and attached to the text may be entirely automatic, entirely manual or a mixture. The ease and accuracy with which analysis may be automated may vary with the level at which the annotation is attached. The method employed should be documented in the |
1055 | AILA | element within the encoding description of the TEI header, as described in section |
1056 | AILA | . Where different parts of a language corpus have used different annotation methods, the |
1075 | AILA | This may be easily transformed into an equivalent TEI XML representation: |
1116 | AILA | , etc.) they are arbitrary codes, used in this case as pointers to other elements which define their significance more precisely. If the codes are considered to be |
1118 | AILA | , then the |
1153 | AILA | ), then this compositionality may be most clearly expressed using a mechanism based on the |
1158 | AILA | This approach requires the text to be fully segmented, using the linguistic segment elements described in section |
1161 | AILA | attribute used to point to each interpretation is clearly defined. A further analysis into phrase and clause elements can be superimposed on the word and morpheme tagging in the preceding illustration. For example, CLAWS provides the following constituent analysis of the sample sentence (the word class codes have been deleted): |
1165 | AILA | Treating the labels on the brackets as phrase or clause interpretations, this analysis of the structure of the example sentence can be combined with the word class analysis and represented as follows (the symbol |
1258 | AILA | element. In this case, each linguistic segment must be supplied with its own |
1307 | AILA | Each linguistic segment so far discussed has been well-behaved with respect to the basic document hierarchy, having only a single parent. Moreover, the segmentation has been complete, in that each part of the text is accounted for by some segment at each level of analysis, without discontinuities or overlap. This state of affairs does not of course apply in all types of analysis, and these Guidelines provide a number of mechanisms to support the representation of discontinuities or multiple analyses. A brief overview of these facilities is provided in chapter |
1311 | AILA | The mechanisms proposed in this chapter may also be used to encode analyses of an entirely different kind, for example discourse function. Here is an application of the span technique to record details of a sales transaction in a spoken text. |
1337 | AILA | (utterance) element and other elements recommended for transcriptions of spoken language, see chapter |
1346 | analysis | Simple analytic mechanisms |
1355 | AI | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
4 | NH | XML employs a strongly hierarchical document model. At various points, these Guidelines discuss problems that arise when using XML to encode textual features that either do not naturally lend themselves to representation in a strictly hierarchical form or conflict with other hierarchies represented in the markup. Examples of such situations include: |
11 | NH | Conflict between a verse text's metrical structure (e.g., its arrangement in stanzas and metrical lines) and its rhetorical or linguistic structure (e.g., phrases, sentences, and, for plays, acts, scenes, and speeches). |
15 | NH | Conflict between metrical, rhetorical, or linguistic structure and the representation of direct speech, especially if the quoted speech is interrupted by other elements (e.g., |
23 | NH | Conflict between different analytical views or descriptions of a text or document, e.g., markup intended to encode diplomatic information about a word's appearance in a manuscript with markup intended to describe its morphology or pronunciation. |
30 | NH | These Guidelines support several methods for handling non-hierarchical information: |
75 | NH | at the back of one certain man and asked me, |
88 | NH | , encodes the text according to its metrical features: line divisions (as here), stanzas or cantos in larger poems, and perhaps prosodic features like stress or syllable patterns, alliteration, or rhyme. A second view, which we might describe as the |
94 | NH | we will encode only metrical lines and line groups; for the |
98 | NH | , we only will distinguish direct quotation from other narration. |
103 | NHME | Conceptually, the simplest method of disentangling two (or more) conflicting hierarchical views of the same information is to encode it twice (or more), each time capturing a single view. |
124 | NHME | would be encoded by taking the same text and replacing the metrical markup with information about its sentence structure: |
185 | NHME | This method is TEI-conformant. Its advantages are that each way of looking at the information is explicitly represented in the data and that the individual views are simple to process. The disadvantages are that the method requires the maintenance of multiple copies of identical textual content (an invitation to inconsistency) and that there is no explicit indication that the various views, which might be in separate files, are related to each other: it might prove difficult to combine the views or access information from one view while processing the file that contains the encoding of another. |
186 | NHME | It has been shown, however, that it is possible to relate the different annotations in an indirect way: if the textual content of the annotations is identical, the very text can serve as a means for linking the different annotations, as described in |
193 | NHBM | A second method for accommodating non-hierarchical objects in an XML document involves marking the start and end points of the non-nesting material. This prevents textual features that fall outside the privileged hierarchy from invalidating the document while identifying their beginnings and ends for further processing. The disadvantage of this method is that no single XML element represents the non-nesting material and, as a result, processing with XML technologies is significantly more difficult. |
201 | NHBM | For some common structural features, the TEI provides milestone elements that can be used to mark the beginning of a textual feature. These include |
228 | NHBM | The use of these elements is by definition TEI-conformant. Care should be taken, however, that the meaning of the milestone elements is preserved: semantically, for example, |
230 | NHBM | is used to mark the start of a new (typographical) line. While in much modern poetry, typographical and metrical line divisions correspond, |
232 | NHBM | does not itself make a metrical claim: in encoding verse from sources, such as Old English manuscripts, where physical line breaks are not used to indicate metrical lineation, the correspondence would break down entirely. |
236 | NHBM | element. Attributes can then be used to indicate the type of feature being delimited and whether a given instance opens or closes the feature. |
257 | NHBM | Another approach is to design custom elements that provide richer information about the feature being delimited or its boundaries. This information can be included as attribute values or as part of the element name itself: e.g., |
288 | NHBM | If the custom elements can be replaced by TEI elements and attributes without loss of information, this method is TEI-conformable (see |
289 | NHBM | ); if the custom elements introduce information or distinctions that cannot be captured using standard TEI elements, the method is an extension. |
297 | NHBM | , etc.) can be adapted so that they serve as empty segment boundary delimiters when the features they encode cross-hierarchical boundaries. Additional attributes ( |
323 | NHBM | The method is TEI-conformable if the modified elements are placed in a distinct, non-TEI namespace (see |
324 | NHBM | ), and if the modified elements and attributes can be mapped without loss of information to existing TEI markup structures such as milestone or anchor elements automatically (see |
327 | NHBM | The method represents an Extension if the modified elements are placed in a distinct, non-TEI namespace, but contain information or distinctions that cannot be algorithmically translated to existing TEI elements without loss of information (see |
330 | NHBM | The method is non-conformant—and indeed strongly deprecated—if the modified elements and attributes are not placed in a distinct, non-TEI namespace (see |
334 | NHBM | In each of the above examples (except the last), the relationship between the start and end delimiters (where these exist) of a given feature is implicit: it is assumed that "end" delimiters close the nearest preceding "start" delimiter, or, in the case of milestones, that the milestone marks both the end of the preceding example and the beginning of the next. Complications arise, however, when the non-nesting text overlaps with other non-nesting text of the same type, as, for example, in a grammatical analysis of the various possible interpretations of the |
379 | NHBM | tag with the |
381 | NHBM | value |
385 | NHBM | with the same value on |
395 | NHBM | tag with the |
397 | NHBM | value |
401 | NHBM | tag that has the same value on |
405 | NHBM | Despite their advantages, segment boundary delimiters incur the disadvantage of cumbersome processing: since the elements of the analysis (e.g., the sentences in the poems, or phrases in the above example) are not uniformly represented by nodes in the document tree, they must be reconstituted by software in an ad hoc fashion, which is likely to be difficult and may be error prone. |
407 | NHBM | Most important for some encoders, the method also disguises the relationship between the beginning and the ending of each logical element. This makes it impossible for standard validation software to provide the same kind of validation possible elsewhere in the encoding. When using grammar-based schema languages it is not possible to define a content model for the range limited by empty elements. |
408 | NHBM | Grammar based schema languages (e.g., DTD, W3C Schema, and RELAX NG) are used to define markup languages (e.g., XHTML or TEI). Rule-based schema languages (e.g., Schematron) can be used to define further constraints. Such a rule-based schema language permits a sequence of certain elements between empty elements to be legitimized or prohibited. |
414 | NHVE | A third method involves breaking what might be considered a single logical (but non-nesting) element into multiple smaller structural elements that fit within the dominant hierarchy but can be reconstituted virtually. For example, if a passage of direct discourse begins in the middle of one paragraph and continues for several more paragraphs, one could encode the passage as a series of |
418 | NHVE | element. The resulting encoding is valid XML, but the text in each |
424 | NHVE | In the case of our selection from Pinsky's poem, for example, the second passage of direct quotation, which crosses a line boundary and is broken up by a |
425 | NHVE | She said |
478 | NHVE | marks seven spans of text using |
490 | NHVE | is a string corresponding to no single grammatical category. |
492 | NHVE | Taken together, these problems can make automatic analysis of the fragmented features difficult. An analysis that intended to count the number of sentences in Wordsworth's poem, for example, would arrive at an inflated figure if it understood the |
494 | NHVE | elements to represent complete rhetorical sentences; if it wanted to do an analysis of his syntax, it would not be able to assume that |
498 | NHVE | The technique of fragmentation is often complemented by the technique of virtual joins. Virtual joins may be used to combine objects in the text to a new hierarchy. Here is |
500 | NHVE | again; this time the relationship between the parts of the fragmented sentences is indicated explicitly using the |
545 | NHVE | attribute with the value |
577 | NHVE | This method is TEI-conformant and simple to use. Its disadvantage is that it does not work well for cases of self-overlap, or if there are nested occurrences of the same element type, as it can become difficult to ascertain which initial, medial, or final partial element should be combined with which others or in which order. This problem becomes evident if we attempt to combine a detailed Grammatical view of the Pinsky example with its metrical encoding: |
705 | NHVE | The major advantage of fragmentation and virtual joins is that it allows all the hierarchies in the text to be handled explicitly: both the privileged one directly represented and the alternate hierarchy that has been split up and rejoined. The major disadvantages are that (like most of the other methods described here) it privileges one hierarchy over the others, requires special processing to reconstitute the elements of the other hierarchies, and, except in the case of |
713 | NHSO | Most markup is characterized by the embedding of elements in the text. An alternative approach separates the text and the elements used to describe it. This approach is known as stand-off markup (see section |
714 | NHSO | ). It establishes a new hierarchy by building a new tree whose nodes are XML elements that do not contain textual content, but rather links to another |
717 | NHSO | a node in another XML document or a span of text |
718 | NHSO | . This approach can be subdivided according to different criteria. A first distinction concerns the link base, i.e. the content to which annotations are to be applied. Sometimes the link target contains markup that can be referred to explicitly, as in the following example where the offset markup uses the |
724 | NHSO | A fake namespace is given for XInclude here, to avoid the markup being interpreted literally during processing. |
798 | NHSO | Note that the layer that uses XInclude to build another hierarchy might well be in another document, in which case the value of |
802 | NHSO | would need to be the URL of the document that contains the base layer, in this case the |
810 | NHSO | elements, and that there exists off-the-shelf software that will perform appropriate processing. Stand-off markup may be used even when the base text being annotated is plain text, i.e. does not have any XML encoding. In this case, the range of text to be marked up is indicated by character offsets (see |
812 | NHSO | ). Another distinction concerns the number of files which can serve as link targets. Often, one (dedicated) annotation is used as the link target of all the other annotations. It is also possible to freely interlink several layers. |
814 | NHSO | It has been noted that stand-off markup has several advantages over embedded annotations. In particular, it is possible to produce annotations of a text even when the source document is read-only. Furthermore, annotation files can be distributed without distributing the source text. Further advantages mentioned in the literature are that discontinuous segments of text can be combined in a single annotation, that independent parallel coders can produce independent annotations, and that different annotation files can contain different layers of information. Lastly, it has also been noted that this approach is elegant. |
818 | NHSO | Inasmuch as it uses elements not included in the TEI namespace, stand-off markup involves an extension of the TEI. |
824 | NHNX | There exist many non-XML methods of encoding a text that either solve or do not suffer the problem of the inability to encode overlapping hierarchies. These include, but are not limited to, the following proposals. |
830 | NHNX | Designing a form of document representation in which several trees share all or part of the same frontier, and in which each individual view of the document has the form of a tree (see |
836 | NHNX | ), which stores a body of information as a set of intertwined XML trees. This approach eliminates unnecessary redundancy and makes the database readily updatable, while allowing the user to exploit different hierarchical access paths. |
850 | NHNX | proposal. This offers alternatives to the basic XML linear form as well as its data and processing models. It uses an alternative notation to XML and a data structure based on Core Range Algebra ( |
858 | NHNX | . This provides a notation (TexMECS) and a data structure (Goddag) as well as a draft constraint language for the representation of non-hierarchical structures; see |
862 | NHNX | These approaches are based either on non-standard XML processing or data models, or not based on XML at all. Since TEI is currently based on XML they are not described any further in these Guidelines. Use of these methods with the TEI will certainly involve extensions; in most cases the documents will also be non-conformant. |
# | id | text |
---|---|---|
2 | CO | Elements Available in All TEI Documents |
4 | CO | This chapter describes elements which may appear in any kind of text and the tags used to mark them in all TEI documents. Most of these elements are freely floating phrases, which can appear at any point within the textual structure, although they must generally be contained by a higher-level element of some kind (such as a paragraph). A few of the elements described in this chapter (for example, bibliographic citations and lists) have a comparatively well-defined internal structure, but most of them have no consistent inner structure of their own. In the general case, they contain only a few words, and are often identifiable in a conventionally printed text by the use of typographic conventions such as shifts of font, use of quotation or other punctuation marks, or other changes in layout. |
8 | CO | tag used to mark paragraphs, the prototypical formal unit for running text in many TEI modules. This is followed, in section |
9 | CO | , by a discussion of some specific problems associated with the interpretation of conventional punctuation, and the methods proposed by the Guidelines for resolving ambiguities therein. |
12 | CO | ) describes a number of phrase-level elements commonly marked by typographic features (and thus well-represented in conventional markup languages). These include features commonly marked by font shifts (section |
13 | CO | ) and features commonly marked by quotation marks (section |
18 | CO | introduces some phrase-level elements which may be used to record simple editorial interventions, such as emendation or correction of the encoded text. The elements described here constitute a simple subset of the full mechanisms for encoding such information (described in full in chapter |
22 | CO | ) describes several phrase-level and inter-level elements which, although often of interest for analysis or processing, are rarely explicitly identified in conventional printing. These include names (section |
35 | CO | , describe two kinds of quasi-structural elements: lists and notes. These may appear either within chunk-level elements such as paragraphs, or between them. Several kinds of lists are catered for, of an arbitrary complexity. The section on notes discusses both notes found in the source and simple mechanisms for adding annotations of an interpretive nature during the encoding; again, only a subset of the facilities described in full elsewhere (specifically, in chapter |
39 | CO | introduces some simple ways of representing graphic or other non-textual content found in a text. A fuller discussion of the multimedia facilities supported by these Guidelines may be found in chapters |
44 | CO | , describes methods of encoding within a text the conventional system or systems used when making references to the text. Some reference systems have attained canonical authority and must be recorded to make the text useable in normal work; in other cases, a convenient reference system must be created by the creator or analyst of an electronic text. |
49 | CO | Additional elements for the encoding of passages of verse or drama (whether prose or verse) are discussed in section |
53 | CO | , describing the structure of the TEI document type definition. |
57 | COPA | The paragraph is the fundamental organizational unit for all prose texts, being the smallest regular unit into which prose can be divided. Prose can appear in all TEI texts, even those that are primarily of another genre (e.g., verse); thus the paragraph is described here, as an element which can appear in any kind of text. |
59 | COPA | Paragraphs can contain any of the other elements described within this chapter, as well as some other elements which are specific to individual text types. We distinguish |
70 | COPA | Because paragraphs may appear in different base or additional tag sets, their possible contents may differ in different kinds of documents. In particular, additional elements not listed in this chapter may appear in paragraphs in certain kinds of text. However, the elements described in this chapter are always by default available in all kinds of text. |
86 | COPA | Since paragraphs are usually explicitly marked in Western texts, typically by indentation, the application of the |
88 | COPA | tag usually presents few problems. |
90 | COPA | In some cases, the body of a text may comprise but a single paragraph: |
107 | COPA | The following extract from a Russian fairy tale demonstrates how other phrase level elements (in this case |
139 | COPU | Punctuation marks cause two distinct classes of problem for text markup: the marks may not be available in the character set used, and they may be significantly ambiguous. To some extent, the availability of the Unicode character set addresses the first of these problems, since it provides specific code points for most punctuation marks, and also the second to the extent that it distinguishes glyphs (such as stop, comma, and hyphen) which are used with different functions. |
140 | COPU | Where punctuation itself is the subject of study, the element |
143 | COPU | . Where the character used for a punctuation mark is not available in Unicode, the |
150 | COPU-1 | Punctuation is itself a form of markup, historically introduced to provide the reader with an indication about how the text should be read. As such, it is unsurprising that encoders will often wish to encode directly the purpose for which punctuation was provided, as well as, or even instead of, the punctuation itself. We discuss some typical cases below. |
157 | COPU-1 | respectively. However, there are independent reasons for tagging these, whether or not they are marked by full stops, and the polysemy of the full stop itself is perhaps no different from that of any other character in the writing system. |
163 | COPU-1 | usually mark the end of orthographic sentences, but may also be used as a mid-sentence comment by the author ( |
167 | COPU-1 | to query a word or expression or mark a sentence as dubious in linguistic discussion). Such usages may be distinguished by marking S-units, in which case the mid-sentence uses of these punctuation marks may be left unmarked, or tagged using the |
173 | COPU-1 | are used for a variety of purposes: as a mark of omission, insertion, or interruption; to show where a new speaker takes over (in dialogue); or to introduce a list item. In the latter two cases particularly, it is clearly desirable to mark the function as well as its rendition using the elements |
182 | COPU-1 | may be removed from text contained by |
186 | COPU-1 | elements on editorial grounds, or they may be marked in a variety of ways; see the discussion of quotation and related features in section |
190 | COPU-1 | must be distinguished from single quote marks. As with hyphens, this disambiguation is best performed by selecting the appropriate Unicode character, though it may also be represented by using appropriate XML markup for quotations as suggested above. However, apostrophes have a variety of uses. In English they mark contractions, genitive forms, and (occasionally) plural forms. Full disambiguation of these uses belongs to the level of linguistic analysis and interpretation. |
193 | COPU-1 | and other marks of suspension such as dashes or ellipses are often used to signal information about the syntactic structure of a text fragment. Full disambiguation of their uses also belongs to the level of linguistic analysis and interpretation, and will therefore need to use the mechanisms discussed in chapter |
196 | COPU-1 | Where punctuation marks are disambiguated by tagging their assumed function in the text (for example, quotation), it may be debated whether they should be excluded or left as part of the text. In the case of quotation marks, it may be more convenient to distinguish opening from closing marks simply by using the appropriate Unicode character than to use the |
200 | COPU-1 | Where segmentation of a text is performed automatically, the accuracy of the result may be considerably enhanced by a first pass in which the function of different punctuation characters is explicitly marked. This need not be done for all cases, but only where the structural function of the punctuation markup (for example as a word or phrase delimiter) is ambiguous. Thus, dots indicating abbreviation might be distinguished from dots indicating sentence end, and exclamation or question marks internal to a sentence distinguished from those which terminate one. Furthermore, when encoding historical materials, it may be considered essential to retain the original punctuation, whether by using an appropriate character code, if this is available (or using the |
202 | COPU-1 | element where it is not) or by an explicit encoding using |
204 | COPU-1 | . The particular method adopted will vary depending upon the feature concerned and upon the purpose of the project. |
209 | COPU-2 | Hyphenation as a phenomenon is generally of most concern when producing formatted text for display in print or on screen: different languages and systems have developed quite sophisticated sets of rules about where hyphens may be introduced and for what reason. These generally do not concern the text encoder, since they belong to the domain of formatting and will generally be handled by the rendition software in use. In this section, we discuss issues arising from the appearance of hyphens in pre-existing formatted texts which are being re-encoded for analysis or other processing. Unicode distinguishes four characters visually similar to the hyphen, including the undifferentiated hyphen-minus (U+002D) which is retained for compatibility reasons. The hard hyphen (U+2010) is distinguished from the minus sign (U+2212) which is for use in mathematical expressions, and also from the soft hyphen (U+00AD) which may appear in |
211 | COPU-2 | documents to indicate places where it is acceptable to insert a hyphen when the document is formatted. |
213 | COPU-2 | Historically, the hard hyphen has been used in printed or manuscript documents for two distinct purposes. In many languages, it is used between words to show that they function as a single syntactic or lexical unit. For example, in French, |
219 | COPU-2 | etc. It may also have an important role in disambiguation (for example, by distinguishing say a |
223 | COPU-2 | ). Such usages, although possibly problematic when a linguistic analysis is undertaken, are not generally of concern to text encoders: the hyphen character is usually retained in the text, because it may be regarded as part of the way a compound or other lexical item is spelled. Deciding whether a compound is to be decomposed into its constituent parts, and if so how, is a different question, involving consideration of many other phenomena in addition to the simple presence of a hyphen. |
225 | COPU-2 | When it appears at the end of a printed or written line however, the hard hyphen generally indicates that—contrary to what might be expected—a word is not yet complete, but continues on the next line (or over the next page or column or other boundary). The hyphen character is not, in this case, part of the word, but just a signal that the word continues over the break. Unfortunately, few languages distinguish these two cases visually, which necessarily poses a problem for text encoders. Suppose, for example, that we wish to investigate a diachronic English corpus for occurrences of "tea-pot" and "teapot", to find evidence for the point at which this compound becomes lexicalized. Any case where the word is hyphenated across a linebreak, like this: |
231 | COPU-2 | They may decide simply to remove any end-of-line hyphenation from the encoded text, on the grounds that its presence is purely a secondary matter of formatting. This will obviously apply also if line endings are themselves regarded as unimportant. |
233 | COPU-2 | Alternatively, they may decide to record the presence of the hyphen, perhaps on the grounds that it provides useful morphological information; perhaps in order to retain information about the visual appearance of the original source. In either case, they need to decide whether to record it explicitly, by including an appropriate punctuation character in the text data, or implicitly by supplying an appropriate symbolic value for one or more of the attributes on the |
235 | COPU-2 | or other milestone element used to record the fact of the line division. If the hyphen is included in the character data of the TEI document, it might be marked up using the |
242 | COPU-2 | A similar range of possibilities applies equally to the representation of other common punctuation marks, notably quotation marks, as discussed in |
246 | COPU-2 | text data |
249 | COPU-2 | , even if those units are not explicitly indicated by the XML markup. The ambiguity of the end-of-line hyphen also causes problems in the way a processor identifies such tokens in the absence of explicit markup. If token boundaries are not explicitly marked (for example using the |
253 | COPU-2 | elements), for most languages a processor will rely on character class information to determine where they are to be found: some punctuation characters are considered to be word-breaking, while others are not. In XML, the newline character in text data is a kind of whitespace, and is therefore word breaking. However, it is generally unsafe to assume that whitespace adjacent to markup tags will always be preserved, and it is decidedly unsafe to assume that markup tags themselves are equivalent to whitespace. |
261 | COPU-2 | elements are notable exceptions to this general rule, since their function is precisely to represent (or replace) line, page, or column breaks, which, as noted above, are generally considered to be equivalent to whitespace. These elements provide a more reliable way of preserving the lineation, pagination, etc of a source document, since the encoder should not assume that (untagged) line breaks etc. in an XML source file will necessarily be preserved. |
269 | COPU-2 | to indicate whether or not the element corresponds with a token boundary. The value |
271 | COPU-2 | is also available, for cases where the encoder does not wish (or is unable) to determine whether the orthographic token concerned is broken by the line ending. |
273 | COPU-2 | As a final complication, it should be noted that in some languages, particularly German and Dutch, the spelling of a word may be altered in the presence of end of line hyphenation. For example, in Dutch, the word |
277 | COPU-2 | ), occurring at the end of a line may be hyphenated as |
279 | COPU-2 | , with a single letter a. An encoder wishing to preserve the original form of this orthographic token in a printed text while at the same time facilitating its recognition as the word |
281 | COPU-2 | will therefore need to rely on a more sophisticated process than simply removing the hyphen. This is however essentially the same as any other form of normalization accompanying the recognition of variations in spelling or morphology: as such it may be encoded using the |
284 | COPU-2 | , or the more sophisticated mechanisms for linguistic analysis discussed in chapter |
291 | COHQ | This section deals with a variety of textual features, all of which have in common that they are frequently realized in conventional printing practice by the use of such features as underlining, italic fonts, or quotation marks, collectively referred to here as |
293 | COHQ | . After an initial discussion of this phenomenon and alternate approaches to encoding it, this section describes ways of encoding the following textual features, all of which are conventionally rendered using some kind of highlighting: |
295 | COHQ | emphasis, foreign words and other linguistically distinct uses of highlighting |
308 | COHQW | typographic features (font, size, hue, etc.) in a printed or written text in order to distinguish some passage of a text from its surroundings. |
309 | COHQW | Although the way in which a spoken text is performed, (for example, the voice quality, loudness, etc.) might be regarded as analogous to |
311 | COHQW | in this sense, these Guidelines recommend distinct elements for the encoding of such |
313 | COHQW | in spoken texts. See further section |
315 | COHQW | The purpose of highlighting is generally to draw the reader's attention to some feature or characteristic of the passage highlighted; this section describes the elements recommended by these Guidelines for the encoding of such textual features. |
319 | COHQW | distinct in some way—as foreign, dialectal, archaic, technical, etc. |
321 | COHQW | emphatic, and which would for example be stressed when spoken |
323 | COHQW | not part of the body of the text, for example cross-references, titles, headings, labels, etc. |
325 | COHQW | identified with a distinct narrative stream, for example an internal monologue or commentary. |
327 | COHQW | attributed by the narrator to some other agency, either within the text or outside it: for example, direct speech or quotation. |
329 | COHQW | set apart from the text in some other way: for example, proverbial phrases, words mentioned but not used, names of persons and places in older texts, editorial corrections or additions, etc. |
332 | COHQW | The textual functions indicated by highlighting may not be rendered consistently in different parts of a text or in different texts. (For example, a foreign word may appear in italics if the surrounding text is in roman, but in roman if the surrounding text is in italics.) For this reason, these Guidelines distinguish between the encoding of rendering itself and the encoding of the underlying feature expressed by it. |
341 | COHQW | ). This allows the encoder both to specify the function of a highlighted phrase or word, by selecting the appropriate element described here or elsewhere in the Guidelines, and to further describe the way in which it is highlighted, by means of an attribute. If the encoder wishes to offer no interpretation of the feature underlying the use of highlighting in the source text, then the |
343 | COHQW | element may be used, which indicates only that the text so tagged was highlighted in some way. |
354 | COHQW | attribute are not formally defined in this version of the Guidelines. It may be used to document any peculiarity of the way a given segment of text was rendered in the original source text, and may thus express a very large range of typographic or other features, by no means restricted to typeface, type size, etc. The |
356 | COHQW | attribute, by contrast, defines the way the source text was rendered using a formally defined style language, such as the W3C standard Cascading Stylesheet Language ( |
359 | COHQW | attribute is used to point to one or more fragments expressed using such a language which have been predefined in the TEI header using the |
370 | COHQW | for analytic purposes, it is in general more useful to know the intended function of a highlighted phrase than simply that it is distinct. |
373 | COHQW | In many, if not most, cases the underlying function of a highlighted phrase will be obvious and non-controversial, since the distinctions indicated by a change of highlighting correspond with distinctions discussed elsewhere in these Guidelines. The elements available to record such distinctions are, for the most part, members of the |
377 | COHQW | class mentioned above constitute the |
381 | COHQW | The distinction between the two classes is simple, and typified by the two elements |
385 | COHQW | : the former marks simply that a passage is typographically distinct in some way, while the latter asserts that a passage is linguistically emphasized for some purpose. These two properties, though often combined, are not identical. It should however be recognized, however, that cases do exist in which it is not economically feasible to mark the underlying function (e.g. in the preparation of large text corpora), as well as cases in which it is not intellectually appropriate (as in the transcription of some older materials, or in the preparation of material for the study of typographic practice). In such cases, the |
408 | COHQHF | Words or phrases which are not in the main language of the text should be tagged as such, at least where the fact is indicated in the text. Where the word or phrase concerned is already distinguished from the rest of the text by virtue of its function (for example, because it is a name, a technical term, a quotation, a mentioned word, etc.) then the global |
410 | COHQHF | attribute should be used to specify additionally that its language distinguishes it from the surrounding text. Any element in the TEI scheme may take a |
412 | COHQHF | attribute, which specifies both the writing system and the language used by its content (see sections |
430 | COHQHF | element should not be used to represent foreign words which are mentioned or glossed within the text: for these use the appropriate element from section |
444 | COHQHF | Elements which do not explicitly state the language of their content by means of an |
446 | COHQHF | attribute are understood to inherit a value for it from their parent element. In the general case, therefore, it is recommended practice to supply a default value for |
448 | COHQHF | on the root |
468 | COHQHE | element. In printed works, emphasis is generally indicated by devices such as the use of an italic font, a large typeface, or extra wide letter spacing; in manuscripts and typescripts, it is usually indicated by the use of underlining. As the following examples demonstrate, an encoder may choose whether or not to make explicit the particular type of rendition associated with the emphasis. If a source text consistently renders a particular feature (e.g. emphasis or words in foreign languages) in a particular way, the rendering associated with that feature may be described in the TEI header using the |
476 | COHQHE | attributes may then be used to describe examples which deviate from the norm. For example, assuming that the TEI header has defined a default rendering for the |
483 | COHQHE | If on the other hand no such default has been defined for the element, the encoder may specify it informally using the |
489 | COHQHE | If the encoder wishes to express information about the rendition used in the source using a formal language such as CSS, then the |
497 | COHQHE | In cases where the rendition of a source needs to be indicated several times in a document, it may be more convenient to provide a default value using the |
499 | COHQHE | element in the header. If a small number of distinct values are required, it may also be convenient to define them all by means of a series of |
501 | COHQHE | elements which can then be referenced from the elements in question by means of the global |
528 | COHQHE | attribute, as discussed above, without however taking a position as to the function of the highlighting. This may also be useful if the text is to be processed in two stages: representing simply typographic distinctions during a first pass, and then replacing the |
554 | COHQHE | in the sense |
574 | COHQHD | element is provided for this purpose. Its attributes allow for additional information characterizing the nature of the linguistic distinction to be made in two distinct ways: the |
576 | COHQHD | attribute simply assigns a user-defined code of some kind to the word or phrase which assigns it to some register, sub-language, etc. No recommendations as to the set of values for this attribute are provided at this time, as little consensus exists in the field. |
578 | COHQHD | Alternatively, the remaining three attributes may be used in combination to place a word or phrase on a three-dimensional scale sometimes used in descriptive linguistics, as for example in |
598 | COHQHD | that is, with respect to a social classification, for example as technical, polite, impolite, restricted, etc. Again, no recommendations are made for the values of these attributes at this time; the encoder should provide a description of the scheme used in the appropriate section of the header (see section |
614 | COHQHD | should be preferred to these simple characterizations. It may also be preferable to record the kinds of analysis suggested here by means of the simple annotation element |
628 | COHQQ | One form of presentational variation found particularly frequently in written and printed texts is the use of quotation marks. As with the typographic variations discussed in the preceding section, it is generally helpful to separate the encoding of the underlying textual feature (for example, a quotation or a piece of direct speech) from the encoding of its rendering (for example, the use of a particular style of quotation marks). |
630 | COHQQ | This section discusses the following elements, all of which are often rendered by the use of quotation marks: |
663 | COHQQ | The most common and important use of quotation marks is, of course, to mark |
664 | COHQQ | quotation |
665 | COHQQ | , by which we mean simply any part of the text which the author or narrator wishes to attribute to some agency other than the narrative voice. The |
667 | COHQQ | element may be used if no further distinction beyond this is judged necessary. If it is felt necessary to distinguish such passages further, for example to indicate whether they are regarded as speech, writing, or thought, either the |
673 | COHQQ | for words or phrases represented as being spoken or thought by people or characters within the current work. The |
675 | COHQQ | element is used for cases where the author or narrator distances him or herself from the words in question without however attributing them to any other voice in particular. The |
677 | COHQQ | element is appropriate for a case where a word or phrase is being discussed in the body of a text rather than forming part of the text directly. |
679 | COHQQ | As noted above, if the distinction among these various reasons why a passage is offset from surrounding text cannot be made reliably, or is not of interest, then any representation of speech, thought, or writing may simply be marked using the |
683 | COHQQ | Quotation may be indicated in a printed source by changes in type face, by special punctuation marks (single or double or angled quotes, dashes, etc.) and by layout (indented paragraphs, etc.), or it may not be explicitly represented at all. If these characteristics are of interest, one or other of the global |
690 | COHQQ | Quotation marks themselves may, like other punctuation marks, be felt for some purposes to be worth retaining within a text, quite independently of their description by the |
692 | COHQQ | attribute. This should generally be done using the appropriate Unicode character, or, if this is not possible, a numeric character reference (see |
693 | COHQQ | ). If the encoder decides both to retain the quotation marks and to represent their function by means of an explicit tag such as |
695 | COHQQ | , the quotation marks should be included within the element, rather than outside it, as in the first example below: |
703 | COHQQ | Alternatively, since this use of the leading mdash is very common typographic practice, it may be considered unnecessary to retain it in the encoding. Its presence in the source might instead be signalled using one of the attributes |
711 | COHQQ | element, which can then be referenced using the |
729 | COHQQ | element provided in the TEI header (see |
730 | COHQQ | ) to indicate that quotation marks have not been retained in the encoding; their presence in the source is implied by the |
734 | COHQQ | Whether or not the quotation marks are suppressed, their presence and nature may be described using some appropriate set of conventions in the |
748 | COHQQ | . If the rendition of passages tagged as |
750 | COHQQ | is uniform throughout a text, then the |
754 | COHQQ | element in the header may be used to specify a default rendering, in which case the same section might simply be tagged: |
779 | COHQQ | This may be used to make explicit who is speaking: |
794 | COHQQ | attribute may be supplied whether or not an indication of the speaker is given explicitly in the text. It may take the form (as above) of a normalized form of the speaker's name, but its role is to act as a pointer to a location elsewhere in the text, or another document, where data about each speaker may be supplied. While this attribute could point to any source of information about the speaker available by a URI, the most appropriate place to place such information is within the |
796 | COHQQ | component of the TEI header, as further discussed in |
797 | COHQQ | but for simple cases like the above, a simple list of speakers located in the front or back matter of the text may suffice. |
799 | COHQQ | It may also be useful to distinguish representations of speech from representations of thought, in modern printed texts often indicated by a change of typeface. The |
809 | COHQQ | Quoted matter may be embedded within quoted matter, as when one speaker reports the speech of another: |
822 | COHQQ | Direct speech nested in this way is treated in the same way as elsewhere: a change of rendition may occur, but the same element should be used. An encoder may however choose to distinguish between direct speech which contains quotations from extra-textual matter and direct speech itself, as in the following example: |
839 | COHQQ | element may be used to group together the quotation and its associated bibliographic reference, which should be encoded using the elements for bibliographic references discussed in section |
860 | COHQQ | Like other bibliographic references, the citation associated with a quotation may be represented simply by a cross-reference, as in this example: |
869 | COHQQ | impractical. In such circumstances, the quotation can be linked to a bibliographical reference using |
883 | COHQQ | Unlike most of the other elements discussed in this chapter, direct speech and quotations may frequently contain other high-level elements such as paragraphs or verse lines, as well as being themselves contained by such elements. Three possible solutions exist for this well-known structural problem: |
885 | COHQQ | the quotation is broken into segments, each of which is entirely contained within a paragraph |
887 | COHQQ | the quotation is marked up using stand-off markup |
889 | COHQQ | the quotation boundaries are represented by empty segment boundary delimiter elements |
896 | COHQQ | is provided for all cases in which quotation marks are used to distance the quoted text from the narrator or speaker. Common examples include the |
932 | COHQU | This section describes a set of textual elements which are used to provide a gloss, alternate identification, or description of something. |
934 | COHQU | Technical terms are often italicized or emboldened upon first mention in printed texts; an explanation or gloss is sometimes given in quotation marks. Linguistic analyses conventionally cite words in languages under discussion in italics, providing a gloss immediately following marked with single quotation marks. Other texts in which individual words or phrases are |
935 | COHQU | mentioned |
943 | COHQU | may mark them either with italics or with quotation marks, and will gloss them less regularly. |
957 | COHQU | is present, it may be linked to the term it is glossing by means of its |
961 | COHQU | value to the |
965 | COHQU | element and provide that id as the value of the |
999 | COHQU | For technical terminology in particular, and generally in terminological studies, it may be useful to associate an instance of a term within a text with a canonical definition for it, which is stored either elsewhere in the same text (for example in a glossary of terms) or externally, for example in a database, authority file, or published standard. The attributes |
1008 | COHQU | Another group of elements is used to supply different kinds of names for objects described by the TEI. Examples of this are documentation of elements, attributes, classes (and also attribute values where appropriate), and description of glyphs. |
1015 | COHQU | element mentioned above, these elements constitute the |
1039 | COHQHEG | This encoding would, however, lose the important distinction between an italicized title and an italicized foreign phrase. Many other phrases might also be italicized in the text, and a retrieval program seeking to identify foreign terms (for example) would not be able to produce reliable results by simply looking for italicized words. Where economic and intellectual constraints permit, therefore, it would be preferable to encode both the function of the highlighted phrases and their appearance, as follows: |
1049 | COHQHEG | debatings. She says I am |
1068 | COHQHEG | ; the former is emphasized, while the latter is proverbial. It also provides an ironic gloss for the words |
1074 | COHQHEG | . The glossed phrases are not, however, technical terms or cited words, but quoted phrases, as if the writer were putting words into her own and her mother's mouths. Finally, the words |
1111 | COED | As in editing a printed text, so in encoding a text in electronic form, it may be necessary to accommodate editorial comment on the text and to render account of any changes made to the text in preparing it. The tags described in this section may be used to record such editorial interventions, whether made by the encoder, by the editor of a printed edition used as a copy text, by earlier editors, or by the copyists of manuscripts. |
1117 | COED | . The examples given here illustrate only simple cases of editorial intervention; in particular, they permit economical encoding of a simple set of alternative readings of a short span of text. To encode multiple views of large or heterogeneous spans of text, the mechanisms described in chapter |
1123 | COED | , that is, a code indicating the person or agency responsible for making the editorial intervention in question, and also an indication of the degree of |
1124 | COED | certainty |
1138 | COED | Many of the elements discussed here can be used in two ways. Their primary purpose is to indicate that the text encoded as the element's content represents an editorial intervention (or non-intervention) of a specific kind, indicated by the element itself. However, pairs or other meaningful groupings of such elements can also be supplied, wrapped within a special purpose |
1143 | COED | This element enables the encoder to represent for example a text in its |
1145 | COED | uncorrected and unaltered form, alongside the same text in one or more |
1148 | COED | view |
1149 | COED | of a text and another, so that (for example) a stylesheet may be set to display either the text in its original form or after the application of editorial interventions of particular kinds. |
1153 | COED | class. The default members of this class are |
1177 | COED | indication or correction of apparent errors |
1188 | COEDCOR | When the copy text is manifestly faulty, an encoder or transcriber may elect simply to correct it without comment, although for scholarly purposes it will often be more generally useful to record both the correction and the original state of the text. The elements described here enable all three approaches, and allows the last to be done in such a way as make it easy for software to present either the original or the correction. |
1193 | COEDCOR | The following examples show alternative treatment of the same material. The copy text reads: |
1194 | COEDCOR | Another property of computer-assisted historical research is that data modelling must permit any one textual feature or part of a textual feature to be a part of more than one information model and to allow the researcher to draw on several such models simultaneously, for example, to select from a machine-readable text those marginal comments which indicate that the date's mentioned in the main body of the text are incorrect. |
1196 | COEDCOR | An encoder may choose to correct the typographic error, either silently or with an indication that a correction has been made, as follows: |
1206 | COEDCOR | If the encoder elects both to record the original source text and to provide a correction for the sake of word-search and other programs, both |
1226 | COEDCOR | If it is desired to indicate the person or edition responsible for the emendation, this might be done as follows: |
1243 | COEDCOR | attribute has been used to indicate responsibility for the correction. Its value ( |
1250 | COEDCOR | element within the TEI header, but any element might be indicated in this way, including for example a |
1269 | COEDCOR | Where, as here, the correction takes the form of adding text not otherwise present in the text being encoded, the encoder should use the |
1271 | COEDCOR | element. Where the correction is present in the text being encoded, and consists of some combination of visible additions and deletions, the elements |
1276 | COEDCOR | below. Where the correction takes the form of addition of material not present in the original because of physical damage or illegibility, the |
1279 | COEDCOR | correction |
1282 | COEDCOR | element may be used. These and other elements to support the detailed encoding of authorial or scribal interventions of this kind are all provided by the module described in chapter |
1292 | COEDREG | When the source text makes extensive use of variant forms or non-standard spellings, it may be desirable for a number of reasons to |
1299 | COEDREG | In some contexts, the term |
1304 | COEDREG | As with other such changes to the copy text, the changes may be made silently (in which case the TEI header should specify the types of silent changes made) or may be explicitly marked using the following elements: |
1340 | COEDREG | Alternatively, the encoder may elect to record both old and new spellings, so that (for example) the same electronic text may serve as the basis of an old- or new-spelling edition: |
1369 | COEDADD | The following elements are used to indicate when words or phrases have been omitted from, added to, or marked for deletion from, a text. Like the other editorial elements, they allow for a wide range of editorial practices: |
1376 | COEDADD | Encoders may choose to omit parts of the copy text for reasons ranging from illegibility of the source or impossibility of transcribing it, to editorial policy, e.g. a systematic exclusion of poetry or prose from an encoding. The full details of the policy decisions concerned should be documented in the TEI header (see section |
1377 | COEDADD | ). Each place in the text at which omission has taken place should be marked with a |
1379 | COEDADD | element, with optionally further information about the reason for the omission, its extent, and the person or agency responsible for it, as in the following examples: |
1380 | COEDADD | Note that the extent of the gap may be marked precisely using attributes |
1386 | COEDADD | attribute. Other, more detailed, options are also available for representing dimensions of any kind; see further |
1391 | COEDADD | element may be used to supply a description of the material omitted, where that is considered useful: |
1407 | COEDADD | elements may be used to record where words or phrases have been added or deleted in the copy text. They are not appropriate where longer passages have been added or deleted, which span several elements; for these, the elements |
1414 | COEDADD | Additions to a text may be recorded for a number of reasons. Sometimes they are marked in a distinctive way in the source text, for example by brackets or insertion above the line ( |
1417 | COEDADD | additions |
1429 | COEDADD | element should not be used to mark editorial changes, such as supplying a word omitted by mistake from the source text or a passage present in another version. In these cases, either the |
1438 | COEDADD | element is used to mark passages in the original which cannot be read with confidence, or about which the transcriber is uncertain for other reasons, as for example when transcribing a partially inaudible or illegible source. Its |
1444 | COEDADD | element, to indicate the cause of uncertainty and the person responsible for the conjectured reading. |
1450 | COEDADD | or from a spoken text: |
1456 | COEDADD | Where the material affected is entirely illegible or inaudible, the |
1462 | COEDADD | element is used to mark material which is deleted in the source but which can still be read with some degree of confidence, as opposed to material which has been omitted by the encoder or transcriber either because it is entirely illegible or for some other reason. This is of particular importance in transcribing manuscript material, though deletion is also found in printed texts, sometimes for humorous purposes: |
1476 | COEDADD | attribute may be used to distinguish different methods of deletion in manuscript or typescript material, as in this line from the typescript of Eliot's |
1492 | COEDADD | provides a way of grouping additions and deletions of this kind. |
1496 | COEDADD | element should not be used where the deletion is such that material cannot be read with confidence, or read at all, or where the material has been omitted by the transcriber or editor for some other reason. Where the material deleted cannot be read with confidence, the |
1498 | COEDADD | tag should be used with the |
1500 | COEDADD | attribute indicating that the difficulty of transcription is due to deletion. Where material has been omitted by the transcriber or editor, this may be indicated by use of the |
1506 | COEDADD | element. Text supplied or marked as unneccessary by an editor should be marked with the |
1515 | COEDADD | . These two sets of elements allow the encoder to distinguish editorial changes from those visible in the source text. |
1525 | CONA | This section describes a number of textual features which it is often convenient to distinguish from their surrounding text. Names, dates, and numbers are likely to be of particular importance to the scholar treating a text as source for a database; distinguishing such items from the surrounding text is however equally important to the scholar primarily interested in lexis. |
1534 | CONARS | referring string |
1571 | CONARS | element may be used for any reference to a person, place, etc., not only to references in the form of a proper noun or noun phrase. |
1580 | CONARS | element by contrast is provided for the special case of referencing strings which consist only of proper nouns; it may be used synonymously with the |
1582 | CONARS | element, or nested within it if a referring string contains a mixture of common and proper nouns. The following example shows an alternative way of encoding the short sentence from |
1594 | CONARS | As the following example shows, a proper name may be nested within a referring string: |
1599 | CONARS | Simply tagging something as a name is generally not enough to enable automatic processing of personal names into the canonical forms usually required for reference purposes. The name as it appears in the text may be inconsistently spelled, partial, or vague. Moreover, name prefixes such as |
1603 | CONARS | may or may not be included as part of the reference form of a name, depending on the language and country of origin of the bearer. |
1605 | CONARS | Two issues arise in this context: firstly, there may be a need to encode a regularized form of a name, distinct from the actual form in the source to hand; secondly, there may be a need to identify the particular person, place, etc. referred to by the name, irrespective of whether the name itself is normalized or not. The element |
1623 | CONARS | A very useful application for them is as a means of gathering together all references to the same individual or location scattered throughout a document: |
1641 | CONARS | The value of the |
1643 | CONARS | attribute may be an unexpanded code, as in the examples above, with no particular significance. More usually however, it will be an externally defined code of some kind, as provided by a standard reference source. |
1649 | CONARS | The standard reference source should be documented using a |
1651 | CONARS | element in the TEI header. |
1655 | CONARS | attribute can be used to point directly to some other resource providing more information about the entity named by the element, such as an authority record in a database, an encylopaedia entry, another element in the same or a different document etc. |
1663 | CONARS | (regularization) element to provide the standard form of a referring string, as in this example: |
1673 | CONARS | attribute, since its form will depend entirely on practice within a given project. For the same reason, this attribute is not recommended in data interchange, since there is no way of ensuring that the values used by one project are distinct from those used by another. In such a situation, a preferable approach for magic tokens which follows standard practice on the Web is to use a |
1675 | CONARS | attribute whose value is a tag URI as defined in |
1684 | CONARS | The inclusion of the domain name of the party responsible for tagging ( |
1686 | CONARS | ), as specified in RFC 4151, helps ensure uniqueness of magic token values across TEI encoding projects, allowing for improved interchange of TEI documents. |
1691 | CONARS | may be used if it is desired to record both a normalized form of a name and the name used in the source being encoded: |
1707 | CONARS | may be more appropriate if the function of the regularization is to provide a consistent index: |
1713 | CONARS | Although adequate for many simple applications, these methods have two inconveniences: if the name occurs many times, then its regularized form must be repeated many times; and the burden of additional XML markup in the body of the text may be inconvenient to maintain and complex to process. For applications such as onomastics, relating to persons or places named rather than the name itself, or wherever a detailed analysis of the component parts of a name is needed, the specialized elements described in chapter |
1730 | CONAAD | elements; for other kinds of address this class may be extended by adding new elements if necessary. |
1732 | CONAAD | These Guidelines provide no particular means for encoding the substructure of an email address (for example, distinguishing the local part from the domain part), nor of distinguishing personal email addresses from generic or fictitious ones. |
1738 | CONAAD | The simplest way of encoding a postal address is to regard it as a series of distinct lines, just as they might be written on an envelope. The following element supports this view: |
1739 | CONAAD | Here is an example of a postal address encoded using this approach: |
1749 | CONAAD | Alternatively, an address may be encoded as a structure of more semantically rich elements. The class |
1751 | CONAAD | element class identifies a number of such possible components: |
1756 | CONAAD | Any number of elements from the |
1758 | CONAAD | class may appear within an address and in any order. None of them is required. |
1760 | CONAAD | Where code letters are commonly used in addresses (for example, to identify regions or countries) a useful practice is to supply the full name of the region or country as the content of the element, but to supply the abbreviatory code as the value of the global |
1762 | CONAAD | attribute, so that (for example) an application preparing formatted labels can readily find the required information. Other components of addresses may be represented using the general-purpose |
1764 | CONAAD | element or (when the additional module for names and dates is included) the more specialized elements provided for that purpose. |
1766 | CONAAD | Using just the elements defined by the core module, the above address could thus be represented as follows: |
1778 | CONAAD | The order of elements within an address is highly culture-specific, and is therefore unconstrained: |
1792 | CONAAD | A telephone number (normally outside of the |
1798 | CONAAD | , with the number itself appearing in the |
1806 | CONAAD | . A full postal address may also include the name of the addressee, tagged as above using the general purpose |
1811 | CONAAD | , a large number of more specific elements such as |
1817 | CONAAD | . The above example might then be encoded as follows: |
1861 | CONANU | element provides a convenient method of distinguishing numbers from the surrounding text. For other kinds of application, numbers are only useful if normalized: here the |
1883 | CONANU | ; less frequently the number may be recognisable linguistically as such but may use a notation with which the encoder is unfamiliar. To help in these situations, the |
1893 | CONANU | measure |
1894 | CONANU | consists of a number, a phrase expressing units of measure and a phrase expressing the commodity being measured, though not all of these components need be present in every case. It may be helpful to distinguish measures from surrounding text for two reasons. Firstly, a measure may be expressed using a particular notation or system of abbreviations which the encoder does not wish to regard as lexical. Secondly, a quantitative application may wish to distinguish and normalize the internal components of a measure, in order to perform calculations on them. |
1896 | CONANU | Consider, as an example of the first case, the following list of Celia's charms, in which the encoder has chosen to make explicit the measurements: |
1931 | CONANU | In general, normalization of a measure will require specification of one or more of its three parts: the quantity, the units, and possibly also the commodity being measured. This is accomplished by supplying values for the three attributes |
1937 | CONANU | , which are supplied by the |
1946 | CONANU | Such techniques are particularly useful when representing historical data such as inventories: |
1962 | CONANU | element is provided as a means of grouping several related measurements together, either because the measurement involves several dimensions (for example height and width) or to avoid the need to repeat all the normalizing attributes: |
1983 | CONADA | Dates and times, like numbers, can appear in widely varying culture- and language-dependent forms, and can pose similar problems in automatic language processing. Such elements constitute the |
1985 | CONADA | class, of which the default members are: |
1989 | CONADA | These elements have some additional attributes by virtue of being members of the |
1993 | CONADA | classes which, in turn, are members of the |
2017 | CONADA | attribute by simply omitting a part of the value supplied. Imprecise dates or times (for example |
2020 | CONADA | some time after ten and before twelve |
2021 | CONADA | ) may be expressed as date or time ranges. |
2023 | CONADA | These mechanisms are useful primarily for fully specified dates or times known with certainty. If component parts of dates or times are to be marked up, or if a more complex analysis of the meaning of a temporal expression is required, the techniques described in chapter |
2026 | CONADA | Where the certainty (i.e. reliability) of the date or time is in question, the encoder should record this fact using the mechanisms discussed in chapter |
2027 | CONADA | . The same chapter also discusses various methods of recording the precision of numerical or temporal assertions. |
2040 | CONADA | attribute always supplies a normalized representation of the date given as content of the |
2047 | CONADA | date |
2059 | CONADA | time |
2063 | CONADA | There is one exception: these Guidelines permit a time to be expressed as only a number of hours, or as a number of hours and minutes, as per ISO 8601:2004 section 4.2.2.3 and 4.3.3. The W3C |
2067 | CONADA | datatypes require that the minutes and seconds be included in the normalized value if they are to be correctly processed for example when sorting. |
2086 | CONADA | Note in the last example the use of a normalized representation for the date string which includes a time: this example could thus equally well be tagged using the |
2109 | CONADA | attribute may be used to specify a date in any calendar system; if the |
2111 | CONADA | attribute is also supplied, it should specify the equivalent date in the Gregorian calendar. |
2121 | CONAAB | It is sometimes desirable to mark abbreviations in the copy text, whether to trigger special processing for them, to provide the full form of the word or phrase abbreviated, or to allow for different possible expansions of the abbreviation. Abbreviations may be transcribed as they stand, or expanded; they may be left unmarked, or marked using these tags: |
2181 | CONAAB | Abbreviation is a particularly important feature of manuscript and other source materials, the transcription of which needs more detailed treatment than is possible using these simple elements. A more detailed set of recommendations is discussed in |
2182 | CONAAB | , which includes additional elements made available for the purpose by the |
2192 | COXR | Cross-references or links between one location in a document and one or more other locations, either in the same or different XML documents, may be encoded using the elements |
2198 | COXR | from one location in a document, the place that the element itself appears, to another (or to several), specified by means of a |
2200 | COXR | attribute, supplied by the |
2208 | COXR | The value of the |
2212 | COXR | mechanism. This permits a range of complexity, from the very simple (a reference to the value of the target element's |
2214 | COXR | attribute) to the more complex usage of a full URI with embedded XPointers. For example, the source of the following paragraph looks something like this: |
2226 | COXR | Alternatively, if no explicit link is to be encoded, but it is simply required to mark the phrase as a cross-reference, the |
2237 | COXR | ; for a discussion of TEI schemes for XPointer, see |
2247 | COXR | are the default members of the phrase-level model class |
2249 | COXR | . As members of the classes |
2267 | COXR | element may contain phrases specifying, or describing more exactly, the target of a cross-reference, which form the content of the element. Since its content thus serves as a human-readable pointer, in the simplest case a |
2279 | COXR | attribute, so that processing software can access it directly, for example to implement a linkage, to generate an appropriate reference, or to give an error message if it cannot be found. Assuming that section 12 in the previous example has been tagged |
2282 | COXR | then the same cross-reference might more exactly be encoded as |
2288 | COXR | If the cross-reference itself is to be generated according to a fixed pattern, or if no text is to appear in the body of the cross-reference, the |
2300 | COXR | ); the definition it provides is used to translate the value of the |
2302 | COXR | attribute into a conventional pointer value, such as one that might be supplied by the |
2312 | COXR | attribute is used, a cross reference may point to any number of locations simultaneously, simply by giving more than one identifier as the value of its |
2314 | COXR | attribute. This may be particularly useful where an analytic index is to be encoded, as in the following example: |
2328 | COXR | , etc. have been provided in the body of the text, for example as page breaks |
2337 | COXR | A similar method may be used to link annotations on a text with the sigla used to encode their points of attachment in a text. For example: |
2358 | COXR | The value |
2364 | COXR | element here might be used to indicate that the object being referenced here is a bibliographic entry rather than a simple cross-reference to an illustration, as is the first |
2366 | COXR | . In either case, the value of the |
2373 | COXR | elements have many applications in addition to the simple cross-referencing facilities illustrated in this section. In conjunction with the analytic tools discussed in chapters |
2376 | COXR | , they may be used to link analyses of a text to their object, to combine corresponding segments of a text, or to align segments of a text with a temporal or other axis or with each other. |
2406 | COLI | list |
2407 | COLI | : numbered, lettered, bulleted, or unmarked. Lists formatted as such in the copy text should in general be encoded using this element, with an appropriate value for the |
2425 | COLI | Some of these values may of course be combined; a list may be inline, but also be rendered with numbers. An example appears below. For more sophisticated and detailed description of list rendering, consider using the |
2431 | COLI | Each distinct item in the list should be encoded as a distinct |
2433 | COLI | element. If the numbering or other identification for the items in a list is unremarkable and may be reconstructed by any processing program, no enumerator need be specified. If however an enumerator is retained in the encoded text, it may be supplied either by using the |
2457 | COLI | The two styles may not be mixed in the same list: if one item is preceded by a label, all must be. |
2459 | COLI | A list need not necessarily be displayed in list format. For example, the following is a reasonable encoding of a list which (in the original) is simply printed as a single paragraph: |
2492 | COLI | A list may be given a heading or title, for which the |
2496 | COLI | element to mark a tabular or glossary list in which each item is associated with a word or phrase rather than a numeric or alphabetic enumerator: |
2522 | COLI | In such a list, the individual items have internal structure. In complex cases, where list items contain many components, the list is better treated as a |
2523 | COLI | table |
2528 | COLI | . A particularly important instance of the simple two-column table is the |
2529 | COLI | glossary list |
2530 | COLI | , which should be marked by the tag |
2531 | COLI | list type="gloss" |
2534 | COLI | element contains a term and each |
2536 | COLI | its gloss; it is a semantic error for a list tagged with |
2567 | COLI | might be used to make explicit the role that each column in the glossary list has, as follows: |
2608 | COLI | ) element what language the term is from. For further discussion of the |
2617 | COLI | element used to supply a title or heading for the whole list, headings for the two columns of a glossary-style list may be specified using the two special elements |
2662 | COLI | , including other lists. In this example, a glossary list contains two items, each of which is itself a simple list: |
2705 | CONONO | The following element is provided for the encoding of discursive notes, whether already present in the copy text or supplied by the encoder: |
2708 | CONONO | A note is any additional comment found in a text, marked in some way as being out of the main textual stream. All notes should be marked using the same tag, |
2710 | CONONO | , whether they appear as block notes in the main text area, at the foot of the page, at the end of the chapter or volume, in the margin, or in some other place. |
2714 | CONONO | A note is usually attached to a specific point or span within a text, which we term here its |
2718 | CONONO | When encoding such a text, it is conventional to replace this siglum by the content of the annotation, duly marked up with a |
2720 | CONONO | element. This may not always be possible for example with marginal notes, which may not be anchored to an exact location. For ease of processing, it may be adequate to position marginal notes before the relevant paragraph or other element. In printed texts, it is sometimes conventional to group notes together at the foot of the page on which their points of attachment appear. This practice is not generally recommended for TEI-encoded texts, since the pagination of a particular printed text is unlikely to be of structural significance. In some cases, however, it may be desirable to transcribe notes not at their point of attachment to the text but at their point of appearance, typically at the end of the volume, or the end of the chapter. In such cases, the |
2728 | CONONO | element, pointing from that to the body of the |
2732 | CONONO | In cases where the note is applied not to a point but to a span of text, not itself represented as a TEI element, the |
2736 | CONONO | function to specify the span of attachment. |
2743 | CONONO | attribute is used to categorise the note as a gloss: |
2757 | CONONO | element, we may infer that its point of attachment is in the margin adjacent to the line in question. In the following version of the same text, however, it may be inferred that the note applies to the whole of the stanza: |
2770 | CONONO | This type of annotation, very common in the early printed texts which Coleridge may be presumed to be imitating in this case, may also be regarded as providing a heading or descriptive label for the passage concerned. The encoder may therefore prefer to use the |
2785 | CONONO | In the following example, a note which appears at the foot of the page in the printed source is given at its point of attachment within the text. The global |
2787 | CONONO | attribute is used to indicate the note number: |
2801 | CONONO | In addition to transcribing notes already present in the copy text, researchers may wish to add their own notes or comments to it. The |
2811 | CONONO | attribute may be used to point to a definition of the person or other agency responsible for the content of the note. |
2813 | CONONO | As a simple example, an edition of the |
2829 | CONONO | ; thus in this case, the TEI header for this text might contain a title statement like the following: |
2840 | CONONO | When annotating the electronic text by means of analytic notes in some structured vocabulary, e.g. to specify the topics or themes of a text, the |
2844 | CONONO | elements may be more effective than the free form |
2846 | CONONO | element; these elements are available when the module for simple analysis is selected (see section |
2852 | CONOIX | The indexing of scholarly texts is a skilled activity, involving substantial amounts of human judgment and analysis. It should not therefore be assumed that simple searching and information retrieval software will be able to meet all the needs addressed by a well-crafted manual index, although it may complement them for example by providing free text search. The role of an index is to provide access via keywords and phrases which are not necessarily present in the text itself, but must be added by the skill of the indexer. |
2856 | CONOIXpre | When encoding a pre-existing text, therefore, if such an index is present it may be advisable to retain it along with the text, rather than attempt to regenerate it automatically. Elements discussed elsewhere in these Guidelines may be used for this purpose. For example, the |
2860 | CONOIXpre | element may be used to mark the section of the text containing the index and the |
2862 | CONOIXpre | element might be used to mark the index itself, each entry being represented by an |
2864 | CONOIXpre | element, possibly containing within it a series of |
2896 | CONOIXpre | Note that this simple representation does not capture the nested structure of the first of these index entries. A more accurate representation might entail the use of nested lists like the following: |
2924 | CONOIXpre | elements above, might also include direct links to the appropriate location in the encoded text, using (for example) a target attribute to supply the identifier of an associated page break element: |
2932 | CONOIXpre | . Note that similar methods may also be used to encode a table of contents, as further exemplified in section |
2938 | CONOIXgen | It can also be useful, however, to generate a new index from a machine-readable text, whether the text is being written for the first time with the tags here defined, or as an addition to a text transcribed from some other source. Depending on the complexity of the text and its subject matter, such an automatically-generated index may not in itself satisfy all the needs of scholarly users. However it can assist a professional indexer to construct a fully adequate index, which might then be post-edited into the digital text, marked-up along the lines already suggested for preserving pre-existing index material. |
2948 | CONOIXgen | this element may be used simply to provide descriptive or interpretive label of some kind for any location within a text, to be processed in any way by analytic software, but its main purpose is to facilitate the generation of an index for a printed version of the text. An |
2950 | CONOIXgen | element may be placed anywhere within a text, between or within other elements. The headwords to be used when making up this index are given by the |
2954 | CONOIXgen | element. The location of the generated index might be specified by means of a processing instruction within the text, such as the following (the exact form of the PI is of course dependent on the application software in use): |
2956 | CONOIXgen | Alternatively, the special purpose |
2960 | CONOIXgen | In the simplest case, a single headword is supplied by an |
2972 | CONOIXgen | The effect of this is to document an index entry for the term |
2974 | CONOIXgen | , which when processed could reference the location of the original |
2978 | CONOIXgen | If the subject of Arabic lemmatization is treated at length in a text, then the index entry generated may need to reference a sequence of locations (e.g. page numbers). In such a case it will be necessary to identify the end of the relevant span of text as well as its starting point. This is most conveniently done by supplying an empty |
2994 | CONOIXgen | This would generate the same index entries as the previous example, but the reference would be to the whole span of text between the location of the |
2996 | CONOIXgen | element and the location of the element identified by the code |
2998 | CONOIXgen | , rather than a single point, and thus might (for example) include a sequence of page numbers. |
3002 | CONOIXgen | element in the text provides the target location that will be specified in the generated index entry, no part of the text itself is used to construct that entry. Index terms appearing in the entry come solely from the content of |
3004 | CONOIXgen | elements, which consequently may have to repeat words or phrases from the text proper. This need not be done verbatim, thus giving scope for normalization of spelling (as in the example above) or other modifications which may assist generation of an index in a desired form or sequence. |
3006 | CONOIXgen | Sometimes, for example when index terms are taken from a different language or consist of mathematical formulae or other expressions, even a normalized form of an index term may be insufficient for an application to order it exactly as desired. The |
3008 | CONOIXgen | attribute may be used to address this problem, as in the following example: |
3012 | CONOIXgen | Here, an entry for the symbol @ will appear in the index, but will be sorted alphabetically as if it were the string |
3014 | CONOIXgen | . This technique is also useful when an index entry is to contain some non-Unicode character or glyph represented by the |
3017 | CONOIXgen | . In the following example, we assume that somewhere a definition for this glyph has been provided using the elements described in chapter |
3018 | CONOIXgen | , and given the code |
3027 | CONOIXgen | Note that if no value is supplied for the sortKey attribute, a sorting application should always use the content of the |
3031 | CONOIXgen | It is common practice to compile more than one index for a given text. A biography of a poet, for example, may offer an index of references to poems by the subject of the study, another index of works by other writers, an index of places or historical personages etc. The indexName attribute is used to assigning index terms and locations to one or more specific indexes: |
3039 | CONOIXgen | TEI |
3042 | CONOIXgen | , an index may contain structured entries like |
3043 | CONOIXgen | TEI, markup practices, index terms |
3044 | CONOIXgen | , where a top level entry |
3045 | CONOIXgen | TEI |
3046 | CONOIXgen | is followed by a number of second-level subcategories, any or all of which may have a third-level list attached to them and so on. In order to reflect such a hierarchical index listing, |
3048 | CONOIXgen | elements may be nested to the required depth. For example, suppose that we wish to make a structured index entry for |
3054 | CONOIXgen | , etc. The example at the start of this section might then be encoded with nested |
3067 | CONOIXgen | The index entry from Burton's |
3069 | CONOIXgen | quoted above might be generated in a similar way. To generate such an entry, the body of the text might include, at page 193, an |
3081 | CONOIXgen | . Similarly, page 601 of the body text would include an |
3109 | CONOIXgen | elements, the duplication required to make the structure explicit will normally be removed, so as to produce entries like those quoted above. However, this is not required by the encoding recommended here. |
3113 | CONOIXgen | element may be used to mark the place at which an index generated from |
3115 | CONOIXgen | elements should be inserted into the output of a processing program; typically but not necessarily this will be at some point within the back matter of the document. If the |
3117 | CONOIXgen | element is used, then the |
3119 | CONOIXgen | attribute should be used to specify which kind of index is to be generated, and its value should correspond with that of the |
3140 | CONOIXgen | attribute may also be used to specify a name or identifier for the generated index itself in the usual way. Any additional headings etc. required for the generated index must be specified as content of the |
3152 | CONOIXgen | If a processing instruction is used, then these parameters for the generated index may be supplied in some other way. |
3154 | CONOIXgen | One final feature frequently found in manually-created indexes to printed works cannot readily be encoded by the means provided here, namely cross-references internal to the index term listing. For example, if all references to the TEI in a text have been indexed using the index term |
3156 | CONOIXgen | , it may also be helpful to include an entry under the term |
3157 | CONOIXgen | TEI |
3158 | CONOIXgen | containing some text such as |
3171 | COGR | Graphics, such as illustrations or diagrams, appear in many different kinds of text, and often with different purposes. Audio or video clips may also appear. In some cases, such media form an integral part of a text (indeed, some texts—comic books for example—may be almost entirely graphic); in others the graphic or video may be a kind of optional extra. In some cases, the text may be incomprehensible unless the media is included; in others, the presence of the media adds little to the sense of the work. It will therefore be a matter of encoding policy as to whether or how media found in a source text are transferred to a new encoded version of the same. In documents which are |
3173 | COGR | , media such as graphics and other non-textual components may be particularly salient, but their inclusion in an archival form of the document concerned remains an editorial decision. |
3175 | COGR | Considered as structural components, media may be anchored to a particular point in the text, or they may |
3177 | COGR | either completely freely, or within some defined scope, such as a chapter or section. Time-based media such as audio or video may need to be synchronized with particular parts of a written text. Media of all kinds often contain associated text such as a heading or label. These Guidelines provide the following different elements to indicate their appearance within a text: |
3185 | COGR | Media files may be encoded in a number of different ways: |
3187 | COGR | in some non-XML or binary format such as PNG, JPEG, MP3, MP4 etc. |
3191 | COGR | in a TEI XML format such as the notation for graphs and trees described in |
3193 | COGR | In the last two cases, the presence of the graphic will be indicated by an appropriate XML element, drawn from the SVG namespace in the second case, and its content will fully define the graphic to be produced. In the first case, however, one of the elements |
3197 | COGR | is used to mark the presence of the graphic only and the visual content itself is stored outside the XML document at a location referenced by means of an |
3201 | COGR | class. Alternatively, if it is small, the media information may be embedded directly within the document using some suitable binary format such as Base64; in this case the |
3213 | COGR | when this module is included in a schema. These elements are also members of the class |
3220 | COGR | For example, the following passage indicates that a copy of the image found in the source text may be recovered from the URL |
3228 | COGR | The media elements are phrase level elements which may be used anywhere that textual content is permitted, within but not between paragraphs or headings. In the following example, the encoder has decided to treat a specific printer's ornament as a heading: |
3235 | COGR | provides additional capabilities, for example the ability to combine a number of images into a hierarchically organized structure or a block of images. The |
3239 | COGR | attribute, which can be used to distinguish different kinds of graphic component within a single work, for example, maps as opposed to illustrations. It also provides the ability to associate an image with additional information such as a heading or a description. |
3250 | CORS | we mean the system by which names or references are associated with particular passages of a text (e.g. |
3252 | CORS | for the third verse of Psalm 23 or |
3256 | CORS | , book 2, poem 10, line 7). Such names make it possible to mark a place within a text and enable other readers to find it again. A reference system may be based on structural units (chapters, paragraphs, sentences; stanza and verse), typographic units (page and line numbers), or divisions created specifically for reference purposes (chapter and verse in Biblical texts). Where one exists, the traditional reference system for a text should be preserved in an electronic transcript of it, if only to make it easier to compare electronic and non-electronic versions of the text. |
3260 | CORS | where a reference system exists, and is based on the same logical structure as that of the text's markup, the reference for a passage may be recorded as the value of the global |
3274 | CORS | where a reference system exists which is not based on the same logical structure as that of the text's markup (for example, one based on the page and line numbers of particular editions of the text rather than on the structural divisions of it), any of a variety of methods for encoding the logical structure representing the reference system may be employed, as described in chapter |
3277 | CORS | where a reference system exists which does not correspond to any particular logical structure, or where the logical structure concerned is of no interest to the encoder except as a means of supporting the referencing system, then references may be encoded by means of |
3279 | CORS | elements, which simply mark points in the text at which values in the reference system change, as described below in section |
3281 | CORS | The specific method used to record traditional or new reference systems for a text should be declared in the TEI header, as further described in section |
3285 | CORS | When a text has no pre-existing associated reference system of any kind, these Guidelines recommend as a minimum that at least the page boundaries of the source text be marked using one of the methods outlined in this section. Retaining page breaks in the markup is also recommended for texts which have a detailed reference system of their own. Line breaks in prose texts may be, but need not be, tagged. |
3286 | CORS | Many encoders find it convenient to retain the line breaks of the original during data entry, to simplify proofreading, but this may be done without inserting a tag for each line break of the original. |
3294 | CORS1 | When traditional reference schemes represent a hierarchical structuring of the text which mirrors that of the marked-up document, the |
3298 | CORS1 | attribute may also be used to record the numbering of sections or list items in the copy text if the copy-text numbering is important for some reason, for example because the numbers are out of sequence. |
3304 | CORS1 | —book 2, poem 10, line 7. Book, poem, and line are structural units of the work and will therefore be tagged in any case. (See chapter |
3305 | CORS1 | for a discussion of structural units in verse collections.) In such cases, it is convenient to record traditional reference numbers of the structural units using the |
3328 | CORS1 | One may also place the entire standard reference for each portion of the text into the appropriate value for the |
3330 | CORS1 | attribute, though for obvious reasons this takes more space in the file: |
3347 | CORS1 | If the names used by the traditional reference system can be formulated as identifiers, then the references can be given as values for the |
3353 | CORS1 | attribute must be unique throughout the document. Our example then looks like this: |
3370 | CORS1 | To document the usage and to allow automatic processing of these standard references, it is recommended that the TEI header be used to declare whether standard references are recorded in the |
3379 | CORS1 | attribute one can specify only a single standard referencing system, a limitation not without problems, since some editions may define structural units differently and thus create alternative reference systems. For example, another edition of the |
3381 | CORS1 | considers poem 10 a continuation of poem 9, and therefore would specify the same line as |
3388 | CORS2 | If a text has no canonical reference system of its own, a new custom reference system may be used. |
3402 | CORS2 | Determining a referencing system for a TEI encoding depends on many factors that may either be derived from textual structure, or influenced by extra-textual contingencies such as project and file management concerns. It is important, therefore, that the attribute used, the elements which can bear standard reference identifiers, and the method for constructing standard reference identifiers, should all be declared in the header as described in section |
3410 | CORS2-1 | A new referencing system may be derived from the structure of the electronic text, specifically from the markup of the text. As with any reference system intended for long-term use, it is important to see the reference as an established, unchanging point in the text. Should the text be revised or rearranged, the reference-system identifiers associated with any section of text must stay with that section of text, even if it means the reference numbers fall out of sequence. (A new reference system may always be created beside the old one if out-of-sequence numbers must be avoided.) |
3417 | CORS2-1 | domain-style address |
3418 | CORS2-1 | comprising a series of components separated by full stops, with one component for each level of the document hierarchy. Two methods may be used. In the |
3420 | CORS2-1 | form of identifier, each component in the identifier takes the form of an element identifier, a hyphen, and a number, for example |
3422 | CORS2-1 | . The element name specifies what type of element is to be sought, and the number specifies which occurrence of that element type is to be selected. (The hyphen and number may be omitted if there is only one element of the given type.) In the |
3424 | CORS2-1 | form of identifier, each component consists of a number, indicating which element in the sequence of nodes at each level is to be selected. To make the resulting identifier a valid XML identifier, it may need to be prefixed with an unchanging alphabetic letter. |
3434 | CORS2-1 | element may be taken as a starting point only if identifiers need to be generated for the |
3438 | CORS2-1 | element as a root would prevent assignment of identifiers for the front and back matter. The component corresponding to the root element can be omitted from identifiers, if no confusion will result. In collections and corpora, the component corresponding to the root may be replaced by the unique identifier assigned to the text or sample. |
3446 | CORS2-1 | value; the latter are prefixed with the string |
3490 | CORS2-1 | attribute is used to record the reference identifiers generated, each value should record the entire path. If the |
3492 | CORS2-1 | attribute is used, each value may record either the entire path or only the subpath from the parent element. The attribute used, the elements which can bear standard reference identifiers, and the method for constructing standard reference identifiers, should all be declared in the header as described in section |
3501 | CORS2-2 | attributes. Every convention will have strengths and weaknesses and it is left to encoders to make a decision that enables them to locate information in their TEI document. |
3503 | CORS2-2 | Here are some examples of referencing systems that have been used in TEI project: |
3506 | CORS2-2 | identifiers constructed with a number of characters from the main document title, followed by an incremental number. E.g. HOL001, HOL002, etc. using a fixed number of digits; or without fixed digits: HOL1, HOL2, etc. |
3509 | CORS2-2 | identifiers constructed on the markup itself, as described in the previous section. To facilitate uniqueness in a corpus, each identifier may be prefixed with the identifier of the root |
3518 | CORS2-2 | XML well-formedness requires only that xml:id attributes be unique within a single document. However, it is also worth keeping in mind that for operating with referencing systems across a corpus of TEI files it is helpful (or even necessary in some circumstances) to have unique identifiers across the whole corpus. |
3522 | CORS2-2 | may be either populated computationally or manually. In the latter case, it is advisable to put measures in place to avoid human error. Custom data types and Schematron rules may be defined in a customization ODD, and a check digit may be added to prevent unwanted changes. |
3523 | CORS2-2 | A check digit is computed from the value of an identifier and appended to the value itself. If the identifier is changed, the check digit would therefore invalidate it. |
3530 | CORS5 | milestone |
3534 | CORS5 | These elements simply mark the points in a text at which some category in a reference system changes. They have no content but subdivide the text into regions, rather in the same way as milestones mark points along a road, thus implicitly dividing it into segments. The elements |
3542 | CORS5 | are specialized types of milestone, marking gathering, page, column, and line boundaries respectively. The global |
3544 | CORS5 | attribute is used in each case to provide a value for the particular unit associated with this milestone (for example, the page or line number). Since it is not structural, validation of a reference system based on |
3546 | CORS5 | s cannot readily be checked by an XML parser, so it will be the responsibility of the encoder or the application software to ensure that they are given in the correct order. |
3548 | CORS5 | Milestone elements are often used as a simple means of capturing the original appearance of an early printed text, which will rarely coincide exactly with structural units, but they are generally useful wherever a text has two or more competing structures. For example, many English novels were first published as serial works, individual parts of which do not always contain a whole number of chapters. An encoder might decide to represent the chapter-based structure using |
3603 | CORS5 | Similarly, when tagging dramatic verse one may wish to privilege stanzas and lines over speeches and speakers, particularly where speeches cross line and line group boundaries. One might also wish to mark changes in narrative voice in a prose text. In either case, a milestone tag may be used to indicate change of speaker: |
3614 | CORS5 | Milestone tags also make it possible to record the reference systems used in a number of different editions of the same work. The reference system of any one edition can be recreated from a text in which all are marked by simply ignoring all elements that do not specify that edition on their |
3618 | CORS5 | As a simple example, assuming that edition E1 of some collection of poems regards the first two poems as constituting the first book, while edition E2 regards the first poem as prefatory, a markup scheme like the following might be adopted: |
3629 | CORS5 | In this case no |
3631 | CORS5 | value is specified, since the numbers rise predictably and the application can keep a count from the start of the document, if desired. |
3633 | CORS5 | The value of the |
3649 | CORS5 | tags, line numbers may be supplied for every line or only periodically (every fifth, every tenth line). The latter may be simpler; the former is more reliable. |
3659 | CORS5 | could have been used equally well if preferred. The special value |
3661 | CORS5 | should be reserved for marking sections of text which fall outside the normal numbering system (e.g. chapter heads, poem numbers, titles, or speaker attributions in a verse drama). |
3663 | CORS5 | By default, there are no constraints on the values supplied for the |
3666 | CORS5 | may be used, for example to specify that the attribute must specify one of a predefined set of values. |
3671 | CORS5 | Milestone elements may be used to mark any kind of shift in the properties associated with a piece of text, whether or not would normally be considered a reference system. For example, they may be used to mark changes in narrative voice in a prose text, or changes of speaker in a dramatic text, where these are not marked using structural elements such as |
3677 | CORS5 | above, milestone elements such as |
3681 | CORS5 | represent whitespace and are therefore by default assumed to occur between orthographic tokens in the text, where these are not otherwise indicated. By default it is reasonable to assume that words are not broken across page or line boundaries, and that therefore a sequence such as |
3694 | CORS5 | attribute is provided to change the default assumption. To make explicit that |
3699 | CORS5 | Where hyphenation appears before a line or page break, the encoder may or may not choose to record the fact, either explicitly using an appropriate Unicode character, or descriptively for example by means of the |
3714 | CORS6 | Whatever kind of reference system is used in an electronic text, it is recommended that the TEI header contain a description of its construction in the |
3734 | CORS6 | tags. The header section for such an encoding should look something like this: |
3807 | CORS6 | tags, but giving the reference string in full on each tag. If canonical references are made only to lines, the reference system could be declared as follows: |
3810 | CORS6 | Since the entire regular expression is enclosed as a parenthetical subgroup, the entire canonical reference string is sought as the value of the |
3820 | CORS6 | This declaration indicates that the entire reference string must be sought as the value of the |
3832 | CORS6 | The third example encodes the same reference system, this time giving the entire reference string as the value of the |
3837 | CORS6 | although in general there seems to be little advantage in this case: it is no more difficult to use a standard relative URI reference as the value of |
3841 | CORS6 | Reference systems recorded by means of milestone tags can also be declared; the following prose description could be used to declare the example given in section |
3846 | CORS6 | Or in this way, using a formal declaration for this reference scheme derived from edition |
3859 | COBI | Bibliographic references (that is, full descriptions of bibliographic items such as books, articles, films, broadcasts, songs, etc.) or pointers to them may appear at various places in a TEI text. They are required at several points within the TEI header's source description, as discussed in section |
3860 | COBI | ; they may also appear within the body of a text, either singly (for example within a footnote), or collected together in a list as a distinct part of a text; detailed bibliographic descriptions of manuscript or other source materials may also be required. These Guidelines propose a number of specialized elements to encode such descriptions, which together constitute the |
3869 | COBI | In printed texts, the individual constituents of a bibliographic reference are conventionally marked off from each other and from the flow of text by such features as bracketing, italics, special punctuation conventions, underlining, etc. In electronic texts, such distinctions are also important, whether in order to produce acceptably formatted output or to facilitate intelligent retrieval processing, |
3872 | COBI | as an author's name from |
3874 | COBI | as a place of publication or as a component of a title. |
3877 | COBI | It should be emphasized that for references as for other textual features, the primary or sole consideration is not how the text should be formatted when it is printed. The distinctions permitted by the scheme outlined here may not necessarily be all that particular formatters or bibliographic styles require, although they should prove adequate to the needs of many such commonly used software systems. |
3882 | COBI | structures, though the nature of their design prevents a simple one-to-one mapping from their data elements to TEI elements. For further information, see section |
3885 | COBI | ) constitute a set which has been useful for a wide range of bibliographic purposes and in many applications, and which moreover corresponds to a great extent with existing bibliographic and library cataloguing practice. For a fuller account of that practice as applied to electronic texts see section |
3901 | COBI | element; instead, the presence and order of child elements must be used to reconstruct the punctuation required by a particular style. |
3905 | COBI | allows for considerable flexibility in that it can include both delimiting punctuation and unmarked-up text; and its constituents can also be ordered in any way. This makes it suitable for marking up bibliographies in existing documents, where it is considered important to preserve the form of references in the original document, while also distinguishing important pieces of information such as authors, dates, publishers, and so on. |
3907 | COBI | may also be useful when encoding |
3909 | COBI | documents which require use of a specific style guide when rendering the content; its flexibility makes it easier to provide all the information for a reference in the exact sequence required by the target rendering, including any necessary punctuation and linking words, rather than using an XSLT stylesheet or similar to reorder and punctuate the data. |
3915 | COBI | , has a content model based on the |
3917 | COBI | element of the TEI header. Both are based on the International Standard for Bibliographic Description (ISBD), which forms the basis of several national standards for bibliographic citations. The order of child elements in both |
3938 | COBI | resource identifier and terms of availability area |
3941 | COBI | , used with its child elements and without delimiting punctuation, provides an appropriate granularity of encoding with elements that can easily be rendered for the reader. However, it is important to note that some ISBD-derived citation formats (such as ANSI/NISO Z39.29 and ГОСТ 7.1) are not entirely conformant to ISBD either, since they may begin with a statement of authorship that does not map to the ISBD statement of responsibility. |
3947 | COBITY | class all share a number of possible component sub-elements. For the |
3957 | COBITY | Different levels of specific tagging may be appropriate in different situations. In some cases, it may be felt necessary to mark just the extent of the reference itself, with perhaps a few distinctions being made within it (for example, between the part of the reference which identifies a title or author and the rest). Such references, containing a mixture of text with specialized bibliographic elements, are regarded as |
3970 | COBITY | Some bibliographic references are extremely elliptical, often only a string of the form |
3972 | COBITY | . If no further details of Baxter's book are given in the source text and none is supplied by the encoder, then the reference thus given should be tagged as a |
4032 | COBITY | element defined in the TEI header module. This element is provided as a means of embedding the file description of one existing digital text within that of another (see further section |
4053 | COBITY | A list of bibliographic items, of whatever kind, may be treated in the same way as any other list (see section |
4068 | COBITY | may contain only bibliographic elements, optionally preceded by a heading and a series of introductory paragraphs. For most purposes, good practice would usually require that a |
4145 | COBITY | s and |
4149 | COBITY | items, the key information is marked up, but it is presented in an order which makes it suitable for direct rendering, with the punctuation included. |
4207 | COBICO | analytic |
4211 | COBICO | series |
4216 | COBICO | information relating to the publication, pagination, etc. of an item (most of these constitute the default members of the |
4227 | COBICO | class, other phrase-level elements, and plain text may be combined without other constraint; within the latter, such of these elements as exist for a given reference must be distinguished, and must also be presented in a specific order, discussed further below (section |
4232 | COBICOL | In common library practice a clear distinction is made between an individual item within a larger collection and a free-standing book, journal, or collection. Similarly a book in a series is distinguished sharply from the series within which it appears. An article forming part of a collection which itself appears in a series thus has a bibliographic description with three quite distinct levels of information: |
4235 | COBICOL | analytic |
4243 | COBICOL | series |
4244 | COBICOL | level, giving the title of the series, possibly the names of its editors, etc., and the number of the volume within that series. |
4245 | COBICOL | In the same way, an article in a journal requires at least two levels of information: the analytic level describing the article itself, and the monographic level describing the journal. |
4247 | COBICOL | A different identifying number may be supplied for any of these three items, that is, for the analytic item, the monographic item, or the series. |
4284 | COBICOL | , the levels are distinguished by the use of the following distinct elements: |
4287 | COBICOL | For purposes of TEI encoding, journals and anthologies are both treated as monographs; a journal title should thus be tagged as a |
4288 | COBICOL | title level="j" |
4292 | COBICOL | analytic |
4301 | COBICOL | element. (Whether reprints of an article are treated in the same bibliographic reference or a separate one varies among different styles. Library lists typically use a different entry for each publication, while academic footnoting practice typically treats all publications of the same article in a single entry.) |
4305 | COBICOL | element is used to supply further information about the location of some part of a bibliographic reference. It specifies where to find the component in which it appears within the immediately preceding component of a different level. |
4311 | COBICOL | , which was itself the second of a four volumes published together under the title |
4313 | COBICOL | ; this last title constituted the 38th volume in the series of |
4350 | COBICOL | In the following example, the article cited has been published twice, once in a journal (where it appeared in volume 40, on pages 3 -46 of the issue of October 1986) and once as a free-standing item, which appeared as number 11 of a German language series. |
4407 | COBICOL | The practice of analytic vs. monographic citation, as described here, should be distinguished from the practice of including within one citation a reference to another work, which the encoder considers to be related to in some way: see further |
4410 | COBICOL | If an identifier is available for the analytic item, it should be represented by means of an |
4414 | COBICOL | element, as in the following example where a DOI (Digital Object identifier) is supplied for the article in question. |
4462 | COBICOL | Punctuation must not appear between the elements within a structured bibliographic entry encoded with |
4510 | COBICOL | , with all the relevant data items marked up appropriately. This markup approach can provide easy rendering, if only one styleguide is targeted, or an original source document uses a specific styleguide, while still allowing for automated recovery of key data items such as names of authors, titles etc. |
4519 | COBICOR | Bibliographic references typically include the title of the work being cited and the names of those intellectually responsible for it. For articles in journals or collections, such statements should appear both for the analytic and for the monographic level. The following elements are provided for tagging such elements: |
4545 | COBICOR | are the default members of the |
4553 | COBICOR | In bibliographic references, all titles should be tagged as such, whether analytic, monographic, or series titles. The single element |
4567 | COBICOR | It is a semantic error to give a value for the |
4571 | COBICOR | value |
4573 | COBICOR | implies the analytic level; the values |
4574 | COBICOR | m |
4578 | COBICOR | u |
4579 | COBICOR | imply the monographic level; the value |
4580 | COBICOR | s |
4581 | COBICOR | implies the series level. Note, however, that the semantic error occurs only if the nested title is directly enclosed by the |
4587 | COBICOR | element; if it is enclosed only indirectly (i.e., nested more deeply), no semantic error need be present. For example, the analytic title may contain a monographic title, as in the following example: |
4615 | COBICOR | In this case, the analytic title |
4622 | COBICOR | element; the monographic title contained within it, |
4632 | COBICOR | The following reference, from a national standard for bibliographic references, illustrates this type of analysis with its distinction between main and subordinate titles. Note that this uses the more flexible |
4636 | COBICOR | element: consequently, there is no requirement to tag all the components of the reference (notably the authors). |
4653 | COBICOR | Slightly more complex is the distinction made below among main, subordinate, and parallel titles, in an example from the same source (p. 63). The punctuation and the bibliographic analysis are those given in ANSI Z39.29-1977; the punctuation is in the style prescribed by the International Standard Bibliographic Description (ISBD). |
4654 | COBICOR | The analysis is not wholly unproblematic: as the text of the standard points out, the first subordinate title is subordinate only to the parallel title in French, while the second is subordinate to both the English main title and the French parallel title, without this relationship being made clear, either in the markup given in the example or in the reference structure offered by the standard. |
4659 | COBICOR | , that specific punctuation may be included between the component elements of the reference. |
4678 | COBICOR | element should be used for the person or agency with primary responsibility for a work's intellectual content, and the element |
4681 | COBICOR | editor |
4683 | COBICOR | author |
4684 | COBICOR | of a broadcast, for example, while the author of a government report will usually be the agency which produced it. A translator, illustrator, or compiler, may however be marked by means of the |
4690 | COBICOR | Many bibliographic and Linked Data applications require disambiguation of author names using unique identifiers. Both the |
4696 | COBICOR | elements, to supply such identifiers. Alternatively, if only a single identifier is to be recorded, the |
4735 | COBICOR | element may also be used for editors, if it is desired to record the specific terms in which their role is described. |
4749 | COBICOR | element may also occur. When one of these elements precedes or immediately follows a title, it applies to that title; when it follows an |
4751 | COBICOR | element or occurs within an edition statement, it applies to the edition in question. |
4797 | COBICOR | This example retains the original punctuation and editorial conventions of the source (ISO 690: 1987) and is therefore encoded using the |
4803 | COBICOR | element applies to the edition, and not to the collection |
4804 | COBICOR | per se |
4807 | COBICOR | element, the component elements have been reordered from their appearance on the title page of the volume in order to ensure the correct relationship of the collection title, the edition statement, and the statement of responsibility. |
4848 | COBICOR | The party with a particular responsibility for the intellectual content may vary over time. Likewise, a given individal's responsibility or role may change over time. These situations may be recorded with the |
4850 | COBICOR | element. For example, the following could be used when one proofreader took over for another. |
4868 | COBICOR | Another form of |
4870 | COBICOR | arises when a work is published as the outcome of a conference, workshop or similar meeting. The |
4932 | COBICOD | identifiers of various types because they do not include a statement of the title and the names of those intellectually responsible for it. The following elements may be used for such purposes: |
4940 | COBICOD | For example, a citation to a patent typically includes a country or organization code (a two-character code identifying a patent authority) and a serial number for the patent (whose structure varies by patent authority). The citation might also contain a |
4941 | COBICOD | kind code |
4942 | COBICOD | (which characterizes a particular publication for the patent and which corresponds to a specific stage in the patent procedure) and the date when the patent was filed with or published by the issuing authority. For bibliographic references to patents, the above elements may be used as follows: |
4947 | COBICOD | , may be used to contain the code of the patent authority. The |
4949 | COBICOD | attribute may be used to specify the type of patent authority (such as a national patent office or a supra-national patent organization). |
4952 | COBICOD | may be used to contain the serial number assigned by the corresponding patent authority. |
4955 | COBICOD | may be used to contain the kind code of the patent document. |
4958 | COBICOD | may be used to contain the date of the patent document. The |
4960 | COBICOD | attribute may be used to specify whether this corresponds to the filing date of a patent application or the publication date of a patent publication. |
4988 | COBICOI | imprint |
4989 | COBICOI | is meant all the information relating to the publication of a work: the person or organization by whose authority and in whose name a bibliographic entity such as a book is made public or distributed (whether a commercial publisher or some other organization), the place and the date of publication. It may also include a full address for the publisher or organization. A full bibliographic references will usually also specify the number of pages in a print publication (or equivalent information for non-print materials), and possibly also the specific location of the material being cited within its containing publication. The following elements are provided to hold this information: |
4998 | COBICOI | Members of the model classes |
5004 | COBICOI | element in a specific location within a |
5014 | COBICOI | For bibliographic purposes, usually only the place (or places) of publication are required, possibly including the name of the country, rather than a full address; the element |
5016 | COBICOI | is provided for this purpose. Where however the full postal address is likely to be of importance in identifying or locating the bibliographic item concerned, it may be supplied and tagged using the |
5019 | COBICOI | . Alternatively, if desired, the |
5024 | COBICOI | may be used; this involves no claim that the information given is either a full address or the name of a city. |
5026 | COBICOI | The name of the publisher of an item should be marked using the |
5028 | COBICOI | element even if the item is made public ( |
5030 | COBICOI | ) by an organization other than a conventional publisher, as is frequently the case with technical reports: |
5094 | COBICOI | When an item has been reprinted, especially reprinted without change from a specific earlier edition, the reprint may appear in a |
5098 | COBICOI | and other details of the reprint. In the following example, a microform reprint has been issued without any change in the title or authorship. The series statement here applies only to the second |
5141 | COBICOI | This encoding can be extended to the case of patent documents, where the same patent application is published, with or without changes, at different stages of the patenting procedure. In this case, the kind code and, optionally, the publication date characterize different publications of the same patent application during the procedure. For example: |
5167 | COBICOI | The above bibliographic reference discloses different publications of the patent EP1558513 during the patenting procedure. The first publication from 3 August 2005 has the kind code "A1" indicating that it is a published patent application comprising the European search report issued after carrying out the search at the European Patent Office, whereas the second publication from 9 September 2009 has the kind code "B1" indicating that it was published after the patent application has been granted. |
5178 | COBICOB | Many bibliographic citations contain data limiting the citation to one or more volumes, issues, or pages, or to a name or number of a subdivison of the host work. These come in two varieties: |
5188 | COBICOB | Where it is desired to distinguish different classes of such information (volume number, page number, chapter number, etc.), the |
5310 | COBICOB | On the other hand, a cited range encodes that the author |
5312 | COBICOB | defined by this range. For example, a footnote following a quotation from page 378 of |
5360 | COBICOS | element. The title of the series may be tagged |
5361 | COBICOS | title level="s" |
5362 | COBICOS | , the volume number |
5363 | COBICOS | biblScope unit="vol" |
5364 | COBICOS | , and responsibility statements for the series (e.g. the name and affiliation of the editor, as in the example in section |
5369 | COBICOS | . Any identifier associated with the series itself should be marked using the |
5376 | COBIRI | related item |
5377 | COBIRI | is any bibliographic item which, though related to that being defined, is distinct from it. The distinction between analytic and monographic items made above may be thought of as a special case of this kind of |
5379 | COBIRI | item. More usually however, the term is applied to such items as translations, continuations, different versions, parts, etc. |
5389 | COBIRI | describes a facsimile edition, and the second describes the work of which it is a facsimile. The relation between the facsimile and its source is represented by means of a |
5439 | COBIRI | may contain any form of bibliographic reference. For example, one of the examples quoted above might also be encoded as follows: |
5484 | COBIRI | attribute should be used to indicate the relationship between the bibliographic item and any |
5526 | COBIRI | In this example, a full bibliographic description of the edition used as source for the translation is provided within the content of the |
5528 | COBIRI | . Alternatively this might be provided by means of a link, in which case the |
5547 | COBICON | Explanatory notes about the publication of unusual items, the form of an item (e.g. |
5551 | COBICON | ), or its provenance (e.g. |
5555 | COBICON | element. The same element may be used for any descriptive annotation of a bibliographic entry in a database. |
5575 | COBICON | This element can take the form of a simple note such as: |
5581 | COBICON | attribute to record the chief language of the bibliographic item, and optionally the |
5593 | COBICON | attributes should both provide language identifiers in the same form as used for |
5596 | COBICON | . Where additional detail is needed correctly to describe a language, or to discuss its deployment in a given text, this should be done using the |
5598 | COBICON | element in the TEI header, within which individual |
5625 | COBICOO | element, if it occurs, must come first, followed by one or more |
5631 | COBICOO | element comes first), and then zero or more of the following in any order: |
5647 | COBICOO | , the title(s), author(s), editor(s), and other statements of responsibility may appear in any order; it is recommended that all forms of the title be given together. Within |
5649 | COBICOO | , the author, editor, and statements of responsibility may either come first or else follow the monographic title(s). Following these, the elements listed below, if present, must appear in the following order: |
5652 | COBICOO | s on the publication (and |
5654 | COBICOO | elements describing the conference, in the case of a proceedings volume) |
5674 | COBICOO | , the sequence of elements is not constrained. |
5688 | COBIXR | ). As discussed in that section, cross-referencing within TEI texts is in general represented by means of |
5694 | COBIXR | attribute on these elements is used to supply an identifying value for the target of the cross-reference, which should be, in the case of bibliographic elements, a bibliographic reference of some kind. Where the form of the reference itself is unimportant, or may be reconstructed mechanically, or is not to be encoded, the |
5701 | COBIXR | Where the form of the reference is important, or contains additional qualifying information which is to be kept but distinguished from the surrounding text, the |
5707 | COBIXR | It may be important to distinguish between the short form of a bibliographic reference and some qualifying or additional information. The latter should not appear within the scope of the |
5709 | COBIXR | element when this is the case, as for example in an application concerned to normalize bibliographic references: |
5717 | COBIXR | element may also be used to provide a reference to a copy of the bibliographic item itself, particularly if this is available online, as in the following example: |
5753 | COBIOT | The BibTeX scheme is intentionally compatible with that of Scribe, although it omits some fields used by Scribe. Hence only one list of fields is given here. |
5756 | COBIOT | address |
5758 | COBIOT | tag as |
5765 | COBIOT | tag as |
5768 | COBIOT | author |
5770 | COBIOT | tag as |
5775 | COBIOT | tag as |
5776 | COBIOT | title level="m" |
5784 | COBIOT | tag as |
5785 | COBIOT | biblScope unit="chap" |
5787 | COBIOT | date |
5789 | COBIOT | used only to record date entry was made in the bibliographic database; not supported |
5791 | COBIOT | edition |
5793 | COBIOT | tag as |
5796 | COBIOT | editor |
5798 | COBIOT | tag as |
5805 | COBIOT | tag as multiple |
5829 | COBIOT | name type="org" |
5833 | COBIOT | tag as |
5835 | COBIOT | , possibly using the form |
5836 | COBIOT | note place="inline" |
5838 | COBIOT | institution |
5840 | COBIOT | used only for issuer of technical reports; tag as |
5845 | COBIOT | tag as |
5846 | COBIOT | title level="j" |
5854 | COBIOT | used to specify an alternate sort key for the bibliographic item, for use instead of author's or editor's name; not supported |
5856 | COBIOT | meeting |
5858 | COBIOT | tag as |
5867 | COBIOT | ; if the date is not in a trivially parseable form, use the |
5872 | COBIOT | note |
5874 | COBIOT | tag as |
5877 | COBIOT | number |
5879 | COBIOT | tag as |
5880 | COBIOT | biblScope unit="issue" |
5882 | COBIOT | biblScope unit="number" |
5884 | COBIOT | idno type="docno" |
5888 | COBIOT | used only for sponsor of conference; use |
5889 | COBIOT | name type="org" |
5898 | COBIOT | tag as |
5899 | COBIOT | biblScope unit="pp" |
5901 | COBIOT | publisher |
5903 | COBIOT | tag as |
5908 | COBIOT | used only for institutions at which thesis work is done; tag as |
5911 | COBIOT | series |
5913 | COBIOT | tag as |
5914 | COBIOT | title level="s" |
5920 | COBIOT | title |
5922 | COBIOT | tag as |
5926 | COBIOT | value |
5930 | COBIOT | tag as |
5931 | COBIOT | biblScope unit="vol" |
5935 | COBIOT | tag as |
5937 | COBIOT | ; if the date is not in a trivially parseable form, use the |
5945 | CODV | The following elements are included in the core module for the convenience of those encoding texts which include mixtures of prose, verse and drama. |
5948 | CODV | Full details of other, more specialized, elements for the encoding of texts which are predominantly verse or drama are described in the appropriate chapter of part three (for verse, see the verse base described in chapter |
5949 | CODV | ; for performance texts, see the drama base described in chapter |
5950 | CODV | ). In this section, we describe only the elements listed above, all of which can appear in any text, whichever of the three modes prose, verse, or drama may predominate in it. |
5954 | COVE | Like other written texts, verse texts or poems may be hierarchically subdivided, for example into books or cantos. These structural subdivisions should be encoded using the general purpose |
5960 | COVE | . The fundamental unit of a verse text is the verse line rather than the paragraph, however. |
5964 | COVE | element is used to mark up verse lines, that is metrical rather than typographic lines. In some modern or free verse, it may be hard to decide whether the typographic line is to be regarded as a verse line or not, but the distinction is quite clear for verse following regular metrical patterns. Where a metrical line is interrupted by a typographic line break, the encoder may choose to ignore the fact entirely or to use the empty |
5967 | COVE | . By convention, the start of a metrical line implies the start of a typographic line; hence there is no need to introduce an |
5969 | COVE | tag at the start of every |
5971 | COVE | element, but only at places where a new typographic line starts within a metrical line, as in the following example: |
5986 | COVE | In the original copy text, the presence of an ornamental capital at the start of the poem means that the measure is not wide enough to print the first four lines on four lines; instead each metrical line occupies two typographic lines, with a break at the point indicated. Note that this encoding makes no attempt to preserve information about the whitespace or indentation associated with either kind of line; if regarded as essential, this information would be recorded using the |
5994 | COVE | element should not be used to represent typographic lines in non-verse materials: if the line-breaking points in a prose text are considered important for analysis, they should be marked with the |
5996 | COVE | element. Alternatively, a neutral segmentation element such as |
6011 | COVE | In some verse forms, regular groupings of lines are regarded as units of some kind, often identified by a regular verse scheme. In stichic verse and couplets, groups of lines analogous to paragraphs are often indicated by indentation. In other verse forms, lines are grouped into irregular sequences indicated simply by whitespace. The |
6013 | COVE | or line group element may be used to mark any such grouping of elements from the |
6020 | COVE | which may be used to further categorize the line group where this is felt desirable, as in the following example. This example also demonstrates the |
6022 | COVE | attribute to indicate whether or not a line is indented. |
6048 | COVE | For some kinds of analysis, it may be useful to identify different kinds of line group within the same piece of verse. Such line groups may self-nest, in much the same way as the un-numbered |
6093 | COVE | It is often the case that verse line boundaries conflict with the boundaries of other structural elements. In the following example, the single verse line |
6095 | COVE | is interrupted by a stage direction: |
6119 | COVE | The same technique may be used where verse lines are collected together into units such as verse paragraphs: |
6142 | COVE | element to indicate that it is incomplete, for example because it forms part of a group that is divided between two speakers, as in the following example: |
6164 | COVE | For alternative methods of aligning groups of lines which do not form simple hierarchic groups, or which are discontinuous, see the more detailed discussion in chapter |
6174 | CODR | performance texts |
6175 | CODR | such as cinema or TV scripts are often hierarchically organized, for example into acts and scenes. These structural subdivisions should be encoded using the general purpose |
6181 | CODR | . Within these divisions, the body of a performance text typically consists of |
6183 | CODR | , often prefixed by a phrase indicating who is speaking, and occasionally interspersed with stage directions of various kinds. |
6210 | CODR | In the following example, each speech consists of a sequence of verse lines, some of them being marked as metrically incomplete: |
6266 | CODR | , the printed speaker attributions need to be supplemented by use of the |
6312 | CODR | By contrast with the preceding examples, the following encodes an early printed edition without making any assumption about which parts are prose or verse: |
6354 | CODR | elements should also be used to mark parts of a text otherwise in prose which are presented as if they were dialogue in a play. The following example is taken from a 19th century novel in which passages of narrative and passages of dialogue are mixed within the same chapter: |
6401 | core | Elements common to all TEI documents |
6410 | COOV | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
6 | WD | introduced the fundamental notions of language identification and character representation in an encoded TEI document. In this chapter we discuss some additional issues relating to the way that written language is represented in a TEI document. In sections |
8 | WD | we introduce markup which may be used to represent and document non-standard characters, that is, written symbols for which no codepoint exists in Unicode. The same markup may be used to annotate existing characters according to their visual or other properties, and thus process them as distinct glyphs (see section |
12 | WD | we discuss ways of documenting the writing mode used in a source text, that is, the directionality of the script, the orientation of individual characters, and related questions. |
16 | WDNE | Despite the availability of Unicode, text encoders still sometimes find that the published repertoire of available characters is inadequate to their needs. This is particularly the case when dealing with ancient languages, for which encoding standards do not yet exist, or where an encoder wishes to represent variant forms of a character or |
34 | WDNE | , and the associated character code charts. Alternatively, users can check the latest published version of |
38 | WDNE | ), though the web site is often more up to date than the printed version, and should be checked for preference. |
42 | WDNE | ) in the Unicode code charts are only meant to be representative, not definitive. If a specific form of an already encoded character is required for a project, refer to the guidelines contained below under |
44 | WDNE | . Remember that your encoded document may be rendered on a system which has different fonts from yours: if the specific form of a character is important to you, then you should document it. |
47 | WDNE | ) to see whether the character is in line for approval. |
49 | WDNE | Ask on the Unicode email list ( |
54 | WDNE | Since there are now close to 100,000 characters in Unicode, chances are good that what you need is already there, but it might not be easy to find, since it might have a different name in Unicode. Look again, this time at other sites, for example |
55 | WDNE | , which also provide searches based on scripts and languages. Take care, however, that all the properties of what seems to be a relevant character are consistent with those of the character you are looking for. For example, if your character is definitely a digit, but the properties of the best match you can find for it say that it is a letter, you may have a character not yet defined in Unicode. |
59 | WDNE | However, if the character you are looking for is being used in a notation (rather than as part of the orthography of a language) then it is quite acceptable to select characters from the Mathematical Operators block, provided that they have the appropriate properties (i.e. |
69 | WDNE | If, however, no suitable form of your character seems to exist, the next question will be: |
70 | WDNE | Does the graphical unit in question represent a variant form of a known character, or does it represent a completely unencoded character? |
74 | WDNE | These guidelines will help you proceed once you have identified a given graphical unit as either a variant or an unencoded character. Determining this will require knowledge of the contents of the document that you have. The first case will be called |
76 | WDNE | of a character, while the second case will be called |
82 | WDNE | While there is some overlap between these requirements, distinct specialized markup constructs have been created for each of these cases. These constructs are presented in section |
91 | D25-20 | numeric character reference |
94 | D25-20 | (A-umlaut). The encoder can also restrict the range of characters which are represented directly in a document (or part of it) by adding a suitable encoding declaration. For example, if a document begins with the declaration |
96 | D25-20 | any Unicode characters which are not in the ISO-8859-1 character set must be represented by NCRs. |
99 | D25-20 | gaiji |
104 | D25-20 | .) This allows the encoder to distinguish characters and glyphs which Unicode regards as identical, to add new nonstandard characters or glyphs, and to represent Unicode characters not available in the document encoding by an alternative means. |
122 | D25-20 | When the gaiji module is included in a schema, the |
130 | D25-20 | The Unicode standard defines properties for all the characters it defines in the Unicode Character Database, knowledge of which is usually built into text processing systems. If the character represented by the |
132 | D25-20 | element does not exist in Unicode at all, its properties are not available. If the character represented is an existing Unicode character, but is not available in the document character set recognized by a given text processing system, it may also be convenient to have access to its properties in the same way. The |
136 | D25-20 | The list of attributes (properties) for characters is modelled on those in the Unicode Character Database, which distinguishes |
140 | D25-20 | character properties. Additional, non-Unicode, properties may also be supplied. Since the list of properties will vary with different versions of the Unicode Standard, there may not be an exact correspondence between them and the list of properties defined in these Guidelines. |
144 | D25-20 | . The gaiji module itself is formally defined in section |
145 | D25-20 | below. It declares the following additional elements: |
155 | D25-20 | when this module is included in a schema. The |
159 | D25-20 | : this class is referenced as an alternative to plain text in almost every element which contains plain text, thus permitting the |
161 | D25-20 | element also to appear at such places when this module is included in a schema. |
182 | D25-20 | element) by providing a specific glyph that shows how a character appeared in the original document. This is necessary since Unicode code points refer not to a single, specific glyph shape of a character, but rather to a set of glyphs, any of which may be used to render the code point in question; in some cases they can differ considerably. |
186 | D25-20 | element is provided for cases where the encoder wants to specify a specific glyph (or family of glyphs) out of all possible glyphs. Unfortunately, due to the way Unicode has been defined, there are cases where several glyphs that logically belong together have been given separate code points, especially in the blocks defining East Asian characters. In such cases, |
188 | D25-20 | elements can also be used to express the view that these apparently distinct characters are to be regarded as instances of the same character (see further |
191 | D25-20 | The Unicode Standard recommends naming conventions which should be followed strictly where the intention is to annotate an existing Unicode character, and which may also be used as a model when creating new names for characters or glyphs |
192 | D25-20 | It should be noted, however, that this naming convention cannot meaningfully be applied to East Asian characters; the typical Unicode descriptions for these characters take the form |
197 | D25-20 | is simply the Unicode code point value of the character in question. In cases where no Unicode code point exists, there is little hope of finding a name that helps to identify the character. Names should therefore be constructed in a way meaningful to local practice, for example by using a reference number from a well-known character dictionary or a project-specific serial number. |
198 | D25-20 | . For convenience of processing, the following distinct elements are proposed for naming characters and glyphs: |
225 | D25-20 | ) are defined by other TEI modules, and their usage here is no different from their usage elsewhere. The |
227 | D25-20 | element, however, is used here only to link to an image of the character or glyph under discussion, or to contain a representation of it in SVG. The |
239 | D25-20 | element is similar to the standard TEI |
241 | D25-20 | element. While the latter is used to express correspondence relationships between TEI concepts or elements and those in other systems or ontologies, the former is used to express any kind of relationship between the character or glyph under discussion and characters or glyphs defined elsewhere. It may contain any Unicode character, or a |
276 | D25-20 | The mapping element may also be used to represent a mapping of the character or (more likely) glyph under discussion onto a character from the private use area as in this example: |
289 | D25-20 | A more precise documentation of the properties of any character or glyph may be supplied using the generic |
297 | ucsprops | characters, defined by reference to a number of |
299 | ucsprops | (or attribute-value pairs) which they are said to possess. For example, a lowercase letter is said to have the value |
305 | ucsprops | properties (i.e. properties which form part of the definition of a given character), and |
308 | ucsprops | additional |
330 | ucsprops | For convenience, we list here some of the normative character properties and their values. For full information, refer to chapter 4 of |
336 | ucsprops | The general category (described in the Unicode Standard chapter 4 section 5) is an assignment to some major classes and subclasses of characters. Suggested values for this property are listed here: |
384 | ucsprops | Punctuation, initial quote |
387 | ucsprops | Punctuation, final quote |
405 | ucsprops | Separator, space |
408 | ucsprops | Separator, line |
432 | ucsprops | This property applies to all Unicode characters. It governs the application of the algorithm for bi-directional behaviour, as further specified in Unicode Annex 9, |
518 | ucsprops | Start of fixed position classes |
521 | ucsprops | End of fixed position classes |
583 | ucsprops | This property is defined for characters, which may be decomposed, for example to a canonical form plus a typographic variation of some kind. For such characters the Unicode standard specifies both a decomposition type and a decomposition mapping (i.e. another Unicode character to which this one may be mapped in the way specified by the decomposition type). The following types of mapping are defined in the Unicode Standard: |
589 | ucsprops | A no-break version of a space or hyphen |
592 | ucsprops | An initial presentation form (Arabic) |
595 | ucsprops | A medial presentation form (Arabic) |
598 | ucsprops | A final presentation form (Arabic) |
601 | ucsprops | An isolated presentation form (Arabic) |
604 | ucsprops | An encircled form |
607 | ucsprops | A superscript form |
610 | ucsprops | A subscript form |
613 | ucsprops | A vertical layout presentation form |
622 | ucsprops | A small variant form (CNS compatibility) |
628 | ucsprops | A vulgar fraction form |
637 | ucsprops | This property applies for any character which expresses any kind of numeric value. Its value is the intended value in decimal notation. |
643 | ucsprops | independent of the text direction: it has the value |
650 | ucsprops | The Unicode Standard also defines a set of informative (but non-normative) properties for Unicode characters. If encoders want to provide such properties, they may be included using the suggested Unicode name, tagged using the |
654 | ucsprops | element to distinguish them. If a Unicode name exists for a given property, it should however always be preferred to a locally defined name. Locally defined names should be used only for properties which are not specified by the Unicode Standard. |
661 | D25-30 | Annotation of a character becomes necessary when it is desired to distinguish it on the basis of certain aspects (typically, its graphical appearance) only. In a manuscript, for example, where distinctly different forms of the letter "r" can be recognized, it might be useful to distinguish them for analytic purposes, quite distinct from the need to provide an accurate representation of the page. A digital facsimile, particularly one linked to a transcribed and encoded version of the text, will always provide a superior visual representation (for information on how to link a digital facsimile to a transcribed text see |
662 | D25-30 | ), but cannot be used to support arguments based on the distribution of such different forms. Character annotation as described here provides a solution to this problem. |
663 | D25-30 | It should be kept in mind that any kind of text encoding is an abstraction and an interpretation of the text at hand, which will not necessarily be useful in reproducing an exact facsimile of the appearance of a manuscript. |
666 | D25-30 | Assuming that we wish to distinguish the variant glyphs from the standard representation for the character concerned, we will need to define distinct |
693 | D25-30 | With these definitions in place, occurrences of these two special "r"s in the text can be annotated using the element |
708 | D25-30 | element will be interpreted as an annotation on the content of the element |
734 | D25-30 | ligature; the encoder may however prefer not to use it in order to simplify other text processing operations, such as indexing). |
745 | D25-30 | which would enable the same material to be encoded as follows: |
749 | D25-30 | The same technique may be used to represent particular abbreviation marks as well as to represent other characters or glyphs. For example, if we believe that the r-with-one-funny-stroke is being used as an abbreviation for |
755 | D25-30 | Note however that this technique employs markup objects to provide a link between a character in the document and some annotation on that character. Therefore, it cannot be used in places where such markup constructs are not allowed, notably in attribute values. |
757 | D25-30 | Since the need to use these constructs to annotate or define characters occurs frequently in Chinese, Korean, and Japanese documents, here are some issues that are specific to these documents. There are two slightly different versions of the problem. In the first case, due to the way Unicode is defined, there are occasions when more than one glyph is defined for a character. In such an occasion, one might want to retain the character as used, but add information in a way so that a normalizer (for search or indexing operations) could take advantage of this information. To achieve this, we simply define within a |
777 | D25-30 | , simply maps our glyph to the code point where Unicode defined it. The other one, of type |
779 | D25-30 | , encodes the fact that in our view, this glyph is a variation of the standard character given in the content of the element. We could then use this |
783 | D25-30 | to refer to it from within a text as follows. |
789 | D25-30 | A slightly different, but related problem occurs when we have multiple variants, none of which has been defined in Unicode. In this case, we need to define one as a new character using |
808 | D25-30 | element then defines a variant glyph of this newly defined character. Additional properties should be specified in order to make these both identifiable. |
814 | D25-40 | The creation of additional characters for use in text encoding is quite similar to the annotation of existing characters. The same element |
816 | D25-40 | is used to provide a link from the character instance in the text to a character definition provided within the |
818 | D25-40 | element. This character definition takes the form of a |
822 | D25-40 | itself will usually be empty, but could contain a code point from the Private Use Area (PUA) of the Unicode Standard, which is an area set aside for the very purpose of privately adding new characters to a document. Recommendations on how to use such PUA characters are given in the following section. |
824 | D25-40 | In some circumstances, it may be desirable to provide a single precomposed form of a character that is encoded in Unicode only as a sequence of code points. For example, in Medieval Nordic material, a character looking like a lowercase letter Y with a dot and an acute-accent above it may be encountered so frequently that the encoder wishes to treat it as a single precomposed character with one single coded value. In the transcription concerned, the encoder enters this letter as |
826 | D25-40 | , which when the transcription is processed can then be expanded in one of three ways, depending on the mapping in force. The entity reference might be translated into the sequence of corresponding Unicode code points or into some locally-defined PUA character (say |
828 | D25-40 | ) for local processing only. Both these options have disadvantages; the former loses the fact that the sequence of composed characters is regarded as a single object; the second is not reliably portable. Therefore, the recommended representation is to use the |
831 | D25-40 | . This makes it possible for the encoder to provide useful documentation for the particular character or glyph so referenced: |
845 | D25-40 | This definition specifies the mapping between this composed character and the individual Unicode-defined code points which make it up. It also supplies a single locally-defined property ( |
847 | D25-40 | ) for the character concerned, the purpose of which is to supply a recommended character entity name for the character. |
849 | D25-40 | Under certain circumstances, Chinese Han characters can be written within a circle. Rather than considering this as simply an aspect of the rendering, an encoder may wish to treat such circled characters as entirely distinct derived characters. For a given character (say that represented by the numeric-character reference |
880 | D25-40 | . The two mappings indicate firstly that the standard form of this character is the character |
884 | D25-40 | . For convenience of local processing this PUA character may in fact appear as content of the |
894 | D25-50 | The developers of the Unicode Standard have set aside an area of the codespace for the private use of software vendors, user groups, or individuals. As of this writing (Unicode 5.0), there are around 137,000 code points available in this area, which should be enough for most needs. No code point assignments will be made to this area by standard bodies and only some very basic default properties have been assigned (which may be overridden where necessary by the mechanism outlined in this chapter). Therefore, unlike all other code points defined by the Unicode Standard, PUA code points should |
898 | D25-50 | In the two previous examples, we mentioned that the variant characters concerned might well be assigned specific code points from the PUA. This might, for example, facilitate the use of a particular font which displays the desired character at this code point in the local processing environment. Since however this assignment would be valid only on the local site, documents containing such code points are unsuitable for blind interchange. During the process of preparing such documents for interchange, any PUA code points should be replaced by an appropriate use of the |
901 | D25-50 | g ref="#xxxx" |
907 | D25-50 | , or retained as content of the |
909 | D25-50 | element. However, since there is no requirement that the same PUA character be used to represent it at the receiving site, and since it may well be the case that this other site has already made an assignment of some other character to the original PUA code point, it is best practice to remove the locally-defined PUA character. It is to be expected that a further translation into the local processing environment at the receiving site will be necessary to handle such characters, during which variant letters can be converted to hitherto unused code points on the basis of the information provided in the |
913 | D25-50 | This mechanism is rather weak in cases where DOM trees or parsed XML fragments are exchanged, which may increasingly be the case. The best an application can do here is to treat any occurrence of a PUA character only in the context of the local document and use the properties provided through the |
917 | D25-50 | In the fullness of time, a character may become standardized, and thus assigned a specific code point outside the PUA. Documents which have been encoded using the mechanism must at the least ensure that this changed code point is recorded within the relevant |
929 | WDWM | The scripts used for writing human languages vary not only in the glyphs they use, but also in the way (or ways) that those glyphs are arranged on the writing surface. For the majority of modern languages, writing is arranged as a series of lines which are to be read from top to bottom. Within each line, individual characters are frequently presented from left to right (English, Russian, Greek), but there are also several widely-used scripts which run right-to-left (Arabic, Hebrew). Writing in which the lines of glyphs are presented vertically and read from right to left is also often encountered, notably in older East Asian scripts (Sinitic characters, Japanese Kana, Korean Hangul, Vietnamese chữ nôm). In many cases, a language normally uses the same |
930 | WDWM | writing mode |
931 | WDWM | (we use this term to refer to the orientation of individual glyphs within a line and the order in which glyphs and lines should be read), but there are exceptions in which the same language may appear in different modes, for example either vertically or horizontally. Many East Asian scripts were traditionally written from top to bottom within the line, with their lines sequenced from right to left. Although modern Japanese, Chinese, and Korean are often written horizontally, the traditional vertical writing mode is still widely used. There are also comparatively rare cases of ancient scripts written with lines running left to right, each line being read top to bottom (Ancient Uighur, classical Mongolian and Manchu), or scripts such as Ogham where the writing direction may start from the bottom left and run around the edge of an inscribed object. |
933 | WDWM | When different languages are combined, it is possible that different writing modes will be needed: for example, in Hebrew text, running right to left, sequences of Latin digits still run left to right. When different writing modes are available for the same language, it may be that different glyphs will be preferred when the script is used in different modes. For example, when Japanese is written horizontally, the Unicode character U+3001, the |
935 | WDWM | , is used in preference to Unicode character U+FE11, the vertical mode comma. This ensures that the comma appears in the correct position relative to the surrounding glyphs. Even for scripts which are usually written in exactly the same way, different writing modes may be encountered in particular contexts; for example when a language using Roman script is embedded within vertically-organized Chinese text, it may sometimes be displayed vertically and sometimes horizontally. The writing mode may also vary in response to layout constraints such as those imposed by a complex table, where column or row labels may be written vertically or diagonally to make the most effective use of available space, just as it may vary in response to the size and shape of the carrier in the case of a monumental inscription. |
937 | WDWM | For many, perhaps most, TEI documents there may be no need to encode the writing mode explicitly, even in so-called "mixed mode" texts containing passages written in languages which use different writing modes. Modern printed texts in most European languages, for instance, may be expected to use left-to-right/top-to-bottom directionality; while Arabic or Hebrew texts are expected to run right-to-left/top-to-bottom. In a TEI document, language and script are explicitly stated in the markup using the attribute |
939 | WDWM | ; this indication will usually imply a particular default writing mode. Even where this attribute is not used, passages in different scripts will use different Unicode characters, and will thus imply a particular default writing mode. |
941 | WDWM | Consider the case of an English text containing a few Arabic words : |
943 | WDWM | The Arabic term قلم رصاص means "pencil". |
945 | WDWM | A correct TEI encoding might read as follows: |
954 | WDWM | attribute with value |
956 | WDWM | that causes processing software to display the Arabic from right to left, but in fact, this is not the case. The order in which the Arabic characters appear when rendered would be the same, even if the markup were not present: |
961 | WDWM | This is because Arabic glyphs are always displayed right to left, even when they appear within a left-to-right English sentence. Like most other codepoints in the Unicode standard, they have a specific directionality setting which helps any rendering software determine how they should be ordered. The Latin glyph "a" has a strong left-to-right bidirectionality setting, as do the digits 0 to 9; the Hebrew א (alef) is strongly right-to-left. Of course, some glyphs (common punctuation marks such as the period or comma for example) have weak or neutral settings because they may appear in several contexts. |
965 | WDWM | ) defines a number of rules enabling software to render sequences of characters which have differing directionality properties in a predictable and reliable way, using only those properties. |
966 | WDWM | Because this algorithm may not always give the desired result, Unicode also provides a set of "directional formatting characters" ( |
967 | WDWM | ). These additional codepoints can be used to signal to rendering software that a specific directionality setting should be turned on or off. However, in the case of documents encoded in XML, there is no need to use such characters, and in fact the W3C explicitly advises against it. "In (X)HTML and XML do not use the paired Unicode bidi formatting code characters where equivalent markup is available." ( |
969 | WDWM | . It should be remembered however that individual sequences of characters are always stored in a file in the order in which they should be read, irrespective of the order in which the characters making up a sequence should be displayed or rendered. For example, in a RTL language such as Hebrew, the first character in a file will be that which is displayed at the rightmost end of the first line of text. |
971 | WDWM | An encoder wishing to document or to control the order in which sequences of characters in a TEI document are displayed will usually do so by segmenting the text into sequences presented in the desired order and specifying an appropriate language code for each. In situations where this approach may result in ambiguity or lack of precision, or if the encoder wishes to record directional information explicitly in their encoding, we recommend using the global @style attribute to supply detail about the writing mode applicable to the content of any element. The |
975 | WDWM | At the time of writing, this W3C module has the status of a candidate recommendation: see further |
978 | WDWM | which permits direct specification of a number of useful properties associated with writing modes, notably |
1004 | WDWM | The global TEI |
1010 | WDWM | and then point to them using the global |
1013 | WDWM | . Although the CSS specifications are mainly used to provide instructions for software when rendering a digital text, they also provide a useful means of describing the visual properties of a pre-existing document in a formal and standardized way. |
1015 | WDWM | The next section presents some examples of how CSS can be used to describe a variety of writing modes. A full description of the appearance of a document will probably include many other properties of course. |
1021 | WDWMEG | The CSS recommendations provides several properties which can be used to encode aspects of the "writing mode". The most useful of these is the property "writing-mode" which may be used to specify a reading-order for both characters within a single line and lines within a single block of text. The property "text-orientation" may also used to indicate the orientation of individual characters with respect to the line, and the property "direction" to determine the reading order of characters within a line only. We give some examples of each below. |
1028 | WDWMEG1 | property is particularly useful for languages which can be written in different writing modes, such as Chinese and Japanese. Its possible values include |
1034 | WDWMEG1 | . Each value has two components: |
1038 | WDWMEG1 | specifies the inline writing direction, while the second component specifies the direction in which lines in a block, and blocks in a sequence are arranged: from top to bottom (as in most European languages, in which lines and paragraphs are arranged from top to bottom on a page), from right to left (as in the case of Japanese written vertically), or left-to-right (as in the case of Mongolian). |
1088 | WDWMEG1 | to supply a value of |
1092 | WDWMEG1 | attribute specifies a horizontal writing mode; this may seem superfluous, but vertically-written romaji is not unknown. |
1098 | WDWMEG2 | When Japanese is written vertically, the glyph orientation remains the same as when it is written horizontally. In other words, glyphs are not rotated (although as noted above some different glyphs may be used for some characters, in particular for punctuation which needs to be positioned differently in vertical and in horizontal text). However, it is very common for languages written vertically to have embedded runs of text from languages which are normally written horizontally. This raises the issue of the orientation of the glyphs from the horizontal language. Are they written upright, as they would normally appear in horizontal text runs, or are they rotated? Consider this fragment from a Japanese article about the Indonesian language, which takes the form of a glossary list: |
1108 | WDWMEG2 | The text-orientation property allows us to indicate whether or not glyphs are rotated. In the following example, we have indicated that the list uses a |
1110 | WDWMEG2 | writing mode, but that the orientation of individual glyphs may vary: |
1126 | WDWMEG2 | characters from horizontal-only scripts are set sideways, i.e. 90° clockwise from their standard orientation in horizontal text. Characters from vertical scripts are set with their intrinsic orientation |
1129 | WDWMEG2 | ). Since the default value for |
1133 | WDWMEG2 | , this rule is not strictly required. However, if the Indonesian glyphs (which are roman characters) had been set vertically, like this: |
1142 | WDWMEG2 | then an encoding like the following could be used to make this explicit: |
1158 | WDWMEG2 | characters from horizontal-only scripts are rendered upright, i.e. in their standard horizontal orientation. Characters from vertical scripts are set with their intrinsic orientation and shaped normally |
1169 | WDWMEG3 | It is not unusual to see text from horizontal languages written vertically even where no vertically-written script is involved. This example is a fragment from a table of information about agricultural development on Vancouver Island, written in 1855: |
1180 | WDWMEG3 | Four of the subheading cells in this fragment contain English text written vertically, bottom-to-top, to conserve space on the page. To describe this sort of phenomenon, we can use the |
1189 | WDWMEG3 | causes text to be set as if in a horizontal layout, but rotated 90° counter-clockwise. |
1190 | WDWMEG3 | We might encode the third of the four cells containing vertical text like this: |
1200 | WDWMEG3 | property captures the fact that the script is written vertically, and its lines are to be read from left to right (so the line containing |
1203 | WDWMEG3 | Cash value |
1206 | WDWMEG3 | value encodes the orientation (rotated 90° counter-clockwise). We might also add |
1208 | WDWMEG3 | to the style, to express the fact that the text is centrally-aligned. |
1214 | WDWMEG4 | Of the rather small number of scripts which appear to be written bottom-to-top, perhaps the best-known is Ogham, an alphabet used mainly to write Archaic Irish. Ogham is typically found inscribed along the edge of a standing stone, starting at its base. The CSS Writing Modes specification does not explicitly distinguish between vertical scripts which are written top-to-bottom and those which are written bottom-to-top. Instead, such bottom-to-top scripts are best treated as left-to-right horizontal scripts, oriented vertically because of the constraints of the medium on which they are inscribed. Such scripts are analogous to the vertical English text-runs in the table cells in the example above, and can be handled in exactly the same manner ( |
1216 | WDWMEG4 | ). In cases where writing follows a curved path (such as Ogham running around the edge of a stone), a meticulous encoder might resort to the use of SVG to describe the path, rather than treating the phenomenon as a writing mode. |
1225 | WDWMEG5 | The Arabic term قلم رصاص means "pencil". |
1238 | WDWMEG5 | property to record the observed directionality of the text is unambiguous, even though it is (as we noted above) superfluous. The use of the |
1240 | WDWMEG5 | property here may require some explanation. By default this property has the value |
1242 | WDWMEG5 | , the effect of which in this context would be to ignore any value supplied for the direction property. The CSS Writing Modes specification stipulates that the direction property |
1243 | WDWMEG5 | has no effect on bidi reordering when specified on inline boxes whose |
1245 | WDWMEG5 | property’s value is |
1247 | WDWMEG5 | , because the element does not open an additional level of embedding with respect to the bidirectional algorithm. |
1250 | WDWMEG5 | Mixed horizontal directionality is very common in languages such as Arabic and Hebrew, particularly when numbers (which are always given LTR) or phrases from LTR languages are embedded. It is not impossible, though quite unusual, for ambiguities to arise in such situations, which may give rise to the parts of a document being displayed in unexpected ways that do not correspond to the natural reading order. A more detailed discussion of this issue from an HTML perspective is provided by a W3C Internationalization Working Group report |
1251 | WDWMEG5 | Inline markup and bidirectional text in HTML |
1260 | WDWMEG | For most texts, information about text directionality need not be explicitly encoded in a TEI text, either because it follows unambiguously from |
1262 | WDWMEG | values, or because it can be expected to be handled unequivocally by the Unicode Bidi Algorithm. Where it is considered important to encode such information, properties and values taken from the CSS Writing Modes module may be used by means of the global TEI |
1264 | WDWMEG | attribute (or using the TEI |
1275 | WDWMTT | In what follows, we examine a range of textual phenomena which in some ways appear very similar to those examined above, and even overlap with them. We can categorize these as text transformation features, and suggest some strategies for encoding them based on the properties detailed in the |
1286 | WDWMTT | Here a block of text has been rotated around its z-axis. This is clearly not a |
1287 | WDWMTT | writing mode |
1288 | WDWMTT | ; the writing mode for this text is horizontal, left to right. Furthermore, even if we wished to treat this as a writing mode, we could not do so, because there is no way to use writing modes properties to describe an text orientation which is angled at 45 degrees; no human languages are consistently written in this orientation. It is more appropriate to treat this as a rotational transformation. We can do this using two properties: |
1292 | WDWMTT | . (Both of these properties have quite complex value sets, and we will not look at all of them here. See the |
1298 | WDWMTT | property takes as its value one or more of the transform functions, one of which is the function |
1304 | WDWMTT | Any rotation must take place clockwise around an axis positioned relative to the element being rotated, and the |
1306 | WDWMTT | property can be used to specify the pivot point. By default, the value of |
1310 | WDWMTT | , the point at the centre of the element, but these values can be changed to reflect rotation around a different origin point. (The TEI |
1316 | WDWMTT | A block of text may also be rotated about either of its other axes. For example, this shows rotation around the Y (vertical) axis: |
1330 | WDWMTT | which are both normally printed in a rotated form so that they represent a pair of wings: |
1351 | WDWMTT | We might also argue that this is in fact a vertical writing mode by supplying |
1353 | WDWMTT | as the value for the |
1357 | WDWMTT | Rotation is also useful as a method of handling a true writing mode which is not covered by the CSS Writing Modes: |
1359 | WDWMTT | . This is a writing mode common in inscriptions in Latin, Greek and other languages, in which alternate lines run from left to right and from right to left |
1360 | WDWMTT | The name is taken from the Greek βουστροφηδόν, meaning |
1364 | WDWMTT | ); that is, turning as an ox does when pulling a plough. |
1366 | WDWMTT | mirror writing |
1389 | WDWMTT | The 180-degree rotation around the Y (vertical) axis here describes what is happening in the RTL line in boustrophedon; the order of glyphs is reversed, and so is their individual orientation (in fact, we see them |
1390 | WDWMTT | from the back |
1395 | WDWMTT | in the sense of poetic lines; the text is continuous prose, and linebreaks are incidental. |
1397 | WDWMTT | There are obviously some unsatisfactory aspects of this manner of encoding boustrophedon. In the inscription above, some words run across linebreaks, so if we wished to tag both words and the right-to-left phenomena, one hierarchy would have to be privileged over the other. By using a transform function rather than a writing mode property, we are apparently suggesting that boustrophedon is not in fact a writing mode, whereas it clearly is. But the CSS Writing Modes specification does not provide support for boustrophedon, because it is a rather obscure historical phenomenon; using a rotational transform is one practical alternative. |
1405 | WDCAV | ; the language is designed to describe how an HTML document should be formatted. This is not, of course, the case for the TEI, which lacks any explicit processing or formatting model, and attempts to define objects as far as possible without consideration of their visual appearance. As long as the properties and values from the CSS Transforms module are used as a convenient, well-specified descriptive language to capture features of a text, without any expectation of using them directly and reliably for rendering, this is not particularly problematic. CSS provides a useful and well-defined vocabulary to describe many aspects of the appearance of source texts, benefitting particularly from the clarity of definition provided by the specification. However, if there is any expectation of using this information to render a text in a predictable and accurate way, it will be essential to provide enough styling information throughout the document hierarchy to resolve all ambiguities with regard to size, positioning, block status, etc. before any element undergoes a transform operation. |
1410 | WSD-DEF | The gaiji module described in this chapter makes available the following components: |
1413 | gaiji | Character and glyph documentation |
1422 | WSD-DEF | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
4 | TS | The module described in this chapter is intended for use with a wide variety of transcribed spoken material. It should be stressed, however, that the present proposals are not intended to support unmodified every variety of research undertaken upon spoken material now or in the future; some discourse analysts, some phonologists, and doubtless others may wish to extend the scheme presented here to express more precisely the set of distinctions they wish to draw in their transcriptions. Speech regarded as a purely acoustic phenomenon may well require different methods from those outlined here, as may speech regarded solely as a process of social interaction. |
6 | TS | This chapter begins with a discussion of some of the problems commonly encountered in transcribing spoken language (section |
8 | TS | documents some additional TEI header elements which may be used to document the recording or other source from which transcribed text is taken. Section |
10 | TS | of this chapter reviews further problems specific to the encoding of spoken language, demonstrating how mechanisms and elements discussed elsewhere in these Guidelines may be applied to them. |
21 | TSOV | of speech. Speech varies according to a large number of dimensions, many of which have no counterpart in writing (for example, tempo, loudness, pitch, etc.). The audibility of speech recorded in natural communication situations is often less than perfect, affecting the accuracy of the transcription. Spoken material may be transcribed in the course of linguistic, acoustic, anthropological, psychological, ethnographic, journalistic, or many other types of research. Even in the same field, the interests and theoretical perspectives of different transcribers may lead them to prefer different levels of detail in the transcript and different styles of visual display. The production and comprehension of speech are intimately bound up with the situation in which speech occurs, far more so than is the case for written texts. A speech transcript must therefore include some contextual features; determining which are relevant is not always simple. Moreover, the ethical problems in recording and making public what was produced in a private setting and intended for a limited audience are more frequently encountered in dealing with spoken texts than with written ones. |
23 | TSOV | Speech also poses difficult structural problems. Unlike a written text, a speech event takes place in time. Its beginning and end may be hard to determine and its internal composition difficult to define. Most researchers agree that the utterances or |
25 | TSOV | of individual speakers form an important structural component in most kinds of speech, but these are rarely as well-behaved (in the structural sense) as paragraphs or other analogous units in written texts: speakers frequently interrupt each other, use gestures as well as words, leave remarks unfinished and so on. Speech itself, though it may be represented as words, frequently contains items such as vocalized pauses which, although only semi-lexical, have immense importance in the analysis of spoken text. Even non-vocal elements such as gestures may be regarded as forming a component of spoken text for some analytic purposes. Below the level of the individual utterance, speech may be segmented into units defined by phonological, prosodic, or syntactic phenomena; no clear agreement exists, however, even as to appropriate names for such segments. |
27 | TSOV | Spoken texts transcribed according to the guidelines presented here are organized as follows. The overall structure of a TEI spoken text is identical to that of any other TEI text: the |
29 | TSOV | element for a spoken text contains a |
33 | TSOV | element. Even texts primarily composed of transcribed speech may also include conventional front and back matter, and may even be organized into divisions like printed texts. |
39 | TSOV | as organizing unit for spoken material |
40 | TSOV | A spoken |
42 | TSOV | might typically be a conversation between a small number of people, a lecture, a broadcast TV item, or a similar event. Each such unit has associated with it a |
44 | TSOV | providing detailed contextual information such as the source of the transcript, the identity of the participants, whether the speech is scripted or spontaneous, the physical and social setting in which the discourse takes place and a range of other aspects. Details of the header in general are provided in chapter |
45 | TSOV | ; the particular elements it provides for use with spoken texts are described below ( |
46 | TSOV | ). Details concerning additional elements which may be used for the documentation of participant and contextual information are given in |
49 | TSOV | Defining the bounds of a spoken text is frequently a matter of arbitrary convention or convenience. In public or semi-public contexts, a text may be regarded as synonymous with, for example, a |
52 | TSOV | broadcast item |
54 | TSOV | meeting |
55 | TSOV | , etc. In informal or private contexts, a text may be simply a conversation involving a specific group of participants. Alternatively, researchers may elect to define spoken texts solely in terms of their duration in time or length in words. By default, these Guidelines assume of a text only that: |
61 | TSOV | it represents a single stretch of time with no significant discontinuities. |
66 | TSOV | element may take the value |
68 | TSOV | to specify that the components of the text are discrete) but is not recommended. |
72 | TSOV | it may be necessary to identify subdivisions of various kinds, if only for convenience of handling. The neutral |
79 | TSOV | A spoken text may contain any of the following components: |
87 | TSOV | kinesic (non-verbal, non-lexical) phenomena such as gestures |
91 | TSOV | writing, regarded as a special class of incident in that it can be transcribed, for example captions or overheads displayed during a lecture |
93 | TSOV | shifts or changes in vocal quality |
96 | TSOV | Elements to represent all of these features of spoken language are discussed in section |
101 | TSOV | ) may contain lexical items interspersed with pauses and non-lexical vocal sounds; during an utterance, non-linguistic incidents may occur and written materials may be presented. The |
107 | TSOV | A spoken text itself may be without substructure, that is, it may consist simply of units such as utterances or pauses, not grouped together in any way, or it may be subdivided. If the notion of what constitutes a |
108 | TSOV | text |
109 | TSOV | in spoken discourse is inevitably rather an arbitrary one, the notion of formal subdivisions within such a |
110 | TSOV | text |
112 | TSOV | text |
119 | TSOV | , provided only that the set of all such divisions is coextensive with the text. |
121 | TSOV | Each such division of a spoken text should be represented by the numbered or unnumbered |
124 | TSOV | . For some detailed kinds of analysis a hierarchy of such divisions may be found useful; nested |
126 | TSOV | elements may be used for this purpose, as in the following example showing how a collection made up of transcribed |
127 | TSOV | sound bites |
128 | TSOV | taken from speeches given by a politician on different occasions might be encoded. Each extract is regarded as a distinct |
148 | TSOV | attribute, for use where the divisions of a text do not all share the same set of the contextual declarations specified in the TEI header. (See further section |
154 | HD32 | Where a computer file is derived from a spoken text rather than a written one, it will usually be desirable to record additional information about the recording or broadcast which constitutes its source. Several additional elements are provided for this purpose within the source description component of the TEI header: |
168 | HD32 | Note that detailed information about the participants or setting of an interview or other transcript of spoken language should be recorded in the appropriate division of the profile description, discussed in chapter |
169 | HD32 | , rather than as part of the source description. The source description is used to hold information only about the source from which the transcribed speech was taken, for example, any script being read and any technical details of how the recording was produced. If the source was a previously-created transcript, it should be treated in the same way as any other source text. |
173 | HD32 | element should be used where it is known that one or more of the participants in a spoken text is speaking from a previously prepared script. The script itself should be documented in the same way as any other written text, using one of the three citation tags mentioned above. Utterances or groups of utterances may be linked to the script concerned by means of the |
192 | HD32 | is used to group together information relating to the recordings from which the spoken text was transcribed. The element may contain either a prose description or, more helpfully, one or more |
194 | HD32 | elements, each corresponding with a particular recording. The linkage between utterances or groups of utterances and the relevant recording statement is made by means of the |
201 | HD32 | element should be used to provide a description of how and by whom a recording was made. This information may be provided in the form of a prose description, within which such items as statements of responsibility, names, places, and dates may be identified using the appropriate phrase-level tags. Alternatively, a selection of elements from the |
212 | HD32 | Specialized collections may wish to add further sub-elements to these major components. These elements should be used only for information relating to the recording process itself; information about the setting or participants (for example) is recorded elsewhere: see sections |
251 | HD32 | When a recording has been made from a public broadcast, details of the broadcast itself should be supplied within the |
255 | HD32 | element. A broadcast is closely analogous to a publication and the |
263 | HD32 | . The broadcasting agency responsible for a broadcast is regarded as its author, while other participants (for example interviewers, interviewees, script writers, directors, producers, etc.) should be specified using the |
294 | HD32 | When a broadcast contains several distinct recordings (for example a compilation), additional |
318 | TSBA | The following elements characterize spoken texts, transcribed according to these Guidelines: |
323 | TSBA | element may appear directly within a spoken text, and may contain any of the others; the others may also appear directly (for example, a |
327 | TSBA | element. In terms of the basic TEI model, therefore, we regard the |
367 | TSBA | (for sounds produced by the human vocal apparatus), and |
377 | TSBA | incident |
383 | TSBA | kinesic |
389 | TSBA | vocal |
406 | TSBA | vocal events |
408 | TSBA | usually involuntary noises. Equally, the distinction between utterances and vocals is not always clear, although for many analytic purposes it will be convenient to regard them as distinct. Individual scholars may differ in the way borderlines are drawn and should declare their definitions in the |
410 | TSBA | element of the header (see |
413 | TSBA | The following short extract exemplifies several of these elements. It is recoded from a text originally transcribed in the CHILDES format. |
424 | TSBA | ). Non-verbal vocal effects such as the child's meowing are indicated either with orthographic transcriptions or with the |
426 | TSBA | element, and entirely non-linguistic but significant incidents such as the sound of the toy cat are represented by the |
470 | TSBA | This example also uses some elements common to all TEI texts, notably the |
472 | TSBA | tag for editorial regularization. Unusually stressed syllables have been encoded with the |
479 | TSBA | Contextual information is of particular importance in spoken texts, and should be provided by the TEI header of a text. In general, all of the information in a header is understood to be relevant to the whole of the associated text. The element |
490 | TSBAUT | Each distinct |
492 | TSBAUT | in a spoken text is represented by a |
500 | TSBAUT | attribute to associate the utterance with a particular speaker is recommended but not required. Its use implies as a further requirement that all speakers be identified by a |
504 | TSBAUT | element in the TEI header (see section |
505 | TSBAUT | ), but it may also point to another external source of information about the speaker. Where utterances or other parts of the transcription cannot be attributed with confidence to any particular participant or group of participants, the encoder may choose to create |
513 | TSBAUT | , and perhaps give the root |
517 | TSBAUT | value of |
519 | TSBAUT | , then point to those as appropriate using |
526 | TSBAUT | . The value specified applies to the transition from the preceding utterance into the utterance bearing the attribute. For example: |
527 | TSBAUT | For the most part, the examples in this chapter use no sentence punctuation except to mark the rising intonation often found in interrogative statements; for further discussion, see section |
541 | TSBAUT | , while there is a marked pause between |
552 | TSBAUT | An utterance may contain either running text, or text within which other basic structural elements are nested. Where such nesting occurs, the |
562 | TSBAUT | ; that is, a pause or shift (etc.) within an utterance is regarded as being produced by that speaker only, while a pause between utterances applies to all speakers. |
564 | TSBAUT | Occasionally, an utterance may seem to contain other utterances, for example where one speaker interrupts himself, or when another speaker produces a |
566 | TSBAUT | while they are still speaking. The present version of these Guidelines does not support nesting of one |
568 | TSBAUT | element within another. The transcriber must therefore decide whether such interruptions constitute a change of utterance, or whether other elements may be used. In the case of self-interruption, the |
570 | TSBAUT | element may be used to show that the speaker has changed the quality of their speech: |
589 | TSBAUT | Where this is not possible, it is simplest to regard the back-channel as a distinct utterance. |
594 | TSBAPA | Speakers differ very much in their rhythm and in particular in the amount of time they leave between words. The following element is provided to mark occasions where the transcriber judges that speech has been paused, irrespective of the actual amount of silence: |
595 | TSBAPA | A pause contained by an utterance applies to the speaker of that utterance. A pause between utterances applies to all speakers. The |
607 | TSBAPA | If detailed synchronization of pausing with other vocal phenomena is required, the alignment mechanism defined at section |
610 | TSBAPA | attribute mentioned in the previous section may also be used to characterize the degree of pausing between (but not within) utterances. |
619 | TSBAVO | attribute should be used to specify the person or group responsible for a |
625 | TSBAVO | which is contained within an utterance, if this differs from that of the enclosing utterance. The attribute must be supplied for a |
635 | TSBAVO | attribute may be used to indicate that the vocal, kinesic, or incident is repeated, for example |
641 | TSBAVO | , where what is being encoded is a shift in voice quality. For this last case, the |
662 | TSBAVO | element of the TEI header. |
694 | TSBAVO | The extent to which encoding of incidents or kinesics is included in a transcription will depend entirely on the purpose for which the transcription was made. As elsewhere, this will depend on the particular research agenda and the extent to which their presence is felt to be significant for the interpretation of spoken interactions. |
698 | TSBAWR | Written text may also be encountered when speech is transcribed, for example in a television broadcast or cinema performance, or where one participant shows written text to another. The |
700 | TSBAWR | element may be used to distinguish such written elements from the spoken text in which they are embedded. |
702 | TSBAWR | For example, if speaker A in the breakfast table conversation in section |
703 | TSBAWR | above had simply shown the newspaper passage to her interlocutor instead of reading it, the interaction might have been encoded as follows: |
712 | TSBAWR | If the source of the writing being displayed is known, bibliographic information about it may be stored in a |
716 | TSBAWR | element of the TEI header, and then pointed to using the |
739 | TSBATI | As noted above, utterances, vocals, pauses, kinesics, incidents, and writing elements all inherit attributes providing information about their position in time from the classes |
743 | TSBATI | . These attributes can be used to link parts of the transcription very exactly with points on a timeline, or simply to indicate their duration. Note that if |
749 | TSBATI | elements whose temporal distance from each other is specified in a timeline, then |
756 | TSBATI | ) may be used as an alternative means of aligning the start and end of timed elements, and is required when the temporal alignment involves points within an element. |
764 | TSSASH | A common requirement in transcribing spoken language is to mark positions at which a variety of prosodic features change. Many paralinguistic features (pitch, prominence, loudness, etc.) characterize stretches of speech which are not co-extensive with utterances or any of the other units discussed so far. One simple method of encoding such units is simply to mark their boundaries. An empty element called |
769 | TSSASH | element may appear within an utterance or a segment to mark a significant change in the particular feature defined by its attributes, which is then understood to apply to all subsequent utterances for the same speaker, unless changed by a new shift for the same feature in the same speaker. Intervening utterances by other speakers do not normally carry the same feature. For example: |
779 | TSSASH | is spoken loudly, the words |
791 | TSSASH | ); this list may be revised or supplemented using the methods outlined in section |
796 | TSSASH | attribute specifies the new state of the feature following the shift. If this attribute has the special value |
800 | TSSASH | A list of suggested values for each of the features proposed follows: |
814 | TSSASH | l |
825 | TSSASH | f |
834 | TSSASH | p |
860 | TSSASH | desc |
888 | TSSASH | legato, every syllable receiving more or less equal stress |
949 | TSSASH | A full definition of the sense of the values provided for each feature should be provided in the encoding description section of the text header (see section |
965 | TSSA | This section describes the following features characteristic of spoken texts for which elements are defined elsewhere in these Guidelines: |
967 | TSSA | segmentation below the utterance level |
972 | TSSA | The elements discussed here are not provided by the module for spoken texts. Some of them are included in the core module and others are contained in the modules for linking and for analysis respectively. The selection of modules and their combination to define a TEI schema is discussed in section |
977 | TSSASE | For some analytic purposes it may be desirable to subdivide the divisions of a spoken text into units smaller than the individual utterance or turn. Segmentation may be performed for a number of different purposes and in terms of a variety of speech phenomena. Common examples include units defined both prosodically (by intonation, pausing, etc.) and syntactically (clauses, phrases, etc.) The term |
979 | TSSASE | has been used by a number of researchers to define units peculiar to speech transcripts. |
980 | TSSASE | The term was apparently first proposed by |
982 | TSSASE | A text can be analysed as a sequence of segments which are internally connected by a network of syntactic relations and externally delimited by the absence of such relations with respect to neighbouring segments. Such a segment is a syntactic unit called a macrosyntagm |
992 | TSSASE | attribute to specify the kind of segmentation applicable to a particular segment, if more than one is possible in a text. A full definition of the segmentation scheme or schemes used should be provided in the |
996 | TSSASE | element in the TEI header (see |
999 | TSSASE | In the first example below, an utterance has been segmented according to a notion of syntactic completeness not necessarily marked by the speech, although in this case a pause has been recorded between the two sentence-like units. In the second, the segments are defined prosodically (an acute accent has been used to mark the position immediately following the syllable bearing the primary accent or stress), and may be thought of as |
1017 | TSSASE | element in the header of the text should specify the principles adopted to define the segments marked in this way. |
1022 | TSSASE | may be used, either as an alternative or in addition to the more general purpose |
1059 | TSSASE | In this example, recoded from a corpus of language-impaired speech prepared by Fletcher and Garman, the speaker's utterance has been fully segmented into clausal ( |
1077 | TSSASE | has been used to define a particular characteristic of this corpus for which no element exists in the TEI scheme. See further chapter |
1078 | TSSASE | for a discussion of the way in which this kind of user-defined extension of the TEI scheme may be performed and chapter |
1081 | TSSASE | This example also uses the core elements |
1088 | TSSASE | It is often the case that the desired segmentation does not respect utterance boundaries; for example, syntactic units may cross utterance boundaries. For a detailed discussion of this problem, and the various methods proposed by these Guidelines for handling it, see chapter |
1091 | TSSASE | milestone |
1094 | TSSASE | tag discussed in section |
1097 | TSSASE | where several discontinuous segments are to be grouped together to form a syntactic unit (e.g. a phrasal verb with interposed complement), the |
1104 | TSSAPA | A major difference between spoken and written texts is the importance of the temporal dimension to the former. As a very simple example, consider the following, first as it might be represented in a playscript: |
1126 | TSSAPA | However, this does not allow us to indicate either the extent to which Stig's utterance is overlapped, nor does it show that there are in fact three things which are synchronous: the end of Jane's utterance, Stig's whole utterance, and Lou's kinesic. To overcome these problems, more sophisticated techniques, employing the mechanisms for pointing and alignment discussed in detail in section |
1127 | TSSAPA | , are needed. If the module for linking has been enabled (as described in section |
1137 | TSSAPA | should be consulted. The rest of the present section, which should be read in conjunction with that more detailed discussion, presents a number of ways in which these mechanisms may be applied to the specific problem of representing temporal alignment, synchrony, or overlap in transcribing spoken texts. |
1145 | TSSAPA | attribute associated with this anchor point specifies the identifiers of the other two elements which are to be synchronized with it: specifically, the second utterance ( |
1147 | TSSAPA | ) and the kinesic (k1). Note that one of these elements has content and the other is empty. |
1149 | TSSAPA | This example demonstrates only a way of indicating a point within one utterance at which it can be synchronized with another utterance and a kinesic. For more complex kinds of alignment, involving possibly multiple synchronization points, an additional element is provided, known as a |
1151 | TSSAPA | . This consists of a series of |
1161 | TSSAPA | This timeline represents four points in time, named TS-P1, TS-P2, TS-P6, and TS-P3 (as with all attributes named |
1163 | TSSAPA | in the TEI scheme, the names must be unique within the document but have no other significance). TS-P1 is located absolutely, at 12:20:01:01 BST. TS-P2 is 4.5 seconds later than TS-P2 (i.e. at 12:20:46). TS-P6 is at some unspecified time later than TS-P2 and previous to TS-P3 (this is implied by its position within the timeline, as no attribute values have been specified for it). The fourth point, TS-P3, is 1.5 seconds later than TS-P6. |
1165 | TSSAPA | One or more such timelines may be specified within a spoken text, to suit the encoder's convenience. If more than one is supplied, the |
1177 | TSSAPA | elements in a time line are a fixed distance apart. |
1179 | TSSAPA | Three methods are available for aligning points or elements within a spoken text with the points in time defined by the |
1185 | TSSAPA | element as the value of one of the |
1207 | TSSAPA | For example, using the timeline given above: |
1269 | TSSAPA | Such conventions have the drawback that they are hard to generalize or to extend beyond the very simple case presented here. Their reliance on the accidentals of physical layout may also make them difficult to transport and to process computationally. These Guidelines recommend the following mechanisms to encode this. |
1297 | TSSAPA | (Note that If only the ordering or sequencing of utterances is needed, then specific timing information shown here in |
1326 | TSSAPA | To avoid deciding whether to point from the timeline to the text or vice versa, a |
1377 | TSREG | When speech is transcribed using ordinary orthographic notation, as is customary, some compromise must be made between the sounds produced and conventional orthography. Particularly when dealing with informal, dialectal, or other varieties of language, the transcriber will frequently have to decide whether a particular sound is to be treated as a distinct vocabulary item or not. For example, while in a given project |
1379 | TSREG | may not be worth distinguishing as a vocabulary item from |
1389 | TSREG | One rule of thumb might be to allow such variation only where a generally accepted orthographic form exists, for example, in published dictionaries of the language register being encoded; this has the disadvantage that such dictionaries may not exist. Another is to maintain a controlled (but extensible) set of normalized forms for all such words; this has the advantage of enforcing some degree of consistency among different transcribers. Occasionally, as for example when transcribing abbreviations or acronyms, it may be felt necessary to depart from conventional spelling to distinguish between cases where the abbreviation is spelled out letter by letter (e.g. |
1397 | TSREG | ). Similar considerations might apply to pronunciation of foreign words (e.g. |
1403 | TSREG | In general, use of punctuation, capitalization, etc., in spoken transcripts should be carefully controlled. It is important to distinguish the transcriber's intuition as to what the punctuation should be from the marking of prosodic features such as pausing, intonation, etc. |
1411 | TSTPPR | In the absence of conventional punctuation, the marking of prosodic features assumes paramount importance, since these structure and organize the spoken message. Indeed, such prosodic features as points of primary or secondary stress may be represented by specialized punctuation marks, or other characters such as those provided by the Unicode Spacing Modifier Letters block. Pauses have already been dealt with in section |
1412 | TSTPPR | ; while tone units (or intonational phrases) can be indicated by the segmentation tag discussed in section |
1418 | TSTPPR | In a more detailed phonological transcript, it is common practice to include a number of conventional signs to mark prosodic features of the surrounding or (more usually) preceding speech. Such signs may be used to record, for example, particular intonation patterns, truncation, vowel quality (long or short) etc. These signs may be preserved in a transcript either by using conventional punctuation or by marking their presence by |
1426 | TSTPPR | of the TEI header |
1441 | TSTPPR | These declarations might additionally provide information about how the characters concerned should be rendered, their equivalent IPA form, etc. In the transcript itself references to them can then be included as follows: |
1493 | TSTPPR | This example, which is taken from a corpus of bookshop service encounters, |
1499 | TSTPPR | . Where words are so unclear that only their extent can be recorded, the empty |
1506 | TSTPPR | For more detailed work, involving a detailed phonological transcript including representation of stress and pitch patterns, it is probably best to maintain the prosodic description in parallel with the conventional written transcript, rather than attempt to embed detailed prosodic information within it. The two parallel streams may be aligned with each other and with other streams, for example an acoustic encoding, using the general alignment mechanisms discussed in section |
1515 | TSTPSM | above), or to transcribe them using IPA or some other transcription system. To simplify analysis of the lexical features of a speech transcript, it may be felt useful to |
1518 | TSTPSM | , to make explicit the extent of regularization or normalization performed by the transcriber. |
1544 | TSTPSM | element may be used to indicate both the original and a corrected form of it: |
1554 | TSTPSM | , where a speaker switches from one language to another, may easily be represented in a transcript by using the |
1556 | TSTPSM | element provided by the core tagset: |
1571 | TSTPAC | The recommendations made here only concern the establishment of a basic text. Where a more sophisticated analysis is needed, more sophisticated methods of markup will also be appropriate, for example, using stand-off markup to indicate multiple segmentation of the stream of discourse, or complex alignment of several segments within it. Where additional annotations (sometimes called |
1575 | TSTPAC | ) are used to represent such features as linguistic word class (noun, verb, etc.), type of speech act (imperative, concessive, etc.), or information status (theme/rheme, given/new, active/semi-active/new), etc., a selection from the general purpose analytic tools discussed in chapters |
1597 | TS | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
5 | FT | In addition to graphic images, documents often contain material presented in graphical or tabular format. In such materials, details of layout and presentation may also be of comparatively greater significance or complexity than they are for running text. Indeed, it may often be difficult to make a clear distinction between details relating purely to the rendition of information and those relating to the information itself. |
13 | FT | As with text markup in general, many incompatible formats have been proposed for the representation of graphics, formulæ, and tables in electronic form. Unfortunately, no single format as effective as XML in the domain of text has yet emerged for their interchange, to some extent because of the difficulty of representing the information these data formats convey independently of the way it is rendered. |
15 | FT | The module defined by this chapter defines special purpose |
20 | FT | . Specific recommendations for the encoding of graphic figures may be found in section |
21 | FT | . The rest of the chapter is devoted to general problems of encoding graphic information. |
23 | FT | There is at the time of writing no consensus on formats for graphical images, and such formats vary in many ways. We therefore provide (in section |
25 | FT | ) a list of formal names for those representations most popular at this time. Each one includes a very brief description. These Guidelines recommend a few particular representations as being the most widely supported and understood. |
29 | FTTAB | A table is the least |
30 | FTTAB | graphic |
31 | FTTAB | of the elements discussed in this chapter. Almost any text structure can be presented as a series of rows and columns: one might, for example, choose to show a glossary or other form of list in tabular form, without necessarily regarding it as a table. In such cases, the global |
33 | FTTAB | attribute is an appropriate way of indicating that some element is being presented in tabular format, for example by using an appropriate display property in CSS. When tabular presentation is regarded as of less intrinsic importance, it is correspondingly simpler to encode descriptive or functional information about the contents of the table, for example to identify one cell as containing a name and another as containing a date, though the two methods may be combined. |
35 | FTTAB | When, however, particular elements are required to encode the tabular arrangement itself, then one or other of the various |
36 | FTTAB | table schemas |
37 | FTTAB | now available may be preferable. The schemas in common use generally view a table as a special text element, made up of row elements, themselves composed of cells. |
38 | FTTAB | Table cells generally appear in row-major order, with the first row from left to right, then the second row, and so on. Details of appearance such as column widths, border lines, and alignment are generally encoded by numerous attributes. Beyond this, however, such schemas differ greatly. This section begins by describing a table schema of this kind; a brief summary of some other widely available table schemas is also provided in section |
41 | FTTAB1 | TEI Tables |
43 | FTTAB1 | For encoding tables of low to moderate complexity, these Guidelines provide the following special purpose elements: |
52 | FTTAB1 | It is to a large extent arbitrary whether a table should be regarded as a series of rows or as a series of columns. For compatibility with currently available systems, however, these Guidelines require a row-by-row description of a table. It is also possible to describe a table simply as a series of cells; this may be useful for tabular material which is not presented as a simple matrix. |
58 | FTTAB1 | may be used to indicate the size of a table, or to indicate that a particular cell or row of a table spans more than one row or column. For both tables and cells, rows and columns are always given in top-to-bottom, left-to-right order, although formatting properties such as those provided by CSS may be used to specify that they should be displayed differently. These Guidelines do not require that the size of a table be specified; for most formatting and many other applications, it will be necessary to process the whole table in two passes in any case. |
60 | FTTAB1 | Where cells span more than one column or row, the encoder must determine whether this is a purely presentational effect (in which case the |
62 | FTTAB1 | attribute may be more appropriate), whether the part of the table affected would be better treated as a nested table, or whether to use the spanning attributes listed above. |
66 | FTTAB1 | attribute may be used to categorize a single cell, or set a default for all the cells in a given row. The present Guidelines distinguish the roles of |
67 | FTTAB1 | label |
73 | FTTAB1 | numeric |
85 | FTTAB1 | The following simple example demonstrates how the data presented as a labelled list in section |
128 | FTTAB1 | The following example demonstrates how a simple statistical table may be represented using this scheme: |
184 | FTTAB1 | Note the use of a blank cell in the first row to ensure that the column labels are correctly aligned with the data. Again, this encoding does not explicitly represent the alignment between column and row labels and the data to which they apply. Where the primary emphasis of an encoding is on the semantic content of a table, a more explicit mechanism for the representation of structured information such as that provided by the feature structure mechanism described in chapter |
185 | FTTAB1 | may be preferred. Alternatively, the general purpose linkage and alignment mechanisms described in chapter |
188 | FTTAB1 | The content of a table cell need not be simply character data. It may also contain any sequence of the phrase-level elements described in chapter |
189 | FTTAB1 | , thus allowing for the encoding of potentially more useful semantic information, as in the following example, where the fact that one cell contains a number and the other contains a place name has been explicitly recorded: |
255 | FTTAB1 | The content of table elements is not limited to |
269 | FTTAB1 | provide options for including text which is clearly part of the table, but outside the actual tabular layout. This example shows the use of |
308 | FTTAB2 | Many authoring systems include built-in support for their own or for public table schemas. These provide an enhanced user interface and good formatting capabilities, but are often product-specific, despite their use of a XML markup language. |
310 | FTTAB2 | The DTD developed by the Association of American Publishers (AAP) and standardized in ANSI Z39.59 provided a very simple encoding for correspondingly simple tables. This has been further developed, together with the table DTD documented in ISO Technical Report 9537, and now forms part of ISO 12083. The TEI table model described above has functionality very similar to that defined by ISO 12083. |
312 | FTTAB2 | For more complex tables, the most effective publicly-available DTD is probably that developed by the US Department of Defense CALS project. This supports vertical and horizontal spanning and various kinds of text rotation and justification within cells and is also directly supported by a number of existing XML software systems. |
314 | FTTAB2 | The CALS table model is much too complex to describe fully here; for historical background see |
316 | FTTAB2 | . As with any other XML vocabulary, the XML version of the CALS model may readily be included in a TEI schema, using the techniques described in |
321 | FTTAB2 | The XHTML table model ( |
322 | FTTAB2 | ) is based on the HTML table model ( |
323 | FTTAB2 | ). Both models support arrangement of arbitrary data into rows and columns of cells. Table rows and columns may be grouped to convey additional structural information and may be rendered by user agents in ways that emphasize this structure. Support for incremental rendering of tables and for rendering on |
327 | FTTAB2 | ). Stylesheets provide a far more effective means of controlling layout and other visual characteristics in both HTML and XML documents. |
332 | FTFOR | Mathematical and chemical formulæ pose problems similar to those posed by tables in that rendition may be of great significance and hard to disentangle from content. They also require access to a wide range of special characters, for most of which standard entity names already exist in the documented ISO entity sets (see further chapters |
338 | FTFOR | The AAP and ISO standards mentioned in section |
339 | FTFOR | above both provide DTDs for equations as well as for tables, which now form part of ISO 12083. The European Mathematical Trust, an organization set up specifically to enhance research support for European mathematicians, has also defined a general purpose mathematical DTD known as EuroMath ( |
342 | FTFOR | Most if not all of the functionality provided by these DTDs can now be found in the OpenMath and MathML XML-based systems briefly described below. |
344 | FTFOR | As with tables, in all the XML solutions a tension exists between the need to encode the way a formula is written (its appearance) and the need to represent its semantics. If the object of the encoding is purely to act as an interchange format among different formatting programs, then there is no need to represent the mathematical meaning of an expression. If however the object is to use the encoding as input to an algebraic manipulation system (such as Mathematica or Maple) or a database system, clearly simply representing superscripts and subscripts will be inadequate. |
346 | FTFOR | The present Guidelines make no attempt to add to the number of available DTDs for representing formulæ. Instead, we recommend that the user make an informed choice from those already available. The module described in this chapter makes available only the following element, which should be used to encode any formula, no matter what notation is employed: |
357 | FTFOR | must be escaped with entity references or numeric character references, e.g. |
361 | FTFOR | If desired, the content of the |
366 | FTFOR | When the content of a |
377 | FTFOR | attribute supplies the name of a notation ( |
389 | FTFOR | structure of an expression. Most of its content elements correspond with the range of operators, relations, and named functions typically found at the high-school level of mathematics. The tortoise example given above in TeX can be re-expressed in MathML as |
443 | FTFOR | MathML 2.0 provides support for a |
463 | FTFOR | Encodings, both binary ( |
467 | FTFOR | OpenMath and MathML have certain common aspects. They both use prefix operators, both are XML-based and they both construct their objects by applying certain rules recursively. Such similarities facilitate mapping between the two standards. There are also some key differences between MathML and OpenMath. OpenMath does not provide support for presentation of mathematical objects and its scope of semantically-oriented elements is much broader that of MathML, with the expressive power to cover virtually all areas of computational mathematics. In fact, a particular set of Content Dictionaries, the |
472 | FTFOR | ) is an extension of the OpenMath standard that supplies markup for structures such as axioms, theorems, proofs, definitions, texts (mixing formal content with mathematical text). |
474 | FTFOR | In-line versus block placement for an equation can be distinguished if desired, via the global |
480 | FTFOR | attributes may also be used to label or identify the formula, as in the following example: |
525 | FTNM | Music, like many other art forms, is often mentioned, discussed and described in writings of various kinds. This applies to both historical and contemporary documents, even though methods of notating music have changed considerably in western history. In most cases, music notation enters the text flow in a way similar to figures, images or graphs. On other occasions, elements of music notation are treated as inline characters in running text. |
528 | FTNM | provides a way to signal the presence of music notation in text, but defer to other representations, which are not covered by the TEI guidelines, to describe the music notation itself. In fact, several commercial, academic and standard bodies have developed digital representations of music notation, and given the topic's complexity, these representations often focus on different aspects and adopt different methodologies. Therefore, |
530 | FTNM | only defines a container element to encode the occurrence of music notation and allows linking to the data format preferred by the encoder. (Note: |
553 | FTNM | can be used to indicate the location of a representation of the music notation. |
556 | FTNM | supplies the MIME type of the data format, when available. |
566 | FTNM | can be used to indicate the location of a graphical representation of the music notation. |
570 | FTNM | provides encoded binary data which constitutes another representation of the music notation (e.g. audio). |
581 | FTNM | supplies the MIME type of the data format when available. For example: |
597 | FTNM | It is possible to link to any kind of music notation data format. However, when a MIME type is not available, it is recommended that the format be specified in the description. See the following examples. |
620 | FTNM | It is possible to specify the location of digital objects representing the notated music in other media such as images or audio-visual files. The interpretation of the correspondence between the notated music and these digital objects is not encoded explicitly. We recommend the use of |
624 | FTNM | mainly as a fallback mechanism when the notated music format is not displayable by the application using the encoding. The alignment of encoded notated music, images carrying the notation, and audio files is a complex matter for which we refer the reader to other formats and specifications such as |
634 | FTNM | In modern printing, music notation positioned between blocks of text for illustrative purposes is usually referred to as a |
635 | FTNM | figure |
674 | FTGRA | The following special purpose elements are used to indicate the presence of graphic images within a document: |
685 | FTGRA | elements form part of the common core module, and are discussed in section |
694 | FTGRA | attribute provides the location of an image. For example: |
696 | FTGRA | Three kinds of content may be supplied inside a |
700 | FTGRA | may be used to transcribe (or supply) a descriptive heading or title for the graphic itself as in this example: |
703 | FTGRA | Figures are often accompanied not only by a title or heading (a caption), but by a paragraph or so of commentary (a legend) following the caption. One or more |
708 | FTGRA | may be used to transcribe any commentary on the figure in the source: |
718 | FTGRA | Here, the figure contains a heading |
722 | FTGRA | . Both of these are transcribed from the source, while the description is provided by the encoder, for use by applications which cannot display the graphic directly. In documents created in electronic form with the needs of print-handicapped readers in mind, the |
724 | FTGRA | element may be provided by the author rather than a subsequent encoder. |
731 | FTGRA | Where the graphic itself contains large amounts of text, perhaps with a complex structure, and perhaps difficult to distinguish from the graphic, the encoder should choose whether to regard the graphic as containing the text (in which case, a nested |
735 | FTGRA | element) or to regard the enclosed text as being a separate division of the |
737 | FTGRA | element in which the graphic appears. In this latter case, an appropriate |
741 | FTGRA | (etc.) element may be used for the text represented within the graphic, and the |
743 | FTGRA | element embedded within it. The choice will depend to a large degree on the encoder's understanding of the relationship between the graphic and the surrounding text. |
745 | FTGRA | A figure which is internally divided, or contains sub-figures, may be encoded with nested |
766 | FTGRA | Like any other element in the TEI scheme, figures may be given identifiers so that they can be aligned with other elements, and linked to or from them, as described in chapter |
771 | FTGRA | version which, when selected by the user, causes the other, high resolution, version to be accessed. In TEI terms, the thumbnail image acts as a |
773 | FTGRA | to the other. Supposing that a thumbnail version of the figure discussed above is available as |
786 | FTGRA | . When the module for transcription is included in a schema, specific attributes for parts of a text and parts (or all) of a digital image are available; these are discussed in |
792 | FTGRA | with chapter two of some text, and another portion of it with chapter three. The application may be thought of as a hypertext browser in which the user selects from a graphic image which part of a text to read next, but the mechanism is independent of this particular application. |
794 | FTGRA | The first requirement is some way of identifying and hence pointing to sub-parts of a graphic image. This may be done by pointing into an XML graphic representation, for example an SVG file. Thus |
815 | FTGRA | The next requirement is some way of identifying the parts of the document to which a link is to be made. The most obvious way of doing this is to use the global |
824 | FTGRA | Now, all that is needed to linking these areas to the relevant chapters is a |
833 | FTGRA | In this example, the SVG representation of the graphic is stored externally to the TEI document and linked by means of a pointer. It is also possible to embed the SVG representation directly within the TEI by extending the content model of the |
837 | FTGRA | from the SVG namespace. Like other customizations of the TEI scheme, this is carried out using the techniques documented in section |
848 | FTGROV | The first major distinction in graphic representation is that between raster graphics and vector graphics. A |
850 | FTGROV | is a list of points, or dots. Scanners, fax machines and other simple devices easily produce digital raster images, and such images are therefore quite common. A |
852 | FTGROV | , in contrast, is a list of geometrical objects, such as lines, circles, arcs, or even cubes. These are much more difficult to produce, and so are mainly encountered as the output of sophisticated systems such as architectural and engineering CAD programs. |
854 | FTGROV | Raster images are difficult to modify because by definition they only encode single points: a line, for example, cannot grow or shrink as such, since it is not identified as such. Only its component parts are identified, and only they can be manipulated. Therefore the resolution or dot-size of a raster image is important, which is not the case with vector images. It is also far more difficult to convert raster images to vector images than to perform the opposite conversion. Raster images generally require more storage space than vector images, and a wide variety of methods exists for compressing them; the variation in these methods leads to corresponding variations in representations for storage and transmission of raster images. |
856 | FTGROV | Motion video usually consists of a long series of raster images. Data compression is even more effective on video than on single raster images (mainly owing to redundancy which arises from the usual similarity of adjacent frames). Notations for representing full-motion video are hotly debated at this time, and any user of these Guidelines would do well to obtain up-to-date expert advice before undertaking a project using them. |
864 | FTGROV | save space by discarding a small portion of the image's detail, such as fine distinctions of shading. When decompressed, therefore, such an image will be only a close approximation of the original. In contrast, |
866 | FTGROV | guarantees that the exact uncompressed image will be reproducible from the compressed form: only truly redundant information is removed. In general, therefore, lossless compression does not save quite so much space as lossy compression, though it does guarantee fidelity to the original uncompressed image. |
870 | FTGROV | , which is the number of dots per inch used to represent the image. Doubling the resolution will give a more precise image, but also quadruple the storage requirement (before compression), and affect processing time for any operations to be performed, such as displaying an image for a reader. Motion video also has resolution in time: the number of frames to be shown per second. Encoders should consider carefully what resolution(s) and frame rate(s) to use for particular applications; these Guidelines express no recommendation in this matter, save the universal ones of consistency and documentation. |
872 | FTGROV | Within any image, it is typical to refer to locations via Cartesian coordinate axes: values for x, y, and sometimes z and/or time. However, graphic notations vary in whether coordinates count from left-to-right and top-to-bottom, or another way. They also vary in whether coordinates are considered real (inches, millimeters, and so on), or virtual (dots). These Guidelines do not recommend any of these methods over another, but all decisions made should be applied consistently, and documented in the |
874 | FTGROV | section of the TEI header. |
875 | FTGROV | Since no special purpose element is provided for this purpose by the current version of the Guidelines, such information should be provided as one or more distinct paragraphs at the end of the |
880 | FTGROV | Methods of aligning images and text are discussed in |
885 | FTGROV | images, each point is rendered in some shade of gray, the number of shades varying from system to system. In true polychrome images, points are rendered in different hues, again with varying limitations affecting the number of distinct shades and the means by which they are displayed. |
889 | FTGRNO | As noted above, there exists a wide variety of different graphics formats, and the following list is in no way exhaustive. Moreover, inclusion of any format in this list should not be taken as indicating endorsement by the TEI of this format or any products associated with it. Some of the formats listed here are proprietary to a greater or lesser extent and cannot therefore be regarded as standards in any meaningful sense. They are however widely used by many different vendors. |
920 | FTGRNO | Brief descriptions of all the above are given below. Where possible, current addresses or other contact information are shown for the originator of each format. Many formal standards, especially those promulgated by ISO and many related national organizations (ANSI, DIN, BSI, and many more), are available from those national organizations. Addresses may be found in any standard organizational directory for the country in question. |
930 | FTGRAVGF | SVG is a language for describing two-dimensional vector and mixed vector or raster graphics in XML. It is defined by the Scalable Vector Graphics (SVG) 1.0 Specification, W3C Recommendation, 04 September 2001, and is available at |
946 | FTGRARGF | Currently the most widely supported raster image format, especially for black and white images, TIFF is also one of the few formats commonly supported on more than one operating system. The drawback to TIFF is that it actually is a wrapper for several formats, and some TIFF-supporting software does not support all variants. TIFF files may use LZW, CCITT Group 4, or PackBits compression methods, or may use no compression at all. Also, TIFF files may be monochrome, grayscale, or polychromatic. All such options should be specified in prose at the end of the |
948 | FTGRARGF | section of the TEI header for any document including TIFF images. TIFF is owned by Aldus Corporation. Documentation on TIFF is available from them at Craigcook Castle, Craigcook Road, Edinburgh EH4 3UH, Scotland, or 411 First Avenue South, Seattle, Washington 98104 USA. |
954 | FTGRARGF | PBM files are easy to process, eschewing all compression in favor of transparency of file format. PBM files can, of course, be compressed by generic file-compression tools for storage and transfer. Public domain software exists which will convert many other formats to and from PBM. Documentation on PBM is copyright by Jeff Poskanzer, and is available widely on the Internet. |
970 | FTGRAMPEG | This standard is sponsored by CCITT and by ISO. It is ISO/IEC Draft International Standard 10918-1, and CCITT T.81. It handles monochrome and polychromatic images with a variety of compression techniques. JPEG per se, like CCITT Group IV, must be encapsulated before transmission; this can be done via TIFF, or via the JPEG File Interchange Format (JFIF), as commonly done for Internet delivery. |
982 | FTGRAMPEG | SMIL is a W3C Recommendation which supports the integration of independent multimedia objects into a synchronized multimedia presentation. It provides multimedia authors with easily-defined basic timing relationships, fine-tuned synchronization, spatial layout, direct inclusion of non-text and non-image media objects, hyperlink support for time-based media, and adaptiveness to varying user and system characteristics. SMIL 1.0 ( |
983 | FTGRAMPEG | ) became a W3C Recommendation on June 15, 1998, and was further developed in SMIL 2.0. SMIL 2.0 adds native support for transitions, animation, event-based interaction, extended layout facilities, and more sophisticated timing and synchronization primitives to the SMIL 1.0 language. It also allows reuse of SMIL syntax and semantics in other XML-based languages, in particular those who need to represent timing and synchronization. For example, SMIL 2.0 components are used for integrating timing into XHTML Document Types and into SVG. SMIL 2.0 also provides recommendations for Document Types based on SMIL 2.0 Modules ( |
985 | FTGRAMPEG | ). It contains support for all of the major SMIL 2.0 features including animation, content control, layout, linking, media object, meta-information, structure, timing, and transition effects and is designed for Web clients that support direct playback from SMIL 2.0 markup. SMIL 2.0 ( |
986 | FTGRAMPEG | ) became a W3C Recommendation on August 7, 2001, becoming the first vocabulary to provide XML Schema support and to have reached such status. |
997 | figures | Tables, formulæ, notated music, and figures |
1009 | FT | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
7 | AB | They make recommendations about suitable ways of representing those features of textual resources which need to be identified explicitly in order to facilitate processing by computer programs. In particular, they specify a set of markers (or |
9 | AB | ) which may be inserted in the electronic representation of the text, in order to mark the text structure and other features of interest. Many, or most, computer programs depend on the presence of such explicit markers for their functionality, since without them a digitized text appears to be nothing but a sequence of undifferentiated bits. The success of the World Wide Web, for example, is partly a consequence of its use of such markup to indicate such features as headings and lists on individual pages, and to indicate links between pages. The process of inserting such explicit markers for implicit textual features is often called |
13 | AB | ; the term |
15 | AB | is also used informally. We use the term |
18 | AB | markup language |
19 | AB | to denote the complete set of rules associated with the use of markup in a given context; we use the term |
21 | AB | for the specific set of markers or named distinctions employed by a given encoding scheme. Thus, this work both describes the TEI encoding scheme, and documents the TEI markup vocabulary. |
23 | AB | The TEI encoding scheme is of particular usefulness in facilitating the loss-free interchange of data amongst individuals and research groups using different programs, computer systems, or application software. Since they contain an inventory of the features most often deployed for computer-based text processing, the Guidelines are also useful as a starting point for those designing new systems and creating new materials, even where interchange of information is not a primary objective. |
25 | AB | These Guidelines apply to texts in any natural language, of any date, in any literary genre or text type, without restriction on form or content. They treat both continuous materials ( |
26 | AB | running text |
27 | AB | ) and discontinuous materials such as dictionaries and linguistic corpora. Though principally directed to the needs of the scholarly research community, the Guidelines are not restricted to esoteric academic applications. They are also useful for librarians maintaining and documenting electronic materials, and for publishers and others creating or distributing electronic texts. Although they focus on problems of representing in electronic form texts which already exist in traditional media, these Guidelines are also applicable to textual material which is |
31 | AB | The rules and recommendations made in these Guidelines are expressed in terms of what is currently the most widely-used markup language for digital resources of all kinds: the Extensible Markup Language (XML), as defined by the World Wide Web Consortium's XML Recommendation. However, the TEI encoding scheme itself does not depend on this language; it was originally formulated in terms of SGML (the ISO Standard Generalized Markup Language), a predecessor of XML, and may in future years be re-expressed in other ways as the field of markup develops and matures. For more information on markup languages see chapter |
35 | AB | This document provides the authoritative and complete statement of the requirements and usage of the TEI encoding scheme. As such, although it includes numerous small examples, it must be stressed that this work is intended to be a reference manual rather than a tutorial guide. |
37 | AB | The remainder of this chapter comprises three sections. The first gives an overview of the structure and notational conventions used throughout these Guidelines. The second enumerates the design principles underlying the TEI scheme and the application environments in which it may be found useful. Finally, the third section gives a brief account of the origins and development of the Text Encoding Initiative itself. |
41 | ABSTRUNC | The remaining two sections of the front matter to the Guidelines provide background tutorial material for those unfamiliar with basic markup technologies. Following the present introductory section, we present a detailed introduction to XML itself, intended to cover in a relatively painless manner as much as the novice user of the TEI scheme needs to know about markup languages in general and XML in particular. This is followed by a discussion of the general principles underlying current practice in the representation of different languages and writing systems in digital form. This chapter is largely intended for the user unfamiliar with the Unicode encoding systems, though the expert may also find its historical overview of interest. |
43 | ABSTRUNC | The body of this edition of the Guidelines proper contains 23 chapters arranged in increasing order of specialist interest. The first five chapters discuss in depth matters likely to be of importance to anyone intending to apply the TEI scheme to virtually any kind of text. The next seven focus on particular kinds of text: verse, drama, spoken text, dictionaries, and manuscript materials. The next nine chapters deal with a wide range of topics, one or more of which are likely to be of interest in specialist applications of various kinds. The last two chapters deal with the XML encoding used to represent the TEI scheme itself, and provide technical information about its implementation. The last chapter also defines the notion of TEI conformance and its implications for interchange of materials produced according to these Guidelines. |
45 | ABSTRUNC | As noted above, this is a reference work, and is not intended to be read through from beginning to end. However, the reader wishing to understand the full potential of the TEI scheme will need a thorough grasp of the material covered by the first four chapters and the last two. Beyond that, the reader is recommended to select according to their specific interests: one of the strengths of the TEI architecture is its modular nature. |
47 | ABSTRUNC | As far as possible, extensive cross referencing is provided wherever related topics are dealt with; these are particularly effective in the online version of the Guidelines. In addition, a series of technical appendixes provide detailed formal definitions for every element, every class, and every macro discussed in the body of the work; these are also cross linked as appropriate. Finally, a detailed bibliography is provided, which identifies the source of many examples cited in the text as well as documenting works referred to, and listing other relevant publications. |
49 | ABSTRUNC | As an aid to the reader, most chapters of these Guidelines follow the same basic organization. The chapter begins with an overview of the subjects treated within it, linked to the following subsections. Within each section where new elements are described, a summary table is first given, which provides their names and a brief description of their intended usage. This is then followed where appropriate by further discussion of each element, including wherever possible usage examples taken somewhat eclectically from a variety of real sources. These examples are not intended to be exhaustive, but rather to suggest typical ways in which the elements concerned may usefully be applied. Where appropriate, a link to a statement of the source for most examples is provided in the online version. Within the examples, use of whitespace such as newlines or indentation is simply intended to aid legibility, and is not prescriptive or normative. |
51 | ABSTRUNC | Wherever TEI elements or classes are mentioned in the text, they are linked in the online version to the relevant reference specification for the element or class concerned. Element names are always given in the form |
54 | ABSTRUNC | name |
61 | ABSTRUNC | include a closing slash to distinguish them wherever they are discussed. References to attributes take the form |
65 | ABSTRUNC | is the name of the attribute. References to classes are also presented as links, for example |
73 | AB-namecon | TEI Naming Conventions |
75 | AB-namecon | These Guidelines use a more or less consistent set of conventions in the naming of XML elements and classes. This section summarizes those conventions. |
80 | AB-namecon | An unadorned name such as |
82 | AB-namecon | is the name of a TEI element or attribute. |
83 | AB-namecon | During generation of TEI RelaxNG schema fragments, the patterns corresponding with these TEI names are given a prefix |
84 | AB-namecon | tei |
85 | AB-namecon | to allow them to co-exist with names from other XML namespace. This prefix is not visible to the end user, and is not used in TEI documentation. When generating multi-namespace schemas, however, the user needs to be aware of them. |
88 | AB-namecon | The following conventions apply to the choice of names: |
94 | AB-namecon | Where an element name contains more than one token, the first letter of the second token, and of any subsequent ones, is capitalized, as in for example |
104 | AB-namecon | The specification for an element or attribute whose name contains abbreviations generally also includes a |
106 | AB-namecon | element providing the expanded sense of the name. |
110 | AB-namecon | element; this is not however generally done in TEI P5. |
116 | AB-namecon | att |
120 | AB-namecon | bibl |
126 | AB-namecon | category, especially as used in text classification |
128 | AB-namecon | char |
134 | AB-namecon | document: this usually refers to the original source document which is being encoded, |
138 | AB-namecon | declaration: has a specific sense in the TEI Header, as discussed in |
140 | AB-namecon | desc |
142 | AB-namecon | description: has a specific sense in the TEI header, as discussed in |
147 | AB-namecon | group. In TEI usage, a group is distinguished from a list in that the former associates several objects which act as a single entity, while the latter does not. For example, a |
153 | AB-namecon | simply lists a number of otherwise unrelated |
157 | AB-namecon | interp |
159 | AB-namecon | interpretation or analysis |
161 | AB-namecon | lang |
162 | AB-namecon | (natural) language |
167 | AB-namecon | org |
169 | AB-namecon | organization, that is, a named group of people or legal entity |
171 | AB-namecon | rdg |
173 | AB-namecon | reading or version found in a specific witness |
175 | AB-namecon | ref |
176 | AB-namecon | reference or link |
184 | AB-namecon | statement: used in a specific sense in the TEI header, as discussed in |
188 | AB-namecon | structured: that is, containing a specific set of named elements rather than |
189 | AB-namecon | mixed content |
191 | AB-namecon | val |
195 | AB-namecon | wit |
207 | AB-namecon | is an additional name, not the name of an addition. Such inconsistencies are relatively few in number, and it is hoped to remove them in subsequent revisions of the Guidelines. |
219 | AB-namecon | (division) etc. We do not specifically list such elements here: as noted above, an expansion of each such abbreviated name is provided within the documentation using the |
240 | ABSTRUNC | att.global |
244 | ABSTRUNC | model.biblPart |
248 | ABSTRUNC | macro.paraContent |
252 | ABSTRUNC | data.pointer |
257 | ABSTRUNC | . Here we simply note some conventions about their naming. |
261 | ABSTRUNC | Attribute class names take the form |
265 | ABSTRUNC | is typically an adjective, or a series of adjectives separated by dots, describing a property common to the attributes which make up the class. |
267 | ABSTRUNC | Attributes with the same name are considered to have the same semantics, whether the attribute is inherited from a class, or locally defined. |
273 | ABSTRUNC | Model classes have names beginning |
276 | ABSTRUNC | root name |
279 | ABSTRUNC | A root name may be the name of an element, generally the prototypical parent or sibling for elements which are members of the class. |
283 | ABSTRUNC | , if the class members are all children of the element named rootname; or |
285 | ABSTRUNC | , if the class members are all siblings of the element named |
291 | ABSTRUNC | is used to indicate that class members are permitted anywhere in a TEI document. |
297 | ABSTRUNC | For example, the class of elements which can form part of a |
301 | ABSTRUNC | . This class includes as a subclass the elements which can form part of a |
303 | ABSTRUNC | in a spoken text, which is named |
309 | ABTEI2 | Because of its roots in the humanities research community, the TEI scheme is driven by its original goal of serving the needs of research, and is therefore committed to providing a maximum of comprehensibility, flexibility, and extensibility. More specific design goals of the TEI have been that the Guidelines should: |
315 | ABTEI2 | support the encoding of all kinds of features of all kinds of texts studied by researchers |
317 | ABTEI2 | be application independent |
318 | ABTEI2 | This has led to a number of important design decisions, such as: |
320 | ABTEI2 | the choice of XML and Unicode |
322 | ABTEI2 | the provision of a large predefined tag set |
324 | ABTEI2 | encodings for different views of text |
331 | ABTEI2 | The goal of creating a common interchange format which is application independent requires the definition of a specific markup syntax as well as the definition of a large set of elements or concepts. The syntax of the recommendations made in this document conforms to the World Wide Web Consortium's XML Recommendation ( |
334 | ABTEI2 | The goal of providing guidance for text encoding suggests that recommendations be made as to what textual features should be recorded in various situations. However, when selecting certain features for encoding in preference to others, these Guidelines have tended to prefer generic solutions to specific ones, and to avoid areas where no consensus exists, while attempting to accommodate as many diverse views as feasible. Consequently, the TEI Guidelines make (with relatively rare exceptions) no suggestions or restrictions as to the relative importance of textual features. The philosophy of the Guidelines is |
335 | ABTEI2 | if you want to encode this feature, do it this way |
338 | ABTEI2 | The requirement to support all kinds of materials likely to be of interest in research has largely conditioned the development of the TEI into a very flexible and modular system. The development of other XML vocabularies or standards is typically motivated by the desire to create a single fully specified encoding scheme for use in a well-defined application domain. By contrast, the TEI is intended for use in a large number of rather ill-defined and often overlapping domains. It achieves its generality by means of the modular architecture described in |
341 | ABTEI2 | The Guidelines have been written largely with a focus on text capture (i.e. the representation in electronic form of an already existing copy text in another medium) rather than text creation (where no such copy text exists). Hence the frequent use of terms like |
346 | ABTEI2 | copy text |
347 | ABTEI2 | , etc. However, the Guidelines are equally applicable to text creation, although certain elements, such as |
350 | ABTEI2 | the rendition indicators |
353 | ABTEI2 | Concerning text capture the TEI Guidelines do not specify a particular approach to the problem of fidelity to the source text and recoverability of the original; such a choice is the responsibility of the text encoder. The current version of these Guidelines, however, provides a more fully elaborated set of tags for markup of rhetorical, linguistic, and simple typographic characteristics of the text than for detailed markup of page layout or for fine distinctions among type fonts or manuscript hands. It should be noted also that, with the present version of the Guidelines, it is no longer necessarily the case that an unmediated version of the source text can be recovered from an encoded text simply by removing the markup. |
362 | ABTEI2 | interpretation |
363 | ABTEI2 | . These distinctions, though widely made and often useful in narrow, well-defined contexts, are perhaps best interpreted as distinctions between issues on which there is a scholarly consensus and issues where no such consensus exists. Such consensus has been, and no doubt will be, subject to change. The TEI Guidelines do not make suggestions or restrictions as to which of these features should be encoded. The use of the terms |
367 | ABTEI2 | about different types of encoding in the Guidelines is not intended to support any particular view on these theoretical issues. Historically, it reflects a purely practical division of responsibility amongst the original working committees (see further |
370 | ABTEI2 | In general, the accuracy and the reliability of the encoding and the appropriateness of the interpretation is for the individual user of the text to determine. The Guidelines provide a means of documenting the encoding in such a way that a user of the text can know the reasoning behind that encoding, and the general interpretive decisions on which it is based. The TEI header may be used to document and justify many such aspects of the encoding, but the choice of TEI elements for a particular feature is in itself a statement about the interpretation reached by the encoder. |
372 | ABTEI2 | In many situations more than one view of a text is needed since no absolute recommendation to embody one specific view of text can apply to all texts and all approaches to them. Within limits, the syntax of XML ensures that some encodings can be ignored for some purposes. To enable encoding multiple views, these Guidelines not only treat a variety of textual features, but sometimes provide several alternative encodings for what appear to be identical textual phenomena. These Guidelines offer the possibility of encoding many different views of the text, simultaneously if necessary. Where different views of the formal structure of a text are required, as opposed to different annotations on a single structural view, however, the formal syntax of XML (which requires a single hierarchical view of text structure) poses some problems; recommendations concerning ways of overcoming or circumventing that restriction are discussed in chapter |
375 | ABTEI2 | In brief, the TEI Guidelines define a general-purpose encoding scheme which makes it possible to encode different views of text, possibly intended for different applications, serving the majority of scholarly purposes of text studies in the humanities. Because no predefined encoding scheme can possibly serve all research purposes, the TEI scheme is designed to facilitate both selection from a wide range of predefined markup choices, and the addition of new (non-TEI) markup options. By providing a formally verifiable means of extending the TEI recommendations, the TEI makes it simple for such user-identified modifications to be incorporated into future releases of the Guidelines as they evolve. The underlying mechanisms which support these aspects of the scheme are introduced in chapter |
383 | ABAPP | guidance for individual or local practice in text creation and data capture; |
385 | ABAPP | support of data interchange; |
387 | ABAPP | support of application-independent local processing. |
388 | ABAPP | These three functions are so thoroughly interwoven in practice that it is hardly possible to address any one without addressing the others. However, the distinction provides a useful framework for discussing the possible role of the Guidelines in work with electronic texts. |
394 | ABAPP1 | Problems specific to text creation or text |
396 | ABAPP1 | have not been considered explicitly in this document. These Guidelines are not concerned with the process by which a digital text comes into being: it can be typed by hand, scanned from a printed book or typescript, read from a typesetter's tape, or acquired from another researcher who may have used another markup scheme (or no explicit markup at all). |
400 | ABAPP1 | XML can appear distressingly verbose, particularly when (as in these Guidelines) the names of tags and attributes are chosen for clarity and not for brevity. Editor macros and keyboard shortcuts can allow a typist to enter frequently used tags with single keystrokes. It is often possible to transform word-processed or scanned text automatically. Markup-aware software can help with maintaining the hierarchical structure of the document, and display the document with visual formatting rather than raw tags. |
403 | ABAPP1 | may be used to develop simpler data capture TEI-conformant schemas, for example with limited numbers of elements, or with shorter names for the tags being used most often. Documents created with such schemas may then be automatically converted to a more elaborated TEI form. |
408 | ABAPP2 | The TEI format may simply be used as an interchange format, permitting projects to share resources even when their local encoding schemes differ. If there are |
414 | ABAPP2 | such mappings are needed. However, for such translations to be carried out without loss of information, the interchange format chosen must be as expressive (in a formal sense) as any of the target formats; this is a further reason for the TEI's provision of both highly abstract or generic encodings and highly specific ones. |
422 | ABAPP2 | creating a suitable set of mappings. |
425 | ABAPP2 | For example, to translate from encoding scheme X into the TEI scheme: |
427 | ABAPP2 | Make a list of all the textual features distinguished in X. |
429 | ABAPP2 | Identify the corresponding feature in the TEI scheme. There are three possibilities for each feature: |
431 | ABAPP2 | the feature exists in both X and the TEI scheme; |
433 | ABAPP2 | X has a feature which is absent from the TEI scheme; |
435 | ABAPP2 | X has a feature which corresponds with more than one feature in the TEI scheme. |
436 | ABAPP2 | The first case is a trivial renaming. The second will require an extension to the TEI scheme, as described in chapter |
437 | ABAPP2 | . The third is more problematic, but not impossible, provided that a consistent choice can be made (and documented) amongst the alternatives. |
442 | ABAPP2 | Translating from the TEI into scheme X follows the same pattern, except that if a TEI feature has no equivalent in X, and X cannot be extended, information must be lost in translation. |
447 | ABAPP2 | The TEI |
448 | ABAPP2 | abstract model |
449 | ABAPP2 | (that is, the set of categorical distinctions which it defines) must be respected. The correspondence between a tag X and the semantic function assigned to it by these Guidelines may not be changed; such changes are known as |
450 | ABAPP2 | tag abuse |
453 | ABAPP2 | A TEI document must be expressed as a valid XML-conformant document which uses the TEI namespace appropriately. If, for example, the document encodes features not provided by the Guidelines, such extensions may not be associated with the TEI namespace. |
455 | ABAPP2 | It must be possible to validate a TEI document against a schema derived from these Guidelines, possibly with extensions provided in the recommended manner. |
461 | ABAPP3 | Machine-readable text can be manipulated in many ways; some users: |
465 | ABAPP3 | edit, display, and link texts in hypertext systems |
475 | ABAPP3 | perform content analysis on texts |
485 | ABAPP3 | scan verse texts metrically |
487 | ABAPP3 | link text and images |
490 | ABAPP3 | These applications cover a wide range of likely uses but are by no means exhaustive. The aim has been to make the TEI Guidelines useful for encoding the same texts for different purposes. We have avoided anything which would restrict the use of the text for other applications. We have also tried not to omit anything essential to any single application. |
492 | ABAPP3 | Because the TEI format is expressed using XML, almost any modern text processing system is able to process it, and new TEI-aware software systems are able to build on a solid base of existing software libraries. |
497 | ABTEI | The Text Encoding Initiative grew out of a planning conference sponsored by the Association for Computers and the Humanities (ACH) and funded by the U.S. National Endowment for the Humanities (NEH), which was held at Vassar College in November 1987. At this conference some thirty representatives of text archives, scholarly societies, and research projects met to discuss the feasibility of a standard encoding scheme and to make recommendations for its scope, structure, content, and drafting. During the conference, the Association for Computational Linguistics and the Association for Literary and Linguistic Computing agreed to join ACH as sponsors of a project to develop the Guidelines. The outcome of the conference was a set of principles (the |
504 | ABTEI | The Text Encoding Initiative project began in June 1988 with funding from the NEH, soon followed by further funding from the Commission of the European Communities, the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada. Four working committees, composed of distinguished scholars and researchers from both Europe and North America, were named to deal with problems of text documentation, |
505 | ABTEI | text representation, text analysis and interpretation, |
515 | ABTEI | ) of the Guidelines was distributed in July 1990 under the title |
518 | ABTEI | Extensive public comment and further work on areas not covered in this version resulted in the drafting of a revised version, TEI P2, distribution of which began in April 1992. This version included substantial amounts of new material, resulting from work carried out by several specialist working groups, set up in 1990 and 1991 to propose extensions and revisions to the text of P1. The overall organization, both of the draft itself and of the scheme it describes, was entirely revised and reorganized in response to public comment on the first draft. |
520 | ABTEI | In June 1993 an Advisory Board met to review the current state of the TEI Guidelines, and recommended the formal publication of the work done to that time. That version of the TEI Guidelines, TEI P3, consolidated the work published as parts of TEI P2, along with some additional new material and was finally published in May of 1994 without the label |
525 | ABTEI | XML was originally developed as a way of publishing on the World Wide Web richly encoded documents such as those for which the TEI was designed. Several TEI participants contributed heavily to the development of XML, most notably XML's senior co-editor C. M. Sperberg-McQueen, who served as the North American editor for the TEI Guidelines from their inception until 1999. |
526 | ABTEI | Following the rapid take-up of this new standard metalanguage, it became evident that the TEI Guidelines (which had been published originally as an SGML application) needed to be re-expressed in this new formalism if they were to survive. The TEI editors, with abundant assistance from others who had developed and used TEI, developed an update plan, and made tentative decisions on relevant syntactic issues. |
528 | ABTEI | In January of 1999, the University of Virginia and the University of Bergen formally proposed the creation of an international membership organization, to be known as the TEI Consortium, which would maintain, develop, and promote the TEI. Shortly thereafter, two further institutions with longstanding ties to the TEI (Brown University and Oxford University) joined them in formulating an Agreement to Establish a Consortium for the Maintenance of the Text Encoding Initiative ( |
529 | ABTEI | ), on which basis the TEI Consortium was eventually established and incorporated as a not-for-profit legal entity at the end of the year 2000. The first members of the new TEI Board took office during January of 2001. |
531 | ABTEI | The TEI Consortium was established in order to maintain a permanent home for the TEI as a democratically constituted, academically and economically independent, self-sustaining, non-profit organization. In addition, the TEI Consortium was intended to foster a broad-based user community with sustained involvement in the future development and widespread use of the TEI Guidelines ( |
534 | ABTEI | To oversee and manage the revision process in collaboration with the TEI Editors, the TEI Board formed a Technical Council, with a membership elected from the TEI user community. The Council met for the first time in January 2002 at King's College London. Its first task was to oversee production of an XML version of the TEI Guidelines, updating P3 to enable users to work with the emerging XML toolset. This, the P4 version of the Guidelines, was published in June 2002. It was essentially an XML version of P3, making no substantive changes to the constraints expressed in the schemas apart from those necessitated by the shift to XML, and changing only corrigible errors identified in the prose of the P3 Guidelines. However, given that P3 had by this time been in steady use since 1994, it was clear that a substantial revision of its content was necessary, and work began immediately on the P5 version of the Guidelines. This was planned as a thorough overhaul, involving a public call for features and new development in a number of important areas not previously addressed including character encoding, graphics, manuscript description, biographical and geographical data, and the encoding language in which the TEI Guidelines themselves are written. |
536 | ABTEI | The members of the TEI Council and its associated workgroups are listed in |
537 | ABTEI | . In preparing this edition, they have been attentive to the requirements and practice of the widest possible range of TEI users, who are now to be found in many different research communities across the world, and have been largely instrumental in transforming the TEI from a grant-supported international research project into a self-sustaining community-based effort. One effect of the incorporation of the TEI has been the legal requirement to hold an annual meeting of the Consortium members; these meetings have emerged as an invaluable opportunity to sustain and reinforce that sense of community. |
544 | ABTEI4 | The encoding recommended by this document may be used without fear that future versions of the TEI scheme will be inconsistent with it in fundamental ways. The TEI will be sensitive, in revising these Guidelines, to the possible problems which revision might pose for those who are already using this version of the Guidelines. |
546 | ABTEI4 | With TEI P5, a version numbering system is introduced following |
548 | ABTEI4 | : the first digit identifies a major version number, the second digit a minor version number, and the third digit a sub-minor version number. The TEI undertakes that no change will be made to the formal expression of these Guidelines (that is, a TEI schema, as defined in |
549 | ABTEI4 | ) such that documents conformant to a given major numbered release cease to be compatible with a subsequent release of the same major number. Moreover, as far as possible, new minor releases will be made only for the purpose of adding new compatible features, or of correcting errors in existing features. |
551 | ABTEI4 | The Guidelines are currently maintained as an open source project on the Sourceforge site |
554 | ABTEI4 | for information on how to find specific versions of TEI releases (Guidelines, schemas etc.). Notice of errors detected and enhancements requested may be submitted at |
# | id | text |
---|---|---|
21 | GD | The treatment here is largely based on the characterizations of graph types in |
24 | GD | , which typically plot data in two or more dimensions, including plots with orthogonal or radial axes, bar charts, pie charts, and the like. These can be described using the elements defined in the module for figures and graphics; see chapter |
36 | GDGR | . An undirected graph is a set of |
40 | GDGR | ) together with a set of pairs of those vertices, called |
44 | GDGR | . Each node in an arc of an undirected graph is said to be |
45 | GDGR | incident |
46 | GDGR | with that arc, and the two vertices (nodes) which make up an arc are said to be |
48 | GDGR | . An directed graph is like an undirected graph except that the arcs are |
50 | GDGR | of nodes. In the case of directed graphs, the term |
52 | GDGR | is not used; moreover, each arc in a directed graph is said to be |
54 | GDGR | the node from which the arc emanates, and |
56 | GDGR | the node to which the arc is directed. We use the element |
69 | GDGR | Before proceeding, some additional terminology may be helpful. We define a |
71 | GDGR | in a graph as a sequence of nodes n1, ..., nk such that there is an arc from each ni to ni+1 in the sequence. A |
75 | GDGR | is a path leading from a particular node back to itself. A graph that contains at least one cycle is said to be |
79 | GDGR | . We say, finally, that a graph is |
81 | GDGR | if there is a path from some node to every other node in the graph; any graph that is not connected is said to be |
128 | GDGR | to record a label for the graph; similarly, the |
138 | GDGR | element record the number of nodes and number of arcs in the graph respectively; these values are optional (since they can be computed from the rest of the graph), but if they are supplied, they must be consistent with the rest of the encoding. They can thus be used to help check that the graph has been encoded and transmitted correctly. The |
142 | GDGR | elements record the number of arcs that are incident with that node. It is optional (because redundant), but can be used to help in validity checking: if a value is given, it must be consistent with the rest of the information in the graph. Finally, the |
148 | GDGR | elements provide pointers to the nodes connected by those arcs. Since the graph is undirected, no directionality is implied by the use of the |
152 | GDGR | attributes; the values of these attributes could be interchanged in each arc without changing the graph. |
195 | GDGR | Note that each arc is represented twice in this encoding of the graph. For example, the existence of the arc from LAX to LVG can be inferred from each of the first two |
197 | GDGR | elements in the graph. This redundancy, however, is not required: it suffices to describe an arc in any one of the three places it can be described (either adjacent node, or in a separate |
226 | GDGR | element is redundant (since arcs can be described using the adjacency attributes of their adjacent nodes), it has nevertheless been included in this module, in order to allow the convenient specification of identifiers, display or rendition information, and labels for each arc (using the attributes |
234 | GDGR | Next, let us modify the preceding graph by adding directionality to the arcs. Specifically, we now think of the arcs as specifying selected routes from one airport to another, as indicated by the direction of the arrowheads in the following diagram. |
272 | GDGR | indicate the number of nodes which are adjacent to and from the node concerned respectively. |
303 | GDGR | If we wish to label the arcs, say with flight numbers, then |
370 | GDTN | ) of the network are distinguished. It can be understood as accepting the set of strings obtained by traversing it from its initial node to its final node, and concatenating the labels. |
407 | GDTN | A finite state transducer has two labels on each arc, and can be thought of as representing a mapping from one sequence of labels to the other. The following example represents a transducer for translating the English strings accepted by the network in the preceding example into French. The nodes have been annotated with numbers, for convenience. |
502 | GDFT | The next example provides an encoding a portion of a family tree |
503 | GDFT | The family tree is that of the mathematician and philosopher Bertrand Russell, whose third wife was commonly known as Peter. The information presented here is taken from |
621 | GDHI | For our final example, we represent graphically the relationships among various geographic areas mentioned in a seventeenth-century Scottish document. The document itself is a |
627 | GDHI | Item instrument of Sasine given the said Hector Mcneil confirmed and dated 28 May 1632 [...] at Edinburgh upon the 15 June 1632 |
629 | GDHI | Item ane charter granted by Archibald late earl of Argyle and Donald McNeill of Gallachalzie wh makes mention that ... the said late Earl yields and grants to the said Donald MacNeill ... |
631 | GDHI | All and hail the two merk land of old extent of Gallachalzie with the pertinents by and in the lordship of Knapdale within the sherrifdome of Argyll |
638 | GDHI | the two merk land of old extent of Gallachalzie with the pertinents by and in the lordship of Knapdale within the sherrifdom of Argyll |
652 | GDHI | We will represent these geographic entities as nodes in a graph. Arcs in the graph will represent the following relationships among them: |
656 | GDHI | location within (IN) |
665 | GDHI | , for example, are inverses of each other: the Earl of Argyll's land includes the parcel in Gallachalzie, and the parcel is therefore in the Earl of Argyll's land. Given an explicit set of inference rules, an appropriate application could use the graph we are constructing to infer the logical consequences of the relationships we identify. |
667 | GDHI | Let us assume that feature-structure analyses are available which describe Gallachalzie, Knapdale, and Argyll. We will link to those feature structures using the |
675 | GDHI | That is, the three syntactic interpretations of the clause are mutually exclusive. The notion that the pertinents are in Argyll is clearly not inconsistent with the notion that both the land in Gallachalzie and the pertinents are in Argyll. The graph given here describes the possible interpretations of the clause itself, not the sets of inferences derivable from each syntactic interpretation, for which it would be convenient to use the facilities described in chapter |
678 | GDHI | We represent the graph and its encoding as follows, where the dotted lines in the graph indicate the mutually exclusive arcs; in the encoding, we use the |
683 | GDHI | The graph formalizes the following relationships: |
704 | GDHI | We encode the graph thus: |
774 | GDTR | tree |
775 | GDTR | is a connected acyclic graph. That is, it is possible in a tree graph to follow a path from any vertex to any other vertex, but there are no paths that lead from any vertex to itself. A rooted tree is a directed graph based on a tree; that is, the arcs in the graph correspond to the arcs of a tree such that there is exactly one node, called the |
776 | GDTR | root |
777 | GDTR | , for which there is a path from that node to all other nodes in the graph. For our purposes, we may ignore all trees except for rooted trees, and hence we shall use the |
781 | GDTR | element for its root. The nodes adjacent to a given node are called its |
783 | GDTR | , and the node adjacent from a given node is called its |
789 | GDTR | element. A node with no children is tagged as a |
791 | GDTR | . If the children of a node are ordered from left to right, then we say that that node is |
793 | GDTR | . If all the nodes of a tree are ordered, then we say that the tree is an |
794 | GDTR | ordered tree |
795 | GDTR | . If some of the nodes of a tree are ordered and others are not, then the tree is a |
796 | GDTR | partially ordered tree |
797 | GDTR | . The ordering of nodes and trees may be specified by an attribute; we take the default ordering for trees to be ordered, that roots inherit their ordering from the trees in which they occur, and internal nodes inherit their ordering from their parents. Finally, we permit a node to be specified as following other nodes, which (when its parent is ordered) it would be assumed to precede, giving rise to crossing arcs. The elements used for the encoding of trees have the following descriptions and attributes. |
809 | GDTR | ) are applied in evaluating the arithmetic formula |
811 | GDTR | . In drawing the graph, the root is placed on the far right, and directionality is presumed to be to the left. |
873 | GDTR | of the tree, which is the greatest value of the |
879 | GDTR | , we say that the tree is a |
880 | GDTR | binary |
885 | GDTR | nodes does not affect the arithmetic result in this case, we could represent in this tree all of the arithmetically equivalent formulas involving its leaves, by specifying the attribute |
972 | GDTR | Linguistic phrase structure is very commonly represented by trees. Here is an example of phrase structure represented by an ordered tree with its root at the top, and a possible encoding. |
1010 | GDTR | Finally, here is an example of an ordered tree, in which a particular node which ordinarily would precede another is specified as following it. In the drawing, the |
1012 | GDTR | symbol indicates that the arc from VB to PT crosses the arc from VP to PN. |
1059 | GDAT | , which is based on the observation that any node of such a tree can be thought of as the root of the subtree that it dominates. Thus subtrees can be thought of as the same type as the trees they are embedded in, hence the designation |
1062 | GDAT | embedding tree |
1199 | GDAT | Ambiguity involving alternative tree structures associated with the same terminal sequence can be encoded relatively conveniently using a combination of the |
1207 | GDAT | may be part of the content of exactly one of two different |
1225 | GDAT | . This ambiguity is indicated in the sketch of the ambiguous tree by means of the dotted-line arcs. The markup using the |
1316 | GDAT | the attachment of a modifier may require the creation of an intermediate node which is not required when the attachment is not made, as shown in the following diagram. A possible encoding of this ambiguous structure immediately follows the diagram. |
1417 | GDAT | derivation |
1418 | GDAT | in a generative grammar is often thought of as a set of trees. To encode such a derivation, one may use the |
1428 | GDAT | attribute may be used to specify what kind of derivation it is. Here is an example of a two-tree forest, involving application of the |
1430 | GDAT | transformation in the derivation of |
1442 | GDAT | empty category |
1527 | GDAT | attributes to provide virtual copies of elements in the tree representing the second stage of the derivation that also occur in the first stage, and the |
1530 | GDAT | ) to link those elements in the second stage with corresponding elements in the first stage that are not copies of them. |
1532 | GDAT | If a group of forests (e.g. a full grammatical derivation including syntactic, semantic, and phonological subderivations) is to be articulated, the grouping element |
1549 | GDstem | ) is a tree-like graphic structure that has become traditional in manuscript studies for representing textual transmission. Consider the following hypothetical stemma: |
1554 | GDstem | The nodes in this stemma represent manuscripts; each has a label (a letter) which identifies it and also distinguishes whether the manuscript is extant, lost, or hypothetical. Extant manuscripts are identified by uppercase Latin letters or words beginning with uppercase Latin letters, e.g., |
1556 | GDstem | , shown as aqua in this example; manuscripts no longer existing, but providing readings which are attested e.g. by note or copy made before their disappearance, are identified by lowercase Latin letters, e.g., |
1564 | GDstem | share textual material that is not shared with other manuscripts (represented in this case by |
1566 | GDstem | ) even though no physical manuscript attesting this stage in the textual transmission has ever been identified. |
1568 | GDstem | Manuscripts are copied from other manuscripts. The preceding stemma represents the hypothesis that all manuscripts go back to a common ancestor ( |
1570 | GDstem | ), that the tradition split after that stage into two ( |
1576 | GDstem | is the earliest common hypothetical stage that can be reconstructed, and all nodes below |
1578 | GDstem | have a single parent, that is, were copied from a single other stage in the tradition. |
1580 | GDstem | This familiar tree model is complicated because manuscripts sometimes show the influence of more than one ancestor. They may have been produced by a scribe who checked the text in one manuscript of the same work whilst copying from another, or perhaps made changes from his memory of a slightly different version of the text that he had read elsewhere. Alternatively, perhaps scribe A copied a manuscript from one source, scribe B made changes in it in the margins or between the lines (either by consulting another source directly or from memory), and another scribe then copied that manuscript, incorporating the changes into the body. Whatever the specific scenario, it is not uncommon for a manuscript to be based primarily on one source, but to incorporate features of another branch of the tradition. This mixed result is called |
1598 | GDstem | element introduced in this chapter can be used to represent a closed tradition in a straightforward manner. Each non-terminal node is represented by a typed |
1600 | GDstem | element and each terminal node by an |
1608 | GDstem | attributes. For example, the closed part of the tradition headed by the label δ may be encoded as follows: |
1622 | GDstem | To complete this representation, we need to show that the node labelled A is not derived solely from its parent node (labelled ε) but also demonstrates contamination from the node labelled γ. The easiest way to accomplish this is to include an appropriately-typed |
1624 | GDstem | element within the node in question, the |
1626 | GDstem | of which points to the node labelled γ. This requires that this latter node be supplied with a value for its |
1677 | GDstem | In any substantial codicological project, it is likely that significantly more data will be required about the individual witnesses than indicated in the simple structures above. These Guidelines provide a rich variety of additional elements for representing such information: see in particular chapters |
1698 | GD | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
4 | SG | The encoding scheme defined by these Guidelines is formulated as an application of the Extensible Markup Language (XML) ( |
5 | SG | ). XML is widely used for the definition of device-independent, system-independent methods of storing and processing texts in electronic form. It is now also the interchange and communication format used by many applications on the World Wide Web. In the present chapter we informally introduce some of its basic concepts and attempt to explain to the reader encountering them for the first time how and why they are used in the TEI scheme. More detailed technical accounts of TEI practice in this respect are provided in chapters |
12 | SG | , that is, a language used to describe other languages, in this case, |
16 | SG | has been used to describe annotation or other marks within a text intended to instruct a compositor or typist how a particular passage should be printed or laid out. Examples include wavy underlining to indicate boldface, special symbols for passages to be omitted or printed in a particular font, and so forth. As the formatting and printing of texts was automated, the term was extended to cover all sorts of special codes inserted into electronic texts to govern formatting, printing, or other processing. |
22 | SG | , as any means of making explicit an interpretation of a text. Of course, all printed texts are implicitly encoded (or marked up) in this sense: punctuation marks, capitalization, disposition of letters around the page, even the spaces between words all might be regarded as a kind of markup, the purpose of which is to help the human reader determine where one word ends and another begins, or how to identify gross structural features such as headings or simple syntactic units such as dependent clauses or sentences. Encoding a text for computer processing is, in principle, like transcribing a manuscript from |
25 | SG | continuous writing |
27 | SG | ; it is a process of making explicit what is conjectural or implicit, a process of directing the user as to how the content of the text should be (or has been) interpreted. |
30 | SG | markup language |
31 | SG | we mean a set of markup conventions used together for encoding texts. A markup language must specify how markup is to be distinguished from text, what markup is allowed, what markup is required, and what the markup means. XML provides the means for doing the first three; documentation such as these Guidelines is required for the last. |
52 | SG11 | These three aspects are discussed briefly below, and then in more depth in the remainder of this chapter. |
54 | SG11 | XML is frequently compared with HTML, the language in which web pages have generally been written, which shares some of the above characteristics. Compared with HTML, however, XML has some other important features: |
57 | SG11 | : it does not consist of a fixed set of tags; |
77 | SG111 | the following item is a paragraph |
79 | SG111 | this is the end of the most recently begun list |
83 | SG111 | move the left margin 2 quads left, move the right margin 2 quads right, skip down one line, and go to the new left margin, |
84 | SG111 | etc. In XML, the instructions needed to process a document for some particular purpose (for example, to format it) are sharply distinguished from the markup used to describe it. |
86 | SG111 | Usually, the markup or other information needed to process a document will be maintained separately from the document itself, typically in a distinct document called a |
88 | SG111 | , though it may do much more than simply define the rendition or visual appearance of a document. |
94 | SG111 | When descriptive markup is used, the same document can readily be processed in many different ways, using only those parts of it which are considered relevant. For example, a content analysis program might disregard entirely the footnotes embedded in an annotated text, while a formatting program might extract and collect them all together for printing at the end of each chapter. Different kinds of processing can be carried out with the same part of a file. For example, one program might extract names of persons and places from a document to create an index or database, while another, operating on the same text, but using a different stylesheet, might print names of persons and places in a distinctive typeface. |
105 | SG112 | title |
107 | SG112 | author |
109 | SG112 | abstract |
110 | SG112 | and a sequence of one or more |
112 | SG112 | . Anything lacking a title, according to this formal definition, would not formally be a report, and neither would a sequence of paragraphs followed by an abstract, whatever other report-like characteristics these might have for the human reader. |
123 | SG113 | A basic design goal of XML is to ensure that documents encoded according to its provisions can move from one hardware and software environment to another without loss of information. The two features discussed so far both address this requirement at an abstract level; the third feature addresses it at the level of the strings of data characters that make up a document. All XML documents, whatever languages or writing systems they employ, use the same underlying character encoding (that is, the same method of representing as binary data those graphic forms making up a particular writing system). |
132 | SG113 | which is implemented by a universal character set maintained by an industry group called the Unicode Consortium, and known as Unicode. |
134 | SG113 | Unicode provides a standardized way of representing any of the many thousands of discrete symbols making up the world's writing systems, past and present. |
137 | SG113 | Most modern computing systems now support Unicode directly; for those which do not, XML provides a mechanism for the indirect representation of single characters by means of their character number, known as |
146 | SG12 | A text is not an undifferentiated sequence of words, much less of bytes. For different purposes, it may be divided into many different units, of different types or sizes. A prose text such as this one might be divided into sections, chapters, paragraphs, and sentences. A verse text might be divided into cantos, stanzas, and lines. Once printed, sequences of prose and verse might be divided into volumes, gatherings, and pages. |
148 | SG12 | Structural units of this kind are most often used to identify specific locations or refer to points within a text ( |
151 | SG12 | canto 10, line 1234 |
154 | SG12 | , etc.) but they may also be used to subdivide a text into meaningful fragments for analytic purposes ( |
160 | SG12 | ). Other structural units are more clearly analytic, in that they characterize a section of a text. A dramatic text might regard each speech by a different character as a unit of one kind, and stage directions or pieces of action as units of another kind. Such an analysis is less useful for locating parts of the text ( |
164 | SG12 | In a prose text one might similarly wish to regard as units of different types passages in direct or indirect speech, passages employing different stylistic registers (narrative, polemic, commentary, argument, etc.), passages of different authorship and so forth. And for certain types of analysis (most notably textual criticism) the physical appearance of one particular printed or manuscript source may be of importance: paradoxically, one may wish to use descriptive markup to describe presentational features such as typeface, line breaks, use of whitespace and so forth. |
166 | SG12 | These textual structures overlap with one another in complex and unpredictable ways. Particularly when dealing with texts as instantiated by paper technology, the reader needs to be aware of both the physical organization of the book and the logical structure of the work it contains. Many great works (Sterne's |
168 | SG12 | for example) cannot be fully appreciated without an awareness of the interplay between narrative units (such as chapters or paragraphs) and presentational ones (such as page divisions). For many types of research, the interplay among different levels of analysis is crucial: the extent to which syntactic structure and narrative structure mesh, or fail to mesh, for example, or the extent to which phonological structures reflect morphology. |
176 | SG131 | The technical term used in XML for a textual unit, viewed as a structural component, is |
186 | SG131 | of textual elements, because these are considered to be application dependent. It is up to the creators of XML vocabularies (such as these Guidelines) to choose intelligible element names and to define their intended use in text markup. That is the chief purpose of documents such as the TEI Guidelines. From the need to choose element names indicative of function comes the technical term for the name of an element type, which is |
190 | SG131 | Within a marked-up text (a |
192 | SG131 | ), each element must be explicitly marked or tagged in some way. This is done by inserting a tag at the beginning of the element (a |
196 | SG131 | ). The start- and end-tag pair are used to bracket off element occurrences within the running text, in rather the same way as different types of parentheses or quotation marks are used in conventional punctuation. For example, a quotation element in a text might be tagged as follows: |
200 | SG131 | As this example shows, a start-tag takes the form |
201 | SG131 | quote |
203 | SG131 | quote |
209 | SG131 | The material between the start-tag and the end-tag (the string of words |
212 | SG131 | content |
213 | SG131 | of the element. Sometimes there may be nothing between the start and the end-tag; in this case the two may optionally be merged together into a single composite tag with the solidus at the end, like this: |
221 | SG132 | , that is, it may have no content at all, or it may contain just a sequence of characters with no other elements. Often, however, elements of one type will be |
229 | SG132 | , and it consists of a series of |
235 | SG132 | , each stanza having embedded within it a number of |
236 | SG132 | line |
237 | SG132 | elements. Fully marked up, a text conforming to this model might appear as follows: |
270 | SG132 | a valid TEI document. |
271 | SG132 | The element names here have been chosen for clarity of exposition; there is, however, a TEI element corresponding to each, so that this example may be regarded as TEI-conformable in the sense that this term is defined in |
273 | SG132 | It will, however, serve as an introduction to the basic notions of XML. Whitespace and line breaks have been added to the example for the sake of visual clarity only; they have no particular significance in the XML encoding itself. Also, the line |
284 | SG132 | root element |
289 | SG132 | each element is completely contained by the root element, or by an element that is so contained; elements do not partially overlap one another; |
291 | SG132 | a tag explicitly marks the start and end of each element. |
295 | SG132 | A well-formed XML document can be processed in a number of useful ways. A simple indexing program could extract only the relevant text elements in order to make a list of headings, first lines, or words used in the poem text; a simple formatting program could insert blank lines between stanzas, perhaps indenting the first line of each, or inserting a stanza number. Different parts of each poem could be typeset in different ways. A more ambitious analytic program could relate the use of punctuation marks to stanzaic and metrical divisions. |
298 | SG132 | Scholars wishing to see the implications of changing the stanza or line divisions chosen by the editor of this poem can do so simply by altering the position of the tags. And of course, the text as presented above can be transported from one computer to another and processed by any program (or person) capable of making sense of the tags embedded within it with no need for the sort of transformations and translations needed for files which have been saved in one or other of the proprietary formats preferred by most word-processing programs. |
300 | SG132 | As we noted above, one of the attractions of XML is that it enables us to make up our own names for the elements rather than requiring us always to use names predefined by other agencies. Clearly, however, if we wish to exchange our poems with others, or to include poems others have marked up in our anthology, we will need to know a bit more about the names used for the tags. The means that XML provides for this is called a |
301 | SG132 | namespace |
303 | SG132 | qualified name |
304 | SG132 | , that is, a name with an optional prefix identifying the set of names to which it belongs. For example, we have defined an element |
306 | SG132 | for the purpose of marking lines of verse. Another person might, however, define an element called |
308 | SG132 | for the purpose of marking typographic lines, or drawn lines. Because of these different meanings, if we wish to share data it will be necessary to distinguish the two |
309 | SG132 | line |
311 | SG132 | namespace prefix |
314 | SG132 | This feature is particularly important if we have different definitions of what a |
315 | SG132 | line |
316 | SG132 | is, of course, but there are many occasions when it is useful to distinguish groups of tags belonging to different |
319 | SG132 | ). One particularly useful namespace prefix is predefined for XML: it is |
323 | SG132 | Namespaces allow us to represent the fact that a name belongs to a group of names, but don't allow us to do much more by way of checking the integrity or accuracy of our tagging. Simple well-formedness alone is not enough for the full range of what might be useful in marking up a document. It might well be useful if, in the process of preparing our digital anthology, a computer system could check some basic rules about how stanzas, lines, and headings can sensibly co-occur in a document. It would be even more useful if the system could check that stanzas are always tagged |
331 | SG132 | document, and the ability to perform such validation is one of the key advantages of using XML. To carry this out, some way of formally stating the criteria for successful validation is necessary: in XML this formal statement is provided by an additional document known as a |
338 | SG132 | , both abbreviated as DTD, may also be encountered. Throughout these Guidelines we use the term |
346 | SG14 | The design of a schema may be as lax or as restrictive as the occasion warrants. A balance must be struck between the convenience of following simple rules and the complexity of handling real texts. This is particularly the case when the rules being defined relate to texts that already exist: the designer may have only the haziest of notions as to an ancient text's original purpose or meaning and hence find it very difficult to specify consistent rules about its structure. On the other hand, where a new text is being prepared to an exact specification, for entry into a textual database of some kind for example, the more precisely stated the rules, the better they can be enforced. Even in the case where an existing text is being marked up, it may be beneficial to define a restrictive set of rules relating to one particular view or hypothesis about the text—if only as a means of testing the usefulness of that view or hypothesis. A schema designed for use by a small project or team is likely to take a different position on such issues than one intended for use by a large and possibly fragmented community. It is important to remember that every schema results from an interpretation of a text. There is no single schema encompassing the absolute truth about any text, although it may be convenient to privilege some schemas above others for particular types of analysis. |
348 | SG14 | XML is widely used in environments where uniformity of document structure is a major desideratum. In the production of technical documentation, for example, it is of major importance that sections and subsections should be properly nested, that cross-references should be properly resolved and so forth. In such situations, documents are seen as raw material to match against predefined sets of rules. As discussed above, however, the use of simple rules can also greatly simplify the task of tagging accurately elements of less rigidly constrained texts. By making these rules explicit, the scholar reduces his or her own burdens in marking up and verifying the electronic text, while also being forced to make explicit an interpretation of the structure and significant particularities of the text being encoded. |
353 | SG141bis | A schema can be expressed in a number of different ways; frequently-encountered methods include the Document Type Definition (DTD) language which XML inherited from SGML; the XML Schema language ( |
354 | SG141bis | ) defined by the W3C; and the RELAX NG language ( |
359 | SG141bis | of RELAX NG, but the specifications within these Guidelines are expressed in a way that is largely independent of the specific language in which a schema generated from them is expressed. |
362 | SG141bis | . In practice, the only part of a TEI element specification not expressed using TEI-defined syntax is the content model for an element, which is expressed using the RELAX NG schema language for reasons of processing convenience. RELAX NG uses its own XML vocabulary to define content models, which is adopted by the TEI for the same purpose. |
366 | SG141bis | anthology_p = element anthology { poem_p+ } poem_p = element poem { heading_p?, stanza_p+ } stanza_p = element stanza {line_p+} heading_p = element heading { text } line_p = element line { text } start = anthology_p |
376 | SG141bis | ; that is, it defines a number of named patterns, each of which acts as a kind of template against which an input document can be matched. The meaning of a pattern is expressed in a schema by reference to other patterns, or to a small number of built-in fundamental concepts, as we shall see. In the example above, the word to the left of the equals sign is the pattern's name, and the material following it declares a meaning for the pattern. Patterns may also be of particular types; the ones that interest us here are called |
380 | SG141bis | . In this example we see definitions for five element patterns. Note that we have used similar names for the pattern and the element which the pattern describes: so, for example, the line |
384 | SG141bis | , the value of which defines an element called |
386 | SG141bis | . These naming conventions are arbitrary; we could use the same name for the pattern as for the element, since the two are syntactically quite distinct. The name, or |
391 | SG141bis | content model |
394 | SG141bis | The last line of the schema above tells a RELAX NG validator which element (or elements) in a document can be used as the root element: in our case only |
397 | SG141bis | entry point |
423 | SG141x | ; the root element of a TEI-conformant document is |
434 | SG143 | content model |
435 | SG143 | of the element being defined, because it specifies what may legitimately be contained within it. In RELAX NG, the content model is defined in terms of other patterns, either by embedding them, or (as in our examples above) by naming or referring to them. The RELAX NG compact syntax also uses a small number of reserved words to identify other possible contents for an element, of which by far the most commonly encountered is |
436 | SG143 | text |
439 | SG143 | ), then almost always, following the branches of the tree downwards (for example, from |
450 | SG143 | text |
455 | SG143 | are so defined, since their content models say |
456 | SG143 | text |
457 | SG143 | only and name no embedded elements. |
467 | SG144 | may be repeated. There are three occurrence indicators: the plus sign, the question mark, and the asterisk or star. The plus sign means that the pattern can match one or more times; the question mark means that it may match at most once but is not mandatory; the star means that the pattern concerned is not mandatory, but may match more than once. Thus, if the content model for |
483 | SG145 | The content model |
491 | SG145 | (the comma) used between its components. The comma connector indicates that the patterns concerned must appear in the sequence given. Another commonly encountered connector is the vertical bar, representing alternation. If the comma in this example were replaced by a vertical bar, then a |
497 | SG146 | In our example so far, the components of each content model have been either single patterns or |
498 | SG146 | text |
499 | SG146 | . It is quite permissible, however, to define content models in which the components are lists of patterns, combined by connectors. Such lists may also be modified by occurrence indicators and themselves combined by connectors. To demonstrate these facilities, let us expand our example to include non-stanzaic types of verse. For the sake of demonstration, we will categorize poems as one of the following: |
507 | SG146 | ). A blank-verse poem consists simply of lines (we ignore the possibility of verse paragraphs for the moment), |
508 | SG146 | It will not have escaped the astute reader that the fact that verse paragraphs need not start on a line boundary seriously complicates the issue; see further section |
510 | SG146 | so no additional elements need be defined for it. A couplet is defined as a |
524 | SG146 | (which are distinguished to enable studies of rhyme scheme, for example |
525 | SG146 | This is however a rather artificial example; XPath, for example, provides ways of distinguishing elements in an XML structure by their position without the need to give them distinct names. |
526 | SG146 | ); these will have exactly the same content model as the existing |
528 | SG146 | element. We will therefore add the following two lines to our example schema: |
530 | SG146 | Next, we can change the declaration for the |
536 | SG146 | The second version, by applying the occurrence indicator to the group rather than to each element within it, would allow a single poem to contain a mixture of stanzas, couplets, and lines. |
538 | SG146 | A group of this kind can contain |
539 | SG146 | text |
541 | SG146 | mixed content |
542 | SG146 | , allows for elements in which the sub-components appear with intervening stretches of character data. For example, if we wished to mark place names wherever they appear inside our verse lines, then, assuming we have also added a pattern for the |
544 | SG146 | element, we could change the definition for |
547 | SG146 | line_p = element line { (text | name_p )* } |
550 | SG146 | Some XML schema languages place no constraints on the way that mixed content models may be defined, but in the XML DTD language, when |
551 | SG146 | text |
552 | SG146 | appears with other elements in a content model, it must always appear as the first option in an alternation; it may appear once only, and in the outermost model group; and if the group containing it is repeated, the star operator must be used. Although these constraints do not apply to (for example) schemas expressed in the RELAX NG language, all TEI content models currently obey them. |
554 | SG146 | Quite complex models can easily be built up in this way, to match the structural complexity of many types of text. As a further example, consider the case of stanzaic verse in which a refrain or chorus appears. Like a stanza, a refrain consists of repetitions of the line element. A refrain can appear at the start of a poem only, or as an optional addition following each stanza. This could be expressed by a pattern such as the following: |
556 | SG146 | That is, a poem consists of an optional heading, followed by either a sequence of lines or an unnamed group, which starts with an optional refrain and is followed by one or more occurrences of another group, each member of which is composed of a stanza followed by an optional refrain. A sequence such as |
558 | SG146 | follows this pattern, as does the sequence |
560 | SG146 | . The sequence |
562 | SG146 | does not, however, and neither does the sequence |
564 | SG146 | Among other conditions made explicit by this content model are the requirements that at least one stanza must appear in a poem, if it is not composed simply of lines, and that if there is both a heading and a stanza they must appear in that order. |
576 | SG152 | In the simple cases described so far, we have assumed that one can identify the immediate constituents of every element in a textual structure. A poem consists of stanzas, and an anthology consists of poems. Stanzas do not float around unattached to poems or combined into some other unrelated element; a poem cannot contain an anthology. All the elements of a given document type may be arranged into a hierarchic structure like a family tree, with a single ancestor at one end and many children (mostly the elements containing simple text) at the other. For example, we could represent an anthology containing two poems, the first of which contains two four-line stanzas and the second a single stanza, by a tree structure like the following figure: |
580 | SG152 | This graphic representation of the structure of an XML document is close to the abstract model implicit in most XML processing systems. Most such systems now use a standardized way of accessing parts of an XML document called |
587 | SG152 | XPath gives us a non-graphical way of referring to any part of an XML document: for example, we might refer to the last line of Blake's poem as |
589 | SG152 | . The square brackets here indicate a numerical selection: we are talking about the fourth line in the second stanza of the first poem in the anthology. If we left out all the square-bracketted selections, the corresponding XPath expression would refer to all lines contained by stanzas contained by poems contained by anthologies. An XPath expression can refer to any collection of elements: for example, the expression |
595 | SG152 | The solidus within an XPath expression behaves in much the same way as the solidus or backslash in a filename specification: it indicates that the item to the left directly contains the item to the right of it. In XPath it is also possible to indicate that any number of other items may intervene by repeating the solidus. For example, the XPath expression |
597 | SG152 | will refer to the first line of each poem in the anthology, irrespective of whether it is in a stanza. |
599 | SG152 | Clearly, there are many such trees that might be drawn to describe the structure of this or other anthologies. Some of them might be representable as further subdivisions of this tree: for example, we might subdivide the lines into individual words, since in our simple example no word crosses a line boundary. Surprisingly perhaps, this grossly simplified view of what text is (memorably termed an |
600 | SG152 | ordered hierarchy of content objects |
601 | SG152 | (OHCO) view of text by Renear |
605 | SG152 | ) turns out to be very effective for a large number of purposes. It is not, however, adequate for the full complexity of real textual structures, for which more complex mechanisms need to be employed. There are many other trees that might be drawn which do |
609 | SG152 | In the OHCO model of text, representation of cases where different elements overlap so that several different trees may be identified in the same document is generally problematic. All the elements marked up in a document, no matter what namespace they belong to, must fit within a single hierarchy. To represent overlapping structures, therefore, a single hierarchy must be chosen, and the points at which other hierarchies intersect with it marked. For example, we might choose the verse structure as our primary hierarchy, and then mark the pagination by means of empty elements inserted at the boundary points between one page and the next. Or we could represent alternative hierarchies by means of the pointing and linking mechanisms described in chapter |
619 | SG16 | , like some other words, has a specific technical sense. It is used to describe information that is in some sense descriptive of a specific element occurrence but not regarded as part of its content. For example, you might wish to add a |
621 | SG16 | attribute to occurrences of some elements in a document to indicate their degree of reliability, or to add an |
625 | SG16 | Although different elements may have attributes with the same name (for example, in the TEI scheme, every element is defined as having an attribute named |
627 | SG16 | ), they are always regarded as different, and may have different values assigned to them. If an element has been defined as having attributes, the attribute values are supplied in the document instance as |
631 | SG16 | The order in which attribute-value pairs are supplied inside a tag has no significance; they must, however, be separated by at least one whitespace (blank, newline, or tab) character. The value part must always be given inside matching quotation marks, either single or double |
632 | SG16 | In the unlikely event that both kinds of quotation marks are needed within the quoted string, either or both can also be presented in escaped form, using the predefined character entities |
652 | SG16 | attribute has the value |
656 | SG16 | attribute has the value |
662 | SG16 | attribute has the value |
664 | SG16 | might be formatted differently from one in which the same attribute has the value |
668 | SG16 | attribute is a slightly special case in that, by convention, it is always used to supply a unique value to identify a particular element occurrence, which may be used for cross-reference purposes, as discussed further below ( |
673 | SG-att | Attributes are declared in a schema in the same way as elements. As well as specifying an attribute's name and the element to which it is to be attached, it is possible to specify (within limits) what kind of value is acceptable for an attribute. |
679 | SG-att | , whose value is an attribute pattern defining an attribute named |
681 | SG-att | . Attribute names are subject to the same restrictions as other names in XML; they need not be unique across the whole schema, however, but only within the list of attributes for a given element. |
683 | SG-att | A pattern defining the possible values for this attribute is given within the curly braces, in just the same way as a content model is given for an element pattern. In this case, the attribute's value must be one of the strings presented explicitly above. |
689 | SG-att | In RELAX NG, an element pattern simply includes any attribute patterns applicable to it along with its other constituents, as shown above. Attribute patterns can also be grouped and alternated in the same way as element patterns, though this particular feature is not widely used in the TEI scheme, since it is not available to the same extent in all schema languages. Because a question mark follows the reference to the |
697 | SG-att | Instead of supplying a list of explicit values, an attribute pattern can specify that the attribute must have a value of a particular type, for example a text string, a numeric value, a normalized date, etc. This is accomplished by supplying a pattern that refers to a |
698 | SG-att | datatype |
699 | SG-att | . In the example above, because a list of acceptable values is predefined, a parser can check that no |
711 | SG-att | a parser would accept almost any unbroken string of characters ( |
717 | SG-att | ) as valid for this attribute. Sometimes, of course, the set of possible values cannot be predefined. Where it can, as in this case, it is generally better to do so. |
719 | SG-att | Schema languages vary widely in the extent to which they support validation of attribute values. Some languages predefine a small set of possibilities. Others allow the schema designer to use values from a predefined |
721 | SG-att | of possible datatypes, or to add their own definitions, possibly of great complexity. A |
722 | SG-att | datatype |
723 | SG-att | might be something fairly general (any positive integer), something very specific or idiosyncratic (any four-character string ending with "T"), or somewhere between the two. In the RELAX NG schemas used by the TEI, general patterns have been defined for about half a dozen datatypes (using the W3C Schema |
726 | SG-att | ). In addition to the two possibilities already mentioned—plain text or an explicit list of possible strings—other datatypes likely to be encountered include the following: |
732 | SG-att | numeric |
734 | SG-att | values must represent a numeric quantity of some kind |
736 | SG-att | date |
738 | SG-att | values must represent a possible date and time in some calendar |
751 | SG-id | see note 6 |
754 | SG-id | . When a text is being produced the actual numbers associated with the notes or chapters may not be certain. If we are using descriptive markup, such things as page or chapter numbers, being entirely matters of presentation, will not in any case be present in the marked-up text: they will be assigned by whatever processor is operating on the text (and may indeed differ in different applications). XML therefore predefines an attribute that may be used to provide any element occurrence with a special identifier, a kind of label, which may be used to refer to it from anywhere else: since it is defined in the XML namespace, the name of this attribute is |
756 | SG-id | and it is used throughout the TEI schema. Because it is intended to act as an identifier, its values must be unique within a given document. The cross-reference itself will be supplied by an element bearing an attribute of a specific kind, which must also be declared in the schema. |
758 | SG-id | Suppose, for example, we wish to include a reference within the notes on one poem that refers to another poem. We will first need to provide some way of attaching a label to each poem: this is easily done using the |
772 | SG-id | Next we need to define a new element for the cross-reference itself. This will not have any content—it is only a pointer—but it has an attribute, the value of which will be the identifier of the element pointed at. This is achieved by the following definition: |
780 | SG-id | . The value of this attribute must be a pointer or web reference of type |
787 | SG-id | (URI) may be supplied here. The accepted syntax for URIs is an Internet Standard, defined in |
792 | SG-id | defined by the W3C Schema datatype library. |
793 | SG-id | furthermore, because there is no indication of optionality on the attribute pattern, it must be supplied on each occurrence—a |
807 | SG-id | A processor may take any number of actions when it encounters a link encoded in this way: a formatter might construct an exact page and line reference for the location of the poem in the current document and insert it, or just quote the poem's title or first lines. A hypertext style processor might use this element as a signal to activate a link to the poem being referred to, for example by displaying it in a new window. Note, however, that the purpose of the XML markup is simply to indicate that a cross-reference exists: it does not necessarily determine what the processor is to do with it. |
813 | SG-id | attribute of datatype URI: |
814 | SG-id | graphic_p = element graphic {att.url, empty} att.url = attribute url {anyURI} |
815 | SG-id | With these additions to the schema, we can now represent the location of the illustration within our text like this: |
817 | SG-id | By providing a location from which a reproduction of the required image can be downloaded, this encoding makes it possible for appropriate software able to display the image as well as record its existence. |
819 | SG-id | Attributes form part of the structure of an XML document in the same way as elements, and can therefore be accessed using XPath. For example, to refer to all the poems in our anthology whose |
821 | SG-id | attribute has the value |
833 | SG-oth | In addition to the elements and attributes so far discussed, an XML document can contain a few other formally distinct things. An XML document may contain references to predefined strings of data that a validator must resolve before attempting to validate the document's structure; these are called |
837 | SG-oth | text or representing character data which cannot easily be keyboarded. An XML document may also contain arbitrary signals or flags for use when the document is processed in a particular way by some class of processor (a common example in document production is the need to force a formatter to start a new page at some specific point in a document); such flags are called |
840 | SG-oth | namespace |
845 | SG-er | As mentioned above, all XML documents use the same internal character encoding. Since not all computer systems currently support this encoding directly, a special syntax is defined that can be used to represent individual characters from the Unicode character set in a portable way by providing their numeric value, in decimal or hexadecimal notation. |
849 | SG-er | is represented within an XML document as the Unicode character with hexadecimal value |
851 | SG-er | . If such a document is being prepared on (or exported to) a system using a different character set in which this character is not available, it may instead be represented by the character reference |
859 | SG-er | To aid legibility, however, it is also possible to use a mnemonic name (such as |
861 | SG-er | ) for such character references, provided that each such name is mapped to the required Unicode value by means of a construct known as an |
863 | SG-er | . A reference to a named character entity always takes the form of an ampersand, followed by the name, followed by a semicolon. For example an XML document containing the string |
869 | SG-er | There is a small set of such character entity references that do not have to be declared because they form part of the definition of XML. These include the names used for characters such as the ampersand ( |
873 | SG-er | ), which could not easily otherwise be included in an XML document without ambiguity. Other predeclared entity names are those for quotation marks ( |
881 | SG-er | For all other named character entities, a set of entity declarations must be provided to an XML processor before the document referring to them can be validated. The declaration itself uses a non-XML syntax inherited from SGML; for example, to define an entity named |
883 | SG-er | with the replacement value é, the declaration could have any of the following forms: |
892 | SG-er | string substitution |
893 | SG-er | purposes, where the same text needs to be repeated uniformly throughout a text. For example, if a declaration such as |
894 | SG-er | <!ENTITY TEI "Text Encoding Initiative"> |
895 | SG-er | is included with a document, then references such as |
897 | SG-er | may be used within it, each of which will be expanded in the same way and replaced by the string |
899 | SG-er | before the text is validated. |
904 | SG-pi | Although one of the aims of using XML is to remove any information specific to the processing of a document from the document itself, it is occasionally very convenient to be able to include such information—if only so that it can be clearly distinguished from the structure of the document. As suggested above, one common example is the need, when processing an XML document for printed output, to include a suggestion that the formatting processor might use to determine where to begin a new page of output. Page-breaking decisions are usually best made by the formatting engine alone, but there will always be occasions when it may be necessary to override these. An XML processing instruction inserted into the document is one very simple and effective way of doing this without interfering with other aspects of the markup. |
912 | SG-pi | . In between are two space-separated strings: by convention, the first is the name of some processor ( |
914 | SG-pi | in the above example) and the second is some data intended for the use of that processor (in this case, the instruction to start a new page). The only constraint placed by XML on the strings is that the first one must be a valid XML name; the other can be any arbitrary sequence of characters, not including the closing character-sequence |
920 | SG-pi | which can be supplied at the beginning of an XML document, for example: |
922 | SG-pi | The XML declaration specifies the version number of the XML Recommendation applicable to the document it introduces (in this case, version 1.0), and optionally also the character encoding used to represent the Unicode characters within it. By default an XML document uses the character encoding UTF-8 or UTF-16; in this case, the 16-bit characters of Unicode have been mapped to the 8-bit character set known as ISO 8859-1; any characters present in the document but not available in the target character set will therefore need to be represented as character references ( |
923 | SG-pi | ). The XML declaration is purely documentary, but if it is wrong many XML-aware processors will be unable to process the associated text. |
933 | SGname | namespace |
934 | SGname | was introduced into the XML language as a means of addressing these and related problems. If the markup of an XML document is thought of as an expression in some language, then a namespace may be thought of as analogous to the lexicon of that language. Just as a document can contain words taken from different languages, so a well-formed XML document can include elements taken from different namespaces. A namespace resembles a schema in that we may say that a given set of elements |
938 | SGname | a given schema. However, a schema is a set of element definitions, whereas a namespace is really only a property of a collection of elements: the only tangible form it takes in an XML document is its distinctive |
941 | SGname | name |
944 | SGname | Suppose for example that we wish to extend our anthology to include a complex diagram. We might start by considering whether or not to extend our simple schema to include XML markup for such features as arcs, polygons, and other graphical elements. XML can be used to represent any kind of structure, not simply text, and there are clear advantages to having our text and our diagrams all expressed in the same way. |
946 | SGname | Fortunately we do not need to invent a schema for the representation of graphical components such as diagrams; it already exists in the shape of the Scalable Vector Graphics (SVG) language defined by the W3C. |
949 | SGname | SVG is a widely used and rich XML vocabulary for representing all kinds of two-dimensional graphics; it is also well supported by existing software. Using an SVG-aware drawing package, we can easily draw our diagram and save it in XML format for inclusion within our anthology. When we do so, we need to indicate that this part of the document contains elements taken from the SVG namespace, if only to ensure that processing software does not confuse our |
955 | SGname | An XML document need not specify any namespace: it is then said to use the |
957 | SGname | namespace. Alternatively, the root element of a document may supply a default namespace, understood to apply to all elements which have no namespace prefix. This is the function of the |
959 | SGname | attribute which provides a unique name for the default namespace, in the form of a URI: |
964 | SGname | In exactly the same way, on the root element for each part of our document which uses the SVG language, we might introduce the SVG namespace name: |
973 | SGname | Although a namespace name usually uses the URI (Uniform Resource Identifier) syntax, it is not treated as an online address and an XML processor regards it just as a string, providing a longer name for the namespace. |
977 | SGname | attribute can also be used to associate a short prefix name with the namespace it defines. This is very useful if we want to mingle elements from different namespaces within the same document, since the prefix can be attached to any element, overriding the implicit namespace for itself (but not its children): |
988 | SGname | There is no limit on the number of namespaces that a document can use. Provided that each is uniquely identified, an XML processor can identify those that are relevant, and validate them appropriately. To extend our example further, we might decide to add a linguistic analysis to each of the poems, using a set of elements such as |
1016 | SG-ms | We mentioned above that the syntax of XML requires the encoder to take special action if characters with a syntactic meaning in XML (such as the left angle bracket or ampersand) are to be used in a document to stand for themselves, rather than to signal the start of a tag or an entity reference respectively. The predefined entities |
1022 | SG-ms | provide one method of dealing with this problem, if the number of occurrences of such things is small. Other methods may be considered when the number is large, as in an XML document like the present Guidelines, which contains hundreds of examples of XML markup. One is to label the XML examples as belonging to a different namespace from that of the document itself, which is the approach taken in the present Guidelines. Another and simpler approach is provided by one of the features inherited by XML from its parent SGML: the |
1026 | SG-ms | A marked section is a block of text within an XML document introduced by the characters |
1030 | SG-ms | . Between these rather strange brackets, markup recognition is turned off, and any tags or entity references encountered are therefore treated as if they were plain text. For example, when we come to write the users' manual for our anthology, we may find ourselves often producing text like the following: |
1043 | SG18 | if a document contains entity references that must be processed before the document can be validated, where are those entities defined? |
1045 | SG18 | an XML document instance may be stored in a number of different operating system files; how should they be assembled together? |
1047 | SG18 | how does a processor determine which stylesheets it should use when processing an XML document, or how to interpret any processing instructions it contains? |
1053 | SG18 | Different schema languages and different XML processing systems take very different positions on all of these topics, since none of them is explicitly addressed in the XML specification itself. Consequently, the best answer is likely to be specific to a particular software environment and schema language. Since this chapter is concerned with XML considered independently of its processing environment, we only address them in summary detail here. |
1060 | SG-ass1 | , which XML inherited from SGML. Different schema languages vary in the ways they make a collection of such definitions available to an XML processor, but fortunately there is one method that all current schema languages support. |
1065 | SG-ass1 | statement. This declarative statement has been inherited by XML from SGML; in its full form it provides a large number of facilities, but we are here concerned only with the small subset of those facilities recognized by all schema languages. |
1069 | SG-ass1 | Any XML processor encountering this statement will use it to add the two named entities it defines to those already predefined for XML. Before the document instance itself is validated, any references to these entities will be expanded to the character string given. Thus, wherever in the document instance the string |
1072 | SG-ass1 | And, indeed, for those responsible for deciding the licensing conditions if they change their minds later. |
1075 | SG-ass1 | following the string DOCTYPE in this example is, of course, the name of the root element of the document to which this declaration is prefixed; however, only an XML DTD processor will take note of this fact. |
1088 | SG-assoc | points to the location of the schema. This is the only mandatory pseudo-attribute, but others can be added to give more information about the schema: |
1094 | SG-assoc | This example includes a standard schema in XML Schema format, along with a schematron schema which might be used for checking the format and linking of names. |
1098 | SG-assoc | Any modern XML processing software tool will provide convenient methods of validating documents which are appropriate to the particular schema language chosen. In the interests of maximizing portability of document instances, they should contain as little processing-specific information as possible. |
1103 | SG-mult | As we have already indicated, a single XML document may be made up of several different operating system files that need to be pulled together by a processor before the whole document can be validated. The XML DTD language defines a special kind of entity (a |
1105 | SG-mult | ) that can be used to embed references to whole files into a document for this purpose, in much the same way as the character or string entities discussed in |
1112 | SG-mult | defines a generic mechanism for this purpose, which is supported by an increasing number of XML processors. |
1116 | SG-style | As mentioned above, the processing of an XML document will usually involve the use of one or more stylesheets, often but not exclusively to provide specific details of how the document should be displayed or rendered. In general, there is no reason to associate a document instance with any specific stylesheet and the schema languages we have discussed so far do not therefore make any special provision for such association. The association is made when the stylesheet processor is invoked, and is thus entirely application-specific. |
1118 | SG-style | However, since one very common application for XML documents is to serve them as browsable documents over the Web, the W3C has defined a procedure and a syntax for associating a document instance with its stylesheet (see |
1119 | SG-style | ). This Recommendation allows a document to supply a link to a default stylesheet and also to categorize the stylesheet according to its |
1121 | SG-style | , for example to indicate whether the stylesheet is written in CSS or XSLT, using a specialized form of processing instruction. |
1125 | SG-style | which is available from the same location as the anthology itself, we could make it available over the Web simply by adding a processing instruction like the following to the anthology: |
1128 | SG-style | Multiple stylesheets can be defined for the same document, and options are available to specify how a web browser should select amongst them. For example, if the document also contained a directive: |
1132 | SG-style | could be used when rendering the document on a handheld device such as a mobile phone. |
1134 | SG-style | Most modern web browsers support CSS (although the extent of their implementation varies), and some of them support XSLT. |
1138 | SG-val | As we noted above, most schema languages provide some degree of datatype validation for attribute values ( |
1139 | SG-val | ). They vary greatly in the validation facilities they offer for the content of elements, other than the syntactic constraints already discussed. Thus, while we may very easily check that our |
1145 | SG-val | elements contain between five and 500 correctly-spelled English words, should we wish to constrain our poetry in such a way. Also, because attributes and elements are treated differently, it is difficult or impossible to express co-occurrence constraints: for example, if the |
1153 | SG-val | The XML DTD language offers very little beyond syntactic checking of element content. By contrast, a major impetus behind the design and development of the W3C schema language was the addition of a much more general and powerful constraint language to the existing structural constraints of XML DTDs. In RELAX NG the opposite approach was taken, in that all datatype validation, whether of attributes or element content, is regarded as external to the schema language. For attributes, as we have seen, RELAX NG makes use of the W3C Schema Datatype Library (but permits use of others). Because RELAX NG treats both elements and attributes as special cases of patterns, the same datatype validation facilities are available for element content as for attribute values; it is unlike other schema languages in this respect. In addition, for content validation, a different component of DSDL known as Schematron can be used. Schematron is a pattern matching (rather than a grammar-based) language, which allows us to test the components of a document against templates that express constraints such as those mentioned above. |
# | id | text |
---|---|---|
23 | VEMEana-eg-23 | Doglia mi reca ne lo core ardire |
79 | TSSASE-eg-20 | Structures of social action: Studies in conversation analysis |
358 | NDPER-eg-17 | membrane 5, entry 154 |
472 | VEST-eg-4 | 2nd edition |
597 | DIC-CP | Collins Pocket Dictionary of the English language |
617 | SA-BIBL-2 | Orbis Pictus: a facsimile of the first English edition of 1659 |
634 | PHegsurp2 | Poeti del Duecento |
888 | COEDADD-eg-89 | The waste land: a facsimile and transcript of the original drafts including the annotations of Ezra Pound |
918 | DS-eg-05 | Is there a text in this class? The authority of interpretive communities |
957 | FTGRA-eg-18 | 2nd edition |
1041 | COHQU-eg-43 | Natural language processing in Prolog |
1292 | DRSTA-eg-40 | Everyman's library: the drama |
1324 | COBICOR-eg-248 | ISO 690:1987: Information and documentation – Bibliographic references – Content, form and structure |
1508 | COHQQ-eg-33 | note 12 |
1637 | DRPRO-eg-7 | epilogue |
1671 | STGA-eg-9 | Crofts American history series |
1740 | TSBA-eg-19 | The approach of the Text Encoding Initiative to the encoding of spoken discourse |
1760 | MS-eg-001 | A summary catalogue of western manuscripts in the Bodleian Library at Oxford which have not hitherto been catalogued ... |
1770 | MS-eg-001 | P5-MS: A general purpose tagset for manuscript description |
1800 | STGA-eg-10 | Crofts American history series |
1968 | TSSASE-eg-37 | Report on the compatibility of J P French's spoken corpus transcription conventions with the TEI guidelines for transcription of spoken texts |
1995 | GDFT-eg-12 | Partial family tree for Bertrand Russell |
2366 | DSBACK-eg-83 | index to vol. 1 |
2600 | WHITMS1 | "[I am a curse]" in |
2606 | WHITMS2 | Single leaf of Notes for a poem about night "visions," possibly related to the untitled 1855 poem that Whitman eventually titled "The Sleepers." Fragments of an unidentified newspaper clipping about the Puget Sound area have been pasted to the leaf. The Trent Collection of Walt Whitman Manuscripts, Duke University Rare Book, Manuscript, and Special Collections Library. |
3818 | Burnard1995b | The Design of the TEI Encoding Scheme |
4487 | SG-BIBL-2 | Refining our notion of what text really is: the problem of overlapping hierarchies |
4756 | CO-BIBL-1 | An international handbook of the science of language and society |
4923 | TS-BIBL-3 | TEI document TEI AI2 W1 |
5068 | DI-BIBL-3 | TEI working paper TEI AIW20 |
5171 | DI-BIBL-6 | Principles for Encoding machine readable dictionaries |
5225 | DI-BIBL-8 | Electronic dictionary encoding: customizing the TEI Guidelines |
5769 | NH-BIBL-7 | The layered markup and annotation language |
5821 | FS-BIBL-01 | A rationale for the TEI recommendations for feature-structure markup, |
5888 | ISO-690 | ISO 690:1987: Information and documentation – Bibliographic references – Content, form and structure |
5900 | ISO-12620 | ISO 12620:2009: Terminology and other language and content resources – Specification of data categories and management of a Data Category Registry for language resources |
5923 | RICA | Istituto Centrale per il Catalogo Unico |
5925 | RICA | Regole italiane di catalogazione per autori |
5994 | BIB-RDG | The following lists of readings in markup theory and the TEI derive from work originally prepared by Susan Schreibman and Kevin Hawkins for the TEI Education Special Interest Group, recoded in TEI P5 by Sabine Krott and Eva Radermacher. They should be regarded only as a snapshot of work in progress, to which further contributions and corrections are welcomed (see further |
6469 | Burnard1999 | Closing plenary address at the XML Europe Conference, Granada, May 1999 |
6547 | Burnard2001a | Dalle «Due Culture» Alla Cultura Digitale: La Nascita del Demotico Digitale |
6663 | Burnard2005b | Metadata for corpus work |
7623 | Pichler1995 | Culture and Value: Philosophy and the Cultural Sciences. Beiträge des 18. Internationalen Wittgenstein Symposiums 13–20. August 1995 Kirchberg am Wechsel |
7626 | Pichler1995 | Kirchberg am Wechsel |
8533 | Unsworthetaleds2004 | TEI Consortium |
8670 | BIB-RDG | TEI |
8780 | BaumanandCatapano1999 | TEI and the Encoding of the Physical Structure of Books |
8810 | Bauman2005 | TEI HORSEing Around |
8889 | Burnard1993 | Rolling your own with the TEI |
9005 | Burnard1997 | Prepared for a seminar on Etiquetación y extracción de información de grandes corpus textuales within the Curso Industrias de la Lengua (14–18 de Julio de 1997). Sponsored by the Fundacion Duques de Soria. |
9022 | BurnardandPopham1999 | Putting Our Headers Together: A Report on the TEI Header Meeting 12 September 1997 |
9084 | Ciottied2005 | Il Manuale TEI Lite: Introduzione Alla Codifica Elettronica Dei Testi Letterari |
9104 | Chang2001 | The Implications of TEI |
9150 | DigitalLibraryFederation1998 | TEI and XML in Digital Libraries: Meeting June 30 and July 1, 1998, Library of Congress, Summary/Proceedings |
9167 | DigitalLibraryFederation2007 | TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices |
9264 | Loiseaunodate | Les standards : autour d'XML et de la TEI |
9288 | MarkoandKelleher2001 | Descriptive Metadata Strategy for TEI Headers: A University of Michigan Library Case Study |
9318 | Mertz2003 | XML Matters: TEI — the Text Encoding Initiative |
9431 | Rahtz2003 | Building TEI DTDs and Schemas on demand |
9462 | Rahtzetal2004 | A unified model for text markup: TEI, Docbook, and beyond |
9521 | Robinsonnodate | Making a Digital Edition with TEI and Anastasia |
9539 | Seaman1995 | The Electronic Text Center Introduction to TEI and Guide to Document Preparation |
9558 | Simons1999 | Using Architectural Forms to Map TEI Data into an Object-Oriented Database |
9588 | Smith1999 | Textual Variation and Version Control in the TEI |
9720 | Vanhoutte2004 | An Introduction to the TEI and the TEI Consortium |
# | id | text |
---|---|---|
4 | CH | The documents which users of these Guidelines may wish to encode encompass all kinds of material, potentially expressed in the full range of written and spoken human languages, including the extinct, the non-existent, and the conjectural. Because of this wide scope, special attention has been paid to two particular aspects of the representation of linguistic information often taken for granted: language identification and character encoding. |
6 | CH | Even within a single document, material in many different languages may be encountered. Human culture, and the texts which embody it, is intrinsically multilingual, and shows no sign of ceasing to be so. Traditional philologists and modern computational linguists alike work in a polyglot world, in which code-switching (in the linguistic sense) and accurate representation of differing language systems constitute the norm, not the exception. The current increased interest in studies of linguistic diversity, most notably in the recording and documentation of endangered languages, is one aspect of this long standing tradition. Because of their historical importance, the needs of endangered and even extinct languages must be taken into account when formulating Guidelines and recommendations such as these. |
8 | CH | Beyond the sheer number and diversity of human languages, it should be remembered that in their written forms they may deploy a huge variety of scripts or writing systems. These scripts are in turn composed of smaller units, which for simplicity we term here characters. A primary goal when encoding a text should be to capture enough information for subsequent users of it correctly to identify both language, script, and constituent characters. In this chapter we address this requirement, and propose recommended mechanisms to indicate the languages, scripts and characters used in a document or a part thereof. |
10 | CH | Identification of language is dealt with in |
11 | CH | . In summary, it recommends the use of pre-defined identifiers for a language where these are available, as they increasingly are, in part as a result of the twin pressures of an increasing demand for language-specific software and an increased interest in language documentation. Where such identifiers are not available or not standardized, these Guidelines recommend a way of documenting language identifiers and their significance, in the same way as other metadata is documented in the TEI header. |
13 | CH | Standardization of the means available to represent characters and scripts has moved on considerably since the publication of the first version of these Guidelines. At that time, it was essential to explicitly document the characters and encoded character sets used by almost any digital resource if it was to have any chance of being usable across different computer platforms or environments, but this is no longer the case. With the availability of the Unicode standard, more than 110,000 different characters representing almost all of the world's current writing systems are available and usable in any XML processing environment without formality. Nevertheless, however large the number of standardized characters, there will always be a need to encode documents which use non-standard characters and glyphs, particularly but not exclusively in historical material. Furthermore, the full potential of Unicode is still not yet realized in all software which users of the Guidelines are likely to encounter. The second part of this chapter therefore discusses in some detail the concepts and practice underlying this standard, and also introduces the methods available for extending beyond it, which are more fully discussed in |
18 | CHSH | Identification of the language a document or part thereof is written in is a crucial requirement for many envisioned usages of an electronic document. The TEI therefore accommodates this need in the following way: |
22 | CHSH | is defined for all TEI elements. Its value identifies the language and writing system used. |
24 | CHSH | The TEI header has a section set aside for the information about the languages used in a document: see further |
28 | CHSH | The value of the attribute |
30 | CHSH | identifies the language using a coded value. For maximal compatibility with existing processes, modelling this value in the following way is recommended (this parallels the modelling of |
34 | CHSH | The identifier for the language should be constructed as in |
41 | CHSH | element in the TEI header, if one is present. |
46 | CHSH | , and proposes the following mechanism for constructing an identifier (tag) for languages as administered by the Internet Assigned Numbers Authority (IANA). The tag is assembled from a sequence of subtags separated by the hyphen (-, U+002D) character. It gives the language (possibly further identified with a sublanguage), a script and a region for this language, each possibly followed by a variant subtag. |
48 | CHSH | The authoritative list of registered subtags is maintained by IANA and is available at |
49 | CHSH | . For a good general overview of the construction of language tags, see |
53 | CHSH | In addition to the list of registered subtags, both BPC 47 and ISO 639-2 provide extensions that can be employed by private convention. The constructs provided can thus be used to generate identifiers for any language, past and present, in any used in any area of the world. If such private extensions are used within the context of the TEI, they should be documented within the |
55 | CHSH | element of the TEI header, which might also provide a prose description of the language described by the language tag. |
57 | CHSH | While language, region and script can be adequately identified using this mechanism, there is only very rough provision to express a dimension of time for the language of a document; those codes provided (e.g. |
61 | CHSH | in ISO 639-2) might not reflect the segments appropriate for a text at hand. Text encoders might express the time window of the language used in the document by means of the extension mechanism defined in BCP 47 and relate that to a |
65 | CHSH | section of the TEI header. |
67 | CHSH | Equivalences to language identifiers by other authorities can be given in the |
71 | CHSH | The scope of the language identification is extending to the whole subtree of the document anchored at the element that carries the |
73 | CHSH | attribute, including all elements and all attributes where a language might apply. |
74 | CHSH | This will exclude all attributes where a non-textual datatype has been specified, for example tokens, boolean values or predefined value lists. |
81 | CH | All document encoding has to do with representing one thing by another in an agreed and systematic way. Applied to the smallest distinctive units in any given writing system, which for the moment we may loosely call |
88 | D4-41 | When the first methods of representing text for storage or transmission by machines were devised, long before the development of computers, the overriding aim was to identify the smallest set of symbols needed to convey the essential semantic content, and to encode that symbol set in the most economical way that the storage or transmission media allowed. The initial outcome were systems that encoded only such content as could be expressed in uppercase letters in the Latin script, plus a few punctuation marks and some |
92 | D4-41 | For many years after the invention of computers, the way they represented text continued to be constrained by the imperative to use expensive resources with maximal efficiency. Even when storage and processing costs began their dramatic fall, the Anglo-centric outlook of most hardware designers and software engineers hampered initiatives to devise a more generous and flexible model for text representation. The wish to retain compatibility with |
94 | D4-41 | data was an additional disincentive. Eventually, tension in East Asia between commitment to technological progress and the inability of existing computers to cope with local writing systems led to decisive developments. Japanese, Korean and Chinese standards bodies, who long before the advent of computers had been engaged in the specification of character sets, joined with computer manufacturers and software houses to devise ways of mapping those character sets to numeric encodings and processing the resulting text data. |
96 | D4-41 | Unfortunately, in the early years there was little or no co-ordination among either the national standards bodies or the manufacturers concerned, so that although commercial necessity dictated that these various local standards were all compatible with the representation of US-American English, they were not straightforwardly compatible with one another. Even within Japan itself there emerged a number of mutually incompatible systems, thanks to a mixture of commercial rivalry, disagreements about how best to manage certain intractable problems, and the fact that such pioneering work inevitably involved some false starts, leading to incompatibilities even between successive products of the same bodies. Roughly at the same time, and for similar reasons, multiple and incompatible ways of representing languages that use Cyrillic scripts were devised, along with methods of encoding ancient writing systems which inevitably could not aim for compatibility with other writing systems apart from basic Latin script. Many of the earliest projects that fed into the TEI were shaped in this developmental phase of the computerized representation of texts, and it was also the context in which SGML was devised and finalized. |
98 | D4-41 | SGML had of necessity to offer ways of coping with multiple writing systems in multiple representations; or rather, it provided a framework within which SGML-compliant applications capable of handling such multiple representations might be developed by those with sufficient financial and personnel resources (such as are seldom found in academia). Earlier editions of these Guidelines offered advice on character set and writing system issues addressed to the condition of those for whom SGML was the only feasible option. That advice must now be substantially altered because of two closely-related developments: the availability of the ISO/Unicode character set as an international standard, and the emergence of XML and related technologies which are committed to the theory and practice of character representation which Unicode embodies. |
118 | D4-42 | will not of itself take us very far towards greater terminological precision. It tends to be used to refer indiscriminately both to the visible symbol on a page and to the letter or ideograph which that symbol represents, two things that it is essential to keep conceptually distinct. The visible symbol obviously has some aspects by which we interpret it as representing one character rather than another; but its appearance may also be significantly determined by features that have no effect on our notion of which character in a writing system it represents. A familiar instance is the lowercase |
122 | D4-42 | symbol ( |
123 | D4-42 | cf. figure 1 |
127 | D4-42 | figure 1 |
129 | D4-42 | abstract character |
136 | D4-42 | in a serif typeface has additional strokes that are absent from the same letter when printed using a sans-serif typeface, so that once again we have differing glyphs standing for the same abstract character. In |
137 | D4-42 | there is even a font, Capitals Regular, in which the glyph for the lowercase letter |
139 | D4-42 | looks like a typical glyph for the character uppercase |
141 | D4-42 | . The distinction between abstract characters and glyphs is fundamental to all machine processing of documents. |
143 | D4-42 | In most scholarly encoding projects, the accurate recording of the abstract characters which make up the text is of prime importance, because it is the essential prerequisite of digitizing and processing the document without semantic loss. In many cases (though there are important exceptions, to be touched on shortly) it may not be necessary to encode the specific glyphs used to render those abstract characters in the original document. An encoding that faithfully registers the abstract characters of a document allows us to search and analyse our document's content, language and structure and access its full semantics. That same encoding, however, may not contain sufficient information to allow an exact visual representation of the glyphs in the source text or manuscript to be recreated. |
145 | D4-42 | The importance of this distinction between information content and its visual representation is not always immediately apparent to people unused to the specific complexities of text handling by machine. Such users tend to ask first what (in order of conceptual priority) should actually be their very last question: how do I get a physical image that looks like character x in my source document to appear on to the screen or the output page? Their first question should in fact be: how can I get an abstract representation of character x into my encoded document in a way that will be universally and unambiguously identifiable, no matter what it happens to look like in printout or on any particular display? And occasionally the response they receive as a result of their misguided initial question is a custom |
147 | D4-42 | that satisfies their immediate rendering wishes at the price of making their underlying document unintelligible to other users (or even to the original user in other times and places) because it encodes the abstract character in an idiosyncratic way. |
149 | D4-42 | That said, there will certainly be documents or projects where it is a matter of scholarly significance that the compositor or scribe chose to represent a given abstract character using one particular glyph or set of strokes rather than a semantically-equivalent but visually distinct alternative, and in that case the specific appearance of the form will have to be encoded on one way or another. But that encoding need not (and in most cases will not) involve a notation that visually resembles the original, any more than italicized text in an original document will be represented by the use of italic characters in the encoded version. |
151 | D4-42 | A collection of the abstract characters needed to represent documents in a given writing system is known as a |
152 | D4-42 | character set |
153 | D4-42 | , and the character set or |
155 | D4-42 | of a processing or rendering device is the set of abstract characters that it is equipped to recognize and handle appropriately. There is, however, a subtle distinction between these two parallel uses of the same term, involving one more key concept which it is essential to grasp. The character set of a document (or the writing system in which it is recorded) is purely a collection of abstract characters. But the character set of a computing device is a set of abstract characters which have been mapped in a well-defined way to a set of numbers or |
156 | D4-42 | code points |
157 | D4-42 | by which the device represents those abstract characters internally. It can therefore be referred to as a |
158 | D4-42 | coded character set |
159 | D4-42 | , meaning a set of abstract characters each of which has been assigned a numerical code point (or in some instances a sequence of code points) which unambiguously identifies the character concerned. |
161 | D4-42 | It is now possible to use this terminology to say what Unicode is: it is a coded character set, devised and actively maintained by an international public body, where each abstract character is identified by a unique name and assigned a distinctive code point. |
162 | D4-42 | Although only Unicode is mentioned here explicitly, it should be noted that the character repertoire and assigned code points of Unicode and the ISO standard 10646 are identical and maintained in a way that ensures this continues to be the case. |
163 | D4-42 | Unicode is distinguished from other, earlier and co-existing coded character sets by its (current and potential) size and scope; its built-in provision for (in practical terms) limitless expansion; the range and quality of linguistic and computational expertise on which it draws; the commitment in principle (and to an increasing degree in practice) to implement it by all important providers of hardware and software worldwide; and the stability, authority and accessibility it derives from its status as an international public standard. |
169 | D4-43 | The distinction between abstract characters and glyphs can be crucial when devising an encoding scheme. Users performing text retrieval, searching or concordancing will expect the system to recognize and treat different glyphs as instances of the same character; but when perusing the text itself they may well expect to see glyph variants preserved and rendered. When encoding a pre-existing text, the encoder must determine whether a particular letter or symbol is a character or a glyphic variant. A detailed model of the relationship between characters and glyphs has been developed within the Unicode Consortium and an ISO work group (ISO/IEC JTC1 SC2/WG2). Its report ( |
171 | D4-43 | ) will form the base for much future standards work. |
173 | D4-43 | The model makes explicit the distinction between two different properties of the components of written language: |
175 | D4-43 | their content, i.e. its meaning and phonetic value (represented by a character) |
181 | D4-43 | When searching for information, a system generally operates on the content aspects of characters, with little or no attention to their appearance. A layout or formatting process, on the other hand, must of necessity be concerned with the exact appearance of characters. Of course, some operations (hyphenation for example) require attention to both kinds of feature, but in general the kind of text encoding described in these Guidelines tends to focus on content rather than appearance (see further |
186 | D4-43 | the level of character encoding, using an appropriate Unicode code point to represent the glyph concerned |
188 | D4-43 | the markup level, with the glyph indicated via appropriate elements and/or attributes |
192 | D4-43 | The encoding practice adopted may be guided by, among other things, an assessment of the most frequent uses to which the encoded text will be put. For example, if recognition of identical characters represented by a variety of glyphs is the main priority, it may be advisable to represent the glyph variations at markup level, so that the character value can be immediately exposed to the indexing and retrieval software. Plainly, an encoding project will need to consider such issues carefully and embody the outcome of their deliberations in local manuals of procedure to ensure encoding consistency. Using Unicode code points to represent glyph information requires that such choices be documented in the TEI header. Such documentation cannot of itself guarantee proper display of the desired glyph but at least makes the intention of the encoder discoverable. |
194 | D4-43 | At present the Unicode Standard does not offer detailed specifications for the encoding of glyph variations. These Guidelines do give some recommendations; some discussion of related matters is given in |
204 | D4-44 | (IMEs) commonly used for the entry of logographic characters. This is most likely to be convenient where the display used for text entry and/or the printer used to produce output for proofreading purposes is capable of rendering the characters concerned using correct and readily identifiable glyphs. Where such easily checkable rendering is not available, or where there is no suitable method of inputting certain characters directly, they may be input by one of two possible forms of indirect notation or |
208 | D4-44 | The first form of reference is a |
210 | D4-44 | (NCR), which takes the general form |
214 | D4-44 | is an integer representing the code point of the character in base 10, or |
218 | D4-44 | is the code point in hexadecimal notation. This has the advantage that no declaration of what this notation means is required anywhere in the document instance or its associated schema. Every XML processor is capable of recognising NCRs and replacing them with the required code point value without needing access to any additional data. The disadvantage of NCRs as a means of entering, representing and proofing character data is that most human beings find them anything but |
222 | D4-44 | The second form of reference is a |
226 | D4-44 | that could be distinctively recognized by a processing system). Character entity references can (and indeed should) have names whose significance is apparent to humans, but each and every entity name has to be associated with its replacement (which as explained below should be a character value, possibly in the form of a NCR) via a formal declaration in the document's internal or external subset. This, however, is not needed for Character Entities defined by the XML standard, namely & (&), > (>), < (<), ' ('), and " ("). For a large number of characters defined by Unicode and commonly used in documents, there are ISO entity sets declaring mnemonic names which should be used wherever feasible: XML compatible character entity declarations using ISO names and suitable for inclusion into the subset are available on the TEI web sites. |
228 | D4-44 | Where characters are not defined in Unicode and so have to be assigned both a local code point and a local entity name of the project's choosing (see |
229 | D4-44 | below) it is highly desirable to follow the same nomenclature principles as ISO and to emulate the practice in the ISO character entity declarations of appending a string giving the character a unique descriptive name as a comment to the actual entity declaration. In addition, where different groups or projects are working on texts with geographical, historical, linguistic or other similarities that give rise to common issues of character encoding, it is highly advisable in the interests of consistency that they should consult one another when devising entity names. The TEI mailing list may provide a suitable first point of contact for such consultations. Further advice on the matter of locally-defined characters is contained in |
237 | D4-45a | Rendering of the encoded text is a complicated process that depends largely on the purpose, external requirements, local equipment and so forth, it is thus outside the scope of coverage for these Guidelines. |
239 | D4-45a | It might however nevertheless be helpful to put some of the terminology used for the rendering process in the context of the discussion of this chapter. As was mentioned above, Unicode encodes abstract characters, not specific glyphs. For any process that makes characters visible, however, concrete, specifically designed glyph shapes have to be used. For a printing process, for example, these shapes describe exactly at which point ink has to be put on the paper and which areas have to be left blank. If we want to print a character from the Latin script, besides the selection of the overall glyph shape, this process also requires that a specific weight of the font has been selected, a specific size and to what degree the shape should be slanted. Beyond individual characters, the overall typesetting process also follows specific rules of how to calculate the distance between characters, how much whitespace occurs between words, at which points line breaks might occur and so forth. |
241 | D4-45a | If we concern ourselves only with the rendering process of the characters themselves, leaving out all these other parameters, we will realize that of all the information required for this process, only a small amount will be drawn from the encoded text itself. This information is the code point used to encode the character in the document. With this information, the font selected for printing will be queried to provide a glyph shape for this character. Some modern font formats (e.g. OpenType) do implement a sophisticated mapping from a code point to the glyph selected, which might take into account surrounding characters (to create ligatures where necessary) and the language or even area this character is printed for to accommodate different typesetting traditions and differences in the usage of glyphs. |
243 | D4-45a | A TEI document might provide some of the information that is required for this process for example by identifying the linguistic context with the |
245 | D4-45a | attribute. The selection of fonts and sizes is usually done in a stylesheet, while the actual layout of a page is determined by the typesetting system used. Similarly, if a document is rendered for publication on the Web, information of this kind can be shipped with the document in a stylesheet. |
252 | D4-45b | The devisers of the XML standard took the view that Unicode should be the only means of representing abstract characters which conformant XML processors were obliged to support. That certainly does not preclude the use of other character encoding schemes or character sets in documents which are to be handled by XML processors, but it does mean that all the abstract characters which are encoded as characters (as distinct from being represented indirectly via markup) in an XML document must either possess an assigned code point within the public Unicode standard, or be assigned a code point devised by and specific to the local project, taken from a reserved range set aside by the standard expressly for this purpose, the so-called |
254 | D4-45b | or PUAs. For the vast majority of projects to which these Guidelines are applicable, the Unicode standard will already offer code points for all the abstract characters their documents employ, and so the requirement that all such characters should be resolvable by XML processors to Unicode code points will not involve any representation via markup or use of PUA code points. Indeed, such projects are not obliged by their choice of XML to use Unicode in their documents. Provided they correctly declare at the requisite points any non-Unicode coded character set they may use, ensure that all their XML processors support their declared encoding, and then consistently employ that encoding in strict conformity with their declarations, they need not consciously concern themselves with Unicode unless and until they feel it is appropriate to do so. |
259 | D4-45-1 | There are, however, strict limits to the way conformant XML processors handle documents whose character set is not Unicode, and unless these limits are understood it is likely that projects not yet ready to commit to Unicode across the board will run into unexpected and baffling problems as they attempt to operate with their legacy character encodings. First, it must be repeated that nothing in the XML standard |
261 | D4-45-1 | conformant processors to handle non-Unicode documents. But even if there were any actual processors which on that basis refused to process non-Unicode documents, that would not limit their usefulness as severely as might at first appear. The reason is that there is a way of internally representing Unicode code points (explained further in |
262 | D4-45-1 | below) where there is no detectable difference between a document which is actually encoded in ASCII employing only 7-bit values and one which is encoded in Unicode but which happens to contain only the abstract characters encompassed by the 7-bit ASCII standard. And the XML standard specifies that this way of representing Unicode is the one which processors must assume as the default for any document that does not explicitly declare an encoding. At a stroke, this provision ensures that all pure 7-bit ASCII encoded documents can be processed without further ado by all conformant XML processors. Add to this the provision, also within the XML standard, that allows any Unicode code point to be indirectly specified using only 7-bit ASCII characters via a Numeric Character Reference (NCR), and the upshot is that all documents in non-Unicode encodings which can be pre-processed to rewrite any characters outside the 7-bit ASCII range as Unicode code points in NCR notation (a simple batch procedure for which software is readily available) can be handled even by processors which have no inbuilt support for any encoding other than Unicode. |
266 | D4-45-1 | To avoid confusion when taking advantage of such encoding support, it is first of all essential to grasp that an encoding declaration in an XML document is indeed simply a declaration: it is not an incantation that magically converts the document that follows into the encoding concerned. It is a common error to think that simply declaring a document's encoding to be, say ISO-8859-1 (or for that matter UTF-8 or UTF-16, the representations of Unicode for which support is mandatory) is sufficient to |
268 | D4-45-1 | . Such a declaration is useless unless the document that follows actually is encoded strictly in conformance with the declaration. Some of the circumstances in which that may not in fact be the case are outlined in |
269 | D4-45-1 | below. Secondly, an encoding declaration does not somehow switch an XML processor into a mode where it works entirely in the declared encoding for as long as the declaration is in scope. On the contrary, all it does is instruct the processor to pass its input through a filter that immediately converts all the code points in the declared encoding into their Unicode counterparts; from that point onwards the document as seen by all subsequent stages of processing is actually in Unicode, even though that may not be apparent to the user. Thirdly, this invariable internal conversion has a crucial consequence: the fact that a processor can successfully accept a document in a non-Unicode encoding does not mean that it will necessarily convert any output it may produce back into the declared input encoding. Internally, the document has been converted to and processed in Unicode, and there is nothing in the XML standard that requires the reverse conversion to be performed at the output stage. Most processors go beyond the standard by offering a facility to output in various encodings: but whether it is available and how to use it must be ascertained from the processor's documentation. Should it be unavailable or unreliable, the output may need to be post-processed through a character convertor to restore the original encoding, and again such software is freely available and easy to use. |
275 | D4-45-2 | In the cases considered in the preceding section, there was a suitable Unicode code point corresponding to each abstract character contained in the non-Unicode character set of the input document. In such instances, the mandatory internal conversion to Unicode carried out by the processor can be more or less transparent to a user who wishes to continue to work with a non-Unicode character set. Things become rather different when the non-Unicode character set contains abstract characters for which there is no code point in the Unicode standard, or when a project that is attempting to work in Unicode throughout finds that it needs to represent abstract characters not currently provided for in the Unicode standard. Here, a significant difference between SGML and XML emerges in a rather troublesome way. |
277 | D4-45-2 | Following their agenda to devise a subset of SGML that would be significantly easier to implement, the authors of the XML specification decided that one particular type of entity available in SGML, known as an internal SDATA entity, should not be carried over into XML. It would be idle to question that decision here, but its consequences for the handling of abstract characters for which there is no Unicode definition were significant. |
279 | D4-45-2 | The procedures recommended in earlier versions of these Guidelines for encoding, processing and exchanging what we might call locally defined abstract characters were reliant on the availability of entities declared as of type SDATA, but that type is not supported in XML, and there is therefore no ready equivalent for XML-based projects to the recommendations previously offered. |
280 | D4-45-2 | In essence, when an SGML parser encounters a reference to an entity of type SDATA, it supplies to the application which it is servicing the name of that entity, as found in the document, plus a pointer to a location somewhere on the local system, and what is present at that location may in turn allow or instruct the application to do one of a number of things, including looking up the entity name in a table and deriving information about the referenced entity which can trigger specific behaviours in the application appropriate to the processing of that abstract character. There is however no way to make an XML parser do anything of the kind in response to an entity reference. |
281 | D4-45-2 | Entities in XML are really only of two basic types, parsed and unparsed. Unparsed entities are of no relevance here. References to parsed entities in an XML document result in only one kind of behaviour: when they appear in the parser's input stream, the parser expects to be able to resolve them by locating a declaration in the document's internal or external subset which maps the entity name to its replacement text. The parser then inserts that replacement text into the document in place of the entity reference, which is discarded without trace. The act of replacement is not notified to the application, except where it fails because the entity is undeclared or the declaration is in some way defective (in which case the parser signals a fatal error and stops.) |
283 | D4-45-2 | Though for explanatory convenience much XML-related documentation, including these Guidelines, refers specifically to Character Entities and Character Entity References, a character entity in XML is not a distinct |
285 | D4-45-2 | in the sense that |
287 | D4-45-2 | is understood in Computer Science terminology, for example when referring to the type of an attribute. Hence there is no way in which editing or other software can check that the replacement to be inserted is indeed a single character or its equivalent rather than an arbitrary chunk of text, possibly including markup. A character entity is simply a general entity whose replacement text happens to be declared as a character value or a NCR representing that value. This has two important consequences if it is proposed to use such an entity reference to stand for a character that has no Unicode equivalent. First, the entity name reference will disappear at an early stage in the parse and be replaced by the declared value of the entity, so that no processing which requires access in the parsed document to the entity reference as originally entered is possible. Secondly, if a character entity is to be used as a true equivalent to a normal character, and consequently be employed at all points in a document where a single character could legitimately occur (apart from in element and attribute names, where no references of any kind are allowed) then it is essential that its replacement value indeed be pure character data. If the replacement value of the entity were to contain any markup, or a processing instruction, there would be many places in a document where simple character data would be legitimate, but where the substitution of markup or some other replacement could cause the document to become invalid or malformed. Taken together, these considerations mean that the transparent use of a CER to stand for a non-Unicode character in an XML document is simply not possible. |
299 | D4-46-1 | The principles of Unicode are judiciously tempered with pragmatism. This means, among other things, that the actual repertoire of characters which the standard encodes, especially those parts dating from its earlier days, include a number of items which on a strict interpretation of the Unicode Consortium's theoretical approach should not have been regarded as abstract characters in their own right. Some of these characters are grouped |
302 | D4-46-1 | . Ligatures are a case in point. Ligatures (e.g. the joining of adjacent lowercase letters |
303 | D4-46-1 | s |
307 | D4-46-1 | f |
310 | D4-46-1 | in Latin scripts, whether produced by a scribal practice of not lifting the pen between strokes or dictated by the aesthetics of a type design) are representational features with no added semantic value beyond that of the two letters they unite (though for historians of typography their presence and form in a given edition may be of scholarly significance). However, by the time the Unicode standard was first being debated, it had become common practice to include single glyphs representing the more common ligatures in the repertoires of some typesetting devices and high-end printers, and for the coded character sets built into those devices to use a single code point for such glyphs, even though they represent two distinct abstract characters. So as to increase the acceptance of Unicode among the makers and users of such devices, it was agreed that some such pseudo-characters should be incorporated into the standard as compatibility characters. Nevertheless, if a project requires the presence of such ligatured forms to be encoded, this should normally be done via markup, not by the use of a compatibility character. That way, the presence of the ligature can still be identified (and, if desired, rendered visually) where appropriate, but indexing and retrieval software will treat the code points in the document as a simple sequential occurrence of the two constituent characters concerned and so correctly align their semantics with non-ligatured equivalents. Such ligatures should not be confused with digraphs (usually) indicating diphthongs, as in the French word "cœur". A digraph is an atomic orthographic unit representing an abstract character in its own right, not purely an amalgamation of glyphs, and indexing and retrieval software must treat it as such. Where a digraph occurs in a source text, it should normally be encoded using the appropriate code point for the single abstract character which it represents, either by direct entry of the character concerned or through the appropriate CER or NCR. |
316 | D4-46-2 | The treatment of characters with diacritical marks within Unicode shows a similar combination of rigour and pragmatism. It is obvious enough that it would be feasible to represent many characters with diacritical marks in Latin and some other scripts by a sequence of code points, where one code point designated the base character and the remainder represented one or more diacritical marks that were to be combined with the base character to produce an appropriate glyphic rendering of the abstract character concerned. From its earliest phase, the Unicode Consortium espoused this view in theory but was prepared in practice to compromise by assigning single code points to |
318 | D4-46-2 | characters which were already commonly assigned a single distinctive code point in existing encoding schemes. This means, however, that for quite a large number of commonly-occurring abstract characters, Unicode has two different, but logically and semantically equivalent encodings: a |
320 | D4-46-2 | single code point, and a code point sequence of a base character plus one or more |
323 | D4-46-2 | normalization |
324 | D4-46-2 | of Unicode documents. Normalization is the process of ensuring that a given abstract character is represented in one way only in a given Unicode document or document collection. The Unicode Consortium provides four standard normalization forms, of which the Normalization Form C (NFC) seems to be most appropriate for text encoding projects. The NFC, as far as possible, defines conversions for all base characters followed by one or more combining characters into the corresponding precomposed characters. The World Wide Web Consortium has produced a document entitled |
328 | D4-46-2 | , which among other things discusses normalization issues and outlines some relevant principles. An authoritative reference is Unicode Standard Annex #15 |
331 | D4-46-2 | . Individual projects will have to decide how far their decisions on normalization need be influenced by the fact that at present, by no means all hardware or software can correctly render (or even consistently identify) abstract characters encoded using combining symbols. |
333 | D4-46-2 | It is important that every Unicode-based project should agree on, consistently implement and fully document a comprehensive and coherent normalization practice. As well as ensuring data integrity within a given project, a consistently implemented and properly documented normalization policy is essential for successful document interchange. |
339 | D4-46-3 | In addition to the Universal Character Set itself, the Unicode Consortium maintains a database of additional character semantics |
340 | D4-46-3 | . This includes names for each character code point and normative properties for it. Character properties, as given in this database, determine the semantics and thus the intended use of a code point or character. It also contains information that might be needed for correctly processing this character for different purposes. This database is an important reference in determining which Unicode code point to use to encode a certain character. |
342 | D4-46-3 | In addition to the printed documentation and lists made available by the Unicode consortium, the information it contains may also be accessed by a number of search systems over the Web (e.g. |
343 | D4-46-3 | ). Examples of character properties included in the database include case, numeric value, directionality, and, where applicable status as a |
349 | D4-46-3 | . Where a project undertakes local definition of characters with code point in the PUA, it is desirable that any relevant additional information about the characters concerned should be recorded in an analogous way, as further discussed under |
357 | D4-47 | An important difference between SGML and XML is that the latter allows for the processing of non-validated documents. Since validity and validation are central TEI concerns, it is unlikely that documents prepared according to these Guidelines will ever be designed or implemented as merely well-formed in the XML sense. However in the domain of XML technologies, even where a document invokes a DTD or schema, it is not always necessarily the case that an XML processor will perform a full validation of it. XSLT transformation is a common case in point. By the workflow stage at which a document is handed off to an XSLT process for transformation, it is likely that its associated DTD or schema will already have fulfilled its role of integrity assurance and quality control, and so it may be undesirable to add validation to the processing overhead. For this reason, most XSLT processors do not attempt validation by default, even if a DTD or schema is declared and accessible. This can, however, create a problem where parsed entities, (and character entities in particular in the present context) are referenced. A validating parser reads all entity declarations from the DTD (including those for character entities) in the initial phase of processing, so that they can be resolved as and when required. However, where no validation takes place, it cannot automatically be assumed that the parser will be able to resolve such entities in all circumstances. The XML standard requires a non-validating parser to read and act on entity declarations only if they are located within the document's internal subset (which does not, of course, mean that the entity declarations have to be manually merged into the document instance in advance of processing: character entity sets, for instance, count as being in the internal subset if they are placed there via a parameter entity, as is normal TEI practice). Some parsers when in non-validating mode will also access entity declarations in the external subset, but this behaviour is not mandated by the standard and should not be relied upon. Provided these facts are borne in mind, the presence of character entities in a document when parser validation is switched off should not cause any difficulties. |
363 | D4-48 | In theory it should not be necessary for encoders to have any knowledge of the various ways in which Unicode code points can be represented internally within a document or in the memory of a processing system, but experience shows that problems frequently arise in this area because of mistaken practice or defective software, and in order to recognize the resulting symptoms and correct their causes an outline knowledge of certain aspects of Unicode internal representation is desirable. |
368 | D4-48-1 | The code points assigned by Unicode 3.0 and later are notionally 32-bit integers, and the most straightforward way to represent each such integer in computer storage would be to use 4 eight-bit bytes. However, many of the code points for characters most commonly used in Latin scripts can be represented in one byte only and the vast majority of the remainder which are in common use (including those assigned from the most frequently used PUA range) can be expressed in two bytes alone. This accounts for the use of UTF-8 and UTF-16 and their special place in the XML standard. UTF-8 and UTF-16 are ways of representing 32-bit code points in an economical way. |
369 | D4-48-1 | UTF-8 is a variable length-encoding: the more significant bits there are in the underlying code point (or in everyday terminology the bigger the number used to represent the character), the more bytes UTF-8 uses to encode it. What makes UTF-8 particularly attractive for representing Latin scripts, explaining its status as the default encoding in XML documents, is that all code points that can be expressed in seven or fewer bits (the 127 values in the original ASCII character set) are also encoded as the same seven or fewer bits (and therefore in a single byte) in UTF-8. That is why a document which is actually encoded in pure 7-bit ASCII can be fed to an XML processor without alteration and without its encoding being explicitly declared: the processor will regard it as being in the UTF-8 representation of Unicode and be able to handle it correctly on that basis. |
371 | D4-48-1 | However, even within the domain of Latin-based scripts, some projects have documents which use characters from 8 bit extensions to ASCII, e.g. those in the ISO-8859-n series of encodings, and the way characters which under ISO-8859-n use all eight bits are encoded in UTF-8 is significantly different, giving rise to puzzling errors. Abstract characters that have a |
373 | D4-48-1 | byte code point where the highest bit is set (that is, they have a decimal numeric representation between 129 and 255) are encoded in ISO-8859-n as a |
375 | D4-48-1 | byte with the same value as the code point. But in UTF-8 code-point values inside that range are expressed as a |
377 | D4-48-1 | byte sequence. That is to say, the abstract character in question is no longer represented in the file or in memory by the same number as its code-point value: it is |
379 | D4-48-1 | (hence the T in UTF) into a sequence of two different numbers. Now as a side-effect of the way such UTF-8 sequences are derived from the underlying code-point value, many of the single-byte eight-bit values employed in ISO-8859-n encodings are illegal in UTF-8. |
381 | D4-48-1 | This complicated situation has a simple consequence which can cause great bewilderment. XML processors will effortlessly handle character data in pure 7-bit ASCII without that encoding needing to be declared to the parser, and will similarly accept documents encoded in an undeclared ISO-8859-n encoding if they happen to use no characters outside the strict ASCII subset of the ISO character sets; but the parse will immediately fail if an eight-bit character from an ISO-8859-n set is encountered in the input stream, unless the document's encoding has been explicitly and correctly declared. Explicitly declaring the encoding ought to solve the problem, and if the file is correctly encoded throughout, it will do so. But since text editors and word processors are currently acquiring different degrees of Unicode support at different rates, projects are likely to find that they have to deal with some files encoded in UTF-8 along with others in, say, ISO-8859-1. Such encoding differences may go unnoticed, especially if the proportion of characters where the internal encodings are distinguishable is relatively small (for example in a long English text with a smattering of French words). If in the process of document preparation two such files have been merged, or intermixed via |
389 | D4-48-1 | Where erroneously mixed encodings are the source of such an error, altering the encoding declaration will not solve the problem, though it may obfuscate it. Eight-bit character codes in a file declared as UTF-8 will always stop the parser. More insidiously, UTF-8 sequences in a file declared as ISO-8859-1 will not halt the parse, but will cause data corruption, because the parser will silently but erroneously convert each byte in every UTF-8 sequence into a spurious separate character, introducing semantic errors which may not become apparent until much later in the processing chain. |
391 | D4-48-1 | In projects that routinely handle documents in non-Latin scripts, everyone is well aware of the need to ensure correct and consistent encoding, so in such places mixed encoding problems seldom arise, and when they do are readily identified and remedied. Real confusion tends to arise, however, in projects which have a low awareness of the issues because they employ predominantly unaccented Latin characters, with only thinly-distributed instances of accented letters, or other |
394 | D4-48-1 | non-breaking space |
395 | D4-48-1 | ). Even, or especially, if such projects view themselves as concerned only with English documents, the close relationship between XML and Unicode means they will need to acquire an understanding of these encoding issues and develop procedures which assure consistency and integrity of encoding and its correct declaration, including the use of appropriate software for transcoding and verification. |
401 | D4-48-2 | The advantages of UTF-8 as an internal representation of Unicode code points outlined above do not obtain where documents are in scripts other than Latin, Cyrillic or Hebrew. Where characters with code points in the sixteen-bit range (two-byte) predominate, UTF-8 is inappropriate, because it requires three or more bytes to represent each abstract character. Here the preferred representation of Unicode (which all XML-conformant parsers must support) is UTF-16, where each code point corresponding to an abstract character is represented in two eight-bit bytes |
404 | D4-48-2 | values to represent code points beyond the 16-bit range is passed over here, since it adds a complication that does not affect the key points at issue |
405 | D4-48-2 | . This encoding presents a different hazard, especially while support for Unicode in editing software is relatively uneven and immature. Because the code points are represented as sixteen-bit integers stored (in most popular computers) in two separate bytes, the order in which those bytes are stored becomes important. This is dependent on the underlying hardware. In the realm of desktop computing, Macintosh machines, for example, store (on disk as well as in memory) byte pairs representing 16-bit integers with the higher-value byte first, whereas PCs using Intel processors store the bytes in the reverse order (this is often referred to with Swiftian nomenclature as |
409 | D4-48-2 | byte order). This means that if a semantically identical plain text file encoded in UTF-16 is prepared on a Macintosh and on a PC, and the two files are then saved to disk, each byte pair in one file will be in the reverse order from the corresponding byte pair in the other file. To avoid the obvious incompatibility problems, the XML standard requires that all documents whose declared encoding is UTF-16 must begin with a special pseudo-character which is not itself part of the document, but merely a Byte Order Marker (BOM) from which the processor can determine the byte order of the document that follows. Now the insertion of a correct BOM and the consistent maintenance of the byte order throughout the file ought to be taken care of transparently by software, but experience, especially from environments where work is distributed across big-endian and little-endian hardware, shows that this cannot always be taken for granted in the current state of software development. As with mixed encoding problems involving UTF-8, inconsistent byte-order in UTF-16 files seems to be the result of merging or cutting and pasting between files using software which does not correctly enforce byte order integrity, and out of misconceived |
411 | D4-48-2 | which conceals byte-order inconsistencies from the user. Once more, the result can be files which look correct in an editor, but which the XML parser either rejects outright or silently passes on in a seriously garbled form. Again, to avoid the consequent errors, projects need to cultivate an informed awareness of relevant encoding issues and devise policies to avoid them in the first place or detect them at an early stage. |
# | id | text |
---|---|---|
2 | ST | The TEI Infrastructure |
9 | ST | The TEI encoding scheme consists of a number of |
12 | ST | classes |
13 | ST | . Another part defines its possible content and attributes with reference to these classes. This indirection gives the TEI system much of its strength and its flexibility. Elements may be combined more or less freely to form a |
15 | ST | appropriate to a particular set of requirements. It is also easy to add new elements which reference existing classes or elements to a schema, as it is to exclude some of the elements provided by any module included in a schema. |
17 | ST | In principle, a TEI schema may be constructed using any combination of modules. However, certain TEI modules are of particular importance, and should always be included in all but exceptional circumstances: the module |
25 | ST | provides declarations for the metadata elements and attributes constituting the TEI header, a component which is required for TEI conformance, while the |
30 | ST | The specification for a TEI schema is itself a TEI document, using elements from the module described in chapter |
40 | ST | The bulk of this chapter describes the TEI infrastructure module itself. Although it may be skipped at a first reading, an understanding of the topics addressed here is essential for anyone planning to take full advantage of the TEI customization techniques described in chapter |
43 | ST | The chapter begins by briefly characterizing each of the modules available in the TEI scheme. Section |
44 | ST | describes in general terms the method of constructing a TEI schema in a specific schema language such as XML DTD language, RELAX NG, or W3C Schema. |
46 | ST | The next and largest part of the chapter introduces the attribute and element classes used to define groups of elements and their characteristics (section |
52 | ST | , which are used to express some commonly used content models, and lists the |
54 | ST | used to constrain the range of legal values for TEI attributes (section |
58 | STMA | TEI Modules |
64 | STMA | a formal declaration, expressed using a special-purpose XML vocabulary defined by these Guidelines in combination with elements taken from the ISO schema language RELAX NG |
69 | STMA | Each chapter of the Guidelines presents a group of related elements, and also defines a corresponding set of declarations, which we call a |
71 | STMA | . All the definitions are collected together in the reference sections provided as an appendix. Formal declarations for a given chapter are collected together within the corresponding module. For convenience, each element is assigned to a single module, typically for use in some specific application area, or to support a particular kind of usage. A module is thus simply a convenient way of grouping together a number of associated element declarations. In the simple case, a TEI schema is made by combining together a small number of modules, as further described in section |
74 | STMA | The following table lists the modules defined by the current release of the Guidelines: |
78 | tab-mods | Module name |
86 | tab-mods | analysis |
93 | tab-mods | certainty |
100 | tab-mods | core |
107 | tab-mods | corpus |
115 | tab-mods | dictionaries |
122 | tab-mods | drama |
129 | tab-mods | figures |
136 | tab-mods | gaiji |
143 | tab-mods | header |
150 | tab-mods | iso-fs |
157 | tab-mods | linking |
164 | tab-mods | msdescription |
171 | tab-mods | namesdates |
178 | tab-mods | nets |
185 | tab-mods | spoken |
192 | tab-mods | tagdocs |
199 | tab-mods | tei |
201 | tab-mods | TEI Infrastructure |
207 | tab-mods | textcrit |
214 | tab-mods | textstructure |
221 | tab-mods | transcr |
228 | tab-mods | verse |
236 | STMA | For each module listed above, the corresponding chapter gives a full description of the classes, elements, and macros which it makes available when it is included in a schema. Other chapters of these Guidelines explore other aspects of using the TEI scheme. |
240 | STIN | Defining a TEI Schema |
243 | STIN | . For a valid TEI document, this schema must be a conformant TEI schema, as further defined in chapter |
246 | STIN | be made explicit. The method of doing this recommended by these Guidelines is to provide explicitly or by reference a TEI schema specification against which the document may be validated. |
248 | STIN | A TEI-conformant schema is a specific combination of TEI modules, possibly also including additional declarations that modify the element and attribute declarations contained by each module, for example to suppress or rename some elements. The TEI provides an application-independent way of specifying a TEI schema by means of the |
251 | STIN | . The same system may also be used to specify a schema which extends the TEI by adding new elements explicitly, or by reference to other XML vocabularies. In either case, the specification may be processed to generate a formal schema, expressed in a variety of specific schema languages, such as XML DTD language, RELAX NG, or W3C Schema. These output schemas can then be used by an XML processor such as a validator or editor to validate or otherwise process documents. Further information about the processing of a TEI formal specification is given in chapter |
257 | STINsimpleExample | The simplest customization of the TEI scheme combines just the four recommended modules mentioned above. In ODD format, this schema specification takes this form: |
272 | STINsimpleExample | ). An ODD processor will generate an appropriate schema from this set of declarations, expressed using the XML DTD language, the ISO RELAX NG language, the W3C Schema language, or in principle any other adequately powerful schema language. The resulting schema may then be associated with the document instance by one of a number of different mechanisms, as further described in chapter |
273 | STINsimpleExample | . The start point (or root element) of document instances to be validated against the schema is specified by means of the |
282 | STINlargerExample | These Guidelines introduce each of the modules making up the TEI scheme one by one, and therefore, for clarity of exposition, each chapter focusses on elements drawn from a single module. In reality, of course, the markup of a text will draw on elements taken from many different modules, partly because texts are heterogeneous objects, and partly because encoders have different goals. Some examples of this heterogeneity include: |
284 | STINlargerExample | a text may be a collection of other texts of different types: for example, an anthology of prose, verse, and drama; |
286 | STINlargerExample | a text may contain other smaller, embedded texts: for example, a poem or song included in a prose narrative; |
288 | STINlargerExample | some sections of a text may be written in one form, and others in a different form: for example, a novel where some chapters are in prose, others take the form of dictionary entries, and still others the form of scenes in a play; |
290 | STINlargerExample | an encoded text may include detailed analytic annotation, for example of rhetorical or linguistic features; |
292 | STINlargerExample | an encoded text may combine a literal transcription with a diplomatic edition of the same or different sources; |
294 | STINlargerExample | the description of a text may require additional specialized metadata elements, for example when describing manuscript material in detail. |
297 | STINlargerExample | The TEI provides mechanisms to support all of these and many other use cases. The architecture permits elements and attributes from any combination of modules to co-exist within a single schema. Within particular modules, elements and attributes are provided to support differing views of the |
301 | STINlargerExample | a definition of a corpus or collection as a series of |
303 | STINlargerExample | documents, sharing a common TEI header (see chapter |
306 | STINlargerExample | a definition of composite texts which combine optional front- and back-matter with a group of collected texts, themselves possibly composite (see section |
317 | STINlargerExample | Subsequent chapters of these Guidelines describe in detail markup constructs appropriate for these and many other possible features of interest. The markup constructs can be combined as needed for any given set of applications or project. |
319 | STINlargerExample | For example, a project aiming to produce an ambitious digital edition of a collection of manuscript materials, to include detailed metadata about each source, digital images of the content, along with a detailed transcription of each source, and a supporting biographical and geographical database might need a schema combining several modules, as follows: |
348 | STINlargerExample | The TEI architecture also supports more detailed customization beyond the simple selection of modules. A schema may suppress elements from a module, suppress some of their attributes, change their names, or even add new elements and attributes. Detailed discussion of the kind of modification possible in this way is provided in |
349 | STINlargerExample | and conformance rules relating to their application are discussed in |
350 | STINlargerExample | . These facilities are available for any schema language (though some features may not be available in all languages). The ODD language also makes it possible to combine TEI and non-TEI modules into a single schema, provided that the non-TEI module is expressed using the RELAX NG schema language (see further |
356 | STEC | The TEI Class System |
358 | STEC | The TEI scheme distinguishes about five hundred different elements. To aid comprehension, modularity, and modification, the majority of these elements are formally classified in some way. Classes are used to express two distinct kinds of commonality among elements. The elements of a class may share some set of attributes, or they may appear in the same locations in a content model. A class is known as an |
360 | STEC | if its members share attributes, and as a |
362 | STEC | if its members appear in the same locations. In either case, an element is said to |
364 | STEC | properties from any classes of which it is a member. |
372 | STEC | A basic understanding of the classes into which the TEI scheme is organized is strongly recommended and is essential for any successful customization of the system. |
377 | STECAT | An attribute class groups together elements which share some set of common attributes. Attribute classes are given names composed of the prefix |
385 | STECAT | attribute, both of which are inherited from their membership in the class rather than individually defined for each element. These attributes are said to be defined by (or inherited from) the |
387 | STECAT | class. If another element were to be added to the TEI scheme for which these attributes were considered useful, the simplest way to provide them would be to make the new element a member of the |
389 | STECAT | class. Note also that this method ensures that the attributes in question are always defined in the same way, taking the same default values etc., no matter which element they are attached to. |
391 | STECAT | Some attribute classes are defined within the |
393 | STECAT | infrastructural module and are thus globally available. Other attribute classes are specific to particular modules and thus defined in other chapters. Attributes defined by such classes will not be available unless the module concerned is included in a schema. |
439 | STECAT | when the |
441 | STECAT | module is included in a schema. If, however, this module is not included in a schema, then the |
447 | STECAT | , is common to all modules, and is therefore described in some detail in the next section. A full list of all attribute classes is given in |
453 | STGA | The following attributes are defined for every TEI element. |
458 | STGA | These attributes are optionally available for any TEI element; none of them is required. Their usage is discussed in the following subsections. |
463 | STGAid | The value supplied for the |
466 | STGAid | name |
472 | STGAid | The colon is also by default a valid name character; however, it has a specific purpose in XML (to indicate namespace prefixes), and may not therefore be used in any other way within a name. |
476 | STGAid | in an XML TEI document) uppercase and lowercase letters are distinguished, and thus |
493 | STGAid | attribute also provides an identifying name or number for an element, but in this case the information need not be a legal |
495 | STGAid | value. Its value may be any string of characters; typically it is a number or other similar enumerator or label. For example, the numbers given to the items of a numbered list may be recorded with the |
497 | STGAid | attribute; this would make it possible to record errors in the numeration of the original, as in this list of chapters, transcribed from a faulty original in which the number 10 is used twice, and 11 is omitted: |
521 | STGAid | As noted above there is no requirement to record a value for either the |
525 | STGAid | attribute. Any XML processor can identify the sequential position of one element within another in an XML document without any additional tagging. An encoding in which each line of a long poem is explicitly labelled with its numerical sequence such as the following |
539 | STGAla | attribute indicates the natural language and writing system applicable to the content of a given element. If it is not specified, the value is inherited from that of the immediately enclosing element. As a rule, therefore, it is simplest to specify the base language of the text on the |
541 | STGAla | element, and allow most elements to take the default value for |
543 | STGAla | ; the language of an element then need be explicitly specified only for elements in languages other than the base language. For this reason, it is recommended practice to supply a default value for the |
547 | STGAla | root element, or on both the |
551 | STGAla | element. The latter is appropriate in the not uncommon case where the text element in a TEI document uses a different default language from that of the TEI header attached to it. Other language shifts in the source should be explicitly identified by use of the |
555 | STGAla | In the following example schematic, an English language TEI header is attached to an English language text: |
565 | STGAla | The same effect would be obtained by specifying the default language for both header and text: |
575 | STGAla | The latter approach is necessary in the case where the two differ: for example, where an English language header is applied to a French text: |
585 | STGAla | The same principle applies at any hierarchic level. In the following example, the default language of the text is French, but one section of it is in German: |
614 | STGAla | element, by contrast, because it is in the same language as its parent. |
622 | STGAla | Note that in cases where it is advisable or necessary to identify the language of the text that is pointed at, the (non-global) attribute |
625 | STGAla | the pointer references text written in French. |
634 | STGAla | Additional information about a particular language may be supplied in the |
636 | STGAla | element within the header (see section |
649 | STGAre | attributes are all used to give information about the physical presentation of the text in the source. In the following example, |
651 | STGAre | is used to indicate that both the emphasized word and the proper name are printed in italics: |
669 | STGAre | elements are rendered in the text by italics, it will be more convenient to register that fact in the TEI header once and for all (using the |
675 | STGAre | value only for any elements which deviate from the stated rendition. |
681 | STGAre | is that the value used for the former may contain one or more tokens from any vocabulary devised by the encoder, separated by space characters, whereas the value used for the latter must be a single string taken from a formally-defined style definition language such as CSS. The |
683 | STGAre | attribute values are sequence-indeterminate set of whitespace-separated tokens, whereas |
685 | STGAre | values allow whitespace and sequence relationships as part of the formally-defined style definition language. |
692 | STGAre | element can then be associated with any element, either by default, or by means of the global |
724 | STGAre | elements, each of which defines some aspect of the rendering or appearance of the text in its original form. These details may most conveniently be described using a formal style definition language, such as CSS ( |
726 | STGAre | ); in some other formal language developed for a specific project; or even informally in running prose. Although languages such as CSS and XSL-FO are generally used to describe document output to screen or print, they nonetheless provide formal and precise mechanisms for describing the appearance of source documents, especially print documents, but also many aspects of manuscript documents. For example, both CSS and XSL-FO provide mechanisms for describing typefaces, weight, and styles; character and line spacing; and so on. |
730 | STGAre | attribute is provided for encoders wishing to describe the appearance of individual source elements using a language such as CSS directly rather than by reference to a |
732 | STGAre | element. Its value may be any expression in the chosen formal style definition language. |
734 | STGAre | Formal definition languages such as CSS typically identity a series of |
738 | STGAre | are specified. A sequence of such property-value pairs makes up a stylesheet. The TEI uses such languages simply to describe the appearance of a source document, rather than to control how it should be formatted. |
740 | STGAre | In the TEI scheme, it is possible to supply information about the appearance of elements within a source document in the following distinct ways: |
742 | STGAre | One or more properties may be specified as the default for all elements of a given type, using the |
750 | STGAre | attribute with any convenient set of one or more sequence-indeterminate tokens; |
758 | STGAre | One or more properties may be supplied explicitly for individual element occurrences, using the |
764 | STGAre | If the same property is specified in more than one of the above ways, the one with the highest number in the list above is understood to be applicable. The resulting properties from each way are then combined to provide the full set of property-value pairs applicable to the given element, and (by default) to all of its children. |
768 | STGAre | attribute to indicate a different language for one or more |
772 | STGAre | attribute, if this is used in combination with either |
778 | STGAre | Note that these TEI attributes always describe the rendition or appearance of the source document, |
786 | STGAba | Several TEI elements carry attributes whose values are defined as |
788 | STGAba | , meaning that such attributes supply a link or pointer, typically expressed as a URL. Like other XML applications, the TEI allows use of a special attribute to set the context within which relative URLs are to be evaluated. The global attribute |
790 | STGAba | is defined as part of the XML specification and belongs to the XML namespace rather than the TEI namespace. We do not describe it in detail here: reference information about |
797 | STGAba | is used to set a context for all relative URLs within the scope of the element on which it is specified. For example: |
816 | STGAba | which supplies a value for |
824 | STGAba | which does not change the default context, and its target is therefore some element within the current document with the value |
828 | STGAba | attribute. Further discussion of this element and its effect on TEI linking methods is provided in chapter |
837 | STGAxs | provides a mechanism for indicating to systems processing an XML file how they should treat whitespace, that is, any sequences of consecutive tab (#x09), space (#x20), carriage return (#x0D) or linefeed (#x0A) characters. Like |
839 | STGAxs | this attribute is defined as part of the XML specification and belongs to the XML namespace rather than the TEI namespace. Complete information about this attribute is provided by |
841 | STGAxs | ; here we provide a summary of how its use affects users of the TEI scheme. |
848 | STGAxs | default |
849 | STGAxs | . The first indicates that whitespace in a text node—every carriage return, every tab, etc.—should be maintained as is when the document is processed. The second (which is implied when the attribute is not supplied), indicates that whitespace should be handled |
853 | STGAxs | These Guidelines assume one of two different ways of processing whitespace will apply in a given case, depending on an element's content model. For an element that can contain only other elements with no intervening non-whitespace characters, whitespace is considered to have no semantic significance, and should therefore be discarded by a processor. For example, in a |
863 | STGAxs | since non-whitespace text is not permitted between the |
875 | STGAxs | element has a content model containing only elements: any punctuation or whitespace required between the lines of an address must therefore be supplied by the processor, as any whitespace present in the input document will be ignored. |
877 | STGAxs | Elements with content models of this type are comparatively unusual in the TEI: a list of them is provided in the TEI release file |
883 | STGAxs | Most TEI elements permit what is known as mixed-content: that is, they can contain both text and other elements. Here the assumption of these Guidelines is that whitespace will be normalized. This means that all space, carriage return, linefeed, and tab characters are converted into spaces, all consecutive spaces are then deleted and replaced by one space, and then space immediately after a start-tag or immediately before an end-tag is deleted. The result is that this encoding, |
899 | STGAxs | . The space before his name has been removed, a space is included between his forenames, the comma is preserved, and the newlines within his name have all been removed. |
902 | STGAxs | If the default treatment described above is not appropriate for a mixed content element, the processing required may be described in the |
904 | STGAxs | element of the TEI header, but generic XML processing tools may not take note of this. |
908 | STGAxs | attribute may be supplied with a value of |
910 | STGAxs | in order to indicate that every space, tab, carriage return and linefeed character found within that element in the document being processed is significant. Typically, the result of that processing will be to retain the whitespace characters in the output. Thus if the above example began |
911 | STGAxs | persName xml:space="preserve" |
912 | STGAxs | , the resulting text would most likely be rendered over five lines, indented, and with a blank line following. |
916 | STGAxs | attribute is rarely used in TEI documents because such layout features are generally captured with less risk and more precision by using native TEI elements such as |
983 | STECCM | As noted above, the members of a given TEI model class share the property that they can all appear in the same location within a document. Wherever possible, the content model of a TEI element is expressed not directly in terms of specific elements, but indirectly in terms of particular model classes. This makes content models simpler and more consistent; it also makes them much easier to understand and to modify. |
985 | STECCM | Like attribute classes, model classes may have subclasses or superclasses. Just as elements inherit from a class the ability to appear in certain locations of a document (wherever the class can appear), so all members of a subclass inherit the ability to appear wherever any superclass can appear. To some extent, the class system thus provides a way of reducing the whole TEI galaxy of elements into a tidy hierarchy. This is however not entirely the case. |
987 | STECCM | In fact, the nature of a given class of elements can be considered along two dimensions: as noted, it defines a set of places where the class members are permitted within the document hierarchy; it also implies a semantic grouping of some kind. For example, the very large class of elements which can appear within a paragraph comprises a number of other classes, all of which have the same structural property, but which differ in their field of application. Some are related to highlighting, while others relate to names or places, and so on. In some cases, the |
988 | STECCM | set of places where class members are permitted |
989 | STECCM | is very constrained: it may just be within one specific element, or one class of element, for example. In other cases, elements may be permitted to appear in very many places, or in more than one such set of places. |
991 | STECCM | These factors are reflected in the way that model classes are named. If a model class has a name containing |
997 | STECCM | then it is primarily defined in terms of its structural location. For example, those elements (or classes of element) which appear as content of a |
1001 | STECCM | class; those which appear as content of a |
1005 | STECCM | class. If, however, a model class has a name containing |
1011 | STECCM | , the implication is that its members all have some additional semantic property in common, for example containing a bibliographic description, or containing some form of name, respectively. These semantically-motivated classes often provide a useful way of dividing up large structurally-motivated classes: for example, the very general structural class |
1014 | STECCM | data elements that form part of a paragraph |
1015 | STECCM | ) has four semantically-motivated member classes ( |
1025 | STECCM | Although most classes are defined by the |
1029 | STECCM | , but instead gain their members as a consequence of individual elements' declaration of their membership. The same class may therefore contain different members, depending on which modules are active. Consequently, the content model of a given element (being expressed in terms of model classes) may differ depending on which modules are active. |
1031 | STECCM | Some classes contain only a single member, even when all modules are loaded. One reason for declaring such a class is to make it easier for a customization to add new member elements in a specific place, particularly in areas where the TEI does not make fully elaborated proposals. For example, the TEI class |
1035 | STECCM | module to include just the TEI |
1037 | STECCM | element. A project wishing to add an alternative way of structuring text-critical information could do so by defining their own elements and adding it to this class. |
1039 | STECCM | Another reason for declaring single-member classes is where the class members are not needed in all documents, but appear in the same place as elements which are very frequently required. For example, the specialized element |
1041 | STECCM | used to represent a non-Unicode character or glyph is provided as the only member of the |
1043 | STECCM | class when the |
1045 | STECCM | module is added to a schema. References to this class are included in almost every content model, since if it is used at all the |
1047 | STECCM | must be available wherever text is available; however these references have no effect unless the gaiji module is loaded. |
1049 | STECCM | At the other end of the scale, a few of the classes predefined by the tei module are subsequently populated with very many members. For example, the class |
1051 | STECCM | groups all the classes of element for simple editorial correction and transcription which can appear within a |
1061 | STECCM | element is one of the basic building blocks of a TEI document it is not surprising that each module will need to add elements to it. The class system here provides a very convenient way of controlling the resulting complexity. Typically, elements are not added directly to these very general classes, but via some intermediate semantically-motivated class. |
1063 | STECCM | Just as there are a few classes which have a single member, so there are some classes which are used only once in the TEI architecture. These classes, which have no superclass and therefore do not fit into the class hierarchy defined here, are a convenient way of maintaining elements which are highly structured internally, but which appear from the outside to be uniform objects like others at the same level. |
1067 | STECCM | Members of such classes can only ever appear within one element, or one class of elements. For example, the class |
1069 | STECCM | is used only to express the content model for the element |
1071 | STECCM | ; it references some other classes of elements, which can appear elsewhere, and also some elements which can only appear inside an address. |
1076 | STBTC | Most TEI elements may also be informally classified as belonging to one of the following groupings: |
1080 | STBTC | high level, possibly self-nesting, major divisions of texts. These elements populate such classes as |
1084 | STBTC | , and typically form the largest component units of a text. |
1091 | STBTC | , either directly or by means of other classes such as |
1105 | STBTC | means any string of characters, and can apply to individual words, parts of words, and groups of words indifferently; it does not refer only to linguistically-motivated phrasal units. This may cause confusion for readers accustomed to applying the word in a more restrictive sense. |
1109 | STBTC | The TEI also identifies two further groupings derived from these three: |
1121 | STBTC | classes but rather a distinct grouping of elements which are both chunk-like and phrase-like. However, the classes |
1132 | STBTC | elements which can appear directly within texts or text divisions; this is a combination of the inter- and chunk- level elements defined above. These elements populate the class |
1134 | STBTC | , which is defined as a superset of the classes |
1142 | STBTC | Broadly speaking, the front, body, and back of a text each comprises a series of components, optionally grouped into divisions. |
1144 | STBTC | As noted above, some elements do not belong to any model class, and some model classes are not readily associated with any of the above informal groupings. However, over two-thirds of the |
1145 | STBTC | elements defined in the present edition of these Guidelines are classified in this way, and future editions of these recommendations will extend and develop this classification scheme. |
1147 | STBTC | A complete alphabetical list of all model classes is provided in |
1269 | STmacros | The infrastructure module defined by this chapter also declares a number of |
1271 | STmacros | , or shortcut names for frequently occurring parts of other declarations. Macros are used in two ways in the TEI scheme: to stand for frequently-encountered content models, or parts of content models ( |
1278 | STECST | As far as possible, the TEI schemas use the following set of frequently-encountered content models to help achieve consistency among different elements. |
1290 | STECST | The present version of the TEI Guidelines includes some |
1292 | STECST | shows, in descending order of frequency, the seven most commonly used content models. |
1306 | DTYPES | The values which attributes may take in a TEI schema are defined, for the most part, by reference to a TEI |
1307 | DTYPES | datatype |
1308 | DTYPES | . Each such datatype is defined in terms of other primitive datatypes, derived mostly from |
1310 | DTYPES | , literal values, or other datatypes. This indirection makes it possible for a TEI application to set constraints either globally or in individual cases, by redefining the datatype definition or the reference to it respectively. In some cases, the TEI datatype includes additional usage constraints which cannot be enforced by existing schema languages, although a TEI-compliant processor should attempt to validate them (see further discussion in chapter |
1313 | DTYPES | Where literal values or name tokens are used in a datatype definition, an associated value list supplies definitions for the significance of suggested or (in the case of closed lists) all possible values. |
1316 | DTYPES | TEI-defined datatypes may be grouped into those which define normalized values for numeric quantities, probabilities, or temporal expressions, those which define various kinds of shorthand codes or keys, and those which define pointers or links. |
1330 | DTYPES | datatype include |
1377 | DTYPES | in the case of durations, times, and date; W3C Schema datatypes in the case of truth values; BCP 47 in the case of language; and ISO 5218 in the case of sex. |
1410 | DTYPES | By far the largest number of TEI attributes take values which are coded values or names of some kind. These values may be constrained or defined in a number of different ways, each of which is given a different name, as follows: |
1431 | DTYPES | , are used to supply an identifier expressed as any kind of single token or word. The TEI places a few constraints on the characters which may be used for this purpose: only Unicode characters classified as letters, digits, punctuation characters, or symbols can appear in an attribute value of this kind. Note in particular that such values cannot include whitespace characters. Legal values include |
1445 | DTYPES | Where identifiers are defined externally, for example as part of a database or file system, the inability to include whitespace or other special characters in a value may be problematic. In other cases, it may also be simply more convenient to supply a short sequence of natural language words including spaces as a single value. For these reasons, we also provide a datatype |
1459 | DTYPES | . This datatype should be used with care since XML will not normalize whitespace characters within it: for example the values |
1463 | DTYPES | (three spaces) would be considered distinct. This case should be distinguished from that of an attribute permitting multiple values, each of which may be separated by whitespace which |
1472 | DTYPES | , but with the additional constraint that they must be legal XML identifiers, as defined by the XML 1.0 specification, or successors. Hence, they may not begin with digits or punctuation characters. Legal identifiers include |
1494 | DTYPES | supplied by |
1498 | DTYPES | above, with the added constraint that the word supplied is taken from a specific list of possibilities. In each case, the element or class specification which includes the definition for the attribute will also contain a list of possible values, together with a prose description of their intended significance. This list may be open (in which case the list is advisory), or closed (in which case it determines the range of legal values). In this latter case, the datatype will not be |
1500 | DTYPES | , but an explicit list of the possible values. |
1515 | DTYPES | An attribute may, of course, take more than one value of a given type, for example a list of pointer values, or a list of words. In the TEI scheme, this information is regarded as a property of the |
1517 | DTYPES | element used to document the attribute in question rather than as a distinct |
1518 | DTYPES | datatype |
1525 | STOV | The TEI Infrastructure Module |
1529 | STOV | module defined by this chapter is a required component of any TEI schema. It provides declarations for all datatypes, and initial declarations for the attribute classes, model classes, and macros used by other modules in the TEI scheme. Its components are listed below in alphabetical order: |
1531 | tei | TEI Infrastructure |
1533 | tei | Declarations for classes, datatypes, and macros available to all TEI modules |
1547 | STOV | The order in which declarations are made within the infrastructure module is critical, since several class declarations refer to others, which must therefore precede them. Other constraints on the order of declarations derive from the way in which the modularity of the TEI scheme is implemented in different schema languages. The XML DTD fragment implementing this TEI module makes extensive use of |
1551 | STOV | to effect a kind of conditional construction; the RELAX NG schema fragment similarly predeclares a number of patterns with null ( |
# | id | text |
---|---|---|
4 | FM1 | This publication constitutes the fifth distinct version of the |
6 | FM1 | , and the first complete revision since the appearance of P3 in 1994. It includes substantial amounts of new material and a major revision of the underlying technical infrastructure. With this version, the Guidelines enter a new stage in their development as a community-maintained open source project. This edition is the first version to have benefitted from the close overview and oversight of an elected TEI Technical Council. The editors are therefore particularly pleased to acknowledge with gratitude the hard work and dedication put into this project by the Council over the last five years. |
8 | FM1 | The Chair of the TEI Board sits on the Technical Council, and the Board appoints the Chair of the Technical Council and one other member of the Council. Other Council members are all elected by the Consortium membership, and serve periods of up to two years at a time. The names and affiliations of all Council members who served during the production of this edition of the Guidelines are listed below. |
40 | FM1 | Members Appointed by the TEI Board |
144 | FM1 | The bulk of the Council's work has been carried out by email and by regular telephone conference. In addition, the Council has held many two-day face-to-face meetings. During production of P5, these meetings were generously hosted by the following institutions: |
181 | FM1 | During the production of TEI P5, the Council chartered a number of smaller workgroups and similar activities, each of which made significant contribution to the intellectual content of the work. Active members of these are listed below: |
186 | FM1 | Active between July 2001 and January 2005, this group revised and developed the recommendations now forming chapters |
194 | FM1 | Active between February 2003 and February 2005, this group developed the material now forming |
201 | FM1 | Active between February 2002 and January 2006, this group reviewed and expanded the material now largely forming part of |
207 | FM1 | Active between February 2003 and December 2005, this group reviewed and finalised the material now forming |
208 | FM1 | . It was chaired by Matthew Driscoll and comprised David Birnbaum and Merrillee Proffitt, in addition to the TEI Editors. |
213 | FM1 | Active between January 2006 and May 2007, this group formulated the new material now forming part of |
220 | FM1 | Active between January 2003 and August 2007, this group reviewed the material now presented in |
224 | FM1 | From 2000 to 2008 the TEI had two appointed Editors, Lou Burnard (University of Oxford) and Syd Bauman (Brown University), who served |
225 | FM1 | ex officio |
228 | FM1 | The council also oversees an Internationalization and Localization project, led by Sebastian Rahtz and with funding from the ALLC. This activity, ongoing since October 2005, is engaged in translating key parts of the P5 source into a variety of languages. |
255 | FM1 | Any one who works closely with the TEI Guidelines, whether as translator, editor, or reader is constantly reminded of the ambitious scope and exceptionally high editorial standards set by the original project, now approaching twenty years ago. It is appropriate therefore to retain a sense of the history of this document, as it has evolved since its first appearance in 1990, and to acknowledge with gratitude the contributions made to that evolution by very many individuals and institutions around the world. The original prefatory notes to each major edition of the Guidelines recording these names are therefore preserved in an appendix to the current edition (see |
# | id | text |
---|---|---|
5 | ND | it was noted that the elements provided in the core module allow an encoder to specify that a given text segment is a proper noun, or a |
6 | ND | referring string |
7 | ND | , and to specify the kind of object named or referred to only by supplying a value for the |
11 | ND | This module also provides elements for the representation of information about the person, place, or organization to which a given name is understood to refer and to represent the name itself, independently of its application. In simple terms, where the core module allows one simply to represent that a given piece of text is a |
12 | ND | name |
14 | ND | personal name |
16 | ND | person |
18 | ND | canonical name |
23 | ND | ), place names (section |
35 | NDATTS | have specialized attributes which support linkage of a naming element with the entity (person, place, organization) being named; members of the class |
37 | NDATTS | have specialized attributes which support a number of ways of normalizing the date or time of the data encoded by the element concerned. |
46 | NDATTSnr | As discussed elsewhere, these attributes provide two different ways of associating any sort of name with its referent. For cases where all that is required is to provide some minimal information about the person name, for example their occupation or status, the |
50 | NDATTSnr | attribute. It also provides an additional attribute, which allows the name itself to be associated with a base or canonical form: |
57 | NDATTSnr | attribute should be used wherever it is possible to supply a direct link such as a URI to indicate the location of canonical information about the referent. |
71 | NDATTSnr | More than one URI may be supplied if the name refers to more than one person. For example, assuming the existence of another |
85 | NDATTSnr | attribute is provided for cases where no such direct link is required: for example because resolution of the reference is carried out by some local convention, or because the encoder judges that no such resolution is necessary. As an example of the first case, a project might maintain its own local database system containing canonical information about persons and places, each entry in which is accessed by means of some system-specific identifier constructed in a project-specific way from the value supplied for the |
89 | NDATTSnr | a similar method is used to link element descriptions to the modules or classes to which they belong, for example. |
90 | NDATTSnr | As an example of the second case, consider the use of well-established codifications such as country or airport codes, which it is probably unnecessary for an encoder to expand further: |
98 | NDATTSnr | , interchange is improved by use of tag URIs in |
106 | NDATTSnr | attribute has a more specialized use, where it is the name itself which is of interest rather than the person, place, or organization being named. See section |
129 | NDATTSda | attribute is used to specify a normalized form for any temporal expression, independently of how it is represented in the text, as in the following example: |
138 | NDATTSda | attribute provides a convenient way of associating an event or date with a named period. Its value is a pointer which should indicate some other element where the period concerned is more precisely defined. A convenient location for such definitions is the |
144 | NDATTSda | of a TEI Header. A |
146 | NDATTSda | may contain simply a bibliographic reference to an external definition for it. More usefully, it may also contain a series of |
148 | NDATTSda | elements, each with an identifier and a description. The identifier can then be used as the target for a |
150 | NDATTSda | attribute. For example, a taxonomy of named periods might be defined as follows: |
186 | NDATTSda | The other dating attributes provided by this class support a wide range of methods of specifying temporal information in a normalized form. Some simple examples follow: |
204 | NDATTSda | Normalization of date and time values permits the efficient processing of data (for example, to determine whether one event precedes or follows another). These examples all use the W3C standard format for representation of dates and times. Further examples, and discussion of some alternative approaches to normalization are given in section |
214 | NDPER | The core |
218 | NDPER | elements can distinguish names in a text but are insufficiently powerful to mark their internal components or structure. To conduct nominal record linkage or even to create an alphabetically sorted list of personal names, it is important to distinguish between a family name, a forename and an honorary title. Similarly, when confronted with a string such as |
220 | NDPER | , the analyst will often wish to distinguish amongst the various constituent elements present, since they provide additional information about the status, occupation, or residence of the person to whom the name belongs. The following elements are provided for these and related purposes: |
225 | NDPER | attributes mentioned above, all of the above elements are members of the class |
234 | NDPER | element irrespective of whether or not the components of the personal name are also to be marked. |
238 | NDPER | name type="person" |
241 | NDPER | attribute allows for further subcategorization of the personal name itself, for example as a |
244 | NDPER | birth |
277 | NDPER | elements because distinctive name components occurring within it can be marked as such. |
280 | NDPER | surname |
281 | NDPER | and additional personal names, often known as |
311 | NDPER | elements to provide further culture- or project-specific detail about the name component, for example: |
340 | NDPER | attribute are not constrained, and may be chosen as appropriate to the encoding needs of the project. They may be used to distinguish different kinds of forename or surname, as well as to indicate the function a name component fills within the whole. In this example, we indicate that a surname is toponymic, and also point to the specific place name from which it is derived: |
353 | NDPER | The value |
355 | NDPER | was suggested above for the not uncommon case where the whole of a surname is composed of several other surname elements. These nested surnames may be individually tagged as well, together with appropriate type values: |
369 | NDPER | attribute may be used to indicate whether a name is an abbreviation, initials, or given in full: |
403 | NDPER | Alternatively, it may be felt more appropriate to mark a patronymic as a distinct kind of name, neither a forename nor a surname, using the |
429 | NDPER | class; its effect is to state the sequence in which |
433 | NDPER | elements should be combined when constructing a sort key for the name. |
471 | NDPER | It is also often convenient to distinguish phrases (historically similar to the generational labels mentioned above) used to link parts of a name together, such as |
477 | NDPER | etc. It is often a matter of arbitrary choice whether such components are regarded as part of the surname or not; the |
499 | NDPER | elements are used to mark all name components other than those already listed. The distinction between them is that a |
501 | NDPER | encloses an associated name component such as an aristocratic or official title which exists in some sense independently of its bearer. The distinction is not always a clear one. As elsewhere, the |
506 | NDPER | An inherited or life-time title of nobility such as |
515 | NDPER | An academic or other honorific prefixed to a name e.g. |
542 | NDPER | role |
543 | NDPER | a person has in a given context (such as |
544 | NDPER | witness |
549 | NDPER | element, since this is intended to mark roles which function as part of a person's name, not the role of the person bearing the name in general. Information about roles, occupations, etc. of a person are encoded within the |
588 | NDPER | A name may have any combination of the above elements: |
606 | NDPER | Although highly flexible, these mechanisms for marking personal name components will not cater for every personal name, nor for every processing need. Where the internal structure of personal names is highly complex or where name components are particularly ambiguous, feature structures are recommended as the most appropriate mechanism to mark and analyze them, as further discussed in chapter |
609 | NDPER | White space is allowed and therefore significant between elements within |
631 | NDORG | In these Guidelines, we use the term |
633 | NDORG | for any named collection of people regarded as a single unit. Typical examples include institutions such as |
645 | NDORG | . Giving a loosely-defined group of individuals a name often serves a particular political or social agenda and an analysis of the way such phrases are constructed and used may therefore be of considerable importance to the social historian, even where the objective existence of an |
647 | NDORG | in this sense is harder to demonstrate than that of (say) a named person. In the case of businesses or other formally constituted institutions, the component parts of an organizational name may help to characterize the organization in terms of its perceived geographical location, ownership, likely number of employees, management structure, etc. |
656 | NDORG | This element is a member of the same attribute classes as |
663 | NDORG | element may be used to mark up any form of organizational name: |
690 | NDORG | attribute should be used to characterize the name (rather than the organization), for example as an acronym: |
716 | NDORG | The components of an organization's name may include place names as well as personal names: |
724 | NDORG | or role names: |
760 | NDPLAC | Like other proper nouns or noun phrases used as names, place names can simply be marked up with the |
764 | NDPLAC | element. For cartographers and historical geographers, however, the component parts of a place name provide important information about the relation between the name and some spot in space and time. They also provide important evidence in historical linguistics. |
766 | NDPLAC | These Guidelines distinguish three ways of referring to places. A place name (represented using the |
769 | NDPLAC | ). A place named simply in terms of geographical features such as mountains or rivers is represented using the |
772 | NDPLAC | ). Finally, an expression consisting of phrases expressing spatial or other kinds of relationship between other kinds of named place may itself be regarded as a way of referring to a place, and hence as a kind of named place (see section |
785 | NDPLAC | mentioned above. These attributes are primarily useful as a means of linking a place name with information about a specific place. Recommendations for the encoding of information about a place, as distinct from its name, are provided in |
794 | NDPLAC | name type="place" |
796 | NDPLAC | rs type="place" |
798 | NDPLAC | Strictly, a suitable value such as |
800 | NDPLAC | should be added to the two place names which are presented periphrastically in the second version of this example. This would preserve the distinction indicated by the choice of |
827 | NDPLGU | A place name may contain text with no indication of its internal structure: |
829 | NDPLGU | More usually however, a place name of this kind will be further analysed in terms of its constitutive geo-political or administrative units. These may be arranged in ascending sequence according to their size or administrative importance, for example: |
845 | NDPLGU | class, members of which may be used anywhere that text is permitted, including within each other as in the following examples: |
924 | NDPLGF | element for this component of the name and then point to it using the |
932 | NDPLR | All the place name specifications so far discussed are |
934 | NDPLR | , in the sense that they define only one place. A place may however be specified in terms of its relationship to another place, for example |
939 | NDPLR | relative place names |
940 | NDPLR | will contain a place name which acts as a referent (e.g. |
944 | NDPLR | ). They will also contain a word or phrase indicating the position of the place being named in relation to the referent (e.g. |
948 | NDPLR | ). A distance, possibly only vaguely specified, between the referent place and the place being indicated may also be present (e.g. |
954 | NDPLR | Relative place names may be encoded using the following elements in combination with either a |
959 | NDPLR | Some examples of relative place names are: |
995 | NDPLR | The internal structure of place names is like that of personal names—complex and subject to an enormous amount of variation across time and different cultures. The recommendations in this section should however be adequate for a majority of users and applications; they may be extended using the mechanisms described in chapter |
996 | NDPLR | to add new elements to the existing classes. When the focus of interest is on the name components themselves, as in place name studies for example, the elements discussed in |
1019 | NDPERS | This module defines a number of special purpose elements which can be used to markup biographical, historical, and prosopographical data. We envisage a number of users and uses for these elements. For example, an encoder may be interested in creating or converting a set of biographical records, for example of the type found in a Dictionary of National Biography. Another use is the creation or conversion of a database-like collection of information about a group of people, such as the people referenced in a marked-up collection of documents, or persons who have served as informants in the creation of spoken corpora. It is also appropriate to use these elements to register information relating to those who have taken part in the creation of a TEI document. |
1021 | NDPERS | To cater for this diversity, these Guidelines propose a flexible strategy, in which encoders may choose for themselves the approach appropriate to their needs. If one were interested, for example, in converting existing DNB-type records, and wanted to preserve the text as is, the |
1024 | NDPERS | ) could simply contain the text of an article, placed within |
1030 | NDPERS | to mark up features of that text. For a more structured entry, however, one would extract the data and place information contained in the text, and encode it directly using the more specific elements described in this section. |
1035 | NDPERSbp | Information about people, places, and organizations, of whatever type, essentially comprises a series of statements or assertions relating to: |
1039 | NDPERSbp | which do not, by and large, change over time |
1043 | NDPERSbp | which hold true only at a specific time |
1046 | NDPERSbp | or incidents which may lead to a change of state or, less frequently, trait. |
1052 | NDPERSbp | are typically independent of an individual's volition or action and can be either physical, such as sex or hair and eye colour, or cultural, such as ethnicity, caste, or faith. The distinction is not entirely straightforward, however: while sex is fairly obviously a physical trait, gender should rather be regarded as culturally determined, and the division of mankind into different |
1054 | NDPERSbp | , proposed by early (white European) anthropologists on the basis of physical characteristics such as skin colour, hair type and skull measurements, is now considered to be more a social or mental construct. Furthermore, while some characteristics will obviously change over time, hair colour for example, none, in principle—not even sex—is immutable. |
1057 | NDPERSbp | include, for example, marital status, place of residence and position or occupation. Such states have a definite duration, that is, they have a beginning and an end and are typically a consequence of the individual's own action or that of others. |
1060 | NDPERSbp | changes in state |
1061 | NDPERSbp | are meant the events in a person's life such as birth, marriage, or appointment to office; such events will normally be associated with a specific date or a fairly narrow date-range. Changes in states can also cause or be caused by changes in characteristics. Any statement or assertion on any of these aspects of a person's life will be based on some source, possibly multiple sources, possibly contradictory. Taking all this into account it follows that each such statement or assertion needs to be able to be documented, put into a time frame and be relatable to other statements or assertions of the same or any of the other types. |
1063 | NDPERSbp | The elements defined by the module described in this chapter may, for the most part, all be regarded as specializations of one or other of the above three classes. Generic elements for state, trait, and event are also defined: |
1076 | NDPERSE | Information about a person, as distinct from references to a person, for example by name, is grouped together within a |
1078 | NDPERSE | element. Information about a group of people regarded as a single entity (for example |
1082 | NDPERSE | element. Note however that information about a group of people with a distinct identity (for example a named theatrical troupe) should be recorded using the |
1097 | NDPERSE | elements may be supplied within the |
1101 | NDPERSE | element of a TEI header (see |
1104 | NDPERSE | can also appear within the body of a text when the module defined by this chapter is included in a schema. |
1130 | NDPERSE | element carries several attributes. As a member of the classes |
1141 | NDPERSE | In addition, a small number of very commonly used personal properties may be recorded using attributes specific to |
1149 | NDPERSE | These attributes are intended for use where only a small amount of data is to be encoded in a more or less normalized form, possibly for many person elements, for example when encoding basic facts about respondents to a questionnaire. When however a more detailed encoding is required for all kinds of information about a person, for example in a historical gazetteer, then it will be more appropriate to use the elements |
1157 | NDPERSE | attribute is not intended to record the person's age expressed in years, months, or other temporal unit. Rather it is intended to record into which age bracket, for the purposes of some analysis, the person falls. A simple (perhaps too simple to be useful) binary classification of age brackets would be |
1161 | NDPERSE | . The actual age brackets useful to various projects are likely to be varied and idiosyncratic, and thus these Guidelines make no particular recommendation as to possible values. Instead, individual projects are recommended to define the values they use in their own customization file, using a declaration like the following: |
1201 | NDPERSE | element may contain many sub-elements, each specifying a different property of the person being described. The remainder of this section describes these more specific elements. For convenience, these elements are grouped into three classes, corresponding with the tripartite division outlined above: one for traits, one for states and one for events. Each class contains both specific elements for common types of biographical information, and a generic element for other, user-defined, types of information. |
1203 | NDPERSE | All the elements in these three classes belong to the attribute class |
1234 | NDPERSEpc | , allow content of ordinary prose containing phrase-level elements. |
1241 | NDPERSEpc | The meanings of concepts such as sex, nationality, or age are highly culturally-dependent, and the encoder should take particular care to be explicit about any assumptions underlying their usage of them. For example, when recording personal age in different cultures, there may be different assumptions about the point from which age is reckoned. A statement of the practice adopted in a given encoding may usefully be provided in the |
1248 | NDPERSEpc | element contains either paragraphs or a number of |
1253 | NDPERSEpc | tag |
1254 | NDPERSEpc | s for the languages. The |
1258 | NDPERSEpc | attribute, which indicates the language with the same kind of |
1259 | NDPERSEpc | language tag |
1261 | NDPERSEpc | language tags |
1291 | NDPERSEpc | attribute to give values from a project-internal taxonomy, or an external standard, such as vCard's sex property |
1317 | NDPERSEpc | As elsewhere, these coded values may be used as an alternative to or normalization of the actual descriptive text contained in the element. The previous example might equally well be given as |
1330 | NDPERSEpc | These element can be used to extend the range of information supplied about an individual's personal characteristics. Either may contain an optional |
1332 | NDPERSEpc | element, used to provide a human-readable specification for the characteristic concerned and a description of the feature itself supplied within a |
1354 | NDPERSEpc | These elements are provided as a simple means of extending the set of descriptive features available in a standardized way. For example, there are no predefined elements for such features as eye or hair colour. If these are to be recorded, they may simply be added as new types of trait: |
1370 | NDPERSEpc | If none of the more specialized elements listed above is appropriate, then a choice must be made between the two generic elements |
1378 | NDPERSEpc | for the latter. It may also be helpful to note that traits are typically, but not necessarily, independent of the volition or action of the holder. If the distinction between state and trait is not considered relevant or useful, use |
1384 | NDPERSEpc | element is repeatable and can, like all TEI elements, take the attribute |
1386 | NDPERSEpc | to indicate the language of the content of the element, as well as a |
1388 | NDPERSEpc | attribute to indicate the type of name, whether a nickname, maiden or birth name, alternative form, etc. This is useful in cases where, for example, a person is known by a Latin name and also by any number of vernacular names, many or all of which may have claims to |
1390 | NDPERSEpc | . In order to ensure uniformity, the method generally employed in the library world has been to accept the form found in some authority file, for example that of the American Library of Congress, as the |
1396 | NDPERSEpc | an overtly foreign form of the name of their local saint or hero. Within the |
1398 | NDPERSEpc | element any number of variant forms of a name can be given, with no prioritization, and hence less likelihood of offence. The Icelandic scholar and manuscript collector Árni Magnússon, to give his name in standard modern Icelandic spelling, is known in Danish as Arne Magnusson, the form which he himself, as a long term resident of Denmark, generally used; there is also a Latinized form, Arnas Magnæus, which he used in his scholarly writings. All three forms can be given, and in any order: |
1410 | NDPERSEpc | At the other extreme, a person may be named periphrastically as in the following example: |
1484 | NDPERSEpe | has a similar content model to that of |
1490 | NDPERSEpe | element to identify the name of the place where the event occurred. It is used to describe any event in the life of an individual or organization. |
1492 | NDPERSEpe | In the following example, we give a brief summary of the wedding of Jane Burden to the English writer, designer, and socialist William Morris, encoded as an |
1496 | NDPERSEpe | element used to record data about Morris, though we could equally well have embedded the event within the |
1568 | NDPERSEpe | elements point either to an external source or to a |
1570 | NDPERSEpe | element within which other information about the person named may be found. As further discussed below ( |
1573 | NDPERSEpe | element may then be used to link them in a more meaningful way: |
1580 | NDPERSEpe | As mentioned above, all these elements, both the specific and the generic, are members of the |
1582 | NDPERSEpe | attribute class, which means they can be limited in terms of time. The following encoding, for example, demonstrates that the person named David Jones changed his name in 1966 to David Bowie: |
1596 | NDPERSEpe | classes. These classes make available the attributes |
1604 | NDPERSEpe | , a pointer to a resource from which the information derives. In this way it is possible, in the case of multiple and conflicting sources, to provide more than one view of what happened, as in the following example: |
1626 | NDPERSREL | attributes in the usual way. The value specified for either attribute on a |
1634 | NDPERSREL | , as defined here, may be any kind of describable link between specified participants. A participant (in this sense) might be a person, a place, or an organization. In the case of persons, therefore, a relationship might be a social relationship (such as employer/employee), a personal relationship (such as sibling, spouse, etc.) or something less precise such as |
1640 | NDPERSREL | relationship); or it may not be if participants are not identical with respect to their role in the relationship (for example, the |
1642 | NDPERSREL | relationship). For non-mutual relationships, only two kinds of role are currently supported; they are named |
1648 | NDPERSREL | , in the sense that they are most readily described by a transitive verb, or a verb phrase of the form |
1687 | NDPERSREL | This example defines the relationships amongst a number of people not further described here; we assume however that each person has been allocated an identifier such as |
1695 | NDPERSREL | , etc. Then the above set of |
1729 | ND-org | elements discussed elsewhere in this chapter, that is to provide a unique wrapper element for information about an entity, distinct from references to that entity which are typically encoded using a naming element such as |
1730 | ND-org | name type="org" |
1733 | ND-org | . The content of a naming element will represent the way an organization is named in a given context; the content of an |
1737 | ND-org | An organization is not the same thing as a list or group of people because it has an identity of its own. That identity may be expressed solely in the existence of a name (for example |
1739 | ND-org | ), but is likely to consist in the combination of that name with a number of events, traits, or states which are considered to apply to the organization itself, rather than any of its members. For example, a sports team might be described in terms of its membership (a |
1743 | ND-org | ), its geographical affiliation (a |
1747 | ND-org | attribute. However, it is the name of the sports team alone which identifies it. |
1749 | ND-org | The content model for |
1776 | ND-org | The names of the people making up an organization can also change over time, (if they are known at all). For example: |
1843 | ND-org | element to group together a number of |
1906 | NDGEOG | we discuss various ways of naming places such as towns, countries, etc. In much the same way as these Guidelines distinguish between the encoding of names for people and the encoding of other data about people, so they also distinguish between the encoding of names for places and the encoding of other data about places. In this section we present elements which may be used to record in a structured way data about places of any kind which might be named or referenced within a text. Such data may be useful as a way of normalizing or standardizing references to particular places, as the raw material for a gazetteer or similar reference document associated with a particular text or set of texts, or in conjunction with any form of geographical information system. |
1916 | NDGEOG | class contains elements describing characteristics of a place which have a definite duration, such as its name. Any member of the |
1924 | NDGEOG | For example, the modern city of Lyon in France was in Roman times known as Lugdunum. Although the modern and the Roman city are not physically co-extensive, they have significant areas which overlap, and we may therefore wish to regard them as the same place, while supplying both names with an indication of the time period during which each was current. |
1926 | NDGEOG | A place is defined, however, by its physical location, which does not typically change over time. Locations may be specified in a number of ways: as a set of coordinates defining a point or an area on the surface of the earth, or by providing a description of how the place may be found, usually in terms of other place names. For example, we can identify the location of the Canadian city of London, either by specifying its latitude and longitude, or by specifying that we mean the city called London located in the province called Ontario within the country called Canada. |
1928 | NDGEOG | In addition we may wish to supply a brief characterization of the place identified, for example to state that it is a city, an administrative area such as a country, or a landmark of some kind such as a monument or a battlefield. If our typology of places is simple, the open ended |
1931 | NDGEOG | place type="city" |
1933 | NDGEOG | place type="battlefield" |
1938 | NDGEOG | element, the following elements may be used to provide more information about specific aspects of the place in a structured form: |
1946 | NDGEOGva | A location may be specified in one or more of the following ways: |
1948 | NDGEOGva | by supplying a string representing its coordinates in some standardized way within a |
1952 | NDGEOGva | by supplying one or more place name component elements (e.g. |
1956 | NDGEOGva | etc.) to place it within a geo-political context |
1970 | NDGEOGva | The simplest method of specifying a location is by means of its geographic coordinates, supplied within the |
1974 | NDGEOGva | ) used for the coordinate system itself. The default recommended by these Guidelines is to supply a string containing two real numbers separated by whitespace, of which the first indicates latitude and the second longitude according to the 1984 World Geodetic System (WGS84); this is the system currently used by most GPS applications which TEI users are likely to encounter. |
1977 | NDGEOGva | We might therefore record the information about the place known as |
1991 | NDGEOGva | Identifying Lyon by its geo-political status as a settlement within a country forming part of a larger political entity, we might represent the same |
1992 | NDGEOGva | place |
2014 | NDGEOGva | We may use the same procedure to represent the location of smaller places, such as a street or even an individual building: |
2031 | NDGEOGva | attribute to categorize more precisely both the kind of place concerned (a building) and the kind of name used to locate it, for example by characterizing the generic |
2053 | NDGEOGva | sometimes resembles a set of instructions for finding a place, rather than a name: |
2073 | NDGEOGva | may also be used to identify a location in terms of its postal or other address: |
2095 | NDGEOGva | When, as here, the same place is given multiple locations, the |
2097 | NDGEOGva | attribute should be used to characterize the kind of location, as a means of indicating that these are alternative ways of identifying the same place, rather than that the place is spread across several locations. |
2101 | NDGEOGva | element may thus identify a place to a greater or lesser degree of precision, using a variety of means: a name, a set of names, or a set of coordinates. The |
2103 | NDGEOGva | element introduced earlier is by default understood to supply a value expressed in a specific (and widely used) notation. If a |
2107 | NDGEOGva | , this is interpreted as being really the same place in the universe, but with different systems used to refer to it. If there is a lack of consensus about the location (of, for example, Camelot), more than one |
2113 | NDGEOGva | By default, the content of |
2117 | NDGEOGva | Firstly, the content of the |
2140 | NDGEOGva | In the following example, we have defined the location of the place |
2165 | NDGEOGva | to indicate the source of the location information. |
2181 | NDGEOGmp | A place may contain other places. This containment relation can be directly modelled in XML: thus we can say that the towns of Vilnius and Kaunas are both in a place called Lithuania (or Lietuva) as follows: |
2204 | NDGEOGmp | As a further example, the islands of Mauritius, Réunion, and Rodrigues are collectively known as the Mascarene Islands. Grouped together with Mauritius there are also several smaller offshore islands, with rather picturesque French names. These offshore islands do not however constitute an identifiable place as a whole. One way of representing this is as follows: |
2234 | NDGEOGmp | Here is a more complex example, showing the variety of names associated at different times and in different languages with a set of hierarchically grouped places—the settlement of Carmarthen Castle, within the town of Carmarthen, within the administrative county of Carmarthenshire, Wales. |
2277 | NDGEOGmp | place |
2284 | NDGEOGmp | elements should be distinguished from the (possibly simpler) case where a number of places with some property in common are being grouped together for convenience, for example, in a gazetteer. The |
2286 | NDGEOGmp | element is provided as a means of grouping places together where there is no implication that the grouped elements constitute a distinct place. For example: |
2322 | NDGEOGste | There are many different kinds of information which it might be considered useful to record for a place in addition to its name and location, and the categories selected are likely to be very project-specific. As with persons therefore these Guidelines make no claim to comprehensiveness in this context. Instead, the generic |
2330 | NDGEOGste | attribute. These are complemented by a small number of predefined elements of general utility: |
2339 | NDGEOGste | element. This element may be used for almost any kind of event in the life of a place; no specialized version of this element is proposed, nor do we attempt to enumerate the possible values which might be appropriate for the |
2456 | NDGEOGste | attribute are to be understood as cumulatively inherited, as elsewhere in the TEI scheme (for example on |
2462 | NDGEOGste | element concerns the squirrel population between the dates given. This is then broken down into red and gray squirrel populations, and within that into male and female: |
2480 | NDGEOGste | attribute: responsibility is not an additive property, and therefore an element either states it explicitly, or inherits it from its nearest ancestor. Dating is slightly different again, in that a child element may specify a date more precisely than its parent, as in the example above |
2482 | NDGEOGste | Events may also be subdivided into other events. For example, a two part meeting might be represented as follows: |
2500 | NDGEOGste | element is usually used to record information about a place, or a person; for this reason the element usually appears as content of a |
2504 | NDGEOGste | . However, it is also possible to describe events independently of either a person or a place. This may be useful in such applications as chronologies, lists of significant events such as battles, legislation, etc. |
2564 | place-rel | element may also be used to express relationships of various kinds between places, or between places and persons, in much the same way as it is used to express relationships between persons alone. Returning to the Mascarene Islands example cited above, we might define the island group and its constituents separately, but indicate the relationship by means of a |
2594 | place-rel | style of representation has the advantage that we can now also represent the fact that a place may be a |
2596 | place-rel | more than one other place; for example, Réunion is part of France, as well as part of the Mascarenes. If we add a declaration for France to the list above: |
2653 | NDNYM | So far we have discussed ways in which a name or referring string encountered in running text may be resolved by considering the object that the name refers to: in the case of a personal name, the name refers to a person; in the case of a place name, to a place, for example. The resolution of this reference is effected by means of the |
2675 | NDNYM | in Russian might all be regarded as existing independently of any person to which they are attached, and also independently of any variant forms that might be attested in different sources (such as Jon or Johnny in English, or Jehan or Jojo in French). We use the term |
2676 | NDNYM | nym |
2677 | NDNYM | to refer to the canonical or normalized form of a name regarded in such a way, and provide the following elements to encode it: |
2687 | NDNYM | to indicate the nym with which it corresponds. Thus, given the following |
2689 | NDNYM | for the name |
2699 | NDNYM | an occurrence of this name in running text might be encoded as follows: |
2705 | NDNYM | The person identified by this particular Tony may however be indicated independently using the |
2707 | NDNYM | attribute, either on the forename or on the whole name component: |
2726 | NDNYM | , etc. For example, we may show that the canonical form for a given nym has two orthographic variants in this way: |
2790 | NDNYM | element used here is provided by the TEI |
2792 | NDNYM | module, which would therefore also need to be included in a schema built to validate such markup. Other possibilities for more detailed linguistic analysis are provided by elements included in that and the |
2802 | NDNYM | might be regarded as a nym in its own right: |
2812 | NDNYM | Within running text, a name can specify all the nyms associated with it: |
2818 | NDNYM | is used to indicate its constituent parts, where these have been identified as distinct nyms: |
2828 | NDNYM | element may also combine a number of other |
2830 | NDNYM | elements together, where it is intended to show that they are all regarded as variations on the same root. Thus the different forms of the name John, all being derived from the same root, may be represented as a hierarchic structure like this: |
2898 | NDDATE | describes a date or time with reference to some other (absolute) temporal expression, and thus may contain an |
2934 | NDDATER | after the lamented death of the Doctor |
2937 | NDDATER | have two distinct components. As well as the absolute temporal expression or event to which reference is made (e.g. |
2942 | NDDATER | the death of the Doctor |
2947 | NDDATER | between the time or date which is indicated and the referent expression (e.g. |
2954 | NDDATER | offset |
2955 | NDDATER | describing the direction of the distance between the time or date indicated and the referent expression (e.g. |
2974 | NDDATER | offset |
3013 | NDDATER | and the cited date are parts of the same temporal expression, and hence to disambiguate the phrase |
3039 | NDDATER | Where more complex or ambiguous expressions are involved, and where it is desirable to make more explicit the interpretive processes required, the feature structure notation described in chapter |
3054 | NDDATER | ). It is used here to link the temporal phrase with an interpretation of it. Like most traditional fairs and market days, the Glasgow Fair was established by local custom and could vary from year to year. Consequently, in order to provide such an interpretation, it is necessary to draw upon additional information which may or may not be located in the particular text in question. In this case, it is necessary at least to know the spatial and temporal context (year and place) of the fair referred to. These and other features required for the analysis of this particular temporal expression may be combined together as one feature structure of type |
3081 | NDDATEA | It may be useful to categorize a temporal expression which is given in terms of a named event, such as a public holiday, or a named time such as |
3082 | NDDATEA | tea time |
3123 | NDDATEISO | The attributes for normalization of dates and times so far described use a standard format defined by |
3127 | NDDATEISO | . The full ISO standard provides formats not available in the W3C recommendation, for example, the capability to refer to a date by its ordinal date or week date, or to refer to a century. It also provides ways of indicating duration and range. |
3129 | NDDATEISO | When this module is included in a schema, the following additional attributes are provided: |
3133 | NDDATEISO | These attributes may be used in preference to their W3C equivalent when it is necessary to provide a normalized value in some form not supported by the W3C attributes. For example, a century date in the W3C format must be expressed as a range, using the |
3146 | NDDATEISO | , however, it is possible to express the same normalized value in any of the following additional ways: |
3170 | NDDATECUSTOM | All date-related encoding described above makes use of the Gregorian calendar, on which both the ISO and W3C datetime formats are based. However, historical texts often pre-date the invention of the Gregorian calendar in the 16th century, or its adoption in Europe over the following centuries, and many other calendars are used in texts from other cultures and contexts. Non-Gregorian dates can be encoded using methods described below. |
3172 | NDDATECUSTOM | First, a Calendar Description element needs to be supplied in the |
3199 | NDDATECUSTOM | element in the header which defines and describes the calendar used. |
3203 | NDDATECUSTOM | attribute is used to specify the calendar used in the |
3204 | NDDATECUSTOM | text content |
3211 | NDDATECUSTOM | etc. to provide more precise expressions of dates and times in a constrained and computable form, it is often necessary to express a date or a date-range from a non-Gregorian calendar in a more precise manner. The attributes whose names end in |
3215 | NDDATECUSTOM | is used to identify the calendar used in the content of these attributes: |
3224 | NDDATECUSTOM | attribute specifies the calendar used in the text content of the |
3228 | NDDATECUSTOM | attribute signifies that the calendar used in the |
3230 | NDDATECUSTOM | attribute is also Julian. The schema could be customized in order to constrain the content of custom attributes in a manner similar to the constraints provided on regular Gregorian dating attributes such as |
3236 | NDDATECUSTOM | , providing the Gregorian calendar equivalent of the Julian date: |
3259 | ND | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
4 | DS | This chapter describes the default high-level structure for TEI documents. A full TEI document combines metadata describing it, represented by a |
10 | DS | class, or the two in combination. This group of elements makes up a |
23 | DS | , is also defined for the representation of language corpora, or other collections of encoded texts. A |
33 | DS | . This permits the encoder to distinguish metadata applicable to the whole collection of encoded texts, which is represented by the outermost |
37 | DS | elements within the corpus. Further information about the organization and encoding of language corpora is given in chapter |
40 | DS | In summary, when the default structure module is included in a schema, the following elements are available for the representation of the outermost structure of a TEI document: |
51 | DS | ). A TEI document may also contain elements from the |
53 | DS | class (such as a collection of facsimile images, or a feature system declaration) if the appropriate module is included in a schema (see further |
61 | DS | are available as major parts of a TEI document. These three elements are provided by the |
70 | DS | TEI texts may be regarded either as |
74 | DS | that is, consisting of several components which are in some important sense independent of each other. The distinction is not always entirely obvious: for example a collection of essays might be regarded as a single item in some circumstances, or as a number of distinct items in others. In such borderline cases, the encoder must choose whether to treat the text as unitary or composite; each may have advantages and disadvantages in a given situation. |
76 | DS | Whether unitary or composite, the text is marked with the |
78 | DS | tag and may contain front matter, a text body, and back matter. In unitary texts, the text body is tagged |
80 | DS | ; in composite texts, where the text body consists of a series of subordinate texts or groups, it is tagged |
85 | DS | The overall structure of a unitary text is: |
102 | DS | The overall structure of a composite text made up of two unitary texts is: |
137 | DS | element is provided for the case where one text is embedded within another, but does not contribute to its hierarchical organization, for example because it interrupts it, or simply quoted within it. This is useful in such common literary contexts as the |
157 | DS | elements, used for more complex or composite text structures, are further discussed in section |
159 | DS | , in the case of elements which can appear in any kind of document, or elsewhere in the case of elements specific to particular kinds of document. |
163 | DSDIV | In some texts, the body consists simply of a sequence of low-level structural items, referred to here as |
168 | DSDIV | ). Examples in prose texts include paragraphs or lists; in dramatic texts, speeches and stage directions; in dictionaries, dictionary entries. In other cases sequences of such elements will be grouped together hierarchically into textual divisions and subdivisions, such as chapters or sections. The names used for these structural subdivisions of texts vary with the genre and period of the text, or even at the whim of the author, editor, or publisher. For example, a major subdivision of an epic or of the Bible is generally called a |
176 | DSDIV | —unless it is an epistolary novel, in which case it may be called a |
178 | DSDIV | . Even texts which are not organized as linear prose narratives, or not as narratives at all, will frequently be subdivided in a similar way: a drama into |
202 | DSDIV | , etc., where the number indicates the depth of this particular division within the hierarchy, the largest such division being |
203 | DSDIV | div1 |
205 | DSDIV | div2 |
207 | DSDIV | div3 |
225 | DSDIV1 | , this element has the following additional attributes: |
228 | DSDIV1 | Using this style, the body of a text containing two parts, each composed of two chapters, might be represented as follows: |
266 | DSDIV2 | these elements all bear the following additional attributes: |
269 | DSDIV2 | The largest possible subdivision of the body is |
279 | DSDIV2 | Using this style, the body of a text containing two parts, each composed of two chapters, might be represented as follows: |
338 | DSDIV3 | The choice between numbered and un-numbered divisions will depend to some extent on the complexity of the material: un-numbered divisions allow for an arbitrary depth of nesting, while numbered divisions limit the depth of the tree which can be constructed. Where divisions at different levels should be processed differently (for example to ensure that chapters, but not sections, begin on a new page), numbered divisions slightly simplify the task of defining the desired processing for each level, though this distinction could also be made by supplying this information on the |
342 | DSDIV3 | . Some software may find numbered divisions easier to process, as there is no need to maintain knowledge of the whole document structure in order to know the level at which a division occurs; such software may, however, find it difficult to cope with some other aspects of the TEI scheme. On the other hand, in a collection of many works it may prove difficult or impossible to ensure that the same numbered division always corresponds with the same type of textual feature: a |
360 | DSDIV3 | class may be used to provide a name or description for the division. Typical values might be |
368 | DSDIV3 | , or (for verse texts) |
448 | DSDIV3 | ), etc. For example, suppose that the body of a text consists of a series of diary entries, each of which is potentially divided into entries for the morning and the afternoon. This might be represented in any of the following ways. First, using the un-numbered style: |
535 | DSDIV3X | (etc.) elements will be both complete and identically organized with reference to the original source. For some purposes however, in particular where dealing with unusually large or unusually small texts, encoders may find it convenient to present as textual divisions sequences of text which are incomplete with reference to the original text, or which are in fact an ad hoc agglomeration of tiny texts. Moreover, in some kinds of texts it is difficult or impossible to determine the order in which individual subdivisions should be combined to form the next higher level of subdivision, as noted below. |
537 | DSDIV3X | To overcome these problems, the following additional attributes are defined for all elements in the |
552 | DSDIV3X | represents a number for the chapter, and the |
554 | DSDIV3X | attribute takes the value |
556 | DSDIV3X | to indicate that this division is incomplete in some respect. Other possible values for this attribute indicate whether material has been omitted initially (I), finally (F), or in the middle (M) of the division, while the |
559 | DSDIV3X | ) may be used to indicate exactly where material has been omitted: |
568 | DSDIV3X | element in the TEI header should also be used to record the principles underlying the selection of incomplete samples, as further described in section |
604 | DSDIV3X | , are really quite independent of each other, although they are all marked as subdivisions of the whole group. They can be read in any order without affecting the sense of the piece; indeed, in some cases, divisions of this nature are printed in such a way as to make it impossible to determine the order in which they are intended to be read. Individual stories can be added or removed without affecting the existing components. |
611 | DSDTB | The divisions of any kind of text may sometimes begin with a brief heading or descriptive title, with or without a byline, an epigraph or brief quotation, or a salutation such as one finds at the start of a letter. They may also conclude with a brief trailer, byline, postscript, or signature. Many of these (e.g. a byline) may appear either at the start or at the end of a text division proper. |
613 | DSDTB | To support this heterogeneity, the TEI architecture defines five classes, all of which are populated by this module: |
635 | DSHD | Unlike some other markup schemes, the TEI scheme does |
655 | DSHD | is the sole member to include other such elements if required. |
657 | DSHD | In certain kinds of text (notably newspapers), there may be a need to categorize individual headings within the sequence at the start of a division, for example as |
700 | DSHD | may be longer than in modern works. When heading-like material appears in the middle of a text, the encoder must decide whether or not to treat it as the start of a new division. If the phrase in question appears to be more closely connected with what follows than with what precedes it, then it may be regarded as a heading and tagged as the |
706 | DSHD | often found in newspapers or magazines, then the |
740 | DSOC | In addition to headings of various kinds, divisions sometimes include more or less formulaic opening or closing passages, typically conveying such information as the name and address of the person to whom the division is addressed, the place or time of its production, a salutation or exhortation to the reader, and so on. Divisions in epistolary form are particularly liable to include such features. Additional elements for the detailed encoding of personal names, dates, and places are provided in chapter |
753 | DSOC | elements are used to encode headings which identify the authorship and provenance of a division. Although the terminology derives from newspaper usage, there is no implication that |
777 | DSOC | Where a sequence of such elements appear together, either at the beginning or end of an element, it may be convenient to group them together using one of the following elements: |
844 | DSAE | element may be used to encode the prefatory list of topics sometimes found at the start of a chapter or other division. It is most conveniently encoded as a list, since this allows each item to be distinguished, but may also simply be presented as a paragraph. The following are thus both equally valid ways of encoding the same argument: |
881 | DSAE | epigraph |
882 | DSAE | is a quotation from some other work, a saying, or a motto, appearing on a title page, or at the start of a division. It may be encoded using the special-purpose |
894 | DSAE | When an epigraph contains a quotation, this may often be associated with a bibliographic reference. In such cases, it is recommended additionally to group the quotation and its source together using the |
915 | DSAE | postscript |
916 | DSAE | is a passage added after the signature of a letter or, less frequently, the main portion of the body of a book, article, or essay. In English a postscript is often abbreviated as |
975 | DSCO | classes, every textual division (numbered or un-numbered) consists of a sequence of ungrouped |
978 | DSCO | ). The actual elements available will depend on the modules in use; in all cases, at least the component-level structural elements defined in the core will be available (paragraphs, lists, dramatic speeches, verse lines and line groups etc.). If the drama module has been selected, then other component- or phrase- level items specialized for performance texts (for example, cast lists or camera angles) will be available, as defined in chapter |
979 | DSCO | ) will be available. If the dictionary module is in use, then dictionary entries, related entries, etc. (as defined in chapter |
980 | DSCO | ) will also be available; if the module for transcribed speech is in use, then utterances, pauses, vocals, kinesics, etc., as defined in chapter |
983 | DSCO | Where a text contains low-level elements from more than one module these may appear at any point; there is no requirement that elements from the same module be kept together. |
1004 | DSGRPF | should be used to represent a collection of independent texts which is to be regarded as a single unit for processing or other purposes. The |
1007 | DSGRPF | should be used to represent an independent text which interrupts the text containing it at any point but after which the surrounding text resumes. |
1014 | DSGRP | element include anthologies and other collections. The presence of common front matter referring to the whole collection, possibly in addition to front matter relating to each individual text, is a good indication that a given text might usefully be encoded in this way; this structure may be found useful in other circumstances too. |
1016 | DSGRP | For example, the overall structure of a collection of short stories might be encoded as follows: |
1091 | DSGRP | A text which is a member of a group may itself contain groups. This is quite common in collections of verse, but may happen in any kind of text. As an example, consider the overall structure of a typical collection, such as the |
1093 | DSGRP | edition of Crashaw's poetry. Following a critical introduction and table of contents, this work contains the following major sections: |
1096 | DSGRP | (a collection of verse first published in 1648) |
1105 | DSGRP | I (a collection of fragments all taken from a single manuscript) |
1108 | DSGRP | II (a further collection of fragments, taken from a different manuscript) |
1111 | DSGRP | Each of the three collections published in Crashaw's lifetime has a reasonable claim to be considered as a text in its own right, and may therefore be encoded as such. It is rather more arbitrary as to whether the two posthumous collections should be treated as two groups, following the practice of the |
1113 | DSGRP | edition. An encoder might elect to combine the two into a single group or simply to treat each fragment as an ungrouped unitary text. |
1117 | DSGRP | edition reprints the whole of each of the three original collections, including their original front matter (title pages, dedications etc.). These should be encoded using the |
1120 | DSGRP | ), while the body of each collection should be encoded as a single |
1122 | DSGRP | element. Each individual poem within the collections should be encoded as a distinct |
1124 | DSGRP | element. The beginning of the whole collection would thus appear as follows (for further discussion of the use of the elements |
1237 | DSGRP | element may be used in this way to encode any kind of collection of which the constituents are regarded by the encoder as texts in their own right. Examples include anthologies or collections of verse or prose by multiple authors, florilegia, or commonplace books, journals, day books, etc. As a fairly typical example, we consider |
1254 | DSGRP | Each titled section listed above comprises a group of extracts or complete texts from writers of a given historical period, preceded by an introductory essay. For example, the second group listed above contains, inter alia, the following: |
1268 | DSGRP | Each group of writings by a single author is preceded by a brief biographical notice. Some of the extracts are quite lengthy, containing several chapters or other divisions; others are quite short. As the above list indicates, the texts included range across all kinds of material: verse, prose, journals and letters. |
1270 | DSGRP | The easiest way of encoding such an anthology is to treat each individual extract as a text in its own right. A sequence of texts by a single author, together with the biographical note preceding it, can then be treated as a single |
1274 | DSGRP | formed by the section. The sequence of single or composite texts making up a single section of the work is likewise treated, together with its prefatory essay, as a single |
1345 | DSGRP | Note that the editor's introductory essays on each author may be treated as texts in their own right (as the essays on Lady Mary Wortley Montagu and Alexander Pope have been treated above), or as front matter to the embedded text, as the essay on Swift has been. The treatment in the example is intentionally inconsistent, to allow comparison of the two approaches. Consistency can be imposed either by treating the Swift section as a |
1347 | DSGRP | containing one text by Swift and one by the editor, or by treating the Montagu and Pope sections as |
1349 | DSGRP | elements containing the editor's essays as front matter. Marked in the second way, the Pope section of the book would look like this: |
1370 | DSGRP | front |
1377 | DSGRP | Where, as in this case, an anthology contains different kinds of text (for example, mixtures of prose and drama, or transcribed speech and dictionary entries, or letters and verse), the elements to be encoded will of course be drawn from more than one module. The elements provided by the core module described in chapter |
1378 | DSGRP | should however prove adequate for most simple purposes, where prose, drama, and verse are combined in a single collection. |
1380 | DSGRP | For anthologies of short extracts such as commonplace books, it may often be preferable to regard each extract not as a text in its own right but simply as a quotation or |
1385 | DSGRP | which appears in the front matter of Melville's |
1432 | DSFLT | An important characteristic of the unitary or composite text structures discussed so far is that they can be regarded as forming what is mathematically known as a |
1434 | DSFLT | covering the whole of the available text (or text division) at each hierarchic level. Just as an XML document has a single root element containing a single tree, each node of which forms a properly nested sub-tree, so it seems natural to think of the internal structure of a text as decomposable hierarchically into subparts, each of which is a properly nested subtree. While this is undoubtedly true of a large number of documents, it is not true of all. In particular, it is not true of texts which are only partly tesselated at a given level. For example, if a text A is contained by text B in such a way that part of B precedes A and part follows it, we cannot tesselate the whole of B. In such a case, we say that text A is a |
1446 | DSFLT | might be regarded as containing many floating texts embedded within another single text, the framing narrative, rather than as groups of discrete texts in which the fragments of framing narrative are regarded as front or back matter. |
1448 | DSFLT | As an example, we consider an 18th century text |
1451 | DSFLT | , by Jane Barker (1726). This lengthy narrative contains nearly a hundred distinct |
1453 | DSFLT | embedded (as the title suggests) in a single patchwork. The work begins by introducing the central character, Galecia, but within a few pages launches into a distinct narrative, the story of Captain Manly: |
1504 | DSFLT | In other multi-narrative texts, the individual nested tales may have greater significance than the framing narratives, and it may therefore be preferable to treat the fragments of framing narrative as front or back matter associated with each nested tale. This is commonly done, for example, in texts such as Chaucer's |
1506 | DSFLT | , where each tale is typically presented with front matter in which the teller of the tale is introduced, and back matter in which the pilgrims comment on it. |
1514 | DSFLT | suggest that its content derives from a source external to the current text, |
1516 | DSFLT | carries no such implication and is simply used whenever the richer content model that it provides is required to support the markup of a part of a text that is presented as a discrete |
1518 | DSFLT | In some cases, such inclusions could be considered external (e.g., enclosures, attachments, etc.); often however, as in the examples above, the included text bears no signs of emanating from outside. |
1523 | DSFLT | may be used in combination. For a text with rich internal structure that is quoted at length, |
1536 | DSVIRT | Where the whole of a division can be automatically generated, for example because it is derived from another part of this or another document, an encoder may prefer not to represent it explicitly but instead simply mark its location by means of a processing instruction, or by using the special purpose |
1559 | DSVIRT | For example, if the table of contents (toc) for a given work is simply derived by copying the first |
1564 | DSVIRT | Similarly, in a digital edition combining a transcribed version of some text with a translated version of it, it may be desired to represent the transcript, the translation, and an aligned version of the two as three distinct divisions. This could be achieved by an encoding like the following: |
1568 | DSVIRT | The processing to be carried out when a |
1570 | DSVIRT | element is rendered will be determined by the application program or stylesheet in use: the function of the TEI markup is simply to identify the location at which the virtual division is to be generated, and also to provide some information about the kind of division to be generated. As such it may be regarded as a special kind of processing instruction, and could equally well be represented by one. |
1576 | DSFRONT | front matter |
1577 | DSFRONT | we mean distinct sections of a text (usually, but not necessarily, a printed one), prefixed to it by way of introduction or identification as a part of its production. Features such as title pages or prefaces are clear examples; a less definite case might be the prologue attached to a play. The front matter of an encoded text should not be confused with the TEI header described in chapter |
1578 | DSFRONT | , which serves as a kind of front matter for the computer file itself, not the text it encodes. |
1580 | DSFRONT | An encoder may choose simply to ignore the front matter in a text, if the original presentation of the work is of no interest, or for other reasons; alternatively some or all components of the front matter may be thought worth including with the text as components of the |
1586 | DSFRONT | With the exception of the title page, (on which see section |
1587 | DSFRONT | ), front matter should be encoded using the same elements as the rest of a text. As with the divisions of the text body, no other specific tags are proposed here for the various kinds of subdivision which may appear within front matter: instead either numbered or un-numbered |
1592 | DSFRONT | for attributes, it is recommended that software written to handle TEI-conformant texts be prepared to recognize and handle these values when they occur, without limiting the user to the values in this list. |
1595 | DSFRONT | attribute may be used to distinguish various kinds of division characteristic of front matter: |
1598 | DSFRONT | A foreword or preface addressed to the reader in which the author or publisher explains the content, purpose, or origin of the text. |
1601 | DSFRONT | A formal declaration of acknowledgment by the author in which persons and institutions are thanked for their part in the creation of a text. |
1604 | DSFRONT | A formal offering or dedication of a text to one or more persons or institutions by the author. |
1605 | DSFRONT | abstract |
1607 | DSFRONT | A summary of the content of a text as continuous prose. |
1610 | DSFRONT | A table of contents, specifying the structure of a work and listing its constituents. The |
1618 | DSFRONT | The following extended example demonstrates how various parts of the front matter of a text may be encoded. The front part begins with a title page, which is presented in section |
1619 | DSFRONT | below. This is followed by a dedication and a preface, each of which is encoded as a distinct |
1647 | DSFRONT | The front matter concludes with another |
1649 | DSFRONT | element, shown in the next example, this time containing a table of contents, which contains a |
1654 | DSFRONT | element to provide page-references: the implication here is that the target identifiers supplied (fish1, fish2, etc.) will correspond with identifiers used for the |
1656 | DSFRONT | elements containing chapters of the text itself. (For the |
1688 | DSFRONT | Alternatively, the pointers in the index might link to the page breaks at which a chapter begins, assuming that these have been included in the markup: |
1702 | DSFRONT | The following example uses numbered divisions to mark up the front matter of a medieval text. Note that in this case no title page in the modern sense occurs; the title is simply given as a heading at the start of the front matter. Note also the use of the |
1751 | DSFRONT | If, however, the table of contents can be automatically generated from the remainder of the text, it may be preferable simply to mark its presence, either by means of an empty |
1758 | DSTITL | Detailed analysis of the title page and other |
1760 | DSTITL | of older printed books and manuscripts is of major importance in descriptive bibliography and the cataloguing of printed books; such analysis may require a rather more detailed module than that proposed here. |
1761 | DSTITL | The following elements are suggested as a means of encoding the major features of most title pages: |
1782 | DSTITL | class. Any number of elements from this class can appear grouped together within a |
1786 | DSTITL | element is included so as to enable encoders to record the presence of complex non-textual material on a title page. For simple cases such as printers' ornaments or illustrations the |
1797 | DSTITL | element without any need to group them together and encode a complete title page. |
1799 | DSTITL | Encoders wishing to add new elements to either class may do so using the methods described in section |
1800 | DSTITL | . Two examples of the use of these elements follow. First, the title page of the work discussed earlier in this section: |
1822 | DSTITL | tag to mark the line breaks of the original where necessary: |
1868 | DSTITL | Where, as here, it is considered important to encode salient features of the way a title page was originally rendered, the techniques exemplified in |
1873 | DSTITL | Where title pages are encoded, their physical rendition is often of considerable importance. One approach to this requirement would be to use the |
1876 | DSTITL | , to segment the typographic content of each part of the title page, and then use the global |
1888 | DSBACK | Conventions vary as to which elements are grouped as back matter and which as front. For example, some books place the table of contents at the front, and others at the back. Even title pages may appear at the back of a book as well as at the front. The content model for |
1896 | DSBACK | attribute on all division elements, in order to distinguish various kinds of division characteristic of back matter: |
1899 | DSBACK | An ancillary self-contained section of a work, often providing additional but in some sense extra-canonical text. |
1902 | DSBACK | A list of terms associated with definition texts ( |
1905 | DSBACK | list type="gloss" |
1913 | DSBACK | A list of bibliographic citations: this should be encoded as a |
1917 | DSBACK | index |
1919 | DSBACK | Any form of index to the work. |
1920 | DSBACK | colophon |
1925 | DSBACK | No additional elements are proposed for the encoding of back matter at present. Some characteristic examples follow; first, an index (for the case in which a printed index is of sufficient interest to merit transcription): |
1958 | DSBACK | Note that if the page breaks in the original source have also been explicitly encoded, and given identifiers, the references to them in the above index can more usefully be recorded as links. For example, assuming that the encoding of page 461 of the original source starts like this: |
1959 | DSBACK | then the last item above might be encoded more usefully in either of the following forms: |
1984 | DSBACK | And finally, a list of corrigenda and addenda with pseudo-epistolary features: |
2022 | textstructure | Default text structure |
2037 | DSSTRUC | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
2 | TitlePageVerso | Releases of the TEI Guidelines |
# | id | text |
---|---|---|
6 | TC | to the text. Witnesses to a text may include authorial or other manuscripts, printed editions of the work, early translations, or quotations of a work in other texts. Information concerning variant readings of a text may be accumulated in highly structured form in a critical apparatus of variants. This chapter defines a module for use in encoding such an apparatus of variants, which may be used in conjunction with any of the modules defined in these Guidelines. It also defines an element class which provides extra attributes for some elements of the core tag set when this module is selected. |
8 | TC | Information about variant readings (whether or not represented by a critical apparatus in the source text) may be recorded in a series of |
10 | TC | , each entry documenting one |
12 | TC | , or set of readings, in the text. Elements for the apparatus entry and readings, and for the documentation of the witnesses whose readings are included in the apparatus, are described in section |
14 | TC | . The available methods for embedding the apparatus in the rest of the text, or for linking an external apparatus to the base text, are described in section |
15 | TC | . Finally, several extra attributes for some tags of the core tag set, made available when the additional tag set for text criticism is selected, are documented in section |
18 | TC | Many examples given in this chapter refer to the following texts of the opening (usually just line 1) of Chaucer's |
56 | TCAPLL | methods of identifying which witnesses support a particular reading, and for describing the witnesses included in the apparatus: see section |
59 | TCAPLL | elements for indicating which portions of a text are covered by fragmentary witnesses: see section |
65 | TCAPLL | element is in one sense a more sophisticated and complex version of the |
68 | TCAPLL | as a way of marking points where the encoding of a passage in a single source may be carried out in more than one way. Unlike |
79 | TCAPEN | element, which groups together all the readings constituting the variation. The identification of discrete textual variations or apparatus entries is not a purely mechanical process; different editors may group readings differently. No rules are given here as to how to group readings into apparatus entries; the tags given here may be used to group readings in whatever way the editor finds most perspicuous or useful. |
81 | TCAPEN | The individual apparatus entry is encoded with the |
93 | TCAPEN | , are used to link the apparatus entry to the base text, if present. In such cases, several methods may be used for such linkage, each involving a slightly different usage for these attributes. Linkage between text and apparatus is described below in section |
103 | TCAPEN | or other elements, as described in the next section. A very simple partial apparatus for the first line of the |
105 | TCAPEN | might take a form something like this: |
115 | TCAPEN | , to indicate a preference for one reading, etc. The following sections on readings, subvariation, and witness information describe some of the more important complications which can arise. |
124 | TCAPLR | Individual readings are the crucial elements in any critical apparatus of variants. The following elements should be used to tag individual readings within an apparatus entry: |
128 | TCAPLR | N.B. the term |
130 | TCAPLR | is used here in the text-critical sense of |
131 | TCAPLR | the reading accepted as that of the original or of the base text |
132 | TCAPLR | . This sense differs from that in which the word is used elsewhere in the Guidelines, for example as in the attribute |
134 | TCAPLR | where the intended sense is |
135 | TCAPLR | the root form of an inflected word |
137 | TCAPLR | the heading of an entry in a reference book, especially a dictionary |
140 | TCAPLR | In recording readings within an apparatus entry, the |
152 | TCAPLR | element may also be used to record the base text of the source edition, to mark the readings of a base witness, to indicate the preference of an editor or encoder for a particular reading, or (e.g. in the case of an external apparatus) to indicate precisely to which portion of the main text the variation applies. Those who prefer to work without the notion of a base text or who are not using the parallel segmentation method may prefer not to use it at all. How it is used depends in part on the method chosen for linking the apparatus to the text; for more information, see section |
160 | TCAPLR | As members of the attribute classes |
174 | TCAPLR | As elsewhere, these attributes may be used to indicate the person responsible for the editorial decision being recorded, and also the degree of certainty associated with that decision by the person carrying out the encoding. |
178 | TCAPLR | attribute identifies the witnesses which have the reading in question. It is required if the apparatus gathers together readings from different witnesses, but may be omitted in an apparatus recording the readings of only one witness, e.g. substitutions, divergent opinions on what is in the witness or on how to expand abbreviations, etc. Even in such a one-witness apparatus, however, the |
180 | TCAPLR | attribute may still be useful when it is desired to record the occurrence of a particular reading in some other witness. For other methods of identifying the witnesses to a reading, see section |
204 | TCAPLR | attributes may be used to convey information on the sequence and cause of variation. In the following apparatus fragment, the reading |
209 | TCAPLR | per |
244 | TCAPLR | Similarly, if a witness is hard to decipher, it may be desired to indicate responsibility for the claim that a particular reading is supported by a particular witness. In line 2212a of |
246 | TCAPLR | , for example, the manuscript is read in different ways by different scholars; the editor Klaeber prints one text, using parentheses to indicate his expansion, and records in the apparatus two different accounts of the manuscript reading, by Zupitza and Chambers: |
268 | TCAPLR | attributes are intelligible only on an element recording a reading from a single witness, and should not be used if more than one witness is given on the same |
272 | TCAPLR | element. If more than one witness is given for the reading, they are undefined. To convey this information when the witness is one among several, the |
277 | TCAPLR | Where there is a greater weight of editorial discussion and interpretation than can conveniently be expressed through the attributes provided on these elements (for example where there are multiple witnesses for a single reading or multiple editorial responsibility for an emendation) this information can be attached to the apparatus in a note, or recorded in the feature structure notation defined in chapter |
278 | TCAPLR | . In particular, such recurring text-critical situations as palaeographic confusion of particular letters, or homœoarchy or homœoteleuton involving specific character groups, may lend themselves to feature structure treatment. Information concerning these recurrent situations may be encoded into database-like fragments within the text which would then be available to sophisticated computer-assisted analysis. Further work remains to be done on such mechanisms, however, and so no examples are given here of the use of feature structures in text-critical apparatus. |
282 | TCAPLR | element may also be used to record the specific wording of notes in the apparatus of the source edition, as here in a transcription of Friedrich Klaeber's note on |
293 | TCAPLR | Notes providing details of the reading of one particular witness should be encoded using the specialized |
298 | TCAPLR | Encoders should be aware of the distinct fields of use of the attribute values |
310 | TCAPLR | indicates the scholar responsible for asserting the existence of that reading in that physical entity. In some cases, the categories may blur: a scholar may produce an edition introducing readings for which he or she is responsible; that edition may itself become a witness in a later critical apparatus. Thus, readings introduced as corrections in the earlier edition will be seen in the later apparatus as witnessed by the earlier edition. As observed in the discussion concerning the discrimination of |
328 | TCAPSU | element may be used to group readings, either because they have identical values on one or more attributes, or because they are seen as forming a self-contained variant sequence, or for some other reason. This grouping of readings is entirely optional: no such grouping of readings is required. |
356 | TCAPSU | To indicate that both Hg and La vary only orthographically from the lemma, one might tag both readings |
357 | TCAPSU | rdg type='orthographic' |
373 | TCAPSU | may be used to organize the substantive variants of an apparatus entry. Editors may need to indicate that each of a group of witnesses may be taken as all supporting a particular reading, even though there may be variation concerning the exact form of that reading in, or the degree of support offered by, those witnesses. For example: one may identify three substantive variants on the first word of Chaucer's |
381 | TCAPSU | . In fact, the manuscripts display many different spellings of these words, and a scholar may wish both to show that the manuscripts have all these variant spellings and that these variant spellings actually support only the three regularized spelling forms. One may term these variant spellings as |
387 | TCAPSU | element by gathering the readings into three groups according to the normalized form of their reading. All the readings within each group may be accounted subvariants of the main reading for the group, which may be indicated by tagging it as a |
390 | TCAPSU | rdg type='groupBase' |
428 | TCAPSU | is supported by Ra2, even though the form differs in that manuscript. Accordingly, an application which recognizes that these apparatus entries show subvariation may then assign all the witnesses instanced as attesting the sub-variants on that lemma as actually supporting the reading of the lemma itself at a higher level of classification. Thus, Ha4 here supports the reading |
434 | TCAPSU | element might also be used to group readings in the same way. The example above is substantially identical to the following, which uses |
465 | TCAPSU | This expresses even more clearly than the previous encoding of this material that at the highest level of classification (apparatus entry A1), this variation has three normalized readings, and that the first of these is supported by manuscripts El, Hg, and Ha4; the second by Cp, Ld1, and La; and the third by Ra2. Some encoders may find the use of nested apparatus entries less intuitive than the use of reading groups, however, so both methods of classifying the readings of a variation are allowed. |
467 | TCAPSU | Reading groups may also be used to bring together variants which form an apparent developmental sequence, and to make clear that other readings are not part of that sequence, as in the following example, which makes clear that the variant sequence |
506 | TCAPLW | A given reading is associated with the set of witnesses attesting it by listing the witnesses in the |
514 | TCAPLW | element. Special mechanisms, described in the following sections, are needed to associate annotation on a reading with one specific witness among several (section |
515 | TCAPLW | ), to transcribe witness information verbatim from a source edition (section |
516 | TCAPLW | ), and to identify the formal lists of witnesses typically provided in the front matter of critical editions (section |
522 | TCAPWD | When it is desired to give additional information about a particular witness or witnesses for the reading, the information may be given in a |
524 | TCAPWD | element. This is a specialized form of note, which can be linked to both a reading and to one or more of the witnesses for that reading. The former linkage is effected by the |
541 | TCAPWD | cannot be included in the text at the point of attachment; it must point to the reading(s) being annotated by means of its |
543 | TCAPWD | attribute. To indicate, on the authority of editor PR, that the Ellesmere manuscript has an ornamental capital in the word |
555 | TCAPWD | This encoding makes clear that the ornamental capital mentioned is in the Ellesmere manuscript, and not in Hengwrt or Ha4. The |
563 | TCAPWD | may be used to record the specific wording of information in the source text, even when the information itself is captured in some more formal way elsewhere. The example from the |
566 | TCAPWD | ), for example, might be extended thus, to record the wording of the note explaining the variant: |
590 | TCAPWD | Observe that a single witness detail element may be linked to several different readings (noting, for example, a recurrent phenomenon in a particular manuscript) by having the |
592 | TCAPWD | attribute point at all the readings in question. Similarly, feature structures containing information about the text in a witness (whether retroversion, regularization, or other) can also be linked to specific |
606 | TCSCWL | In the transcription of printed critical editions, it may be desirable to retain for future reference the exact form in which the source edition records the witnesses to a particular reading; this is particularly important in cases of ambiguity in the information, or uncertainty as to the correct interpretation. The |
613 | TCSCWL | list may appear following a |
619 | TCSCWL | element in any apparatus entry, and should be used only to transcribe the witness information in the form found in the source. |
626 | TCSCWL | The advantage of holding witness information in the |
633 | TCSCWL | an application can check that every sigil |
634 | TCSCWL | We use the term sigil as the English equivalent of the Latin term |
639 | TCSCWL | attribute has declared datatype of one or more |
641 | TCSCWL | values, a check can be made that readings are assigned only to witness sigla which have been identified (using the |
646 | TCSCWL | ). Such checking is more difficult for witness sigla held as the content of a |
649 | TCSCWL | For this reason, it is recommended that encoders always hold witness information in the |
655 | TCSCWL | , where possible. Thus, as in the examples below, even when a reference to a witness is exactly reproduced in the |
657 | TCSCWL | element, the corresponding sigil for that witness can be written into the |
663 | TCSCWL | . However, in cases where it is uncertain how the witness reference contained in the |
665 | TCSCWL | element should be interpreted, or where no witness exists, the |
703 | TCSCWL | Of course, the sigil used for a particular witness in the source, as recorded in the |
705 | TCSCWL | element, may well differ from that used to indicated the same witness in the |
707 | TCSCWL | attribute, as shown particularly in the apparatus for the second line of the poem (Diet.1.2). |
716 | TCAPWL | A list of all identified witnesses should normally be supplied in the front matter of the edition, or in the |
723 | TCAPWL | element, which contains a series of |
727 | TCAPWL | element may contain a brief characterization of the witness, given as one or more prose paragraphs. If more detailed information about a manuscript witness is available, it should be represented using the |
737 | TCAPWL | Whether information about a particular witness is supplied by means of a |
743 | TCAPWL | element, a unique sigil for this source should always be supplied, using the global |
745 | TCAPWL | attribute. This identifier can then be used elsewhere to refer to this particular witness. |
753 | TCAPWL | The minimal information provided by a witness list is thus the set of sigla for all the witnesses named in the apparatus. For example, the witnesses referenced by the examples of this chapter might simply be listed thus: |
770 | TCAPWL | It is more helpful, however, for witness lists to be somewhat more informative: each |
781 | TCAPWL | As the last example shows, the witness description here may be complemented by a reference to a full description of the manuscript supplied elsewhere, typically as the content of a |
821 | TCAPWL | . Note also that if the witnesses being recorded are not manuscripts but printed works, it may be preferable to document them using the standard |
838 | TCAPWL | In text-critical work it is customary to refer to frequently occurring groups of witnesses by means of a single common sigil. Such sigla may be documented as pseudo-witnesses in their own right by including a nested witness list within the witness list, which uses the sigil for the group as its identifier, and supplies a fuller name for the group in its optional child |
869 | TCAPWL | Note that a single witness cannot appear more than once in a witness list, and therefore cannot be assigned to more than one group of witnesses. |
871 | TCAPWL | Situations commonly arise where there are many more or less fragmentary witnesses, such that there may be quite distinct groups of witnesses for different parts of a text or collection of texts. One may treat this with distinct |
875 | TCAPWL | element at the beginning of the file or in its header listing all the witnesses, partial and complete, for the text, with the attestation of fragmentary witnesses indicated within the apparatus by use of the |
882 | TCAPWL | If a witness list is provided, it may be unnecessary to give, in each apparatus entry, an exhaustive list of the witnesses which agree with the base text. An application program can—in principle—compare the witnesses given for each variant found with those given in the full list of witnesses, subtracting from this list all the witnesses not active at this point (perhaps because of lacuna, or because they contain a variation on a different, overlapping lemma) and thence calculate all the manuscripts agreeing with the base text. In practice, encoders may find it less error-prone to list all witnesses explicitly in each apparatus entry. |
893 | TCAPMI | If a witness is incomplete (whether a single fragment, a series of fragments, or a relatively complete text with one or more lacunae), it is usually desirable to record explicitly where its preserved portions begin and end. The following empty tags, which may occur within any |
897 | TCAPMI | element, indicate the beginning or end of a fragmentary witness or of a lacuna within a witness: |
909 | TCAPMI | when the module defined by this chapter is included in a schema. |
913 | TCAPMI | has a physical lacuna, and the text of the manuscript begins with |
933 | TCAPMI | both appear in witness X. In some cases, the apparatus in the source may commence recording the readings for a particular witness without its being clear whether the previous absence of readings for this witness is due to a lacuna, or to some other reason. The |
955 | TCAPLK | Three different methods may be used to link a critical apparatus to the text: |
961 | TCAPLK | the parallel segmentation method. |
968 | TCAPLK | apparatus, the former dispersed within the base text, the latter held in some separate location, within or outside the document with the base text. The parallel segmentation method does not use the concept of a base text and may only be used for in-line apparatus. |
975 | TCAPLK | element provides a useful means of grouping together a series of |
993 | TCAPLK | element of its TEI header, thus: |
1000 | TCAPLO | The location-referenced method of encoding apparatus provides a convenient method for encoding printed apparatus; in this method as in most printed editions, the apparatus is linked to the base text by indicating explicitly only the block of text on which there is a variant (noted usually by a canonical reference scheme, or by line number in the edition, such as |
1003 | TCAPLO | Page 15 line 1 |
1006 | TCAPLO | If the location-referenced method is used for an apparatus stored externally to the base text, the TEI header must have the declaration: |
1010 | TCAPLO | of the document, the base text (here El) will appear: |
1034 | TCAPLO | If the same text is encoded using in-line storage, the apparatus is dispersed through the base text block to which it refers. In this case, the location of the variant can be read from the line in which it occurs. |
1047 | TCAPLO | Since the location is not required to be exact, the apparatus for a line might also appear at the end of the line: |
1057 | TCAPLO | When the apparatus is linked to the text by means of location references, as shown here, it is not possible to find automatically the precise portion of text varied by the readings. In order to show explicitly what portion of the base text is replaced by the variant readings, the |
1071 | TCAPLO | base text reading |
1072 | TCAPLO | and requiring no qualification, but it may optionally carry the normal attributes, as shown here. Some text critics prefer to abbreviate or elide the lemma, in order to save space or trouble; such practice is not forbidden by these Guidelines, but no recommendations are made for conventions of abbreviating the lemma, whether abbreviation of each word, or suppression of all but the first and last word, etc. |
1080 | TCAPDE | In the double end-point attachment method, the beginning and end of the lemma in the base text are both explicitly indicated. It thus differs from the location-referenced method, in which only the larger span of text containing the lemma is indicated. Double end-point attachment permits unambiguous matching of each variant reading against its lemma. It or the parallel-segmentation method should be used in all cases where this is desired, for example where the apparatus is intended to enable full reconstruction of the text, or of the substantives, of every witness. |
1091 | TCAPDE | . In cases where it is not possible to insert anchors within the base text (e.g. where the text is on a read-only medium) the beginning and end of the lemma may be indicated by using the |
1096 | TCAPDE | The double end-point attachment method may be used with in-line or external apparatus. In the latter case, the base text (here El) will appear with |
1098 | TCAPDE | elements inserted at every place where a variant begins or ends (unless some element with an identifier already begins or ends at that point): |
1120 | TCAPDE | attribute can use the identifier for the line as a whole; the lemma is assumed to run from the beginning of the element indicated by the |
1124 | TCAPDE | attribute. If no value is given for |
1149 | TCAPDE | element in this method, as it may be extracted reliably from the base text. If an exhaustive list of witnesses is available, it will also not be necessary to specify just which manuscripts agree with the base text to enable reconstruction of witnesses. An application will be able to determine the manuscripts that witness the base reading, by noting which witnesses are attested as having a variant reading, and inferring the base text reading for all others after adjusting for fragmentary witnesses and for witnesses carrying overlapping variant readings. |
1151 | TCAPDE | Alternatively, if it is desired to make an explicit record of the attestation of the base text, the |
1166 | TCAPDE | . For example, at line 117 of the Wife of Bath's Prologue, the manuscripts Hg (Hengwrt), El (Ellesmere), and Ha4 (British Library Harleian 7334) read: |
1206 | TCAPDE | The parallel segmentation method, to be discussed next, cannot handle overlaps among variants, and would require the individual variants to be split into pieces. |
1208 | TCAPDE | Because creation and interpretation of double end-point attachment apparatus will be lengthy and difficult it is likely that they will usually be created and examined by scholars only with mechanical assistance. |
1214 | TCAPPS | This method differs from the double end-point attachment method in that all variants at any point of the text are expressed as variants on one another. In this method, no two variations can overlap, although they may nest. Thus, the concepts of a base text and of a lemma become unnecessary: the texts compared are divided into matching segments all synchronized with one another. This permits direct comparison of any span of text in any witness with that in any other witness. It is also very easy with this method for an application to extract the full text of any one witness from the apparatus. |
1216 | TCAPPS | This method will (by definition) always be satisfactory when there are just two texts for comparison (assuming they are in the same language and script). It will also be useful where editors do not wish to privilege a text as the |
1218 | TCAPPS | or when editors wish to present parallel texts. It will become less convenient as traditions become more complex and tension develops between the need to segment on the largest variation found and the need to express the finest detail of agreement between witnesses. |
1220 | TCAPPS | In the parallel segmentation method, each segment of text on which there is variation is marked by an |
1224 | TCAPPS | element; if it is desired to single out one reading as preferred, it may be tagged |
1239 | TCAPPS | This method cannot be used with external apparatus: it must be used in-line. Note that apparatus encoded with this method may be translated into the double end-point attachment method and back without loss of information. Where double-end-point-attachment encodings have no overlapping lemmata, translation of these to the parallel segmentation encoding and back will also be possible without loss of information. |
1241 | TCAPPS | For economy, the witnesses to the reading most widely attested need not be stated. Since all manuscripts must be represented in all apparatus entries, it will be possible for an application to read a |
1243 | TCAPPS | declaring all the witnesses to the text and then calculate which witnesses have not been named. In the example below, only La and Ra2 are identified explicitly with a reading; an application might successfully infer from this that |
1260 | TCAPPS | As noted, apparatus entries may nest in this method: if an imaginary fifth manuscript of the text read |
1262 | TCAPPS | , the variation on the individual words of the line would nest within that for the line as a whole: |
1293 | TCAPPS | Parallel segmentation cannot, however, deal very gracefully with variants which overlap without nesting: such variants must be broken up into pieces in order to keep all witnesses synchronized. |
1300 | TCAPLN | When an apparatus is provided it does not need to be given at the location in the transcription where the variation, emendation, attribution, or other apparatus observation occurs. Instead it may be stored in a separate place in the same file, or indeed in another file, and point to the location at which it is meant to be used. Storing apparatus entries separately can be beneficial when encoding multiple competing, potentially overlapping, interpretations of the same point in the source texts. |
1302 | TCAPLN | The location-referenced method can be used to point a position in a text using the |
1310 | TCAPLN | or other element at the location where the apparatus observation takes place. The contents of an element pointed to are understood to be equivalent to a |
1312 | TCAPLN | if none exists in the |
1314 | TCAPLN | , and if a |
1322 | TCAPLN | datatype and thus contains a URI as a value. This means that it can point directly to an |
1353 | TCAPLN | is not provided in the source file. |
1355 | TCAPLN | In addition, URLs can contain XPointer schemes including xpath(), range(), and string-range() which can be used in providing the location of an |
1357 | TCAPLN | that is stored separately from the text to which it applies. Both |
1361 | TCAPLN | can be used, as in the double end-point attachment method, to identify the starting and ending location for an apparatus using XPointer schemes described in |
1362 | TCAPLN | section to more precisely identify this location where beneficial. |
1379 | TCAPLN | attribute is provided then it should be understood that this supplies the location of the textual variance that the apparatus documents. If the |
1381 | TCAPLN | attribute contains an XPointer scheme that identifies a range of text (or elements) then this is understood to record the starting and ending of the range as in the double end-point attachment method. In such a case a @to attribute is unnecessary. |
1390 | TCTR | element. An application may then construct different |
1398 | TCTR | element. Consider, for example, the three different transcriptions given below of line 105 of the Hengwrt manuscript of Chaucer's |
1400 | TCTR | . The last word of the line |
1407 | TCTR | u |
1413 | TCTR | u |
1428 | TCTR | This example uses special purpose elements |
1456 | TCTR | In most cases, elements used to indicate features of a primary textual source may be represented within an |
1464 | TCTR | elements in the example just given. However, in cases where the tagged feature extends across a span of text which might itself contain variant readings which it is desired to represent by |
1466 | TCTR | structures, some adaptation of the tagging may be necessary. For example, a span of text may be marked in the transcription of the primary source as a single deletion but it may be desirable to represent just a few words from this source as individual deletions within the context of a critical apparatus drawing together readings from this and several other witnesses. In this case, the tagging of the span of words as one deletion may need to be decomposed into a series of one-word deletions for encoding within the apparatus. If it is important to record the fact that all were deleted by the same act, the markup may use the |
1495 | TC | The selection and combination of modules to form a TEI schema is described in |
# | id | text |
---|---|---|
3 | DR | This module is intended for use when encoding printed dramatic texts, screen plays or radio scripts, and written transcriptions of any other form of performance. |
6 | DR | discusses elements such as cast lists, which can appear only in the front or back matter of printed dramatic texts. Section |
7 | DR | discusses the structural components of performance texts: these include major structural divisions such as acts and scenes (section |
10 | DR | ); stage directions (section |
14 | DR | discusses a small number of additional elements characteristic of screen plays and radio or television scripts, as well as some elements for representing technical stage directions such as lighting or blocking. |
16 | DR | The default structure for dramatic texts is similar to that defined by chapter |
20 | DR | Two element classes are used by this module. The |
22 | DR | class supplies specialized elements which can appear only in the front or back matter of performance texts. The |
24 | DR | class supplies a set of elements for stage directions and similar items such as camera movements, which can occur between or within speeches. |
31 | DRFAB | In dramatic texts, as in all TEI-conformant documents, the header element is followed by a |
33 | DRFAB | element, which contains optional front and back matter, and either a |
46 | DRFAB | elements are most likely to be of use when encoding preliminary materials in published performance texts. When the module defined by this chapter is included in a schema, the following additional elements not generally found in other forms of text become available as part of the front or back matter: |
49 | DRFAB | Elements for encoding each of these specific kinds of front matter are discussed in the remainder of this section, in the order given above. In addition, the front matter of dramatic texts may include the same elements as that of any other kind of text, notably title pages and various kinds of text division, as discussed in section |
51 | DRFAB | div type="performance" |
53 | DRFAB | div1 type="set" |
56 | DRFAB | Most other material in the front matter of a performance text will be marked with the default text structure elements described in chapter |
57 | DRFAB | . For example, the title page, dedication, other commendatory material, preface, etc., in a printed text should be encoded using |
61 | DRFAB | elements, containing headings, paragraphs, and other core tags. |
70 | DRSET | A special form of note describing the setting of a dramatic text (that is, the time and place of its action) is sometimes found in the front matter. |
71 | DRSET | Descriptions of the setting may also appear as initial stage directions in the body of the play, but such descriptions should be marked as stage directions, not |
75 | DRSET | element should be used only where the description forms part of the front matter, as in the following examples: |
125 | DRPRO | Many plays in the Western tradition include in their front matter a prologue, spoken by an actor, generally not in character. Similar speeches often also occur at the end of the play, as epilogues. The elements |
129 | DRPRO | are provided for the encoding of such features within the front or back matter, where appropriate. |
130 | DRPRO | A prologue may be encoded just like a distinct poem, as in the following example: |
164 | DRPRO | A prologue or epilogue may also be encoded as a speech, using the |
167 | DRPRO | . This is particularly appropriate where stage directions, etc., are involved, as in the following example: |
203 | DRPRO | In cases where the prologue or epilogue is clearly a significant part of the dramatic action, it may be preferable to include it in the body of a text, rather than in the front or back matter. In such cases, the encoder (and theatrical tradition) will determine whether or not to regard it as a new scene or division, or simply the final speech in the play. In the First Folio version of Shakespeare's |
205 | DRPRO | , for example, Prospero's final speech is clearly marked off as a distinct textual unit by the headings and layout of the page, and might therefore be encoded as back matter: |
294 | DRPERF | Performance texts are not only printed in books to be read, they are also performed. It is common practice therefore to include within the front matter of a printed dramatic text some brief account of particular performances, using the following element: |
297 | DRPERF | element may be used to group any and all information relating to the actual performance of a play or screenplay, whether it specifies how the play should be performed in general or how it was performed in practice on some occasion. |
299 | DRPERF | Performance information may include complex structures such as cast lists, or paragraphs describing the date and location of a performance, details about the setting portrayed in the performance and so forth. (See the discussion of these specialized structures in section |
300 | DRPERF | above.) If information for more than one performance is being recorded, then more than one |
304 | DRPERF | Names of persons, places, and dates of particular significance within the performance record may be explicitly marked using the general purpose |
307 | DRPERF | rs type="place" |
401 | DRCAST | cast list |
402 | DRCAST | is a specialized form of list, conventionally found at the start or end of a play, usually listing all the speaking and non-speaking roles in the play, often with additional description ( |
404 | DRCAST | ) or the name of an actor or actress ( |
406 | DRCAST | ). Cast lists may be encoded with the general purpose |
426 | DRCAST | A cast list relating to a specific performance may be accompanied by notes about the time or place of that performance, indicating (for example) the name of the theatre where the play was first presented, the name of the producer or director, and so forth. When the cast list relates to a specific performance, it should be embedded within a |
460 | DRCAST | . For example, the second cast item above might be encoded as follows: |
472 | DRCAST | element, where it is desired to link speeches within the text explicitly to the role, using the |
477 | DRCAST | The occasionally lengthy descriptions of a role sometimes found in written play scripts may be marked using the |
500 | DRCAST | When a list of such minor roles is given together, the |
504 | DRCAST | should indicate that it contains more than one role, by taking a value such as |
505 | DRCAST | list |
520 | DRCAST | A group of cast items forming a distinct subdivision of a cast list may be marked as such by using the special purpose |
524 | DRCAST | attribute may be used to indicate whether this grouping is indicated in the text by layout alone (i.e. the use of whitespace), by long braces or by some other means. A |
528 | DRCAST | element) followed by a series of |
551 | DRCAST | as a role description, and encode the above example as follows: |
569 | DRCAST | This version has the advantage that all role descriptions are treated alike, rather than in some cases being treated as headings. On the other hand there are also cases, such as the following, where the role description does function more like a heading: |
660 | DRBOD | The body of a performance text may be divided into structural units, variously called acts, scenes, stasima, entr'actes, etc. All such formal divisions should be encoded using an appropriate text-division element ( |
667 | DRBOD | . Whether divided up into such units or not, all performance texts consist of sequences of speeches (see |
668 | DRBOD | ) and stage directions (see |
670 | DRBOD | number |
672 | DRBOD | ). Speeches will generally consist of a sequence of |
674 | DRBOD | -level items: paragraphs, verse lines, stanzas, or (in case of uncertainty as to whether something is verse or prose) |
679 | DRBOD | The boundaries of formal units such as verse lines or paragraphs do not always coincide with speech boundaries. Units such as songs may be discontinuous or shared among several speakers. As described below in section |
685 | DRDIV | Large divisions in drama such as acts, scenes, stasima, or entr'actes are indicated by numbered or unnumbered |
692 | DRDIV | attributes may be used to define the type of division being marked, and to provide a name or number for it, as in the following example: |
704 | DRDIV | Where the largest divisions of a performance text are themselves subdivided, most obviously in the case of plays traditionally divided into acts and scenes, further nested text-division elements may be used, as in this example: |
741 | DRDIV | convention, (where the entrance of each new set of characters is marked as a distinct unit in the text) and the |
743 | DRDIV | element to represent the acts into which the play is divided. The elements chosen are determined only by the hierarchic position of these units in the text as a whole. If the text had no acts, but only scenes, then the scenes might be represented by |
745 | DRDIV | elements. Equally, if a play is divided only into |
747 | DRDIV | , with no smaller subdivisions, then the |
751 | DRDIV | should be used, as above, to make explicit the name associated with a particular category of subdivision. |
755 | DRDIV | . The second act in the above example would then be represented as follows: |
773 | DRSP | The following elements are used to identify speeches and speakers in a performance text: |
775 | DRSP | As noted above, the structure of many performance texts may be analysed as multiply hierarchic: a scene of a verse play, for example, may be divided into speeches and, at the same time, into verse lines. The end of a line may or may not coincide with the end of a speech, and vice versa. Other structures, such as songs, may be discontinuous or split up over several speeches. For some purposes it will be appropriate to regard the verse-structure as the fundamental organizing principle of the text, and for others the speech structure; in some cases, the choice between the two may be arbitrary. The discussion in the remainder of this chapter assumes that it is the speech-based hierarchy which most prominently determines the structure of performance texts, but the same mechanisms could be employed to encode a view of a performance text in which individual speeches were entirely subordinate to the formal units of prose and verse. For more detailed discussion and examples of various treatments of this fundamental issue, refer to chapter |
782 | DRSP | element are both used to indicate the speaker or speakers of a speech, but in rather different ways. The |
784 | DRSP | element is used to encode the word or phrase actually used within the source text to indicate the speaker: it may contain any string or prefix, and may be thought of as a highly specialized form of stage direction. The |
788 | DRSP | element in the TEI header |
791 | DRSP | element in the cast list |
792 | DRSP | , or even to some external source such as an online handbook of dramatic roles. The most usual case is that the pointer value supplied (prefixed by a sharp sign) corresponds with the value of an |
846 | DRSP | If the speaker attributions are completely regular (and may thus be reconstructed mechanically from the values given for the |
848 | DRSP | attribute), or are of no interest for the encoder of the text (as might be the case with editorially supplied attributions in older texts), then the |
850 | DRSP | element need not be used; the former example above then might look like this: |
866 | DRSP | More than one identifier may be listed as value for the |
868 | DRSP | attribute if the speech is spoken by more than one person, as in the following example: |
887 | DRSP | elements are both declared within the core module (see section |
892 | DRSPG | This module makes available the following additional element for handling groups of speeches: |
896 | DRSPG | element is intended for cases where the characters in a performance launch into something which might be regarded almost as a kind of separate structural division, typically associated with its own heading or numbering system, but which |
898 | DRSPG | in the text, at the same hierarchic level as speeches preceding or following it. Such units are often numbered, titled, and visually presented as distinct objects within the text. Here is a typical example from a well-known American musical comedy: |
961 | DRSTA | Both between and within the speeches of a written performance text, it is normal practice to include a wide variety of descriptive directions to indicate non-verbal action. The following elements are provided to represent these: |
966 | DRSTA | A satisfactory typology of stage directions is difficult to define. Certain basic types such as |
971 | DRSTA | setting |
974 | DRSTA | , are easily identified. But the list is not a closed one, and it is not uncommon to mix types within a single direction. No closed set of values for the |
976 | DRSTA | attribute is therefore proposed at the present time, though some suggested values are indicated in the list below, which also indicates the range of possibilities. |
1005 | DRSTA | element of the TEI header (described in section |
1085 | DRSTA | element may also be used in non-theatrical texts, to mark sound effects or musical effects, etc., as further discussed in section |
1090 | DRSTA | element is intended to help overcome the fact that the stage directions of a printed text may often not provide full information about either the intended or the actual movement of actors on stage. It may be used to keep track of entrances and exits in detail, so as to know which characters are on stage at which time. Its attributes permit a relatively formal specification for movements of characters, using user-defined codes to identify the characters involved (the |
1094 | DRSTA | attribute), and optionally which part of the stage is involved ( |
1098 | DRSTA | attribute is also provided; this allows the recording of different |
1104 | DRSTA | element should be located at the position in the text where the move is presumed to take place. This will often coincide with a stage direction, as in the following simple example: |
1113 | DRSTA | element can however appear independently of a stage direction, as in the following example: |
1133 | DRPAL | The actual speeches of a dramatic text may be composed of running text, which must be formally organized into paragraphs, in the case of prose (see section |
1134 | DRPAL | ), verse lines or line groups in that of verse (see section |
1137 | DRPAL | elements, in case of doubt as to whether the material should be treated as verse or prose. The following elements, all of which are defined in the core, are particularly useful when marking units of prose or verse within speeches: |
1139 | DRPAL | Like other milestone elements, the element |
1152 | DRPAL | As a member of the classes |
1170 | DRPAL | also gain additional attributes through their membership of the class |
1174 | DRPAL | In many texts, prose and verse may be inextricably mingled; particularly in earlier printed texts, prose may be printed as verse or verse as prose, or it may be impossible to distinguish the two. In cases of doubt, an encoder may prefer to tag the dubious material consistently as verse, to tag it all as prose, to follow the typography of the source text, or to use the neutral |
1180 | DRPAL | element of the header may be used to record explicitly what policy has been adopted. |
1184 | DRPAL | ) and verse (marked as |
1198 | DRPAL | class provides one simple way of indicating where the boundaries of a speech and of a verse line or line group do not coincide. The encoder may simply indicate that a line or line group is metrically incomplete by specifying the value |
1221 | DRPAL | Alternatively, where the fragments of the line or line group are consecutive in the text (though possibly interrupted by stage directions), the values |
1249 | DRPAL | or line group element is most often of use for the encoding of songs and other stanzaic material. Line groups may be fragmented across speakers in the same way as individual lines, and the same set of attributes may be used to record this fact. The element |
1251 | DRPAL | is provided in order to simplify the situation, very common in performances, where performance of a single entity, such as a song, is shared amongst several performers, as in the following example: |
1279 | DRPAL | This encoding however does not indicate that the three lines of Sir Joseph's song and the two lines following it together constitute a single verse stanza. This can be indicated by using the |
1314 | DREMB | Although primarily composed of speeches, performance texts often contain other structural units such as songs or strophes which are shared among different speakers. More generally, complex nested structures of plays within plays, interpolated masques, or interludes are far from uncommon. In more modern material, comparably complex structural devices such as flashback or nested playback are equally frequent. In all kinds of performance material, it may be necessary to indicate several actions which are happening simultaneously. |
1316 | DREMB | A number of different devices are available within the TEI scheme to support these complexities in the general case. Texts may be composite or self-nesting (see section |
1318 | DREMB | ). The TEI encoding scheme provides a variety of linking mechanisms, which may be used to indicate temporal alignment and aggregation of fragmented structures. In this section we provide a few specific examples of the application of these techniques to performance texts: |
1334 | DREMB | attributes on fragments of embedded structures to join them into a larger whole |
1343 | DREMB | When the whole of a song appears within a single speech, it may require no special treatment if it is considered to form a part of the speech: |
1368 | DREMB | If however, the song is to be regarded as forming a distinct item, perhaps with its own front and back matter, it may be better to regard it as a floating text: |
1396 | DREMB | element, each of its constituent parts must be regarded as a distinct fragment; the problem then facing the encoder is to reconstitute the interrupted whole in some way. |
1400 | DREMB | element may be used to group together consecutive speeches which are grouped together in some way, for example constituting a single song. Alternatively the |
1404 | DREMB | element contains a partial, not a complete, verse line, may also be used on the |
1406 | DREMB | element, to indicate that the line group is partial rather than complete, thus: |
1429 | DREMB | When the fragments of a song are separated by other intervening dialogue, or even when not, they may be linked together with the |
1434 | DREMB | . For example, the line groups making up Ophelia's song might be encoded as follows: |
1502 | DREMB | : they form part of the module for alignment and linking; this module must therefore be included in a schema if they are to be used, as further discussed in section |
1510 | DREMB | element is specifically intended to encode the fact that several discontiguous elements of the text together form one |
1571 | DREMB | The location of the |
1581 | DREMB | element requires the additional module for linking, which is selected as shown above. |
1585 | DRSIM | In printed or written versions of performance texts, a variety of techniques may be used to indicate the temporal alignment of speeches or actions. Speeches may be printed vertically aligned on the page, or braced together; stage directions (e.g. |
1586 | DRSIM | Speaking at the same time |
1643 | DRSIM | In the original, the stage direction |
1645 | DRSIM | is printed opposite a brace grouping all four speeches, indicating that all four characters speak at once, and that the stage direction applies to all of them. Rather than attempting to represent the appearance of the source, this example encoding represents its presumed meaning: the |
1651 | DRSIM | attribute is used to specify the fact that the three speeches were grouped by the brace in the copy text. Producing a readable version of the text which simulates the original printed effect may however require more complex markup and processing. |
1654 | DRSIM | . These would be appropriate for encodings the focus of which is on the actual performance of a text rather than its structure or formal properties. The module described in that chapter includes a large number of other detailed proposals for the encoding of such features as voice quality, prosody, etc., which might be relevant to such a treatment of performance texts. |
1658 | DROTH | Most of the elements and structures identified thus far are derived from traditional theatrical texts. Although other performance texts, such as screenplays or radio scripts, have not been discussed specifically, they can be encoded using the elements and structures listed above. Encoders may however find it convenient to use, as well, the additional specialized elements discussed in this section. For scripts containing very detailed technical information, the |
1663 | DROTH | Like other texts, screenplays and television or radio scripts may be divided into text divisions marked with |
1673 | DROTH | , each associated with a single camera angle and setting. Shots and sequences should be encoded using an appropriate text-division element (i.e., a |
1675 | DROTH | element if numbered division elements are in use and the next largest unit is a |
1679 | DROTH | element if un-numbered divisions are in use) specifying |
1680 | DROTH | sequence |
1683 | DROTH | as the value of the |
1687 | DROTH | It is normal practice in screenplays and radio scripts to distinguish directions concerning camera angles, sound effects, etc., from other forms of stage direction. Such texts also generally include far more detailed specifications of what the audience actually sees: descriptions of actions and background, etc. Scripts derived from cinema and television productions may also include texts displayed as captions superimposed on the action. All of these may be encoded using the general purpose |
1701 | DROTH | Where particular words or phrases within a direction are emphasized (by change of typeface or use of capital letters), an appropriate phrase-level element may be used to indicate the fact, as in the following examples, where certain words in the original are given in small capitals: |
1723 | DROTH | All of these elements, like other stage directions, can appear both within and between speeches. |
1780 | DRTEC | Traditional stage scripts may contain additional technical information about such production-related factors as lighting, |
1785 | DRTEC | . Alternatively, they may be formally distinguished from other stage directions by using the specialized |
1790 | DRTEC | Like stage directions, |
1815 | DR | The selection and combination of modules to form a TEI schema is described in |