maybe need ple

These tables list the 15,996 words (in 9,287 text nodes) that match the @ident of some *Spec.

The first column is just position(), for use as a reference and so any given table can be re-sorted back to its original order. The second column is either the @ident of the *Spec, or the closest ancestor @xml:id. The links often don't work, because I don't know how to consistently generate a proper link, and one can't easily test that the document is available, as doc-available() always fails because of the about:legacy-compat.

specifications (i.e., https://svn.code.sf.net/p/tei/code/trunk/P5/Source/Specs/)

#	id	text
4	factuality	describes the extent to which the text may be regarded as imaginative or non-imaginative, that is, as describing a fictional or a non-fictional world.
27	factuality	categorizes the factuality of the text.
46	factuality	the text is to be regarded as entirely imaginative
62	factuality	the text is to be regarded as entirely informative or factual
78	factuality	the text contains a mixture of fact and fiction
94	factuality	the fiction/fact distinction is not regarded as helpful or appropriate to this text
147	factuality	Usually empty, unless some further clarification of the type attribute is needed, in which case it may contain running prose
149	factuality	For many literary texts, a simple binary opposition between
155	factuality	are in any sense

#	id	text
4	collection	contains the name of a collection of manuscripts, not necessarily located within a single repository.

#	id	text
4	damage	contains an area of damage to the text witness.
40	damage	Since damage to text witnesses frequently makes them harder to read, the
46	damage	attribute may be used to group together several related

#	id	text
2	pubPlace	publication place
13	pubPlace	contains the name of the place where a bibliographic item was published.

#	id	text
2	cond	conditional feature-structure constraint
14	cond	defines a conditional feature-structure constraint; the consequent and the antecedent are specified as feature structures or feature-structure collections; the constraint is satisfied if both the antecedent and the consequent subsume a given feature structure, or if the antecedent does not.

#	id	text
16	classRef	the identifier used for the required class within the source indicated.
23	classRef	indicates how references to this class within a content model should be interpreted.
31	classRef	a single occurrence of all members of the class may appear in sequence
35	classRef	a single occurrence of one or more members of the class may appear in sequence
43	classRef	one or more occurrences of all members of the class may appear in sequence
52	classRef	c
53	classRef	, then a reference to the class within a content model is understood as being a reference to
55	classRef	when
57	classRef	has the value
61	classRef	when it has the value
62	classRef	sequence
65	classRef	when it has the value
67	classRef	; to (a,b, c*) when it has the value
69	classRef	; or to (a+,b+,c+) when it has the value
77	classRef	supplies a list of class members which are to be included in the schema being defined.
84	classRef	supplies a list of class members which are to be excluded from the schema being defined.
105	classRef	Attribute and model classes are identified by the name supplied as value for the
109	classRef	element in which they are declared. All TEI names are unique; attribute class names conventionally begin with the latters

#	id	text
4	event	contains data relating to any kind of significant event associated with a person, place, or organization.
60	event	indicates the location of an event by pointing to a

#	id	text
2	model.pLike.front	groups paragraph-like elements which can occur as direct constituents of front matter.

#	id	text
2	principal	principal researcher
16	principal	supplies the name of the principal researcher responsible for the creation of an electronic text.

#	id	text
14	biblScope	defines the scope of a bibliographic reference, for example as a list of page numbers, or a named subdivision of a larger work.
79	biblScope	. For example, if the citation has

#	id	text
4	gap	indicates a point where material has been omitted in a transcription, whether for editorial reasons described in the TEI header, as part of sampling practice, or because the material is illegible, invisible, or inaudible.
125	gap	in the case of text omitted from the transcription because of deliberate deletion by an identifiable hand, indicates the hand which made the deletion.
144	gap	in the case of text omitted because of damage, categorizes the cause of the damage, if it can be identified.
163	gap	damage results from rubbing of the leaf edges
179	gap	damage results from mildew on the leaf surface
195	gap	damage results from smoke
262	gap	core tag elements may be closely allied in use with the
266	gap	elements, available when using the additional tagset for transcription of primary sources. See section
271	gap	tag simply signals the editors decision to omit or inability to transcribe a span of text. Other information, such as the interpretation that text was deliberately erased or covered, should be indicated using the relevant tags, such as
273	gap	in the case of deliberate deletion.

#	id	text
4	surrogates	contains information about any representations of the manuscript being described which may exist in the holding institution or elsewhere.

#	id	text
14	head	contains any type of heading, for example the title of a section, or the heading of a list, glossary, manuscript description, etc.
52	head	may be rather longer than usual in modern works. If a section has an explicit ending as well as a heading, it should be marked as a
165	head	element is used for headings at all levels; software which treats (e.g.) chapter headings, section headings, and list titles differently must determine the proper processing of a
169	head	occurring as the first element of a list is the title of that list; one occurring as the first element of a
171	head	is the title of that chapter or section.

#	id	text
4	stress	contains the stress pattern for a dictionary headword, if given separately.
36	stress	Usually stress information is included within pronunciation information.

#	id	text
15	listForest	identifies the type of the forest group.

#	id	text
2	eLeaf	leaf or terminal node of an embedding tree
14	eLeaf	provides explicitly for a leaf of an embedding tree, which may also be encoded with the eTree element.
48	eLeaf	indicates the value of an embedding leaf, which is a feature structure or other analytic element.
86	eLeaf	tag may be used if the encoder does not wish to distinguish by name between nonleaf and leaf nodes in embedding trees; they are distinguished by their arrangement.

#	id	text
2	att.declaring	provides attributes for elements which may be independently associated with a particular declarable element within the header, thus overriding the inherited default for that element.
50	att.declaring	The rules governing the association of declarable elements with individual parts of a TEI text are fully defined in chapter

#	id	text
2	authority	release authority
16	authority	supplies the name of a person or other agency responsible for making a work available, other than a publisher or distributor.

#	id	text
35	undo	This encoding represents the following sequence of events:
37	undo	At stage s2, "just some sample text, we need" is deleted by overstriking, and "not" is added
38	undo	At stage s3, parts of the deletion are cancelled by underdotting, thus reinstating the words "just some" and "text".

#	id	text
2	entryFree	unstructured entry
13	entryFree	contains a single unstructured entry in any kind of lexical resource, such as a dictionary or lexicon.

#	id	text
14	alternate	The alternate element must have at least two child elements
26	alternate	This example content model permits either a

#	id	text
4	climate	contains information about the physical climate of a place.

#	id	text
2	when	indicates a point in time either relative to other elements in the same timeline tag, or absolutely.
28	when	supplies an absolute value for the time.
75	when	specifies the unit of time in which the
77	when	value is expressed, if this is not inherited from the parent
172	when	specifies a time interval either as a number or as one of the keywords defined by the datatype data.interval
191	when	identifies the reference point for determining the time of the current
193	when	element, which is obtained by adding the interval to the time of the reference point.
227	when	. If no value is supplied, and the
229	when	attribute is also unspecified, then the reference point is understood to be the origin of the enclosing
272	when	attribute must be supplied to specify an identifier for this point in time. The value used may be chosen freely provided that it is unique within the document and is a syntactically valid name. There is no requirement for values containing numbers to be in sequence.

#	id	text
2	titlePage	title page
16	titlePage	contains the title page of a text, appearing within the front or back matter.
54	titlePage	classifies the title page according to any convenient typology.
74	titlePage	This attribute allows the same element to be used for volume title pages, series title pages, etc., as well as for the
76	titlePage	title page of a work.

#	id	text
2	substJoin	substitution join
6	substJoin	identifies a series of possibly fragmented additions, deletions or other revisions on a manuscript that combine to make up a single intervention in the text

#	id	text
2	stage	stage direction
14	stage	contains any kind of stage direction within a dramatic text or fragment.
39	stage	indicates the kind of stage direction.
106	stage	describes stage business.
122	stage	is a narrative, motivating stage direction.
303	stage	attribute may be used to indicate more precisely the person or persons participating in the action described by the stage direction.

#	id	text
20	data.truthValue	The possible values of this datatype are
30	data.truthValue	This datatype applies only for cases where uncertainty is inappropriate; if the attribute concerned may have a value other than true or false, e.g.

#	id	text
2	model.headLike	groups elements used to provide a title or heading at the start of a text division.

#	id	text
4	mood	contains information about the grammatical mood of verbs (e.g. indicative, subjunctive, imperative).
88	mood	gram type="mood"

#	id	text
68	dimensions	dimensions relate to one or more leaves (e.g. a single leaf, a gathering, or a separately bound part)
84	dimensions	dimensions relate to the area of a leaf which has been ruled in preparation for writing.
100	dimensions	dimensions relate to the area of a leaf which has been pricked out in preparation for ruling (used where this differs significantly from the ruled area, or where the ruling is not measurable).
116	dimensions	dimensions relate to the area of a leaf which has been written, with the height measured from the top of the minims on the top line of writing, to the bottom of the minims on the bottom line of writing.
132	dimensions	dimensions relate to the miniatures within the manuscript
148	dimensions	dimensions relate to the binding in which the codex or manuscript is contained
164	dimensions	dimensions relate to the box or other container in which the manuscript is stored.
241	dimensions	This element may be used to record the dimensions of any text-bearing object, not necessarily a codex. For example:
257	dimensions	When simple numeric quantities are involved, they may be expressed on the
278	dimensions	Contains no more than one of each of the specialized elements used to express a three-dimensional object's height, width, and depth, combined with any number of other kinds of dimensional specification.

#	id	text
2	damageSpan	damaged span of text
12	damageSpan	marks the beginning of a longer sequence of text which is damaged in some way but still legible.
85	damageSpan	Both the beginning and ending of the damaged sequence must be marked: the beginning by the
89	damageSpan	attribute: if no other element available, the
93	damageSpan	The damaged text must be at least partially legible, in order for the encoder to be able to transcribe it. If it is not legible at all, the
99	damageSpan	element should be employed, with the value of the

#	id	text
2	lbl	label
14	lbl	contains a label for a form, example, translation, or other piece of information, e.g. abbreviation for, contraction of, literally, approximately, synonyms:, etc.
39	lbl	classifies the label using any convenient typology.

#	id	text
2	model.lLike	groups elements representing metrical components such as verse lines.

#	id	text
41	reg	If all that is desired is to call attention to the fact that the copy text has been regularized,

#	id	text
14	xr	contains a phrase, sentence, or icon referring the reader to some other location in this or another text.
130	xr	related or similar term
316	xr	This element encloses both the actual indication of the location referred to, which may be tagged using the
320	xr	elements, and any accompanying material which gives more information about why the reader is being referred there.

#	id	text
2	att.datable.iso	provides attributes for normalization of elements that contain datable events using the ISO 8601 standard.
19	att.datable.iso	supplies the value of a date or time in a standard form.
35	att.datable.iso	The following are examples of ISO date, time, and date & time formats that are
125	att.datable.iso	is a valid time with respect to the W3C
133	att.datable.iso	specifies the earliest possible date for the event in standard form, e.g. yyyy-mm-dd.
152	att.datable.iso	specifies the latest possible date for the event in standard form, e.g. yyyy-mm-dd.
211	att.datable.iso	The value of these attributes should be a normalized representation of the date, time, or combined date & time intended, in any of the standard formats specified by ISO 8601, using the Gregorian calendar.
239	att.datable.iso	are specified, the values should be interpreted as indicating a span of time by its starting time (or date) and duration. That is,
240	att.datable.iso	indicates the same time period as
245	att.datable.iso	form, no claim is made that the form in the source text is incorrect; the regularized form is simply that chosen as the main form for purposes of unifying variant forms under a single heading.

#	id	text
2	triangle	underspecified embedding tree, so called because of its characteristic shape when drawn
14	triangle	provides for an underspecified eTree, that is, an eTree with information left out.
51	triangle	supplies a value for the triangle, in the form of the identifier of a feature structure or other analytic element.
95	triangle	An optional label followed by zero or more embedding trees, triangles, or embedding leafs.

#	id	text
12	foreign	identifies a word or phrase as belonging to some language other than that of the surrounding text.
61	foreign	attribute should be supplied for this element to identify the language of the word or phrase marked. As elsewhere, its value should be a language tag as defined in
66	foreign	attribute should be used in preference to this element where it is intended to mark the language of the whole of some text element.

#	id	text
2	code	contains literal code from some formal language such as a programming language.
25	code	formal language
35	code	a name identifying the formal language in which the code is expressed

#	id	text
2	anchor	anchor point
69	anchor	attribute must be supplied to specify an identifier for the point at which this element occurs within a document. The value used may be chosen freely provided that it is unique within the document and is a syntactically valid name. There is no requirement for values containing numbers to be in sequence.

#	id	text
4	rendition	supplies information about the rendition or appearance of one or more elements in the source text.
38	rendition	styling applies to the first line of the target element
46	rendition	styling should be applied immediately before the content of the target element
50	rendition	styling should be applied immediately after the content of the target element
71	rendition	The present release of these Guidelines does not specify the content of this element in any further detail. It may be used to hold a description of the default rendition to be associated with the specified element, expressed in running prose, or in some more formal language such as CSS.

#	id	text
4	list	contains any sequence of items organized as a list.
88	list	The content of a "gloss" list should include a sequence of one or more pairs of a label element followed by an item element
103	list	each list item glosses some term or concept, which is given by a label element preceding the list item.
121	list	each list item is an entry in an index such as the alphabetical topical index at the back of a print volume.
125	list	each list item is a step in a sequence of instructions, as in a recipe.
129	list	each list item is one of a sequence of petitions, supplications or invocations, typically in a religious ritual.
133	list	each list item is part of an argument consisting of two or more propositions and a final conclusion derived from them.
142	list	to encode the rendering or appearance of a list (whether it was bulleted, numbered, etc.). The current recommendation is to use the
148	list	for the more appropriate task of characterizing the nature of the content of a list.
155	list	list type="gloss"
336	list	The following example treats the short numbered clauses of Anglo-Saxon legal codes as lists of items. The text is from an ordinance of King Athelstan (924–939):
366	list	Note that nested lists have been used so the tagging mirrors the structure indicated by the two-level numbering of the clauses. The clauses could have been treated as a one-level list with irregular numbering, if desired.
385	list	May contain an optional heading followed by a series of items, or a series of label and item pairs, the latter being optionally preceded by one or two specialized headings.

#	id	text
22	interleave	This example content model permits either a

#	id	text
2	docTitle	document title
16	docTitle	contains the title of a document, including all its constituents, as given on a title page.

#	id	text
2	num	number
38	num	indicates the type of numeric value.
135	num	supplies the value of the number in standard form.
152	num	a numeric value.
157	num	The standard form used is defined by the TEI datatype data.numeric.
211	num	Detailed analyses of quantities and units of measure in historical documents may also use the feature structure mechanism described in chapter

#	id	text
2	orig	original form
119	orig	will be combined with a regularized form within a

#	id	text
2	transpose	describes a single textual transposition as an ordered list of at least two pointers specifying the order in which the elements indicated should be re-combined.
30	transpose	Transposition is usually indicated in a document by a metamark such as a wavy line or numbering.

#	id	text
2	catDesc	category description
16	catDesc	describes some category within a taxonomy or text typology, either in the form of a brief prose description or in terms of the situational parameters used by the TEI formal textDesc.

#	id	text
85	item	May contain simple prose or a sequence of chunks.
87	item	Whatever string of characters is used to label a list item in the copy text may be used as the value of the global
95	item	element to record the enumerator of the list item. In glossary lists, however, the term being defined should be given with the

#	id	text
4	district	contains the name of any kind of subdivision of a settlement, such as a parish, ward, or other administrative or geographic unit.

chapters ('en') (i.e., https://svn.code.sf.net/p/tei/code/trunk/P5/Source/Guidelines/en/)

DI-PrintDictionaries.xml#13091

#	id	text
4	DI	This chapter defines a module for encoding lexical resources of all kinds, in particular human-oriented monolingual and multilingual dictionaries, glossaries, and similar documents. The elements described here may also be useful in the encoding of computational lexica and similar resources intended for use by language-processing software; they may also be used to provide a rich encoding for wordlists, lexica, glossaries, etc. included within other documents. Dictionaries are most familiar in their printed form; however, increasing numbers of dictionaries exist also in electronic forms which are independent of any particular printed form, but from which various displays can be produced.
6	DI	Both typographically and structurally, print dictionaries are extremely complex. Such lexical resources are moreover of interest to many communities with different and sometimes conflicting goals. As a result, many general problems of text encoding are particularly pronounced here, and more compromises and alternatives within the encoding scheme may be required in the future.
21	DI	dictionaries; encoding guidelines should include these structural principles. We therefore define two distinct elements for dictionary entries, one (
34	DI	Second, since so much of the information in printed dictionaries is implicit or highly compressed, their encoding requires clear thought about whether it is to capture the precise typographic form of the source text or the underlying structure of the information it presents. Since both of these views of the dictionary may be of interest, it proves necessary to develop methods of recording both, and of recording the interrelationship between them as well. Users interested mainly in the printed format of the dictionary will require an encoding to be faithful to an original printed version. However, other users will be interested primarily in capturing the lexical information in a dictionary in a form suitable for further processing, which may demand the expansion or rearrangement of the information contained in the printed form. Further, some users wish to encode
36	DI	of these views of the data, and retain the links between related elements of the two encodings. Problems of recording these two different views of dictionary data are discussed in section
37	DI	, together with mechanisms for retaining both views when this is desired.
39	DI	To deal with this complexity, and in particular to account for the wide variety of linguistic contexts within which a dictionary may be designed, it can be necessary to customize or change the schema by providing more restriction or possibly alternate content models for the elements defined in this chapter. Section
40	DI	illustrates this with the provision of a closed set of values for grammatical descriptors.
42	DI	This chapter contains a large number of examples taken from existing print dictionaries; in each case, the original source is identified. In presenting such examples, we have tried to retain the original typographic appearance of the example as well as presenting a suggested encoding for it. Where this has not been possible (for example in the display of pronunciation) we have adopted the transliteration found in the electronic edition of the
44	DI	. Also, the middle dot in quoted entries is rendered with a full stop, while within the sample transcriptions hyphenation and syllabification points are indicated by a vertical bar \|, regardless of their appearance in the source text.
49	DIBO	Overall, dictionaries have the same structure of front matter, body, and back matter familiar from other texts. In addition, this module defines
55	DIBO	as component-level elements which can occur directly within a text division or the text body.
68	DIBO	As members of the classes
82	DIBO	The front and back matter of a dictionary may well contain specialized material such as lists of common and proper nouns, grammatical tables, gazetteers, a
84	DIBO	, etc. These should be tagged using elements defined elsewhere in these Guidelines, chiefly in the core module (chapter
89	DIBO	element consists of a set of
93	DIBO	elements. These text divisions might, for example, correspond to sections for different letters of the alphabet, or to sections for different languages in a bilingual dictionary, as in the following example:
118	DIBO	In a print dictionary, the entries are typically typographically distinct entities, each headed by some morphological form of the lexical item described (the
120	DIBO	), and sorted in alphabetical order or (especially for non-alphabetic scripts) in some other conventional sequence. Dictionary entries should be encoded as distinct successive items, each marked as an
128	DIBO	Some dictionaries provide distinct entries for homographs, on the basis of etymology, part-of-speech, or both, and typically provide a numeric superscript on the headword identifying the homograph number. In these cases each homograph should be encoded as a separate entry; the
130	DIBO	element may optionally be used to group such successive homograph entries. In addition to a series of
136	DIBO	group (see section
137	DIBO	) when information about hyphenation, pronunciation, etc., is given only once for two or more homograph entries. If the homograph number is to be recorded, the global attribute
139	DIBO	may be used for this purpose. In some dictionaries, homographs are treated in distinct parts of the same entry; in these cases, they may be separated by use of the
146	DIBO	attribute, is often required for superentries and entries, especially in cases where the order of entries does not follow the local character-set collating sequence (as, for example, when an entry for
148	DIBO	appears at the place where
210	DIEN	A simple dictionary entry may contain information about the form of the word treated, its grammatical characterization, its definition, synonyms, or translation equivalents, its etymology, cross-references to other entries, usage information, and examples. These we refer to as the
224	DIEN	In addition, however, dictionary entries often have a complex hierarchical structure. For example, an entry may consist of two or more sub-parts, each corresponding to information for a different part-of-speech homograph of the headword. The entry (or part-of-speech homographs, if the entry is split this way) may also consist of senses, each of which may in turn be composed of two or more sub-senses, etc. Each sub-part, homograph entry, sense, or sub-sense we call a
232	DIENHI	The outermost structural level of an entry is marked with the elements
242	DIENHI	element even for an entry that has only one sense to group together all parts of the definition relating to the word sense since this leads to more consistent encoding across entries. All of these levels may each contain any of the constituent parts of an entry. A special case of hierarchical structure is represented by the
247	DIENHI	may be used at any point in the hierarchy to delimit parts of the dictionary entry which are structurally anomalous, as further discussed in section
257	DIENHI	For example, an entry with two senses will have the following structure:
265	DIENHI	An entry with two homographs, the first with two senses and the second with three (one of which has two sub-senses), may have a structure like this:
326	DIENHI	The hierarchic structure of a dictionary entry is enforced by the structures defined in this module. The content model for
328	DIENHI	specifies that entries do not nest, that homographs nest within entries, and that senses nest within entries, homographs, or senses, and may be nested to any depth to reflect the embedding of sub-senses. Any of the top-level constituents (
352	DIENGP	information about the form of the word treated (orthography, pronunciation, hyphenation, etc.)
356	DIENGP	definitions or translations into another language
395	DIENGP	In a simple entry with no internal hierarchy, all top-level constituents can appear as children of
403	DIENGP	n person who competes.
432	DIENGP	Any top-level constituent can appear at any level when the hierarchical structure of the entry is more complex. The most obvious examples are
438	DIENGP	level when several senses or translations exist:
481	DIENGP	n cry of an ass; sound of a trumpet. ∙ vt [VP2A] make a cry or sound of this kind.
518	DIENGP	Information of the same kind can appear at different levels within the same entry; here, grammatical information occurs both at entry and homograph level.
582	DIENGP	2 n [U] the state when one's feelings and actions are uncontrolled; freedom from control...
677	DITPFO	Dictionary entries most often begin with information about the form of the word to which the entry applies. Typically, the orthographic form of the word, sometimes marked for syllabification or hyphenation, is the first item in an entry. Other information about the word, including variant or alternate forms, inflected forms, pronunciation, etc., is also often given.
712	DITPFO	gen, number, case
723	DITPFO	when describing that particular form of the word.
725	DITPFO	Different dictionaries use different means to mark hyphenation, syllabification, and stress, and they often use some unusual glyphs (e.g., the
728	DITPFO	. When transcribing representations of pronunciation the International Phonetic Alphabet should be used. It may be convenient (as has been done in the text of this chapter) to use a simple transliteration scheme for this; such a scheme should however be properly documented in the header.
753	DITPFO	For a variety of reasons including ease of processing, it may be desired to split into separate elements information which is collapsed into a single element in the source text; orthography and hyphenation may for example be transcribed as separate elements, although given together in the source text. For a discussion of the issues involved, and of methods for retaining both the presentation form and the interpreted form, see section
797	DITPFO	Or the inflectional pattern may be indicated by reference to a table of paradigms, as here:
820	DITPFO	Explanatory labels may be attached to alternate forms:
825	DITPFO	mean time between failures.
866	DITPFO	element is repeated to associate the first orthographic form explicitly with the first pronunciation, and the second orthographic form with the second pronunciation:
894	DITPFO	element can preserve relations among elements that are implicit in the text. For example, in the CED entry for
962	DITPGR	, or any other element containing content about which there is grammatical information. For example, in the entry
977	DITPGR	, the elements for morphological information are simply shorthand for the general purpose
979	DITPGR	element. Consider this entry for the French word
987	DITPGR	This entry can be tagged using specialized grammatical elements:
1120	DITPSE	Dictionaries may describe the meanings of words in a wide variety of different ways—by means of synonyms, paraphrases, translations into other languages, formal definitions in various highly stylized forms, etc. No attempt is made here to distinguish all the different forms which sense information may take; all of them may be tagged using the
1125	DITPSE	As a special case it is frequently desirable to distinguish the provision of translation equivalents in other languages from other forms of sense information; the use of
1126	DITPSE	cit type="translation"
1127	DITPSE	(which groups a translation equivalent with related information such as its grammatical description) for this purpose is described in section
1134	DITPDE	Dictionary definitions are those pieces of prose in a dictionary entry that describe the meaning of some lexical item. Most often, definitions describe the headword of the entry; in some cases, they describe translated texts, examples, etc.; see
1135	DITPDE	cit type="translation"
1138	DITPDE	cit type="example"
1142	DITPDE	element directly contains the text of the definition; unlike
1146	DITPDE	, it does not serve solely to group a set of smaller elements. The close analysis of definition text, such as the tagging of hypernyms, typical objects, etc., is not covered by these Guidelines.
1148	DITPDE	Definitions may occur directly within an entry; when multiple definitions are given, they are typically identified as belonging to distinct senses, as here:
1228	DITPTR	Multilingual dictionaries contain information about translations of a given word in some source language for one or more target languages. Minimally, the dictionary provides the corresponding translation in the target language; other material, such as morphological information (gender, case), various kinds of usage restrictions, etc., may also be given. If translation equivalents are to be distinguished from other kinds of sense information, they may be encoded using
1229	DITPTR	cit type="translation"
1236	DITPTR	element is used in multilingual dictionaries to group information (forms, grammatical information, usage, translation(s), etc.) about a given sense of a word where necessary. Information about the individual translation equivalents within a sense is grouped using
1237	DITPTR	cit type="translation"
1238	DITPTR	. This information may include the translation text (tagged
1260	DITPTR	Note how in the following example, different translation equivalents are grouped into the same or different senses, following the punctuation of the source and the usage labels:
1389	DITPTR	cit type="translation"
1390	DITPTR	may also be used in monolingual dictionaries when a translation is given for a foreign word:
1437	DITPET	marks a block of etymological information. Etymologies may contain highly structured lists of words in an order indicating their descent from each other, but often also include related words and forms outside the direct line of descent, for comparison. Not infrequently, etymologies include commentary of various sorts, and can grow into short (or long!) essays with prose-like structure. This variation in structure makes it impracticable to define tags which capture the entire intellectual structure of the etymology or record the precise interrelation of all the words mentioned. It is, however, feasible to mark some of the more obvious phrase-level elements frequently found in etymologies, using tags defined in the core module or elsewhere in this chapter. Of particular relevance for the markup of etymologies are:
1449	DITPET	As in other prose, individual word forms mentioned in an etymological description are tagged with
1459	DITPET	element may be used to identify a particular language name where it appears, in addition to using the
1545	DITPEG	cit type="example"
1546	DITPEG	element contains usage examples and associated information; the example text itself should be enclosed in a
1552	DITPEG	element associates a quotation with a bibliographic reference to its source.
1571	DITPEG	adj tech having many parts: the multiplex eye of the fly.
1578	DITPEG	Or when one wants a more comprehensive representation of examples:
1679	DITPEG	When a source is indicated, the example should be marked with a
1710	DITPUS	Most dictionaries provide restrictive labels and phrases indicating the usage of given words or particular senses. Other phrases, not necessarily related to usage, may also be attached to forms, translations, cross-references, and examples. The following elements are provided to mark up such labels:
1717	DITPUS	element may be used for any kind of significative phrase or label within the text. The
1733	DITPUS	Many dictionaries provide an explanation and/or a list of such usage labels in a preface or appendix. The type of the usage information may be indicated in the
1740	DITPUS	geo
1746	DITPUS	time
1759	DITPUS	domain
1762	DITPUS	reg
1790	DITPUS	lang
1793	DITPUS	language for foreign words, spellings pronunciations, etc.
1796	DITPUS	gram
1801	DITPUS	In addition to this kind of information, multilingual dictionaries often provide
1803	DITPUS	to help the user determine the right sense of a word in the source language (and hence the correct translation). These include synonyms, concept subdivisions, typical subjects and objects, typical verb complements, etc. These labels may also be marked with the
1822	DITPUS	colloc
1855	DITPUS	unclassifiable piece of information to guide sense choice
1961	DITPUS	When the usage label is hard to classify, it may be described as a
1994	DITPXR	Dictionary entries frequently refer to information in other entries, often using extremely dense notations to convey the headword of the entry to be sought, the particular part of the entry being referred to, and the nature of the information to be sought there (synonyms, antonyms, usage notes, etymology, an illustration, etc.)
1996	DITPXR	Cross-references may be tagged in dictionaries using the
2000	DITPXR	elements defined in the core module (section
2003	DITPXR	element may be used to group all the information relating to a cross-reference.
2015	DITPXR	) is used to tag the cross-reference target proper (in dictionaries, usually the headword, possibly accompanied by a homograph number, a sense number, or other further restriction specifying what portion of the target entry is being referred to). The
2017	DITPXR	element is used to group the target with any accompanying phrases or symbols used to label the cross-reference; the cross-reference label itself may be tagged as a
2057	DITPXR	to mark the cross-reference label, the two examples differ in another way. The former assumes that the first sense of
2061	DITPXR	, and that the specific form of the reference in the source volume can be reconstructed, if needed, from that information. The latter does not require the first sense of
2063	DITPXR	to have an identifier, and retains the print form of the cross-reference; by omitting the
2069	DITPXR	and find the location referred to, or else that such processing will not be necessary.
2075	DITPXR	element may be used to indicate what kind of cross-reference is being made, using any convenient typology. Since different dictionaries may label the same kind of cross-reference in different ways, it may be useful to give normalized indications in the
2131	DITPXR	Strictly speaking, the reference above is not to the entry for
2133	DITPXR	, but to the list of synonyms found within that entry.
2135	DITPXR	In some cases, the cross-reference is to a particular subset of the meanings of the entry in question:
2167	DITPXR	The asterisk signals a reference to the entry for
2175	DITPXR	In some cases, the form in the definition is inflected, and thus
2226	DITPNO	am not, is not, are not, have not
2232	DITPNO	Although the interrogative form
2235	DITPNO	am I not?
2236	DITPNO	, it is generally avoided in spoken English and never used in formal English.
2291	DITPRE	element encloses a degenerate entry which appears in the body of another entry for some purpose. Many dictionaries include related entries for direct derivatives or inflected forms of the entry word, or for compound words, phrases, collocations, and idioms containing the entry word.
2372	DIHW	Examples, definitions, etymologies, and occasionally other elements such as cross-references, orthographic forms, etc., often contain a shortened or iconic reference to the headword, rather than repeating the headword itself. The references may be to the orthographic form or to the pronunciation, to the form given or to a variant of that form. The following elements are used to encode such iconic references to a headword:
2382	DIHW	which may optionally be used to resolve any ambiguity about the headword form being referred to.
2390	DIHW	indicates a reference to the full form of the headword
2410	DIHW	gives the initial of the word followed by a full stop, to indicate reference to the full form of the headword
2414	DIHW	refers to a capitalized form of the headword
2420	DIHW	element should be used for iconic or shortened references to the orthographic form(s) of the headword itself. It is an empty element and replaces, rather than enclosing, the reference. Note that the reference to a headword is not necessarily a simple string replacement. In the example
2426	DIHW	, the tilde stands for either headword form (
2520	DIHW	attribute to refer to a specific form of the headword:
2525	DIHW	comb form … : vagus nerve <
2625	DIHW	In many cases the reference is not to the orthographic form of the headword, but rather to another form of the headword—usually to an inflected form. In these cases, the element
2627	DIHW	should be used; this element takes as its content the string as it appears in the text.
2666	DIHW	, which are defined in the additional module for linking, segmentation, and alignment (see chapter
2689	DIHW	In addition, some dictionaries make reference to the pronunciation of the headword in the pronunciation of related entries, variants, or examples. The
2746	DIHW	Since existing printed dictionaries use different conventions for headword references (swung dash, first letter abbreviated form, capitalization, or italicization of the word, etc.) the exact method used should be documented in the header.
2764	DIMV	typographic view
2765	DIMV	—the two-dimensional printed page, including information about line and page breaks and other features of layout
2768	DIMV	editorial view
2769	DIMV	—the one-dimensional sequence of tokens which can be seen as the input to the typesetting process; the wording and punctuation of the text and the sequencing of items are visible in this view, but specifics of the typographic realization are not
2772	DIMV	lexical view
2773	DIMV	—this view includes the underlying information represented in a dictionary, without concern for its exact textual form
2777	DIMV	For example, a domain indication in a dictionary entry might be broken over a line and therefore hyphenated (
2781	DIMV	); the typographic view of the dictionary preserves this information. In a purely editorial view, the particular form in which the domain name is given in the particular dictionary (as
2787	DIMV	, etc.) would be preserved, but the fact of the line break would not. Font shifts might plausibly be included in either a strictly typographic or an editorial view. In the lexical view, the only information preserved concerning domain would be some standard symbol or string representing the nautical domain (e.g.
2789	DIMV	) regardless of the form in which it appears in the printed dictionary.
2795	DIMV	, the fonts in which different types of information are to be rendered, etc.), and then the typographic view, which is tied to a specific printed rendering. Computational linguists and philologists often begin with the typographic view and analyse it to obtain the editorial and/or lexical views. Some users may ultimately be concerned with retaining only the lexical view, or they may wish to preserve the typographic or editorial views as a reference text, perhaps as a guard against the loss or misinterpretation of information in the translation process. Some researchers may wish to retain all three views, and study their interrelations, since research questions may well span all three views.
2797	DIMV	In general, an electronic encoding of a text will allow the recovery of at least one view of that text (the one which guided the encoding); if editorial and typographic practices are consistently applied in the production of a printed dictionary, or if exceptions to the rules are consistently recorded in the electronic encoding, then it is
2799	DIMV	possible to recover the editorial view from an encoding of the lexical view, and the typographic view from an encoding of the editorial view. In practice, of course, the severe compression of information in dictionaries, the variety of methods by which this compression is achieved, the complexity of formulating completely explicit rules for editorial and typographic practice, and the relative rarity of complete consistency in the application of such rules, all make the mechanical transformation of information from one view into another something of a vexed question.
2801	DIMV	This section describes some principles which may be useful in capturing one or the other of these views as consistently and completely as possible, and describes some methods of attempting to capture more than one view in a single encoding. Only the editorial and lexical views are explicitly treated here; for methods of recording the physical or typographic details of a text, see chapter
2806	DIMV	attributes to link feature structures to a transcription of the editorial view of a dictionary, are not discussed here (for feature structures, see chapter
2807	DIMV	. For linkage of textual form and underlying information, see chapter
2813	DIMVTV	Common practice in encoding texts of all sorts relies on principles such as the following, which can be used successfully to capture the editorial view when encoding a dictionary:
2815	DIMVTV	All characters of the source text should be retained, with the possible exception of
2816	DIMVTV	rendition text
2819	DIMVTV	Characters appearing in the source text should typically be given as character data content in the document, rather than as the value of an attribute; again, rendition text may optionally be excepted from this rule.
2821	DIMVTV	Apart from the characters or graphics in the source text, nothing else should appear as content in the document, although it may be given in attribute values.
2823	DIMVTV	The material in the source text should appear in the encoding in the same order. Complications of the character sequence by footnotes, marginal notes, etc., text wrapping around illustrations, etc., may be dealt with by the usual means (for notes, see section
2825	DIMVTV	Complications of sequence caused by marginal or interlinear insertions and deletions, which are frequent in manuscripts, or by unconventional page layouts, as in concrete poetry, magazines with imaginative graphic designers, and texts about the nature of typography as a medium, typically do not occur in dictionaries, and so are not discussed here.
2830	DIMVTV	In a very conservative transcription of the editorial view of a text,
2831	DIMVTV	rendition characters
2833	DIMVTV	rendition text
2834	DIMVTV	(for example, conjunctions joining alternate headwords, etc.) are typically retained. Removing the tags from such a transcription will leave all and only the characters of the source text, in their original sequence.
2835	DIMVTV	This is a slight oversimplification. Even in conservative transcriptions, it is common to omit page numbers, signatures of gatherings, running titles and the like. The simple description above also elides, for the sake of simplicity, the difficulties of assigning a meaning to the phrase
2836	DIMVTV	original sequence
2837	DIMVTV	when it is applied to the printed characters of a source text; the
2838	DIMVTV	original sequence
2839	DIMVTV	retained or recovered from a conservative transcription of the editorial view is, of course, the one established during the transcription by the encoder.
2849	DIMVTV	. a feather, wing, fin, or similarly shaped part. 3. another name for
2853	DIMVTV	A conservative encoding of the editorial view of this entry, which retains all rendition text, might resemble the following:
2916	DIMVTV	A somewhat simplified encoding of the editorial view of this entry might exploit the fact that rendition text is often systematically recoverable. For example, parentheses consistently appear around pronunciation in this dictionary, and thus are effectively implied by the start- and end-tags for
2919	DIMVTV	The omission of rendition text is particularly common in systems for document production; it is considered good practice there, since automatic generation of rendition text is more reliable and more consistent than attempting to maintain it manually in the electronic text.
2920	DIMVTV	In such an encoding, removing the tags should exactly reproduce the sequence of characters in the source, minus rendition text. The original character sequence can be recovered fully by replacing tags with any rendition text they imply.
2924	DIMVTV	element in the header would be used to record the following patterns of rendition text:
2934	DIMVTV	appears before alternate forms
2940	DIMVTV	, inflection information, and sense numbers
2942	DIMVTV	senses are numbered in sequence unless otherwise specified using the global
3006	DIMVTV	When rendition text is omitted, it is recommended that the means to regenerate it be fully documented, using the
3008	DIMVTV	element of the TEI header.
3010	DIMVTV	If rendition text is used systematically in a dictionary, with only a few mistakes or exceptions, the global attribute
3012	DIMVTV	may be used on any tag to flag exceptions to the normal treatment. The values of the
3020	DIMVTV	element in the TEI header.
3052	DIMVLV	If the text to be interchanged retains only the lexical view of the text, there may be no concern for the recoverability of the editorial (not to speak of the typographic) view of the text. However, it is strongly recommended that the TEI header be used to document fully the nature of all alterations to the original data, such as normalization of domain names, expansion of inflected forms, etc.
3054	DIMVLV	In an encoding of the lexical view of a text, there are degrees of departure from the original data: normalizing inconsistent forms like
3068	DIMVLV	reorganizing the order of elements in an entry to show their relationship, as in
3073	DIMVLV	where in a strictly lexical view one might wish to group
3079	DIMVLV	splitting an entry into two separate entries, as in
3082	DIMVLV	/"selIb@sI/ n [U] state of living unmarried, esp as a religious obligation. celi.bate /"selIb@t/ n [C] unmarried person (esp a priest who has taken a vow not to marry).
3084	DIMVLV	For some purposes, this entry might usefully be split into an entry for
3086	DIMVLV	and a separate entry for
3092	DIMVLV	An encoding which captures the lexical view of the example given in the previous section might look something like the following. In this encoding:
3161	DIMVLV	Whether the given dictionary encoding focusses on the lexical view and thus approaches the status of lexical databases, or uses the typographic/editorial view approach and needs to communicate the sometimes informally stated values for the particular descriptive features, the issue of
3163	DIMVLV	of the content and of the container objects becomes relevant, in view of the growing tendency to interlink pieces of information across Internet resources. In such cases, it becomes crucial to be able to encode the fact that whether the information on, for instance, the value of the grammatical category of Number is provided as "sg.", "sing.", "Singular", or equivalently "poj." in Polish, or "Ez." in German, etc., what is actually referred to is always the same grammatical value that can be rendered with a plethora of markers, depending on the publisher, language, or lexicographic tradition. In order to signal that this variety of surface markers in fact indicate the same underlying value, it is possible to align them with an external inventory of standardized values. The TEI provides means to align grammatical categories as well as their content with the ISOcat reference, which is a Web implementation of
3167	DIMVLV	In the example below, a fragment of the entry for
3174	DIMVLV	). Depending on the status and extent of the dictionary, various strategies may be used to reduce the redundancy of the repeated ISOcat references.
3193	DIMVBO	It is sometimes desirable to retain both the lexical and the editorial view, in which case a potential conflict exists between the two. When there is a conflict between the encodings for the lexical and editorial views, the principles described in the following sections may be applied.
3198	DIMVAV	If the order of the data is the same in both views, then both views may be captured by encoding one
3200	DIMVAV	view in the character data content of the document, and encoding the other using attribute values on the appropriate elements. If all tags were to be removed, the remaining characters would be those of the dominant view of the text.
3204	DIMVAV	is used to provide attributes for use in encoding multiple views of the same dictionary entry. These attributes are available for use on all elements defined in this chapter when the base module for dictionaries is selected.
3206	DIMVAV	When the editorial view is dominant, the following attributes may be used to capture the lexical view:
3211	DIMVAV	When the lexical view is dominant, the following attributes may be used to record the editorial view:
3221	DIMVAV	For example, if the source text had the domain label
3223	DIMVAV	, it might be encoded as follows. With the editorial view dominant:
3227	DIMVAV	The lexical view of the same label would transcribe the normalized form as content of the
3229	DIMVAV	element, the typographic form as an attribute value:
3235	DIMVAV	If the source text gives inflectional information for the verb
3241	DIMVAV	. An encoding of the editorial view might take this form:
3259	DIMVAV	tag with null content, to enable the representation of implicit information even though it has no print realization.
3261	DIMVAV	The lexical view might be encoded thus:
3284	DIMVAV	A particular problem may be posed by the common practice of presenting two alternate forms of a word in a single string, by marking some parts of the word as optional in some forms. The following entry is for a word which can be spelled either
3292	DIMVAV	With the editorial view dominant, this entry might begin thus:
3300	DIMVAV	With the lexical view dominant, however, two
3349	DIMVAV	attribute is recommended, however, when long spans of text are involved, or when the optional part contains embedded tags.
3362	DIMVAV	A simple encoding solution would be to leave the definition text unanalysed, but this might be felt inadequate since it does not show that there are two definitions. A possible alternative encoding would be:
3372	DIMVAV	This transcribes some characters of the source text twice, however, which deviates from the usual practice. The following encoding records both the editorial and lexical views:
3388	DIMVOL	The attributes described in the previous section are useful only when the order of material is the same in both the editorial and the lexical view. When the two views impose different orders on the data, the standard linking mechanisms may be used to show the original location of material transposed in an encoding of the lexical view.
3392	DIMVOL	element may be used to mark the original location of the material, and the
3394	DIMVOL	attribute may be used on the lexical encoding of that material to indicate its original location(s). Like those in the preceding section, this attribute is defined for the attribute class
3562	DIFR	The content model for the
3564	DIFR	element provides an entry structure suitable for many average dictionaries, as well as many regular entries in more exotic dictionaries. However, the structure of some dictionaries does not allow the restrictions imposed by the content model for
3570	DIFR	elements are provided to support much wider variation in entry structure. The
3572	DIFR	element offers less freedom, in that it can only contain phrase level elements, but it can itself appear at any point within a dictionary entry where any of the structural components of a dictionary entry are permitted. As such, it acts as a container for otherwise anomalous parts of an entry.
3588	DIFR	element. For example, in the following entry from a dictionary already in electronic form, it is necessary to include a
3592	DIFR	. This is not permitted in the content model for
3629	DIFR	) elements—that is, using no grouping elements at all. This can be desirable if the encoder wants a completely
3631	DIFR	view, with no indication of or commitment to the association of one element with another. The following encoding uses no grouping elements, and keeps all rendition text:
3659	DIFR	Here is an alternative way of representing the same structure, this time using
3697	DI	The selection and combination of modules to form a TEI schema is described in

CC-LanguageCorpora.xml#13064

#	id	text
3	CC	The term
4	CC	language corpus
5	CC	is used to mean a number of rather different things. It may refer simply to any collection of linguistic data (for example, written, spoken, signed, or multimodal), although many practitioners prefer to reserve it for collections which have been organized or collected with a particular end in view, generally to characterize a particular state or variety of one or more languages. Because opinions as to the best method of achieving this goal differ, various subcategories of corpora have also been identified. For our purposes however, the distinguishing characteristic of a corpus is that its components have been selected or structured according to some conscious set of design criteria.
7	CC	These design criteria may be very simple and undemanding, or very sophisticated. A corpus may be intended to represent (in the statistical sense) a particular linguistic variety or sublanguage, or it may be intended to represent all aspects of some assumed
8	CC	core
9	CC	language. A corpus may be made up of whole texts or of fragments or text samples. It may be a
15	CC	corpus, the composition of which may change over time. However, since an open corpus is of necessity finite at any particular point in time, the only likely effect of its expansibility from the encoding point of view may be some increased difficulty in maintaining consistent encoding practices (see further section
23	CC	). This is because although each discrete sample of language in a corpus clearly has a claim to be considered as a text in its own right, it is also regarded as a subdivision of some larger object, if only for convenience of analysis. Corpora share a number of characteristics with other types of composite texts, including anthologies and collections. Most notably, different components of composite texts may exhibit different structural properties (for example, some may be composed of verse, and others of prose), thus potentially requiring elements from different TEI modules.
25	CC	Aside from these high-level structural differences, and possibly differences of scale, the encoding of language corpora and the encoding of individual texts present identical sets of problems. Any of the encoding techniques and elements presented in other chapters of these Guidelines may therefore prove relevant to some aspect of corpus encoding and may be used in corpora. Therefore, we do not repeat here the discussion of such fundamental matters as the representation of multiple character sets (see chapter
27	CC	). In addition to these general purpose elements, these Guidelines offer a range of more specialized sets of tags which may be of use in certain specialized corpora, for example those consisting primarily of verse (chapter
28	CC	), drama (chapter
29	CC	), transcriptions of spoken text (chapter
31	CC	should be reviewed for details of how these and other components of the Guidelines should be tailored to create a document type definition appropriate to a given application. In sum, it should not be assumed that only the matters specifically addressed in this chapter are of importance for corpus creators.
33	CC	This chapter does however include some other material relevant to corpora and corpus-building, for which no other location appeared suitable. It begins with a review of the distinction between unitary and composite texts, and of the different methods provided by these Guidelines for representing composite texts of different kinds (section
35	CC	describes a set of additional header elements provided for the documentation of contextual information, of importance largely though not exclusively to language corpora. This is the additional module for language corpora proper. Section
36	CC	discusses a mechanism by which individual parts of the TEI header may be associated with different parts of a TEI-conformant text. Section
37	CC	reviews various methods of providing linguistic annotation in corpora, with some specific examples of relevance to current practice in corpus linguistics. Finally, section
55	CCDEF	); this section discusses their application to composite texts in particular.
58	CCDEF	text
59	CCDEF	refers to any stretch of discourse, whether complete or incomplete, unitary or composite, which the encoder chooses (perhaps merely for purposes of analytic convenience) to regard as a unit. The term
60	CCDEF	composite text
63	CCDEF	language corpora
67	CCDEF	poem cycles and epistolary works (novels or essays written in the form of collections or series of letters)
70	CCDEF	The elements listed above may be combined to encode each of these varieties of composite text in different ways.
72	CCDEF	In corpora, the component samples are clearly distinct texts, but the systematic collection, standardized preparation, and common markup of the corpus often make it useful to treat the entire corpus as a unit, too. Some corpora may become so well established as to be regarded as texts in their own right; the Brown and LOB corpora are now close to achieving this status.
76	CCDEF	element is intended for the encoding of language corpora, though it may also be useful in encoding newspapers, electronic anthologies, and other disparate collections of material. The individual samples in the corpus are encoded as separate
78	CCDEF	elements, and the entire corpus is enclosed in a
88	CCDEF	element, in which the corpus as a whole, and encoding practices common to multiple samples may be described. The overall structure of a TEI-conformant corpus is thus:
105	CCDEF	Header information which relates to the whole corpus rather than to individual components of it should be factored out and included in the
107	CCDEF	element prefixed to the whole. This two-level structure allows for contextual information to be specified at the corpus level, at the individual text level, or at both. Discussion of the kinds of information which may thus be specified is provided below, in section
112	CCDEF	In some cases, the design of a corpus is reflected in its internal structure. For example, a corpus of newspaper extracts might be arranged to combine all stories of one type (reportage, editorial, reviews, etc.) into some higher-level grouping, possibly with sub-groups for date, region, etc. The
114	CCDEF	element provides no direct support for reflecting such internal corpus structure in the markup: it treats the corpus as an undifferentiated series of components, each tagged
118	CCDEF	If it is essential to reflect a single permanent organization of a corpus into sub- and sub-sub-corpora, then the corpus or the high-level subcorpora may be encoded as composite texts, using the
121	CCDEF	. The mechanisms for corpus characterization described in this chapter, however, are designed to reduce the need to do this. Useful groupings of components may easily be expressed using the text classification and identification elements described in section
122	CCDEF	, and those for associating declarations with corpus components described in section
123	CCDEF	. These methods also allow several different methods of text grouping to co-exist, each to be used as needed at different times. This helps minimize the danger of cross-classification and misclassification of samples, and helps improve the flexibility with which parts of a corpus may be characterized for different applications.
125	CCDEF	Anthologies and collections are often treated as texts in their own right, if only for historical reasons. In conventional publishing, at least, anthologies are published as units, with single editorial responsibility and common front and back matter which may need to be included in their electronic encodings. The texts collected in the anthology, of course, may also need to be identifiable as distinct individual objects for study.
127	CCDEF	Poem cycles, epistolary novels, and epistolary essays differ from anthologies in that they are often written as single works, by single authors, for single occasions; nevertheless, it can be useful to treat their constituent parts as individual texts, as well as the cycle itself. Structurally, therefore, they may be treated in the same way as anthologies: in both cases, the body of the text is composed largely of other texts.
133	CCDEF	element can also be used to record the potentially complex internal structure of language corpora. For a full description, see chapter
140	CCDEF	elements. The embedded text itself may be encoded using the
145	CCDEF	All composite texts share the characteristic that their different component texts may be of structurally similar or dissimilar types. If all component texts may all be encoded using the same module, then no problem arises. If however they require different modules, then these must be included in the schema. This process is described in more detail in section
150	CCAH	Contextual information is of particular importance for collections or corpora composed of samples from a variety of different kinds of text. Examples of such contextual information include: the age, sex, and geographical origins of participants in a language interaction, or their socio-economic status; the cost and publication data of a newspaper; the topic, register or factuality of an extract from a textbook. Such information may be of the first importance, whether as an organizing principle in creating a corpus (for example, to ensure that the range of values in such a parameter is evenly represented throughout the corpus, or represented proportionately to the population being sampled), or as a selection criterion in analysing the corpus (for example, to investigate the language usage of some particular vector of social characteristics).
152	CCAH	Such contextual information is potentially of equal importance for unitary texts, and these Guidelines accordingly make no particular distinction between the kinds of information which should be gathered for unitary and for composite texts. In either case, the information should be recorded in the appropriate section of a TEI header, as described in chapter
153	CCAH	. In the case of language corpora, such information may be gathered together in the overall corpus header, or split across all the component texts of a corpus, in their individual headers, or divided between the two. The association between an individual corpus text and the contextual information applicable to it may be made in a number of ways, as further discussed in section
157	CCAH	, which should be read in conjunction with the present section, describes in full the range of elements available for the encoding of information relating to the electronic file itself, for example its bibliographic description and those of the source or sources from which it was derived (see section
159	CCAH	); more detailed descriptive information about the creation and content of the corpus, such as the languages used within it and any descriptive classification system used (see section
160	CCAH	); and version information documenting any changes made in the electronic text (see section
164	CCAH	, several other elements can be used in the TEI header if the additional module defined by this chapter is invoked. These additional tags make it possible to characterize the social or other situation within which a language interaction takes place or is experienced, the physical setting of a language interaction, and the participants in it. Though this information may be relevant to, and provided for, unitary texts as well as for collections or corpora, it is more often recorded for the components of systematically developed corpora than for isolated texts, and thus this module is referred to as being
165	CCAH	for language corpora
168	CCAH	When the module defined in this chapter is included in a schema, a number of additional elements become available within the
170	CCAH	element of the TEI header (discussed in section
187	CCAHTD	element provides a full description of the situation within which a text was produced or experienced, and thus characterizes it in a way relatively independent of any
191	CCAHTD	. The description is organized as a set of values and optional prose descriptions for the following eight
200	CCAHTD	By default, a text description will contain each of the above elements, supplied in the order specified. Except for the
202	CCAHTD	element, which may be repeated to indicate multiple purposes, no element should appear more than once within a single text description. Each element may be empty, or may contain a brief qualification or more detailed description of the value expressed by its attributes. It should be noted that some texts, in particular literary ones, may resist unambiguous classification in some of these dimensions; in such cases, the situational parameter in question should be given the content
206	CCAHTD	Texts may be described along many dimensions, according to many different taxonomies. No generally accepted consensus as to how such taxonomies should be defined has yet emerged, despite the best efforts of many corpus linguists, text linguists, sociolinguists, rhetoricians, and literary theorists over the years. Rather than attempting the task of proposing a single taxonomy of
208	CCAHTD	(or the equally impossible one of enumerating all those which have been proposed previously), the closed set of
220	CCAHTD	it is equally applicable to spoken, written, or signed texts
222	CCAHTD	Two alternative approaches to the use of these parameters are supported by these Guidelines. One is to use pre-existing taxonomies such as those used in subject classification or other types of text categorization. Such taxonomies may also be appropriate for the description of the topics addressed by particular texts. Elements for this purpose are described in section
224	CCAHTD	. A second approach is to develop an application-specific set of
232	CCAHTD	Where the organizing principles of a corpus or collection so permit, it may be convenient to regard a particular set of values for the situational parameters listed in this section as forming a
234	CCAHTD	in its own right; this may also be useful where the same set of values applies to several texts within a corpus. In such a case, the set of text-types so defined should be regarded as a
235	CCAHTD	taxonomy
243	CCAHTD	element rather than a prose description. Particular texts may then be associated with such definitions using the mechanisms described in sections
308	CCAHPA	element provides additional information about the participants in a spoken text or, where this is judged appropriate, the persons named or depicted in a written text. When the detailed elements provided by the
311	CCAHPA	are included in a schema, this element can contain detailed demographic or descriptive information about individual speakers or groups of speakers, such as their names or other personal characteristics. Individually identified persons may also identified by a code which can then be used elsewhere within the encoded text, for example as the value of a
316	CCAHPA	speaker
321	CCAHPA	within a written text, except where otherwise stated. For the purposes of analysis of language usage, the information specified here should be equally applicable to written, spoken, or signed texts.
325	CCAHPA	contains a description of the participants in an interaction, which may be supplied as straightforward prose, possibly containing a list of names, encoded using the usual
341	CCAHPA	Alternatively, when the
365	CCAHPA	An identified character in a drama or a novel may also be regarded as a participant in this sense, and encoded using the same techniques:
366	CCAHPA	It is particularly useful to define participants in a dramatic text in this way, since it enables the
368	CCAHPA	attribute to be used to link
393	CCAHSE	element is used to describe the setting or settings in which language interaction takes place. It may contain a prose description, analogous to a stage description at the start of a play, stating in broad terms the locale, or a more detailed description of a series of such settings.
395	CCAHSE	Each distinct setting is described by means of a
405	CCAHSE	. If this attribute is not specified, the setting details provided are assumed to apply to all participants represented in the language interaction. Note however that it is not possible to encode different settings for the same participant: a participant is deemed to be a person within a specific setting.
409	CCAHSE	element may contain either a prose description or a selection of elements from the classes
415	CCAHSE	. By default, when the module defined by this chapter is included in a schema, these classes thus provide the following elements:
426	CCAHSE	may also be available if the
430	CCAHSE	The following example demonstrates the kind of background information often required to support transcriptions of language interactions, first encoded as a simple prose narrative:
471	CCAHSE	Again, a more detailed encoding for places is feasible if the
473	CCAHSE	module is included in the schema. The above examples assume that only the general purpose
475	CCAHSE	element supplied in the core module is available.
484	CCAS	This section discusses the association of the contextual information held in the header with the individual elements making up a TEI text or corpus. Contextual information is held in elements of various kinds within the TEI header, as discussed elsewhere in this section and in chapter
485	CCAS	. Here we consider what happens when different parts of a document need to be associated with different contextual information of the same type, for example when one part of a document uses a different encoding practice from another, or where one part relates to a different setting from another. In such situations, there will be more than one instance of a header element of the relevant type.
487	CCAS	The TEI scheme allow for the following possibilities:
489	CCAS	A given element may appear in the corpus header only, in the header of one or more texts only, or in both places
491	CCAS	There may be multiple occurrences of certain elements in either corpus or text header.
498	CCAS1	A TEI-conformant document may have more than one header only in the case of a TEI corpus, which must have a header in its own right, as well as the obligatory header for each text. Every element specified in a corpus-header is understood as if it appeared within every text header in the corpus. An element specified in a text header but not in the corpus header supplements the specification for that text alone. If any element is specified in both corpus and text headers, the corpus header element is over-ridden for that text alone.
502	CCAS1	for a corpus text is understood to be prefixed by the
504	CCAS1	given in the corpus header. All other optional elements of the
506	CCAS1	should be omitted from an individual corpus text header unless they differ from those specified in the corpus header. All other header elements behave identically, in the manner documented below. This facility makes it possible to state once for all in the corpus header each piece of contextual information which is common to the whole of the corpus, while still allowing for individual texts to vary from this common denominator.
508	CCAS1	For example, the following schematic shows the structure of a corpus comprising three texts, the first and last of which share the same encoding description. The second one has its own encoding description.
555	CCAS2	Certain of the elements which can appear within a TEI header are known as
557	CCAS2	. These elements have in common the fact that they may be linked explicitly with a particular part of a text or corpus by means of a
559	CCAS2	attribute on that element. This linkage is used to over-ride the default association between declarations in the header and a corpus or corpus text. The only header elements which may be associated in this way are those which would not otherwise be meaningfully repeatable.
570	CCAS2	An alphabetically ordered list of declarable elements follows:
611	CCAS2	. Since there are two, one of them (in this case
629	CCAS2	For texts associated with the header in which this declaration appears, correction method
631	CCAS2	will be assumed, unless they explicitly state otherwise. Here is the structure for a text which does state otherwise:
641	CCAS2	In this case, the contents of the divisions D1 and D3 will both use correction policy
643	CCAS2	, and those of division D2 will use correction policy
657	CCAS2	, as well as smaller structural units, down to the level of paragraphs in prose, individual utterances in spoken texts, and entries in dictionaries. However, TEI recommended practice is to limit the number of multiple declarable elements used by a document as far as possible, for simplicity and ease of processing.
663	CCAS2	An identifier specifying an element which contains multiple instances of one or more other elements should be interpreted as if it explicitly identified the elements identified as the default in each such set of repeated elements
665	CCAS2	Each element specified, explicitly or implicitly, by the list of identifiers must be of a different kind.
708	CCAS2	applies, correction method C1A and normalization method N1 apply, since these are the specified defaults within
710	CCAS2	. In the same way, for a text specifying
714	CCAS2	, correction C2A, and normalization N2B will apply.
716	CCAS2	A finer grained approach is also possible. A text might specify
717	CCAS2	text decls='C2B N2A'
720	CCAS2	declarations as required. A tag such as
721	CCAS2	text decls='ED1 ED2'
722	CCAS2	would (obviously) be illegal, since it includes two elements of the same type; a tag such as
723	CCAS2	text decls='ED2 C1A'
728	CCAS2	, resulting in a list that identifies two correction elements (C1A and C2A).
734	CCAS3	If there is a single occurrence of a given declarable element in a corpus header, then it applies by default to all elements within the corpus.
736	CCAS3	If there is a single occurrence of a given declarable element in the text header, then it applies by default to all elements of that text irrespective of the contents of the corpus header.
738	CCAS3	Where there are multiple occurrences of declarable elements within either corpus or text header,
740	CCAS3	each must have a unique value specified as the value of its
746	CCAS3	attribute with the value
754	CCAS3	An association made by one element applies by default to all of its descendants.
759	CCAN	Language corpora often include analytic encodings or annotations, designed to support a variety of different views of language. The present Guidelines do not advocate any particular approach to linguistic annotation (or
761	CCAN	); instead a number of general analytic facilities are provided which support the representation of most forms of annotation in a standard and self-documenting manner. Analytic annotation is of importance in many fields, not only in corpus linguistics, and is therefore discussed in general terms elsewhere in the Guidelines.
766	CCAN	The present section presents informally some particular applications of these general mechanisms to the specific practice of corpus linguistics.
772	CCAN1	we mean here any annotation determined by an analysis of linguistic features of the text, excluding as borderline cases both the formal structural properties of the text (e.g. its division into chapters or paragraphs) and descriptive information about its context (the circumstances of its production, its genre, or medium). The structural properties of any TEI-conformant text should be represented using the structural elements discussed elsewhere in these Guidelines, for example in chapters
774	CCAN1	. The contextual properties of a TEI text are fully documented in the TEI header, which is discussed in chapter
778	CCAN1	Other forms of linguistic annotation may be applied at a number of levels in a text. A code (such as a word-class or part-of-speech code) may be associated with each word or token, or with groups of such tokens, which may be continuous, discontinuous, or nested. A code may also be associated with relationships (such as cohesion) perceived as existing between distinct parts of a text. The codes themselves may stand for discrete non-decomposable categories, or they may represent highly articulated bundles of textual features. Their function may be to place the annotated part of the text somewhere within a narrowly linguistic or discoursal domain of analysis, or within a more general semantic field, or any combination drawn from these and other domains.
780	CCAN1	The manner by which such annotations are generated and attached to the text may be entirely automatic, entirely manual, or a mixture. The ease and accuracy with which analysis may be automated may vary with the level at which the annotation is attached. The method employed should be documented in the
782	CCAN1	element within the encoding description of the TEI header, as described in section
783	CCAN1	. Where different parts of a corpus have used different annotation methods, the
788	CCAN1	An extended example of one form of linguistic analysis commonly practised in corpus linguistics is given in section
794	CCREC	These Guidelines include proposals for the identification and encoding of a far greater variety of textual features and characteristics than is likely to be either feasible or desirable in any one language corpus, however large and ambitious. The reasoning behind this catholic approach is further discussed in chapter
795	CCREC	. For most large-scale corpus projects, it will therefore be necessary to determine a subset of TEI recommended elements appropriate to the anticipated needs of the project, as further discussed in chapter
796	CCREC	; these mechanisms include the ability to exclude selected element types, add new element types, and change the names of existing elements. A discussion of the implications of such changes for TEI conformance is provided in chapter
799	CCREC	Because of the high cost of identifying and encoding many textual features, and the difficulty in ensuring consistent practice across very large corpora, encoders may find it convenient to divide the set of elements to be encoded into the following four categories:
802	CCREC	texts included within the corpus will always encode textual features in this category, should they exist in the text
805	CCREC	textual features in this category will be encoded wherever economically and practically feasible; where present but not encoded, a note in the header should be made.
808	CCREC	textual features in this category may or may not be encoded; no conclusion about the absence of such features can be inferred from the absence of the corresponding element in a given text.
812	CCREC	textual features in this category are deliberately not encoded; they may be transcribed as unmarked up text, or represented as
833	CC	The selection and combination of modules to form a TEI schema is described in

FS-FeatureStructures.xml#12945

#	id	text
6	FS	is a general purpose data structure which identifies and groups together individual
8	FS	, each of which associates a name with one or more values. Because of the generality of feature structures, they can be used to represent many different kinds of information, but they are of particular usefulness in the representation of linguistic analyses, especially where such analyses are partial, or
29	FSor	binary
34	FSor	numeric
36	FSor	string
43	FSor	set
47	FSor	list
49	FSor	discusses how the operations of alternation, negation, and collection of feature values may be represented. Section
62	FSBI	The fundamental elements used to represent a feature structure analysis are
74	FSBI	attribute which may be used to represent typed feature structures, and may contain any number of
81	FSBI	value
82	FSBI	. The value may be simple: that is, a single binary, numeric, symbolic (i.e. taken from a restricted set of legal values), or string value, or a collection of such values, organized in various ways, for example, as a list; or it may be complex, that is, it may itself be a feature structure, thus providing a degree of recursion. Values may be under-specified or defaulted in various ways. These possibilities are all described in more detail in this and the following sections.
86	FSBI	. The components of such libraries may then be referenced from other feature or feature-value representations, using the
92	FSBI	We begin by considering the simple case of a feature structure which contains binary-valued features only. The following three XML elements are needed to represent such a feature structure:
101	FSBI	are not discussed in this section: they provide an alternative way of indicating the content of an element, as further discussed in section
108	FSBI	elements with binary values can be straightforwardly used to encode the
145	FSBI	attribute to indicate the name of the feature. Feature structures need not be typed, but features must be named. Similarly, the
153	FSBI	to a binary value) requires additional validation, as does any restriction on the features available within a feature structure of a particular type (e.g. whether a feature structure of type
157	FSBI	). Such validation may be carried out at the document level, using special purpose processing, at the schema level using additional validation rules, or at the declarative level, using an additional mechanism such as the
162	FSBI	Although we have used the term
163	FSBI	binary
172	FSBI	), it should be noted that such values are not restricted to propositional assertions. As this example shows, this kind of value is intended for use with any binary-valued feature.
181	FSSY	numeric values
183	FSSY	string values
184	FSSY	. The module defined by this chapter allows for the specification of additional datatypes if necessary, by extending the underlying class
194	FSSY	element is used for the value of a feature when that feature can have any of a small, finite set of possible values, representable as character strings. For example, the following might be used to represent the claim that the Latin noun form
210	FSSY	case
214	FSSY	number
215	FSSY	) are used to define morpho-syntactic properties of a word. Each of these features can take one of a small number of values (for example, case can be
225	FSSY	elements. Note that, instead of using a symbolic value for grammatical number, one could have named the feature
229	FSSY	and given it an appropriate binary value, as in the following example:
234	FSSY	Whether one uses a binary or symbolic value in situations like this is largely a matter of taste.
238	FSSY	element is used for the value of a feature when that value is a string drawn from a very large or potentially unbounded set of possible strings of characters, so that it would be impractical or impossible to use the
240	FSSY	element. The string value is expressed as the content of the
242	FSSY	element, rather than as an attribute value. For example, one might encode a street address as follows:
250	FSSY	element is used when the value of a feature is a numeric value, or a range of such values. For example, one might wish to regard the house number and the street name as different features, using an encoding like the following:
257	FSSY	If the numeric value to be represented falls within a specific range (for example an address that spans several numbers), the
266	FSSY	It is also possible to specify that the numeric value (or values) represented should (or should not) be truncated. For example, assuming that the daily rainfall in mm is a feature of interest for some address, one might represent this by an encoding like the following:
269	FSSY	This represents any of the infinite number of numeric values falling between 0 and 1.3; by contrast
274	FSSY	Some communities of practice, notably those with a strong computer-science bias, prefer to dissociate the information on the value of the given feature from the specification of the data type that this value represents. In such cases, feature values can be provided directly as textual content of
281	FSSY	As noted above, additional processing is necessary to ensure that appropriate values are supplied for particular features, for example to ensure that the feature
283	FSSY	is not given a value such as
284	FSSY	symbol value="feminine"/
285	FSSY	. There are two ways of attempting to ensure that only certain combinations of feature names and values are used. First, if the total number of legal combinations is relatively small, one can predefine all of them in a construct known as a
287	FSSY	, and then reference the combination required using the
292	FSSY	feature value library
293	FSSY	(so called, since a feature structure may be the value of a feature). A total of 30 feature structures (5 × 3 × 2) is required to enumerate all the possible combinations of individual case, gender and number values in the preceding illustration. We discuss the use of such libraries and their representation in XML further in section
301	FSSY	Whether at the level of feature-system declarations, feature- and feature-value libraries, or individual features, it is possible to align both feature names and their values with standardized external data category repositories such as ISOcat.
306	FSSY	and its value
321	FSFL	As the examples in the preceding section suggest, the direct encoding of feature structures can be verbose. Moreover, it is often the case that particular feature-value combinations, or feature structures composed of them, are re-used in different analyses. To reduce the size and complexity of the task of encoding feature structures, one may use the
337	FSFL	). If a feature has as its value a feature structure or other value which is predefined in this way, the
344	FSFL	For example, suppose a feature library for phonological feature specifications is set up as follows.
391	FSFL	Then the feature structures that represent the analysis of the phonological segments (phonemes)
405	FSFL	The preceding are but four of the 128 logically possible fully specified phonological segments using the seven binary features listed in the feature library. Presumably not all combinations of features correspond to phonological segments (there are no strident vowels, for example). The legal combinations, however, can be collected together, each one represented as an identifiable
423	FSFL	attribute; for example, one might use them in a feature value pair such as:
427	FSFL	Feature structures stored in this way may also be associated with the text which they are intended to annotate, either by a link from the text (for example, using the TEI global
429	FSFL	attribute), or by means of stand-off annotation techniques (for example, using the TEI
434	FSFL	Note that when features or feature structures are linked to in this way, the result is effectively a copy of the item linked to into the place from which it is linked. This form of linking should be distinguished from the phenomenon of
444	FSST	Features may have complex values as well as atomic ones; the simplest such complex value is represented by supplying a
446	FSST	element as the content of an
450	FSST	element as the value for the
464	FSST	To illustrate the use of complex values, consider the following simple model of a word, as a structure combining surface form information, a syntactic category, and semantic information. Each word analysis is represented as a
465	FSST	fs type='word'
467	FSST	surface
472	FSST	. The first of these has an atomic string value, but the other two have complex values, represented as nested feature structures of types
473	FSST	category
492	FSST	This analysis does not tell us much about the meaning of the symbols
514	FSST	element, as a number of
516	FSST	elements. Alternatively, the relevant features may be referenced by their identifiers, supplied as the value of the
532	FSST	With such libraries in place, and assuming the availability of similarly predefined feature structures for transitivity and semantics, the preceding example could be considerably simplified:
556	FSVAR	Sometimes the same feature value is required at multiple places within a feature structure, in particular where the value is only partially specified at one or more places. The
563	FSVAR	For example, suppose one wishes to represent noun-verb agreement as a single feature structure. Within the representation, the feature indicating (say) number appears more than once. To represent the fact that each occurrence is another appearance of the same feature (rather than a copy) one could use an encoding like the following:
590	FSVAR	vLabel
595	FSVAR	The scope of the names used to label re-entrancy points is that of the outermost
597	FSVAR	element in which they appear. When a feature structure is imported from a feature value library, or referenced from elsewhere (for example by using the
599	FSVAR	attribute) the names of any sharing points it may contain are implicitly prefixed by the identifier used for the imported feature structure, to avoid name clashes. Thus, if some other feature structure were to reference the
602	FSVAR	then the labelled points in the example would be interpreted as if they had the name
616	FSSS	A feature whose value is regarded as a set, bag, or list may have any positive number of values as its content, or none at all, (thus allowing for representation of the empty set, bag, or list). The items in a list are ordered, and need not be distinct. The items in a set are not ordered, and must be distinct. The items in a bag are neither ordered nor distinct. Sets and bags are thus distinguished from lists in that the order in which the values are specified does not matter for the former, but does matter for the latter, while sets are distinguished from bags and lists in that repetitions of values do not count for the former but do count for the latter.
618	FSSS	If no value is specified for the
622	FSSS	defines a list of values. If the
628	FSSS	attribute, suppose that a feature structure analysis is used to represent a genealogical tree, with the information about each individual treated as a single feature structure, like this:
654	FSSS	element is first used to supply a list of
655	FSSS	name
658	FSSS	feature. Other features are defined by reference to values which we assume are held in some external feature value library (not shown here). For example, the
660	FSSS	element is used a second time to indicate that the persons's siblings should be regarded as constituting a set rather than a list. Each sibling is represented by a feature structure: in this example, each feature structure is a copy of one specified in the feature value library.
662	FSSS	If a specific feature contains only a single feature structure as its value, the component features of which are organized as a set, bag, or list, it may be more convenient to represent the value as a
666	FSSS	. For example, consider the following encoding of the English verb form
670	FSSS	feature whose value is a feature structure which contains
671	FSSS	person
673	FSSS	number
714	FSSS	element is also useful in cases where an analysis has several components. In the following example, the French word
716	FSSS	has a two-part analysis, represented as a list of two values. The first specifies that the word contains a preposition; the second that it contains a masculine plural relative pronoun:
736	FSSS	The set, bag, or list which has no members is known as the null (or empty) set, bag, or list. A
738	FSSS	element with no content and with no value for its
740	FSSS	attribute is interpreted as referring to the null set, bag, or list, depending on the value of its
755	FSSS	elements, if, for example one of the members of a set is itself a set, or if two lists are concatenated together. Note that such collections pay no attention to the contents of the nested
757	FSSS	elements: if it is desired to produce the union of two sets, the
759	FSSS	element discussed below should be used to make a new collection from the two sets.
764	FVE	It is sometimes desirable to express the value of a feature as the result of an operation over some other value (for example, as
768	FVE	, or as the concatenation of two collections). Three special purpose elements are provided to represent disjunctive alternation, negation, and collection of values:
779	FVALT	element can be used wherever a feature value can appear. It contains two or more feature values, any one of which is to be understood as the value required. Suppose, for example, that we are using a feature system to describe residential property, using such features as
781	FVALT	. In a particular case, we might wish to represent uncertainty as to whether a house has two or three bathrooms. As we have already shown, one simple way to represent this would be with a numeric maximum:
791	FVALT	element represents alternation over feature values, not feature-value pairs. If therefore the uncertainty relates to two or more feature value specifications, each must be represented as a feature structure, since a feature structure can always appear where a value is required. For example, suppose that it is uncertain as to whether the house being described has two bathrooms or two bedrooms, a structure like the following may be used:
805	FVALT	: in the case above, the implication is that having two bathrooms excludes the possibility of having two bedrooms and vice versa. If inclusive alternation is required, a
824	FVALT	This analysis indicates that the property may have two bathrooms, two bedrooms, or both two bathrooms and two bedrooms.
830	FVALT	to describe items that are mentioned to enhance a property's sales value, such as whether it has a pool or a good view. Now suppose for a particular listing, the selling points include an alarm system and a good view, and either a pool or a jacuzzi (but not both). This situation could be represented, using the
870	FVALT	If a large number of ambiguities or uncertainties need to be represented, involving a relatively small number of features and values, it is recommended that a stand-off technique, for example using the general-purpose
883	FVNOT	element can be used wherever a feature value can appear. It contains any feature value and returns the complement of its contents. For example, the feature
885	FVNOT	in the following example has any whole numeric value other than 2:
892	FVNOT	element is to provide the complement of the feature values it contains, rather than their negation. If a feature system declaration is available which defines the possible values for the associated feature, then it is possible to say more about the negated value. For example, suppose that the available values for the feature
893	FVNOT	case
894	FVNOT	are declared to be nominative, genitive, dative, or accusative, whether in a TEI feature system declaration or by some other means. Then the following two specifications are equivalent:
906	FVNOT	If however no such system declaration is available, all that one can say about a feature specified via negation is that its value is something other than the negated value.
908	FVNOT	Negation is always applied to a feature value, rather than to a feature-value pair. The negation of an atomic value is the set of all other values which are possible for the feature.
910	FVNOT	Any kind of value can be negated, including collections (represented by a
914	FVNOT	elements). The negation of any complex value is understood to be the set of values which cannot be unified with it. Thus, for example, the negation of the feature structure F is understood to be the set of feature structures which are not unifiable with F. In the absence of a constraint mechanism such as the Feature System Declaration, the negation of a collection is anything that is not unifiable with it, including collections of different types and atomic values. It will generally be more useful to require that the organization of the negated value be the same as that of the original value, for example that a negated set is understood to mean the set which is a complement of the set, but such a requirement cannot be enforced in the absence of a constraint mechanism.
921	FVCOLL	element can be used wherever a feature value can appear. It contains two or more feature values, all of which are to be collected together. The organization of the resulting collection is specified by the value of the
923	FVCOLL	attribute, which need not necessarily be the same as that of its constituent values if these are collections. For example, one can change a list to a set, or vice versa.
940	FVCOLL	Suppose however that we discover for some language it is necessary to add a new possible value, and to treat the value of the feature as a list rather than as a set. The
961	FSBO	The value of a feature may be underspecified in a number of different ways. It may be null, unknown, or uncertain with respect to a range of known possibilities, as well as being defined as a negation or an alternation. As previously noted, the specification of the range of known possibilities for a given feature is not part of the current specification: in the TEI scheme, this information is conveyed by the
963	FSBO	. Using this, or some other system, we might specify (for example) that the range of values for an element includes symbols for masculine, feminine, and neuter, and that the default value is neuter. With such definitions available to us, it becomes possible to say that some feature takes the default value, or some unspecified value from the list. The following special element is provided for this purpose:
968	FSBO	The value of an empty
982	FSBO	If, however, the value is explicitly stated to be the default one, using the
984	FSBO	element, then the following two representations are equivalent:
992	FSBO	Similarly, if the value is stated to be the negation of the default, then the following two representations are equivalent:
1007	FSLINK	Text elements can be linked with feature structures using any of the linking methods discussed elsewhere in the Guidelines (see for example sections
1121	FSLINK	element is used to link selected characters in the text
1168	FSLINK	It would then be possible to link each word to its intended annotation in the feature library quoted above, as follows:
1183	FD	The Feature System Declaration (FSD) is intended for use in conjunction with a TEI-conforming text that makes use of
1187	FD	It provides a mechanism by which the encoder can list all of the feature names and feature values and give a prose description as to what each represents.
1193	FD	It provides a mechanism by which the encoder can define the intended interpretation of underspecified feature structures. This involves defining default values (whether literal or computed) for missing features.
1196	FD	. This chapter relies upon, but does not reproduce, formal definitions and descriptions presented more thoroughly in the ISO standard, which should be consulted in case of ambiguity or uncertainty.
1198	FD	The FSD serves an important function in documenting precisely what the encoder intended by the system of feature structure markup used in an XML-encoded text. The FSD is also an important resource which standardizes the rules of inference used by software to validate the feature structure markup in a text, and to infer the full interpretation of underspecified feature structures.
1200	FD	The reader should be aware the terminology used in this document does not always closely follow conventional practice in formal logic, and may also diverge from practice in some linguistic applications of typed feature structures. In particular, the term
1201	FD	interpretation
1202	FD	when applied to a feature structure is not an interpretation in the model-theoretic sense, but is instead a minimally informative (or equivalently, most general) extension
1203	FD	of that feature structure that is consistent with a set of constraints declared by an FSD. In linguistic application, such a system of constraints is the principal means by which the grammar of some natural language is expressed. There is a great deal of disagreement as to what, if any, model-theoretic interpretation feature structures have in such applications, but the status of this formal kind of interpretation is not germane to the present document. Similarly, the term
1205	FD	is used here as elsewhere in these Guidelines to identify the syntactic state of well-formedness in the sense defined by the logic of typed feature structures itself, as distinct from and in addition to the
1209	FD	We begin by describing how an encoded text is associated with one or more feature system declarations. The second, third, and fourth sections describe the overall structure of a feature system declaration and give details of how to encode its components. The final section offers a full example; fuller discussion of the reasoning behind FSDs and another complete example are provided in
1213	FDLK	Linking a TEI Text to Feature System Declarations
1215	FDLK	In order for application software to use feature system declarations to aid in the automatic interpretation of encoded texts, or even for human readers to find the appropriate declarations which document the feature system used in markup, there must be a formal link from the encoded texts to the declarations. However, the schema which declares the syntax of the Feature System itself should be kept distinct from the feature structure schema, which is an application of that system.
1219	FDLK	element for each distinct type of feature structure used must be provided and associated with the type, which is the value used within each feature structure for its
1230	FDLK	element may be supplied either within the header of a standard TEI document, or as a standalone document in its own right. It contains one or more
1245	FDLK	element for each within the header attached to the document as follows:
1274	FDLK	In this case there is an implicit link between the
1278	FDLK	element because they share the same value for their
1280	FDLK	attribute and appear within the same document. This is a short cut for the more general case which requires a more explicit link provided by means of the
1285	FDLK	Ways of pointing to components of a TEI document without using an XML identifier are discussed in
1286	FDLK	way of accomplishing this is to add an XML identifier to each
1301	FDLK	(Although in this case the XML identifier is simply an uppercase version of the type name, there is no necessary connection between the two names. The only requirement is that the XML identifier conform to the standards required for identifiers, and that it be unique within the document containing it.)
1332	FDLK	there is no requirement for the local name for a given type of feature structures to be the same as that used by
1348	FDLK	element of a TEI document containing typed feature structures. Alternatively, it may appear independently of any feature structures, as a document in its own right, possibly with its own
1362	FDLK	value specified on a
1371	FDOV	A feature system declaration contains one or more feature structure declarations, each of which has up to three parts: an optional description (which gives a prose comment on what that type of feature structure encodes), an obligatory set of feature declarations (which specify range constraints and default values for the features in that type of structure), and optional feature structure constraints (which specify co-occurrence restrictions on feature values).
1380	FDOV	element may name one or more
1385	FDOV	fsDecl type="Basic"
1387	FDOV	fDecl name="One"
1389	FDOV	fDecl name="Two"
1391	FDOV	fsDecl type="Derived" baseTypes="Basic"
1393	FDOV	fDecl name="Three"
1395	FDOV	fs type="Derived"
1397	FDOV	fsDecl type="Derived"
1399	FDOV	fsDecl type="Basic"
1400	FDOV	when it specifies a base type of
1422	FDOV	gives the name of one or more types from which this type inherits feature specifications and constraints; if this type includes a feature specification with the same name as one inherited from any of the types specified by this attribute, or if more than one specification of the same name is inherited, then the possible values of that feature is determined by unification. Similarly, the set of constraints applicable is derived by conjoining those specified explicitly within this element with those implied by the
1424	FDOV	attribute. When no base type is specified, no feature specification or constraint is inherited.
1426	FDOV	Although the present standard does provide for default feature values, feature inheritance is defined to be monotonic.
1427	FDOV	The process of combining constraints may result in a contradiction, for example if two specifications for the same feature specify disjoint ranges of values, and at least one such specification is mandatory. In such a case, there is no valid feature structure of the type being defined.
1432	FDOV	fsDecl type="Sub" baseTypes="Super1 Super2"
1455	FDFD	has three parts: an optional prose description (which should explain what the feature and its values represent), an obligatory range specification (which declares what values the feature is allowed to have), and an optional default specification (which declares what default value should be supplied when the named feature does not appear in an
1460	FDFD	has no value provided, or the value
1466	FDFD	either has no default specified, or has conditional defaults, none of the conditions on which is met,
1468	FDFD	then the value of this feature in the feature structure's most general valid extension is the most general value provided in its
1470	FDFD	, in the case of a unit organization, or the singleton set, bag, or list containing that element, in the case of a complex organization. If the feature:
1473	FDFD	has no value provided, or the value
1477	FDFD	either has a default specified, or has conditional defaults, one of the conditions on which is met,
1479	FDFD	then this feature does have a value in the feature structure's most general valid extension when it exists, namely the default value that pertains.
1481	FDFD	It is possible that a feature structure will not have a valid extension because the default value that pertains to a feature is not consistent with that feature's declared range. Additional tools are required for the enforcement of such criteria.
1492	FDFD	The logic for validating feature values and for matching the conditions for supplying default values is based on the operation of
1506	FDFD	containing the value
1510	FDFD	. The negation of a value
1515	FDFD	) subsumes any value that is not
1519	FDFD	subsumes any numeric value other than zero.
1520	FDFD	The value
1521	FDFD	fs type="X"/
1524	FDFD	, even if it is not valid.
1534	FDFD	The INV feature, which encodes whether or not a sentence is inverted, allows only the values plus (+) and minus (-). If the feature is not specified, then the default rule (FSD 1 above) says that a value of minus is always assumed. The feature declaration for this feature would be encoded as follows:
1544	FDFD	The value range is specified as an alternation (more precisely, an exclusive disjunction), which can be represented by the
1546	FDFD	feature value. That is, the value must be either true or false, but cannot be both or neither.
1548	FDFD	The CONJ feature indicates the surface form of the conjunction used in a construction. The ~ in the default rule (see FSD 2 above) represents negation. This means that by default the feature is not applicable, in other words, no conjunction is taking place. Note that CONJ not being present is distinct from CONJ being present but having the NIL value allowed in the value range. In their analysis, NIL means that the phenomenon of conjunction is taking place but there is no explicit conjunction in the surface form of the sentence. The feature declaration for this feature would be encoded as follows:
1568	FDFD	is not strictly necessary in this case, since the binary value of
1572	FDFD	The COMP feature indicates the surface form of the complementizer used in a construction. In value range, it is analogous to CONJ. However, its default rule (see FSD 9 above) is conditional. It says that if the verb form is infinitival (the VFORM feature is not mentioned in the rule since it is the only feature that can take INF as a value), and the construction has a subject, then a
1598	FDFD	The AGR feature stores the features relevant to subject-verb agreement. Gazdar et al. specify the range of this feature as CAT. This means that the value is a
1599	FDFD	category
1600	FDFD	, which is their term for a feature structure. This is actually too weak a statement. Not just any feature structure is allowable here; it must be a feature structure for agreement (which is defined in the complete example at the end of the chapter to contain the features of person and number). The following feature declaration encodes this constraint on the value range:
1605	FDFD	That is, the value must be a feature structure of type
1608	FDFD	fsDecl type="Agreement"
1610	FDFD	fDecl name="PERS"
1612	FDFD	fDecl name="NUM"
1615	FDFD	The PFORM feature indicates the surface form of the preposition used in a construction. Since PFORM is specified above as an open set,
1626	FDFD	subsumes any string that is not the empty string.
1646	FDFS	Ensuring the validity of feature structures may require much more than simply specifying the range of allowed values for each feature. There may be constraints on the co-occurrence of one feature value with the value of another feature in the same feature structure or in an embedded feature structure.
1648	FDFS	Such constraints on valid feature structures are expressed as a series of conditional and biconditional tests in the
1652	FDFS	. A particular feature structure is valid only if it meets all the constraints. The
1654	FDFS	element encodes the conventional if-then conditional of boolean logic which succeeds when both the antecedent and consequent are true, or whenever the antecedent is false. The
1656	FDFS	element encodes the biconditional (if and only if) operation of boolean logic. It succeeds only when the corresponding if-then conditionals in both directions are true.
1657	FDFS	In feature structure constraints the antecedent and consequent are expressed as feature structures; they are considered true if they
1660	FDFS	) the feature structure in question, but in the case of consequents, this truth is asserted rather than simply tested. That is to say, a conditional is enforced by determining that the antecedent does not (and will never) subsume the given feature structure, or by determining that the antecedent does subsume the given feature structure, and then unifying the consequent with it (the result of which, if successful, will be subsumed by the consequent). In practice, the enforcement of such constraints can result in periods in which the truth of a constraint with respect to a given feature structure is simply not known; in this case, the constraint must be persistently monitored as the feature structure becomes more informative until either its truth value is determined or computation fails for some other reason.
1675	FDFS	The first constraint says that if a construction is inverted, it must also have an auxiliary and a finite verb form. That is,
1683	FDFS	The second constraint says that if a construction has a BAR value of zero (i.e., it is a sentence), then it must have a value for the features N, V, and SUBCAT. By the same token, because it is a biconditional, if it has values for N, V, and SUBCAT, it must have BAR='0'. That is,
1694	FDFS	The final constraint says that if a construction has a BAR value of 1 (i.e., it is a phrase), then the SUBCAT feature should be absent (~). This is not biconditional, since there are other instances under which the SUBCAT feature is inappropriate. That is,
1830	FSDEF	This elements discussed in this chapter constitute a module of the TEI scheme which is formally defined as follows:
1844	FSDEF	The selection and combination of modules to form a TEI schema is described in

PH-PrimarySources.xml#13092

#	id	text
5	PH	provides elements for the encoding of digital facsimiles or images of such materials, while the remainder of the chapter discusses ways of encoding detailed transcriptions of such materials. This module may also be useful in the preparation of critical editions, but the module defined here is distinct from that defined in chapter
7	PH	, but again the present module may be used independently if such data is not required.
13	PH	to the encoding of printed matter or indeed any form of written source, including monumental inscriptions. Similarly, where in the following descriptions terms such as
16	PH	author
18	PH	editor
25	PH	plays a role analogous to the
27	PH	, while in an authorial manuscript, the author and the scribe are the same person.
32	PHFAX	These Guidelines are mostly concerned with the preparation of digital texts in which pre-existing sources are transcribed or otherwise converted into character form, and marked up in XML. However, it is also very common practice to make a different form of
33	PHFAX	digital text
34	PHFAX	, which is instead composed of digital images of the original source, typically one per page, or other written surface. We call such a resource a
35	PHFAX	digital facsimile
36	PHFAX	. A digital facsimile may, in the simplest case, just consist of a collection of images, with some metadata to identify them and the source materials portrayed. It may sometimes contain a variety of images of the same source pages, perhaps of different resolutions, or of different kinds. Such a collection may form part of any kind of document, for example a commentary of a codicological or paleographic nature, where there is a need to align explanatory text with image data. It may also be complemented by a transcribed or encoded version of the original source, which may be linked to the page images. In this section we present elements designed to support these various possibilities and discuss the associated mechanisms provided by these Guidelines.
56	PHFAX	In the simple case where a digital text is composed of page images, the
74	PHFAX	attribute represents the whole of the text following the
78	PHFAX	element. Any convenient milestone element (see further
79	PHFAX	) could be used in the same way; for example if the images represent individual columns, the
81	PHFAX	element might be used. Though simple, this method has some drawbacks. It does not scale well to more complex cases where, for example, the images do not correspond exactly with transcribed pages, or where the intention is to align specific marked up elements with detailed images, or parts of images. The management of information about the images may become more difficult if references to them are scattered through many files rather than being concentrated in a single identifiable location. Nevertheless, this solution may be adequate for many straightforward
97	PHFAX	, which are also provided by this module. These elements make it possible to accommodate multiple images of each page, as well as to record the position and relative size of elements identified on any kind of written surface and to link such elements with digital facsimile images of them. Typical applications include the provision of full text search in
98	PHFAX	digital facsimile editions
99	PHFAX	, and ways of annotating graphics, for example so as to identify individuals appearing in group portraits and link them to data about the people represented.
114	PHFAX	elements may be used to represent a digital facsimile. Either may appear within a TEI document along with, or instead of, the
119	PHFAX	element is designed for the case where the digital facsimile contains only images, whereas the
121	PHFAX	element is for use in the case where such images are complemented by a documentary transcription. In this section, we first discuss the simpler case, returning to the use of the
124	PHFAX	below. When this module is selected therefore, a legal TEI document may thus comprise any of the following:
126	PHFAX	a TEI header and a text element
128	PHFAX	a TEI header and a facsimile element
130	PHFAX	a TEI header and a sourceDoc element
132	PHFAX	a TEI header, a facsimile element, and a text element
134	PHFAX	a TEI header, one or more sourceDoc or facsimile elements, and a text element
150	PHFAX	In the simplest case, a facsimile just contains a series of
169	PHFAX	In this simple case, the four page images are understood to represent the complete facsimile, and are to be read in the sequence given. Suppose, however, that the second page of this particular work is available both as an ordinary photograph and as an infra-red image, or in two different resolutions. The
171	PHFAX	element may be used to group the two image files, since these correspond with the same area of the work:
186	PHFAX	element provides a way of indicating that the two images of page2 represent the same surface within the source material. A
187	PHFAX	surface
188	PHFAX	might be one side of a piece of paper or parchment, an opening in a codex treated as a single surface by the writer, a face of a monument, a billboard, a membrane of a scroll, or indeed any two-dimensional surface, of any size.
209	PHFAX	Simply grouping related graphics is not however the main purpose of the
211	PHFAX	element: rather it is to help identify the location and size of the various two-dimensional spaces constituting the digital facsimile. Note that the actual dimensions of the object represented are not provided by the
215	PHFAX	element defines an abstract coordinate space which may be used to address parts of the image. Four attributes supplied by the
223	PHFAX	By default, the same coordinate space is used for a
226	PHFAX	The coordinate space may be thought of as a grid superimposed on a rectangular space. Rectangular areas of the grid are defined as four numbers
227	PHFAX	a b c d
232	PHFAX	points from the origin along the
236	PHFAX	points from the origin along the
239	PHFAX	It may be most convenient to derive a coordinate space from a digital image of the surface in question such that each pixel in the image corresponds with a whole number of units (typically 1) in the coordinate space. In other cases it may be more convenient to use units such as millimetres. Neither practice implies any specific mapping between the coordinate system used and the actual dimensions of the physical object represented.
245	PHFAX	elements, each of which represents a region or
247	PHFAX	defined in terms of the same coordinate space as that of its parent
249	PHFAX	element. A zone may be rectangular or non-rectangular: a rectangular zone is defined by a sequence of four coordinates in the same way as a surface; a non-rectangular zone is defined using the attribute
251	PHFAX	, which provides a sequence of coordinates, each of which specifies a point on the perimeter of the zone.
256	PHFAX	in the same form as that required by the
263	PHFAX	A zone may be used to define any region of interest, such as a detail or illustration, or some part of the surface which is to be aligned with a particular text element, or otherwise distinguished from the rest of the surface. A surface establishes a coordinate system which may be used to address parts or the whole of some digital representation of a written surface. A zone, by contrast, defines any arbitrary area of interest relative to that surface, using the same coordinate system. It might be bigger or smaller than its parent surface, or might overlap its boundaries. The only constraint is that it must be defined using the same coordinate system.
265	PHFAX	When an image of some kind is supplied within either a zone or a surface, the implication is that the image represents the whole of the zone or surface concerned. In the simple case therefore, we might imagine a surface defining a page, within which there is a graphic representing the whole of that page, and a number of zones defining parts of the page, each with its own graphic, each representing a part of the page. If however one of those graphics actually represents an area larger than the page (for example to include a binding or the surface of a desk on which the page rests), then it will be enclosed by a zone with coordinates larger than those of the parent surface.
273	PHFAX	This is an image of a two page spread from a manuscript in the Badische Landesbibliothek, Karlsruhe. We have no information as to the dimensions of the original object, but the low resolution image displayed here contains 500 pixels horizontally and 321 pixels vertically. For convenience, we might map each pixel to one cell of the coordinate space.
274	PHFAX	The coordinate space used here is based on pixels, but the mapping between pixels and units in the coordinate space need not be one-to-one; it might be convenient to define a more delicate grid, to enable us to address much smaller parts of the image. This can be done simply by supplying appropriate values for the attributes which define the coordinate space; for example doubling them all would map each pixel to two grid points in the coordinate space.
279	PHFAX	element corresponding with the area of the image which represents the whole of the two page spread and embed the graphic within it:
315	PHFAX	elements may be used to identify parts of a surface for analytical purposes.
317	PHFAX	The relationship between zone and surface can be quite complex: for example, it may be appropriate to treat the whole of a two page spread as a single written surface, perhaps because particular written zones span both pages. A zone may contain a nested surface, if for example a page has an additional scrap of paper attached to it. A zone may be of any shape, not simply rectangular. Discussion of these and other cases are provided in section
320	PHFAX	In the following extended example, we discuss a hypothetical digital edition of an early 16th century French work, Charles de Bovelles'
323	PHFAX	The image is taken from the collection at
329	PHFAX	element used to contain the whole set of pages, we define a
340	PHFAX	We can now identify distinct zones within the page image using the coordinate scale defined for the surface. In the following figure
348	facs-fig1	Detail of p 49r from Bovelles
351	PHFAX	The following encoding defines each of the four zones identified in the figure above.
365	PHFAX	Note that the location of each zone is defined independently but using the same coordinate system.
381	PHFAX	element has been associated directly with the surface of the page rather than nesting it within a zone. However, it is also possible to include multiple
385	PHFAX	element, if for example a detailed image is available. Since all
389	PHFAX	), there is no need to demonstrate enclosure of one zone within another by means of nesting. To continue the current example, supposing that we have an additional image called
391	PHFAX	containing an additional image of the figure in the third zone above, we might encode that zone as follows:
402	PH-transcr	A digitized source document may contain nothing more than page images and a small amount of metadata. It may also contain an encoded transcription of the pages represented, which may either be
406	PH-transcr	element, or supplied in parallel with a
410	PH-transcr	If the transcription is regarded as a text in its own right, organized and structured independently of its physical realization in the document or documents represented by the facsimile, then the recommended practice is to use the
419	PH-transcr	below. Alternatively, if the transcription is intended not to prioritize representation of the final text so much as the process by which the document came to take its present form, or the physical disposition of its component parts, it may be preferable to present it as an embedding transcription, as further described in section
425	PH-bov	Suppose now that we wish to align a transcription of the page discussed in the preceding section with particular zones. We begin by giving each relevant part of the facsimile an identifier:
492	PH-bov	attribute, which supplies the identifier of the element containing at least the start of the transcribed text found within the surface or zone concerned. Thus, another way of linking this page with its transcription would be simply
546	PHZLAB	When supplied within a
548	PHZLAB	element, these elements may contain transcriptions of the written content of a source in addition to or as an alternative to digital images of them. Such transcription may be placed directly within the
552	PHZLAB	elements, for cases where the writing is linear, in the sense that it is composed of discrete tokens organized physically into groups, typically organized in a sequence corresponding with the way they are intended to be read. Depending on the directionality of the writing system used, this might be any combination of top-down and left to right, or vice versa. The element
554	PHZLAB	may be used to hold a complete group of such tokens. Where, however, the lineation is not considered significant, any group of tokens may be indicated using the
565	PHZLAB	Returning to the preceding example, we might transcribe the content of the zone to which we gave the identifier
598	PHZLAB	As mentioned above, some or all of the written surfaces being transcribed may be composed of physically distinct scraps. In the following example, taken from the Walt Whitman Archive, two pieces of newsprint have been glued to a piece of blue paper on which a poem is being drafted:
601	sleeprs	Single leaf of notes possibly related to the poem eventually titled Sleepers. From the Walt Whitman Archive (Duke 258).
603	PHZLAB	The two pieces of newsprint might simply be regarded as special kinds of zone, but they are also new surfaces, since they might contain additional written zones themselves (such as the numbers in this case).
650	PHZLAB	elements identified in the transcription. The encoder may choose to complement a transcription with graphic representations of its source at whatever level is considered effective, or not at all. Equally, the encoder may choose to provide only graphics without any transcription, to provide only a structured (non-embedded) transcription, or to provide any combination of the three.
654	PHZLAB	element they are to be found, other than the reading order implicit in their sequence. Such information could be added if desired by specifying a coordinate system on the outermost
656	PHZLAB	element, and then indicating values within that system for each of the two fragments, as was discussed above. We discuss this in further detail in section
666	PHST	transcription or a critical edition. In either case they may also wish to include other editorial material, such as comments on the status or possible origin of particular readings, corrections, or text supplied to fill lacunae.
672	PHST	of writing in one or more documents. Transcriptions of this kind are closely focussed on the physical appearance of specific documents, needing to distinguish the traces of different writing activities on them, such as additions and deletions but also other indications of how the writing is to be read, such as indications of transposition, re-affirmation of writing which has been deleted, and so on. Such distinctions are considered of particular importance when dealing with authorial manuscripts, but are also relevant in the case of historical sources such as charters or other legal documents.
674	PHST	In either case, it is customary in transcriptions to register certain features of the source, such as ornamentation, underlining, deletion, areas of damage and lacunae. This chapter provides ways of encoding such information:
676	PHST	methods of recording editorial or other alterations to the text, such as expansion of abbreviations, corrections, conjectures, etc. (section
679	PHST	methods of describing important extra-linguistic phenomena in the source: unusual spaces, lines, page and line breaks, changes of manuscript hand, etc. (section
685	PHST	methods of representing aspects of layout such as spacing or lines
688	PHST	methods of representing material such as running heads, catch-words, and the like (section
696	PHST	, etc. are used to mark writing traces and their functions within the document. Each such element can be assigned to one or more editorially-defined modification groups, termed a
697	PHST	change
700	PHST	attribute, which references a definition for the modification group concerned, typically provided within the TEI header
717	PHST	These recommendations are not intended to meet every transcriptional circumstance likely to be faced by any scholar. Rather, they should be regarded as a base which can be elaborated if necessary by different scholars in different disciplines
720	PHST	As a rule, all elements which may be used in the course of a transcription of a single witness may also be used in a critical apparatus, i.e. within the elements proposed in chapter
721	PHST	. This can generally be achieved by nesting a particular reading containing tagged elements from a particular witness within the
727	PHST	Just as a critical apparatus may contain transcriptional elements within its record of variant readings in various witnesses, one may record variant readings in an individual witness by use of the apparatus mechanisms
737	PHCH	In the detailed transcription of any source, it may prove necessary to record various types of actual or potential alteration of the text: expansion of abbreviations, correction of the text (either by author, scribe, or later hand, or by previous or current editors or scholars), addition, deletion, or substitution of material, and similar matters. The sections below describe how such phenomena may be encoded using either elements defined in the core module (defined in chapter
738	PHCH	) or specialized elements available only when the module described in this chapter is available.
757	PHCO	All of these elements bear additional attributes for specifying who is responsible for the interpretation represented by the markup, and the associated certainty. In addition, some of them bear an attribute allowing the markup to be categorized by type and source.
766	PHCO	The following sections describe how the core elements just named may be used in the transcription of primary source materials.
772	PHAB	The writing of manuscripts by hand lends itself to the use of abbreviation to shorten scribal labour. Commonly occurring letters, groups of letters, words, or even whole phrases, may be represented by significant marks. This phenomenon of manuscript abbreviation is so widespread and so various that no taxonomy of it is here attempted. Instead, methods are shown which allow abbreviations to be encoded using the core elements mentioned above.
774	PHAB	A manuscript abbreviation may be viewed in two ways. One may transcribe it as a particular sequence of letters or marks upon the page: thus, a
775	PHAB	p with a bar through the descender
781	PHAB	per
783	PHAB	re
788	PHAB	In many cases the glyph found in the manuscript source also exists in the Unicode character set: for example the common Latin brevigraph ⁊, standing for
792	PHAB	can be directly represented in any XML document as the Unicode character with code point
803	PHAB	These two methods of coding abbreviation may also be combined. An encoder may record, for any abbreviation, both the sequence of letters or marks which constitutes it, and its sense, that is, the letter or letters for which it is believed to stand. For example, in the following fragment the phrase
805	PHAB	is represented by a sequence of abbreviated characters:
826	PHAB	Note that in each case the
859	PHAB	When abbreviated forms such as these are expanded, two processes are carried out: some characters not present in the abbreviation are added (always), and some characters or glyphs present in the abbreviation are omitted or replaced (often). For example, when the abbreviation
871	PHAB	element surrounds characters or signs such as tittles or tildes, used to indicate the presence of an abbreviation, which are typically removed or replaced by other characters in the expanded form of the abbreviation:
887	PHAB	The content of the
905	PHAB	As implied in the preceding discussion, making decisions about which of these various methods of representing abbreviation to use will form an important part of an encoder's practice. As a rule, the
909	PHAB	elements should be preferred where it is wished to signify that the content of the element is an abbreviation, without necessarily indicating what the abbreviation may stand for. The
913	PHAB	elements should be used where it is wished to signify that the content of the element is not present in the source but has been supplied by the transcriber, without necessarily indicating the abbreviation used in the original. The decision as to which course of action is appropriate may vary from abbreviation to abbreviation; there is no requirement that the same system be used throughout a transcription, although doing so will generally simplify processing. The choice is likely to be a matter of editorial policy. If the highest priority is to transcribe the text
915	PHAB	(letter by letter), while indicating the presence of abbreviations, the choice will be to use
919	PHAB	throughout. If the highest priority is to present a reading transcription, while indicating that some letters or words are not actually present in the original, the choice will be to use
934	PHAB	, a note is attached to an editorial expansion of the tail on the final d of
951	PHAB	The editor might declare a degree of certainty for this expansion, based on the OED examples, and state the responsibility for the expansion:
955	PHAB	The value supplied for the
957	PHAB	attribute should point to the name of the editor responsible for this and possibly other interventions; an appropriate element therefore might be a
959	PHAB	element in the header like the following:
972	PHAB	element only to indicate confidence in the content of the element (i.e. the expansion), and responsibility for suggesting this expansion respectively.
984	PHAB	If it is desired to express aspects of certainty and responsibility for some other aspect of the use of these elements, then the mechanisms discussed in chapter
986	PHAB	for discussion of the issues of certainty and responsibility in the context of transcription.
1025	PHCC	and its correction
1038	PHCC	element is used to provide a corrected form which is
1040	PHCC	present in the source; in the case of a correction made in the source itself, whether scribal, authorial, or by some other hand, the
1053	PHCC	element indicates the transcriber's correction of them. Where the transcriber considers that one or more words have been erroneously omitted in the original source and corrects this omission, the
1058	PHCC	. Thus, in the following example, from George Moore's draft of additional materials for
1072	PHCC	, the choice as to whether to record simply that there is an apparent error, or simply that a correction has been applied, or to record both possible readings within a
1074	PHCC	element is left to the encoder. The decision is likely to be a matter of editorial policy, which might be applied consistently throughout or decided case by case. If the highest priority is to present an uncorrected transcription while noting perceived errors in the original, the choice will typically be to use only
1076	PHCC	throughout. If the highest priority is to present a reading transcription, while indicating that perceived errors in the original have been corrected, the choice will be to use only
1119	PHCC	is used to indicate who is responsible for the proposed emendation. Its value is a pointer, which will typically indicate a
1123	PHCC	element in the header of the transcribed document, but can point anywhere, for example to some online authority file. Using these two attributes, the
1154	PHCC	element. However, if the number of corrections is large and the number of notes is small, it may well be both more practical and more appropriate to regard the collection of annotations as constituting a typology and then use the
1156	PHCC	attribute. Suppose that the note given above is one of half a dozen possible kinds of corrected phenomena identified in a given text; others might include, say,
1157	PHCC	repetition of a word from the preceding line
1162	PHCC	element can be used to specify an arbitrary code for the particular kind of correction (or other editorial intervention) identified within it. This code can be chosen freely and is not treated as a pointer.
1175	PHCC	In addition, the conscientious encoder will provide documentation explaining the circumstances in which particular codes are judged appropriate. A suitable location for this might be within the
1196	PHCC	choice type="substitution" subtype="graphicResemblance"
1203	PHCC	attributes automatically. This is easily done but requires customization of the TEI system using techniques described in
1207	PHCC	When making a correction in a source which forms part of a textual tradition attested by many witnesses, a textual editor will sometimes use a reading from one witness to correct the reading of the source text. In the general case, such encoding is best achieved with the mechanisms provided by the module for textual criticism described in chapter
1214	PHCC	mentioned above, Parkes proposes to emend the problematic word
1223	PHCC	The value of the
1225	PHCC	attribute here is, like the value of the
1227	PHCC	attribute, a pointer, in this case indicating the manuscript used as a witness. Elsewhere in the transcribed text, a list of witnesses used in this text will be given, one of which has an identifier
1229	PHCC	. Each witness will be represented either by a
1266	PHCC	attribute were supplied on the
1268	PHCC	element, it would indicate the person responsible for asserting that the manuscript indicated has this reading, who is not necessarily the same as the person responsible for asserting that this reading should be used to correct the others. Editorial intervention elements such as
1272	PHCC	to provide this additional information:
1283	PHCC	found in Gg is regarded as a correction by Parkes.
1295	PHCC	element, these attributes indicate confidence in and responsibility for identifying the reading within the sources specified; when used on the
1297	PHCC	element they indicate confidence in and responsibility for the use of the reading to correct the base text. If no other source is indicated (either by the
1303	PHCC	), the reading supplied within a
1305	PHCC	has been provided by the person indicated by the
1309	PHCC	If it is desired to express certainty of or responsibility for some other aspect of the use of these elements, then the mechanisms discussed in chapter
1311	PHCC	for further discussion of the issues of certainty and responsibility in the context of transcription.
1317	PHAD	Additions and deletions observed in a source text may be described using the following elements:
1327	PHAD	are included in the core module, while
1331	PHAD	are available only when using the module defined in this chapter. These particular elements are members of the
1338	PHAD	Further characteristics of each addition and deletion, such as the hand used, its effect (complete or incomplete, for example), or its position in a sequence of such operations may conveniently be recorded as attributes of these elements, all of which are members of the
1384	PHAD	attribute may be useful to indicate the classification; when they are classified by the manner in which they were effected, or by their appearance, however, this will lead to a certain arbitrariness in deciding whether to use the
1392	PHAD	attribute be reserved for higher level or more abstract classifications.
1396	PHAD	attribute is also available to indicate the location of an addition. For example, consider the following passage from a draft letter by Robert Graves:
1420	PHAD	above the line, and then deletes it. This may be encoded similarly:
1426	PHAD	has been added and then deleted:
1434	PHAD	, and then changed it; it may be that he inserted other punctuation marks between the letters before replacing them with the centre dots used elsewhere to represent this acronym. We do not deal with these possibilities here, and mention them only to indicate that any encoding of manuscript material of this complexity will need to make decisions about what is and is not worth mentioning.
1442	PHAD	, then deletes
1462	PHAD	elements defined in the core module suffice only for the description of additions and deletions which fit within the structure of the text being transcribed, that is, which each deletion or addition is completely contained by the structural element (paragraph, line, division) within which it occurs. Where this is not the case, for example because an individual addition or deletion involves several distinct structural subdivisions, such as poems or prose items, or otherwise crosses a structural boundary in the text being encoded, special treatment is needed. The
1476	PHAD	element is first declared, within the header of the document, to associate the identifier
1478	PHAD	with Helgi. Each of the added poems is encoded as a distinct
1480	PHAD	element. In the body of the text, an
1482	PHAD	element is placed to mark the beginning of the span of added text, and an
1506	PHAD	several occasions where sequences of whole lines are marked for deletion, either by boxes or by being struck out. If the encoder is marking up individual verse lines with the
1528	PHAD	It is also often the case that deletions and additions may themselves contain other deletions and additions. For example, in Thomas Moore's autograph of the second version of
1543	PHAD	In this case the
1551	PHAD	The text deleted must be at least partially legible, in order for the encoder to be able to transcribe it. If all of part of it is not legible, the
1553	PHAD	element should be used to indicate where text has not been transcribed, because it could not be. The
1556	PHAD	may be used to indicate areas of text which cannot be read with confidence. See further section
1566	PHSU	As we have shown, the simplest method of recording a substitution is simply to record both the addition and the deletion. However, when the module defined by this chapter is in use, additional elements are available to indicate that the encoder believes the addition and the deletion to be part of the same intervention: a substitution.
1580	PHSU	Since the purpose of this element is solely to group its child elements together, the order in which they are presented is not significant. When both deletion and addition are present, it may not always be clear which occurs first: using the
1590	PHSU	and this is then replaced by
1594	PHSU	This may be encoded as follows, representing the two changes as a sequence of additions and deletions:
1606	PHSU	to record text first added, then deleted in the source. The numbers assigned by the
1608	PHSU	attribute may be used to identify the order in which the various additions and deletions are believed by the encoder to have been carried out, and thus provide a simple method of supporting the kind of
1617	PHSU	The case of a single substitution or scribal correction that involves non-contiguous addition and deletion can be handled by using the
1619	PHSU	element to make an explicit connection between one or more
1627	PHSU	to group this
1633	PHSU	allows the encoder to indicate that additions and deletions separated in this way are part of a single scribal intervention:
1688	PHSU	in the last line is simply marked as a deletion;
1695	PHSU	provides similar facilities, by treating each state of the text as a distinct reading. The
1717	PHCD	An author or scribe may mark a word or phrase in some way, and then on reflection decide to cancel the marking. For example, text may be marked for deletion and the deletion then cancelled, thus restoring the deleted text. Such cancellation may be indicated by the
1723	PHCD	This element bears the same attributes as the other transcriptional elements. These may be used to supply further information such as the hand in which the restoration is carried out, the type of restoration, and the person responsible for identifying the restoration as such, in the same way as elsewhere.
1725	PHCD	Presume that Lawrence decided to restore
1730	PHCD	For I hate this my body
1733	PHCD	first deleted then restored by writing
1740	PHCD	Another feature commonly encountered in manuscripts is the use of circles, lines, or arrows to indicate transposition of material from one point in the text to another. No specific markup for this phenomenon is proposed at this time. Such cases are most simply encoded as additions at the point of insertion and deletions at the point of encirclement or other marking.
1746	PHOM	Where text is not transcribed, whether because of damage to the original, or because it is illegible, or for some other reason such as editorial policy, the
1748	PHOM	core element may be used to register the omission; where such text is transcribed, but the editor wishes to indicate that they consider it to be superfluous, for example because it is an inadvertent scribal repetition, the
1750	PHOM	element may be used in preference. Where text not present in the source is supplied (whether conjecturally or from other witnesses) to fill an apparent gap in the text, the
1760	PHOM	element has no content. It marks a point in the text where nothing at all can be read, whether because of authorial or scribal erasure, physical damage, or any other form of illegibility. Its attributes allow the encoder to specify the amount of text which is illegible in this way at this point, using any convenient units, where this can be determined. For example, in the Beerbohm manuscript of
1762	PHOM	cited above, the author has erased a passage amounting about 10 cm in length by inking over it completely:
1769	PHOM	The degree of precision attempted when measuring the size of a gap will vary with the purpose of the encoding and the nature of the material: no particular recommendation is made here.
1773	PHOM	element should only be used where text has not been transcribed. If partially legible text has been transcribed, one of the elements
1778	PHOM	); if the text is legible and has been transcribed, but the editor wishes to indicate that they regard it is superfluous or redundant, then the element
1780	PHOM	may be used in preference to the core element
1782	PHOM	used to indicate text regarded as erroneous.
1784	PHOM	Amongst the many examples cited in Hans Krummrey & Silvio Panciera's classic text on the editing of epigraphic inscriptions is the following. In a late classical inscription, the form
1786	PHOM	is encountered. The editor may choose any of the following three possibilities:
1789	PHOM	mark this as an erroneous form
1794	PHOM	additionally supply a corrected form
1802	PHOM	indicate that the erroneous form contains surplus characters which the editor wishes to suppress
1825	PHOM	here are metrically inconsistent with the rest and have been marked by the editor as such.
1827	PHOM	If some part of the source text is completely illegible or missing, an encoder may sometimes wish to supply new (conjectural) material to replace it. This conjectural reading is analogous to a correction in that it contains text provided by the encoder and not attested in the source. This is not however a correction, since no error is necessarily present in the original; for that reason a different element
1830	PHOM	I am dear Sir your very humble Servt Sydney Smith
1831	PHOM	, the text illegible in the autograph might be supplied in the transcription:
1839	PHOM	attributes are used, as elsewhere, to indicate respectively the sigil of a manuscript from which the supplied reading has been taken, and the identifier of the person responsible for deciding to supply the text. If the
1841	PHOM	attribute is not supplied, the implication is that the encoder (or whoever is indicated by the value of the
1843	PHOM	attribute) has supplied the missing reading. Both
1859	PHPH	This section discusses in more detail the representation of aspects of responsibility perceived or to be recorded for the writing of a primary source. These include points at which one scribe takes over from another, or at which ink, pen, or other characteristics of the writing change. A discussion of the usage of the
1870	PHDH	For many text-critical purposes it is important to signal the person responsible (the
1872	PHDH	) for the writing of a whole document, a stretch of text within a document, or a particular feature within the document. A hand, as the name suggests, need not necessarily be identified with a particular known (or unknown) scribe or author; it may simply indicate a particular combination of writing features recognized within one or more documents. The examples given above of the use of the
1874	PHDH	attribute with coding of additions and deletions illustrate this.
1887	PHDH	attribute, may appear in either of two places in the TEI header, depending on which modules are included in a schema. When the
1893	PHDH	element of the TEI header, to hold one or more
1901	PHDH	also becomes available as part of a structured manuscript description. The encoder may choose to place
1903	PHDH	elements identifying individual hands in either location without affecting their accessibility since the element is always addressed by means of its
1907	PHDH	element may be more appropriate when a full cataloguing of each manuscript is required; the
1909	PHDH	element if only a brief characterization of each hand is needed. It is also possible to use the two elements together if, for example, the
1911	PHDH	element contains a single summary describing all the hands discursively, while the
1913	PHDH	element gives specific details of each. The choice will depend on individual encoders' priorities.
1917	PHDH	attribute is available on several elements to indicate the hand in which the content of the element (usually a deletion or addition) is carried out. The
1919	PHDH	element may also be used within the body of a transcription to indicate where a change of hand is detected for whatever reason.
1935	PHDH	A single hand may employ different writing styles and inks within a document, or may change character. For example, the writing style might shift from
1939	PHDH	, or the ink from blue to brown, or the character of the hand may change. Simple changes of this kind may be indicated by assigning a new value to the appropriate attribute within the
1941	PHDH	element. It is for the encoder to decide whether a change in these properties of the writing style is so marked as to require treatment as a distinct hand.
1943	PHDH	Where such a change is to be identified, the
1945	PHDH	attribute indicates the hand applicable to the material following the
1947	PHDH	. The sequence of such
1949	PHDH	elements will often, but not necessarily, correspond with the order in which the material was originally written. Where this is not the case, the facilities described in section
1952	PHDH	As might be expected, a single hand may also vary renditions within the same writing style, for example medieval scribes often indicate a structural division by emboldening all the words within a line. Such changes should be indicated by use of the
1958	PHDH	In the following example there is a change of ink within a single hand. This is simply indicated by a new value for the
1969	PHDH	In the following example, the encoder has identified two distinct hands within the document and given them identifiers
1973	PHDH	, by means of the following declarations included in the document's TEI header:
1983	PHDH	Then the change of hand is indicated in the text:
1987	PHDH	When a more precise or nuanced discussion of the writing in a manuscript is required, the
2004	PHHR	attributes have similar, but not identical, meanings. Observe their distinctive uses in the following encoding of the William James passage mentioned above in section
2009	PHHR	, and the consequent editorial correction of
2034	PHHR	should be reserved for indicating the hand of any form of marking—here, addition but also deletion, correction, annotation, underlining, etc.—within the primary text being transcribed. The scribal or authorial responsibility for this marking may be inferred from the value of the
2036	PHHR	attribute. The value of the
2038	PHHR	attribute should be a pointer to a hand identifiers typically declared in the document header but potentially in another document or repository (see section
2043	PHHR	attribute, by contrast, indicates the person responsible for deciding to mark up this part of the text with this particular element. In the case of the
2049	PHHR	attribute is supplied) to which hand it should be attributed. In this case, Bowers is credited with identifying the hand as that of William James. In the case of the
2053	PHHR	attribute indicates who is responsible for supplying the intellectual content of the correction reported in the transcription: here, Bowers' correction of
2057	PHHR	. In the case of a deletion, the
2067	PHHR	attributes are defined for a particular element, the two attributes refer to the same aspect of the markup. The one indicates who is intellectually responsible for some item of information, the other indicates the degree of confidence in the information. Thus, for a correction, the
2069	PHHR	attribute signifies the person responsible for supplying the correction, while the
2073	PHHR	attribute signifies the person responsible for supplying the expansion and the
2081	PHHR	attributes with each element is intended to provide for the most frequent circumstances in which encoders might wish to make unambiguous statements regarding the responsibility for and certainty of aspects of their encoding. The
2085	PHHR	attributes, as so defined, give a convenient mechanism for this. However, there will be cases where it is desirable to state responsibility for and certainty concerning other aspects of the encoding. For example, one may wish in the case of an apparent addition to state the responsibility for the use of the
2087	PHHR	element, rather than the responsibility for identifying the hand of the addition. It may also be that one editor may make an electronic transcription of another editor's printed transcription of a manuscript text—here, one will wish to assign layers of responsibility, so as to allow the reader to determine exactly what in the final transcription was the responsibility of each editor. In these complex cases of divided editorial responsibility for and certainty concerning the content, attributes, and application of a particular element, the more general mechanisms for representing certainty and responsibility described in chapter
2091	PHHR	It should be noted that the certainty and responsibility mechanisms described in chapter
2100	PHHR	in line 117 of Chaucer's
2113	PHHR	Exactly the same information could be conveyed using the certainty and responsibility mechanisms, as follows:
2119	PHHR	The choice of which mechanism to use is left to the encoder. In transcriptions where only such statements of responsibility and certainty are made as can be accommodated within the
2127	PHHR	attributes of those elements. Where many statements of responsibility and certainty are made which cannot be so accommodated, it may be economical to use the
2133	PHHR	The above discussion supposes that in each case an encoder is able to specify exactly what it is that one wishes to state responsibility for and certainty about. Situations may arise when an encoder wishes to make a statement concerning certainty or responsibility but is unable or unwilling to specify so precisely the domain of the certainty or responsibility. In these cases, the
2137	PHHR	attribute set to
2140	PHHR	resp
2141	PHHR	and the content of the note giving a prose description of the state of affairs.
2148	PHDAMCON	The carrier medium of a primary source may often sustain physical damage which makes parts of it hard or impossible to read. In this section we discuss elements which may be used to represent such situations and give recommendations about how these should be used in conjunction with the other related elements introduced previously in this chapter.
2158	PHDA	) should be used with appropriate attributes where the degree of damage or illegibility in a text is such that nothing can be read and the text must be either omitted or supplied conjecturally or from one or more other sources. In many cases, however, despite damage or illegibility, the text may yet be read with reasonable confidence. In these cases, the following elements should be used:
2181	PHDA	inherits the following additional attribute:
2190	PHDA	In the first line of this leaf, the transcriber may believe that the last three letters of
2198	PHDA	If, as is often the case, the damage crosses structural divisions, so that the
2225	PHDA	element, since it is the whole of the leaf (the text between the two
2230	PHDA	If, as is also likely, the damage affects several disjoint parts of the text, each such part must be marked with a separate
2236	PHDA	attribute may be used as in the following example. In this (imaginary) text of Fitzgerald's translation from Omar Khayam, water damage has affected an area covering parts of several lines:
2255	PHDA	which may be used to link together arbitrary elements of any kind in the transcription. Here, several phenomena of illegibility and conjecture all result from a single cause: an area of damage to the text caused by rubbing at various points. The damage is not continuous, and affects the text at irregular points. In cases such as this, the join element may be used to indicate which tagged features are part of the same physical phenomenon.
2257	PHDA	If the damage has been so severe as to render parts of the text only imperfectly legible, the
2285	PHDA	element may if desired be enclosed within a
2304	PHDA	Where elements are nested in this way, information about agency, etc. is by default inherited. In the following imaginary example, there is a smoke-damaged part within which two stretches can be read with some difficulty, and a third stretch which cannot be read at all:
2355	PHCOMB	elements may be closely allied in their use. For example, an area of damage in a primary source might be encoded with any one of the first four of these elements, depending on how far the damage has affected the readability of the text. Further, certain of the elements may nest within one another. The examples given in the last sections illustrate something of how these elements are to be distinguished in use. This may be formulated as follows:
2357	PHCOMB	where the text has been rendered completely illegible by deletion or damage and no text is supplied by the editor in place of what is lost: place an empty
2361	PHCOMB	attribute to state the cause (damage, deletion, etc.) of the loss of text.
2363	PHCOMB	where the text has been rendered completely illegible by deletion or damage and text is supplied by the editor in place of what is lost: surround the text supplied at the point of deletion or damage with the
2367	PHCOMB	attribute to state the cause (damage, deletion, etc.) of the loss of text leading to the need to supply the text.
2369	PHCOMB	where the text has been rendered partly illegible by deletion or damage so that the text can be read but without perfect confidence: transcribe the text and surround it with the
2373	PHCOMB	attribute to state the cause (damage, deletion, etc.) of the uncertainty in transcription and the
2377	PHCOMB	where there is deletion or damage but at least some of the text can be read with perfect confidence: transcribe the text and surround it with the
2387	PHCOMB	where there is an area of deletion or damage and parts of the text within that area can be read with perfect confidence, other parts with less confidence, other parts not at all: in transcription, surround the whole area with the
2395	PHCOMB	element. Places within the damaged area where the text has been rendered completely illegible and no text is supplied by the editor may be marked with the
2397	PHCOMB	element. For each element, one may use appropriate attribute values to indicate the cause and type of deletion or damage and the certainty of the reading.
2404	PHCOMB	elements, and for the interpretation of such combinations, are similar:
2407	PHCOMB	if one
2413	PHCOMB	), then the addition
2424	PHCOMB	if one
2435	PHCOMB	if a
2439	PHCOMB	element, the normal interpretation will be that an addition was made within a passage which was later deleted in its entirety:
2444	PHCOMB	if an
2448	PHCOMB	element, the normal interpretation will be that a deletion was made from a passage which had earlier been added:
2459	alterations	Modifications of various kinds (correction, addition, deletion, etc.) are frequently found within a single document, and may also be inferred when different documents are compared, although it may be an open question as to whether inter-document discrepancies
2462	alterations	In this section we discuss a number of elements which may be useful when attempting to record traces of the writing process within a document.
2467	PH-mod	Most, if not all, transcriptional elements imply a certain level of semantic interpretation. For instance, using the
2469	PH-mod	element to encode a word or phrase that occupies interlinear space involves a decision that it has been deliberately inserted as an addition rather than an alternative, and indeed a judgment that it was written after, rather than before, the other lines. Where it is felt desirable to keep the recording of
2472	PH-mod	what is the editor’s interpretation
2484	PH-mod	attribute, but they provide no further interpretation of the function or intention of the passage so marked up. The
2486	PH-mod	attribute may be used to indicate the end of a modified passage if this extends across the boundaries of some other XML element, for example from the middle of one line tagged as a
2515	PH-meta	metamark
2516	PH-meta	we mean marks such as numbers, arrows, crosses, or other symbols introduced by the writer into a document expressly for the purpose of indicating how the text is to be read. Such marks thus constitute a kind of markup of the document, rather than forming part of the text.
2521	PH-meta	Unlike marginal notes or other additions to the text, metamarks are used by the writer to indicate a deliberate alteration of the writing itself, such as
2522	PH-meta	move this passage over there
2523	PH-meta	. An addition or annotation by contrast would typically concern some property of the passage other than its intended location or status within the text flow. A metamark may contain text, or some other graphic which the encoder wishes to represent, or it may simply consist of arrows, dots, lines etc. which the encoder simply describes.
2540	PH-meta	. The passage to which the metamark applies may be indicated in either of two ways: the
2546	PH-meta	itself must be supplied at the position in the document where the passage concerned begins; in the former case it may be supplied at any convenient point. Both attributes should not be supplied.
2560	PH-meta	. It is thought to function as a metamark, indicating that this sentence forms part of the regulations. A further sentence was then added, while at some later stage the text and also the metamark were deleted. We might encode this as follows:
2596	PH-meta	deletion symbol to left and right of the section. The deletion itself might be encoded by using the normal
2602	PH-meta	element. This is quite a different case from that of the next example, in which the writer does not intend to suppress the content, but only to mark that it has been copied to another manuscript or reused.
2607	PH-meta	From "I am that halfgrown angry boy" (MS q 25), David M. Rubenstein Rare Book & Manuscript Library, Duke University.
2613	PH-meta	signalled by the larger of the two single vertical lines, which shows that the written material has been transferred or re-used, not deleted.
2648	PH-meta	In this example, we class as metamarks both the long vertical line and the annotation
2651	PH-meta	Both metamarks are assumed to indicate that the whole of the written zone with identifier
2659	PH-fix	A writer may sometimes rewrite material a second time without significant change and in the same place. We consider this a distinct activity from addition as usually defined because no new textual material results; instead the status of existing material is reaffirmed. We may distinguish two variants of this:
2674	PH-fix	hastily, and then returned to it to make the letter
2675	PH-fix	l
2719	PH-fix	element is used only for cases where text has been written multiple times. When metamarks and other markup-like strokes have been rewritten multiple times, the
2740	undo	) is provided for the comparatively simple case where a simple deletion is marked as having been subsequently cancelled. The
2742	undo	element discussed here is more widely applicable and may be used for any kind of cancellation. It points to the element or elements which are being cancelled. These components need not be contiguous, provided that the cancellation is clearly a single act; each distinct act of cancellation requires a distinct
2755	undo	We hypothesize that the text has gone through three states or changes, as follows:
2765	undo	This sequence of events might be encoded as follows:
2781	undo	attribute, to delimit the two parts of the deletion which were reverted at change s3. Note that in this case, since
2791	undo	to delimit the two sequences whose deletion is being reverted, and then use the
2817	transpo	occurs when metamarks are found in a document indicating that passages should be moved to a different position. Typically this may be done using arrows, asterisks or numbers, or other means. By definition the result of a transposition is not present in the document, and should not therefore be encoded, if the intention is to represent the actual appearance of the document. Instead, the following elements may be used to indicate the intended reordering:
2851	transpo	element to identify the sections of text being transposed. When (as in the following example) the whole of a line is to be transposed, there is no need to delimit the sections concerned:
2878	transpo	elements may be supplied either embedded within the text or in the
2896	alter	In this example two alternative readings are provided, but no preference is indicated. While the author apparently first composed the line
2902	alter	. The manuscript supplies no indication of which word Moore favours at this point, although in fact, in the first printed edition of
2912	alter	module gives a simple way of encoding the state of this manuscript, as follows:
2946	instantcorr	necessarily implies that the modifications they indicate were made at some time after the original writing. An exception to this is where a false start or
2948	instantcorr	correction has been identified: the author starts to write, and then immediately corrects what has been written.
2954	instantcorr	class to modify this default assumption. When the value of
2956	instantcorr	is set to
2958	instantcorr	, the addition or deletion is considered to belong to the same change as its parent element, while
2960	instantcorr	means some change later than that of its parent.
2962	instantcorr	An example of false start or instant correction can be seen in the following line:
2966	instantcorr	[I am a curse]
2970	instantcorr	in which we can detect the following sequence of events:
2974	instantcorr	is written and then immediately deleted
2983	instantcorr	is then deleted
2991	instantcorr	To indicate that the first of these acts must have taken place during the main act of writing, before the other deletion and additions, we might encode this revision campaign as follows:
3023	PH-surfzone	element is both to identify a specific area containing writing and to provide a two dimensional set of coordinates which can be used to position and provide dimensions for sub-parts of it. Furthermore, surfaces may nest within other surfaces, as in the case of
3025	PH-surfzone	or other written materials attached to the main writing surface. In the general case, the position and dimensions of such nested surfaces will be defined using the same coordinate system as that supplied by the parent
3038	PH-surfzone	when given on the
3040	PH-surfzone	element define the coordinate scheme, rather than specifying the location of that surface. We must therefore introduce an additional
3067	PH-surfzone	element that contains it. This zone, and the preceding one, which contains a sequence of
3073	PH-surfzone	elements occupy a rectangle with coordinates (1,1,10,10), while the nested surface occupies a rectangle with coordinates (4,4,20,20).
3075	PH-surfzone	Now suppose that we wish to define a finer scale grid for the newspaper patch, perhaps because we wish to localize zones within it with greater accuracy. To do this we will need to specify the position of the nested surface as in the previous example, but also to define the new coordinate system. We accomplish this as follows:
3091	PH-surfzone	As before, the second zone defines the position and size of the newspaper patch itself in terms of a coordinate system running from 0 to 50 on both X and Y axes. The nested
3093	PH-surfzone	element however defines a new scale for all of its components, running from 0 to 100 on both X and Y axes. The position of the nested zone containing the text
3099	PH-surfzone	attribute may be used to define non-rectangular zones as a series of points. For example, in the last of the Whitman examples discussed in section
3100	PH-surfzone	above, we might wish to record the exact shape of the zone containing the metamark
3104	PH-surfzone	attribute to indicate the points defining a polygon which contains it. The values used are expressed in terms of a coordinate space running from 0 to 229 in the X dimension, and 0 to 160 in the Y dimension.
3112	PH-surfzone	In exactly the same way, we may wish to identify the curved zone in the following image containing the word
3119	PH-surfzone	This curved zone might be encoded in the following way:
3129	PH-surfzone	does not need to be entirely contained within the two-dimensional space defined by its parent surface. For example, we might wish to encode the example in
3130	PH-surfzone	above not as a surface representing the whole of the two page spread, but as a surface representing only the written part of this opening. The written part appears 50 units from the left of the image and 20 units from the top, while the bottom right corner of the written part appears 400 units from the left of the image, and 280 units from the top. We therefore define the written surface within this image as follows:
3135	PH-surfzone	To describe the whole image, we will now need to define a zone of interest which represents an area larger than this surface. Using the same coordinate system as that defined for the surface, its coordinates are
3137	PH-surfzone	. This zone of interest can be defined by a
3139	PH-surfzone	element, within which we can place the uncropped
3153	PHLAY	The following methods are available to capture general aspects of the layout of material on a page where this is considered important. Within the
3184	PHLAY	s corresponding with each two page opening, for example where it is clear that the writer regarded each such opening as a single writing surface, with written zones or other features crossing the page divide. An example is shown here:
3193	PHLAY	The coloured lines added to this image indicate a number of zones of writing, colour coded to indicate the order in which they were written (purple, then green, then red). For example, the zone marked in red on the left contains a note referring to the purple zone on the right.
3196	PHLAY	This approach assumes that the transcription will primarily be organized in the same way as the physical layout of the source, using embedded transcription elements. Alternatively, where the a non-embedded transcription has been provided, using the
3198	PHLAY	element, it is still possible to record gathering breaks, page breaks, column breaks, line breaks etc in the source, using the elements described in section
3199	PHLAY	. Detailed metadata about the physical make-up of a source will usually be summarized by the
3209	PHSP	The author or scribe may have left space for a word, or for an initial capital, and for some reason the word or capital was never supplied and the space left empty. The presence of significant space in the text being transcribed may be indicated by the
3214	PHSP	Note that this element should not be used to mark normal inter-word space or the like.
3216	PHSP	In line 694 of Chaucer's
3218	PHSP	in the Holkham manuscript the scribe has left a space for a word where other manuscripts read
3225	PHSP	element discussed in the previous section may be used to supply the text presumed missing:
3229	PHSP	Here, the fact of the space within the manuscript is indicated by the value of the
3231	PHSP	attribute. The source of the supplied text is shown by the value of the
3233	PHSP	attribute as the Hengwrt manuscript; the transcriber responsible for supplying the text is ES.
3239	PHLN	One of the more common forms of modification encountered in written documents of any kind is the presence of lines written under, beside, or through the text. Such lines may be of various types: they may be solid, dashed or dotted, doubled or tripled, wavy or straight, or a combination of these and other renderings. The line may be used for emphasis, or to mark a foreign or technical term, or to signal a quotation or a title, etc.: the elements
3249	PHLN	may be used for these. Where the line has a clear paratextual function the
3251	PHLN	element may be considered more appropriate. Frequently, a scholar may judge that a line is used to delete text: the
3274	PHLN	The above examples presume the common case where a single word or phrase is marked by a line, with no doubt as to where the marking begins or ends and with no overlapping of the area of text with other marked areas of text. Where there is doubt, the
3287	PHLN	Where the area of text marked overlaps other areas of text, for example crossing a structural division, one of the spanning mechanisms mentioned above must be used; for example where the line is thought to mark a deletion, the
3289	PHLN	element may be used. Where it is desired simply to record the marking of a span of text in circumstances where it is not possible to surround the text with a
3299	PHLN	More work needs to be done on clarifying the treatment of other textual features marked by lines which might so overlap or nest. For example, in many Middle English manuscripts (e.g. the Jesus and Digby verse collections), marginal sidebars may indicate metrical structure: couplets may be linked in pairs, with the pairs themselves linked into stanzas. Or, marginal sidebars may indicate emphasis, or may point out a region of text on which there is some annotation: in many manuscripts of Chaucer's
3307	PHLN	element, containing a prose description of the manuscript at this point, enhanced by a link to a visual representation (or facsimile) of the feature in question. For example, in the Chaucer example just cited, one may wish to record that the
3325	PHSK	Such information as page numbers, signatures, or catchwords may be recorded in a specialized
3327	PHSK	element provided for that purpose. Although the name derives from the term
3333	PHSK	element may be used for such features of any document, written or printed. Note that the purpose of this element is to record page numbers etc.
3346	PHSK	: since this information is usually provided by the encoder, it is not subject to the constraint that it should be present only if textually present in the source being encoded. In text-critical situations it may be useful to provide both a normalized version of the pagination and a representation of the catch-word or numbering, especially when the latter presents a variant reading, or is significant for compositor identification.
3361	PHSK	other material repeated from page to page, which falls outside the stream of the text
3386	PH-changes	A major purpose of genetic editing is the identification of
3390	PH-changes	. An editor may wish to assign a set of alterations (deletions, additions, substitutions, transpositions, etc.) or any other act of writing to a particular change, to indicate both that one or more of such phenomena preceded or followed another and also to indicate that they are related in some way, for example that one is a consequence of the other. They might also wish to group together certain revisions, regardless of when they might have occurred, based on a variety of other shared characteristics (e.g., corrections of factual errors or revisions that incorporate suggestions made by a given reader). To document this we need:
3392	PH-changes	a system to assign phenomena to a particular change
3394	PH-changes	a way to characterize a change, in itself and in relation to other changes.
3399	PH-changes	(within the TEI header profile description) contains all information relating to the genesis or production of a text. It may contain a
3401	PH-changes	element which contains a number of
3409	PH-changes	In the following example an editor has identified four distinct changes:
3435	PH-changes	(the default). The attribute specifies whether the order of child elements signifies a temporal order for the revision campaigns which they document. In the example above, the editor has asserted that the four stages distinguished are ordered chronologically according to the order of the
3440	PH-changes	elements can be nested hierarchically. This may be helpful in two cases. Firstly one can build up hypotheses about related revisions step-by-step, starting with stages of smaller coverage, whose members are certainly related, and then in a subsequent pass grouping these stages in turn, thereby extending their reach.
3481	PH-changes	In addition to the possibility of ordering text stages in relation to each other,
3483	PH-changes	elements may carry a number of attributes from the
3497	PH-changes	) which allow each stage to be dated as exactly or inexactly as necessary, in the same way as is currently possible for the TEI
3542	PH-changes	element, apart from declaring a distinct change in the creation of the document, may also contain references to other annotations contained within the
3544	PH-changes	or in the document (as shown in the previous example). Such references, along with the textual content, are purely documentary and do not affect the textual stage associated with any element thus referred to. The association of a textual component with a change is always made explicitly, either by using the
3554	PH-changes	element is associated with some element, it is also associated with all of that element's children, unless otherwise indicated, for example by a new value for the
3558	PH-changes	In the following simple example, the text at one stage read
3570	PH-changes	In this example, however, the text originally read
3584	PH-changes	Note that in this case both the deletion and the addition are associated with the second stage. The word
3594	PH-changes	and the like carry an implied semantics concerning the order in which events in the writing of a document was carried out: something which is deleted must have been written before it was deleted; something which is added must have been added at a later stage of the writing. Even when a combination of such elements is used, the chronology can usually be inferred (see further
3595	PH-changes	). Explicit indication of the stage to which some modification belongs is mostly useful in situations where all the alterations identified in a document are to be grouped, for example chronologically.
3599	PH-changes	The interpretation of change assignments for a particular text passage is based on a number of implicit assumptions and constraints which have the effect of minimizing the amount of tagging necessary. The system is also flexible enough to support an explicit distinction between acts of writing and textual alterations, since either of these can be associated with changes described in the encoding. The following example shows an encoding in which the same passage is transcribed twice, once from a documentary perspective, and once from a textual one:
3655	PH-changes	The documentary transcription stresses the writing process, while the textual transcription emphasizes textual alterations. In either case, the change of writing activity associated with a particular feature in the transcript is explicitly indicated. From the documentary perspective, by assigning particular modifications to a specific change, we describe the writing process, in that they specify which segment has been written when
3656	PH-changes	. From the textual perspective, the markup concentrates simply on the existence of textual alterations and makes no explicit claims about the order of writing.
3663	PHTRXX	We repeat the advice given at the beginning of this chapter, that these recommendations are not intended to meet every transcriptional circumstance ever likely to be faced by any scholar. They are intended rather as a base to enable encoding of the most common phenomena found in the course of scholarly transcription of primary source materials. These guidelines particularly do not address the encoding of physical description of textual witnesses: the materials of the carrier, the medium of the inscribing implement, the organisation of the carrier materials themselves (as quiring, collation, etc.), authorial instructions or scribal markup, etc., except insofar as these are involved in the broader question of manuscript description, as addressed by the
3688	PH	The selection and combination of modules to form a TEI schema is described in

HD-Header.xml#13139

#	id	text
2	HD	The TEI Header
4	HD	This chapter addresses the problems of describing an encoded work so that the text itself, its source, its encoding, and its revisions are all thoroughly documented. Such documentation is equally necessary for scholars using the texts, for software processing them, and for cataloguers in libraries and archives. Together these descriptions and declarations provide an electronic analogue to the title page attached to a printed work. They also constitute an equivalent for the content of the code books or introductory manuals customarily accompanying electronic data sets.
6	HD	Every TEI-conformant text must carry such a set of descriptions, prefixed to it and encoded as described in this chapter. The set is known as the
7	HD	TEI header
16	HD	, containing a full bibliographical description of the computer file itself, from which a user of the text could derive a proper bibliographic citation, or which a librarian or archivist could use in creating a catalogue entry recording its presence within a library or archive. The term
18	HD	here is to be understood as referring to the whole entity or document described by the header, even when this is stored in several distinct operating system files. The file description also includes information about the source or sources from which the electronic document was derived. The TEI elements used to encode the file description are described in section
25	HD	, which describes the relationship between an electronic text and its source or sources. It allows for detailed description of whether (or how) the text was normalized during transcription, how the encoder resolved ambiguities in the source, what levels of encoding or analysis were applied, and similar matters. The TEI elements used to encode the encoding description are described in section
29	HD	text profile
32	HD	, containing classificatory and contextual information about the text, such as its subject matter, the situation in which it was produced, the individuals described by or participating in producing it, and so forth. Such a text profile is of particular use in highly structured composite texts such as corpora or language collections, where it is often highly desirable to enforce a controlled descriptive vocabulary or to perform retrievals from a body of text in terms of text type or origin. The text profile may however be of use in any form of automatic text processing. The TEI elements used to encode the profile description are described in section
36	HD	revision history
39	HD	, which allows the encoder to provide a history of changes made during the development of the electronic text. The revision history is important for
41	HD	and for resolving questions about the history of a file. The TEI elements used to encode the revision description are described in section
45	HD	A TEI header can be a very large and complex object, or it may be a very simple one. Some application areas (for example, the construction of language corpora and the transcription of spoken texts) may require more specialized and detailed information than others. The present proposals therefore define both a
46	HD	core
47	HD	set of elements (all of which may be used without formality in any TEI header) and some additional elements which become available within the header as the result of including additional specialized modules within the schema. When the module for language corpora (described in chapter
48	HD	) is in use, for example, several additional elements are available, as further detailed in that chapter.
50	HD	The next section of the present chapter briefly introduces the overall structure of the header and the kinds of data it may contain. This is followed by a detailed description of all the constituent elements which may be used in the core header. Section
51	HD	, at the end of the present chapter, discusses the recommended content of a minimal TEI header and its relation to standard library cataloguing practices.
53	HD1	Organization of the TEI Header
55	HD11	The TEI Header and Its Components
61	HD11	front matter
62	HD11	of the text itself (for which see section
63	HD11	). A composite text, such as a corpus or collection, may contain several headers, as further discussed below. In the general case, however, a TEI-conformant text will contain a single
71	HD11	The header element has the following description:
76	HD11	element has four principal components:
81	HD11	element is required in all TEI headers; the others are optional. Only one of the four components of the TEI header (the
84	HD11	below. The smallest possible valid TEI Header thus looks like this:
94	HD11	The content of the elements making up a TEI header may be given in any language, not necessarily that of the text to which the header applies, and not necessarily English. As elsewhere, the
96	HD11	attribute should be used at an appropriate level to specify the language. For example, in the following schematic example, an English text has been given a French header:
106	HD11	In the case of language corpora or collections, it may be desirable to record header information either at the level of the individual components in the corpus or collection, or at the level of the corpus or collection itself (more details concerning the tagging of composite texts are given in section
109	HD11	attribute may be used to indicate whether the header applies to a corpus or a single text. A corpus may thus take the form:
144	HD12	Types of Content in the TEI Header
146	HD12	The elements occurring within the TEI header may contain several types of content; the following list indicates how these types of content are described in the following sections:
151	HD12	should be understood to imply a series of paragraphs, each marked as a
165	HD12	) usually enclose a group of specialized elements recording some structured information. In the case of the bibliographic elements, the suffix
171	HD12	. On the relation between the TEI proposals and other standards for bibliographic description, see further section
173	HD12	In most cases grouping elements may contain prose descriptions as an alternative to the set of specialized elements, thus allowing the encoder to choose whether or not the information concerned should be presented in a structured form or in prose.
182	HD12	) enclose information about specific encoding practices applied in the electronic text; often these practices are described in coded form. Typically, such information takes the form of a series of declarations, identifying a code with some more complex structure or description. A declaration which applies to more than one text or division of a text need not be repeated in the header of each such text or subdivision. Instead, the
184	HD12	attribute of each text (or subdivision of the text) to which the declaration applies may be used to supply a cross-reference to it, as further described in section
197	HD1	Model Classes in the TEI Header
199	HD1	The TEI header provides a very rich collection of metadata categories, but makes no claim to be exhaustive. It is certainly the case that individual projects may wish to record specialized metadata which either does not fit within one of the predefined categories identified by the TEI header or requires a more specialized element structure than is proposed here. To overcome this problem, the encoder may elect to define additional elements using the customization methods discussed in
200	HD1	. The TEI class system makes such customizations simpler to effect and easier to use in interchange.
202	HD1	These classes are specific to parts of the header:
224	HD2	The bibliographic description of a machine-readable or digital text resembles in structure that of a book, an article, or any other kind of textual object. The file description element of the TEI header has therefore been closely modelled on existing standards in library cataloguing; it should thus provide enough information to allow users to give standard bibliographic references to the electronic text, and to allow cataloguers to catalogue it. Bibliographic citations occurring elsewhere in the header, and also in the text itself, are derived from the same model (on bibliographic citations in general, see further section
228	HD2	The bibliographic description of an electronic text should be supplied by the mandatory
288	HD21	It contains the title given to the electronic work, together with one or more optional
295	HD21	element contains the chief name of the electronic work, including any alternative title or subtitles it may have. It may be repeated, if the work has more than one title (perhaps in different languages) and takes whatever form is considered appropriate by its creator. Where the electronic work is derived from an existing source text, it is strongly recommended that the title for the former should be derived from the latter, but clearly distinguishable from it, for example by the addition of a phrase such as
298	HD21	a digital edition
300	HD21	This will distinguish the electronic work from the source text in citations and in catalogues which contain descriptions of both types of material.
302	HD21	The electronic work will also have an external name (its
305	HD21	data set name
306	HD21	) or reference number on the computer system where it resides at any time. This name is likely to change frequently, as new copies of the file are made on the computer system. Its form is entirely dependent on the particular computer system in use and thus cannot always easily be transferred from one system to another. Moreover, a given work may be composed of many files. For these reasons, these Guidelines strongly recommend that such names should
329	HD21	which identify the person(s) responsible for the intellectual or artistic content of an item and any corporate bodies from which it emanates.
331	HD21	Any number of such statements may occur within the title statement. At a minimum, identify the author of the text and (where appropriate) the creator of the file. If the bibliographic description is for a corpus, identify the creator of the corpus.
332	HD21	Optionally include also names of others involved in the transcription or elaboration of the text, sponsors, and funding agencies. The name of the person responsible for physical data input need not normally be recorded, unless that person is also intellectually responsible for some aspect of the creation of the file.
334	HD21	Where the person whose responsibility is to be documented is not an author, sponsor, funding body, or principal researcher, the
340	HD21	element indicating the nature of the responsibility. No specific recommendations are made at this time as to appropriate content for the
344	HD21	Names given may be personal names or corporate names. Give all names in the form in which the persons or bodies wish to be publicly cited. This would usually be the fullest form of the name, including first names.
345	HD21	Agencies compiling catalogues of machine-readable files are recommended to use available authority lists, such as the Library of Congress Name Authority List, for all common personal names.
400	HD22	It contains either phrases or more specialized elements identifying the edition and those responsible for it:
404	HD22	edition
405	HD22	applies to the set of all the identical copies of an item produced from one master copy and issued by a particular publishing agency or a group of such agencies. A change in the identity of the distributing body or bodies does not normally constitute a change of edition, while a change in the master copy does.
409	HD22	is not entirely appropriate, since they are far more easily copied and modified than printed ones; nonetheless the term
410	HD22	edition
411	HD22	may be used for a particular state of a machine-readable text at which substantive changes are made and fixed. Synonymous terms used in these Guidelines are
424	HD22	changes have to be before they are regarded as producing a new edition, rather than a simple update. The general principle proposed here is that the production of a new edition entails a significant change in the intellectual content of the file, rather than its encoding or appearance. The addition of analytic coding to a text would thus constitute a new edition, while automatic conversion from one coded representation to another would not. Changes relating to the character code or physical storage details, corrections of misspellings, simple changes in the arrangement of the contents and changes in the output format do not normally constitute a new edition, whereas the addition of new information (e.g. a linguistic analysis expressed in part-of-speech tagging, sound or graphics, referential links to external data sets) almost always does.
426	HD22	Clearly, there will always be borderline cases and the matter is somewhat arbitrary. The simplest rule is: if you think that your file is a new edition, then call it such. An edition statement is optional for the first release of a computer file; it is mandatory for each later release, though this requirement cannot be enforced by the parser.
430	HD22	changes in a file considered significant, whether or not they are regarded as constituting a new edition or simply a new revision, should be independently noted in the revision description section of the file header (see section
435	HD22	element should contain phrases describing the edition or version, including the word
436	HD22	edition
439	HD22	, or equivalent, together with a number or date, or terms indicating difference from other editions such as
440	HD22	new edition
442	HD22	revised edition
443	HD22	etc. Any dates that occur within the edition statement should be marked with the
453	HD22	elements may also be used to supply statements of responsibility for the edition in question. These may refer to individuals or corporate bodies and can indicate functions such as that of a reviser, or can name the person or body responsible for the provision of supplementary matter, of appendices, etc., in a new edition. For further detail on the
487	HD23	For printed books, information about the carrier, such as the kind of medium used and its size, are of great importance in cataloguing procedures. The print-oriented rules for bibliographic description of an item's medium and extent need some re-interpretation when applied to electronic media. An electronic file exists as a distinct entity quite independently of its carrier and remains the same intellectual object whether it is stored on a magnetic tape, a CD-ROM, a set of floppy disks, or as a file on a mainframe computer. Since, moreover, these Guidelines are specifically aimed at facilitating transparent document storage and interchange, any purely machine-dependent information should be irrelevant as far as the file header is concerned.
497	HD23	Although it is equally system-dependent, some measure of the size of the computer file may be of use for cataloguing and other practical purposes. Because the measurement and expression of file size is fraught with difficulties, only very general recommendations are possible; the element
543	HD23	Note that when more than one
545	HD23	is supplied in a single
558	HD24	element and is mandatory. Its function is to name the agency by which a resource is made available (for example, a publisher or distributor) and to supply any additional information about the way in which it is made available such as licensing conditions, identifying numbers, etc.
562	HD24	These elements form the
564	HD24	class; if the agency making the resource available is unknown, but other structured information about it is available, an explicit statement such as
565	HD24	publisher unknown
569	HD24	publisher
570	HD24	is the person or institution by whose authority a given edition of the file is made public. The
571	HD24	distributor
572	HD24	is the person or institution from whom copies of the text may be obtained. Where a text is not considered formally published, but is nevertheless made available for circulation by some individual or organization, this person or institution is termed the
573	HD24	release authority
576	HD24	Whichever of these elements is chosen, it may be followed by one or more of the following elements, which together form the
596	HD24	elements all supply additional information relating to the the publisher, distributor, or release authority immediately preceding them. In the following example, Benson is identified as responsible for distribution of some resource at the date and place cited:
605	HD24	A resource may have (for example) both a publisher and a distributor, or more than one publisher each using different identifiers for the same resource, and so on. For this reason, the sequence of at least one
611	HD24	The following example shows a resource published by one agency (Sigma Press) at one address and date, which is also distributed by another (Oxford Text Archive), with a specified identifier and a different date:
641	HD24	always refers to the date of publication, first distribution, or initial release. If the text was created at some other date, this may be recorded using the
645	HD24	element. Other useful dates (such as dates of collection of data) may be given using a note in the
663	HD24	attribute to point to a location from which the licence document itself may be obtained. Alternatively, the licence document may simply be contained within the
680	HD26	series
683	HD26	A group of separate items related to one another by the fact that each item bears, in addition to its own title proper, a collective title applying to the group as a whole. The individual items may or may not be numbered.
687	HD26	A separately numbered sequence of volumes within a series or serial.
695	HD26	may be used to supply any identifying number associated with the item, including both standard numbers such as an ISSN and particular issue numbers. (Arabic numerals separated by punctuation are recommended for this purpose:
701	HD26	attribute is used to categorize the number further, taking the value
737	HD27	the nature, scope, artistic form, or purpose of the file; also the genre or other intellectual category to which it may belong: e.g.
744	HD27	an abstract or summary of the content of a document which has been supplied by the encoder because no such abstract forms part of the content of the source. This should be supplied in the
751	HD27	summary description providing a factual, non-evaluative account of the subject content of the file: e.g.
758	HD27	bibliographic details relating to the source or sources of an electronic text: e.g.
759	HD27	Transcribed from the Norton facsimile of the 1623 Folio
765	HD27	further information relating to publication, distribution, or release of the text, including sources from which the text may be obtained, any restrictions on its use or formal terms on its availability. These should be placed in the appropriate division of the
771	HD27	ICPSR study number 1803
773	HD27	Oxford Text Archive text number 1243
785	HD27	dates, when they are relevant to the content or condition of the computer file: e.g.
790	HD27	names of persons or bodies connected with the technical production, administration, or consulting functions of the effort which produced the file, if these are not named in statements of responsibility in the title or edition statements of the file description: e.g.
793	HD27	availability of the file in an additional medium or information not already recorded about the availability of documentation: e.g.
796	HD27	language of work and abstract, if not encoded in the
801	HD27	The unique name assigned to a serial by the International Serials Data System (ISDS), if not encoded in an
804	HD27	lists of related publications, either describing the source itself, or concerned with the creation or use of the electronic work, e.g.
808	HD27	Each such item of information may be tagged using the general-purpose
819	HD27	There are advantages, however, to encoding such information with more precise elements elsewhere in the TEI header, when such elements are available. For example, the notes above might be encoded as follows:
847	HD3	element. It is a mandatory element and is used to record details of the source or sources from which a computer file is derived. This might be a printed text or manuscript, another computer file, an audio or video recording of some kind, or a combination of these. An electronic file may also have no source, if what is being catalogued is an original text created in electronic form.
852	HD3	element may contain little more than a simple prose description, or a brief note stating that the document has no source:
864	HD3	These classes make available by default a range of ways of providing bibliographic citations which specify the provenance of the text. For written or printed sources, the source may be described in the same way as any other bibliographic citation, using one of the following elements:
871	HD3	. Using them, a source might be described in very simple terms:
896	HD3	When the header describes a text derived from some pre-existing TEI-conformant or other digital document, it may be simpler to use the following element, which is designed specifically for documents derived from texts which were
912	HD3	class also makes available additional elements when additional modules are included. For example, when the
916	HD3	element may also include the following special-purpose elements, intended for cases where an electronic text is derived from a spoken text rather than a written one:
920	HD3	A single electronic text may be derived from multiple source documents, in whole or in part. The
935	HD3	may be used to associate parts of the encoded text with the bibliographic element from which it derives in either case.
937	HD3	The source description may also include lists of names, persons, places, etc. when these are considered to form part of the source for an encoded document. When such information is recorded using the specialized elements discussed in the
956	HD31	If a computer file (call it B) is derived not from a printed source but from another computer file (call it A) which includes a TEI file header, then the source text of computer file B is another computer file, A. The four sections of A's file header will need to be incorporated into the new header for B in slightly differing ways, as listed below:
957	HD31	fileDesc
964	HD31	profileDesc
969	HD31	encodingDesc
971	HD31	A's encoding practice may or (more likely) may not be the same as B's. Since the object of the encoding description is to define the relationship between the current file and its source, in principle only changes in encoding practice between A and B need be documented in B. The relationship between A and its source(s) is then only recoverable from the original header of A. In practice it may be more convenient to create a new complete
974	HD31	revisionDesc
988	HD5	element is the second major subdivision of the TEI header. It specifies the methods and editorial principles which governed the transcription or encoding of the text in hand and may also include sets of coded definitions used by other components of the header. Though not formally required, its use is highly recommended.
1022	HD51	element may be used to describe, in prose, the purpose for which a digital resource was created, together with any other relevant information concerning the process by which it was assembled or collected. This is of particular importance for corpora or miscellaneous collections, but may be of use for any text, for example to explain why one kind of encoding practice has been followed rather than another.
1048	HD52	the underlying population being sampled
1059	HD52	It may also include a simple description of any parts of the source text included or excluded.
1064	HD52	A sampling declaration which applies to more than one text or division of a text need not be repeated in the header of each such text. Instead, the
1066	HD52	attribute of each text (or subdivision of the text) to which the sampling declaration applies may be used to supply a cross-reference to it, as further described in section
1079	HD53	It may contain a prose description only, or one or more of a set of specialized elements, members of the TEI
1083	HD53	Some of these policy elements carry attributes to support automated processing of certain well-defined editorial decisions; all of them contain a prose description of the editorial principles adopted with respect to the particular feature concerned. Examples of the kinds of questions which these descriptions are intended to answer are given in the list below.
1091	HD53	Was the text corrected during or after data capture? If so, were corrections made silently or are they marked using the tags described in section
1092	HD53	? What principles have been adopted with respect to omissions, truncations, dubious corrections, alternate readings, false starts, repetitions, etc.?
1099	HD53	Was the text normalized, for example by regularizing any non-standard spellings, dialect forms, etc.? If so, were normalizations performed silently or are they marked using the tags described in section
1100	HD53	? What authority was used for the regularization? Also, what principles were used when normalizing numbers to provide the standard values for the
1110	HD53	How were quotation marks processed? Are apostrophes and quotation marks distinguished? How? Are quotation marks retained as content in the text or replaced by markup? Are there any special conventions regarding for example the use of single or double quotation marks when nested? Is the file consistent in its practice or has this not been checked? See section
1111	HD53	for discussion of ways in which quotation marks may be encoded.
1122	HD53	hyphens? What principle has been adopted with respect to end-of-line hyphenation where source lineation has not been retained? Have soft hyphens been silently removed, and if so what is the effect on lineation and pagination? See section
1123	HD53	for discussion of ways in which hyphenation may be encoded.
1130	HD53	How is the text segmented? If
1134	HD53	segmentation units have been used to divide up the text for analysis, how are they marked and how was the segmentation arrived at?
1153	HD53	Has any analytic or
1155	HD53	information been provided—that is, information which is felt to be non-obvious, or potentially contentious? If so, how was it generated? How was it encoded? If feature-structure analysis has been used, are
1166	HD53	How has the encoding of punctuation marks present in the original source been treated? For example, has it been normalised, or suppressed in favour of descriptive markup? If it has been retained, is it located within or around elements such as
1170	HD53	Any information about the editorial principles applied not falling under one of the above headings should be recorded in a distinct list of items. Experience shows that a full record should be kept of decisions relating to editorial principles and encoding practice, both for future users of the text and for the project which produced the text in the first instance. Some simple examples follow:
1202	HD53	An editorial practices declaration which applies to more than one text or division of a text need not be repeated in the header of each such text. Instead, the
1204	HD53	attribute of each text (or subdivision of the text) to which it applies may be used to supply a cross-reference to it, as further described in section
1213	HD57	the namespace to which elements appearing within the transcribed text belong.
1215	HD57	how often particular elements appear within the text, so that a recipient can validate the integrity of a text during interchange.
1219	HD57	a default rendition applicable to all instances of an element.
1230	HD57	element consists of an optional sequence of
1232	HD57	elements, each of which must bear a unique identifier, followed by an optional sequence of one or more
1234	HD57	elements, each of which contains a series of
1236	HD57	elements, up to one for each element type from that namespace occurring within the associated
1249	HD57-1	element allows the encoder to specify how one or more elements are rendered in the original source in any of the following ways:
1253	HD57-1	using a standard stylesheet language such as CSS or XSL-FO
1255	HD57-1	using a project-defined formal language
1264	HD57-1	element may be used to indicate a default rendition for all occurrences of the named element
1268	HD57-1	attribute may be used on any element to indicate its rendition, overriding or complementing any supplied default value
1279	HD57-1	elements are by default to be rendered using one set of specifications identified as
1306	HD57-1	As noted above, the content of a
1308	HD57-1	element may describe the appearance of the source material using prose, a project-defined formal language, or any standard languages such as the Cascading Stylesheet Language (
1313	HD57-1	) may be supplied within the
1327	HD57-1	First we define a rendition element for each aspect of the source page rendition that we wish to retain. Details of CSS are given in
1328	HD57-1	; we use it here simply to provide a vocabulary with which to describe such aspects as font size and style, letter and line spacing, colour, etc. Note that the purpose of this encoding is to describe the original, rather than specify how it should be reproduced, although the two are obviously closely linked.
1355	HD57-1	attribute can now be used to specify on any element which of the above rendition features apply to it. For example, a title page might be encoded as follows:
1393	HD57-1	pseudo-elements can be used often in conjunction with the "content" property to add additional characters which need to be added before or after the element content to make it more closely resemble the appearance of the source.
1395	HD57-1	For example, assuming that a text has been encoded using the
1397	HD57-1	element to enclose passages in quotation marks, but the quotation marks themselves have been routinely omitted from the encoding, a set of renditions such as the following:
1409	HD57-1	element is actually rendered in the source with initial and final quotation marks, it may then be encoded as follows:
1420	HD57-2	element, if present, should contain up to one occurrence of a
1422	HD57-2	element for each element type from the given namespace that occurs within the outermost
1427	HD57-2	In the case of a TEI corpus (
1430	HD57-2	in a corpus header will describe tag usage across the whole corpus, while one in an individual text header will describe tag usage for the individual text concerned.
1433	HD57-2	element may be used to supply a count of the number of occurrences of this element within the text, which is given as the value of its
1435	HD57-2	attribute. It may also be used to hold any additional usage information, which is supplied as running prose within the element itself.
1447	HD57-2	attribute may optionally be used to specify how many of the occurrences of the element in question bear a value for the global
1455	HD57-2	The content of the
1461	HD57-2	attributes, but if it does, then the counts provided must correspond with the number of such elements present in the associated
1474	HD57-1a	The content of the
1476	HD57-1a	element and the value of the
1478	HD57-1a	attribute are expressed using one of a small number of formally defined style definition languages. For ease of processing, it is strongly recommended to use a single such language throughout an encoding project, although the TEI system permits a mixture.
1484	HD57-1a	element, is used to supply the name of the default style definition language. The name is supplied as the value of the
1490	HD57-1a	Informal free text description
1499	HD57-1a	A user-defined formal description language
1503	HD57-1a	attribute may be used to supply the precise version of the style definition language used, and the content of this element, if any, may supply additional information.
1507	HD57-1a	attribute is used, its value must always be expressed using whichever default style definition language is in force. If more than one occurrence of the
1509	HD57-1a	is provided, there will be more than one default available, and the
1522	HD54	It may contain either a series of prose paragraphs or the following specialized elements:
1527	HD54	Note that not all possible referencing schemes are equally easily supported by current software systems. A choice must be made between the convenience of the encoder and the likely efficiency of the particular software applications envisaged, in this context as in many others. For a more detailed discussion of referencing systems supported by these Guidelines, see section
1534	HD54	as a series of pairs of regular expressions and XPaths
1537	HD54	milestone
1538	HD54	s
1545	HD54	element can be included in the header if more than one canonical reference scheme is to be used in the same document, but the current proposals do not check for mutual inconsistency.
1551	HD54P	by a simple prose description. Such a description should indicate which elements carry identifying information, and whether this information is represented as attribute values or as content. Any special rules about how the information is to be interpreted when reading or generating a reference string should also be specified here. Such a prose description cannot be processed automatically, and this method of specifying the structure of a canonical reference system is therefore not recommended for automatic processing.
1592	HD54M	This method is appropriate when only
1593	HD54M	milestone
1597	HD54M	A reference based on milestone tags concatenates the values specified by one or more such tags. Since each tag marks the point at which a value changes, it may be regarded as specifying the
1598	HD54M	refState
1599	HD54M	of a variable. A reference declaration using this method therefore specifies the individual components of the canonical reference as a sequence of
1608	HD54M	might be thought of as representing the state of three variables: the
1610	HD54M	variable is in state
1614	HD54M	variable is in state
1618	HD54M	variable is in state
1620	HD54M	. If milestone tagging has been used, there should be a tag marking the point in the text at which each of the above
1625	HD54M	tag itself, what are here referred to as
1634	HD54M	therefore an application must scan left to right through the text, monitoring changes in the state of each of these three variables as it does so. When all three are simultaneously in the required state, the desired point will have been reached. There may of course be several such points.
1642	HD54M	tags in the text are to be checked for state-changes. A state-change is signalled whenever a new
1644	HD54M	tag is found with
1650	HD54M	element in question. The value for the new state may be given explicitly by the
1654	HD54M	element, or it may be implied, if the
1658	HD54M	For example, for canonical references in the form
1662	HD54M	represents the page number in the first edition, and
1664	HD54M	the line number within this page, a reference system declaration such as the following would be appropriate:
1668	HD54M	This implies that milestone tags of the form
1670	HD54M	will be found throughout the text, marking the positions at which page and line numbers change. Note that no value has been specified for the
1672	HD54M	attribute on the second milestone tag above; this implies that its value at each state change is monotonically increased. For more detail on the use of milestone tags, see section
1677	HD54M	The milestone referencing scheme, though conceptually simple, is not supported by a generic XML parser. Its use places a correspondingly greater burden of verification and accuracy on the encoder.
1687	HD54M	A reference system declaration which applies to more than one text or division of a text need not be repeated in the header of each such text. Instead, the
1689	HD54M	attribute of each text (or subdivision of the text) to which the declaration applies may be used to supply a cross-reference to it, as further described in section
1695	HD55	element is used to group together definitions or sources for any descriptive classification schemes used by other parts of the header. Each such scheme is represented by a
1705	HD55	element has two slightly different, but related, functions. For well-recognized and documented public classification schemes, such as Dewey or other published descriptive thesauri, it contains simply a bibliographic citation indicating where a full description of a particular taxonomy may be found.
1715	HD55	element contains a description of the taxonomy itself as well as an optional bibliographic citation. The description consists of a number of
1717	HD55	elements, each defining a single category within the given typology. The category is defined by the contents of a nested
1719	HD55	element, which may contain either a phrase describing the category, or any number of elements from the
1721	HD55	class. When the corpus module is included in a schema, this class provides the
1723	HD55	element whose components allow the definition of a text type in terms of a set of
1726	HD55	; if the corpus module is not included in a schema, this class is empty and the
1730	HD55	If the category is subdivided, each subdivision is represented by a nested
1732	HD55	element, having the same structure. Categories may be nested to an arbitrary depth in order to reflect the hierarchical structure of the taxonomy. Each
1766	HD55	Linkage between a particular text and a category within such a taxonomy is made by means of the
1771	HD55	. Where the taxonomy permits of classification along more than one dimension, more than one category will be referenced by a particular
1773	HD55	, as in the following example, which identifies a text with the sub-categories
1779	HD55	within the category
1787	HD55	child, when for example the category is described in more than one language, as in the following example:
1821	HDGDECL	The following element is provided to indicate (within the header of a document, or in an external location) that a particular coordinate notation, or a particular datum, has been employed in a text. The default notation is a string containing two real numbers separated by whitespace, of which the first indicates latitude and the second longitude according to the 1984 World Geodetic System (WGS84).
1833	HDSCHSPEC	, it allows embedding of a schema inside a TEI header; alternatively, this element may be used in the
1840	HDSCHSPEC	element contains all the information needed to generate schemas for a particular TEI customization, and the ODD documentation elements, by reference to the TEI, are more succinct than the schemas derived from them. Therefore you may find it convenient to make a copy of the
1844	HDSCHSPEC	itself, in addition to supplying an external schema and/or ODD file; if the XML file becomes separated from its schema, the schema can be regenerated at any time using the information in the
1864	HDAPP	to allow an application to discover that it has previously opened or edited a file, and what version of itself was used to do that;
1866	HDAPP	to show (through a date) which application last edited the file to allow for diagnosis of any problems that might have been caused by that application;
1868	HDAPP	to allow users to discover information about an application used to edit the file
1870	HDAPP	to allow the application to declare an interest in elements of the file which it has edited, so that other applications or human editors may be more wary of making changes to those sections of the file.
1886	HDAPP	element identifies the current state of one software application with regard to the current file. This element is a member of the
1888	HDAPP	class, which provides a variety of attributes for associating this state with a date and time, or a temporal range. The
1892	HDAPP	attributes should be used to uniquely identify the application and its major version number (for example,
1894	HDAPP	). It is not intended that an application should add a new
1896	HDAPP	each time it touches the file.
1898	HDAPP	The following example shows how these elements might be used to document the fact that version 1.5 of an application called
1916	HDENCOTH	The elements discussed so far are available to any schema. When the schema in use includes some of the more specialized TEI modules, these make available other more module-specific components of the encoding description. These are discussed fully in the documentation for the module in question, but are also noted briefly here for convenience.
1919	HDENCOTH	element is available only when the
1921	HDENCOTH	module is included in a schema. Its purpose is to document the
1924	HDENCOTH	) underlying any analytic
1927	HDENCOTH	) present in the text documented by this header.
1930	HDENCOTH	element is available only when the
1932	HDENCOTH	module is included in a schema. Its purpose is to document any metrical notation scheme used in the text, as further discussed in section
1933	HDENCOTH	. It consists either of a prose description or a series of
1938	HDENCOTH	element is available only when the
1940	HDENCOTH	module is included in a schema. Its purpose is to document the method used to encode textual variants in the text, as discussed in section
1949	HD4	element is the third major subdivision of the TEI header. It is an optional element, the purpose of which is to enable information characterizing various descriptive aspects of a text or a corpus to be recorded within a single unified framework.
1952	HD4	In principle, almost any component of the header might be of importance as a means of characterizing a text. The author of a written text, its title or its date of publication, may all be regarded as characterizing it at least as strongly as any of the parameters discussed in this section. The rule of thumb applied has been to exclude from discussion here most of the information which generally forms part of a standard bibliographic style description, if only because such information has already been included elsewhere in the TEI header.
1958	HD4	element, followed by any number of additional elements taken from the
1960	HD4	class. The default members of this class are the following :
1991	HD4	. Its purpose is to group together a number of
1995	HD4	element can also appear within a structured manuscript description, when the
2000	HD4	element is actually declared within the header module, but is only accessible to a schema when one or other of the
2020	HD4C	element contains phrases describing the origin of the text, e.g. the date and place of its composition.
2023	HD4C	The date and place of composition are often of particular importance for studies of linguistic variation; since such information cannot be inferred with confidence from the bibliographic description of the copy text, the
2025	HD4C	element may be used to provide a consistent location for this information:
2044	HD41	elements, each of which provides information about a single language, notably the quantity of that language present in the text. Note that this element should
2056	HD41	element may be supplied for each different language used in a document. If used, its
2058	HD41	attribute should specify an appropriate language identifier, as further discussed in section
2059	HD41	. This is particularly important if extended language identifiers have been used as the value of
2079	HD43	element is used to classify a text in some way.
2087	HD43	by providing a set of keywords, as provided for example by British Library or Library of Congress Cataloguing in Publication data
2089	HD43	by referencing any other taxonomy of text categories recognized in the field concerned, or peculiar to the material in hand; this may include one based on recurring sets of values for the situational parameters defined in section
2101	HD43	element simply categorizes an individual text by supplying a list of keywords which may describe its topic or subject matter, its form, date, etc. In some schemes, the order of items in the list is significant, for example, from major topic to minor; in others, the list has an organized substructure of its own. No recommendations are made here as to which method is to be preferred. Wherever possible, such keywords should be taken from a recognized source, such as the British Library/Library of Congress Cataloguing in Publication data in the case of printed books, or a published thesaurus appropriate to the field.
2105	HD43	attribute is used to indicate the source of the keywords used, in the case where such a source exists. If the keywords are taken from an externally defined authority which is available online, this attribute should point directly to it, as in the following examples:
2125	HD43	If the authority file is not available online, but is generally recognized and commonly cited, a bibliographic description for it should be supplied within the
2130	HD43	attribute may then reference that
2154	HD43	If no authority file exists, perhaps because the keywords used were assigned directly by an author, the
2158	HD43	Alternatively, if the keyword vocabulary itself is locally defined, the
2172	HD43	element also categorizes an individual text, by supplying a numerical or other code rather than descriptive terms. Such codes constitute a recognized classification scheme, such as the Dewey Decimal Classification. On this element, the
2174	HD43	attribute is required; it indicates the source of the classification scheme in the same way as for keywords: this may be a pointer of any kind, either to a TEI element, possibly in the current document, as in the
2176	HD43	examples above, or to some canonical source for the scheme, as in the following example:
2183	HD43	element categorizes an individual text by pointing to one or more
2192	HD43	) holds information about a particular classification or category within a given taxonomy. Each such category must have a unique identifier, which may be supplied as the value of the
2196	HD43	elements which are regarded as falling within the category indicated.
2198	HD43	A text may, of course, fall into more than one category, in which case more than one identifier may be supplied as the value for the
2205	HD43	attribute may be supplied to specify the taxonomy to which the categories identified by the target attribute belong, if this is not adequately conveyed by the resource pointed to. For example,
2207	HD43	Here the same text has been classified as of categories
2213	HD43	), and as of category
2219	HD43	with multiple identifiers in the value of
2223	HD43	elements, each with a single identifier in the value of
2225	HD43	. However, note that maintenance of a TEI document with a large number of values within a single
2233	HD43	elements is that the values used as identifying codes are exhaustively enumerated for the former, typically within the TEI header. In the latter case, however, the values use any externally-defined scheme, and therefore may be taken from a more open-ended descriptive classification system.
2240	HD4ABS	The main purpose of the
2242	HD4ABS	element is to supply a brief resume or abstract for an article which was originally published without such a component. An abstract or summary forming part of the document at its creation should usually appear in the front matter (
2265	HD4ABS	The same element may be used to provide other summary information supplied by the encoder, perhaps grouped together into a list of discrete items:
2310	HD44	Each such element contains one or more paragraphs of description for the calendar system concerned, and also supplies an identifying code for it as the value of its
2324	HD44	This identifying code may then be referenced from any element supplying a date expressed using that calendar system:
2348	HD44CD	This information is complementary to the detailed descriptions of physical objects (such as letters) associated with correspondence activities, which are typically provided by the sourceDesc element.
2367	HD44CD	element is used to group references relevant to the item of correspondence being described, typically to other items such as the item to which it is a reply, or the item which replies to it:
2394	HD44CD	to describe the sending of a letter by Adelbert von Chamisso from Vertus on 29 January 1807 to Louis de La Foye at Caen. The date of reception is unknown:
2414	HD44CD	to provide a normalized form of the date. The content of the
2416	HD44CD	element may also be omitted, since no underlying source is being transcribed.
2420	HD44CD	if the action is considered to apply to them all acting as a single group. In the following example two people are considered to have received the communication.
2459	HD44CD	The same person may be associated with many actions. For example, it will often be the case that the author and sender of a message are identical, and that many individual letters will need to be associated with the same person. The
2462	HD44CD	may be used to indicate that the same name applies to many actions. Its value will usually be the identifier of an element defining the person or name concerned, which is supplied elsewhere in the document.
2470	HD44CD	It is assumed that each correspondence action applies to a single act of communication. It may however be the case that the same physical object is involved in several such acts, if for example person A sends a letter to person B, who then annotates it and sends it on to person C, or if persons A and B both use the same document to convey quite different messages. In such situations, multiple
2472	HD44CD	elements should be supplied, one for each communication. In the following example, the same document contains distinct messages, sent by two different people to the same destination:
2520	HD6	The final sub-element of the TEI header, the
2522	HD6	element, provides a detailed change log in which each change made to a text may be recorded. Its use is optional but highly recommended. It provides essential information for the administration of large numbers of files which are being updated, corrected, or otherwise modified as well as extremely useful documentation for files being passed from researcher to researcher or system to system. Without change logs, it is easy to confuse different versions of a file, or to remain unaware of small but important changes made in the file by some earlier link in the chain of distribution. No significant change should be made in any TEI-conformant file without corresponding entries being made in the change log.
2529	HD6	The main purpose of the revision description is to record changes in the text to which a header is prefixed. However, it is recommended TEI practice to include entries also for significant changes in the header itself (other than the revision description itself, of course). At the very least, an entry should be supplied indicating the date of creation of the header.
2531	HD6	The log consists of a list of entries, one for each change. Changes may be grouped and organised using either the
2537	HD6	. Alternatively, a simple sequence of
2543	HD6	may be supplied for each
2545	HD6	element to indicate its date and the person responsible for it respectively. The description of the change itself can range from a simple phrase to a series of paragraphs. If a number is to be associated with one or more changes (for example, a revision number), the global
2628	HD7	The TEI header allows for the provision of a very large amount of information concerning the text itself, its source, its encodings, and revisions of it, as well as a wealth of descriptive information such as the languages it uses and the situation(s) in which it was produced, together with the setting and identity of participants within it. This diversity and richness reflects the diversity of uses to which it is envisaged that electronic texts conforming to these Guidelines will be put. It is emphatically
2630	HD7	intended that all of the elements described above should be present in every TEI Header.
2632	HD7	The amount of encoding in a header will depend both on the nature and the intended use of the text. At one extreme, an encoder may expect that the header will be needed only to provide a bibliographic identification of the text adequate to local needs. At the other, wishing to ensure that their texts can be used for the widest range of applications, encoders will want to document as explicitly as possible both bibliographic and descriptive information, in such a way that no prior or ancillary knowledge about the text is needed in order to process it. The header in such a case will be very full, approximating to the kind of documentation often supplied in the form of a manual. Most texts will lie somewhere between these extremes; textual corpora in particular will tend more to the latter extreme. In the remainder of this section we demonstrate first the minimal, and next a commonly recommended, level of encoding for the bibliographic information held by the TEI header.
2634	HD7	Supplying only the minimal level of encoding required, the TEI header of a single text might look like the following example:
2656	HD7	The only mandatory component of the TEI header is the
2664	HD7	are all required constituents. Within the title statement, a title is required, and an author should be specified, even if it is
2666	HD7	, as should some additional statement of responsibility, here given by the
2670	HD7	, a publisher, distributor, or other agency responsible for the file must be specified. Finally, the source description should contain at the least a loosely structured bibliographic citation identifying the source of the electronic text if (as is usually the case) there is one.
2672	HD7	We now present the same example header, expanded to include additionally recommended information, adequate to most bibliographic purposes, in particular to allow for the creation of an
2674	HD7	-conformant bibliographic record. We have also added information about the encoding principles used in this (imaginary) encoding, about the text itself (in the form of Library of Congress subject headings), and about the revision of the file.
2848	HD7	Many other examples of recommended usage for the elements discussed in this chapter are provided here, in the reference index and in the associated tutorials.
2852	HD8	A strong motivation in preparing the material in this chapter was to provide in the TEI header a viable chief source of information for cataloguing computer files. The TEI header is not a library catalogue record, and so will not make all of the distinctions essential in standard library work. It also includes much information generally excluded from standard bibliographic descriptions. It is the intention of the developers, however, to ensure that the information required for a catalogue record be retrievable from the TEI file header, and moreover that the mapping from the one to the other be as simple and straightforward as possible. Where the correspondence is not obvious, it may prove useful to consult one of the works which were influential in developing the content of the TEI header. These include:
2856	HD8	is an international standard setting out what information should be recorded in a description of a bibliographical item. Until a consolidated edition published in 2011, there was a general standard called ISBD(G) and separate ISBDs covering different types of material, e.g. ISBD(M) for monographs, ISBD(ER) for electronic resources. These separate ISBDs follow the same general scheme as the main ISBD(G), but provide appropriate interpretations for the specific materials under consideration.
2862	HD8	were published in 1978, with revisions appearing periodically through 2005. AACR2 provides guidelines for the construction of catalogues in general libraries in the English-speaking world. AACR2 is explicitly based on the general framework of the ISBD(G) and the subsidiary ISBDs: it gives a description of how to describe bibliographic items and how to create access points such as subject or name headings and uniform titles. Other national cataloguing codes exist as well, including the Z44 series of standards from issued by the Association française de normalisation (AFNOR),
2865	HD8	Regole italiane di catalogazione per autore
2876	HD8	Since the TEI file description elements are based on the ISBD areas, it should be possible to use the content of file description as the basis for a catalog record for a TEI document. However, cataloguers should be aware that the permissive nature of the TEI Guidelines may lead to divergences between practice in using the TEI file description and the comparatively strict recommendations of AACR2 and other national cataloguing codes. Such divergences as the following may preclude automatic generation of catalogue records from TEI headers:
2878	HD8	The TEI Guidelines do not require that text be transcribed from the
2879	HD8	chief source of information
2880	HD8	using normalized capitalization and punctuation
2883	HD8	The TEI title statement may not categorize constituent titles in the same way as prescribed by a national cataloguing code.
2885	HD8	The TEI title statement contains authors, editors, and other responsible parties in separate elements, with names which may not have been normalized; it does not necessarily contain a single statement of responsibility
2888	HD8	There is no specific place in a TEI header to specify the
2889	HD8	main entry
2893	HD8	name or title headings under which a catalogue record is filed
2896	HD8	The TEI header does not require use of a particular vocabulary for subject headings nor require the use of subject headings.
2900	HD	The TEI Header Module
2904	header	The TEI Header
2913	HD	The selection and combination of modules to form a TEI schema is described in

TD-DocumentationElements.xml#13168

#	id	text
4	TD	This chapter describes a module which may be used for the documentation of the XML elements and element classes which make up any markup scheme, in particular that described by the TEI Guidelines, and also for the automatic generation of schemas or DTDs conforming to that documentation. It should be used also by those wishing to customize or modify these Guidelines in a conformant manner, as further described in chapters
6	TD	and may also be useful in the documentation of any other comparable encoding scheme, even though it contains some aspects which are specific to the TEI and may not be generally applicable.
13	TD	, and was the name invented by the original TEI Editors for the predecessor of the system currently used for this purpose. See further
16	TD	Like any other piece of XML software, an ODD processor may be instantiated in many ways: the current system uses a number of XSLT stylesheets which are freely available from the TEI, but this specification makes no particular assumptions about the tools which will be used to provide an ODD processing environment.
18	TD	As the name suggests, an ODD processor uses a single XML document to generate multiple outputs. These outputs will include:
23	TD	detailed descriptive documentation, embedding some parts of the formal reference documentation, such as the tag description lists provided in this and other chapters of these Guidelines;
25	TD	declarative code for one or more XML schema languages, such as RELAX NG, W3C Schema, ISO Schematron, or DTD.
30	TD	The input required to generate these outputs consists of running prose, and special purpose elements documenting the components (elements, classes, etc.) which are to be declared in the chosen schema language. All of this input is encoded in XML using elements defined in this chapter. In order to support more than one schema language, these elements constitute a comparatively high-level model which can then be mapped by an ODD processor to the specific constructs appropriate for the schema language in use. Although some modern schema languages such as RELAX NG or W3C Schema natively support self-documentary features of this kind, we have chosen to retain the ODD model, if only for reasons of compatibility with earlier versions of these Guidelines. For reasons of backwards compatibility, the ISO standard XML schema language RELAX NG (
31	TD	) may be used as a means of declaring content models and datatypes, but it is also possible to express content models using natively TEI XML constructs. We also use the ISO Schematron language to define additional constraints beyond those expressed in the content model, as further discussed in
34	TD	In the TEI system, a
38	TD	and has an identifier unique across the whole TEI scheme. For convenience, these specifications are grouped into a number of discrete
40	TD	, which can also be combined more or less as required. Each major chapter of these Guidelines defines a distinct module. Each module declares a number of
43	TD	classes
44	TD	. All classes are available globally, irrespective of the module in which they are declared; particular modules extend the meaning of a class by adding elements or attributes to it. Wherever possible, element content models are defined in terms of classes rather than in terms of specific elements. Modules can also declare particular
46	TD	, which act as short-cuts for commonly used content models or class references.
48	TD	In the present chapter, we discuss the components needed to support this system. In addition, section
49	TD	discusses some general purpose elements which may be useful in any kind of technical documentation, wherever there is need to talk about technical features of an XML encoding such as element names and attributes. Section
54	TD	provides a summary overview of the elements provided by this module.
62	TDphraseTE	In any kind of technical documentation, the following phrase-level elements may be found useful for marking up strings of text which need to be distinguished from the running text because they come from some formal language:
66	TDphraseTE	Like other phrase-level elements used to indicate the semantics of a typographically distinct string, these are members of the
68	TDphraseTE	class. They are available anywhere that running prose is permitted when the module defined by this chapter is included in a schema.
74	TDphraseTE	elements are intended for use when citing brief passages in some formal language such as a programming language, as in the following example:
91	TDphraseTE	A further group of similar phrase-level elements is also defined for the special case of representing parts of an XML document:
101	TDphraseTE	. They are also available anywhere that running prose is permitted when the module defined by this chapter is included in a schema.
103	TDphraseTE	As an example of the recommended use of these elements, we quote from an imaginary TEI working paper:
131	TDphraseTE	element may be used to enclose any kind of example, which will typically be rendered as a distinct block, possibly using particular formatting conventions, when the document is processed. It is a specialized form of the more general
133	TDphraseTE	element provided by the TEI core module. In documents containing examples of XML markup, the
136	TDphraseTE	, since the content of this element can be checked for well-formedness.
140	TDphraseTE	when this module is included in a schema. That class is a part of the general
152	TDphraseEA	Within the body of a document using this module, the following elements may be used to reference parts of the specification elements discussed in section
159	TDphraseEA	TEI practice recommends that a
161	TDphraseEA	listing the elements under discussion introduce each subsection of a module's documentation. The source for the present section, for example, begins as follows:
178	TDphraseEA	element in this example, an ODD processor might simply generate the section number and title of the section referred to, perhaps additionally inserting a link to the section. In a similar way, when processing the
184	TDphraseEA	in this case) from their associated declaration elements: typically, the details recovered will include a brief description of the element and its attributes. These, and other data, will be stored in a specification element elsewhere within the current document, or they may be supplied by the ODD processor in some other way, for example from a database. For this reason, the link to the required specification element is always made using a TEI-defined key rather than an XML IDREF value. The ODD processor uses this key as a means of accessing the specification element required. There is no requirement that this be performed using the XML ID/IDREF mechanism, but there is an assumption that the identifier be unique.
213	TDmodules	As mentioned above, the primary purpose of this module is to facilitate the documentation and creation of an XML schema derived from the TEI Guidelines. The following elements are provided for this purpose:
217	TDmodules	is a convenient way of grouping together element and other declarations, and of associating an externally-visible name with the resulting group. A
218	TDmodules	specification group
219	TDmodules	performs essentially the same function, but the resulting group is not accessible outside the scope of the ODD document in which it is defined, whereas a module can be accessed by name from any TEI schema specification. Elements, and their attributes, element classes, and patterns are all individually documented using further elements described in section
220	TDmodules	below; part of that specification includes the name of the module to which the component belongs.
224	TDmodules	element found. For example, the chapter documenting the TEI module for names and dates contains a module specification like the following:
241	TDmodules	attribute, the value of which is
242	TDmodules	namesdates
245	TDmodules	element above can thus generate a schema fragment for the TEI
249	TDmodules	In most realistic applications, it will be desirable to combine more than one module together to form a complete
251	TDmodules	. A schema consists of references to one or more modules or specification groups, and may also contain explicit declarations or redeclarations of elements (see further
253	TDmodules	The distinction between base and additional tagsets in earlier versions of the TEI scheme has not been carried forward into P5.
256	TDmodules	A schema can combine references to TEI modules with references to other (non-TEI) modules using different namespaces, for example to include mathematical markup expressed using MathML in a TEI document. By default, the effect of combining modules is to allow all of the components declared by the constituent modules to coexist (where this is syntactically possible: where it is not—for example, because of name clashes—a schema cannot be generated). It is also possible to over-ride declarations contained by a module, as further discussed in section
264	TDmodules	attribute, and may then be referenced from any point in an ODD document using the
266	TDmodules	element. This is useful if, for example, it is desired to describe particular groups of elements in a specific sequence. Note however that the order in which element declarations appear within the schema code generated from an ODD file element is not in general affected by the order of declarations within a
270	TDmodules	An ODD processor will generate a piece of schema code corresponding with the declarations contained by a
272	TDmodules	element in the documentation being output, and a cross-reference to such a piece of schema code when processing a
274	TDmodules	. For example, if the input text reads
285	TDmodules	then the output documentation will replace the two
287	TDmodules	elements above with a representation of the schema code declaring the elements
297	TDmodules	respectively. Similarly, if the input text contains elsewhere a passage such as
304	TDmodules	then the
306	TDmodules	elements may be replaced by an appropriate piece of reference text such as
331	TDcrystals	Unlike most elements in the TEI scheme, each of these
333	TDcrystals	has a fairly rigid internal structure consisting of a large number of child elements which are always presented in the same order.
334	TDcrystals	Furthermore, since these elements all describe markup objects in broadly similar ways, they have several child elements in common. In the remainder of this chapter, we discuss first the elements which are common to all the specification elements, and then those which are specific to a particular type.
338	TDcrystals	element, but the specification element for any particular component may only appear once (except in the case where a modification is being defined; see further
339	TDcrystals	). The order in which they appear will not affect the order in which they are presented within any schema module generated from the document. In documentation mode, however, an ODD processor will output the schema declarations corresponding with a specification element at the point in the text where they are encountered, provided that they are contained by a
342	TDcrystals	as discussed in the previous section. An ODD processor will also associate all declarations found with the nominated module, thus including them within the schema code generated for that module, and it will also generate a full reference description for the object concerned in a catalogue of markup objects. These latter two actions always occur irrespective of whether or not the declaration is included in a
355	TDcrystalsCE	This section discusses the child elements common to all of the specification elements; some of these are defined in the core module (
373	TDcrystalsCEdc	element may be used to provide a brief explanation for the name of the object if this is not self-explanatory. For example, the specification for the element
375	TDcrystalsCEdc	used to mark arbitrary blocks of text begins as follows:
382	TDcrystalsCEdc	may also be supplied for an attribute name or an attribute value in similar circumstances:
400	TDcrystalsCEdc	element is needed to explain the significance of the identifier for an item only when this is not apparent, for example because it is abbreviated, as in the above example. It should not be used to provide a full description of the intended meaning (this is the function of the
402	TDcrystalsCEdc	element), nor to comment on equivalent values in other schemes (this is the purpose of the
406	TDcrystalsCEdc	attribute value in other languages (this is the purpose of the
412	TDcrystalsCEdc	element provide a brief characterization of the intended function of the object being documented in a form that permits its quotation out of context, as in the following example:
428	TDcrystalsCEdc	Where specifications are supplied in multiple languages, the elements
432	TDcrystalsCEdc	may be repeated as often as needed. Each such description or gloss should carry both an
436	TDcrystalsCEdc	attribute to indicate the language used and the date on which the translated text was last checked against its source.
442	TDcrystalsCEdc	attribute is used to supply a pointer to some location where such external concepts are defined. For example, to indicate that the TEI
444	TDcrystalsCEdc	element corresponds to the concept defined by the CIDOC CRM category E69, the declaration for the former might begin as follows:
458	TDcrystalsCEdc	attributes to point to an implementation of the mapping. This is useful when a TEI customization (see
461	TDcrystalsCEdc	for convenience of data entry or markup readability. For example, suppose that in some TEI customization an element
464	TDcrystalsCEdc	hi rend='bold'
467	TDcrystalsCEdc	element can be converted to canonical TEI by obtaining a filter from the URI specified, and running the procedure with the name
471	TDcrystalsCEdc	attribute specifies the language (in this case XSL) in which the filter is written:
484	TDcrystalsCEdc	element is used to provide an alternative name for an object, for example using a different natural language. Thus, the following might be used to indicate that the
496	TDcrystalsCEdc	may also be referred to using the alternate identifier
512	TDcrystalsCEdc	of a component is identical to the value of its
518	TDcrystalsCEdc	element contains any additional commentary about how the item concerned may be used, details of implementation-related issues, suggestions for other ways of treating related information etc., as in the following example:
534	TDcrystalsCEdc	A specification element will usually conclude with a list of references, each tagged using the standard
538	TDcrystalsCEdc	element: in the case of the
540	TDcrystalsCEdc	element discussed above, the list is as follows:
545	TDcrystalsCEdc	where the value
570	TDeg	attribute may be used on either element to indicate the source from which an example is taken, typically by means of a pointer to an entry in an associated bibliography, as in the following example:
576	TDeg	element should be used. In such a case, it will clearly be necessary to distinguish the markup within the example from the markup of the document itself. In an XML environment, this is easily done by using a different name space for the content of the
592	TDeg	If the XML contained in an example is not well-formed then it must either be enclosed in a CDATA marked section, or
606	TDeg	element should not be used to tag non-XML examples: the general purpose
616	TDcrystalsCEcl	In the TEI scheme elements are assigned to one or more
617	TDcrystalsCEcl	classes
630	TDcrystalsCEcl	element. It specifies the classes of which the element or class concerned is a member by means of one or more
679	DEFCON	may have three different kinds of content. It may express a content model directly using the TEI elements discussed in the remainder of this section. Alternatively, it may use a schema language of some kind, as defined by a pattern called
680	DEFCON	macro.schemaPattern
682	DEFCON	below. As a third possibility, the legal content for an element may be exhaustively specified using the
687	DEFCON	The following elements are used to define a content model:
707	DEFCON	provides the name of an element which may appear at a certain point in a content model. A
709	DEFCON	provides the name of a class, members of which may appear at a certain point in content model. A
711	DEFCON	provides the name of a predefined macro, the expansion of which may be inserted at a certain point in a content model.
718	DEFCON	Finally, two wrapper elements are provided to indicate whether the components of a content model form a sequence or an alternation:
731	DEFCON	This is the content model for the macro
733	DEFCON	, which is defined as containing any number (including zero) of elements from the
745	DEFCON	This is the content model for the
747	DEFCON	element, which is defined as a sequence of components, firstly a mandatory
749	DEFCON	, followed by any number (including zero) of elements from the
759	TDTAGCONT	Alternatively, element content models may be defined using RELAX NG patterns, or by expressions in some other schema language, depending on the value of the
760	TDTAGCONT	macro.schemaPattern
769	TDTAGCONT	element appears will have a content model which is expressed in RELAX NG as
770	TDTAGCONT	text
771	TDTAGCONT	, using the RELAX NG namespace. This model will be copied unchanged to the output when RELAX NG schemas are being generated. When an XML DTD is being generated, an equivalent declaration (in this case
787	TDTAGCONT	This is the content model for the
793	TDTAGCONT	The RELAX NG language does not formally distinguish element names, attribute names, class names, or macro names: all names are patterns which are handled in the same way, as the above example shows. Within the TEI scheme, however, different naming conventions are used to distinguish amongst the objects being named. Unqualified names (
794	TDTAGCONT	fileDesc
796	TDTAGCONT	revisionDesc
805	TDTAGCONT	) are always class names. In DTD language, classes are represented by parameter entities (
810	TDTAGCONT	The RELAX NG pattern names generated by an ODD processor by default include a special prefix, the default value for which is set using the
815	TDTAGCONT	The purpose of this is to ensure that the pattern name generated is uniquely identified as belonging to a particular schema, and thus avoid name clashes. For example, in a RELAX NG schema combining the TEI element
822	TDTAGCONT	ident
823	TDTAGCONT	. Most of the time, this behaviour is entirely transparent to the user; the one occasion when it is not will be where a content model (expressed using RELAX NG syntax) needs explicitly to reference either the TEI
829	TDTAGCONT	may be used. For example, suppose that we wish to define a content model for
831	TDTAGCONT	which permits either a TEI
835	TDTAGCONT	defined by some other vocabulary. A suitable content model would be generated from the following
850	TDTAGCONS	element, a set of general
854	TDTAGCONS	attribute) in order that a TEI customization may override, delete or change them individually. Each
863	TDTAGCONS	assertion language
864	TDTAGCONS	, together with a RELAXNG to validate it. The Schematron assertion language provides a powerful way of expressing constraints on the content of any XML document in addition to those provided by other schema languages. Such constraints can be embedded within a TEI schema specification using the methods exemplified in this chapter. An ODD processor will typically process any
866	TDTAGCONS	elements in a TEI specification whose
870	TDTAGCONS	The TEI Guidelines include some additional constraints which are expressed using the ISO Schematron language. A conformant TEI document should respect these constraints, although automatic validation of them may not be possible for all processors. A TEI customization may likewise specify additional constraints using this mechanism. Some examples of what is possible using the Schematron language are given below.
872	TDTAGCONS	Constraints are generally used to model local rules which may be outside the scope of the target schema language. For example, in earlier versions of these Guidelines several constraints on the usage of the attributes of the TEI element
881	TDTAGCONS	may be supplied only if the attribute
884	TDTAGCONS	. Few schema language support co-occurence constraints such as the latter. In the current version of the Guidelines, constraint specifications expressed as Schematron rules have been added, as follows:
906	TDTAGCONS	The constraints in the preceding example all related to attributes in the empty namespace, and the schematron rules did not therefore need to define a TEI namespace prefix. The Schematron language
908	TDTAGCONS	element should be used to do this when a constraint needs to refer to a TEI element, as in the following example, which models the constraint that a TEI
921	TDTAGCONS	Schematron rules are also useful where an application needs to enforce rules on attribute values, as in the following examples which check that various types of
939	TDTAGCONS	As a further example, Schematron may be used to enforce rules applicable to a TEI document which is going to be rendered into accessible HTML, for example to check that some sort of content is available from which the
956	TDTAGCONS	Schematron rules can also be used to enforce other HTML accessibility rules about tables; note here the use of a report and an assertion within one pattern:
973	TDTAGCONS	Constraints can be expressed using any convenient language. The following example uses a pattern matching language called SPITBOL to express the requirement that title and author should be different. Implementing private schemes of this kind will generally be more problematic than simply adopting a widely-deployed system such as ISO Schematron however.
988	TDATT	element is used to document information about a collection of attributes, either within an
992	TDATT	. An attribute list can be organized either as a group of attribute definitions, all of which are understood to be available, or as a choice of attribute definitions, of which only one is understood to be available. An attribute list may thus contain nested attribute lists.
998	TDATT	elements are all to be made available, or whether only one of them may be used. For example, the attribute list for the element
1000	TDATT	contains a nested attribute list to indicate that either the
1020	TDATT	element is used to document a single attribute, using an appropriate selection from the common elements already mentioned and the following which are specific to attributes:
1034	TDATT	is used to specify only the attributes which are specific to that particular element. Instances of the element may carry other attributes which are declared by the classes of which the element is a member. These extra attributes, which are shared by other elements, or by all elements, are specified by an
1046	TD-datatypes	element is used to state what kind of value an attribute may have. The TEI defines a number of datatype macros, each with an identifier beginning
1048	TD-datatypes	, which are used in preference to the datatypes available natively from the target schema, since the facilities provided by different schema languages vary so widely. The available TEI datatypes are described in section
1051	TD-datatypes	A TEI schema specification using RELAX NG may choose to define datatypes directly using RELAX NG syntax, for example
1054	TD-datatypes	permits any string of Unicode characters not containing markup, and is thus the equivalent of
1058	TD-datatypes	The RELAX NG language also provides support for a number of more complex cases such as choices or lists.
1059	TD-datatypes	Such usages are permitted by the scheme documented here, but are not recommended when it is desired to remain independent of a particular schema language, since the full generality of one schema language cannot readily be converted to that of another. In the TEI abstract model, datatyping should preferably be carried out either by explicit enumeration of permitted values (using the TEI-specific
1061	TD-datatypes	element described below), by reference to an existing datatype macro, or by definition of a new datatype, using the
1070	TD-datatypes	are provided for the case where an attribute may take more than one value of the type specified. The
1083	TD-datatypes	attribute may take any number of values, each being of the type defined by the TEI
1085	TD-datatypes	macro. As is usual in XML, multiple values for a single attribute are separated by one or more white space characters. Hence, values such as
1098	TDATTvs	element may be used to describe constraints on data content in an informal way: for example
1115	TDATTvs	must take positive integer values less than 150, the datatype
1155	TDATTvs	Where all the possible values for an attribute can be enumerated, the datatype
1173	TDATTvs	element here to explain the otherwise less than obvious meaning of the codes used for these values. Since this value list specifies that it is of type
1181	TDATTvs	attribute will have the value
1212	TDATTvs	The datatype will be
1220	TDATTvs	element) to put constraints on the permitted content of an element, as noted at
1221	TDATTvs	. This use is not however supported by all schema languages, and is therefore not recommended if support for non-RELAX NG systems is a consideration.
1246	TDCLA	A model class specification does not list all of its members. Instead, its members declare that they belong to it by means of a
1252	TDCLA	element for each class of which the relevant element is a member, supplying the name of the relevant class. For example, the
1280	TDCLA	The function of a model class declaration is to provide another way of referring to a group of elements. It does not confer any other properties on the elements which constitute its membership.
1288	TDCLA	classes. In the case of attribute classes, the attributes provided by membership in the class are documented by an
1292	TDCLA	. In the case of model classes, no further information is needed to define the class beyond its description, its identifier, and optionally any classes of which it is a member.
1294	TDCLA	When a model class is referenced in the content model of an element (i.e. in the
1298	TDCLA	), its meaning will depend on the name used to reference the class.
1300	TDCLA	If the reference simply takes the form of the class name, it is interpreted to mean an alternated list of all the current members of the class. For example, suppose that the members of the class
1308	TDCLA	. Then a content model such as
1312	TDCLA	would be equivalent to the explicit content model:
1322	TDCLA	). However, a content model referencing the class as
1324	TDCLA	would be equivalent to the following explicit content model:
1334	TDCLA	The following suffixes, appended with an underscore, can be given to a class name when it is referenced in a content model:
1340	TDCLA	sequence
1342	TDCLA	members of the class are to be provided in sequence
1354	TDCLA	members of the class must be provided one or more times, in sequence
1360	TDCLA	in a content model would be equivalent to:
1384	TDCLA	sequence
1385	TDCLA	in which members of a class appear in a content model when one of the sequence options is used is that in which the elements are declared.
1391	TDCLA	attribute, which can be used to say that this particular model may only be referenced in a content model with the suffixes it specifies. For example, if the
1395	TDCLA	took the form
1396	TDCLA	classSpec ident="model.hiLike" generate="sequence sequenceOptional"
1397	TDCLA	then a content model referring to (say)
1411	TDCLA	defines a small set of attributes common to all elements which are members of that class: those attributes are listed by the
1423	TDCLA	, to which some modules contribute additional attributes when they are included in a schema.
1453	TDENT	element may be used to select a specific named pattern from those available. Patterns are used as a shorthand chiefly to describe common content models and datatypes, but may be used for any purpose. The following elements are used to represent patterns:
1488	TDbuild	specification elements also have an attribute which determines which namespace to which the object being created will belong. In the case of
1490	TDbuild	, this namespace is inherited by all the elements created in the schema, unless they have their own
1496	TDbuild	These attributes are used by an ODD processor to determine how declarations are to be combined to form a schema or DTD, as further discussed in this section.
1498	TDbuild	As noted above, a TEI schema is defined by a
1500	TDbuild	element containing an arbitrary mixture of explicit declarations for objects (i.e. elements, classes, patterns, or macro specifications) and references to other objects containing such declarations (i.e. references to specification groups, or to modules). A major purpose of this mechanism is to simplify the process of defining user customizations, by providing a formal method for the user to combine new declarations with existing ones, or to modify particular parts of existing declarations.
1506	TDbuild	An ODD processor, given such a document, should combine the declarations which belong to the named modules, and deliver the result as a schema of the requested type. It may also generate documentation for the elements declared by those modules. No source is specified for the modules, and the schema will therefore combine the declarations found in the most recent release version of the TEI Guidelines known to the ODD processor in use.
1508	TDbuild	The value specified for the
1510	TDbuild	attribute, when it is supplied as a URL, specifies any convenient location from which the relevant ODD files may be obtained. For the current release of the TEI Guidelines, a URL in the form
1516	TDbuild	. Alternatively, if the ODD files are locally installed, it may be more convenient to supply a value such as
1520	TDbuild	The value for the
1522	TDbuild	attribute may be any form of URI. A set of TEI-conformant specifications in a form directly usable by an ODD processor must be available at the location indicated. When no
1524	TDbuild	value is supplied, an ODD processor may either raise an error or assume that the location of the current release of the TEI Guidelines is intended.
1526	TDbuild	If the source is specified in the form of a private URI, the form recommended is
1530	TDbuild	is a prefix indicating the markup language in use, and
1534	TDbuild	should be used to reference release 1.2.1 of the current TEI Guidelines. When such a URI is used, it will usually be necessary to translate it before such a file can be used in blind interchange.
1542	TDbuild	which allow the encoder to supply an explicit lists of elements from the stated module which are to be included or excluded respectively. For example:
1546	TDbuild	The schema specified here will include all the elements supplied by the core module except for
1558	TDbuild	elements from the linking module.
1567	TDbuild	Note that in this last case, there is no need to specify the name of the module from which the two element declarations are to be found; in the TEI scheme, element names are unique across all modules. The module is simply a convenient way of grouping together a number of related declarations.
1578	TDbuild	, which is not defined in the TEI scheme, will be added to the output schema. This element will also be added to the existing TEI class
1580	TDbuild	, and will thus be available in TEI conformant documents.
1590	TDbuild	The effect of this is to redefine the content model for the element
1600	TDbuild	which appear both in the original specification and in the new specification supplied above:
1602	TDbuild	in this example. Note that if the value for
1610	TDbuild	A schema may not contain more than two declarations for any given component. The value of the
1612	TDbuild	attribute is used to determine exactly how the second declaration (and its constituents) should be combined with the first. The following table summarizes how a processor should resolve duplicate declarations; the term
1619	TDbuild	mode value
1627	TDbuild	add
1631	TDbuild	add new declaration to schema; process its children in add mode
1635	TDbuild	add
1659	TDbuild	change
1667	TDbuild	change
1671	TDbuild	process identifiable children according to their modes; process unidentifiable children in replace mode; retain existing children where no replacement or change is provided
1694	ST-aliens	Combining TEI and Non-TEI Modules
1696	ST-aliens	In the simplest case, all that is needed to include a non-TEI module in a schema is to reference its RELAX NG source using the
1702	ST-aliens	(defining Standard Vector Graphics) are included. To avoid any risk of name clashes, the schema specifies that all TEI patterns generated should be prefixed by the string "TEI_".
1712	ST-aliens	This specification generates a single schema which might be used to validate either a TEI document (with the root element
1714	ST-aliens	), or an SVG document (with a root element
1718	ST-aliens	validate a TEI document containing
1722	ST-aliens	element must become a member of a TEI model class (
1723	ST-aliens	), so that it may be referenced by other TEI elements. To achieve this, we modify the last
1735	ST-aliens	This states that when the declarations from the
1739	ST-aliens	in the TEI module should be extended to include the element
1741	ST-aliens	as an alternative. This has the effect that elements in the TEI scheme which define their content model in terms of that element class (notably
1743	ST-aliens	) can now include it. A RELAX NG schema generated from such a specification can be used to validate documents in which the TEI
1763	TD-LinkingSchemas	This example includes a standard RELAX NG schema, a Schematron schema which might be used for checking that all pointing attributes point at existing targets, and also a link to the TEI ODD file from which the RELAX NG schema was generated. See also
1764	TD-LinkingSchemas	for details of another method of linking an ODD specification into your file by including a
1778	tagdocs	Documentation of TEI modules
1787	TDformal	The selection and combination of modules to form a TEI schema is described in
1808	TDformal	). All of these classes are declared along with the other general TEI classes, in the basic structure module documented in
1815	TDformal	macro.schemaPattern

MS-ManuscriptDescription.xml#12922

#	id	text
10	msov	This chapter is based on the work of the European MASTER (Manuscript Access through Standards for Electronic Records) project, funded by the European Union from January 1999 to June 2001, and led by Peter Robinson, then at the Centre for Technology and the Arts at De Montfort University, Leicester (UK). Significant input also came from a TEI Workgroup headed by Consuelo W. Dutschke of the Rare Book and Manuscript Library, Columbia University (USA) and Ambrogio Piazzoni of the Biblioteca Apostolica Vaticana (IT) during 1998-2000.
11	msov	defines a special purpose element which can be used to provide detailed descriptive information about handwritten primary sources. Although originally developed to meet the needs of cataloguers and scholars working with medieval manuscripts in the European tradition, the scheme presented here is general enough that it can also be extended to other traditions and materials, and is potentially useful for any kind of inscribed artefact.
13	msov	The scheme described here is also intended to accommodate the needs of many different classes of encoders. On the one hand, encoders may be engaged in
16	msov	ex nihilo
17	msov	, that is, creating new detailed descriptions for materials never before catalogued. Some may be primarily concerned to represent accurately the description itself, as opposed to the ideas and interpretations the description represents; others may have entirely opposite priorities. At one extreme, a project may simply wish to capture an existing catalogue in a form that can be displayed on the Web, and which can be searched for literal strings, or for such features such as titles, authors and dates; at the other, a project may wish to create, in highly structured and encoded form, a detailed database of information about the physical characteristics, history, interpretation, etc. of the material, able to support practitioners of
21	msov	To cater for this diversity, here as elsewhere, these Guidelines propose a flexible strategy, in which encoders must choose for themselves the approach appropriate to their needs, and are provided with a choice of encoding mechanisms to support those differing degrees.
31	msdesc	element of the header of a TEI-conformant document, where the document being encoded is a digital representation of some manuscript original, whether as an encoded transcription, as a collection of digital images (as described in
32	msdesc	), or as some combination of the two. However, in cases where the document being encoded is essentially a collection of manuscript descriptions, the
40	msdesc	) making up the TEI element class
50	msdesc	element has the following components, which provide more detailed information under a number of headings. Each of these component elements is further described in the remainder of this chapter.
66	msdesc	), and then either one or more paragraphs, marked up as a series of
80	msdesc	). These elements are all optional, but if used they must appear in the order given here. Finally, in the case of a composite manuscript, a full description may also contain one or more
95	msdesc	The simplest way of digitizing this catalogue entry would simply be to key in the text, tagging the relevant parts of it which make up the mandatory
118	msdesc	and add some of the additional phrase-level elements available when this module is in use:
160	msdesc	Note that in this version the text has been slightly reorganized, but no actual rewriting has been necessary. The encoding now allows the user to search for such features as title, material, and date and place of origin; it is also possible to distinguish quoted material from descriptive passages and to search within descriptions relating to a particular topic (for example, history as distinct from material).
162	msdesc	This process could be continued further, restructuring the whole entry so as to take full advantage of many more of the encoding possibilities provided by the module described in this chapter:
279	msphrase	Within a manuscript description, many other standard TEI phrase level elements are available, notably those described in the Core module (
297	msdates	elements respectively, used to indicate specifically the date and place of origin of a manuscript or manuscript part. Such information would normally be encoded within the
304	msdates	can also be used to identify the place or date of origin of any aspect of the manuscript, such as its decoration or binding, when these are not of the same date or from the same location as rest of the manuscript. Both these elements are members of the
312	msdates	class, and may thus also carry additional attributes giving normalized values for the associated dating.
320	msmat	element can be used to tag any specific term used for the physical material of which a manuscript (or binding, seal, etc.) is composed. The
322	msmat	element may be used to tag any term specifying the type of object or manuscript upon with the text is written.
327	msmat	These elements may appear wherever a term regarded as significant by the encoder occurs, as in the following examples:
356	mswat	These element may appear wherever a term regarded as significant by the encoder occurs. The
369	mswat	element will typically appear when text from the source is being transcribed, for example within a rubric in the following case:
385	mswat	If, as here, any text contained by a stamp is included in its description it should be clearly distinguished from that description. The element
395	msdim	element can be used to specify the size of some aspect of the manuscript, and thus may be thought of as a specialized form of the existing TEI
403	msdim	element will normally occur within the element describing the particular feature or aspect of a manuscript whose dimensions are being given; thus the size of the leaves would be specified within the
410	msdim	), while the dimensions of other specific parts of a manuscript, such as accompanying materials, binding, etc., would be given in other parts of the description, as appropriate.
438	msdim	are used only when the measurement applies to several items, for example the size of all leaves in a manuscript; attributes
442	msdim	are used when the measurement applies to a single item, for example the size of a specific codex, but has had to be estimated. Attribute
444	msdim	is used when the measurement can be given exactly, and applies to a single item; this is the usual situation. In this case, the units in which dimensions are measured may be specified using the
446	msdim	attribute, which will normally take from a closed set of values appropriate to the project, using standard units of measurement wherever possible, such as following values:
453	msdim	line
455	msdim	char
456	msdim	. If however the only data available for the measurement uses some other unit, or it is preferred to normalize it in some other way, then it may be supplied as a string value by means of the
464	msdim	More usually, the measurement will be normalized into a value and an appropriate SI unit:
466	msdim	Where the exact value is uncertain, the attributes
474	msdim	It is often convenient to supply a measurement which applies to a number of discrete observations: for example, the number of ruled lines on the pages of a manuscript (which may not all be the same), or the diameter of an object like a bell, which will differ depending where it is measured. In such cases, the
488	msdim	element may be repeated as often as necessary, with appropriate attribute values to indicate the nature and scope of the measurement concerned. For example, in the following case the leaf size and ruled space of the leaves of the manuscript are specified:
498	msdim	This indicates that for most leaves of the manuscript being described the ruled space is 90 mm high and 48 mm wide, while the leaves throughout are between 157 and 160 mm in height and 105 mm in width.
502	msdim	element is provided for cases where some measurement other than height, width, or depth is required. Its
514	msdim	element may be supplied is not constrained.
525	msloc	element, used to indicate a location, or sequence of locations, within a manuscript.
532	msloc	element is used to reference a single location within a manuscript, typically to specify the location occupied by the element within which it appears. If, for example, it is used as the first component of a
537	msloc	below) then it is understood to specify the location (or locations) of that item within the manuscript being described.
543	msloc	element can be used to identify any reference to one or more folios within a manuscript, wherever such a reference is appropriate. Locations are conventionally specified as a sequence of folio or page numbers, but may also be a discontinuous list, or a combination of the two. This specification should be given as the content of the
553	msloc	A normalized form of the location can also be supplied, using special purpose attributes on the
563	msloc	When the item concerned occupies a discontinuous sequence of pages, this may simply be indicated in the body of the
572	msloc	Alternatively, if it is desired to indicate normalized values for each part of the sequence, a sequence of
587	msloc	Finally, the content of the
589	msloc	element may be omitted if a formatting application can construct it automatically from the values of the
609	msloc	attribute can also be used to associate a location within a manuscript with facsimile images of that location, using the
611	msloc	attribute, or with a transcription of the text occurring at that location. The former association is effected by means of the
619	msloc	is available only when the
640	msloc	attribute uses a URI reference to point directly to images of the relevant pages. This method may be found cumbersome when many images are to be associated with a single location. It is of most use when specific pages are referenced within a description, as in the following example:
690	msloc	When (as in this example) a sequence of elements is to be supplied as target value, it may be given explicitly as above, or using the xPointer range() syntax defined at
691	msloc	. Note however that support for this pointer mechanism is not widespread in current XML processing systems.
695	msloc	attribute should only be used to point to elements that contain or indicate a transcription of the locus being described. To associate a
706	msloc	attribute may be used to distinguish them. For example, MS 65 Corpus Christi College, Cambridge contains two fly leaves bearing music. These leaves have modern foliation 135 and 136 respectively, but are also marked with an older foliation. This may be preserved in an encoding such as the following:
721	msloc	attribute should be supplied on the
742	msnames	The standard TEI element
769	msnames	name
770	msnames	, not the person, place, or organization to which that name refers. In the last example above, the
772	msnames	attribute is used to associate the name with a more detailed description of the person named. This is provided by means of the
774	msnames	element, which becomes available when the
777	msnames	is included in a schema. An element such as the following might then be used to provide detailed information about the person indicated by the name:
792	msnames	element must be provided for each distinct
794	msnames	value specified. For example, in the case above, the value
800	msnames	element; the same value will be used as the
808	msnames	attribute may be used to supply a unique identifying code for the person referenced by the name independently of both the existence of a
810	msnames	element and the use of the standard URI reference mechanism. If, for example, a project maintains as its authority file some non-digital resource, or uses a database which cannot readily be integrated with other digital resources for this purpose, the unique codes used by such
815	msnames	, interchange is improved by use of tag URIs in
823	msnames	elements referenced by a particular document set should be collected together within a
826	msnames	element, located in the TEI header. This functions as a kind of prosopography for all the people referenced by the set of manuscripts being described, in much the same way as a
828	msnames	element in the back matter may be used to hold bibliographic information for all the works referenced.
843	msmisc	element is used to describe one method by which correct ordering of the quires of a codex is ensured. Typically, this takes the form of a word or phrase written in the lower margin of the last leaf verso of a gathering, which provides a preview of the first recto leaf of the successive gathering. This may be a simple phrase such as the following:
859	msmisc	element can be used for either leaf signatures, or a combination of quire and leaf signatures, whether the marking is alphabetic, alphanumeric, or some ad hoc system, as in the following more complex example:
869	msmisc	) taken from a specific known point in a codex (for example the first few words on the second leaf). Since these words will differ from one copy of a text to another, the practice originated in the middle ages of using them when cataloguing a manuscript in order to distinguish individual copies of a work in a way which its opening words could not.
878	mshera	Descriptions of heraldic arms, supporters, devices, and mottos may appear at various points in the description of a manuscript, usually in the context of ownership information, binding descriptions, or detailed accounts of illustrations. A full description may also contain a detailed account of the heraldic components of a manuscript independently considered. Frequently, however, heraldic descriptions will be cited as short phrases within other parts of the record. The phrase level element
919	msid	element is intended to provide an unambiguous means of uniquely identifying a particular manuscript. This may be done in a structured way, by providing information about the holding institution and the call number, shelfmark, or other identifier used to indicate its location within that institution. Alternatively, or in addition, a manuscript may be identified simply by a commonly used name.
923	msid	A manuscript's actual physical location may occasionally be different from its place of ownership; at Cambridge University, for example, manuscripts owned by various colleges are kept in the central University Library. Normally, it is the ownership of the manuscript which should be specified in the manuscript identifier, while additional or more precise information on the physical location of the manuscript can be given within the
938	msid	These elements are all structurally equivalent to the standard TEI
940	msid	element with an appropriate value for its
948	msid	and they must, if present, appear in the order given.
958	msid	to reference a single standardized source of information about the entity named.
969	msid	Major manuscript repositories will usually have a preferred form of citation for manuscript shelfmarks, including rules about punctuation, spacing, abbreviation, etc., which should be adhered to. Where such a format also contains information which might additionally be supplied as a distinct subcomponent of the
971	msid	, for example a collection name, a decision must be taken as to whether to use the more specific element, or to include such information within the
1012	msid	In the former example, the preferred form of the identifier can be retrieved by prefixing the content of the
1028	msid	might be considered helpful in some circumstances (if, for example, some of the items in the Ellesmere collection had shelfmarks which did not begin
1032	msid	In some cases the shelfmark may contain no information about the collection; in other cases, the item may be regarded as belonging to more than one collection. The
1070	msid	Note in the latter case the use of the
1072	msid	element to provide a common name other than the shelfmark by which a manuscript is known. Where a manuscript has several such names, more than one of these elements may be used, as in the following example:
1090	msid	attribute has been used to specify the language of the alternative names.
1092	msid	In very rare cases a repository may have only one manuscript (or only one of any significance), which will have no shelfmark as such but will be known by a particular name or names. In such circumstances, the
1094	msid	element may be omitted, and the manuscript identified by the name or names used for it, using one or more
1111	msid	Where manuscripts have moved from one institution to another, or even within the same institution, they may have identifiers additional to the ones currently used, such as former shelfmarks, which are sometimes retained even after they have been officially superseded. In such cases it may be useful to supply an alternative identifier, with a detailed structure similar to that of the
1115	msid	in the collection of the Duque de Osuna, but which now has the shelfmark
1139	msid	, except in cases where a manuscript is likely still to be referred to or known by its former identifier. For example, an institution may have changed its call number system but still wish to retain a record of the earlier number, perhaps because the manuscript concerned is frequently cited in print under its previous number:
1153	msid	Where (as in this example) no repository is specified for the
1157	msid	. Where the holding institution has only one preferred form of citation but wishes to retain the other for internal administrative purposes, the secondary could be given within
1159	msid	with an appropriate value on the
1182	msid	, substantial parts of which are to be found in three separate repositories, in Ljubljana, Warsaw, and St. Petersburg. This should be represented using three distinct
1184	msid	elements, using an appropriate value on the type attribute to indicate that these three identifiers are not alternate ways of referring to the same physical object, but three parts of the same entity.
1217	msid	As mentioned above, the smallest possible description is one that contains only the element
1241	msdo	. This will often have been enough to identify a manuscript in a small collection because the identity of the author is implicit. Where a title does not imply the author, and is thus insufficient to identify the main text of a manuscript, the author should be stated explicitly (e.g.
1245	msdo	). Many inventories of manuscripts consist of no more than an author and title, with some form of copy-specific identifier, such as a shelfmark or
1253	msdo	); information on date and place of writing will sometimes also be included. The standard TEI element
1258	msdo	In this way the cataloguer or scholar can supply in one place a minimum of essential information, such as might be displayed or printed as the heading of a full description. For example:
1276	msdo	element is intended principally to contain a heading. More structured information concerning the contents, physical form, or history of the manuscript should be given within the specialized elements described below,
1284	msdo	element may also be used to supply an unstructured collection of such information, as in the example given above (
1293	msco	element is used to describe the intellectual content of a manuscript or manuscript part. It comprises
1295	msco	a series of informal prose paragraphs
1297	msco	a series of
1301	msco	elements, each of which provides a more detailed description of a single item contained within the manuscript. These may be prefaced, if desired, by a
1325	msco	This description may of course be expanded to include any of the TEI elements generally available within a
1394	msco	elements if it is desired to provide both a general summary of the contents of a manuscript and more detail about some or all of the individual items within it. It may not however be used within an individual
1419	mscoit	Each discrete item in a manuscript or manuscript part can be described within a distinct
1464	mscoit	is that in the former, the order and number of child elements is not constrained; any element, in other words, may be given in any order, and repeated as often as is judged necessary. In the latter, however, the sub-elements, if used, must be given in the order specified above and only some of them may be repeated; specifically,
1480	mscoit	may contain untagged running text, both permit an unstructured description to be provided in the form of one or more paragraphs of text. They differ in this respect also: if paragraphs are supplied as the content of an
1482	mscoit	, then none of the other component elements listed above is permitted; in the
1490	mscoit	elements may also nest, where a number of separate items in a manuscript are grouped under a single title or rubric, as is the case, for example, with a work like
1549	mscoit	; they are available only when the
1563	msat	element should be used to supply a regularized form of the item's title, as distinct from any rubric quoted from the manuscript. If the item concerned has a standardized distinctive title, e.g.
1565	msat	, then this should be the form given as content of the
1567	msat	element, with the value of the
1571	msat	. If no uniform title exists for an item, or none has been yet identified, or if one wishes to provide a general designation of the contents, then a
1572	msat	supplied
1573	msat	title can be given, e.g.
1575	msat	, in which case the
1579	msat	should be given the value
1580	msat	supplied
1583	msat	Similarly, if used within a manuscript description, the
1585	msat	element should always contain the normalized form of an author's name, irrespective of how (or whether) this form of the name is cited in the manuscript. If it is desired to retain the form of the author's name as given in the manuscript, this may be tagged as a distinct
1587	msat	element, within the text at the point where it occurs.
1594	msat	element carrying full details of the person concerned (see further
1599	msat	element can be used to supply the name and role of a person other than the author who is responsible for some aspect of the intellectual content of the manuscript:
1612	msat	element can also be used where there is a discrepancy between the author of an item as given in the manuscript and the accepted scholarly view, as in the following example:
1622	msat	Note that such attributions of authorship, both correct and incorrect, are frequently found in the rubric or final rubric (and occasionally also elsewhere in the text), and can therefore be transcribed and included in the description, if desired, using the
1633	mscorie	It is customary in a manuscript description to record the opening and closing words of a text as well as any headings or colophons it might have, and the specialized elements
1647	mscorie	, for recording other bits of the text not covered by these elements. Each of these elements has the same substructure, containing a mixture of phrase-level elements and plain text. A
1649	mscorie	element can be included within each, in order to specify the location of the component, as in the following example:
1667	mscorie	In the following example, standard TEI elements for the transcription of primary sources have been used to mark the expansion of abbreviations and other features present in the original:
1702	mscorie	to indicate that the text begins and ends defectively.
1716	mscorie	may always be used to identify the language of the text quoted, if this is different from the default language specified by the
1750	msclass	One or more text classification or text-type codes may be specified, either for the whole of the
1779	msclass	The value used for the
1791	msclass	element of the TEI header (
1820	mslangs	element should be used to provide information about the languages used within a manuscript item. It may take the form of a simple note, as in the following example:
1825	mslangs	Where, for validation and indexing purposes, it is thought convenient to add keywords identifying the particular languages used, the
1836	mslangs	A manuscript item will sometimes contain material in more than one language. The
1846	mslangs	Since Old Church Slavonic may be written in either Cyrillic or Glagolitic scripts, and even occasionally in both within the same manuscript, it might be preferable to use a more explicit identifier:
1851	mslangs	The form and scope of language identifiers recommended by these Guidelines is based on the IANA standard described at
1852	mslangs	and should be followed throughout. Where additional detail is needed correctly to describe a language, or to discuss its deployment in a given text, this should be done using the
1854	mslangs	element in the TEI header, within which individual
1861	mslangs	element defines a particular combination of human language and writing system. Only one
1863	mslangs	element may be supplied for each such combination. Standard TEI practice also allows this element to be referenced by any element using the global
1865	mslangs	attribute in order to specify the language applicable to the content of that element. For example, assuming that
1902	msph	we subsume a large number of different aspects generally regarded as useful in the description of a given manuscript. These include:
1904	msph	aspects of the form, support, extent, and quire structure of the manuscript object and of the way in which the text is laid out on the page (
1910	msph	and discussion of its binding, seals, and any accompanying material (
1914	msph	Most manuscript descriptions touch on several of these categories of information though few include them all, and not all distinguish them as clearly as we propose here. In particular, it is often the case that an existing description will include information for which we propose distinct elements within a single paragraph, or even sentence. The encoder must then decide whether to rewrite the description using the structure proposed here, or to retain the existing prose, marked up simply as a series of
1922	msph	element may thus be used in either of two distinct ways. It may contain a series of paragraphs addressing topics listed above and similar ones. Alternatively, it may act as a container for any choice of the more specialized elements described in the remainder of this section, each of which itself contains a series of paragraphs, and may also have more specific attributes.
1926	msph	element will normally contain either a series of
1928	msph	elements, or a sequence of specialized elements from the
1932	msph	the description already exists in a prose form where some of the specialized topics are treated together in paragraphs of prose, but others are treated distinctly;
1955	msph	The order in which specific elements may appear is also constrained by the content model; again this is for simplicity of processing. They may of course be processed or displayed in any desired order, but for ease of validation, they must be given in the order specified below.
1961	msph1	element is used to group together those parts of the physical description which relate specifically to the text-bearing object, its format, constitution, layout, etc. The
1963	msph1	attribute is used to indicate the specific type of writing vehicle being described, for example, as a codex, roll, tablet, etc. If used it must appear first in the sequence of specialized elements. The
1966	msph1	support
1967	msph1	, i.e. the physical carrier on which the text is inscribed; and a description of the
1968	msph1	layout
1969	msph1	, i.e. the way text is organized on the carrier.
1971	msph1	Taking these in turn, the description of the support is tagged using the following elements, each of which is discussed in more detail below:
1981	msph1	), may be used to tag specific terms of interest if so desired.
2007	msph1sup	element groups together information about the physical carrier. Typically, for western manuscripts, this will entail discussion of the material (parchment, paper, or a combination of the two) written on. For paper, a discussion of any watermarks present may also be useful. If this discussion makes reference to standard catalogues of such items, these may be tagged using the standard
2030	msph1ext	element, defined in the TEI header, may also be used in a manuscript description to specify the number of leaves a manuscript contains, as in the following example:
2070	msph1col	element, which is provided when the
2121	msphfo	element may be used to indicate the scheme, medium or location of folio, page, column, or line numbers written in the manuscript, frequently including a statement about when and, if known, by whom, the numbering was done.
2129	msphfo	Where a manuscript contains traces of more than one foliation, each should be recorded as a distinct
2131	msphfo	element and optionally given a distinct value for its
2136	msphfo	can then indicate which foliation scheme is being cited by means of its
2155	msphco	element is used to summarize the overall physical state of a manuscript, in particular where such information is not recorded elsewhere in the description. It should not, however, be used to describe changes or repairs to a manuscript, as these are more appropriately described as a part of its custodial history (see
2156	msphco	). It should be supplied within the
2158	msphco	element, if it discusses the condition of the physical support of the manuscript; within the
2163	msphco	) if it discusses only the condition of the binding or bindings concerned; or within the
2165	msphco	element if it discusses the condition of any seal attached to the manuscript.
2187	msphla	of the manuscript, that is the way in which text and illumination are arranged on the page, specifying for example the number of written, ruled, or pricked lines and columns per page, size of margins, distinct blocks such as glosses, commentaries, etc. This may be given as a simple series of paragraphs. Alternatively, one or more different layouts may be identified within a single manuscript, each described by its own
2196	msphla	element is used, the layout will often be sufficiently regular for the attributes on this element to convey all that is necessary; more usually however a more detailed treatment will be required. The attributes are provided as a convenient shorthand for commonly occurring cases, and should not be used except where the layout is regular. The value
2198	msphla	(not-applicable) should be used for cases where the layout is either very irregular, or where it cannot be characterized simply in terms of lines and columns, for example, where blocks of commentary and text are arranged in a regular but complex pattern on each page
2217	msphla	elements within the content of the element, as in the following example:
2239	msph2	The second group of elements within a structured physical description concerns aspects of the writing, illumination, or other notation (notably, music) found in a manuscript, including additions made in later hands—the
2240	msph2	text
2259	msphwr	element can contain a short description of the general characteristics of the writing observed in a manuscript, as in the following example:
2276	msphwr	Where several distinct hands have been identified, this fact can be registered by using the
2318	msphwr	can be used to link the relevant parts of the transcription to the appropriate
2321	msphwr	handShift new="#Eirsp-2"/
2334	msphwr	element can simply provide a summary description:
2357	msphwr	elements should be supplied. Similarly, in the following example, the source text is a typescript with extensive handwritten annotation:
2391	msphdec	It can be difficult to draw a clear distinction between aspects of a manuscript which are purely physical and those which form part of its intellectual content. This is particularly true of illuminations and other forms of decoration in a manuscript. We propose the following elements for the purpose of delimiting discussion of these aspects within a manuscript description, and for convenience locate them all within the physical description, despite the fact that the illustrative features of a manuscript will in many cases also be seen as constituting part of its intellectual content.
2401	msphdec	Alternatively, it may contain a series of more specific typed
2428	msphdec	Where more exact indexing of the decorative content of a manuscript is required, the standard TEI elements
2470	msphmu	element may be used to describe the form of notation employed, as in the following example:
2486	mspham	element can be used to list or describe any additions to the manuscript, such as marginalia, scribblings, doodles, etc., which are considered to be of interest or importance. Such topics may also be discussed or referenced elsewhere in a description, for example in the
2590	msph3	The third major component of the physical description relates to supporting but distinct physical components, such as bindings, seals and accompanying material. These may be described using the following specialist elements:
2602	msphbi	element contains a description of the state of the present and former bindings of a manuscript, including information about its material, any distinctive marks, and provenance information. This may be given as a series of paragraphs if only one binding is being described, or as a series of distinct
2604	msphbi	elements, each describing a distinct binding where these are separately described. For example:
2612	msphbi	Within a binding description, the elements
2639	msphbi	for paragraphs concerned exclusively with the condition of a binding, where this has not been supplied as part of the physical description.
2679	msadac	The circumstance may arise where material not originally part of a manuscript is bound into or otherwise kept with a manuscript. In some cases this material would best be treated in a separate
2682	msadac	below). There are, however, cases where the additional matter is not self-evidently a distinct manuscript: it might, for example, be a set of notes by a later scholar, or a file of correspondence relating to the manuscript. The
2688	msadac	Here is an example of the use of this element, describing a note by the Icelandic manuscript collector Árni Magnússon which has been bound with the manuscript:
2734	mshy	The following elements are used to record information about the history of a manuscript:
2752	mshy	Information about the origins of the manuscript, its place and date of writing, should be given as one or more paragraphs contained by a single
2754	mshy	element; following this, any available information on distinct stages in the history of the manuscript before its acquisition by its current holding institution should be included as paragraphs within one or more
2802	mshy	elements where distinct periods of ownership for the manuscript have been identified:
2841	msad	Three categories of additional information are provided for by the scheme described here, grouped together within the
2852	msad	is required. If any is supplied, it may appear once only; furthermore, the order in which elements are supplied should be as specified above.
2862	msadad	element is used to hold information relating to the curation and management of a manuscript. This may be supplied as a note using the global
2875	msrh	element may contain simply a series of paragraphs. Alternatively it may contain a
2877	msrh	element, followed by an optional series of
2886	msrh	element is used to document the primary source of information for the record containing it, in a similar way to the standard TEI
2888	msrh	element within a TEI Header. If the record is a new one, made without reference to anything other than the manuscript itself, then it may simply contain a
2895	msrh	Frequently, however, the record will be derived from some previously existing description, which may be specified using the standard TEI
2907	msrh	If, as is likely, a full bibliographic description of the source from which cataloguing information was taken is included within the
2911	msrh	element, or elsewhere in the current document, then it need not be repeated here. Instead, it should be referenced using the standard TEI
2947	msrh	element of the standard TEI header; its use here is intended to signal the similarity of function between the two container elements. Where the TEI header should be used to document the revision history of the whole electronic file to which it is prefixed, the
2960	msadch	element is another element also available in the TEI header, which should be used here to supply any information concerning access to the current manuscript, such as its physical location (where this is not implicit in its identifier), any restrictions on access, information about copyright, etc.
2977	msadch	record is used to describe the custodial history of a manuscript, recording any significant events noted during the period that it has been located within its holding institution. It may contain either a series of
2979	msadch	elements, or a series of
2981	msadch	elements, each describing a distinct incident or event, further specified by a
3018	msadsu	element is used to provide information about representations such as photographs or other representations of the manuscript which may exist within the holding institution or elsewhere.
3028	msadsu	element. However, it is often also convenient to record information such as negative numbers or digital identifiers for unpublished collections of manuscript images maintained within the holding institution, as well as to provide more detailed descriptive information about the surrogate itself. Such information may be provided as prose paragraphs, within which identifying information about particular surrogates may be presented using the standard TEI
3056	msadsu	Note the use of the specialized form of title (
3057	msadsu	general material designation
3060	msadsu	At a later revision, the content of the
3062	msadsu	element is likely to be expanded to include elements more specifically intended to provide detailed information such as technical details of the process by which a digital or photographic image was made. For information about the inclusion of digital facsimile images within a TEI document, refer also to
3137	MSref	The selection and combination of modules to form a TEI schema is described in

SA-LinkingSegmentationAlignment.xml#13230

#	id	text
4	SA	This chapter discusses a number of ways in which encoders may represent analyses of the structure of a text which are not necessarily linear or hierarchic. The module defined by this chapter provides for the following common requirements:
6	SA	to link disparate elements using the
11	SA	to link disparate elements without using the
17	SA	to segment text into elements convenient for the encoder and to mark arbitrary points within documents (section
20	SA	to represent correspondence or alignment among groups of text elements, both those with content and those which are empty (section
22	SA	We use the term
24	SA	as a special case for the more general notion of correspondence. Using A as a short form for
27	SA	set to the value
29	SA	, and suppose elements A1, A2, and A3 occur in that order and form one group, while elements B1, B2, and B3 occur in that order and form another group. Then a relation in which A1 corresponds to B1, A2 corresponds to B2, and A3 corresponds to B3 is an alignment. On the other hand, a relation in which A1 corresponds to B2, B1 to C2, and C1 to A2 is not an alignment.
31	SA	to synchronize elements of a text, that is to represent temporal correspondences and alignments among text elements (section
32	SA	) and also to align them with specific points in time (section
35	SA	to specify that one text element is identical to or a copy of another (section
47	SA	to associate segments of a text with interpretations or analyses of their significance (section
51	SA	These facilities all use the same set of techniques based on the W3C XPointer framework (
63	SA	is extended to include eight additional attributes to support the various kinds of linking listed above. Each of these attributes is introduced in the appropriate section below. In addition, for many of the topics discussed, a choice of methods of encoding is offered, ranging from simple but less general ones, which use attribute values only, to more elaborate and more general ones, which use specialized elements.
70	SAPT	to others if the first has an attribute whose value is a reference to the others: such an element is called a
80	SAPT	. These elements all indicate an association between one place in the document (the location of the pointer itself) and one or more others (the elements whose identifiers are specified by the pointer's
83	SAPT	link
100	SAPTL	element, which represents an association between two (or more) locations by specifying each location explicitly. Its own location is irrelevant to the intended linkage. All three elements use the attribute
104	SAPTL	class as a means of indicating the location or locations referenced or pointed to.
114	SAPTL	between an element (which, in the case of a pure pointer, is simply a location in a document), and one or more others, known collectively as its
121	SAPTL	point, conceptually, at a single target, even if that target may be discontinuous in the document. The
126	SAPTL	These three elements also share a common set of attributes, derived from the
141	SAPTL	element. All that is required is that the value of the
143	SAPTL	(or other pointing) attribute of the one be the value of the
161	SAPTL	attribute may take as value one or more URI reference. In the simplest case, each such reference will indicate an element in the current document (or in some other document), for example by supplying the value used for its global
163	SAPTL	attribute. It may however carry as value any form of URI, such as a URL pointing to some other document or location on the Internet. Pointing or linking to external documents and pointing and linking where identifiers are not available is described below in section
170	SAPTEG	As an example of the use of mechanisms which establish connections among elements, consider the practice (common in 18th century English verse and elsewhere) of providing footnotes citing parallel passages from classical authors.
172	POPE	The figure shows the original page of Pope's Dunciad which is discussed in the text.
178	SAPTEG	attribute, placed adjacent to the passage to which the note refers:
181	SAPTEG	attribute on the note is used to classify the notes using the typology established in the Advertisement to the work:
185	SAPTEG	In the source text, the text of the poem shares the page with two sets of notes, one headed
214	SAPTEG	implicit linking
215	SAPTEG	). It relies on the juxtaposition of the note to the text being commented on for the connection to be understood. If it is felt that the mere juxtaposition of the note to the text does not make it sufficiently clear exactly what text segment is being commented on (for example, is it the immediately preceding line, or the immediately preceding two lines, or what?), or if it is decided to place the note at some distance from the text, then the pointing or the linking must be made explicit. We now consider various methods for doing that.
219	SAPTEG	element might be placed at an appropriate point within the text to link it with the annotation:
242	SAPTEG	) to enable it to be specified as the target of the pointer element. Because there is nothing in the text to signal the existence of the annotation, the
244	SAPTEG	attribute has been given the value
254	SAPTEG	attribute has been supplied for the associated text:
264	SAPTEG	Given this encoding of the text itself, we can now link the various notes to it. In this case, the note itself contains a pointer to the place in the text which it is annotating; this could be encoded using a
268	SAPTEG	attribute of its own and contains a (slightly misquoted) extract from the text marked as a
292	SAPTEG	a pointer within one line indicates the note
294	SAPTEG	the note indicates the line
296	SAPTEG	a pointer within the note indicates the line
298	SAPTEG	Note that we do not have any way of pointing from the line itself to the note: the association is implied by containment of the pointer. We do not as yet have a true double link between text and note. To achieve that we will need to supply identifiers for the annotations as well as for the verse lines, and use a
331	SAPTEG	element here bears the identifier of the note followed by that of the verse line. We could also allocate an identifier to the reference within the note and encode the association between it and the verse line in the same way:
346	SAPTEG	s could be combined into one, as follows:
352	SAPTLG	Clearly, there are many reasons for which an encoder might wish to represent a link or association between different elements. For some of them, specific elements are provided in these Guidelines; some of these are discussed elsewhere in the present chapter. The
354	SAPTLG	element is a general purpose element which may be used for any kind of association. The element
356	SAPTLG	may be used to group links of a particular type together in a single part of the document; such a collection may be used to represent what is sometimes referred to in the literature of Hypertext as a
358	SAPTLG	, a term introduced by the Brown University FRESS project in 1969, and not to be confused with the World Wide Web.
373	SAPTLG	element provides a convenient way of establishing a default for the
375	SAPTLG	attribute on a group of links of the same type: by default, the
379	SAPTLG	element has the same value as that given for
385	SAPTLG	Typical software might hide a web entirely from the user, but use it as a source of information about links, which are displayed independently at their referenced locations. Alternatively, software might provide a direct view of the link collection, along with added functions for manipulating the collection, as by filtering, sorting, and so on. To continue our previous example, this text contains many other notes of a kind similar to the one shown above. Here are a few more of the lines to which annotations have to be attached, followed by the annotations themselves:
426	SAPTLG	attribute can be used to identify the text elements within which the individual targets of the links are to be found. Suppose that the text under discussion is organized into a
428	SAPTLG	element, containing the text of the poem, and a
432	SAPTLG	attribute can have as its value the identifiers of the
436	SAPTLG	, to enable an application to verify that the link targets are in fact contained by appropriate elements, or to limit its search space:
448	SAPTLG	domain
449	SAPTLG	; if some notes are contained by a section with identifier
460	SAPTLG	attribute can be used to provide further information about the role or function of the various targets specified for each link in the group. The value of the
462	SAPTLG	attribute is a list of names (formally, name tokens), one for each of the targets in the link; these names can be chosen freely by the encoder, but their significance should be documented in the encoding description in the header.
463	SAPTLG	Since no special element is provided for this purpose in the present version of these Guidelines, the information should be supplied as a series of paragraphs at the end of the
467	SAPTLG	In the current example, we might think of the note as containing the
468	SAPTLG	source
469	SAPTLG	of the imitation and the verse line as containing the
489	SAPTIP	In the preceding examples, we have shown various ways of linking an annotation and a single verse line. However, the example cited in fact requires us to encode an association between the note and a
491	SAPTIP	of verse lines (lines 284 and 285); we call these two lines a
492	SAPTIP	span
495	SAPTIP	There are a number of possible ways of correcting this error: one could use the
497	SAPTIP	attribute to indicate one end of the span and the special purpose
501	SAPTIP	element to point to the other. Another possibility might be to create an element which represents the whole span itself and assign that an
503	SAPTIP	attribute, which can then be linked to the
531	SAPTIP	then provides an identifier which can be linked to the
540	SAPTIP	value of
546	SAPTIP	had the value
548	SAPTIP	, the link target would be the pointer itself, rather than the objects it points to.
552	SAPTIP	element is used to group a collection of
565	SAXP	This section introduces more formally the pointing mechanisms available in the TEI. In addition to those discussed so far, the TEI provides methods of pointing:
575	SAXP	at arbitrary content in any XML document using TEI-defined XPointer schemes.
579	SAXP	All TEI attributes used to point at something else are declared as having the datatype
599	SAUR	Like the ubiquitous if misnamed XHTML pointing attribute
601	SAUR	, the TEI pointing attributes can point to a document that is not the current document (the one that contains the pointing element) whether it is in the same local filesystem as the current document, or on a different system entirely. In either case, the pointing can be accomplished absolutely (using the entire address of the target document) or relatively (using an address relative to the current base URI in force). The
605	SAUR	. If there is none, the base URI is that of the current document. In common practice the current base URI in force is likely to be the value of the
616	SAUR	This example points explicitly to a location on the Web, accessible via HTTP
617	SAUR	. Suppose however that we wish to access a document stored locally in a file. Again we will supply an absolute URI reference, but this time using a different protocol:
631	SAUR	is specified here, the location of the resource
635	SAUR	In the following example, however, we first change the current base URI by setting a new value for
637	SAUR	. The resource required is then identified by means of a relative URI:
691	SABN	Because the default base URI is the current document, a pointer that is specified as a
692	SABN	bare name
694	SABN	In more recent W3C documents, the term
695	SABN	bare name
696	SABN	is deprecated in favour of the more explicit
720	SABN	of the target element as a bare name only (e.g.,
722	SABN	) is the simplest and often the best approach where it can be applied, i.e. where both the source element and target element are in the same XML document, and where the target element carries an identifier. It is the method used extensively in previous sections of this chapter and elsewhere in these Guidelines.
729	SAPU	is a useful way of handling the repeated use of long external URIs. However, it is less convenient when your text contain many references to a variety of different sources in different locations. Even in the case of relative links on the local file system,
733	SAPU	attributes may become quite lengthy and make XML code difficult to read. To deal with this problem, the TEI provides a useful method of using abbreviated pointers and documenting a way to dereference them automatically.
735	SAPU	Imagine a project which has a large collection of XML documents organized like this:
765	SAPU	If you want to link a
773	SAPU	file, the link will look like this:
777	SAPU	If there are many names to tag in a single paragraph, the XML encoding will be congested, and such lengthy links are prone to typographical error. In addition, if the project organization is changed, every relative link will have to be found and altered.
787	SAPU	element in the TEI header, as described in
788	SAPU	. However, such a link cannot be mechanically processed by an external system that does not know how to interpret it; a human will have to read the header explanation and write code explicitly to reconstruct the intended link.
794	SAPU	, and can therefore be used as the value of any attribute which has that datatype, such as
798	SAPU	. Such a scheme consists of a prefix with a colon, and then a value. You might, for example, use the prefix
800	SAPU	(for "person"), and structure your name tags like this:
806	SAPU	? Essentially, it isn't, except that TEI provides a structured method of dereferencing it (turning it into a computable path, such as
810	SAPU	in the TEI header, using the elements and attributes for prefix declaration:
831	SAPU	value is constructed with a
837	SAPU	, and it contains any number of
847	SAPU	provides the string which will be used as a replacement. In this example, using
849	SAPU	, the value
853	SAPU	, and also captured (through the parentheses in the regular expression); it would then be replaced by the value
869	SAPU	in the header to see if there is an available expansion for it, and if there is, it can automatically provide the expansion and generate a full or relative URI.
873	SAPU	element in the personography file, it might also be useful to point to an external source which is available on the network, representing the same information in a different way. So there might be a second
881	SAPU	Any number of
883	SAPU	elements may be provided for the same prefix. A processor may decide to process one or all of them; if it processes only one, it should choose the first one with the correct
891	SAPU	When creating private URI schemes, it is recommended that you avoid using any existing registered prefix. A list of registered prefixes is maintained by IANA at
906	SATS	TEI XPointer Schemes
908	SATS	The pointing schemes described in this chapter are part of a number of such schemes envisaged by the W3C, which together constitute a framework for addressing data within XML documents, known as the XPointer Framework (
912	SATS	. The W3C has predefined a set of such schemes, and maintains a register for their expansion.
917	SATS	. These Guidelines also define six other pointer schemes, which provide access to parts of an XML document such as points within data content or stretches of data content. These additional TEI pointer schemes are defined in sections
921	SATSin	Introduction to TEI Pointers
923	SATSin	Before discussing the TEI pointer schemes, we introduce slightly more formally the terminology used to define them. So far, we have discussed only ways of pointing at components of the XML information set node such as elements and attributes. However, there is often a need in text analysis to address additional types of location such as the
931	SATSin	that may arbitrarily cross the boundaries of nodes in a document. The content of an XML document is organized sequentially as well as hierarchically, and it makes sense to consider ranges of characters within a document independently of the nodes to which they belong. From the perspective of most of the pointer schemes discussed below, a TEI document is a tree structure superimposed upon a character stream. Nodes are entities available only in the tree, while points are available only in the stream. For this reason, the schemes below that rely upon character positions (
937	SATSin	) cannot take nodes into account. Similarly, XPath, being a method for locating nodes in the tree, treats those nodes as atomic, and is unable to address parts of nodes in their document context.
939	SATSin	The TEI pointer scheme thus distinguishes the following kinds of object:
943	SATSin	A node is an instance of one of the node kinds defined in the
945	SATSin	. It represents a single item in the XML information set for a document. For pointing purposes, the only nodes that are of interest are Text Nodes, Element Nodes, and Attribute nodes.
949	SATSin	A Sequence follows the definition in the XPath 2.0 Data Model, with one alteration. A Sequence is an ordered collection of zero or more items, where an item is either a node or a partial text node.
953	SATSin	A Text Stream is the concatenation of the text nodes in a document and behaves as though all tags had been removed. A text stream begins at a reference node and encompasses all of the text inside that node (if any) and all the text following it in document order. In XPath terms, this would encompass all of the text nodes beginning at a particular node, and following it on the
959	SATSin	A Point represents a dimensionless point between nodes or characters in a document. Every point is adjacent to either characters or elements, and never to another point. Points can only be referenced in relation to an element or text node in the document (i.e. something addressable by either an XPath or a fragment identifier). Points occur either immediately before or after an element, or at a numbered position inside a text stream. Position zero in the stream would be immediately before the first character. Note that points within attribute values cannot mark the beginning or end of a range extending beyond the attribute value, because points indicate a position within a document. Since attribute nodes are by definition un-ordered, they cannot be said to have a fixed position.
963	SATSin	The TEI recommends the following seven pointer schemes:
967	SATSin	Addresses a node or nodeset using the XPath syntax. (
974	SATSin	addresses the point before (left) or after (right) a node or node set (
980	SATSin	addresses a point inside a text node (
994	SATSin	addresses a range which matches a specified string within a node (
1001	SATSin	scheme refers to the existing XPath specification which is adopted with one modification: the default namespace for any XPath used as a parameter to this scheme is assumed to be the TEI namespace
1007	SATSin	draft, but are individually much simpler. At the time of this writing, there is no current or scheduled activity at the W3C towards revising this draft or issuing it as a recommendation.
1009	SATSin	A note on namespaces
1014	SATSin	) which when prepended to a resolvable pointer allows for the definition of namespace prefixes to be used in XPaths in subsequent pointers. TEI Pointer schemes assume that un-prefixed element names in TEI Pointer XPaths are in the TEI namespace,
1018	SATSin	is thus optional, provided no new prefixes need to be defined. If the schemes described here are used to address non-TEI elements, then any new prefixes to be used in pointer XPaths may be defined using the
1030	SATSXP	scheme locates a node within an XML Information Set. The single argument
1038	SATSXP	scheme because they represent extracted values rather than locations in the source document. XPath expressions that address attribute nodes are only advisable in the
1042	SATSXP	The example below, and all subsequent examples in this section refer to the following TEI fragment
1075	SATSXP	A TEI Pointer that referenced the "normalized" form in the
1076	SATSXP	choice
1077	SATSXP	in line 1 of the example might look like:
1081	SATSXP	When an XPath is interpreted by a TEI processor, the information set of the referenced document is interpreted without any additional information supplied by any schema processing that may or may not be present. In particular this means that no whitespace normalization is applied to a document before the XPath is interpreted.
1087	SATSXP	pointers more robust than the other mechanisms discussed in this section even if the designated document changes. For durability in the presence of editing, use of
1089	SATSXP	is always recommended when possible.
1101	SATSL	scheme locates the point immediately preceding the node addressed by its argument, which is either an
1105	SATSL	, the value of an
1112	SATSL	lb
1114	SATSL	gap
1134	SATSR	scheme locates the point immediately following the node addressed by its argument.
1139	SATSR	lb
1156	SATSSI	scheme locates a point based on character positions in a text stream relative to the node identified by the IDREF or XPATH parameter. The
1160	SATSSI	. An offset of 0 represents the position immediately before the first character in either the first text node descendant of the node addressed in the first parameter or the first following text node, if the addressed element contains no text node descendants.
1165	SATSSI	s
1170	SATSSI	in line 2.
1184	SATSRN	s, which are each members of the set
1196	SATSRN	locates a (possibly non-contiguous) sequence beginning at the first POINTER parameter and ending at the last. If the POINTER locates a node (i.e. is an XPATH or IDREF), then that node is a member of the addressed sequence. If a sequence addressed by a range pointer overlaps, but does not wholly contain, an element (i.e. it contains only the start but not the end tag or vice-versa), then that element is not part of the sequence.
1199	SATSRN	s may address sequences of non-contiguous nodes. For example, a range() might select text beginning before an
1201	SATSRN	, encompassing the content of a single
1210	SATSRN	line 4
1219	SATSRN	indicates the sequence
1225	SATSRN	indicates the non-contiguous sequence
1237	SATSSR	The string-range() scheme locates a sequence based on character positions in a text stream relative to the node identified by the first parameter. The location of the beginning of the addressed sequence is determined precisely as for
1245	SATSSR	parameter is a positive integer that denotes the length of the text stream captured by the sequence. As with
1247	SATSSR	, the addressed sequence may contain text nodes and/or elements. The
1249	SATSSR	scheme, can accept multiple OFFSET, LENGTH pairs to address a non-contiguous sequence in mauch the same way that range() can accept multiple pairs of pointers.
1251	SATSSR	Because string-range() addresses points in the text stream, tags are invisible to it. For example, if an empty tag like
1253	SATSSR	is encountered while processing a string-range(), it will be included in the resulting sequence, but the LENGTH count will not increment when it is captured.
1258	SATSSR	line 5
1259	SATSSR	from the text immediately following the
1260	SATSSR	lb
1262	SATSSR	ab
1267	SATSSR	indicates the sequence
1273	SATSSR	indicates the non-contiguous sequence
1285	SATSMA	The match scheme locates a sequence based on matching the REGEX parameter against a text stream relative to the reference node identified by the first parameter. REGEX is a regular expression as defined by
1299	SATSMA	are assumed to operate in multi-line mode. The end of the string to be matched against is either the end of the text contained by the element in the first parameter or the end of the document, if that parameter indicates an empty element. The meta-character
1301	SATSMA	therefore matches the beginning of the text stream inside or following the reference node, and the meta-character
1305	SATSMA	The optional INDEX parameter is an integer greater than 0 which specifies which match should be chosen when there is more than one possibility. If omitted, the first match in the text stream will be used.
1315	SATSMA	indicates the sequence
1318	SATSMA	line 5
1326	SATSMA	unclear
1329	SATSMA	, just their text children.
1343	SACR	, chapter 5, verse 7.
1344	SACR	They might then wish to translate the string
1357	SACR	Several elements in the TEI scheme (
1367	SACR	, just for this purpose. Using the system described in this section, an encoder may specify references to canonical works in a discipline-familiar format, and expect software to derive a complete URI from it. The value of the
1369	SACR	attribute is processed as described in this section, and the resulting URI reference is treated as if it were the value of the
1379	SACR	attribute to function as required, a mechanism is needed to define the mapping between (for example)
1385	SACR	in the TEI header, which contains an algorithm for translating a canonical reference string (like
1421	SACR	When an application encounters a canonical reference as the value of
1423	SACR	attribute, it might follow this sequence of specific steps to transform it into a URI reference:
1436	SACR	match the value of the
1438	SACR	attribute to the regular expression found as the value of the
1442	SACR	if the value of the
1446	SACR	take the value of the
1448	SACR	attribute and substitute the back references ($1, $2, etc.) with the corresponding matched substrings
1450	SACR	the result is taken as if it were a relative or absolute URI reference specified on the
1454	SACR	attribute value as usual
1456	SACR	no further processing of this value of the
1460	SACR	should take place
1464	SACR	if, however, the value of the
1466	SACR	attribute does not match the regular expression specified in the value of the
1478	SACR	The regular expression language used as the value of the
1486	SACR	tei
1487	SACR	matches any string that contains
1488	SACR	tei
1489	SACR	, in the W3C language it only matches the string
1490	SACR	tei
1492	SACR	The value of the
1498	SACR	are replaced by the corresponding substring match. Note that since a maximum of nine substring matches are permitted, the string
1501	SACR	the value of the first matched substring followed by the character
1505	SACR	. If there is a need for an actual string including a dollar sign followed by a digit that is not supposed to be replaced, the dollar sign should be written as
1519	SACRWE	above, an application comes across a
1521	SACRWE	value of
1529	SACRWE	. The application would first apply the regular expression
1539	SACRWE	. The application would then apply these substrings to the pattern
1549	SACRWE	If, however, the input string had been
1551	SACRWE	, the first regular expression would not have matched. The application would have then tried the second,
1557	SACRWE	. It would then have substituted those matched substrings into the pattern
1559	SACRWE	to produce a fragment identifier, which when appended to the
1565	SACRWE	If the input string had been
1567	SACRWE	, neither the first nor the second regular expressions would have successfully matched. The application would have then tried the third,
1586	SACRex	In the above example, the value of
1639	SACRmu	Canonical reference pointers are intended for use by TEI encoders. However, this specification might be useful to the development of a process for recognizing canonical references in non-TEI documents (such as plain text documents), possibly as part of their conversion to TEI.
1647	SASE	In this section, we discuss three general purposes elements which may be used to mark and categorize both a span of text and a point within one. These elements have several uses, most notably to provide elements which can be given identifiers for use when aligning or linking to parts of a document, as discussed elsewhere in this chapter. They also provide a convenient way of extending the semantics of the TEI markup scheme in a theory-neutral manner, by providing for two neutral or
1649	SASE	elements to which the encoder can add any meaning not supplied by other TEI defined elements.
1690	SASE	, it is useful where multiple views of a document are to be combined, for example, when a logical view based on paragraphs or verse lines is to be mapped on to a physical view based on manuscript lines. Like those elements, it is a member of the class
1692	SASE	and can therefore appear anywhere within a document when the module defined by this chapter is included in a schema. Unlike the other elements in its class, the
1695	SASE	, rather than as a means of marking segment boundaries for some arbitrary segmentation of a text.
1697	SASE	For example, suppose that we wish to mark the end of the fifth word following each occurrence of some term in a particular text, perhaps to assist with some collocational analysis. This can most easily be done with the help of the
1712	SASE	element may be used at the encoder's discretion to mark almost any segment of the text of interest for processing. One use of the element is to mark text features for which no appropriate markup is otherwise defined, i.e. as a simple extension mechanism. Another use is to provide an identifier for some segment which is to be pointed at by some other element, i.e. to provide a target, or a part of a target, for a
1720	SASE	as a means of marking segments significant in a metrical or rhyming analysis (see section
1723	SASE	as a means of marking typographic lines in drama (see section
1724	SASE	) or title pages (see section
1735	SASE	element simply delimits the extent of a stutter, a textual feature for which no element is provided in these Guidelines.
1759	SASE	elements may be nested directly within one another, to any degree of analysis considered appropriate. This is taken a little further in the following example, where the
1802	SASE	to facilitate this particular kind of analysis. These allow for the explicit markup of units called
1829	SASE	attribute of these specialized elements now carries the value carried by the
1833	SASE	element. For an analysis not using these traditional linguistic categories however, the
1837	SASE	In language corpora and similar material, the
1839	SASE	element may be used to provide an end-to-end segmentation as an alternative to the more specific
1848	SASE	element can then be used to mark both features within s-units and segments composed of s-units, as in the following example:
1850	SASE	, where the text from which this fragment is taken is analyzed.
1864	SASE	tag must be properly enclosed within other elements. Thus, a single
1866	SASE	element can be used to group together words in different sentences only if the sentences are not themselves tagged. The first of the following two encodings is legal, but the second is not.
1890	SASE	element has the same content as a paragraph in prose: it can therefore be used to group together consecutive sequences of
1892	SASE	class elements, such as lists, quotations, notes, stage directions, etc. as well as to contain sequences of phrase-level elements. It cannot however be used to group together sequences of paragraphs or similar text units such as verse lines; for this purpose, the encoder should use intermediate pointers, as described in section
1894	SASE	. It is particularly important that the encoder provide a clear description of the principles by which a text has been segmented, and the way in which that segmentation is represented. This should include a description of the method used and the significance of any categorization codes. The description should be provided as a series of paragraphs within the
1896	SASE	element of the encoding description in the TEI header, as described in section
1901	SASE	element may also be used to encode simultaneous or mutually exclusive variants of a text when the more special purpose elements for simple editorial changes, abbreviation and expansion, addition and deletion, or for a critical apparatus are not appropriate. In these circumstances, one
1903	SASE	is encoded for each possible variant, and the set of them is enclosed in a
1907	SASE	For example, if one were writing dual-platform instructions for installation of software, it might be useful to use
1916	SASE	Elsewhere in this chapter we provide a number of examples where the
1924	SASE	element, but is used for portions of the text which occur not within paragraphs or other component-level elements, but at the component level themselves. It is therefore a member of the
1930	SASE	element may be used, for example, to tag the canonical verse divisions of Biblical texts:
1948	SASE	In other cases, where the text clearly indicates paragraph divisions containing one or more verses, the
1950	SASE	element may be used to tag the paragraphs, and the
1978	SASE	element is also useful for marking dramatic speeches when it is not clear whether the speech is to be regarded as prose or verse. If, for example, an encoder does not wish to express an opinion as to whether the opening lines of Shakespeare's
2027	SACS	, which is a special kind of correspondence involving an ordered set of correspondences. Both cases may be represented using the
2032	SACS	. We also discuss the special case of alignment in time or
2034	SACS	, for which special purpose elements are proposed in section
2040	SACS1	A common requirement in text analysis is to represent correspondences between two or more parts of a single document, or between places in different documents. Provided that explicit elements are available to represent the parts or places to be linked, then the global linking attribute
2055	SACS1	element should be used, if no other element is available. Where the correspondence is between
2059	SACS1	element should be used, if no other element is available.
2063	SACS1	attribute with spans of content is illustrated by the following example:
2081	SACS1	attributes. This mechanism is simple to apply, but has the drawback that it is not possible to specify more exactly what kind of correspondence is intended. Where this attribute is used, therefore, encoders are encouraged to specify their intent in the associated encoding description in the TEI header.
2139	SACSAL	One very important application area for the alignment of parallel texts is multilingual corpora. Consider, for example, the need to align
2141	SACSAL	of sentences drawn from a corpus such as the Canadian Hansard, in which each sentence is given in both English and French. Concerning this problem, Gale and Church write:
2142	SACSAL	Most English sentences match exactly one French sentence, but it is possible for an English sentence to match two or more French sentences. The first two English sentences [in the example below] illustrate a particularly hard case where two English sentences align to two French sentences. No smaller alignments are possible because the clause
2144	SACSAL	in the first English sentence corresponds to (part of) the second French sentence. The next two alignments ... illustrate the more typical case where one English sentence aligns with exactly one French sentence. The final alignment matches two English sentences to a single French sentence. These alignments [which were produced by a computer program] agreed with the results produced by a human judge.
2146	SACSAL	, from which the example in the text is taken.
2148	SACSAL	The alignment produced by Gale and Church's program can be expressed in four different ways. The encoder must first decide whether to represent the alignment in terms of points within each text (using the
2152	SACSAL	element. To some extent the choice will depend on the process by which the software works out where alignment occurs, and the intention of the encoder. Secondly, the encoder may elect to represent the actual encoding using either
2183	SACSAL	attribute be specified in both English and French texts, since (as noted above) this attribute is defined as representing a mutual association. However, it may simplify processing to do so, and also avoids giving the impression that the English is translating the French, or vice versa. More seriously, this encoding does not make explicit that it is in fact the entire stretch of text between the anchors which is being aligned, not simply the points themselves. If for example one text contained material omitted from the other, this approach would not be appropriate.
2239	SACSXA	The preceding encoding of the alignment of parallel passages from two texts requires that those texts and the alignment all be part of the same document. If the texts are in separate documents, then complete URIs, whether absolute or relative (section
2240	SACSXA	), will be required. These external pointers may appear anywhere within the document, but if they are created solely for use in encoding links, they may for convenience be grouped within the
2250	SACSXA	Each topic covered in this work has three parts: a picture, a prose text in Latin describing the topic, and a carefully-aligned translation of the Latin into English, German, or some other vernacular. Key terms in the two texts are typographically distinct, and are linked to the picture by numbers, which appear in the two texts and within the picture as well.
2252	SACSXA	First, we consider the text portions. The English and Latin portions have been encoded as distinct
2299	SACSXA	Next we consider the non-textual parts of the page. Encoding this requires providing two distinct components: firstly a digitized rendering of the page itself, and secondly a representation of the areas within that image which are to be aligned. In section
2309	SACSXA	This example of SVG defines two rectangles at the locations with the specified x and y coordinates. A view is defined on these, enabling them to be mapped by an SVG processor to the image found at the URL specified (
2312	SACSXA	; for further discussion of using non-TEI XML vocabularies such as SVG within a TEI document, see section
2315	SACSXA	As printed, the Comenius text exhibits three kinds of alignment.
2321	SACSXA	Particular words or phrases are marked as terms in the two languages by a change of rendition: the English text, which otherwise uses black letter type throughout, has the words
2339	SACSXA	Numbered labels appear within the text portions, linking keywords to each other and to sections of the picture. These labels, which have been left out of the above encoding, are attached to the first, third, and last segments in each language quoted below, and also appear (rather indistinctly) within the picture itself. Thus, the images of the study, the student, and his books are each aligned with the correct term for them in the two languages.
2375	SACSXA	This map, of course, only aligns whole segments and image portions, since these are the only parts of our encoding which bear identifiers and can therefore be pointed to. To add to it the alignment between the typographically distinct words mentioned above, new elements must be defined, either within the text itself or externally by using stand off techniques. Encoding these word pairs as
2379	SACSXA	, although intuitively obvious, requires a non-trivial decision as to whether the Latin text is glossing the English, or vice versa. Tagging all the marked words as
2381	SACSXA	avoids the difficult decision, but might be thought by some encoders to convey the wrong information about the words in question. Simply tagging them as additional embedded
2385	SACSXA	These solutions all require the addition of further markup to the text. This may pose no problems, or it may be infeasible, for example because the text is held on a read-only medium. If it is not feasible to add more markup to the original text, some form of stand-off markup will be needed. Any item within the text that can be pointed to using the various pointer schemes discussed in this chapter may be used, not simply those which rely on the existence of an
2410	SACSXA	To express the same alignment mentioned above, we could use an XPath expression to identify the required
2422	SACSXA	correspond, we might express the link between them as follows:
2429	SASY	In the previous section we discussed two particular kinds of alignment: alignment of parallel texts in different languages; and alignment of texts and portions of an image. In this section we address another specialized form of alignment: synchronization. The need to mark the relative positions of text components with respect to time arises most naturally and frequently in transcribed spoken texts, but it may arise in any text in which quoted speech occurs, or events are described within a time frame. The methods described here are also generalizable for other kinds of alignment (for example, alignment of text elements with respect to space).
2434	SASYNC	Provided that explicit elements are available to represent the parts or places to be synchronized, then the global linking attribute
2443	SASYNC	elements may be used to make explicit the fact that the synchronous elements are aligned.
2445	SASYNC	To illustrate the use of these mechanisms for marking synchrony, consider the following representation of a spoken text:
2447	SASYNC	B: The first time in twenty five years, we've cooked Christmas (unclear) for a blooming great load of people. A: So you're [1] (unclear) [2] B: [1] It will be [2] nice in a way, but, [3] be strange. [4] A: [3] Yeah [4], yeah, cos it, it's [5] the [6] B: [5] not [6]
2456	SASYNC	To encode this we use the spoken texts module, described in chapter
2516	SASYNC	As with other forms of alignment, synchronization may be expressed between stretches of speech as well as between points. When complete utterances are synchronous, for example, if one person says
2529	SASYNC	(where one speaker starts speaking before another has finished) is thus to use the
2548	SASYNC	element and the content of a
2550	SASYNC	element, and between the content of an
2563	SASYMP	A synchronous alignment specifies which points in a spoken text occur at the same time, and the order in which they occur, but does not say at what time those points actually occur. If that information is available to the encoder it can be represented by means of the
2573	SASYMP	attribute, whose value is a string which specifies a particular time, or indirectly by means of the
2579	SASYMP	is used, then the
2583	SASYMP	attributes should also be used to indicate the amount of time that has elapsed since the time specified by the element pointed to by the
2585	SASYMP	attribute; the value
2591	SASYMP	elements are uniformly spaced in time, then the
2599	SASYMP	elements. If the intervals vary, but the units are all the same, then the
2615	SASYMP	element which specifies the reference or origin for the timings within the
2617	SASYMP	; this must, of course, specify its position in time absolutely. If the origin of a timeline is unknown, then this attribute may be omitted.
2643	SASYMP	To avoid the need for two distinct link groups (one marking the synchronization of anchors with each other, and the other marking their alignment with points on the time line) it would be better to link the
2656	SASYMP	Finally, suppose that a digitized audio recording is also available, and an XML file that assigns identifiers to the various temporal spans of sound is available. For example, the following Synchronized Multimedia Integration Language (SMIL, pronounced "smile") fragment:
2682	SAIE	, that is, an element which is not explicitly present in a text, but the presence of which an application can infer from the encoding supplied. In this section, we are concerned with virtual elements made by simply cloning existing elements. In the next section (
2685	SAIE	Provided that explicit elements are available to represent the parts or places to be linked, then the global linking attributes
2694	SAIE	It is useful to be able to represent the fact that one element of text is identical to others, for analytical purposes, or (especially if the elements have lengthy content) to obviate the need to repeat the content. For example, consider the repetition of the
2708	SAIE	element above has identical content to the first. The
2710	SAIE	attribute is provided for this purpose. Using it, we could recode the last line of the above example as follows:
2716	SAIE	attribute may be used to document the fact that two elements have identical content. It may be regarded as a special kind of link. It should only be attached to an element with identical content to that which it targets, or to one the content of which clearly designates it as a repetition, such as the word
2720	SAIE	in the representation of the chorus of a song, the second time it is to be sung. The relation specified by the
2722	SAIE	attribute is symmetric: if a chorus is repeated three times and each repetition bears a
2728	SAIE	attribute is used in a similar way to indicate that the content of the element bearing it is identical to that of another. The difference is that the content is not itself repeated. The effect of this attribute is thus to create a
2730	SAIE	of the element indicated. Using this attribute, the repeated date in the first example above could be recoded as follows:
2732	SAIE	An application program should replace whatever is the actual content of an element bearing a
2734	SAIE	attribute with the content of the element specified by it. If the content of the element specified includes other elements, these will become embedded within the element bearing the attribute. Care must be taken to ensure that the document is valid both before and after this embedding takes place. If, for example, the element bearing a
2736	SAIE	attribute requires a mandatory sub-component, then this component must be present (though possibly empty), even though it will be replaced by the content of the targetted element.
2790	SAAG	Because of the strict hierarchical organization of elements, or for other reasons, it may not always be possible or desirable to include all the parts of a possibly fragmented text segment within a single element. In section
2791	SAAG	we introduced the notion of an intermediate pointer as a way of pointing to discontinuous segments of this kind. In this section we first describe another way of linking the parts of a discontinuous whole, using a set of linking attributes, which are made available for any tag by following the procedure described at the beginning of this chapter. We then describe how the
2795	SAAG	element, which is a special-purpose linking element specifically for representing the aggregation of parts, and the
2801	SAAG	The linking attributes for aggregation are
2814	SAAG	Here is the material on which we base our first illustration of the use of these mechanisms. Our problem is to represent the s-units identified below as
2844	SAAG	attributes, we can link the s-units with identifiers
2854	SAAG	Double linking of the two s-units, as illustrated by the last of these encodings, is equivalent to specifying a
2862	SAAG	attribute with a value of
2863	SAAG	join
2864	SAAG	to specify that the link is to be understood as joining its targets into a single aggregate.
2871	SAAG	join
2883	SAAG	element within a text is significant: it must be supplied at a position where the element indicated by its
2893	SAAG	As a further example, consider the following list of authors' names. The object of the
2895	SAAG	element here is to provide another list, composed of those authors from the larger list who happen to come from Heidelberg:
2917	SAAG	can be used to reconstruct a text cited in fragments presented out of order. The poem being remembered (an unusual translation of a well-known poem by Basho) runs
2958	SAAG	is available for use when a number of
2964	SAAG	if they are all of the same type, and also allows us to restrict the domain within which their target elements are to be found, in the same way as for
2971	SAAG	may appear only where the elements represented by its contents are legal. Thus if we had created many
2973	SAAG	tags of the sort just described, we could group them together, and require that their components are all contained by an element with the identifier
2985	SAAG	). It may also be used as a convenient way of representing a variety of analytic units, like the
2998	SAAG	And then he added,
3011	SAAG	Suppose now that we wish to represent an interpretation of the above passage in which we distinguish between the various
3015	SAAG	attribute has been used for this purpose; its value on each occasion supplies a pointer to the
3017	SAAG	to which each speech is attributed. (For convenience in this example, we use simply the first occurrence of the names used for each voice as the target for these pointers.) Note also that we add
3019	SAAG	attributes to each distinct speech fragment, which we can then use to link the material spoken by each voice:
3060	SAAG	s making up the
3068	SAAG	value for them.
3147	SAAT	if any of those elements could be present in a text, but one and only one of them is; in addition, we say that those elements are
3151	SAAT	if at least one (and possibly more) of them is present. The elements that are in alternation may also be called
3155	SAAT	The need to mark exclusive alternation arises frequently in text encoding. A common situation is one in which it can be determined that exactly one of several different words appears in a given location, but it cannot be determined which one. One way to mark such an exclusive alternation is to use the linking attribute
3157	SAAT	. Having marked an exclusive alternation, it can sometimes later be determined which of the alternants actually appears in the given location. To preserve the fact that an alternation was posited, one can add the linking attribute
3159	SAAT	to a tag which hierarchically encompasses the alternants, which points to the one which actually appears. To assign responsibility and degree of certainty to the choice, one can use the
3161	SAAT	tag described in chapter
3162	SAAT	. Also see that chapter for further discussion of certainty in general.
3172	SAAT	A more general way to mark alternation, encompassing both exclusive and inclusive alternation, is to use the linking element
3174	SAAT	. The description and attributes of this tag and of the associated grouping tag
3180	SAAT	To take a simple hypothetical example, suppose in transcribing a spoken text, we encounter an utterance that we can understand either as
3193	SAAT	If it is then determined that the speaker said
3197	SAAT	, the encoder could amend the text by deleting the alternant containing
3203	SAAT	value to the
3205	SAAT	attribute value on the
3225	SAAT	seg type="word"
3227	SAAT	seg type="character"
3252	SAAT	, but is certain that if it is
3254	SAAT	, then the other uncertain word is definitely
3290	SAAT	The value of the
3292	SAAT	attribute is defined as a list of identifiers; hence it can also be used to narrow down the range of alternants, as in:
3302	SAAT	element tag appears, and is thus equivalent to just the alternation of those two tags:
3311	SAAT	attribute can also be used in case there is uncertainty about the tag that appears in a certain position. For example, the occurrence of the word
3315	SAAT	can be interpreted, in the absence of other information, either as a person's name or as a date. The uncertainty can be rendered as follows, using the
3326	SAAT	; this avoids having to repeat the content of the element whose correct tagging is in doubt.
3341	SAAT	element in the body of a document, or as the first
3358	SAAT	attribute, if used, would appear on the
3384	SAAT	Now we define the specialized linking element
3407	SAAT	, which is to be used if one wishes to assign
3409	SAAT	to the targets (alternants). Its value is a list of numbers, corresponding to the targets, expressing the probability that each target appears.
3410	SAAT	If the alternants are mutually exclusive, then the weights must sum to 1.0.
3467	SAAT	alt mode="incl"
3472	SAAT	is the number of targets. If the sum is 0%, then the alternation is equivalent to exclusive alternation; if the sum is (100 x k)%, then all of the alternants must appear, and the situation is better encoded without an
3486	SAAT	attribute defaults to the value
3498	SAAT	, but that if the first word is
3500	SAAT	, then the third word is
3502	SAAT	. Now suppose we have the following additional information: if
3504	SAAT	occurs, then the probability that
3508	SAAT	occurs is 50%; if
3510	SAAT	occurs, then the probability that
3530	SAAT	As noted above, when the
3534	SAAT	has the value
3536	SAAT	, then each weight states the probability that the corresponding alternative occurs, given that at least one of the other alternatives occurs.
3546	SAAT	Another very similar example is the following regarding the text of a Broadway song. In three different versions of the song, the same line reads
3552	SAAT	The variant readings are found in the commercial sheet music, the performance score, and the Broadway cast recording.
3564	SAAT	Let us extend the example with a further (imaginary) variation, supposing for the sake of the argument that the next line is variously given as
3570	SAAT	element, we can express the conviction that if the first choice for the second line is correct, then the probability that the first line contains
3572	SAAT	is 90%, and each of the others 5%; whereas if the second choice for the second line is correct, then the probability that the first line contains
3616	SASOin	Most of the mechanisms defined in this chapter rely to a greater or lesser extent on the fact that tags in a marked-up document can both assert a property for a span of text which they enclose, and assert the existence of an association between themselves and some other span of text elsewhere. In stand-off markup, there is a clear separation of these two behaviours: the markup does not directly contain any part of the text, but instead includes it by reference. One specific mechanism recommended by these Guidelines for this purpose is the standard XInclude mechanism defined by the W3C; another is to use pointers as demonstrated elsewhere in this chapter.
3618	SASOin	There are many reasons for using stand-off markup: the source text might be read-only so that additional markup cannot be added, or a single text may need to be marked up according to several hierarchically incompatible schemes, or a single scheme may need to accommodate multiple hierarchical ambiguities, so that a single markup tree is not the most faithful representation of the source material.
3628	SASOin	source document
3631	SASOin	a document to which the stand-off markup refers (a source document can be either XML or plain text); there may be more than one source document.
3637	SASOin	markup that is already present in an XML source document
3643	SASOin	markup that is either outside of the source document and points in to it to the data it describes, or alternatively is in another part of the source document and points elsewhere within the document to the data it describes
3649	SASOin	a document that contains stand-off markup that points to a different, source document
3655	SASOin	the action of creating a new XML document with external markup and data integrated with the source document data, and possibly some source document markup as well
3661	SASOin	a process applied to markup from a pre-existing XML document, which splits it into two documents, an XML (external) document containing some of the markup of the original document, and another (source) XML document containing whatever text content and markup has not been extracted into the stand-off document; if all markup has been externalized from a document, the new source may be a plain text document
3667	SASOin	any valid TEI markup can be either internal or external,
3669	SASOin	external markup can be internalized by applying it to the document content by either substituting the existing markup or adding to it, to form a valid TEI document, and
3679	SASOov	Stand-off markup which relies on the inclusion of virtual content is adequately supported by the W3C XInclude recommendation, which is also recommended for use by these Guidelines.
3680	SASOov	The version on which this text is based is the
3685	SASOov	XInclude defines a namespace (
3695	SASOov	discussed elsewhere in this chapter to point to the actual fragments of text to be internalized. Although XInclude only requires support for the
3700	SASOov	XInclude is a W3C recommendation which specifies a syntax for the inclusion within an XML document of data fragments placed in different resources. Included resources can be either plain text or XML. XInclude instructions within an XML document are meant to be replaced by a resource targetted by a URI, possibly augmented by an XPointer that identifies the exact subresource to be included.
3706	SASOov	attribute to specify the location of the resource to be included; its value is an URI containing, if necessary, an XPointer. Additionally, it uses the
3709	SASOov	text
3712	SASOov	) to specify whether the included content is plain text or an XML fragment, and the
3714	SASOov	attribute to provide a hint, when the included fragment is text, of the character encoding of the fragment. An optional
3718	SASOov	; it specifies alternative content to be used when the external resource cannot be fetched for some reason. Its use is not however recommended for stand-off markup.
3722	SASOso	Stand-off Markup in TEI
3726	SASOso	internalization of one or more source documents' content into a stand-off document. TEI use of XInclude for stand-off markup enables use of XInclude-conformant software to perform this useful operation. However, internalization is not clearly defined for all stand-off files, because the structure of the internal and external markup trees may overlap. In particular, when an external markup document selects a range that overlaps partial elements in the source document, it is not clear how the semantics of internalization (inclusion) should work, since partial elements are not XML objects.
3728	SASOso	XInclude defines a semantics for this case that involves only complete elements.
3730	SASOso	When a range selection partially overlaps a number of elements in a source document, XInclude specifies that the partially overlapping elements should be included as well as all completely overlapping elements and characters (partially overlapping characters are not possible). The effect of this is that elements that straddle the start or end of a selected range will be included as wrappers for those of their children that are completely or partially selected by the range. For example, given the following source document:
3746	SASOso	The result of the inclusion is two paragraph elements, while the original range designated in the source document overlapped two paragraph fragments.
3747	SASOso	The semantics of XInclude require the creation of well-formed XML results even though the pointing mechanisms it uses do not necessarily respect the hierarchical structure of XML documents, as in this case. While this is a good way to ensure that internalization is always possible, it has implications for the use of XInclude as a notation for the
3751	SASOso	When overlapping hierarchies need to be represented for a single document, each hierarchy must be represented by a separate set of XInclude tags pointing to a common source document. This sort of structure corresponds to common practice in work with linguistic text corpora. In such corpora, each potentially overlapping hierarchy of elements for the text is represented as a separate stream of stand-off markup. Generally the source text contains markup for the smallest significant units of analysis in the corpus, such as words or morphemes, this information and its markup representing a layer of common information that is shared by all the various hierarchies. As a way of organizing the representation of complex data, this technique generally allows a large number of
3753	SASOso	attributes to be attached to the shared elements, providing robust anchors for links and facilitating adjustments to the source document without breaking external documents that reference it.
3756	SASOso	Any tag can be externalized by
3757	SASOso	removing its content and replacing it with an
3761	SASOso	For instance the following portion of a TEI document:
3777	SASOso	can be externalized by placing the actual text in a separate document, and providing exactly the same markup with the
3793	SASOso	Please note that this specification requires that the XInclude namespace declaration is present in all cases. The
3795	SASOso	element contains text or XML fragments to be placed in the document if the inclusion fails for any reason (for instance due to inaccessibility of an external resource). The
3797	SASOso	element is optional; if it is not present an XInclude processor must signal a fatal error when a resource is not found. This is the preferred behaviour for use with stand-off markup. These Guidelines recommend against the use of
3805	SASOva	The whole source fragment identified by an XInclude element, as well as any markup therein contained is inserted in the position specified, and an XInclude processor is required to ensure that the resulting internalized document is well-formed. This has obvious implications when the external document contains XML markup. A plain text source document will always create a well-formed internalized document.
3807	SASOva	While a TEI customization may permit
3809	SASOva	elements in various places in a TEI document instance, in general these Guidelines suggest that validity be verified after the resolution of all the
3817	SASOfr	When the source text is plain text the overall form of the XPointer pointing to it is of minimal importance. The form of the XPointer matters considerably, on the other hand, when the source document is XML.
3819	SASOfr	In this case, it is rather important to distinguish whether we intend to substitute the source XML with the new one, or just to add new markup to it. The XPointers used in the references can express both cases.
3851	SASOfr	will select the whole poem, text content
3857	SASOfr	hypertext links (NB: in XPointer whitespace-only text nodes count).
3863	SASOfr	will only select the text of the poem, with no markup inside.
3881	SAAN	and elsewhere, provision is made for analytic and interpretive markup to be represented outside of textual markup, either in the same document or in a different document. The elements in these separate domains can be connected, either with the pointing attributes
3884	SAAN	analysis
3904	linking	Linking, segmentation and alignment
3913	SAref	The selection and combination of modules to form a TEI schema is described in

USE.xml#13163

#	id	text
2	USE	Using the TEI
4	USE	This section discusses some technical topics concerning the deployment of the TEI markup scheme documented elsewhere in these Guidelines.
6	USE	we discuss the scope and variety of the TEI customization mechanisms, distinguishing between
8	USE	modifications, which result in a schema that supports a subset of the distinctions made in the full TEI system, on the one hand, from
12	USE	TEI Conformance
13	USE	, distinguishing between documents which are algorithmically TEI-conformant ("TEI-conformable") from those which are intrinsically conformant ("TEI-conformant"); we also define the concept of a TEI extension. Since the ODD markup description language defined in chapter
14	USE	is fundamental to the way conformance and customization are handled in the TEI system, these two definitional sections are followed by a section (
20	MEDIATYPE	Serving TEI files with the TEI Media Type
22	MEDIATYPE	In February 2011, the media type
28	MEDIATYPE	). We recommend that any XML file whose root element is in the TEI namespace be served with the media type
30	MEDIATYPE	to enable and encourage automated recognition and processing of TEI files by external applications.
33	DT	Obtaining the TEI Schemas
36	DT	, the modules making up the TEI scheme are generated from a single set of XML source files. Schemas can be generated for TEI customizations in each of XML DTD language, W3C schema language, and RELAX NG schema language. In the body of the Guidelines, only the latter form is presented, using the compact syntax.
38	DT	The TEI schemas and Guidelines are widely available over the Internet and elsewhere. The canonical home for the TEI source, the schema fragments generated from it, and example modifications, is the TEI repository at
39	DT	; versions are also available in other formats, along with copies of the Guidelines and related materials, from the TEI web site at
46	MD	These Guidelines provide an encoding scheme suitable for encoding a very wide range of texts, and capable of supporting a wide variety of applications. For this reason, the TEI scheme supports a variety of different approaches to solving similar problems, and also defines a much richer set of elements than is likely to be necessary in any given project. Furthermore, the TEI scheme may be extended in well-defined and documented ways for texts that cannot be conveniently or appropriately encoded using what is provided. For these reasons, it is almost impossible to use the TEI scheme without customizing or personalizing it in some way.
48	MD	This section describes how the TEI encoding scheme may be customized, and should be read in conjunction with chapter
49	MD	, which describes how a specific application of the TEI encoding scheme should be documented. The documentation system described in that chapter is, like the rest of the TEI scheme, independent of any particular schema or document type definition language.
51	MD	Formally speaking, these Guidelines provide both syntactic rules about how elements and attributes may be used in valid documents and semantic recommendations about what interpretation should be attached to a given syntactic construct. In this sense, they provide both a
56	MD	TEI Abstract Model
57	MD	, which defines a set of related concepts, and the
58	MD	TEI schema
59	MD	which defines a set of syntactic rules and constraints. Many (though not all) of the semantic recommendations are provided solely as informal descriptive prose, though some of them are also enforced by means of such constructs as datatypes (see
62	MD	them in the sense of attaching slightly variant semantics to them.
68	MD	which can take on arbitrary string values, depending on how it is used in a document. A new type of
69	MD	note
70	MD	, therefore, requires no change in the existing model. On the other hand, for many applications, it may be desirable to constrain the possible values for the
72	MD	attribute to a small set of possibilities. A schema modified in this way would no longer necessarily regard as valid the same set of documents as the corresponding unmodified TEI schema, but would remain faithful to the same conceptual model.
74	MD	This section explains how the TEI scheme can be customized by suppressing elements, modifying classes of elements, adding elements, and renaming elements. Documents which validate against an application of the TEI scheme which has been customized in this way may or may not be considered
79	MD	The TEI scheme is designed to support modification and customization in a documented way that can be validated by an XML processor. This is achieved by writing a small TEI-conformant document, from which an appropriate processor can generate both human-readable documentation, and a schema expressed in a language such as RELAX NG or DTD. The mechanisms used to instantiate a TEI schema differ for different schema languages, and are therefore not defined here. In XML DTDs, for example, extensive use is made of parameter entities, while in RELAX NG schemas, extensive use is made of patterns. In either case, the names of elements and, wherever possible, their attributes and content models are defined indirectly. The syntax used to implement this indirection also varies with the schema language used, but the underlying constructs in the TEI Abstract Model are given the same names.
82	MD	, the TEI encoding scheme comprises a set of class and macro declarations, and a number of
84	MD	. Each module is made up of element and attribute declarations, and a schema is made by combining a particular set of modules together. In the absence of any other kind of personalization, when modules are combined together:
88	MD	each such element is identified by the canonical name given it in these Guidelines;
90	MD	the content model of each such element is as defined by these Guidelines;
94	MD	the elements comprising element classes and the meaning of macro declarations expressed in terms of element classes is determined by the particular combination of modules selected.
95	MD	The TEI personalization mechanisms allow the user to control this behaviour as follows:
97	MD	particular elements may be suppressed, removing them from any classes in which they are members, and also from any generated schema;
99	MD	within certain limits, the name (generic identifier) associated with an element may be changed, without changing the semantic or syntactic properties of the element;
101	MD	new elements may be added to an existing class, thus making them available in macros or content models defined in terms of those classes;
103	MD	additional attributes, or attribute values, may be specified for an individual element or for classes of elements;
105	MD	within certain limits, attributes, or attribute values, may also be removed either from an individual element or for classes of elements;
107	MD	the characteristics inherited by one class from another class may be modified by modifying its class membership: all members of the class then inherit the changed characteristics;
109	MD	the set of values legal for an attribute or attribute class may be constrained or relaxed by supplying or modifying a value list, or by modifying its datatype.
114	MD	; in the remainder of this section we give specific examples to illustrate how that system may be applied. An ODD processor, such as the Roma application supported by the TEI, or any other comparable set of stylesheets will use the declarations provided by an ODD to generate appropriate sets of declarations in a specific schema language such as RELAX NG or the XML DTD language. We do not discuss in detail here how this should be done, since the details are schema language-specific; some background information about the methods used for XML DTD and RELAX NG schema generation is however provided in section
115	MD	. Several example ODD files are also provided as part of the standard TEI release: see further section
126	MDMD	modification of content models;
135	MDMD	Each kind of modification changes the set of documents that will be considered valid according to the resulting schema. Any combination of unchanged TEI modules may be thought of as defining a certain set of documents. Each schema resulting from a modified combination of TEI modules will define a different set of documents. The set of documents valid according to the unmodified schema may or may not be properly contained in the set of documents considered to be valid according to the modified schema. We use the term
137	MDMD	to describe a modification which regards as valid a subset of the documents considered valid by the same combination of TEI modules unmodified. Alternatively, the set of documents considered valid by the original schema might be disjoint from the set of documents considered valid by the modified schema, with neither being properly contained by the other. Modifications that have this result are called
141	MDMD	Cleanliness can only be assessed with reference to elements in the TEI namespace.
145	MDMDSU	The simplest way to modify the supplied modules is to suppress one or more of the supplied elements. This is simply done by setting the
153	MDMDSU	For example, if the
158	MDMDSU	attribute here supplies the canonical name of the element to be deleted, the
162	MDMDSU	attribute specifies what is to be done with it. Note that the module name must be supplied explicitly, and that the schema specification in which this declaration appears must also contain a reference to the module itself. The full specification for a schema in which this modification is applied would thus be something like the following:
169	MDMDSU	In most cases, deletion is a clean modification, since most elements are optional. Documents that are valid with respect to the modified schema are also valid according to the unmodified schema. To say this another way, the set of documents matching the new schema is contained by the set of documents matching the original schema.
171	MDMDSU	There are however some elements in the TEI scheme which have mandatory children; for example, the element
185	MDMDSU	In general, whenever the element deleted by a modification is mandatory within the content model of some other (undeleted) element, the result is an unclean modification, and may also break the TEI Abstract Model (
186	MDMDSU	). However, the parent of a mandatory child can be safely removed if it is itself optional.
188	MDMDSU	To determine whether or not an element is mandatory in a given context, the user must inspect the content model of the element concerned. In most cases, content models are expressed in terms of model classes rather than elements; hence, removing an element will generally be a clean modification, since there will generally be other members of the class available. If a class is completely depopulated by a modification, then the cleanliness of the modification will depend upon whether or not the class reference is mandatory or optional, in the same way as for an individual element.
193	MDMDNM	Every element and other named markup construct in the TEI scheme has a
194	MDMDNM	canonical name
195	MDMDNM	, usually in the English language: this name is supplied as the value of the
205	MDMDNM	used to define it. The element or attribute declaration used within a schema generated from that specification may however be different, thus permitting schemas to be written using elements with generic identifiers from a different language, or otherwise modified. There may be many alternative identifiers for the same markup construct, and an ODD processor may choose which of them to use for a given purpose. Each such alternative name is supplied by means of an
220	MDMDNM	now takes the value
221	MDMDNM	change
222	MDMDNM	to indicate that those parts of the element specification not supplied are to be inherited from the standard definition. The content of the
224	MDMDNM	element will be used in place of the canonical
226	MDMDNM	value in the schema generated.
230	MDMDNM	modification. Although it is an inherently unclean modification (because the set of documents matched by the resulting schema is disjoint with the set matched by its unmodified equivalent), the process of converting any document in which elements have been renamed into an exactly equivalent document using canonical names is completely deterministic, requiring only access to the ODD in which the renaming has been specified. This assumes that the renamed elements used are not placed in the TEI namespace but either use a null namespace or some user-defined namespace, as further discussed in
231	MDMDNM	; if this is not the case, care must be taken to avoid name collision between the new name and all existing TEI names. Furthermore, unclean modifications which do not specify a namespace are not conformant (see further
234	MDMDNM	The TEI provides a systematic set of renamings into languages other than English. These all use a language-specific namespace.
239	MDMDCM	The content model for an element in the TEI scheme is defined by means of a
243	MDMDCM	which specifies it. As shown elsewhere in these Guidelines, the content model is defined using RELAX NG syntax, whether the resulting schema is expressed in RELAX NG or in some other schema language.
254	MDMDCM	This indicates that the content model contains declarations taken from the RELAX NG namespace, and that it consists of a reference to a pattern called
256	MDMDCM	. Further examination shows that this pattern in turn expands to an optional repeatable alternation of text (
258	MDMDCM	) with references to three other classes (
264	MDMDCM	). For some particular application it might be preferable to insist that
276	MDMDCM	This is a clean modification which does not change the meaning of a TEI element; there is therefore no need to assign the element to some other namespace than that of the TEI, though it may be considered good practice; see further
279	MDMDCM	A change of this kind, which simplifies the possible content of an element by reducing its model to one of its existing components, is always clean, because the set of documents matched by the resulting schema is a subset of the set of documents which would have been matched by the unmodified schema.
281	MDMDCM	Note that content models are generally defined (as far as possible) in terms of references to model classes, rather than to explicit elements. This means that the need to modify content models is greatly reduced: if an element is deleted or modified, for example, then the deletion or modification will be available for every content model which references that element via its class, as well as those which reference it explicitly. For this reason it is not (in general) good practice to replace class references by explicit element references, since this may have unintended side effects.
283	MDMDCM	An unqualified reference to an element class within a content model generates a content model which is equivalent to an alternation of all the members of the class referenced. Thus, a content model which refers to the model class
285	MDMDCM	will generate a content model in which any one of the members of that class is equally acceptable. It is also possible to reference predefined content model fragments based on classes, such as
288	MDMDCM	a sequence containing no more than one of each member of the class
292	MDMDCM	Content model changes which are not simple restrictions on an existing model should be undertaken with caution. The set of documents matching the schema which results from such changes is likely to be disjoint with the set of documents matching the unmodified schema, and such changes are therefore regarded as unclean. When content models are changed or extended, care should be taken to respect the existing semantics of the element concerned as stated in the Guidelines. For example, the element
294	MDMDCM	is defined as containing a line of verse. It would not therefore make sense to redefine its content model so that it could also include members of the class
296	MDMDCM	: such a modification although syntactically feasible would not be regarded as TEI-conformant because it breaks the TEI Abstract Model.
307	MDMDAL	element. To add a new attribute to an element, the schema builder should therefore first check to see whether this attribute is already defined by some existing attribute class. If it is, then the simplest method of adding it will be to make the element in question a member of that class, as further discussed below. If this is not possible, then a new
320	MDMDAL	content
331	MDMDAL	Suppose, for example, that we wish to add two attributes to the
345	MDMDAL	element in fact has no local attributes defined for it at all: we will therefore need to add not only an
365	MDMDAL	The value supplied for the
370	MDMDAL	add
371	MDMDAL	; if this attribute already existed on the element we are modifying this should generate an error, since a specification cannot have more than one attribute of the same name. If the attribute is already present, we can replace the whole of the existing declaration by supplying
373	MDMDAL	as the value for
375	MDMDAL	; alternatively, we can change some parts of an existing declaration only by supplying just the new parts, and setting
376	MDMDAL	change
377	MDMDAL	as the value for
381	MDMDAL	Because the new attribute is not defined by the TEI, we must specify a namespace for it on the
391	MDMDAL	The canonical name for the new attribute is
393	MDMDAL	, and is supplied on the
397	MDMDAL	element. In this simple example, we supply only a description and datatype for the new attribute; the former is given by the
402	MDMDAL	). The content of the
406	MDMDAL	element, uses patterns from the RELAX NG namespace, in this case to select one of the predefined TEI datatypes (
409	MDMDAL	It is often desirable to constrain the possible values for an attribute to a greater extent than is possible by simply supplying a TEI datatype for it. This facility is provided by the
413	MDMDAL	element. Suppose for example that, rather than supplying them as pointers to a bibliography, all that we wish to indicate about the source of our examples is that each comes from one of three predefined sources, which we call A, B, and C. A declaration like the following might be appropriate:
442	MDMDAL	supplied as part of any attribute in the TEI scheme.
444	MDMDAL	Depending on the modification, the set of documents matched by a schema generated from an ODD modified in this way, may or may not be a subset of the set of documents matched by the unmodified schema. As such, it is difficult to tell in principle whether such modifications are intrinsically unclean.
449	MDMDCL	The concept of element classes was introduced in
450	MDMDCL	; an understanding of it is fundamental to successful use of the TEI scheme. As noted there, we distinguish
451	MDMDCL	model classes
453	MDMDCL	attribute classes
454	MDMDCL	, the members of which simply share a set of attributes.
458	MDMDCL	. All classes to which the element belongs must be specified within this, using a
462	MDMDCL	To add an element to a class in which it is not already a member, all that is needed is to supply a new
466	MDMDCL	element for the element concerned. For example, to add an element to the
477	MDMDCL	element is set to
478	MDMDCL	change
479	MDMDCL	(rather than its default value of
483	MDMDCL	element retains its membership of the two classes (
493	MDMDCL	defined in the core module is a member of two attribute classes,
510	MDMDCL	If the intention is to change the class membership of an element completely, rather than simply add or remove it to or from one or more classes, the value of
514	MDMDCL	can be set to
516	MDMDCL	(which is the default if no value is specified), indicating that the memberships indicated by its child
531	MDMDCL	attribute is set to
532	MDMDCL	change
537	MDMDCL	To change or remove attributes inherited from an attribute class for all members of the class (as opposed to specific members of that class), it is also possible to modify the class specification itself. For example, the class
561	MDMDCL	defining the attributes inherited through membership of this class has the value
562	MDMDCL	change
567	MDMDCL	The classes used in the TEI scheme are further discussed in chapter
568	MDMDCL	. Note in particular that classes are themselves classified: the attributes inherited by a member of attribute class A may come to it directly from that class, or from another class of which A is itself a member. For example, the class
570	MDMDCL	is itself a member of the classes
574	MDMDCL	. By default, these two classes are predefined as empty. However, if (for example) the
576	MDMDCL	module is included in a schema, a number of attributes (
584	MDMDCL	will then inherit these new attributes (see further section
593	MDMDCL	Such global changes should be undertaken with caution: in general removing existing non-mandatory attributes from a class will always be a clean modification, in the same way as removing non-mandatory elements. Adding a new attribute to a class however can be a clean modification only if the new attribute is labelled as belonging to some namespace other than the TEI.
595	MDMDCL	The same mechanisms are available for modification of model classes. Care should be taken when modifying the model class membership of existing elements since model class membership is what determines the content model of most elements in the TEI scheme, and a small change may have unintended consequences.
600	MDMDNE	To add a completely new element into a schema involves providing a complete element specification for it, the
602	MDMDNE	element of which includes a reference to at least one TEI model class. Without such a reference, the new element will not be referenced by the content model of any other TEI element, and will therefore be inaccessible within a TEI document.
612	MDMDNE	. To add a fourth member (say
622	MDMDNE	The other parts of this declaration will typically include a description for the new element and information about its content model, its attributes, etc., as further described in
629	MDNS	All the elements defined by the TEI scheme are labelled as belonging to a single
630	MDNS	namespace
631	MDNS	, maintained by the TEI and with the URI
636	MDNS	used to represent TEI examples has its own namespace,
639	MDNS	Only elements which are unmodified or which have undergone a clean modification may use this namespace. In a TEI-conformant document, it is assumed that all attributes not explicitly labelled with a namespace (such as, for example
641	MDNS	) also belong to the TEI namespace, and are defined by the TEI.
643	MDNS	This implies that any other modification (including a renaming or reversible modification) must either specify a different namespace or specify no namespace at all. The
653	MDNS	Suppose, for example, that we wish to add a new attribute
655	MDNS	to the existing TEI element
657	MDNS	. In the absence of namespace considerations, this would be an unclean modification, since
659	MDNS	does not currently have such an attribute. The most appropriate action is to explicitly attach the new attribute to a new namespace by a declaration such as the following:
678	MDNS	is explicitly labelled as belonging to something other than the TEI namespace, we regard the modification which introduced it as clean. A namespace-aware processor will be able to validate those elements in the TEI namespace against the unmodified schema.
679	MDNS	Full namespace support does not exist in the DTD language, and therefore these techniques are available only to users of more modern schema languages such as RELAX NG or W3C Schema.
681	MDNS	Similar considerations apply when modification is made to the content model or some other aspect of an element, or when a new element is declared. Clean modification requires that all such changes be explicitly labelled as belonging to some non-TEI name space or to no name space at all.
685	MDNS	attribute is supplied on a
687	MDNS	element, it identifies the namespace applicable to all components of the schema being specified. Even if such a schema includes unmodified modules from the TEI namespace, the elements contained by such modules will now be regarded as belonging to the namespace specified on the
689	MDNS	. This can be useful if it is desired simply to avoid namespace processing. For example, the following schema specification results in a schema called
691	MDNS	which has no namespace, even though it comprises declarations from the TEI
698	MDNS	In addition to the TEI canonical namespace mentioned above, the TEI may also define namespaces for approved translations of the TEI scheme into other languages. These may be used as appropriate to indicate that a customization uses a standardized set of renamings. The namespace for such translations is the same as that for the canonical namespace, suffixed by the appropriate ISO language identifier (
699	MDNS	). A schema specification using the Chinese translation, for example, would use the namespace
705	MDDO	The elements used to define a TEI customization (
711	MDDO	, etc.) will typically be used within a TEI document which supplies further information about the intended use of the new schema, the meaning and application of any new or modified elements within it, and so on. This document will typically conform to a TEI (or other) schema which includes the module described in chapter
715	MDDO	Where the customization to be documented simply consists in a selection of modules, perhaps with some deletion of unwanted elements or attributes, the documentation need not specify anything further. Even here however it may be considered worthwhile to replace some of the semantic information provided by the unmodified TEI specification. For example, the
717	MDDO	element of an unmodified TEI
732	MDDO	elements are not required, or in which any other rule stated in these Guidelines is either not enforced or not enforceable. In fact, the mechanism, if used in an extreme way, permits replacement of all that the TEI has to say about every component of its scheme. Such revisions would result in documents that are not TEI-conformant in even the broadest sense, and it is not intended that encoders use the mechanism in this way. We discuss exactly what is meant by the concept of
733	MDDO	TEI conformance
739	MDlite	Several examples of customizations of the TEI are provided as part of the standard release. They include the following:
743	MDlite	The schema generated from this customization is the minimum needed for TEI Conformance. It provides only a handful of elements.
747	MDlite	The schema generated from this customization combines all available TEI modules, providing
752	MDlite	The schema generated from this customization combines all available TEI modules with three other non-TEI vocabularies, specifically MathML, SVG, and XInclude.
756	MDlite	It is unlikely that any project would wish to use any of these extremes unchanged. However, they form a useful starting point for customization, whether by removing modules from tei_all or tei_allPlus, or by replacing elements deleted from tei_bare. They also demonstrate how an ODD document may be constructed to provide a basic reference manual to accompany schemas generated from it.
758	MDlite	Shortly after publication of the first edition of these Guidelines, as a demonstration of how the TEI encoding scheme might be adopted to meet 90% of the needs of 90% of the TEI user community, the TEI editors produced a brief tutorial defining one specific
760	MDlite	modification of the TEI scheme, which they called TEI Lite. This tutorial and its associated DTD became very popular and are still available from the TEI web site at
761	MDlite	. The tutorial and associated schema specification is also included as one of the exemplars provided with TEI P5.
763	MDlite	The exemplars provided with TEI P5 also include a customization file from which a schema for the validation of other customization files may be generated. This ODD, called tei_odds, combines the four basic modules with the tagdocs, dictionaries, gaiji, linking, and figures modules as well as including the (non-TEI) module defining the RELAX NG language. This enables schemas derived from this customization file to validate examples contained within them in a number of ways, further described within the document.
771	CF	TEI Conformance
772	CF	is intended to assist in the description of the format and contents of a particular XML document instance or set of documents. It may be found useful in such situations as:
780	CF	specifying the form of documents to be produced by or for a given project.
782	CF	It is not intended to provide any other evaluation, for example of scholarly merit, intellectual integrity, or value for money. A document may be of major intellectual importance and yet not be TEI-conformant; a TEI-conformant document may be of no scholarly value whatsoever.
784	CF	In this section we explore several aspects of conformance, and in particular attempt to define how the term
786	CF	should be used. The terminology defined here should be considered normative: users and implementors of the TEI Guidelines should use the phrases
791	CF	TEI Extension
796	CF	if it:
802	CF	TEI Schema
803	CF	, that is, a schema derived from the TEI Guidelines (
806	CF	conforms to the TEI Abstract Model (
810	CF	TEI Namespace
817	CF	) which refers to the TEI Guidelines
821	CF	A document is said to be
823	CF	if it is a well-formed XML document which can be transformed algorithmically and automatically into a TEI-conformant document as defined above without loss of information. Such a document may informally be described as TEI-conformant; the terms
829	CF	A document is said to use a
830	CF	TEI Extension
831	CF	if it is a well-formed XML document which is valid against a TEI Schema which contains additional distinctions, representing concepts not present in the TEI Abstract Model, and therefore not documented in these Guidelines. Such a document cannot, in general, be algorithmically conformant since it cannot be automatically transformed without loss of information. However, since one of the goals of the TEI is to support extensions and modifications, it should not be assumed that no TEI document can include extensions: an extension which is expressed by means of the recommended mechanisms is also a TEI document provided that those parts of it which are not extensions are TEI-conformant, or -conformable.
833	CF	A TEI-conformant (or -conformable) document is said to follow
834	CF	TEI Recommended Practice
844	CFWF	. Other ways of representing the concepts of the TEI Abstract Model are possible, and other representations may be considered appropriate for use in particular situations (for example, for data capture, or project-internal processing). But such alternative representations are at best
851	CFWF	A TEI-conformant document must use the TEI namespace, and therefore must also include an XML-conformant namespace declaration, as defined below (
854	CFWF	The use of XML greatly reduces the need to consider hardware or software differences between processing environments when exchanging data. No special packing or interchange format is required for an XML document, beyond that defined by the W3C recommendations, and no special
856	CFWF	format is therefore proposed by these Guidelines. For discussion of encoding issues that may arise in the processing of special character sets or non-standard writing systems, see further chapter
861	CFWF	document, as being a well-formed document which matches a specific set of rules or syntactic constraints, defined by a
863	CFWF	. As noted above, TEI conformance implies that the schema used to determine validity of a given document should be derived from the present Guidelines, by means of an ODD which references and documents the schema fragments which the Guidelines define.
870	CFVL	documents must validate against a schema file that has been derived from the published TEI Guidelines, combined and documented in the manner described in section
872	CFVL	TEI Schema
875	CFVL	The TEI does not mandate use of any particular schema language, only that this schema
880	CFVL	TEI ODD file
881	CFVL	that references the TEI Guidelines. Currently available tools permit the expression of schemas in any or all of the XML DTD language, W3C XML Schema, and RELAX NG (both compact and XML formats). Some of what is syntactically possible using the ODD formalism cannot be represented by all schema languages; and there are some features of some schema languages which have no counterpart in ODD. No single schema language fully captures all the constraints implied by conformance to the TEI Abstract Model. A document which is valid according to a TEI schema represented using one schema language may not be valid against the same schema expressed in other languages; in particular the DTD language does not fully support namespaces. Features which cannot be represented in all schema languages are documented in chapters
886	CFVL	, many varieties of TEI schema are possible and not all of them are necessarily
888	CFVL	; derivation from an ODD is a necessary but not a sufficient condition for TEI Conformance.
892	CFAM	Conformance to the TEI Abstract Model
895	CFAM	TEI Abstract Model
896	CFAM	is the conceptual schema instantiated by the TEI Guidelines. These Guidelines define, both formally and informally, a set of abstract concepts such as
902	CFAM	s do not contain
904	CFAM	s. These Guidelines also define classes of elements, which have both semantic and structural properties in common. Those semantic and structural properties are also a part of the TEI Abstract Model; the class membership of an existing TEI element cannot therefore be changed without changing the model. Elements can however be removed from a class by deletion, and new non-TEI elements within their own namespaces can be added to existing TEI classes.
908	CFAMsc	It is an important condition of TEI conformance that elements defined in the TEI Guidelines as having one specific meaning should not be used with another. For example, the element
910	CFAMsc	is defined in the TEI Guidelines as containing a line of verse. A schema in which it is redefined to mean a typographic line, or an ordered queue of objects of some kind, cannot therefore be TEI-conformant, whatever its other properties.
912	CFAMsc	The semantics of elements defined in the TEI Guidelines are conveyed in a number of ways, ranging from formally verifiable datatypes to informal descriptive prose. In addition, a mapping between TEI elements and concepts in other conceptual models may be provided by the
916	CFAMsc	A schema which shares equivalent concepts to those of the TEI conceptual model may be mappable to the TEI Schema by means of such a mechanism. For example, the concept of paragraph expressed in the TEI scheme by the
920	CFAMsc	element. In this respect (though not in others) a DocBook-conformant document might therefore be considered to be TEI-conformable. Such areas of overlap facilitate interoperability, because elements from one namespace may be readily integrated with those from another, but do not affect the definition of conformance.
922	CFAMsc	A document is said to conform to the
923	CFAMsc	TEI Abstract Model
924	CFAMsc	if features for which an encoding is proposed by the TEI Guidelines are encoded within it using the markup and other syntactic properties defined by means of a valid
926	CFAMsc	schema. Hence, even though the names of elements or attributes may vary, a TEI-conformant document must respect the TEI Semantic Model, and be valid with respect to a TEI-conformant Schema. Although it may be possible to transform a document which follows the
927	CFAMsc	TEI Abstract Model
934	CFAMmc	Mandatory Components of a TEI Document
958	CFAMmc	in the case of a corpus or collection, a single overall
960	CFAMmc	element followed by a series of
973	CFAMmc	This should include the title of the TEI document expressed using a
979	CFAMmc	This should include the place and date of publication or distribution of the TEI document, expressed using the
994	CFNS	TEI Namespace
997	CFNS	) provides a way for an XML document to combine markup from different vocabularies without risking name collision and consequent processing difficulties. While the scope of the TEI is large, there are many areas in which it makes no particular recommendation, or where it recommends that other defined markup schemes should be adopted, such as graphics or mathematics. It is also considered desirable that users of other markup schemes should be able to integrate documents using TEI markup with their own system. To meet these objectives without compromising the reliability of its encoding, a TEI-conformant document is required to make appropriate use of the TEI namespace.
999	CFNS	Essentially all elements in a TEI Schema which represents concepts from the TEI Abstract Model belong to the TEI namespace,
1001	CFNS	, maintained by the TEI. A TEI-conformant document is required to declare the namespace for all the elements it contains whether these come from the TEI namespace or from other schemes.
1003	CFNS	A TEI Schema may be created which assigns TEI elements to some other namespace, or to no namespace at all. A document using such a schema must be regarded as a TEI extension and cannot be considered TEI-conformant, though it may be TEI-conformable. A document which places non-TEI elements or attributes within the TEI namespace cannot be TEI-conformant; such practices are strongly deprecated as they may lead to serious difficulties for processing or interchange.
1010	CFOD	above, a TEI Schema can only be generated from a TEI ODD, which also serves to document the semantics of the elements defined by it. A TEI-conformant document should therefore always be accompanied by (or refer to) a valid
1011	CFOD	TEI ODD file
1012	CFOD	specifying which modules, elements, classes, etc. are in use together with any modifications or renamings applied, and from which a TEI Schema can be generated to validate the document. The TEI supplies a number of predefined
1013	CFOD	TEI Customization exemplar ODD files
1015	CFOD	), but most projects will typically need to customize the TEI beyond what these examples provide. It is assumed, for example, that most projects will customize the TEI scheme by removing those elements that are not needed for the texts they are encoding, and by providing further constraints on the attribute values and element content models the TEI provides. All such customizations must be specified by means of a valid
1016	CFOD	TEI ODD
1019	CFOD	As different sorts of customization have different implications for the interchange and interoperability of TEI documents, it cannot be assumed that every customization will necessarily result in a schema that validates only TEI-conformant documents. The ODD language permits modifications which conflict with the TEI Abstract Model, even though observing this model is a requirement for TEI Conformance. The ODD language can in fact be used to describe many kinds of markup scheme, including schemes which have nothing to do with the TEI at all.
1021	CFOD	Equally, it is possible to construct a TEI Schema which is identical to that derived from a given TEI ODD file without using the ODD scheme. A schema can constructed simply by combining the predefined schema language fragments corresponding with the required set of TEI modules and other statements in the relevant schema language. The status of such a schema with respect to the
1023	CFOD	schema cannot however be determined, in general; it may therefore be impossible to determine whether such a schema represents a clean modification or an extension. This is one reason for making the presence of a TEI ODD file a requirement for conformance.
1027	CFCATSCH	Varieties of TEI Conformance
1031	CFCATSCH	Is it a valid XML document, for which a TEI Schema exists? If not, then the document cannot be considered TEI-conformant in any sense.
1033	CFCATSCH	Is the document accompanied by a TEI-conformant ODD specification describing its markup scheme and intended semantics? If not, then the document can only be considered TEI-conformant if it validates against a predefined TEI Schema and conforms to the TEI abstract model.
1035	CFCATSCH	Does the markup in the document correctly represent the TEI abstract model? Though difficult to assess, this is essential to TEI conformance.
1037	CFCATSCH	Does the document claim that all of its elements come from some namespace other than the TEI (or no namespace)? If so, the document cannot be TEI-conformant.
1039	CFCATSCH	If the document claims to use the TEI namespace, in part or wholly, do the elements associated with that namespace in fact belong to it? If not, the document cannot be TEI-conformant; if so, and if all non-TEI elements and attributes are correctly associated with other namespaces, then the document may be TEI-conformant.
1041	CFCATSCH	Is the document valid according to a schema made by combining all TEI modules as well as valid according to the schema derived from its associated ODD specification? If so, the document is TEI-conformant.
1045	CFCATSCH	? If so, the document uses a TEI extension.
1049	CFCATSCH	, using only information supplied in the accompanying ODD and without loss of information? If so, the document is TEI-conformable.
1075	tab-conformance	Conforms to TEI Abstract Model
1135	tab-conformance	Uses TEI and other namespaces correctly
1176	tab-conformance	Document can be converted automatically to a form which is valid as a subset of
1200	CFCATSCH	The document in column A is TEI-conformant. Its tagging follows the TEI Abstract Model, both as regards syntactic constraints (its
1206	CFCATSCH	elements appear to contain verse lines rather than typographic ones). It is accompanied by a valid ODD which documents exactly how it uses the TEI. All the TEI-defined elements and attributes in the document are placed in the TEI namespace. The schema against which it is valid is a
1212	CFCATSCH	The document in column B is not a TEI document. Although it is accompanied by a valid TEI ODD, the resulting schema includes some
1214	CFCATSCH	modifications, and represents some concepts from the TEI Abstract Model using non-TEI elements; for example, it re-defines the content model of
1220	CFCATSCH	which appears to have the same meaning as the existing TEI
1222	CFCATSCH	element, but the equivalence is not made explicit in the ODD. It uses the TEI namespace correctly to identify the TEI elements it contains, but the ODD does not contain enough information automatically to convert its non-TEI elements into TEI equivalents.
1224	CFCATSCH	The document in column C is TEI-conformable. It is almost the same as the document in column A, except that the names of the elements used are not those specified by the TEI namespace. Because the ODD accompanying it contains an exact mapping for each element name (using the
1226	CFCATSCH	element) and there are no name conflicts, it is possible to make an automatic conversion of this document.
1228	CFCATSCH	The document in column D is a TEI Extension. It combines elements from its own namespace with unmodified TEI elements in the TEI namespace. Its usage of TEI elements conforms to the TEI Abstract Model. Its ODD defines a new
1230	CFCATSCH	element which has no exact TEI equivalent, but which is assigned to an existing TEI class; consequently its schema is not a clean subset of
1232	CFCATSCH	. If the associated ODD provided a way of mapping this element to an existing TEI element, then this would be TEI-conformable.
1234	CFCATSCH	The document in column E is superficially similar to document D, but because it does not use any namespace declarations (or, equivalently, it assigns unmodified TEI elements to its own namespace), it may contain name collisions; there is no way of knowing whether a
1238	CFCATSCH	or has some other meaning. The accompanying ODD file may be used to provide the human reader with information about equivalently named elements in the TEI namespace, and hence to determine whether the document is valid with respect to the TEI Abstract Model but this is not an automatable process. In particular, cases of apparent conflict (for example use of an element
1240	CFCATSCH	to represent a concept not in the TEI Abstract Model but in the abstract model of some other system, whose namespace has been removed as well) cannot be reliably resolved. By our current definition therefore, this is not a TEI document.
1244	CFCATSCH	which is used in this document is a specialization of an existing TEI element, and the ODD in which it is defined specifies the mapping (a
1252	CFCATSCH	; if it does not, this would also be a case of TEI Extension.
1254	CFCATSCH	The document in column G is not a TEI document. Its structure is fully documented by a valid TEI ODD, but it does not claim to represent the TEI Abstract Model, does not use the TEI namespace, and is not intended to validate against any TEI schema.
1256	CFCATSCH	The document in column H is very like that in column A, but it lacks an accompanying ODD. Instead, the schema used to validate it is produced simply by combining TEI schema fragments in the same way as an ODD processor would, given the ODD. If the resulting schema is a clean subset of
1258	CFCATSCH	, such a document is indistinguishable from a TEI-conformant one, but there is no way of determining (without inspection) whether this is the case if any modification or extension has been applied. Its status is therefore, like that of Text E, impossible to determine.
1268	IM	The specifications in this section are illustrative but not normative. Its function is to further illustrate the intended scope and application of the elements documented in chapter
1269	IM	, since it is believed that these may have application beyond the areas directly addressed by the TEI.
1271	IM	An ODD processing system has to accomplish two main tasks. A set of selections, deletions, changes, and additions supplied by an ODD customization (as described in
1272	IM	) must first be merged with the published TEI P5 ODD specifications. Next, the resulting unified ODD must be processed to produce the desired outputs.
1274	IM	An ODD processor is not required to do these two stages in sequence, but that may well be the simplest approach; the ODD processing tools currently provided by the TEI Consortium, which are also used to process the source of these Guidelines, adopt this approach.
1288	IM-unified	attribute. This provides a name for the generated schema, which other components of the processing system may use to refer to the schema being generated, e.g. in issuing error messages or as part of the generated output schema file or files. The
1290	IM-unified	attribute may be used to specify the default namespace within which elements valid against the resulting schema belong, as discussed in
1295	IM-unified	element contains an unordered series of specialized elements, each of which is of one of the following four types:
1301	IM-unified	(by default
1315	IM-unified	add
1317	IM-unified	If the value of
1320	IM-unified	add
1321	IM-unified	, then the object is simply copied to the output, but if it is
1322	IM-unified	change
1327	IM-unified	, then it will be looked at by other parts of the process.
1336	IM-unified	element, in turn, groups together a set of ODD specifications (among other things, including further
1360	IM-unified	references to TEI Modules
1365	IM-unified	attributes refer to components of the TEI. The value of the
1371	IM-unified	element defining a TEI module. The
1373	IM-unified	must be dereferenced by some means, such as reading an XML file with the TEI ODD specification (either from the local hard drive or off the Web), or looking up the reference in an XML database (again, locally or remotely); whatever means is used, it should return a stream of XML containing the element, class, and macro specifications collected together in the specified module. These specification elements are then processed in the same way as if they had been supplied directly within the
1383	IM-unified	attribute; the content of such modules, which must be available in the RELAX NG XML syntax, are passed directly and without modification to the output schema when that is created.
1387	IM-unified	Each object obtained from the TEI ODD specification using
1395	IM-unified	if there is an object in the ODD customization with the same value for the
1399	IM-unified	value of
1401	IM-unified	, then the object from the module is ignored;
1403	IM-unified	if there is an object in the ODD customization with the same value for the
1407	IM-unified	value of
1409	IM-unified	, then the object from the module is ignored, and the one from the ODD customization is used in its place;
1411	IM-unified	if there is an object in the ODD customization with the same value for the
1415	IM-unified	value of
1416	IM-unified	change
1417	IM-unified	, then the two objects must be merged, as described below;
1419	IM-unified	if there is an object in the ODD customization with the same value for the
1423	IM-unified	value of
1424	IM-unified	add
1425	IM-unified	, then an error condition should be raised;
1441	IM-unified	elements). If such a component is found in the ODD customization, it will be copied to the output; if it is not found there, but is present in the TEI ODD specification, then that will be copied to the output.
1447	IM-unified	, for example); these are always copied to the output, and their children are then processed following the rules given in this list.
1481	IM-unified	elements. These should be copied from both the TEI ODD specification and the ODD customization, and all occurrences included in the output.
1522	IM-unified	This means that when
1523	IM-unified	memberOf key="att.typed"/
1524	IM-unified	is processed, that class is looked up, each attribute which it defines is examined in turn, and the customization is searched for an override. If the modification is of the attribute class itself, work proceeds as usual; if, however, the modification is at the element level, the class reference is deleted and a series of
1526	IM-unified	elements is added to the element, one for each attribute inherited from the class. Since attribute classes can themselves be members of other attribute classes, membership must be followed recursively.
1542	IM-unified	to provide an alternate description in another language. Nothing prevents the user from supplying
1554	IM-unified	In the processing of the content models of elements and the content of macros, deleted elements may require special attention.
1555	IM-unified	The carthago program behind the Pizza Chef application, written by Michael Sperberg-McQueen for TEI P3 and P4, went to very great efforts to get this right. The XSLT transformations used by the P5 Roma application are not as sophisticated, partly because the RELAX NG language is more forgiving than DTDs.
1556	IM-unified	A content model like this:
1575	IM-unified	requires no special treatment because everything is expressed in terms of model classes; if deletions result in
1577	IM-unified	having no members, then
1581	IM-unified	. An ODD processor may or may not elect to simplify the resulting choice between nothing and
1585	IM-unified	element. However, such simplification may be considerably more complex in the general case (if for example the
1591	IM-unified	), and an ODD processor is therefore likely to be more successful in carrying out such simplification as a distinct stage during processing of ODD sources.
1614	IM-unified	Note that deletion of required elements will cause the schema specification to accept as valid documents which cannot be TEI-conformant, since they no longer conform to the TEI Abstract Model; conformance topics are addressed in more detail in
1622	IM-unified	which contains a complete and internally consistent set of element, class, and macro specifications, possibly also including
1632	IMGS	Assuming that any modifications have been resolved, as outlined in the previous section, making a schema is now a four stage process:
1634	IMGS	all datatype and other macro specifications must be collected together and declared at the start of the output schema;
1636	IMGS	all classes must be declared in the right order (since some classes reference others, the order is significant);
1646	IMGS	Working in this order gives the best chance of successfully supporting all the schema languages. However, there are a number of obstacles to overcome along the way.
1648	IMGS	An ODD processor may use any desired schema language or languages for its schema output. The TEI ODD specification uses RELAX NG to express content models, and is therefore biased towards this language. However, the current TEI ODD processing system is capable of producing schema output in the three main schema languages, as follows:
1650	IMGS	A RELAX NG (XML) schema is generated by creating wrappers around the content models taken directly from the ODD specification; a version re-expressed in the RELAX NG compact syntax is generated using James Clark's
1654	IMGS	A DTD schema is generated by converting the RELAX NG content models to DTD language, often simplifying it to allow for the less-sophisticated output language.
1656	IMGS	A W3C Schema schema is created by generating a RELAX NG schema and then using James Clark's
1666	IMGS	Secondly, it is possible to create two rather different styles of schema. On the one hand, the schema can try to maintain all the flexibility of ODD by using the facilities of the schema language for parameterization; on the other, it can remove all customization features and produce a flat result which is not suitable for further manipulation. The TEI project currently generates both styles of schema; the first as a set of schema fragments in DTD and RELAX NG languages, which can be included as modules in other schemas, and customized further; the second as the output from a processor such as Roma, in which many of the parameterization features have been removed.
1702	IMGS	performance = element performance { (model.divTop \| model.global), (model.common, model.global)+, (model.divBottom, model.global) att.global.attribute.xmlspace, att.global.attribute.xmlid, att.global.attribute.n, att.global.attribute.xmllang, att.global.attribute.rend, att.global.attribute.xmlbase, att.global.linking.attribute.corresp, att.global.linking.attribute.synch, att.global.linking.attribute.sameAs, att.global.linking.attribute.copyOf, att.global.linking.attribute.next, att.global.linking.attribute.prev, att.global.linking.attribute.exclude, att.global.linking.attribute.select }
1705	IMGS	) would have no effect, since references to such classes have been expanded to reference their constituent attributes.
1708	IMGS	performance = element performance { performance.content, performance.attributes } performance.content = (model.divTop \| model.global), (model.common, model.global)+, (model.divBottom, model.global) performance.attributes = att.global.attributes, empty
1711	IMGS	is provided via an explicit reference (
1713	IMGS	), and can therefore be redefined. Moreover, the attributes are separated from the content model, allowing either to be overridden.
1719	IMGS	are used to distinguish the two schema types. An ODD processor is not required to support both, though the simple schema output is generally preferable for most applications.
1744	IMGS	class. What happens if
1762	IMGS	it is impossible to be sure which rule is being used. This situation is not detected when RELAX NG is used, since the language is able to cope with non-deterministic content models of this kind and does not require that only a single rule be used.
1764	IMGS	Finally, an application will need to have some method of associating the schema with document instances that use it. The TEI does not mandate any particular method of doing this, since different schema languages and processors vary considerably in their requirements. ODD processors may wish to build in support for some of the methods for associating a document instance with a schema. The TEI does not mandate any particular method, but does suggest that those which are already part of XML (the DOCTYPE declaration for DTDs) and W3C Schema (the
1770	IMGS	attribute to be valid when a document is validated against either a DTD or a RELAX NG schema, ODD processors may wish to add declarations for this attribute and its namespace to the root element, even though these are not part of the TEI
1771	IMGS	per se
1774	IMGS	to the list of attributes on the root element, which permits the non-namespace-aware DTD language to recognize the
1776	IMGS	notation. For RELAX NG, the namespace and attribute would be declared in the usual way:
1777	IMGS	namespace xsi = "http://www.w3.org/2001/XMLSchema-instance"
1779	IMGS	attribute xsi:schemaLocation { list { data.namespace, data.pointer }+ }
1780	IMGS	inside the root element declaration.
1784	IMGS	attribute in a W3C Schema schema is not permitted. Therefore, if W3C Schemas are being generated by converting the RELAX NG schema (for example, with
1798	IM-naming	If a RELAX NG pattern or DTD parameter entity is being created, its name is the value of the corresponding
1800	IM-naming	attribute, prefixed by the value of any
1804	IM-naming	. This allows for elements from an external schema to be mixed in without risk of name clashes, since all TEI elements can be given a distinctive prefix such as
1814	IM-naming	tei_sp = element sp { ... }
1817	IM-naming	If an element or attribute is being created, its default name is the value of the
1819	IM-naming	attribute, but if there is an
1821	IM-naming	child, its content is used instead.
1827	IM-naming	should be copied into the generated schema. If there is only one occurrence of either of these elements, it should be used regardless, but if there are several, local processing rules will need to be applied. For example, if there are several with different values of
1829	IM-naming	, a locale indication in the processing environment might be used to decide which to use. For example,
1843	IM-naming	might generate a RELAX NG schema fragment like the following, if the locale is determined to be French:
1844	IM-naming	head = ## en-tête element head { head.content, head.attributes }
1847	IM-naming	Alternatively, a selection might be made on the basis of the value of the
1853	IM-naming	In addition, there are three conventions about naming patterns relating to classes; ODD processors need not follow them, but those reading the schemas generated by the TEI project will find it necessary to understand them:
1855	IM-naming	when a pattern for an attribute class is created, it is named after the attribute class identifier (as above) suffixed by
1861	IM-naming	when a pattern for an attribute is created, it is named after the attribute class identifier (as above) suffixed by
1863	IM-naming	and then the identifier of the attribute (e.g.
1868	IM-naming	when a parameterized schema is created, each element generates patterns for its attributes and its contents separately, suffixing respectively
1890	IMRN	element defining which elements can occur as the root of a document. The ODD
1896	IMRN	. A pattern normally corresponds to an element name, but if a prefix (see above,
1897	IMRN	) is supplied for an element, the pattern consists of the prefix name with the element name.
1902	IMMA	An ODD macro generates a corresponding RELAX NG pattern simply by copying the body of the
1930	IMMA	Although some versions of these Guidelines show the RELAX NG output in the compact syntax, both the content of the
1932	IMMA	element and the unified ODD specification generated by the TEI ODD processing software always store RELAX NG in the more verbose XML syntax. However, the two formats are interchangeable.
1952	IMCL	if the elements
1958	IMCL	are included. Depending on the value of the
1962	IMCL	, it may also generate a set of sequences as well as alternation patterns. Thus we may also generate the
2010	IMCL	where the pattern name is created by appending an underscore and the name of the generation sequence to the class name.
2012	IMCL	Attribute classes work by producing a pattern containing definitions of the appropriate attributes. So
2063	IMCL	Since the processor may have expanded the attribute classes already, separate patterns are generated for each attribute in the class as well as one for the class itself. This allows an element to refer directly to a member of a class. Notice that the
2065	IMCL	element is used to add an
2073	IMCL	Naturally, this behaviour is not mandatory; and other ODD processors may create documentation in other ways, or ignore those parts of the ODD specifications when creating schemas.
2084	IMCL	attribute in the namespace
2088	IMCL	. The body of the attribute is taken from the
2094	IMCL	value of
2096	IMCL	. In that case an
2146	IMCL	namespace to provide default values and documentation.
2156	IMEL	pattern by which other elements can refer to it, and then it must generate an
2158	IMEL	with the content model and attributes. It may be convenient to make two separate patterns, one for the element's attributes and one for its content model.
2160	IMEL	The content model is created simply by copying the body of the
2171	IM-makeDTD	. A DTD may not refer to an entity which has not yet been declared. Since both macros and classes generate DTD parameter entities, the TEI Guidelines are constructed so that they can be declared in the right order. A processor must therefore work in the following order:
2173	IM-makeDTD	declare all model classes which have a
2175	IM-makeDTD	value of
2180	IM-makeDTD	value of
2183	IM-makeDTD	declare all other classes
2209	IM-makeDTD	<!ENTITY % faith 'INCLUDE' > <![ %faith; [ <!--doc:specifies the faith, religion, or belief set of a person. --> <!ELEMENT %n.faith; %om.RR; %macro.phraseSeq;> <!ATTLIST %n.faith; xmlns CDATA "http://www.tei-c.org/ns/1.0"> <!ATTLIST %n.faith; %att.global.attributes; %att.editLike.attributes; %att.datable.attributes; > ]]>
2211	IM-makeDTD	), the element name is parameterized (see
2216	IM-makeDTD	. Note the additional attribute which provides a default
2218	IM-makeDTD	declaration for the element; the effect of this is that if the document is processed by a DTD-aware XML processor, the namespace declaration will be present automatically without the document author even being aware of it.
2220	IM-makeDTD	A simpler rendition for a flattened DTD generated from a customization will result in the following, with no containing marked section, and no parameterized name:
2221	IM-makeDTD	<!ELEMENT faith %macro.phraseSeq;> <!ATTLIST faith xmlns CDATA "http://www.tei-c.org/ns/1.0"> <!ATTLIST faith %att.global.attribute.xmlspace; %att.global.attribute.xmlid; %att.global.attribute.n; %att.global.attribute.xmllang; %att.global.attribute.rend; %att.global.attribute.xmlbase; %att.global.linking.attribute.corresp; %att.global.linking.attribute.synch; %att.global.linking.attribute.sameAs; %att.global.linking.attribute.copyOf; %att.global.linking.attribute.next; %att.global.linking.attribute.prev; %att.global.linking.attribute.exclude; %att.global.linking.attribute.select; %att.editLike.attribute.cert; %att.editLike.attribute.resp; %att.editLike.attribute.evidence; %att.datable.w3c.attribute.period; %att.datable.w3c.attribute.when; %att.datable.w3c.attribute.notBefore; %att.datable.w3c.attribute.notAfter; %att.datable.w3c.attribute.from; %att.datable.w3c.attribute.to;>
2222	IM-makeDTD	Here the attributes from classes have been expanded into individual entity references.
2241	IMGD	The generated documentation may be of two forms. On the one hand, we may document the customization itself, that is, only those elements (etc.) which differ in their specification from that provided by the TEI reference documentation. Alternatively, we may generate reference documentation for the complete subset of the TEI which results from applying the customization. The TEI Roma tools take the latter approach, and operate on the result of the first stage processing described in
2252	IMGD	for each element, by tracing which other elements have them as possible members of their content models.
2270	STPE	Using TEI Parameterized Schema Fragments
2272	STPE	The TEI parameterized DTD and RELAX NG fragments make use of parameter entities and patterns for several purposes. In this section we describe their interface for the user. In general we recommend use of ODD instead of this technique.
2276	STPED	Special-purpose parameter entities are used to specify which modules are to be combined into a TEI DTD. They take the form
2280	STPED	is the name of the module as given in table
2286	STPED	. All such parameter entities are declared by default with the value
2288	STPED	: to select a module, therefore, the encoder declares the appropriate parameter entities with the value
2292	STPED	For XML DTD fragments, note that some modules generate two DTD fragments: for example the
2298	STPED	. This is because the declarations they contain are needed at different points in the creation of an XML DTD.
2314	STPED	If TEI.linking has its default value of IGNORE, neither declaration has any effect. If however it has the value INCLUDE, then the content of each marked section is acted upon: the parameter entities
2318	STPED	are referenced, which has the effect of embedding the content of the files they represent at the appropriate point in the DTD.
2327	STPEEX	The TEI DTD fragments also use marked sections and parameter entity references to allow users to exclude the definitions of individual elements, in order either to make the elements illegal in a document or to allow the element to be redefined. The parameter entities used for this purpose have exactly the same name as the generic identifier of the element concerned. The default definition for these parameter entities is
2331	STPEEX	in order to exclude the standard element and attribute definition list declarations from the DTD.
2335	STPEEX	, for example, are preceded by a definition for a parameter entity with the name
2340	STPEEX	<!ENTITY % p 'INCLUDE' > <![ %p; [ <!-- element and attribute list declaration for p here --> ]]
2350	STPEEX	<!ENTITY % p 'IGNORE' >
2351	STPEEX	is added earlier in the DTD than the default (see further
2354	STPEEX	Similarly, in the parameterized RELAX NG schemas, every element is defined by a pattern named after the element. To undefine an element therefore all that is necessary is to add a declaration like the following:
2355	STPEEX	p = notAllowed
2360	STPEGI	In the TEI DTD fragments, elements are not referred to directly by their generic identifiers; instead, the DTD fragments refer to parameter entities which expand to the standard generic identifiers. This allows users to rename elements by redefining the appropriate parameter entity. Parameter entities used for this purpose are formed by taking the standard generic identifier of the element and attaching the string
2372	STPEGI	These declarations are generated by an ODD processor when TEI DTD fragments are created.
2374	STPEGI	In the RELAX NG schemas, all elements are normally defined using a pattern with the same name as the element (as described in
2376	STPEGI	abbr = element abbr { abbr.content, abbr.attributes }
2378	STPEGI	abbr = element abbrev { abbr.content, abbr.attributes }
2379	STPEGI	More complex revisions, such as redefining the content of the element (defined by the pattern
2383	STPEGI	) can be accomplished in a similar way, using the features of the RELAX NG language. The recommended method of carrying out such modifications is however to use the ODD language as further described in section
2389	STOVLO	Any local modifications to a DTD (i.e. changes to a schema other than simple inclusion or exclusion of modules) are made by declarations stored in one of two local extension files, one containing modifications to the TEI parameter entities, and the other new or changed declarations of elements and their attributes. Entity declarations must be made which associate the names of these two files with the appropriate parameter entity so that the declarations they contain can be embedded within the TEI DTD at an appropriate point.
2393	STOVLO	file to embed portions of the TEI DTD fragments or locally developed extensions.
2396	STOVLO	identifies a local file containing extensions to the TEI parameter entities
2400	STOVLO	identifies a local file containing extensions to the TEI module
2403	STOVLO	For example, if the relevant files are called
2407	STOVLO	, then declarations like the following would be appropriate:
2410	STOVLO	When an entity is declared more than once, the first declaration is binding and the others are ignored. The local modifications to parameter entities should therefore be handled before the standard parameter entities themselves are declared in
2414	STOVLO	is referred to before any TEI declarations are handled, to allow the user's declarations to take priority. If the user does not provide a
2418	STOVLO	For example the encoder might wish to add two phrase-level elements
2423	STOVLO	hi rend='italics'
2425	STOVLO	hi rend='bold'
2427	STOVLO	, this involves two distinct steps: one to define the new elements, and the other to ensure that they are placed into the TEI document structure at the right place.
2429	STOVLO	Creating the new declarations is done in the same way for user-defined elements as for any other; the same parameter entities need to be defined so that they may be referenced by other elements. The content models of these new elements may also reference other parameter entities, which is why they need to be declared after other declarations.
2433	STOVLO	should be modified to include the generic identifiers for the new elements we wish to create. The declaration for each modifiable parameter entity in the DTD includes a reference to an additional parameter entity with the same name prefixed by an
2435	STOVLO	; these entities are declared by default as the null string. However, in the file containing local declarations they may be redeclared to include references to the new class members:
2437	STOVLO	and this declaration will take precedence over the default when the declaration for macro.phraseSeq is evaluated.

CO-CoreElements.xml#13243

#	id	text
2	CO	Elements Available in All TEI Documents
4	CO	This chapter describes elements which may appear in any kind of text and the tags used to mark them in all TEI documents. Most of these elements are freely floating phrases, which can appear at any point within the textual structure, although they must generally be contained by a higher-level element of some kind (such as a paragraph). A few of the elements described in this chapter (for example, bibliographic citations and lists) have a comparatively well-defined internal structure, but most of them have no consistent inner structure of their own. In the general case, they contain only a few words, and are often identifiable in a conventionally printed text by the use of typographic conventions such as shifts of font, use of quotation or other punctuation marks, or other changes in layout.
8	CO	tag used to mark paragraphs, the prototypical formal unit for running text in many TEI modules. This is followed, in section
9	CO	, by a discussion of some specific problems associated with the interpretation of conventional punctuation, and the methods proposed by the Guidelines for resolving ambiguities therein.
12	CO	) describes a number of phrase-level elements commonly marked by typographic features (and thus well-represented in conventional markup languages). These include features commonly marked by font shifts (section
13	CO	) and features commonly marked by quotation marks (section
18	CO	introduces some phrase-level elements which may be used to record simple editorial interventions, such as emendation or correction of the encoded text. The elements described here constitute a simple subset of the full mechanisms for encoding such information (described in full in chapter
22	CO	) describes several phrase-level and inter-level elements which, although often of interest for analysis or processing, are rarely explicitly identified in conventional printing. These include names (section
35	CO	, describe two kinds of quasi-structural elements: lists and notes. These may appear either within chunk-level elements such as paragraphs, or between them. Several kinds of lists are catered for, of an arbitrary complexity. The section on notes discusses both notes found in the source and simple mechanisms for adding annotations of an interpretive nature during the encoding; again, only a subset of the facilities described in full elsewhere (specifically, in chapter
39	CO	introduces some simple ways of representing graphic or other non-textual content found in a text. A fuller discussion of the multimedia facilities supported by these Guidelines may be found in chapters
44	CO	, describes methods of encoding within a text the conventional system or systems used when making references to the text. Some reference systems have attained canonical authority and must be recorded to make the text useable in normal work; in other cases, a convenient reference system must be created by the creator or analyst of an electronic text.
49	CO	Additional elements for the encoding of passages of verse or drama (whether prose or verse) are discussed in section
53	CO	, describing the structure of the TEI document type definition.
57	COPA	The paragraph is the fundamental organizational unit for all prose texts, being the smallest regular unit into which prose can be divided. Prose can appear in all TEI texts, even those that are primarily of another genre (e.g., verse); thus the paragraph is described here, as an element which can appear in any kind of text.
59	COPA	Paragraphs can contain any of the other elements described within this chapter, as well as some other elements which are specific to individual text types. We distinguish
70	COPA	Because paragraphs may appear in different base or additional tag sets, their possible contents may differ in different kinds of documents. In particular, additional elements not listed in this chapter may appear in paragraphs in certain kinds of text. However, the elements described in this chapter are always by default available in all kinds of text.
86	COPA	Since paragraphs are usually explicitly marked in Western texts, typically by indentation, the application of the
88	COPA	tag usually presents few problems.
90	COPA	In some cases, the body of a text may comprise but a single paragraph:
107	COPA	The following extract from a Russian fairy tale demonstrates how other phrase level elements (in this case
139	COPU	Punctuation marks cause two distinct classes of problem for text markup: the marks may not be available in the character set used, and they may be significantly ambiguous. To some extent, the availability of the Unicode character set addresses the first of these problems, since it provides specific code points for most punctuation marks, and also the second to the extent that it distinguishes glyphs (such as stop, comma, and hyphen) which are used with different functions.
140	COPU	Where punctuation itself is the subject of study, the element
143	COPU	. Where the character used for a punctuation mark is not available in Unicode, the
150	COPU-1	Punctuation is itself a form of markup, historically introduced to provide the reader with an indication about how the text should be read. As such, it is unsurprising that encoders will often wish to encode directly the purpose for which punctuation was provided, as well as, or even instead of, the punctuation itself. We discuss some typical cases below.
157	COPU-1	respectively. However, there are independent reasons for tagging these, whether or not they are marked by full stops, and the polysemy of the full stop itself is perhaps no different from that of any other character in the writing system.
163	COPU-1	usually mark the end of orthographic sentences, but may also be used as a mid-sentence comment by the author (
167	COPU-1	to query a word or expression or mark a sentence as dubious in linguistic discussion). Such usages may be distinguished by marking S-units, in which case the mid-sentence uses of these punctuation marks may be left unmarked, or tagged using the
173	COPU-1	are used for a variety of purposes: as a mark of omission, insertion, or interruption; to show where a new speaker takes over (in dialogue); or to introduce a list item. In the latter two cases particularly, it is clearly desirable to mark the function as well as its rendition using the elements
182	COPU-1	may be removed from text contained by
186	COPU-1	elements on editorial grounds, or they may be marked in a variety of ways; see the discussion of quotation and related features in section
190	COPU-1	must be distinguished from single quote marks. As with hyphens, this disambiguation is best performed by selecting the appropriate Unicode character, though it may also be represented by using appropriate XML markup for quotations as suggested above. However, apostrophes have a variety of uses. In English they mark contractions, genitive forms, and (occasionally) plural forms. Full disambiguation of these uses belongs to the level of linguistic analysis and interpretation.
193	COPU-1	and other marks of suspension such as dashes or ellipses are often used to signal information about the syntactic structure of a text fragment. Full disambiguation of their uses also belongs to the level of linguistic analysis and interpretation, and will therefore need to use the mechanisms discussed in chapter
196	COPU-1	Where punctuation marks are disambiguated by tagging their assumed function in the text (for example, quotation), it may be debated whether they should be excluded or left as part of the text. In the case of quotation marks, it may be more convenient to distinguish opening from closing marks simply by using the appropriate Unicode character than to use the
200	COPU-1	Where segmentation of a text is performed automatically, the accuracy of the result may be considerably enhanced by a first pass in which the function of different punctuation characters is explicitly marked. This need not be done for all cases, but only where the structural function of the punctuation markup (for example as a word or phrase delimiter) is ambiguous. Thus, dots indicating abbreviation might be distinguished from dots indicating sentence end, and exclamation or question marks internal to a sentence distinguished from those which terminate one. Furthermore, when encoding historical materials, it may be considered essential to retain the original punctuation, whether by using an appropriate character code, if this is available (or using the
202	COPU-1	element where it is not) or by an explicit encoding using
204	COPU-1	. The particular method adopted will vary depending upon the feature concerned and upon the purpose of the project.
209	COPU-2	Hyphenation as a phenomenon is generally of most concern when producing formatted text for display in print or on screen: different languages and systems have developed quite sophisticated sets of rules about where hyphens may be introduced and for what reason. These generally do not concern the text encoder, since they belong to the domain of formatting and will generally be handled by the rendition software in use. In this section, we discuss issues arising from the appearance of hyphens in pre-existing formatted texts which are being re-encoded for analysis or other processing. Unicode distinguishes four characters visually similar to the hyphen, including the undifferentiated hyphen-minus (U+002D) which is retained for compatibility reasons. The hard hyphen (U+2010) is distinguished from the minus sign (U+2212) which is for use in mathematical expressions, and also from the soft hyphen (U+00AD) which may appear in
211	COPU-2	documents to indicate places where it is acceptable to insert a hyphen when the document is formatted.
213	COPU-2	Historically, the hard hyphen has been used in printed or manuscript documents for two distinct purposes. In many languages, it is used between words to show that they function as a single syntactic or lexical unit. For example, in French,
219	COPU-2	etc. It may also have an important role in disambiguation (for example, by distinguishing say a
223	COPU-2	). Such usages, although possibly problematic when a linguistic analysis is undertaken, are not generally of concern to text encoders: the hyphen character is usually retained in the text, because it may be regarded as part of the way a compound or other lexical item is spelled. Deciding whether a compound is to be decomposed into its constituent parts, and if so how, is a different question, involving consideration of many other phenomena in addition to the simple presence of a hyphen.
225	COPU-2	When it appears at the end of a printed or written line however, the hard hyphen generally indicates that—contrary to what might be expected—a word is not yet complete, but continues on the next line (or over the next page or column or other boundary). The hyphen character is not, in this case, part of the word, but just a signal that the word continues over the break. Unfortunately, few languages distinguish these two cases visually, which necessarily poses a problem for text encoders. Suppose, for example, that we wish to investigate a diachronic English corpus for occurrences of "tea-pot" and "teapot", to find evidence for the point at which this compound becomes lexicalized. Any case where the word is hyphenated across a linebreak, like this:
231	COPU-2	They may decide simply to remove any end-of-line hyphenation from the encoded text, on the grounds that its presence is purely a secondary matter of formatting. This will obviously apply also if line endings are themselves regarded as unimportant.
233	COPU-2	Alternatively, they may decide to record the presence of the hyphen, perhaps on the grounds that it provides useful morphological information; perhaps in order to retain information about the visual appearance of the original source. In either case, they need to decide whether to record it explicitly, by including an appropriate punctuation character in the text data, or implicitly by supplying an appropriate symbolic value for one or more of the attributes on the
235	COPU-2	or other milestone element used to record the fact of the line division. If the hyphen is included in the character data of the TEI document, it might be marked up using the
242	COPU-2	A similar range of possibilities applies equally to the representation of other common punctuation marks, notably quotation marks, as discussed in
246	COPU-2	text data
249	COPU-2	, even if those units are not explicitly indicated by the XML markup. The ambiguity of the end-of-line hyphen also causes problems in the way a processor identifies such tokens in the absence of explicit markup. If token boundaries are not explicitly marked (for example using the
253	COPU-2	elements), for most languages a processor will rely on character class information to determine where they are to be found: some punctuation characters are considered to be word-breaking, while others are not. In XML, the newline character in text data is a kind of whitespace, and is therefore word breaking. However, it is generally unsafe to assume that whitespace adjacent to markup tags will always be preserved, and it is decidedly unsafe to assume that markup tags themselves are equivalent to whitespace.
261	COPU-2	elements are notable exceptions to this general rule, since their function is precisely to represent (or replace) line, page, or column breaks, which, as noted above, are generally considered to be equivalent to whitespace. These elements provide a more reliable way of preserving the lineation, pagination, etc of a source document, since the encoder should not assume that (untagged) line breaks etc. in an XML source file will necessarily be preserved.
269	COPU-2	to indicate whether or not the element corresponds with a token boundary. The value
271	COPU-2	is also available, for cases where the encoder does not wish (or is unable) to determine whether the orthographic token concerned is broken by the line ending.
273	COPU-2	As a final complication, it should be noted that in some languages, particularly German and Dutch, the spelling of a word may be altered in the presence of end of line hyphenation. For example, in Dutch, the word
277	COPU-2	), occurring at the end of a line may be hyphenated as
279	COPU-2	, with a single letter a. An encoder wishing to preserve the original form of this orthographic token in a printed text while at the same time facilitating its recognition as the word
281	COPU-2	will therefore need to rely on a more sophisticated process than simply removing the hyphen. This is however essentially the same as any other form of normalization accompanying the recognition of variations in spelling or morphology: as such it may be encoded using the
284	COPU-2	, or the more sophisticated mechanisms for linguistic analysis discussed in chapter
291	COHQ	This section deals with a variety of textual features, all of which have in common that they are frequently realized in conventional printing practice by the use of such features as underlining, italic fonts, or quotation marks, collectively referred to here as
293	COHQ	. After an initial discussion of this phenomenon and alternate approaches to encoding it, this section describes ways of encoding the following textual features, all of which are conventionally rendered using some kind of highlighting:
295	COHQ	emphasis, foreign words and other linguistically distinct uses of highlighting
308	COHQW	typographic features (font, size, hue, etc.) in a printed or written text in order to distinguish some passage of a text from its surroundings.
309	COHQW	Although the way in which a spoken text is performed, (for example, the voice quality, loudness, etc.) might be regarded as analogous to
311	COHQW	in this sense, these Guidelines recommend distinct elements for the encoding of such
313	COHQW	in spoken texts. See further section
315	COHQW	The purpose of highlighting is generally to draw the reader's attention to some feature or characteristic of the passage highlighted; this section describes the elements recommended by these Guidelines for the encoding of such textual features.
319	COHQW	distinct in some way—as foreign, dialectal, archaic, technical, etc.
321	COHQW	emphatic, and which would for example be stressed when spoken
323	COHQW	not part of the body of the text, for example cross-references, titles, headings, labels, etc.
325	COHQW	identified with a distinct narrative stream, for example an internal monologue or commentary.
327	COHQW	attributed by the narrator to some other agency, either within the text or outside it: for example, direct speech or quotation.
329	COHQW	set apart from the text in some other way: for example, proverbial phrases, words mentioned but not used, names of persons and places in older texts, editorial corrections or additions, etc.
332	COHQW	The textual functions indicated by highlighting may not be rendered consistently in different parts of a text or in different texts. (For example, a foreign word may appear in italics if the surrounding text is in roman, but in roman if the surrounding text is in italics.) For this reason, these Guidelines distinguish between the encoding of rendering itself and the encoding of the underlying feature expressed by it.
341	COHQW	). This allows the encoder both to specify the function of a highlighted phrase or word, by selecting the appropriate element described here or elsewhere in the Guidelines, and to further describe the way in which it is highlighted, by means of an attribute. If the encoder wishes to offer no interpretation of the feature underlying the use of highlighting in the source text, then the
343	COHQW	element may be used, which indicates only that the text so tagged was highlighted in some way.
354	COHQW	attribute are not formally defined in this version of the Guidelines. It may be used to document any peculiarity of the way a given segment of text was rendered in the original source text, and may thus express a very large range of typographic or other features, by no means restricted to typeface, type size, etc. The
356	COHQW	attribute, by contrast, defines the way the source text was rendered using a formally defined style language, such as the W3C standard Cascading Stylesheet Language (
359	COHQW	attribute is used to point to one or more fragments expressed using such a language which have been predefined in the TEI header using the
370	COHQW	for analytic purposes, it is in general more useful to know the intended function of a highlighted phrase than simply that it is distinct.
373	COHQW	In many, if not most, cases the underlying function of a highlighted phrase will be obvious and non-controversial, since the distinctions indicated by a change of highlighting correspond with distinctions discussed elsewhere in these Guidelines. The elements available to record such distinctions are, for the most part, members of the
377	COHQW	class mentioned above constitute the
381	COHQW	The distinction between the two classes is simple, and typified by the two elements
385	COHQW	: the former marks simply that a passage is typographically distinct in some way, while the latter asserts that a passage is linguistically emphasized for some purpose. These two properties, though often combined, are not identical. It should however be recognized, however, that cases do exist in which it is not economically feasible to mark the underlying function (e.g. in the preparation of large text corpora), as well as cases in which it is not intellectually appropriate (as in the transcription of some older materials, or in the preparation of material for the study of typographic practice). In such cases, the
408	COHQHF	Words or phrases which are not in the main language of the text should be tagged as such, at least where the fact is indicated in the text. Where the word or phrase concerned is already distinguished from the rest of the text by virtue of its function (for example, because it is a name, a technical term, a quotation, a mentioned word, etc.) then the global
410	COHQHF	attribute should be used to specify additionally that its language distinguishes it from the surrounding text. Any element in the TEI scheme may take a
412	COHQHF	attribute, which specifies both the writing system and the language used by its content (see sections
430	COHQHF	element should not be used to represent foreign words which are mentioned or glossed within the text: for these use the appropriate element from section
444	COHQHF	Elements which do not explicitly state the language of their content by means of an
446	COHQHF	attribute are understood to inherit a value for it from their parent element. In the general case, therefore, it is recommended practice to supply a default value for
448	COHQHF	on the root
468	COHQHE	element. In printed works, emphasis is generally indicated by devices such as the use of an italic font, a large typeface, or extra wide letter spacing; in manuscripts and typescripts, it is usually indicated by the use of underlining. As the following examples demonstrate, an encoder may choose whether or not to make explicit the particular type of rendition associated with the emphasis. If a source text consistently renders a particular feature (e.g. emphasis or words in foreign languages) in a particular way, the rendering associated with that feature may be described in the TEI header using the
476	COHQHE	attributes may then be used to describe examples which deviate from the norm. For example, assuming that the TEI header has defined a default rendering for the
483	COHQHE	If on the other hand no such default has been defined for the element, the encoder may specify it informally using the
489	COHQHE	If the encoder wishes to express information about the rendition used in the source using a formal language such as CSS, then the
497	COHQHE	In cases where the rendition of a source needs to be indicated several times in a document, it may be more convenient to provide a default value using the
499	COHQHE	element in the header. If a small number of distinct values are required, it may also be convenient to define them all by means of a series of
501	COHQHE	elements which can then be referenced from the elements in question by means of the global
528	COHQHE	attribute, as discussed above, without however taking a position as to the function of the highlighting. This may also be useful if the text is to be processed in two stages: representing simply typographic distinctions during a first pass, and then replacing the
554	COHQHE	in the sense
574	COHQHD	element is provided for this purpose. Its attributes allow for additional information characterizing the nature of the linguistic distinction to be made in two distinct ways: the
576	COHQHD	attribute simply assigns a user-defined code of some kind to the word or phrase which assigns it to some register, sub-language, etc. No recommendations as to the set of values for this attribute are provided at this time, as little consensus exists in the field.
578	COHQHD	Alternatively, the remaining three attributes may be used in combination to place a word or phrase on a three-dimensional scale sometimes used in descriptive linguistics, as for example in
598	COHQHD	that is, with respect to a social classification, for example as technical, polite, impolite, restricted, etc. Again, no recommendations are made for the values of these attributes at this time; the encoder should provide a description of the scheme used in the appropriate section of the header (see section
614	COHQHD	should be preferred to these simple characterizations. It may also be preferable to record the kinds of analysis suggested here by means of the simple annotation element
628	COHQQ	One form of presentational variation found particularly frequently in written and printed texts is the use of quotation marks. As with the typographic variations discussed in the preceding section, it is generally helpful to separate the encoding of the underlying textual feature (for example, a quotation or a piece of direct speech) from the encoding of its rendering (for example, the use of a particular style of quotation marks).
630	COHQQ	This section discusses the following elements, all of which are often rendered by the use of quotation marks:
663	COHQQ	The most common and important use of quotation marks is, of course, to mark
664	COHQQ	quotation
665	COHQQ	, by which we mean simply any part of the text which the author or narrator wishes to attribute to some agency other than the narrative voice. The
667	COHQQ	element may be used if no further distinction beyond this is judged necessary. If it is felt necessary to distinguish such passages further, for example to indicate whether they are regarded as speech, writing, or thought, either the
673	COHQQ	for words or phrases represented as being spoken or thought by people or characters within the current work. The
675	COHQQ	element is used for cases where the author or narrator distances him or herself from the words in question without however attributing them to any other voice in particular. The
677	COHQQ	element is appropriate for a case where a word or phrase is being discussed in the body of a text rather than forming part of the text directly.
679	COHQQ	As noted above, if the distinction among these various reasons why a passage is offset from surrounding text cannot be made reliably, or is not of interest, then any representation of speech, thought, or writing may simply be marked using the
683	COHQQ	Quotation may be indicated in a printed source by changes in type face, by special punctuation marks (single or double or angled quotes, dashes, etc.) and by layout (indented paragraphs, etc.), or it may not be explicitly represented at all. If these characteristics are of interest, one or other of the global
690	COHQQ	Quotation marks themselves may, like other punctuation marks, be felt for some purposes to be worth retaining within a text, quite independently of their description by the
692	COHQQ	attribute. This should generally be done using the appropriate Unicode character, or, if this is not possible, a numeric character reference (see
693	COHQQ	). If the encoder decides both to retain the quotation marks and to represent their function by means of an explicit tag such as
695	COHQQ	, the quotation marks should be included within the element, rather than outside it, as in the first example below:
703	COHQQ	Alternatively, since this use of the leading mdash is very common typographic practice, it may be considered unnecessary to retain it in the encoding. Its presence in the source might instead be signalled using one of the attributes
711	COHQQ	element, which can then be referenced using the
729	COHQQ	element provided in the TEI header (see
730	COHQQ	) to indicate that quotation marks have not been retained in the encoding; their presence in the source is implied by the
734	COHQQ	Whether or not the quotation marks are suppressed, their presence and nature may be described using some appropriate set of conventions in the
748	COHQQ	. If the rendition of passages tagged as
750	COHQQ	is uniform throughout a text, then the
754	COHQQ	element in the header may be used to specify a default rendering, in which case the same section might simply be tagged:
779	COHQQ	This may be used to make explicit who is speaking:
794	COHQQ	attribute may be supplied whether or not an indication of the speaker is given explicitly in the text. It may take the form (as above) of a normalized form of the speaker's name, but its role is to act as a pointer to a location elsewhere in the text, or another document, where data about each speaker may be supplied. While this attribute could point to any source of information about the speaker available by a URI, the most appropriate place to place such information is within the
796	COHQQ	component of the TEI header, as further discussed in
797	COHQQ	but for simple cases like the above, a simple list of speakers located in the front or back matter of the text may suffice.
799	COHQQ	It may also be useful to distinguish representations of speech from representations of thought, in modern printed texts often indicated by a change of typeface. The
809	COHQQ	Quoted matter may be embedded within quoted matter, as when one speaker reports the speech of another:
822	COHQQ	Direct speech nested in this way is treated in the same way as elsewhere: a change of rendition may occur, but the same element should be used. An encoder may however choose to distinguish between direct speech which contains quotations from extra-textual matter and direct speech itself, as in the following example:
839	COHQQ	element may be used to group together the quotation and its associated bibliographic reference, which should be encoded using the elements for bibliographic references discussed in section
860	COHQQ	Like other bibliographic references, the citation associated with a quotation may be represented simply by a cross-reference, as in this example:
869	COHQQ	impractical. In such circumstances, the quotation can be linked to a bibliographical reference using
883	COHQQ	Unlike most of the other elements discussed in this chapter, direct speech and quotations may frequently contain other high-level elements such as paragraphs or verse lines, as well as being themselves contained by such elements. Three possible solutions exist for this well-known structural problem:
885	COHQQ	the quotation is broken into segments, each of which is entirely contained within a paragraph
887	COHQQ	the quotation is marked up using stand-off markup
889	COHQQ	the quotation boundaries are represented by empty segment boundary delimiter elements
896	COHQQ	is provided for all cases in which quotation marks are used to distance the quoted text from the narrator or speaker. Common examples include the
932	COHQU	This section describes a set of textual elements which are used to provide a gloss, alternate identification, or description of something.
934	COHQU	Technical terms are often italicized or emboldened upon first mention in printed texts; an explanation or gloss is sometimes given in quotation marks. Linguistic analyses conventionally cite words in languages under discussion in italics, providing a gloss immediately following marked with single quotation marks. Other texts in which individual words or phrases are
935	COHQU	mentioned
943	COHQU	may mark them either with italics or with quotation marks, and will gloss them less regularly.
957	COHQU	is present, it may be linked to the term it is glossing by means of its
961	COHQU	value to the
965	COHQU	element and provide that id as the value of the
999	COHQU	For technical terminology in particular, and generally in terminological studies, it may be useful to associate an instance of a term within a text with a canonical definition for it, which is stored either elsewhere in the same text (for example in a glossary of terms) or externally, for example in a database, authority file, or published standard. The attributes
1008	COHQU	Another group of elements is used to supply different kinds of names for objects described by the TEI. Examples of this are documentation of elements, attributes, classes (and also attribute values where appropriate), and description of glyphs.
1015	COHQU	element mentioned above, these elements constitute the
1039	COHQHEG	This encoding would, however, lose the important distinction between an italicized title and an italicized foreign phrase. Many other phrases might also be italicized in the text, and a retrieval program seeking to identify foreign terms (for example) would not be able to produce reliable results by simply looking for italicized words. Where economic and intellectual constraints permit, therefore, it would be preferable to encode both the function of the highlighted phrases and their appearance, as follows:
1049	COHQHEG	debatings. She says I am
1068	COHQHEG	; the former is emphasized, while the latter is proverbial. It also provides an ironic gloss for the words
1074	COHQHEG	. The glossed phrases are not, however, technical terms or cited words, but quoted phrases, as if the writer were putting words into her own and her mother's mouths. Finally, the words
1111	COED	As in editing a printed text, so in encoding a text in electronic form, it may be necessary to accommodate editorial comment on the text and to render account of any changes made to the text in preparing it. The tags described in this section may be used to record such editorial interventions, whether made by the encoder, by the editor of a printed edition used as a copy text, by earlier editors, or by the copyists of manuscripts.
1117	COED	. The examples given here illustrate only simple cases of editorial intervention; in particular, they permit economical encoding of a simple set of alternative readings of a short span of text. To encode multiple views of large or heterogeneous spans of text, the mechanisms described in chapter
1123	COED	, that is, a code indicating the person or agency responsible for making the editorial intervention in question, and also an indication of the degree of
1124	COED	certainty
1138	COED	Many of the elements discussed here can be used in two ways. Their primary purpose is to indicate that the text encoded as the element's content represents an editorial intervention (or non-intervention) of a specific kind, indicated by the element itself. However, pairs or other meaningful groupings of such elements can also be supplied, wrapped within a special purpose
1143	COED	This element enables the encoder to represent for example a text in its
1145	COED	uncorrected and unaltered form, alongside the same text in one or more
1148	COED	view
1149	COED	of a text and another, so that (for example) a stylesheet may be set to display either the text in its original form or after the application of editorial interventions of particular kinds.
1153	COED	class. The default members of this class are
1177	COED	indication or correction of apparent errors
1188	COEDCOR	When the copy text is manifestly faulty, an encoder or transcriber may elect simply to correct it without comment, although for scholarly purposes it will often be more generally useful to record both the correction and the original state of the text. The elements described here enable all three approaches, and allows the last to be done in such a way as make it easy for software to present either the original or the correction.
1193	COEDCOR	The following examples show alternative treatment of the same material. The copy text reads:
1194	COEDCOR	Another property of computer-assisted historical research is that data modelling must permit any one textual feature or part of a textual feature to be a part of more than one information model and to allow the researcher to draw on several such models simultaneously, for example, to select from a machine-readable text those marginal comments which indicate that the date's mentioned in the main body of the text are incorrect.
1196	COEDCOR	An encoder may choose to correct the typographic error, either silently or with an indication that a correction has been made, as follows:
1206	COEDCOR	If the encoder elects both to record the original source text and to provide a correction for the sake of word-search and other programs, both
1226	COEDCOR	If it is desired to indicate the person or edition responsible for the emendation, this might be done as follows:
1243	COEDCOR	attribute has been used to indicate responsibility for the correction. Its value (
1250	COEDCOR	element within the TEI header, but any element might be indicated in this way, including for example a
1269	COEDCOR	Where, as here, the correction takes the form of adding text not otherwise present in the text being encoded, the encoder should use the
1271	COEDCOR	element. Where the correction is present in the text being encoded, and consists of some combination of visible additions and deletions, the elements
1276	COEDCOR	below. Where the correction takes the form of addition of material not present in the original because of physical damage or illegibility, the
1279	COEDCOR	correction
1282	COEDCOR	element may be used. These and other elements to support the detailed encoding of authorial or scribal interventions of this kind are all provided by the module described in chapter
1292	COEDREG	When the source text makes extensive use of variant forms or non-standard spellings, it may be desirable for a number of reasons to
1299	COEDREG	In some contexts, the term
1304	COEDREG	As with other such changes to the copy text, the changes may be made silently (in which case the TEI header should specify the types of silent changes made) or may be explicitly marked using the following elements:
1340	COEDREG	Alternatively, the encoder may elect to record both old and new spellings, so that (for example) the same electronic text may serve as the basis of an old- or new-spelling edition:
1369	COEDADD	The following elements are used to indicate when words or phrases have been omitted from, added to, or marked for deletion from, a text. Like the other editorial elements, they allow for a wide range of editorial practices:
1376	COEDADD	Encoders may choose to omit parts of the copy text for reasons ranging from illegibility of the source or impossibility of transcribing it, to editorial policy, e.g. a systematic exclusion of poetry or prose from an encoding. The full details of the policy decisions concerned should be documented in the TEI header (see section
1377	COEDADD	). Each place in the text at which omission has taken place should be marked with a
1379	COEDADD	element, with optionally further information about the reason for the omission, its extent, and the person or agency responsible for it, as in the following examples:
1380	COEDADD	Note that the extent of the gap may be marked precisely using attributes
1386	COEDADD	attribute. Other, more detailed, options are also available for representing dimensions of any kind; see further
1391	COEDADD	element may be used to supply a description of the material omitted, where that is considered useful:
1407	COEDADD	elements may be used to record where words or phrases have been added or deleted in the copy text. They are not appropriate where longer passages have been added or deleted, which span several elements; for these, the elements
1414	COEDADD	Additions to a text may be recorded for a number of reasons. Sometimes they are marked in a distinctive way in the source text, for example by brackets or insertion above the line (
1417	COEDADD	additions
1429	COEDADD	element should not be used to mark editorial changes, such as supplying a word omitted by mistake from the source text or a passage present in another version. In these cases, either the
1438	COEDADD	element is used to mark passages in the original which cannot be read with confidence, or about which the transcriber is uncertain for other reasons, as for example when transcribing a partially inaudible or illegible source. Its
1444	COEDADD	element, to indicate the cause of uncertainty and the person responsible for the conjectured reading.
1450	COEDADD	or from a spoken text:
1456	COEDADD	Where the material affected is entirely illegible or inaudible, the
1462	COEDADD	element is used to mark material which is deleted in the source but which can still be read with some degree of confidence, as opposed to material which has been omitted by the encoder or transcriber either because it is entirely illegible or for some other reason. This is of particular importance in transcribing manuscript material, though deletion is also found in printed texts, sometimes for humorous purposes:
1476	COEDADD	attribute may be used to distinguish different methods of deletion in manuscript or typescript material, as in this line from the typescript of Eliot's
1492	COEDADD	provides a way of grouping additions and deletions of this kind.
1496	COEDADD	element should not be used where the deletion is such that material cannot be read with confidence, or read at all, or where the material has been omitted by the transcriber or editor for some other reason. Where the material deleted cannot be read with confidence, the
1498	COEDADD	tag should be used with the
1500	COEDADD	attribute indicating that the difficulty of transcription is due to deletion. Where material has been omitted by the transcriber or editor, this may be indicated by use of the
1506	COEDADD	element. Text supplied or marked as unneccessary by an editor should be marked with the
1515	COEDADD	. These two sets of elements allow the encoder to distinguish editorial changes from those visible in the source text.
1525	CONA	This section describes a number of textual features which it is often convenient to distinguish from their surrounding text. Names, dates, and numbers are likely to be of particular importance to the scholar treating a text as source for a database; distinguishing such items from the surrounding text is however equally important to the scholar primarily interested in lexis.
1534	CONARS	referring string
1571	CONARS	element may be used for any reference to a person, place, etc., not only to references in the form of a proper noun or noun phrase.
1580	CONARS	element by contrast is provided for the special case of referencing strings which consist only of proper nouns; it may be used synonymously with the
1582	CONARS	element, or nested within it if a referring string contains a mixture of common and proper nouns. The following example shows an alternative way of encoding the short sentence from
1594	CONARS	As the following example shows, a proper name may be nested within a referring string:
1599	CONARS	Simply tagging something as a name is generally not enough to enable automatic processing of personal names into the canonical forms usually required for reference purposes. The name as it appears in the text may be inconsistently spelled, partial, or vague. Moreover, name prefixes such as
1603	CONARS	may or may not be included as part of the reference form of a name, depending on the language and country of origin of the bearer.
1605	CONARS	Two issues arise in this context: firstly, there may be a need to encode a regularized form of a name, distinct from the actual form in the source to hand; secondly, there may be a need to identify the particular person, place, etc. referred to by the name, irrespective of whether the name itself is normalized or not. The element
1623	CONARS	A very useful application for them is as a means of gathering together all references to the same individual or location scattered throughout a document:
1641	CONARS	The value of the
1643	CONARS	attribute may be an unexpanded code, as in the examples above, with no particular significance. More usually however, it will be an externally defined code of some kind, as provided by a standard reference source.
1649	CONARS	The standard reference source should be documented using a
1651	CONARS	element in the TEI header.
1655	CONARS	attribute can be used to point directly to some other resource providing more information about the entity named by the element, such as an authority record in a database, an encylopaedia entry, another element in the same or a different document etc.
1663	CONARS	(regularization) element to provide the standard form of a referring string, as in this example:
1673	CONARS	attribute, since its form will depend entirely on practice within a given project. For the same reason, this attribute is not recommended in data interchange, since there is no way of ensuring that the values used by one project are distinct from those used by another. In such a situation, a preferable approach for magic tokens which follows standard practice on the Web is to use a
1675	CONARS	attribute whose value is a tag URI as defined in
1684	CONARS	The inclusion of the domain name of the party responsible for tagging (
1686	CONARS	), as specified in RFC 4151, helps ensure uniqueness of magic token values across TEI encoding projects, allowing for improved interchange of TEI documents.
1691	CONARS	may be used if it is desired to record both a normalized form of a name and the name used in the source being encoded:
1707	CONARS	may be more appropriate if the function of the regularization is to provide a consistent index:
1713	CONARS	Although adequate for many simple applications, these methods have two inconveniences: if the name occurs many times, then its regularized form must be repeated many times; and the burden of additional XML markup in the body of the text may be inconvenient to maintain and complex to process. For applications such as onomastics, relating to persons or places named rather than the name itself, or wherever a detailed analysis of the component parts of a name is needed, the specialized elements described in chapter
1730	CONAAD	elements; for other kinds of address this class may be extended by adding new elements if necessary.
1732	CONAAD	These Guidelines provide no particular means for encoding the substructure of an email address (for example, distinguishing the local part from the domain part), nor of distinguishing personal email addresses from generic or fictitious ones.
1738	CONAAD	The simplest way of encoding a postal address is to regard it as a series of distinct lines, just as they might be written on an envelope. The following element supports this view:
1739	CONAAD	Here is an example of a postal address encoded using this approach:
1749	CONAAD	Alternatively, an address may be encoded as a structure of more semantically rich elements. The class
1751	CONAAD	element class identifies a number of such possible components:
1756	CONAAD	Any number of elements from the
1758	CONAAD	class may appear within an address and in any order. None of them is required.
1760	CONAAD	Where code letters are commonly used in addresses (for example, to identify regions or countries) a useful practice is to supply the full name of the region or country as the content of the element, but to supply the abbreviatory code as the value of the global
1762	CONAAD	attribute, so that (for example) an application preparing formatted labels can readily find the required information. Other components of addresses may be represented using the general-purpose
1764	CONAAD	element or (when the additional module for names and dates is included) the more specialized elements provided for that purpose.
1766	CONAAD	Using just the elements defined by the core module, the above address could thus be represented as follows:
1778	CONAAD	The order of elements within an address is highly culture-specific, and is therefore unconstrained:
1792	CONAAD	A telephone number (normally outside of the
1798	CONAAD	, with the number itself appearing in the
1806	CONAAD	. A full postal address may also include the name of the addressee, tagged as above using the general purpose
1811	CONAAD	, a large number of more specific elements such as
1817	CONAAD	. The above example might then be encoded as follows:
1861	CONANU	element provides a convenient method of distinguishing numbers from the surrounding text. For other kinds of application, numbers are only useful if normalized: here the
1883	CONANU	; less frequently the number may be recognisable linguistically as such but may use a notation with which the encoder is unfamiliar. To help in these situations, the
1893	CONANU	measure
1894	CONANU	consists of a number, a phrase expressing units of measure and a phrase expressing the commodity being measured, though not all of these components need be present in every case. It may be helpful to distinguish measures from surrounding text for two reasons. Firstly, a measure may be expressed using a particular notation or system of abbreviations which the encoder does not wish to regard as lexical. Secondly, a quantitative application may wish to distinguish and normalize the internal components of a measure, in order to perform calculations on them.
1896	CONANU	Consider, as an example of the first case, the following list of Celia's charms, in which the encoder has chosen to make explicit the measurements:
1931	CONANU	In general, normalization of a measure will require specification of one or more of its three parts: the quantity, the units, and possibly also the commodity being measured. This is accomplished by supplying values for the three attributes
1937	CONANU	, which are supplied by the
1946	CONANU	Such techniques are particularly useful when representing historical data such as inventories:
1962	CONANU	element is provided as a means of grouping several related measurements together, either because the measurement involves several dimensions (for example height and width) or to avoid the need to repeat all the normalizing attributes:
1983	CONADA	Dates and times, like numbers, can appear in widely varying culture- and language-dependent forms, and can pose similar problems in automatic language processing. Such elements constitute the
1985	CONADA	class, of which the default members are:
1989	CONADA	These elements have some additional attributes by virtue of being members of the
1993	CONADA	classes which, in turn, are members of the
2017	CONADA	attribute by simply omitting a part of the value supplied. Imprecise dates or times (for example
2020	CONADA	some time after ten and before twelve
2021	CONADA	) may be expressed as date or time ranges.
2023	CONADA	These mechanisms are useful primarily for fully specified dates or times known with certainty. If component parts of dates or times are to be marked up, or if a more complex analysis of the meaning of a temporal expression is required, the techniques described in chapter
2026	CONADA	Where the certainty (i.e. reliability) of the date or time is in question, the encoder should record this fact using the mechanisms discussed in chapter
2027	CONADA	. The same chapter also discusses various methods of recording the precision of numerical or temporal assertions.
2040	CONADA	attribute always supplies a normalized representation of the date given as content of the
2047	CONADA	date
2059	CONADA	time
2063	CONADA	There is one exception: these Guidelines permit a time to be expressed as only a number of hours, or as a number of hours and minutes, as per ISO 8601:2004 section 4.2.2.3 and 4.3.3. The W3C
2067	CONADA	datatypes require that the minutes and seconds be included in the normalized value if they are to be correctly processed for example when sorting.
2086	CONADA	Note in the last example the use of a normalized representation for the date string which includes a time: this example could thus equally well be tagged using the
2109	CONADA	attribute may be used to specify a date in any calendar system; if the
2111	CONADA	attribute is also supplied, it should specify the equivalent date in the Gregorian calendar.
2121	CONAAB	It is sometimes desirable to mark abbreviations in the copy text, whether to trigger special processing for them, to provide the full form of the word or phrase abbreviated, or to allow for different possible expansions of the abbreviation. Abbreviations may be transcribed as they stand, or expanded; they may be left unmarked, or marked using these tags:
2181	CONAAB	Abbreviation is a particularly important feature of manuscript and other source materials, the transcription of which needs more detailed treatment than is possible using these simple elements. A more detailed set of recommendations is discussed in
2182	CONAAB	, which includes additional elements made available for the purpose by the
2192	COXR	Cross-references or links between one location in a document and one or more other locations, either in the same or different XML documents, may be encoded using the elements
2198	COXR	from one location in a document, the place that the element itself appears, to another (or to several), specified by means of a
2200	COXR	attribute, supplied by the
2208	COXR	The value of the
2212	COXR	mechanism. This permits a range of complexity, from the very simple (a reference to the value of the target element's
2214	COXR	attribute) to the more complex usage of a full URI with embedded XPointers. For example, the source of the following paragraph looks something like this:
2226	COXR	Alternatively, if no explicit link is to be encoded, but it is simply required to mark the phrase as a cross-reference, the
2237	COXR	; for a discussion of TEI schemes for XPointer, see
2247	COXR	are the default members of the phrase-level model class
2249	COXR	. As members of the classes
2267	COXR	element may contain phrases specifying, or describing more exactly, the target of a cross-reference, which form the content of the element. Since its content thus serves as a human-readable pointer, in the simplest case a
2279	COXR	attribute, so that processing software can access it directly, for example to implement a linkage, to generate an appropriate reference, or to give an error message if it cannot be found. Assuming that section 12 in the previous example has been tagged
2282	COXR	then the same cross-reference might more exactly be encoded as
2288	COXR	If the cross-reference itself is to be generated according to a fixed pattern, or if no text is to appear in the body of the cross-reference, the
2300	COXR	); the definition it provides is used to translate the value of the
2302	COXR	attribute into a conventional pointer value, such as one that might be supplied by the
2312	COXR	attribute is used, a cross reference may point to any number of locations simultaneously, simply by giving more than one identifier as the value of its
2314	COXR	attribute. This may be particularly useful where an analytic index is to be encoded, as in the following example:
2328	COXR	, etc. have been provided in the body of the text, for example as page breaks
2337	COXR	A similar method may be used to link annotations on a text with the sigla used to encode their points of attachment in a text. For example:
2358	COXR	The value
2364	COXR	element here might be used to indicate that the object being referenced here is a bibliographic entry rather than a simple cross-reference to an illustration, as is the first
2366	COXR	. In either case, the value of the
2373	COXR	elements have many applications in addition to the simple cross-referencing facilities illustrated in this section. In conjunction with the analytic tools discussed in chapters
2376	COXR	, they may be used to link analyses of a text to their object, to combine corresponding segments of a text, or to align segments of a text with a temporal or other axis or with each other.
2406	COLI	list
2407	COLI	: numbered, lettered, bulleted, or unmarked. Lists formatted as such in the copy text should in general be encoded using this element, with an appropriate value for the
2425	COLI	Some of these values may of course be combined; a list may be inline, but also be rendered with numbers. An example appears below. For more sophisticated and detailed description of list rendering, consider using the
2431	COLI	Each distinct item in the list should be encoded as a distinct
2433	COLI	element. If the numbering or other identification for the items in a list is unremarkable and may be reconstructed by any processing program, no enumerator need be specified. If however an enumerator is retained in the encoded text, it may be supplied either by using the
2457	COLI	The two styles may not be mixed in the same list: if one item is preceded by a label, all must be.
2459	COLI	A list need not necessarily be displayed in list format. For example, the following is a reasonable encoding of a list which (in the original) is simply printed as a single paragraph:
2492	COLI	A list may be given a heading or title, for which the
2496	COLI	element to mark a tabular or glossary list in which each item is associated with a word or phrase rather than a numeric or alphabetic enumerator:
2522	COLI	In such a list, the individual items have internal structure. In complex cases, where list items contain many components, the list is better treated as a
2523	COLI	table
2528	COLI	. A particularly important instance of the simple two-column table is the
2529	COLI	glossary list
2530	COLI	, which should be marked by the tag
2531	COLI	list type="gloss"
2534	COLI	element contains a term and each
2536	COLI	its gloss; it is a semantic error for a list tagged with
2567	COLI	might be used to make explicit the role that each column in the glossary list has, as follows:
2608	COLI	) element what language the term is from. For further discussion of the
2617	COLI	element used to supply a title or heading for the whole list, headings for the two columns of a glossary-style list may be specified using the two special elements
2662	COLI	, including other lists. In this example, a glossary list contains two items, each of which is itself a simple list:
2705	CONONO	The following element is provided for the encoding of discursive notes, whether already present in the copy text or supplied by the encoder:
2708	CONONO	A note is any additional comment found in a text, marked in some way as being out of the main textual stream. All notes should be marked using the same tag,
2710	CONONO	, whether they appear as block notes in the main text area, at the foot of the page, at the end of the chapter or volume, in the margin, or in some other place.
2714	CONONO	A note is usually attached to a specific point or span within a text, which we term here its
2718	CONONO	When encoding such a text, it is conventional to replace this siglum by the content of the annotation, duly marked up with a
2720	CONONO	element. This may not always be possible for example with marginal notes, which may not be anchored to an exact location. For ease of processing, it may be adequate to position marginal notes before the relevant paragraph or other element. In printed texts, it is sometimes conventional to group notes together at the foot of the page on which their points of attachment appear. This practice is not generally recommended for TEI-encoded texts, since the pagination of a particular printed text is unlikely to be of structural significance. In some cases, however, it may be desirable to transcribe notes not at their point of attachment to the text but at their point of appearance, typically at the end of the volume, or the end of the chapter. In such cases, the
2728	CONONO	element, pointing from that to the body of the
2732	CONONO	In cases where the note is applied not to a point but to a span of text, not itself represented as a TEI element, the
2736	CONONO	function to specify the span of attachment.
2743	CONONO	attribute is used to categorise the note as a gloss:
2757	CONONO	element, we may infer that its point of attachment is in the margin adjacent to the line in question. In the following version of the same text, however, it may be inferred that the note applies to the whole of the stanza:
2770	CONONO	This type of annotation, very common in the early printed texts which Coleridge may be presumed to be imitating in this case, may also be regarded as providing a heading or descriptive label for the passage concerned. The encoder may therefore prefer to use the
2785	CONONO	In the following example, a note which appears at the foot of the page in the printed source is given at its point of attachment within the text. The global
2787	CONONO	attribute is used to indicate the note number:
2801	CONONO	In addition to transcribing notes already present in the copy text, researchers may wish to add their own notes or comments to it. The
2811	CONONO	attribute may be used to point to a definition of the person or other agency responsible for the content of the note.
2813	CONONO	As a simple example, an edition of the
2829	CONONO	; thus in this case, the TEI header for this text might contain a title statement like the following:
2840	CONONO	When annotating the electronic text by means of analytic notes in some structured vocabulary, e.g. to specify the topics or themes of a text, the
2844	CONONO	elements may be more effective than the free form
2846	CONONO	element; these elements are available when the module for simple analysis is selected (see section
2852	CONOIX	The indexing of scholarly texts is a skilled activity, involving substantial amounts of human judgment and analysis. It should not therefore be assumed that simple searching and information retrieval software will be able to meet all the needs addressed by a well-crafted manual index, although it may complement them for example by providing free text search. The role of an index is to provide access via keywords and phrases which are not necessarily present in the text itself, but must be added by the skill of the indexer.
2856	CONOIXpre	When encoding a pre-existing text, therefore, if such an index is present it may be advisable to retain it along with the text, rather than attempt to regenerate it automatically. Elements discussed elsewhere in these Guidelines may be used for this purpose. For example, the
2860	CONOIXpre	element may be used to mark the section of the text containing the index and the
2862	CONOIXpre	element might be used to mark the index itself, each entry being represented by an
2864	CONOIXpre	element, possibly containing within it a series of
2896	CONOIXpre	Note that this simple representation does not capture the nested structure of the first of these index entries. A more accurate representation might entail the use of nested lists like the following:
2924	CONOIXpre	elements above, might also include direct links to the appropriate location in the encoded text, using (for example) a target attribute to supply the identifier of an associated page break element:
2932	CONOIXpre	. Note that similar methods may also be used to encode a table of contents, as further exemplified in section
2938	CONOIXgen	It can also be useful, however, to generate a new index from a machine-readable text, whether the text is being written for the first time with the tags here defined, or as an addition to a text transcribed from some other source. Depending on the complexity of the text and its subject matter, such an automatically-generated index may not in itself satisfy all the needs of scholarly users. However it can assist a professional indexer to construct a fully adequate index, which might then be post-edited into the digital text, marked-up along the lines already suggested for preserving pre-existing index material.
2948	CONOIXgen	this element may be used simply to provide descriptive or interpretive label of some kind for any location within a text, to be processed in any way by analytic software, but its main purpose is to facilitate the generation of an index for a printed version of the text. An
2950	CONOIXgen	element may be placed anywhere within a text, between or within other elements. The headwords to be used when making up this index are given by the
2954	CONOIXgen	element. The location of the generated index might be specified by means of a processing instruction within the text, such as the following (the exact form of the PI is of course dependent on the application software in use):
2956	CONOIXgen	Alternatively, the special purpose
2960	CONOIXgen	In the simplest case, a single headword is supplied by an
2972	CONOIXgen	The effect of this is to document an index entry for the term
2974	CONOIXgen	, which when processed could reference the location of the original
2978	CONOIXgen	If the subject of Arabic lemmatization is treated at length in a text, then the index entry generated may need to reference a sequence of locations (e.g. page numbers). In such a case it will be necessary to identify the end of the relevant span of text as well as its starting point. This is most conveniently done by supplying an empty
2994	CONOIXgen	This would generate the same index entries as the previous example, but the reference would be to the whole span of text between the location of the
2996	CONOIXgen	element and the location of the element identified by the code
2998	CONOIXgen	, rather than a single point, and thus might (for example) include a sequence of page numbers.
3002	CONOIXgen	element in the text provides the target location that will be specified in the generated index entry, no part of the text itself is used to construct that entry. Index terms appearing in the entry come solely from the content of
3004	CONOIXgen	elements, which consequently may have to repeat words or phrases from the text proper. This need not be done verbatim, thus giving scope for normalization of spelling (as in the example above) or other modifications which may assist generation of an index in a desired form or sequence.
3006	CONOIXgen	Sometimes, for example when index terms are taken from a different language or consist of mathematical formulae or other expressions, even a normalized form of an index term may be insufficient for an application to order it exactly as desired. The
3008	CONOIXgen	attribute may be used to address this problem, as in the following example:
3012	CONOIXgen	Here, an entry for the symbol @ will appear in the index, but will be sorted alphabetically as if it were the string
3014	CONOIXgen	. This technique is also useful when an index entry is to contain some non-Unicode character or glyph represented by the
3017	CONOIXgen	. In the following example, we assume that somewhere a definition for this glyph has been provided using the elements described in chapter
3018	CONOIXgen	, and given the code
3027	CONOIXgen	Note that if no value is supplied for the sortKey attribute, a sorting application should always use the content of the
3031	CONOIXgen	It is common practice to compile more than one index for a given text. A biography of a poet, for example, may offer an index of references to poems by the subject of the study, another index of works by other writers, an index of places or historical personages etc. The indexName attribute is used to assigning index terms and locations to one or more specific indexes:
3039	CONOIXgen	TEI
3042	CONOIXgen	, an index may contain structured entries like
3043	CONOIXgen	TEI, markup practices, index terms
3044	CONOIXgen	, where a top level entry
3045	CONOIXgen	TEI
3046	CONOIXgen	is followed by a number of second-level subcategories, any or all of which may have a third-level list attached to them and so on. In order to reflect such a hierarchical index listing,
3048	CONOIXgen	elements may be nested to the required depth. For example, suppose that we wish to make a structured index entry for
3054	CONOIXgen	, etc. The example at the start of this section might then be encoded with nested
3067	CONOIXgen	The index entry from Burton's
3069	CONOIXgen	quoted above might be generated in a similar way. To generate such an entry, the body of the text might include, at page 193, an
3081	CONOIXgen	. Similarly, page 601 of the body text would include an
3109	CONOIXgen	elements, the duplication required to make the structure explicit will normally be removed, so as to produce entries like those quoted above. However, this is not required by the encoding recommended here.
3113	CONOIXgen	element may be used to mark the place at which an index generated from
3115	CONOIXgen	elements should be inserted into the output of a processing program; typically but not necessarily this will be at some point within the back matter of the document. If the
3117	CONOIXgen	element is used, then the
3119	CONOIXgen	attribute should be used to specify which kind of index is to be generated, and its value should correspond with that of the
3140	CONOIXgen	attribute may also be used to specify a name or identifier for the generated index itself in the usual way. Any additional headings etc. required for the generated index must be specified as content of the
3152	CONOIXgen	If a processing instruction is used, then these parameters for the generated index may be supplied in some other way.
3154	CONOIXgen	One final feature frequently found in manually-created indexes to printed works cannot readily be encoded by the means provided here, namely cross-references internal to the index term listing. For example, if all references to the TEI in a text have been indexed using the index term
3156	CONOIXgen	, it may also be helpful to include an entry under the term
3157	CONOIXgen	TEI
3158	CONOIXgen	containing some text such as
3171	COGR	Graphics, such as illustrations or diagrams, appear in many different kinds of text, and often with different purposes. Audio or video clips may also appear. In some cases, such media form an integral part of a text (indeed, some texts—comic books for example—may be almost entirely graphic); in others the graphic or video may be a kind of optional extra. In some cases, the text may be incomprehensible unless the media is included; in others, the presence of the media adds little to the sense of the work. It will therefore be a matter of encoding policy as to whether or how media found in a source text are transferred to a new encoded version of the same. In documents which are
3173	COGR	, media such as graphics and other non-textual components may be particularly salient, but their inclusion in an archival form of the document concerned remains an editorial decision.
3175	COGR	Considered as structural components, media may be anchored to a particular point in the text, or they may
3177	COGR	either completely freely, or within some defined scope, such as a chapter or section. Time-based media such as audio or video may need to be synchronized with particular parts of a written text. Media of all kinds often contain associated text such as a heading or label. These Guidelines provide the following different elements to indicate their appearance within a text:
3185	COGR	Media files may be encoded in a number of different ways:
3187	COGR	in some non-XML or binary format such as PNG, JPEG, MP3, MP4 etc.
3191	COGR	in a TEI XML format such as the notation for graphs and trees described in
3193	COGR	In the last two cases, the presence of the graphic will be indicated by an appropriate XML element, drawn from the SVG namespace in the second case, and its content will fully define the graphic to be produced. In the first case, however, one of the elements
3197	COGR	is used to mark the presence of the graphic only and the visual content itself is stored outside the XML document at a location referenced by means of an
3201	COGR	class. Alternatively, if it is small, the media information may be embedded directly within the document using some suitable binary format such as Base64; in this case the
3213	COGR	when this module is included in a schema. These elements are also members of the class
3220	COGR	For example, the following passage indicates that a copy of the image found in the source text may be recovered from the URL
3228	COGR	The media elements are phrase level elements which may be used anywhere that textual content is permitted, within but not between paragraphs or headings. In the following example, the encoder has decided to treat a specific printer's ornament as a heading:
3235	COGR	provides additional capabilities, for example the ability to combine a number of images into a hierarchically organized structure or a block of images. The
3239	COGR	attribute, which can be used to distinguish different kinds of graphic component within a single work, for example, maps as opposed to illustrations. It also provides the ability to associate an image with additional information such as a heading or a description.
3250	CORS	we mean the system by which names or references are associated with particular passages of a text (e.g.
3252	CORS	for the third verse of Psalm 23 or
3256	CORS	, book 2, poem 10, line 7). Such names make it possible to mark a place within a text and enable other readers to find it again. A reference system may be based on structural units (chapters, paragraphs, sentences; stanza and verse), typographic units (page and line numbers), or divisions created specifically for reference purposes (chapter and verse in Biblical texts). Where one exists, the traditional reference system for a text should be preserved in an electronic transcript of it, if only to make it easier to compare electronic and non-electronic versions of the text.
3260	CORS	where a reference system exists, and is based on the same logical structure as that of the text's markup, the reference for a passage may be recorded as the value of the global
3274	CORS	where a reference system exists which is not based on the same logical structure as that of the text's markup (for example, one based on the page and line numbers of particular editions of the text rather than on the structural divisions of it), any of a variety of methods for encoding the logical structure representing the reference system may be employed, as described in chapter
3277	CORS	where a reference system exists which does not correspond to any particular logical structure, or where the logical structure concerned is of no interest to the encoder except as a means of supporting the referencing system, then references may be encoded by means of
3279	CORS	elements, which simply mark points in the text at which values in the reference system change, as described below in section
3281	CORS	The specific method used to record traditional or new reference systems for a text should be declared in the TEI header, as further described in section
3285	CORS	When a text has no pre-existing associated reference system of any kind, these Guidelines recommend as a minimum that at least the page boundaries of the source text be marked using one of the methods outlined in this section. Retaining page breaks in the markup is also recommended for texts which have a detailed reference system of their own. Line breaks in prose texts may be, but need not be, tagged.
3286	CORS	Many encoders find it convenient to retain the line breaks of the original during data entry, to simplify proofreading, but this may be done without inserting a tag for each line break of the original.
3294	CORS1	When traditional reference schemes represent a hierarchical structuring of the text which mirrors that of the marked-up document, the
3298	CORS1	attribute may also be used to record the numbering of sections or list items in the copy text if the copy-text numbering is important for some reason, for example because the numbers are out of sequence.
3304	CORS1	—book 2, poem 10, line 7. Book, poem, and line are structural units of the work and will therefore be tagged in any case. (See chapter
3305	CORS1	for a discussion of structural units in verse collections.) In such cases, it is convenient to record traditional reference numbers of the structural units using the
3328	CORS1	One may also place the entire standard reference for each portion of the text into the appropriate value for the
3330	CORS1	attribute, though for obvious reasons this takes more space in the file:
3347	CORS1	If the names used by the traditional reference system can be formulated as identifiers, then the references can be given as values for the
3353	CORS1	attribute must be unique throughout the document. Our example then looks like this:
3370	CORS1	To document the usage and to allow automatic processing of these standard references, it is recommended that the TEI header be used to declare whether standard references are recorded in the
3379	CORS1	attribute one can specify only a single standard referencing system, a limitation not without problems, since some editions may define structural units differently and thus create alternative reference systems. For example, another edition of the
3381	CORS1	considers poem 10 a continuation of poem 9, and therefore would specify the same line as
3388	CORS2	If a text has no canonical reference system of its own, a new custom reference system may be used.
3402	CORS2	Determining a referencing system for a TEI encoding depends on many factors that may either be derived from textual structure, or influenced by extra-textual contingencies such as project and file management concerns. It is important, therefore, that the attribute used, the elements which can bear standard reference identifiers, and the method for constructing standard reference identifiers, should all be declared in the header as described in section
3410	CORS2-1	A new referencing system may be derived from the structure of the electronic text, specifically from the markup of the text. As with any reference system intended for long-term use, it is important to see the reference as an established, unchanging point in the text. Should the text be revised or rearranged, the reference-system identifiers associated with any section of text must stay with that section of text, even if it means the reference numbers fall out of sequence. (A new reference system may always be created beside the old one if out-of-sequence numbers must be avoided.)
3417	CORS2-1	domain-style address
3418	CORS2-1	comprising a series of components separated by full stops, with one component for each level of the document hierarchy. Two methods may be used. In the
3420	CORS2-1	form of identifier, each component in the identifier takes the form of an element identifier, a hyphen, and a number, for example
3422	CORS2-1	. The element name specifies what type of element is to be sought, and the number specifies which occurrence of that element type is to be selected. (The hyphen and number may be omitted if there is only one element of the given type.) In the
3424	CORS2-1	form of identifier, each component consists of a number, indicating which element in the sequence of nodes at each level is to be selected. To make the resulting identifier a valid XML identifier, it may need to be prefixed with an unchanging alphabetic letter.
3434	CORS2-1	element may be taken as a starting point only if identifiers need to be generated for the
3438	CORS2-1	element as a root would prevent assignment of identifiers for the front and back matter. The component corresponding to the root element can be omitted from identifiers, if no confusion will result. In collections and corpora, the component corresponding to the root may be replaced by the unique identifier assigned to the text or sample.
3446	CORS2-1	value; the latter are prefixed with the string
3490	CORS2-1	attribute is used to record the reference identifiers generated, each value should record the entire path. If the
3492	CORS2-1	attribute is used, each value may record either the entire path or only the subpath from the parent element. The attribute used, the elements which can bear standard reference identifiers, and the method for constructing standard reference identifiers, should all be declared in the header as described in section
3501	CORS2-2	attributes. Every convention will have strengths and weaknesses and it is left to encoders to make a decision that enables them to locate information in their TEI document.
3503	CORS2-2	Here are some examples of referencing systems that have been used in TEI project:
3506	CORS2-2	identifiers constructed with a number of characters from the main document title, followed by an incremental number. E.g. HOL001, HOL002, etc. using a fixed number of digits; or without fixed digits: HOL1, HOL2, etc.
3509	CORS2-2	identifiers constructed on the markup itself, as described in the previous section. To facilitate uniqueness in a corpus, each identifier may be prefixed with the identifier of the root
3518	CORS2-2	XML well-formedness requires only that xml:id attributes be unique within a single document. However, it is also worth keeping in mind that for operating with referencing systems across a corpus of TEI files it is helpful (or even necessary in some circumstances) to have unique identifiers across the whole corpus.
3522	CORS2-2	may be either populated computationally or manually. In the latter case, it is advisable to put measures in place to avoid human error. Custom data types and Schematron rules may be defined in a customization ODD, and a check digit may be added to prevent unwanted changes.
3523	CORS2-2	A check digit is computed from the value of an identifier and appended to the value itself. If the identifier is changed, the check digit would therefore invalidate it.
3530	CORS5	milestone
3534	CORS5	These elements simply mark the points in a text at which some category in a reference system changes. They have no content but subdivide the text into regions, rather in the same way as milestones mark points along a road, thus implicitly dividing it into segments. The elements
3542	CORS5	are specialized types of milestone, marking gathering, page, column, and line boundaries respectively. The global
3544	CORS5	attribute is used in each case to provide a value for the particular unit associated with this milestone (for example, the page or line number). Since it is not structural, validation of a reference system based on
3546	CORS5	s cannot readily be checked by an XML parser, so it will be the responsibility of the encoder or the application software to ensure that they are given in the correct order.
3548	CORS5	Milestone elements are often used as a simple means of capturing the original appearance of an early printed text, which will rarely coincide exactly with structural units, but they are generally useful wherever a text has two or more competing structures. For example, many English novels were first published as serial works, individual parts of which do not always contain a whole number of chapters. An encoder might decide to represent the chapter-based structure using
3603	CORS5	Similarly, when tagging dramatic verse one may wish to privilege stanzas and lines over speeches and speakers, particularly where speeches cross line and line group boundaries. One might also wish to mark changes in narrative voice in a prose text. In either case, a milestone tag may be used to indicate change of speaker:
3614	CORS5	Milestone tags also make it possible to record the reference systems used in a number of different editions of the same work. The reference system of any one edition can be recreated from a text in which all are marked by simply ignoring all elements that do not specify that edition on their
3618	CORS5	As a simple example, assuming that edition E1 of some collection of poems regards the first two poems as constituting the first book, while edition E2 regards the first poem as prefatory, a markup scheme like the following might be adopted:
3629	CORS5	In this case no
3631	CORS5	value is specified, since the numbers rise predictably and the application can keep a count from the start of the document, if desired.
3633	CORS5	The value of the
3649	CORS5	tags, line numbers may be supplied for every line or only periodically (every fifth, every tenth line). The latter may be simpler; the former is more reliable.
3659	CORS5	could have been used equally well if preferred. The special value
3661	CORS5	should be reserved for marking sections of text which fall outside the normal numbering system (e.g. chapter heads, poem numbers, titles, or speaker attributions in a verse drama).
3663	CORS5	By default, there are no constraints on the values supplied for the
3666	CORS5	may be used, for example to specify that the attribute must specify one of a predefined set of values.
3671	CORS5	Milestone elements may be used to mark any kind of shift in the properties associated with a piece of text, whether or not would normally be considered a reference system. For example, they may be used to mark changes in narrative voice in a prose text, or changes of speaker in a dramatic text, where these are not marked using structural elements such as
3677	CORS5	above, milestone elements such as
3681	CORS5	represent whitespace and are therefore by default assumed to occur between orthographic tokens in the text, where these are not otherwise indicated. By default it is reasonable to assume that words are not broken across page or line boundaries, and that therefore a sequence such as
3694	CORS5	attribute is provided to change the default assumption. To make explicit that
3699	CORS5	Where hyphenation appears before a line or page break, the encoder may or may not choose to record the fact, either explicitly using an appropriate Unicode character, or descriptively for example by means of the
3714	CORS6	Whatever kind of reference system is used in an electronic text, it is recommended that the TEI header contain a description of its construction in the
3734	CORS6	tags. The header section for such an encoding should look something like this:
3807	CORS6	tags, but giving the reference string in full on each tag. If canonical references are made only to lines, the reference system could be declared as follows:
3810	CORS6	Since the entire regular expression is enclosed as a parenthetical subgroup, the entire canonical reference string is sought as the value of the
3820	CORS6	This declaration indicates that the entire reference string must be sought as the value of the
3832	CORS6	The third example encodes the same reference system, this time giving the entire reference string as the value of the
3837	CORS6	although in general there seems to be little advantage in this case: it is no more difficult to use a standard relative URI reference as the value of
3841	CORS6	Reference systems recorded by means of milestone tags can also be declared; the following prose description could be used to declare the example given in section
3846	CORS6	Or in this way, using a formal declaration for this reference scheme derived from edition
3859	COBI	Bibliographic references (that is, full descriptions of bibliographic items such as books, articles, films, broadcasts, songs, etc.) or pointers to them may appear at various places in a TEI text. They are required at several points within the TEI header's source description, as discussed in section
3860	COBI	; they may also appear within the body of a text, either singly (for example within a footnote), or collected together in a list as a distinct part of a text; detailed bibliographic descriptions of manuscript or other source materials may also be required. These Guidelines propose a number of specialized elements to encode such descriptions, which together constitute the
3869	COBI	In printed texts, the individual constituents of a bibliographic reference are conventionally marked off from each other and from the flow of text by such features as bracketing, italics, special punctuation conventions, underlining, etc. In electronic texts, such distinctions are also important, whether in order to produce acceptably formatted output or to facilitate intelligent retrieval processing,
3872	COBI	as an author's name from
3874	COBI	as a place of publication or as a component of a title.
3877	COBI	It should be emphasized that for references as for other textual features, the primary or sole consideration is not how the text should be formatted when it is printed. The distinctions permitted by the scheme outlined here may not necessarily be all that particular formatters or bibliographic styles require, although they should prove adequate to the needs of many such commonly used software systems.
3882	COBI	structures, though the nature of their design prevents a simple one-to-one mapping from their data elements to TEI elements. For further information, see section
3885	COBI	) constitute a set which has been useful for a wide range of bibliographic purposes and in many applications, and which moreover corresponds to a great extent with existing bibliographic and library cataloguing practice. For a fuller account of that practice as applied to electronic texts see section
3901	COBI	element; instead, the presence and order of child elements must be used to reconstruct the punctuation required by a particular style.
3905	COBI	allows for considerable flexibility in that it can include both delimiting punctuation and unmarked-up text; and its constituents can also be ordered in any way. This makes it suitable for marking up bibliographies in existing documents, where it is considered important to preserve the form of references in the original document, while also distinguishing important pieces of information such as authors, dates, publishers, and so on.
3907	COBI	may also be useful when encoding
3909	COBI	documents which require use of a specific style guide when rendering the content; its flexibility makes it easier to provide all the information for a reference in the exact sequence required by the target rendering, including any necessary punctuation and linking words, rather than using an XSLT stylesheet or similar to reorder and punctuate the data.
3915	COBI	, has a content model based on the
3917	COBI	element of the TEI header. Both are based on the International Standard for Bibliographic Description (ISBD), which forms the basis of several national standards for bibliographic citations. The order of child elements in both
3938	COBI	resource identifier and terms of availability area
3941	COBI	, used with its child elements and without delimiting punctuation, provides an appropriate granularity of encoding with elements that can easily be rendered for the reader. However, it is important to note that some ISBD-derived citation formats (such as ANSI/NISO Z39.29 and ГОСТ 7.1) are not entirely conformant to ISBD either, since they may begin with a statement of authorship that does not map to the ISBD statement of responsibility.
3947	COBITY	class all share a number of possible component sub-elements. For the
3957	COBITY	Different levels of specific tagging may be appropriate in different situations. In some cases, it may be felt necessary to mark just the extent of the reference itself, with perhaps a few distinctions being made within it (for example, between the part of the reference which identifies a title or author and the rest). Such references, containing a mixture of text with specialized bibliographic elements, are regarded as
3970	COBITY	Some bibliographic references are extremely elliptical, often only a string of the form
3972	COBITY	. If no further details of Baxter's book are given in the source text and none is supplied by the encoder, then the reference thus given should be tagged as a
4032	COBITY	element defined in the TEI header module. This element is provided as a means of embedding the file description of one existing digital text within that of another (see further section
4053	COBITY	A list of bibliographic items, of whatever kind, may be treated in the same way as any other list (see section
4068	COBITY	may contain only bibliographic elements, optionally preceded by a heading and a series of introductory paragraphs. For most purposes, good practice would usually require that a
4145	COBITY	s and
4149	COBITY	items, the key information is marked up, but it is presented in an order which makes it suitable for direct rendering, with the punctuation included.
4207	COBICO	analytic
4211	COBICO	series
4216	COBICO	information relating to the publication, pagination, etc. of an item (most of these constitute the default members of the
4227	COBICO	class, other phrase-level elements, and plain text may be combined without other constraint; within the latter, such of these elements as exist for a given reference must be distinguished, and must also be presented in a specific order, discussed further below (section
4232	COBICOL	In common library practice a clear distinction is made between an individual item within a larger collection and a free-standing book, journal, or collection. Similarly a book in a series is distinguished sharply from the series within which it appears. An article forming part of a collection which itself appears in a series thus has a bibliographic description with three quite distinct levels of information:
4235	COBICOL	analytic
4243	COBICOL	series
4244	COBICOL	level, giving the title of the series, possibly the names of its editors, etc., and the number of the volume within that series.
4245	COBICOL	In the same way, an article in a journal requires at least two levels of information: the analytic level describing the article itself, and the monographic level describing the journal.
4247	COBICOL	A different identifying number may be supplied for any of these three items, that is, for the analytic item, the monographic item, or the series.
4284	COBICOL	, the levels are distinguished by the use of the following distinct elements:
4287	COBICOL	For purposes of TEI encoding, journals and anthologies are both treated as monographs; a journal title should thus be tagged as a
4288	COBICOL	title level="j"
4292	COBICOL	analytic
4301	COBICOL	element. (Whether reprints of an article are treated in the same bibliographic reference or a separate one varies among different styles. Library lists typically use a different entry for each publication, while academic footnoting practice typically treats all publications of the same article in a single entry.)
4305	COBICOL	element is used to supply further information about the location of some part of a bibliographic reference. It specifies where to find the component in which it appears within the immediately preceding component of a different level.
4311	COBICOL	, which was itself the second of a four volumes published together under the title
4313	COBICOL	; this last title constituted the 38th volume in the series of
4350	COBICOL	In the following example, the article cited has been published twice, once in a journal (where it appeared in volume 40, on pages 3 -46 of the issue of October 1986) and once as a free-standing item, which appeared as number 11 of a German language series.
4407	COBICOL	The practice of analytic vs. monographic citation, as described here, should be distinguished from the practice of including within one citation a reference to another work, which the encoder considers to be related to in some way: see further
4410	COBICOL	If an identifier is available for the analytic item, it should be represented by means of an
4414	COBICOL	element, as in the following example where a DOI (Digital Object identifier) is supplied for the article in question.
4462	COBICOL	Punctuation must not appear between the elements within a structured bibliographic entry encoded with
4510	COBICOL	, with all the relevant data items marked up appropriately. This markup approach can provide easy rendering, if only one styleguide is targeted, or an original source document uses a specific styleguide, while still allowing for automated recovery of key data items such as names of authors, titles etc.
4519	COBICOR	Bibliographic references typically include the title of the work being cited and the names of those intellectually responsible for it. For articles in journals or collections, such statements should appear both for the analytic and for the monographic level. The following elements are provided for tagging such elements:
4545	COBICOR	are the default members of the
4553	COBICOR	In bibliographic references, all titles should be tagged as such, whether analytic, monographic, or series titles. The single element
4567	COBICOR	It is a semantic error to give a value for the
4571	COBICOR	value
4573	COBICOR	implies the analytic level; the values
4574	COBICOR	m
4578	COBICOR	u
4579	COBICOR	imply the monographic level; the value
4580	COBICOR	s
4581	COBICOR	implies the series level. Note, however, that the semantic error occurs only if the nested title is directly enclosed by the
4587	COBICOR	element; if it is enclosed only indirectly (i.e., nested more deeply), no semantic error need be present. For example, the analytic title may contain a monographic title, as in the following example:
4615	COBICOR	In this case, the analytic title
4622	COBICOR	element; the monographic title contained within it,
4632	COBICOR	The following reference, from a national standard for bibliographic references, illustrates this type of analysis with its distinction between main and subordinate titles. Note that this uses the more flexible
4636	COBICOR	element: consequently, there is no requirement to tag all the components of the reference (notably the authors).
4653	COBICOR	Slightly more complex is the distinction made below among main, subordinate, and parallel titles, in an example from the same source (p. 63). The punctuation and the bibliographic analysis are those given in ANSI Z39.29-1977; the punctuation is in the style prescribed by the International Standard Bibliographic Description (ISBD).
4654	COBICOR	The analysis is not wholly unproblematic: as the text of the standard points out, the first subordinate title is subordinate only to the parallel title in French, while the second is subordinate to both the English main title and the French parallel title, without this relationship being made clear, either in the markup given in the example or in the reference structure offered by the standard.
4659	COBICOR	, that specific punctuation may be included between the component elements of the reference.
4678	COBICOR	element should be used for the person or agency with primary responsibility for a work's intellectual content, and the element
4681	COBICOR	editor
4683	COBICOR	author
4684	COBICOR	of a broadcast, for example, while the author of a government report will usually be the agency which produced it. A translator, illustrator, or compiler, may however be marked by means of the
4690	COBICOR	Many bibliographic and Linked Data applications require disambiguation of author names using unique identifiers. Both the
4696	COBICOR	elements, to supply such identifiers. Alternatively, if only a single identifier is to be recorded, the
4735	COBICOR	element may also be used for editors, if it is desired to record the specific terms in which their role is described.
4749	COBICOR	element may also occur. When one of these elements precedes or immediately follows a title, it applies to that title; when it follows an
4751	COBICOR	element or occurs within an edition statement, it applies to the edition in question.
4797	COBICOR	This example retains the original punctuation and editorial conventions of the source (ISO 690: 1987) and is therefore encoded using the
4803	COBICOR	element applies to the edition, and not to the collection
4804	COBICOR	per se
4807	COBICOR	element, the component elements have been reordered from their appearance on the title page of the volume in order to ensure the correct relationship of the collection title, the edition statement, and the statement of responsibility.
4848	COBICOR	The party with a particular responsibility for the intellectual content may vary over time. Likewise, a given individal's responsibility or role may change over time. These situations may be recorded with the
4850	COBICOR	element. For example, the following could be used when one proofreader took over for another.
4868	COBICOR	Another form of
4870	COBICOR	arises when a work is published as the outcome of a conference, workshop or similar meeting. The
4932	COBICOD	identifiers of various types because they do not include a statement of the title and the names of those intellectually responsible for it. The following elements may be used for such purposes:
4940	COBICOD	For example, a citation to a patent typically includes a country or organization code (a two-character code identifying a patent authority) and a serial number for the patent (whose structure varies by patent authority). The citation might also contain a
4941	COBICOD	kind code
4942	COBICOD	(which characterizes a particular publication for the patent and which corresponds to a specific stage in the patent procedure) and the date when the patent was filed with or published by the issuing authority. For bibliographic references to patents, the above elements may be used as follows:
4947	COBICOD	, may be used to contain the code of the patent authority. The
4949	COBICOD	attribute may be used to specify the type of patent authority (such as a national patent office or a supra-national patent organization).
4952	COBICOD	may be used to contain the serial number assigned by the corresponding patent authority.
4955	COBICOD	may be used to contain the kind code of the patent document.
4958	COBICOD	may be used to contain the date of the patent document. The
4960	COBICOD	attribute may be used to specify whether this corresponds to the filing date of a patent application or the publication date of a patent publication.
4988	COBICOI	imprint
4989	COBICOI	is meant all the information relating to the publication of a work: the person or organization by whose authority and in whose name a bibliographic entity such as a book is made public or distributed (whether a commercial publisher or some other organization), the place and the date of publication. It may also include a full address for the publisher or organization. A full bibliographic references will usually also specify the number of pages in a print publication (or equivalent information for non-print materials), and possibly also the specific location of the material being cited within its containing publication. The following elements are provided to hold this information:
4998	COBICOI	Members of the model classes
5004	COBICOI	element in a specific location within a
5014	COBICOI	For bibliographic purposes, usually only the place (or places) of publication are required, possibly including the name of the country, rather than a full address; the element
5016	COBICOI	is provided for this purpose. Where however the full postal address is likely to be of importance in identifying or locating the bibliographic item concerned, it may be supplied and tagged using the
5019	COBICOI	. Alternatively, if desired, the
5024	COBICOI	may be used; this involves no claim that the information given is either a full address or the name of a city.
5026	COBICOI	The name of the publisher of an item should be marked using the
5028	COBICOI	element even if the item is made public (
5030	COBICOI	) by an organization other than a conventional publisher, as is frequently the case with technical reports:
5094	COBICOI	When an item has been reprinted, especially reprinted without change from a specific earlier edition, the reprint may appear in a
5098	COBICOI	and other details of the reprint. In the following example, a microform reprint has been issued without any change in the title or authorship. The series statement here applies only to the second
5141	COBICOI	This encoding can be extended to the case of patent documents, where the same patent application is published, with or without changes, at different stages of the patenting procedure. In this case, the kind code and, optionally, the publication date characterize different publications of the same patent application during the procedure. For example:
5167	COBICOI	The above bibliographic reference discloses different publications of the patent EP1558513 during the patenting procedure. The first publication from 3 August 2005 has the kind code "A1" indicating that it is a published patent application comprising the European search report issued after carrying out the search at the European Patent Office, whereas the second publication from 9 September 2009 has the kind code "B1" indicating that it was published after the patent application has been granted.
5178	COBICOB	Many bibliographic citations contain data limiting the citation to one or more volumes, issues, or pages, or to a name or number of a subdivison of the host work. These come in two varieties:
5188	COBICOB	Where it is desired to distinguish different classes of such information (volume number, page number, chapter number, etc.), the
5310	COBICOB	On the other hand, a cited range encodes that the author
5312	COBICOB	defined by this range. For example, a footnote following a quotation from page 378 of
5360	COBICOS	element. The title of the series may be tagged
5361	COBICOS	title level="s"
5362	COBICOS	, the volume number
5363	COBICOS	biblScope unit="vol"
5364	COBICOS	, and responsibility statements for the series (e.g. the name and affiliation of the editor, as in the example in section
5369	COBICOS	. Any identifier associated with the series itself should be marked using the
5376	COBIRI	related item
5377	COBIRI	is any bibliographic item which, though related to that being defined, is distinct from it. The distinction between analytic and monographic items made above may be thought of as a special case of this kind of
5379	COBIRI	item. More usually however, the term is applied to such items as translations, continuations, different versions, parts, etc.
5389	COBIRI	describes a facsimile edition, and the second describes the work of which it is a facsimile. The relation between the facsimile and its source is represented by means of a
5439	COBIRI	may contain any form of bibliographic reference. For example, one of the examples quoted above might also be encoded as follows:
5484	COBIRI	attribute should be used to indicate the relationship between the bibliographic item and any
5526	COBIRI	In this example, a full bibliographic description of the edition used as source for the translation is provided within the content of the
5528	COBIRI	. Alternatively this might be provided by means of a link, in which case the
5547	COBICON	Explanatory notes about the publication of unusual items, the form of an item (e.g.
5551	COBICON	), or its provenance (e.g.
5555	COBICON	element. The same element may be used for any descriptive annotation of a bibliographic entry in a database.
5575	COBICON	This element can take the form of a simple note such as:
5581	COBICON	attribute to record the chief language of the bibliographic item, and optionally the
5593	COBICON	attributes should both provide language identifiers in the same form as used for
5596	COBICON	. Where additional detail is needed correctly to describe a language, or to discuss its deployment in a given text, this should be done using the
5598	COBICON	element in the TEI header, within which individual
5625	COBICOO	element, if it occurs, must come first, followed by one or more
5631	COBICOO	element comes first), and then zero or more of the following in any order:
5647	COBICOO	, the title(s), author(s), editor(s), and other statements of responsibility may appear in any order; it is recommended that all forms of the title be given together. Within
5649	COBICOO	, the author, editor, and statements of responsibility may either come first or else follow the monographic title(s). Following these, the elements listed below, if present, must appear in the following order:
5652	COBICOO	s on the publication (and
5654	COBICOO	elements describing the conference, in the case of a proceedings volume)
5674	COBICOO	, the sequence of elements is not constrained.
5688	COBIXR	). As discussed in that section, cross-referencing within TEI texts is in general represented by means of
5694	COBIXR	attribute on these elements is used to supply an identifying value for the target of the cross-reference, which should be, in the case of bibliographic elements, a bibliographic reference of some kind. Where the form of the reference itself is unimportant, or may be reconstructed mechanically, or is not to be encoded, the
5701	COBIXR	Where the form of the reference is important, or contains additional qualifying information which is to be kept but distinguished from the surrounding text, the
5707	COBIXR	It may be important to distinguish between the short form of a bibliographic reference and some qualifying or additional information. The latter should not appear within the scope of the
5709	COBIXR	element when this is the case, as for example in an application concerned to normalize bibliographic references:
5717	COBIXR	element may also be used to provide a reference to a copy of the bibliographic item itself, particularly if this is available online, as in the following example:
5753	COBIOT	The BibTeX scheme is intentionally compatible with that of Scribe, although it omits some fields used by Scribe. Hence only one list of fields is given here.
5756	COBIOT	address
5758	COBIOT	tag as
5765	COBIOT	tag as
5768	COBIOT	author
5770	COBIOT	tag as
5775	COBIOT	tag as
5776	COBIOT	title level="m"
5784	COBIOT	tag as
5785	COBIOT	biblScope unit="chap"
5787	COBIOT	date
5789	COBIOT	used only to record date entry was made in the bibliographic database; not supported
5791	COBIOT	edition
5793	COBIOT	tag as
5796	COBIOT	editor
5798	COBIOT	tag as
5805	COBIOT	tag as multiple
5829	COBIOT	name type="org"
5833	COBIOT	tag as
5835	COBIOT	, possibly using the form
5836	COBIOT	note place="inline"
5838	COBIOT	institution
5840	COBIOT	used only for issuer of technical reports; tag as
5845	COBIOT	tag as
5846	COBIOT	title level="j"
5854	COBIOT	used to specify an alternate sort key for the bibliographic item, for use instead of author's or editor's name; not supported
5856	COBIOT	meeting
5858	COBIOT	tag as
5867	COBIOT	; if the date is not in a trivially parseable form, use the
5872	COBIOT	note
5874	COBIOT	tag as
5877	COBIOT	number
5879	COBIOT	tag as
5880	COBIOT	biblScope unit="issue"
5882	COBIOT	biblScope unit="number"
5884	COBIOT	idno type="docno"
5888	COBIOT	used only for sponsor of conference; use
5889	COBIOT	name type="org"
5898	COBIOT	tag as
5899	COBIOT	biblScope unit="pp"
5901	COBIOT	publisher
5903	COBIOT	tag as
5908	COBIOT	used only for institutions at which thesis work is done; tag as
5911	COBIOT	series
5913	COBIOT	tag as
5914	COBIOT	title level="s"
5920	COBIOT	title
5922	COBIOT	tag as
5926	COBIOT	value
5930	COBIOT	tag as
5931	COBIOT	biblScope unit="vol"
5935	COBIOT	tag as
5937	COBIOT	; if the date is not in a trivially parseable form, use the
5945	CODV	The following elements are included in the core module for the convenience of those encoding texts which include mixtures of prose, verse and drama.
5948	CODV	Full details of other, more specialized, elements for the encoding of texts which are predominantly verse or drama are described in the appropriate chapter of part three (for verse, see the verse base described in chapter
5949	CODV	; for performance texts, see the drama base described in chapter
5950	CODV	). In this section, we describe only the elements listed above, all of which can appear in any text, whichever of the three modes prose, verse, or drama may predominate in it.
5954	COVE	Like other written texts, verse texts or poems may be hierarchically subdivided, for example into books or cantos. These structural subdivisions should be encoded using the general purpose
5960	COVE	. The fundamental unit of a verse text is the verse line rather than the paragraph, however.
5964	COVE	element is used to mark up verse lines, that is metrical rather than typographic lines. In some modern or free verse, it may be hard to decide whether the typographic line is to be regarded as a verse line or not, but the distinction is quite clear for verse following regular metrical patterns. Where a metrical line is interrupted by a typographic line break, the encoder may choose to ignore the fact entirely or to use the empty
5967	COVE	. By convention, the start of a metrical line implies the start of a typographic line; hence there is no need to introduce an
5969	COVE	tag at the start of every
5971	COVE	element, but only at places where a new typographic line starts within a metrical line, as in the following example:
5986	COVE	In the original copy text, the presence of an ornamental capital at the start of the poem means that the measure is not wide enough to print the first four lines on four lines; instead each metrical line occupies two typographic lines, with a break at the point indicated. Note that this encoding makes no attempt to preserve information about the whitespace or indentation associated with either kind of line; if regarded as essential, this information would be recorded using the
5994	COVE	element should not be used to represent typographic lines in non-verse materials: if the line-breaking points in a prose text are considered important for analysis, they should be marked with the
5996	COVE	element. Alternatively, a neutral segmentation element such as
6011	COVE	In some verse forms, regular groupings of lines are regarded as units of some kind, often identified by a regular verse scheme. In stichic verse and couplets, groups of lines analogous to paragraphs are often indicated by indentation. In other verse forms, lines are grouped into irregular sequences indicated simply by whitespace. The
6013	COVE	or line group element may be used to mark any such grouping of elements from the
6020	COVE	which may be used to further categorize the line group where this is felt desirable, as in the following example. This example also demonstrates the
6022	COVE	attribute to indicate whether or not a line is indented.
6048	COVE	For some kinds of analysis, it may be useful to identify different kinds of line group within the same piece of verse. Such line groups may self-nest, in much the same way as the un-numbered
6093	COVE	It is often the case that verse line boundaries conflict with the boundaries of other structural elements. In the following example, the single verse line
6095	COVE	is interrupted by a stage direction:
6119	COVE	The same technique may be used where verse lines are collected together into units such as verse paragraphs:
6142	COVE	element to indicate that it is incomplete, for example because it forms part of a group that is divided between two speakers, as in the following example:
6164	COVE	For alternative methods of aligning groups of lines which do not form simple hierarchic groups, or which are discontinuous, see the more detailed discussion in chapter
6174	CODR	performance texts
6175	CODR	such as cinema or TV scripts are often hierarchically organized, for example into acts and scenes. These structural subdivisions should be encoded using the general purpose
6181	CODR	. Within these divisions, the body of a performance text typically consists of
6183	CODR	, often prefixed by a phrase indicating who is speaking, and occasionally interspersed with stage directions of various kinds.
6210	CODR	In the following example, each speech consists of a sequence of verse lines, some of them being marked as metrically incomplete:
6266	CODR	, the printed speaker attributions need to be supplemented by use of the
6312	CODR	By contrast with the preceding examples, the following encodes an early printed edition without making any assumption about which parts are prose or verse:
6354	CODR	elements should also be used to mark parts of a text otherwise in prose which are presented as if they were dialogue in a play. The following example is taken from a 19th century novel in which passages of narrative and passages of dialogue are mixed within the same chapter:
6401	core	Elements common to all TEI documents
6410	COOV	The selection and combination of modules to form a TEI schema is described in

WD-NonStandardCharacters.xml#12945

#	id	text
6	WD	introduced the fundamental notions of language identification and character representation in an encoded TEI document. In this chapter we discuss some additional issues relating to the way that written language is represented in a TEI document. In sections
8	WD	we introduce markup which may be used to represent and document non-standard characters, that is, written symbols for which no codepoint exists in Unicode. The same markup may be used to annotate existing characters according to their visual or other properties, and thus process them as distinct glyphs (see section
12	WD	we discuss ways of documenting the writing mode used in a source text, that is, the directionality of the script, the orientation of individual characters, and related questions.
16	WDNE	Despite the availability of Unicode, text encoders still sometimes find that the published repertoire of available characters is inadequate to their needs. This is particularly the case when dealing with ancient languages, for which encoding standards do not yet exist, or where an encoder wishes to represent variant forms of a character or
34	WDNE	, and the associated character code charts. Alternatively, users can check the latest published version of
38	WDNE	), though the web site is often more up to date than the printed version, and should be checked for preference.
42	WDNE	) in the Unicode code charts are only meant to be representative, not definitive. If a specific form of an already encoded character is required for a project, refer to the guidelines contained below under
44	WDNE	. Remember that your encoded document may be rendered on a system which has different fonts from yours: if the specific form of a character is important to you, then you should document it.
47	WDNE	) to see whether the character is in line for approval.
49	WDNE	Ask on the Unicode email list (
54	WDNE	Since there are now close to 100,000 characters in Unicode, chances are good that what you need is already there, but it might not be easy to find, since it might have a different name in Unicode. Look again, this time at other sites, for example
55	WDNE	, which also provide searches based on scripts and languages. Take care, however, that all the properties of what seems to be a relevant character are consistent with those of the character you are looking for. For example, if your character is definitely a digit, but the properties of the best match you can find for it say that it is a letter, you may have a character not yet defined in Unicode.
59	WDNE	However, if the character you are looking for is being used in a notation (rather than as part of the orthography of a language) then it is quite acceptable to select characters from the Mathematical Operators block, provided that they have the appropriate properties (i.e.
69	WDNE	If, however, no suitable form of your character seems to exist, the next question will be:
70	WDNE	Does the graphical unit in question represent a variant form of a known character, or does it represent a completely unencoded character?
74	WDNE	These guidelines will help you proceed once you have identified a given graphical unit as either a variant or an unencoded character. Determining this will require knowledge of the contents of the document that you have. The first case will be called
76	WDNE	of a character, while the second case will be called
82	WDNE	While there is some overlap between these requirements, distinct specialized markup constructs have been created for each of these cases. These constructs are presented in section
91	D25-20	numeric character reference
94	D25-20	(A-umlaut). The encoder can also restrict the range of characters which are represented directly in a document (or part of it) by adding a suitable encoding declaration. For example, if a document begins with the declaration
96	D25-20	any Unicode characters which are not in the ISO-8859-1 character set must be represented by NCRs.
99	D25-20	gaiji
104	D25-20	.) This allows the encoder to distinguish characters and glyphs which Unicode regards as identical, to add new nonstandard characters or glyphs, and to represent Unicode characters not available in the document encoding by an alternative means.
122	D25-20	When the gaiji module is included in a schema, the
130	D25-20	The Unicode standard defines properties for all the characters it defines in the Unicode Character Database, knowledge of which is usually built into text processing systems. If the character represented by the
132	D25-20	element does not exist in Unicode at all, its properties are not available. If the character represented is an existing Unicode character, but is not available in the document character set recognized by a given text processing system, it may also be convenient to have access to its properties in the same way. The
136	D25-20	The list of attributes (properties) for characters is modelled on those in the Unicode Character Database, which distinguishes
140	D25-20	character properties. Additional, non-Unicode, properties may also be supplied. Since the list of properties will vary with different versions of the Unicode Standard, there may not be an exact correspondence between them and the list of properties defined in these Guidelines.
144	D25-20	. The gaiji module itself is formally defined in section
145	D25-20	below. It declares the following additional elements:
155	D25-20	when this module is included in a schema. The
159	D25-20	: this class is referenced as an alternative to plain text in almost every element which contains plain text, thus permitting the
161	D25-20	element also to appear at such places when this module is included in a schema.
182	D25-20	element) by providing a specific glyph that shows how a character appeared in the original document. This is necessary since Unicode code points refer not to a single, specific glyph shape of a character, but rather to a set of glyphs, any of which may be used to render the code point in question; in some cases they can differ considerably.
186	D25-20	element is provided for cases where the encoder wants to specify a specific glyph (or family of glyphs) out of all possible glyphs. Unfortunately, due to the way Unicode has been defined, there are cases where several glyphs that logically belong together have been given separate code points, especially in the blocks defining East Asian characters. In such cases,
188	D25-20	elements can also be used to express the view that these apparently distinct characters are to be regarded as instances of the same character (see further
191	D25-20	The Unicode Standard recommends naming conventions which should be followed strictly where the intention is to annotate an existing Unicode character, and which may also be used as a model when creating new names for characters or glyphs
192	D25-20	It should be noted, however, that this naming convention cannot meaningfully be applied to East Asian characters; the typical Unicode descriptions for these characters take the form
197	D25-20	is simply the Unicode code point value of the character in question. In cases where no Unicode code point exists, there is little hope of finding a name that helps to identify the character. Names should therefore be constructed in a way meaningful to local practice, for example by using a reference number from a well-known character dictionary or a project-specific serial number.
198	D25-20	. For convenience of processing, the following distinct elements are proposed for naming characters and glyphs:
225	D25-20	) are defined by other TEI modules, and their usage here is no different from their usage elsewhere. The
227	D25-20	element, however, is used here only to link to an image of the character or glyph under discussion, or to contain a representation of it in SVG. The
239	D25-20	element is similar to the standard TEI
241	D25-20	element. While the latter is used to express correspondence relationships between TEI concepts or elements and those in other systems or ontologies, the former is used to express any kind of relationship between the character or glyph under discussion and characters or glyphs defined elsewhere. It may contain any Unicode character, or a
276	D25-20	The mapping element may also be used to represent a mapping of the character or (more likely) glyph under discussion onto a character from the private use area as in this example:
289	D25-20	A more precise documentation of the properties of any character or glyph may be supplied using the generic
297	ucsprops	characters, defined by reference to a number of
299	ucsprops	(or attribute-value pairs) which they are said to possess. For example, a lowercase letter is said to have the value
305	ucsprops	properties (i.e. properties which form part of the definition of a given character), and
308	ucsprops	additional
330	ucsprops	For convenience, we list here some of the normative character properties and their values. For full information, refer to chapter 4 of
336	ucsprops	The general category (described in the Unicode Standard chapter 4 section 5) is an assignment to some major classes and subclasses of characters. Suggested values for this property are listed here:
384	ucsprops	Punctuation, initial quote
387	ucsprops	Punctuation, final quote
405	ucsprops	Separator, space
408	ucsprops	Separator, line
432	ucsprops	This property applies to all Unicode characters. It governs the application of the algorithm for bi-directional behaviour, as further specified in Unicode Annex 9,
518	ucsprops	Start of fixed position classes
521	ucsprops	End of fixed position classes
583	ucsprops	This property is defined for characters, which may be decomposed, for example to a canonical form plus a typographic variation of some kind. For such characters the Unicode standard specifies both a decomposition type and a decomposition mapping (i.e. another Unicode character to which this one may be mapped in the way specified by the decomposition type). The following types of mapping are defined in the Unicode Standard:
589	ucsprops	A no-break version of a space or hyphen
592	ucsprops	An initial presentation form (Arabic)
595	ucsprops	A medial presentation form (Arabic)
598	ucsprops	A final presentation form (Arabic)
601	ucsprops	An isolated presentation form (Arabic)
604	ucsprops	An encircled form
607	ucsprops	A superscript form
610	ucsprops	A subscript form
613	ucsprops	A vertical layout presentation form
622	ucsprops	A small variant form (CNS compatibility)
628	ucsprops	A vulgar fraction form
637	ucsprops	This property applies for any character which expresses any kind of numeric value. Its value is the intended value in decimal notation.
643	ucsprops	independent of the text direction: it has the value
650	ucsprops	The Unicode Standard also defines a set of informative (but non-normative) properties for Unicode characters. If encoders want to provide such properties, they may be included using the suggested Unicode name, tagged using the
654	ucsprops	element to distinguish them. If a Unicode name exists for a given property, it should however always be preferred to a locally defined name. Locally defined names should be used only for properties which are not specified by the Unicode Standard.
661	D25-30	Annotation of a character becomes necessary when it is desired to distinguish it on the basis of certain aspects (typically, its graphical appearance) only. In a manuscript, for example, where distinctly different forms of the letter "r" can be recognized, it might be useful to distinguish them for analytic purposes, quite distinct from the need to provide an accurate representation of the page. A digital facsimile, particularly one linked to a transcribed and encoded version of the text, will always provide a superior visual representation (for information on how to link a digital facsimile to a transcribed text see
662	D25-30	), but cannot be used to support arguments based on the distribution of such different forms. Character annotation as described here provides a solution to this problem.
663	D25-30	It should be kept in mind that any kind of text encoding is an abstraction and an interpretation of the text at hand, which will not necessarily be useful in reproducing an exact facsimile of the appearance of a manuscript.
666	D25-30	Assuming that we wish to distinguish the variant glyphs from the standard representation for the character concerned, we will need to define distinct
693	D25-30	With these definitions in place, occurrences of these two special "r"s in the text can be annotated using the element
708	D25-30	element will be interpreted as an annotation on the content of the element
734	D25-30	ligature; the encoder may however prefer not to use it in order to simplify other text processing operations, such as indexing).
745	D25-30	which would enable the same material to be encoded as follows:
749	D25-30	The same technique may be used to represent particular abbreviation marks as well as to represent other characters or glyphs. For example, if we believe that the r-with-one-funny-stroke is being used as an abbreviation for
755	D25-30	Note however that this technique employs markup objects to provide a link between a character in the document and some annotation on that character. Therefore, it cannot be used in places where such markup constructs are not allowed, notably in attribute values.
757	D25-30	Since the need to use these constructs to annotate or define characters occurs frequently in Chinese, Korean, and Japanese documents, here are some issues that are specific to these documents. There are two slightly different versions of the problem. In the first case, due to the way Unicode is defined, there are occasions when more than one glyph is defined for a character. In such an occasion, one might want to retain the character as used, but add information in a way so that a normalizer (for search or indexing operations) could take advantage of this information. To achieve this, we simply define within a
777	D25-30	, simply maps our glyph to the code point where Unicode defined it. The other one, of type
779	D25-30	, encodes the fact that in our view, this glyph is a variation of the standard character given in the content of the element. We could then use this
783	D25-30	to refer to it from within a text as follows.
789	D25-30	A slightly different, but related problem occurs when we have multiple variants, none of which has been defined in Unicode. In this case, we need to define one as a new character using
808	D25-30	element then defines a variant glyph of this newly defined character. Additional properties should be specified in order to make these both identifiable.
814	D25-40	The creation of additional characters for use in text encoding is quite similar to the annotation of existing characters. The same element
816	D25-40	is used to provide a link from the character instance in the text to a character definition provided within the
818	D25-40	element. This character definition takes the form of a
822	D25-40	itself will usually be empty, but could contain a code point from the Private Use Area (PUA) of the Unicode Standard, which is an area set aside for the very purpose of privately adding new characters to a document. Recommendations on how to use such PUA characters are given in the following section.
824	D25-40	In some circumstances, it may be desirable to provide a single precomposed form of a character that is encoded in Unicode only as a sequence of code points. For example, in Medieval Nordic material, a character looking like a lowercase letter Y with a dot and an acute-accent above it may be encountered so frequently that the encoder wishes to treat it as a single precomposed character with one single coded value. In the transcription concerned, the encoder enters this letter as
826	D25-40	, which when the transcription is processed can then be expanded in one of three ways, depending on the mapping in force. The entity reference might be translated into the sequence of corresponding Unicode code points or into some locally-defined PUA character (say
828	D25-40	) for local processing only. Both these options have disadvantages; the former loses the fact that the sequence of composed characters is regarded as a single object; the second is not reliably portable. Therefore, the recommended representation is to use the
831	D25-40	. This makes it possible for the encoder to provide useful documentation for the particular character or glyph so referenced:
845	D25-40	This definition specifies the mapping between this composed character and the individual Unicode-defined code points which make it up. It also supplies a single locally-defined property (
847	D25-40	) for the character concerned, the purpose of which is to supply a recommended character entity name for the character.
849	D25-40	Under certain circumstances, Chinese Han characters can be written within a circle. Rather than considering this as simply an aspect of the rendering, an encoder may wish to treat such circled characters as entirely distinct derived characters. For a given character (say that represented by the numeric-character reference
880	D25-40	. The two mappings indicate firstly that the standard form of this character is the character
884	D25-40	. For convenience of local processing this PUA character may in fact appear as content of the
894	D25-50	The developers of the Unicode Standard have set aside an area of the codespace for the private use of software vendors, user groups, or individuals. As of this writing (Unicode 5.0), there are around 137,000 code points available in this area, which should be enough for most needs. No code point assignments will be made to this area by standard bodies and only some very basic default properties have been assigned (which may be overridden where necessary by the mechanism outlined in this chapter). Therefore, unlike all other code points defined by the Unicode Standard, PUA code points should
898	D25-50	In the two previous examples, we mentioned that the variant characters concerned might well be assigned specific code points from the PUA. This might, for example, facilitate the use of a particular font which displays the desired character at this code point in the local processing environment. Since however this assignment would be valid only on the local site, documents containing such code points are unsuitable for blind interchange. During the process of preparing such documents for interchange, any PUA code points should be replaced by an appropriate use of the
901	D25-50	g ref="#xxxx"
907	D25-50	, or retained as content of the
909	D25-50	element. However, since there is no requirement that the same PUA character be used to represent it at the receiving site, and since it may well be the case that this other site has already made an assignment of some other character to the original PUA code point, it is best practice to remove the locally-defined PUA character. It is to be expected that a further translation into the local processing environment at the receiving site will be necessary to handle such characters, during which variant letters can be converted to hitherto unused code points on the basis of the information provided in the
913	D25-50	This mechanism is rather weak in cases where DOM trees or parsed XML fragments are exchanged, which may increasingly be the case. The best an application can do here is to treat any occurrence of a PUA character only in the context of the local document and use the properties provided through the
917	D25-50	In the fullness of time, a character may become standardized, and thus assigned a specific code point outside the PUA. Documents which have been encoded using the mechanism must at the least ensure that this changed code point is recorded within the relevant
929	WDWM	The scripts used for writing human languages vary not only in the glyphs they use, but also in the way (or ways) that those glyphs are arranged on the writing surface. For the majority of modern languages, writing is arranged as a series of lines which are to be read from top to bottom. Within each line, individual characters are frequently presented from left to right (English, Russian, Greek), but there are also several widely-used scripts which run right-to-left (Arabic, Hebrew). Writing in which the lines of glyphs are presented vertically and read from right to left is also often encountered, notably in older East Asian scripts (Sinitic characters, Japanese Kana, Korean Hangul, Vietnamese chữ nôm). In many cases, a language normally uses the same
930	WDWM	writing mode
931	WDWM	(we use this term to refer to the orientation of individual glyphs within a line and the order in which glyphs and lines should be read), but there are exceptions in which the same language may appear in different modes, for example either vertically or horizontally. Many East Asian scripts were traditionally written from top to bottom within the line, with their lines sequenced from right to left. Although modern Japanese, Chinese, and Korean are often written horizontally, the traditional vertical writing mode is still widely used. There are also comparatively rare cases of ancient scripts written with lines running left to right, each line being read top to bottom (Ancient Uighur, classical Mongolian and Manchu), or scripts such as Ogham where the writing direction may start from the bottom left and run around the edge of an inscribed object.
933	WDWM	When different languages are combined, it is possible that different writing modes will be needed: for example, in Hebrew text, running right to left, sequences of Latin digits still run left to right. When different writing modes are available for the same language, it may be that different glyphs will be preferred when the script is used in different modes. For example, when Japanese is written horizontally, the Unicode character U+3001, the
935	WDWM	, is used in preference to Unicode character U+FE11, the vertical mode comma. This ensures that the comma appears in the correct position relative to the surrounding glyphs. Even for scripts which are usually written in exactly the same way, different writing modes may be encountered in particular contexts; for example when a language using Roman script is embedded within vertically-organized Chinese text, it may sometimes be displayed vertically and sometimes horizontally. The writing mode may also vary in response to layout constraints such as those imposed by a complex table, where column or row labels may be written vertically or diagonally to make the most effective use of available space, just as it may vary in response to the size and shape of the carrier in the case of a monumental inscription.
937	WDWM	For many, perhaps most, TEI documents there may be no need to encode the writing mode explicitly, even in so-called "mixed mode" texts containing passages written in languages which use different writing modes. Modern printed texts in most European languages, for instance, may be expected to use left-to-right/top-to-bottom directionality; while Arabic or Hebrew texts are expected to run right-to-left/top-to-bottom. In a TEI document, language and script are explicitly stated in the markup using the attribute
939	WDWM	; this indication will usually imply a particular default writing mode. Even where this attribute is not used, passages in different scripts will use different Unicode characters, and will thus imply a particular default writing mode.
941	WDWM	Consider the case of an English text containing a few Arabic words :
943	WDWM	The Arabic term قلم رصاص means "pencil".
945	WDWM	A correct TEI encoding might read as follows:
954	WDWM	attribute with value
956	WDWM	that causes processing software to display the Arabic from right to left, but in fact, this is not the case. The order in which the Arabic characters appear when rendered would be the same, even if the markup were not present:
961	WDWM	This is because Arabic glyphs are always displayed right to left, even when they appear within a left-to-right English sentence. Like most other codepoints in the Unicode standard, they have a specific directionality setting which helps any rendering software determine how they should be ordered. The Latin glyph "a" has a strong left-to-right bidirectionality setting, as do the digits 0 to 9; the Hebrew א (alef) is strongly right-to-left. Of course, some glyphs (common punctuation marks such as the period or comma for example) have weak or neutral settings because they may appear in several contexts.
965	WDWM	) defines a number of rules enabling software to render sequences of characters which have differing directionality properties in a predictable and reliable way, using only those properties.
966	WDWM	Because this algorithm may not always give the desired result, Unicode also provides a set of "directional formatting characters" (
967	WDWM	). These additional codepoints can be used to signal to rendering software that a specific directionality setting should be turned on or off. However, in the case of documents encoded in XML, there is no need to use such characters, and in fact the W3C explicitly advises against it. "In (X)HTML and XML do not use the paired Unicode bidi formatting code characters where equivalent markup is available." (
969	WDWM	. It should be remembered however that individual sequences of characters are always stored in a file in the order in which they should be read, irrespective of the order in which the characters making up a sequence should be displayed or rendered. For example, in a RTL language such as Hebrew, the first character in a file will be that which is displayed at the rightmost end of the first line of text.
971	WDWM	An encoder wishing to document or to control the order in which sequences of characters in a TEI document are displayed will usually do so by segmenting the text into sequences presented in the desired order and specifying an appropriate language code for each. In situations where this approach may result in ambiguity or lack of precision, or if the encoder wishes to record directional information explicitly in their encoding, we recommend using the global @style attribute to supply detail about the writing mode applicable to the content of any element. The
975	WDWM	At the time of writing, this W3C module has the status of a candidate recommendation: see further
978	WDWM	which permits direct specification of a number of useful properties associated with writing modes, notably
1004	WDWM	The global TEI
1010	WDWM	and then point to them using the global
1013	WDWM	. Although the CSS specifications are mainly used to provide instructions for software when rendering a digital text, they also provide a useful means of describing the visual properties of a pre-existing document in a formal and standardized way.
1015	WDWM	The next section presents some examples of how CSS can be used to describe a variety of writing modes. A full description of the appearance of a document will probably include many other properties of course.
1021	WDWMEG	The CSS recommendations provides several properties which can be used to encode aspects of the "writing mode". The most useful of these is the property "writing-mode" which may be used to specify a reading-order for both characters within a single line and lines within a single block of text. The property "text-orientation" may also used to indicate the orientation of individual characters with respect to the line, and the property "direction" to determine the reading order of characters within a line only. We give some examples of each below.
1028	WDWMEG1	property is particularly useful for languages which can be written in different writing modes, such as Chinese and Japanese. Its possible values include
1034	WDWMEG1	. Each value has two components:
1038	WDWMEG1	specifies the inline writing direction, while the second component specifies the direction in which lines in a block, and blocks in a sequence are arranged: from top to bottom (as in most European languages, in which lines and paragraphs are arranged from top to bottom on a page), from right to left (as in the case of Japanese written vertically), or left-to-right (as in the case of Mongolian).
1088	WDWMEG1	to supply a value of
1092	WDWMEG1	attribute specifies a horizontal writing mode; this may seem superfluous, but vertically-written romaji is not unknown.
1098	WDWMEG2	When Japanese is written vertically, the glyph orientation remains the same as when it is written horizontally. In other words, glyphs are not rotated (although as noted above some different glyphs may be used for some characters, in particular for punctuation which needs to be positioned differently in vertical and in horizontal text). However, it is very common for languages written vertically to have embedded runs of text from languages which are normally written horizontally. This raises the issue of the orientation of the glyphs from the horizontal language. Are they written upright, as they would normally appear in horizontal text runs, or are they rotated? Consider this fragment from a Japanese article about the Indonesian language, which takes the form of a glossary list:
1108	WDWMEG2	The text-orientation property allows us to indicate whether or not glyphs are rotated. In the following example, we have indicated that the list uses a
1110	WDWMEG2	writing mode, but that the orientation of individual glyphs may vary:
1126	WDWMEG2	characters from horizontal-only scripts are set sideways, i.e. 90° clockwise from their standard orientation in horizontal text. Characters from vertical scripts are set with their intrinsic orientation
1129	WDWMEG2	). Since the default value for
1133	WDWMEG2	, this rule is not strictly required. However, if the Indonesian glyphs (which are roman characters) had been set vertically, like this:
1142	WDWMEG2	then an encoding like the following could be used to make this explicit:
1158	WDWMEG2	characters from horizontal-only scripts are rendered upright, i.e. in their standard horizontal orientation. Characters from vertical scripts are set with their intrinsic orientation and shaped normally
1169	WDWMEG3	It is not unusual to see text from horizontal languages written vertically even where no vertically-written script is involved. This example is a fragment from a table of information about agricultural development on Vancouver Island, written in 1855:
1180	WDWMEG3	Four of the subheading cells in this fragment contain English text written vertically, bottom-to-top, to conserve space on the page. To describe this sort of phenomenon, we can use the
1189	WDWMEG3	causes text to be set as if in a horizontal layout, but rotated 90° counter-clockwise.
1190	WDWMEG3	We might encode the third of the four cells containing vertical text like this:
1200	WDWMEG3	property captures the fact that the script is written vertically, and its lines are to be read from left to right (so the line containing
1203	WDWMEG3	Cash value
1206	WDWMEG3	value encodes the orientation (rotated 90° counter-clockwise). We might also add
1208	WDWMEG3	to the style, to express the fact that the text is centrally-aligned.
1214	WDWMEG4	Of the rather small number of scripts which appear to be written bottom-to-top, perhaps the best-known is Ogham, an alphabet used mainly to write Archaic Irish. Ogham is typically found inscribed along the edge of a standing stone, starting at its base. The CSS Writing Modes specification does not explicitly distinguish between vertical scripts which are written top-to-bottom and those which are written bottom-to-top. Instead, such bottom-to-top scripts are best treated as left-to-right horizontal scripts, oriented vertically because of the constraints of the medium on which they are inscribed. Such scripts are analogous to the vertical English text-runs in the table cells in the example above, and can be handled in exactly the same manner (
1216	WDWMEG4	). In cases where writing follows a curved path (such as Ogham running around the edge of a stone), a meticulous encoder might resort to the use of SVG to describe the path, rather than treating the phenomenon as a writing mode.
1225	WDWMEG5	The Arabic term قلم رصاص means "pencil".
1238	WDWMEG5	property to record the observed directionality of the text is unambiguous, even though it is (as we noted above) superfluous. The use of the
1240	WDWMEG5	property here may require some explanation. By default this property has the value
1242	WDWMEG5	, the effect of which in this context would be to ignore any value supplied for the direction property. The CSS Writing Modes specification stipulates that the direction property
1243	WDWMEG5	has no effect on bidi reordering when specified on inline boxes whose
1245	WDWMEG5	property’s value is
1247	WDWMEG5	, because the element does not open an additional level of embedding with respect to the bidirectional algorithm.
1250	WDWMEG5	Mixed horizontal directionality is very common in languages such as Arabic and Hebrew, particularly when numbers (which are always given LTR) or phrases from LTR languages are embedded. It is not impossible, though quite unusual, for ambiguities to arise in such situations, which may give rise to the parts of a document being displayed in unexpected ways that do not correspond to the natural reading order. A more detailed discussion of this issue from an HTML perspective is provided by a W3C Internationalization Working Group report
1251	WDWMEG5	Inline markup and bidirectional text in HTML
1260	WDWMEG	For most texts, information about text directionality need not be explicitly encoded in a TEI text, either because it follows unambiguously from
1262	WDWMEG	values, or because it can be expected to be handled unequivocally by the Unicode Bidi Algorithm. Where it is considered important to encode such information, properties and values taken from the CSS Writing Modes module may be used by means of the global TEI
1264	WDWMEG	attribute (or using the TEI
1275	WDWMTT	In what follows, we examine a range of textual phenomena which in some ways appear very similar to those examined above, and even overlap with them. We can categorize these as text transformation features, and suggest some strategies for encoding them based on the properties detailed in the
1286	WDWMTT	Here a block of text has been rotated around its z-axis. This is clearly not a
1287	WDWMTT	writing mode
1288	WDWMTT	; the writing mode for this text is horizontal, left to right. Furthermore, even if we wished to treat this as a writing mode, we could not do so, because there is no way to use writing modes properties to describe an text orientation which is angled at 45 degrees; no human languages are consistently written in this orientation. It is more appropriate to treat this as a rotational transformation. We can do this using two properties:
1292	WDWMTT	. (Both of these properties have quite complex value sets, and we will not look at all of them here. See the
1298	WDWMTT	property takes as its value one or more of the transform functions, one of which is the function
1304	WDWMTT	Any rotation must take place clockwise around an axis positioned relative to the element being rotated, and the
1306	WDWMTT	property can be used to specify the pivot point. By default, the value of
1310	WDWMTT	, the point at the centre of the element, but these values can be changed to reflect rotation around a different origin point. (The TEI
1316	WDWMTT	A block of text may also be rotated about either of its other axes. For example, this shows rotation around the Y (vertical) axis:
1330	WDWMTT	which are both normally printed in a rotated form so that they represent a pair of wings:
1351	WDWMTT	We might also argue that this is in fact a vertical writing mode by supplying
1353	WDWMTT	as the value for the
1357	WDWMTT	Rotation is also useful as a method of handling a true writing mode which is not covered by the CSS Writing Modes:
1359	WDWMTT	. This is a writing mode common in inscriptions in Latin, Greek and other languages, in which alternate lines run from left to right and from right to left
1360	WDWMTT	The name is taken from the Greek βουστροφηδόν, meaning
1364	WDWMTT	); that is, turning as an ox does when pulling a plough.
1366	WDWMTT	mirror writing
1389	WDWMTT	The 180-degree rotation around the Y (vertical) axis here describes what is happening in the RTL line in boustrophedon; the order of glyphs is reversed, and so is their individual orientation (in fact, we see them
1390	WDWMTT	from the back
1395	WDWMTT	in the sense of poetic lines; the text is continuous prose, and linebreaks are incidental.
1397	WDWMTT	There are obviously some unsatisfactory aspects of this manner of encoding boustrophedon. In the inscription above, some words run across linebreaks, so if we wished to tag both words and the right-to-left phenomena, one hierarchy would have to be privileged over the other. By using a transform function rather than a writing mode property, we are apparently suggesting that boustrophedon is not in fact a writing mode, whereas it clearly is. But the CSS Writing Modes specification does not provide support for boustrophedon, because it is a rather obscure historical phenomenon; using a rotational transform is one practical alternative.
1405	WDCAV	; the language is designed to describe how an HTML document should be formatted. This is not, of course, the case for the TEI, which lacks any explicit processing or formatting model, and attempts to define objects as far as possible without consideration of their visual appearance. As long as the properties and values from the CSS Transforms module are used as a convenient, well-specified descriptive language to capture features of a text, without any expectation of using them directly and reliably for rendering, this is not particularly problematic. CSS provides a useful and well-defined vocabulary to describe many aspects of the appearance of source texts, benefitting particularly from the clarity of definition provided by the specification. However, if there is any expectation of using this information to render a text in a predictable and accurate way, it will be essential to provide enough styling information throughout the document hierarchy to resolve all ambiguities with regard to size, positioning, block status, etc. before any element undergoes a transform operation.
1410	WSD-DEF	The gaiji module described in this chapter makes available the following components:
1413	gaiji	Character and glyph documentation
1422	WSD-DEF	The selection and combination of modules to form a TEI schema is described in

TS-TranscriptionsofSpeech.xml#12961

#	id	text
4	TS	The module described in this chapter is intended for use with a wide variety of transcribed spoken material. It should be stressed, however, that the present proposals are not intended to support unmodified every variety of research undertaken upon spoken material now or in the future; some discourse analysts, some phonologists, and doubtless others may wish to extend the scheme presented here to express more precisely the set of distinctions they wish to draw in their transcriptions. Speech regarded as a purely acoustic phenomenon may well require different methods from those outlined here, as may speech regarded solely as a process of social interaction.
6	TS	This chapter begins with a discussion of some of the problems commonly encountered in transcribing spoken language (section
8	TS	documents some additional TEI header elements which may be used to document the recording or other source from which transcribed text is taken. Section
10	TS	of this chapter reviews further problems specific to the encoding of spoken language, demonstrating how mechanisms and elements discussed elsewhere in these Guidelines may be applied to them.
21	TSOV	of speech. Speech varies according to a large number of dimensions, many of which have no counterpart in writing (for example, tempo, loudness, pitch, etc.). The audibility of speech recorded in natural communication situations is often less than perfect, affecting the accuracy of the transcription. Spoken material may be transcribed in the course of linguistic, acoustic, anthropological, psychological, ethnographic, journalistic, or many other types of research. Even in the same field, the interests and theoretical perspectives of different transcribers may lead them to prefer different levels of detail in the transcript and different styles of visual display. The production and comprehension of speech are intimately bound up with the situation in which speech occurs, far more so than is the case for written texts. A speech transcript must therefore include some contextual features; determining which are relevant is not always simple. Moreover, the ethical problems in recording and making public what was produced in a private setting and intended for a limited audience are more frequently encountered in dealing with spoken texts than with written ones.
23	TSOV	Speech also poses difficult structural problems. Unlike a written text, a speech event takes place in time. Its beginning and end may be hard to determine and its internal composition difficult to define. Most researchers agree that the utterances or
25	TSOV	of individual speakers form an important structural component in most kinds of speech, but these are rarely as well-behaved (in the structural sense) as paragraphs or other analogous units in written texts: speakers frequently interrupt each other, use gestures as well as words, leave remarks unfinished and so on. Speech itself, though it may be represented as words, frequently contains items such as vocalized pauses which, although only semi-lexical, have immense importance in the analysis of spoken text. Even non-vocal elements such as gestures may be regarded as forming a component of spoken text for some analytic purposes. Below the level of the individual utterance, speech may be segmented into units defined by phonological, prosodic, or syntactic phenomena; no clear agreement exists, however, even as to appropriate names for such segments.
27	TSOV	Spoken texts transcribed according to the guidelines presented here are organized as follows. The overall structure of a TEI spoken text is identical to that of any other TEI text: the
29	TSOV	element for a spoken text contains a
33	TSOV	element. Even texts primarily composed of transcribed speech may also include conventional front and back matter, and may even be organized into divisions like printed texts.
39	TSOV	as organizing unit for spoken material
40	TSOV	A spoken
42	TSOV	might typically be a conversation between a small number of people, a lecture, a broadcast TV item, or a similar event. Each such unit has associated with it a
44	TSOV	providing detailed contextual information such as the source of the transcript, the identity of the participants, whether the speech is scripted or spontaneous, the physical and social setting in which the discourse takes place and a range of other aspects. Details of the header in general are provided in chapter
45	TSOV	; the particular elements it provides for use with spoken texts are described below (
46	TSOV	). Details concerning additional elements which may be used for the documentation of participant and contextual information are given in
49	TSOV	Defining the bounds of a spoken text is frequently a matter of arbitrary convention or convenience. In public or semi-public contexts, a text may be regarded as synonymous with, for example, a
52	TSOV	broadcast item
54	TSOV	meeting
55	TSOV	, etc. In informal or private contexts, a text may be simply a conversation involving a specific group of participants. Alternatively, researchers may elect to define spoken texts solely in terms of their duration in time or length in words. By default, these Guidelines assume of a text only that:
61	TSOV	it represents a single stretch of time with no significant discontinuities.
66	TSOV	element may take the value
68	TSOV	to specify that the components of the text are discrete) but is not recommended.
72	TSOV	it may be necessary to identify subdivisions of various kinds, if only for convenience of handling. The neutral
79	TSOV	A spoken text may contain any of the following components:
87	TSOV	kinesic (non-verbal, non-lexical) phenomena such as gestures
91	TSOV	writing, regarded as a special class of incident in that it can be transcribed, for example captions or overheads displayed during a lecture
93	TSOV	shifts or changes in vocal quality
96	TSOV	Elements to represent all of these features of spoken language are discussed in section
101	TSOV	) may contain lexical items interspersed with pauses and non-lexical vocal sounds; during an utterance, non-linguistic incidents may occur and written materials may be presented. The
107	TSOV	A spoken text itself may be without substructure, that is, it may consist simply of units such as utterances or pauses, not grouped together in any way, or it may be subdivided. If the notion of what constitutes a
108	TSOV	text
109	TSOV	in spoken discourse is inevitably rather an arbitrary one, the notion of formal subdivisions within such a
110	TSOV	text
112	TSOV	text
119	TSOV	, provided only that the set of all such divisions is coextensive with the text.
121	TSOV	Each such division of a spoken text should be represented by the numbered or unnumbered
124	TSOV	. For some detailed kinds of analysis a hierarchy of such divisions may be found useful; nested
126	TSOV	elements may be used for this purpose, as in the following example showing how a collection made up of transcribed
127	TSOV	sound bites
128	TSOV	taken from speeches given by a politician on different occasions might be encoded. Each extract is regarded as a distinct
148	TSOV	attribute, for use where the divisions of a text do not all share the same set of the contextual declarations specified in the TEI header. (See further section
154	HD32	Where a computer file is derived from a spoken text rather than a written one, it will usually be desirable to record additional information about the recording or broadcast which constitutes its source. Several additional elements are provided for this purpose within the source description component of the TEI header:
168	HD32	Note that detailed information about the participants or setting of an interview or other transcript of spoken language should be recorded in the appropriate division of the profile description, discussed in chapter
169	HD32	, rather than as part of the source description. The source description is used to hold information only about the source from which the transcribed speech was taken, for example, any script being read and any technical details of how the recording was produced. If the source was a previously-created transcript, it should be treated in the same way as any other source text.
173	HD32	element should be used where it is known that one or more of the participants in a spoken text is speaking from a previously prepared script. The script itself should be documented in the same way as any other written text, using one of the three citation tags mentioned above. Utterances or groups of utterances may be linked to the script concerned by means of the
192	HD32	is used to group together information relating to the recordings from which the spoken text was transcribed. The element may contain either a prose description or, more helpfully, one or more
194	HD32	elements, each corresponding with a particular recording. The linkage between utterances or groups of utterances and the relevant recording statement is made by means of the
201	HD32	element should be used to provide a description of how and by whom a recording was made. This information may be provided in the form of a prose description, within which such items as statements of responsibility, names, places, and dates may be identified using the appropriate phrase-level tags. Alternatively, a selection of elements from the
212	HD32	Specialized collections may wish to add further sub-elements to these major components. These elements should be used only for information relating to the recording process itself; information about the setting or participants (for example) is recorded elsewhere: see sections
251	HD32	When a recording has been made from a public broadcast, details of the broadcast itself should be supplied within the
255	HD32	element. A broadcast is closely analogous to a publication and the
263	HD32	. The broadcasting agency responsible for a broadcast is regarded as its author, while other participants (for example interviewers, interviewees, script writers, directors, producers, etc.) should be specified using the
294	HD32	When a broadcast contains several distinct recordings (for example a compilation), additional
318	TSBA	The following elements characterize spoken texts, transcribed according to these Guidelines:
323	TSBA	element may appear directly within a spoken text, and may contain any of the others; the others may also appear directly (for example, a
327	TSBA	element. In terms of the basic TEI model, therefore, we regard the
367	TSBA	(for sounds produced by the human vocal apparatus), and
377	TSBA	incident
383	TSBA	kinesic
389	TSBA	vocal
406	TSBA	vocal events
408	TSBA	usually involuntary noises. Equally, the distinction between utterances and vocals is not always clear, although for many analytic purposes it will be convenient to regard them as distinct. Individual scholars may differ in the way borderlines are drawn and should declare their definitions in the
410	TSBA	element of the header (see
413	TSBA	The following short extract exemplifies several of these elements. It is recoded from a text originally transcribed in the CHILDES format.
424	TSBA	). Non-verbal vocal effects such as the child's meowing are indicated either with orthographic transcriptions or with the
426	TSBA	element, and entirely non-linguistic but significant incidents such as the sound of the toy cat are represented by the
470	TSBA	This example also uses some elements common to all TEI texts, notably the
472	TSBA	tag for editorial regularization. Unusually stressed syllables have been encoded with the
479	TSBA	Contextual information is of particular importance in spoken texts, and should be provided by the TEI header of a text. In general, all of the information in a header is understood to be relevant to the whole of the associated text. The element
490	TSBAUT	Each distinct
492	TSBAUT	in a spoken text is represented by a
500	TSBAUT	attribute to associate the utterance with a particular speaker is recommended but not required. Its use implies as a further requirement that all speakers be identified by a
504	TSBAUT	element in the TEI header (see section
505	TSBAUT	), but it may also point to another external source of information about the speaker. Where utterances or other parts of the transcription cannot be attributed with confidence to any particular participant or group of participants, the encoder may choose to create
513	TSBAUT	, and perhaps give the root
517	TSBAUT	value of
519	TSBAUT	, then point to those as appropriate using
526	TSBAUT	. The value specified applies to the transition from the preceding utterance into the utterance bearing the attribute. For example:
527	TSBAUT	For the most part, the examples in this chapter use no sentence punctuation except to mark the rising intonation often found in interrogative statements; for further discussion, see section
541	TSBAUT	, while there is a marked pause between
552	TSBAUT	An utterance may contain either running text, or text within which other basic structural elements are nested. Where such nesting occurs, the
562	TSBAUT	; that is, a pause or shift (etc.) within an utterance is regarded as being produced by that speaker only, while a pause between utterances applies to all speakers.
564	TSBAUT	Occasionally, an utterance may seem to contain other utterances, for example where one speaker interrupts himself, or when another speaker produces a
566	TSBAUT	while they are still speaking. The present version of these Guidelines does not support nesting of one
568	TSBAUT	element within another. The transcriber must therefore decide whether such interruptions constitute a change of utterance, or whether other elements may be used. In the case of self-interruption, the
570	TSBAUT	element may be used to show that the speaker has changed the quality of their speech:
589	TSBAUT	Where this is not possible, it is simplest to regard the back-channel as a distinct utterance.
594	TSBAPA	Speakers differ very much in their rhythm and in particular in the amount of time they leave between words. The following element is provided to mark occasions where the transcriber judges that speech has been paused, irrespective of the actual amount of silence:
595	TSBAPA	A pause contained by an utterance applies to the speaker of that utterance. A pause between utterances applies to all speakers. The
607	TSBAPA	If detailed synchronization of pausing with other vocal phenomena is required, the alignment mechanism defined at section
610	TSBAPA	attribute mentioned in the previous section may also be used to characterize the degree of pausing between (but not within) utterances.
619	TSBAVO	attribute should be used to specify the person or group responsible for a
625	TSBAVO	which is contained within an utterance, if this differs from that of the enclosing utterance. The attribute must be supplied for a
635	TSBAVO	attribute may be used to indicate that the vocal, kinesic, or incident is repeated, for example
641	TSBAVO	, where what is being encoded is a shift in voice quality. For this last case, the
662	TSBAVO	element of the TEI header.
694	TSBAVO	The extent to which encoding of incidents or kinesics is included in a transcription will depend entirely on the purpose for which the transcription was made. As elsewhere, this will depend on the particular research agenda and the extent to which their presence is felt to be significant for the interpretation of spoken interactions.
698	TSBAWR	Written text may also be encountered when speech is transcribed, for example in a television broadcast or cinema performance, or where one participant shows written text to another. The
700	TSBAWR	element may be used to distinguish such written elements from the spoken text in which they are embedded.
702	TSBAWR	For example, if speaker A in the breakfast table conversation in section
703	TSBAWR	above had simply shown the newspaper passage to her interlocutor instead of reading it, the interaction might have been encoded as follows:
712	TSBAWR	If the source of the writing being displayed is known, bibliographic information about it may be stored in a
716	TSBAWR	element of the TEI header, and then pointed to using the
739	TSBATI	As noted above, utterances, vocals, pauses, kinesics, incidents, and writing elements all inherit attributes providing information about their position in time from the classes
743	TSBATI	. These attributes can be used to link parts of the transcription very exactly with points on a timeline, or simply to indicate their duration. Note that if
749	TSBATI	elements whose temporal distance from each other is specified in a timeline, then
756	TSBATI	) may be used as an alternative means of aligning the start and end of timed elements, and is required when the temporal alignment involves points within an element.
764	TSSASH	A common requirement in transcribing spoken language is to mark positions at which a variety of prosodic features change. Many paralinguistic features (pitch, prominence, loudness, etc.) characterize stretches of speech which are not co-extensive with utterances or any of the other units discussed so far. One simple method of encoding such units is simply to mark their boundaries. An empty element called
769	TSSASH	element may appear within an utterance or a segment to mark a significant change in the particular feature defined by its attributes, which is then understood to apply to all subsequent utterances for the same speaker, unless changed by a new shift for the same feature in the same speaker. Intervening utterances by other speakers do not normally carry the same feature. For example:
779	TSSASH	is spoken loudly, the words
791	TSSASH	); this list may be revised or supplemented using the methods outlined in section
796	TSSASH	attribute specifies the new state of the feature following the shift. If this attribute has the special value
800	TSSASH	A list of suggested values for each of the features proposed follows:
814	TSSASH	l
825	TSSASH	f
834	TSSASH	p
860	TSSASH	desc
888	TSSASH	legato, every syllable receiving more or less equal stress
949	TSSASH	A full definition of the sense of the values provided for each feature should be provided in the encoding description section of the text header (see section
965	TSSA	This section describes the following features characteristic of spoken texts for which elements are defined elsewhere in these Guidelines:
967	TSSA	segmentation below the utterance level
972	TSSA	The elements discussed here are not provided by the module for spoken texts. Some of them are included in the core module and others are contained in the modules for linking and for analysis respectively. The selection of modules and their combination to define a TEI schema is discussed in section
977	TSSASE	For some analytic purposes it may be desirable to subdivide the divisions of a spoken text into units smaller than the individual utterance or turn. Segmentation may be performed for a number of different purposes and in terms of a variety of speech phenomena. Common examples include units defined both prosodically (by intonation, pausing, etc.) and syntactically (clauses, phrases, etc.) The term
979	TSSASE	has been used by a number of researchers to define units peculiar to speech transcripts.
980	TSSASE	The term was apparently first proposed by
982	TSSASE	A text can be analysed as a sequence of segments which are internally connected by a network of syntactic relations and externally delimited by the absence of such relations with respect to neighbouring segments. Such a segment is a syntactic unit called a macrosyntagm
992	TSSASE	attribute to specify the kind of segmentation applicable to a particular segment, if more than one is possible in a text. A full definition of the segmentation scheme or schemes used should be provided in the
996	TSSASE	element in the TEI header (see
999	TSSASE	In the first example below, an utterance has been segmented according to a notion of syntactic completeness not necessarily marked by the speech, although in this case a pause has been recorded between the two sentence-like units. In the second, the segments are defined prosodically (an acute accent has been used to mark the position immediately following the syllable bearing the primary accent or stress), and may be thought of as
1017	TSSASE	element in the header of the text should specify the principles adopted to define the segments marked in this way.
1022	TSSASE	may be used, either as an alternative or in addition to the more general purpose
1059	TSSASE	In this example, recoded from a corpus of language-impaired speech prepared by Fletcher and Garman, the speaker's utterance has been fully segmented into clausal (
1077	TSSASE	has been used to define a particular characteristic of this corpus for which no element exists in the TEI scheme. See further chapter
1078	TSSASE	for a discussion of the way in which this kind of user-defined extension of the TEI scheme may be performed and chapter
1081	TSSASE	This example also uses the core elements
1088	TSSASE	It is often the case that the desired segmentation does not respect utterance boundaries; for example, syntactic units may cross utterance boundaries. For a detailed discussion of this problem, and the various methods proposed by these Guidelines for handling it, see chapter
1091	TSSASE	milestone
1094	TSSASE	tag discussed in section
1097	TSSASE	where several discontinuous segments are to be grouped together to form a syntactic unit (e.g. a phrasal verb with interposed complement), the
1104	TSSAPA	A major difference between spoken and written texts is the importance of the temporal dimension to the former. As a very simple example, consider the following, first as it might be represented in a playscript:
1126	TSSAPA	However, this does not allow us to indicate either the extent to which Stig's utterance is overlapped, nor does it show that there are in fact three things which are synchronous: the end of Jane's utterance, Stig's whole utterance, and Lou's kinesic. To overcome these problems, more sophisticated techniques, employing the mechanisms for pointing and alignment discussed in detail in section
1127	TSSAPA	, are needed. If the module for linking has been enabled (as described in section
1137	TSSAPA	should be consulted. The rest of the present section, which should be read in conjunction with that more detailed discussion, presents a number of ways in which these mechanisms may be applied to the specific problem of representing temporal alignment, synchrony, or overlap in transcribing spoken texts.
1145	TSSAPA	attribute associated with this anchor point specifies the identifiers of the other two elements which are to be synchronized with it: specifically, the second utterance (
1147	TSSAPA	) and the kinesic (k1). Note that one of these elements has content and the other is empty.
1149	TSSAPA	This example demonstrates only a way of indicating a point within one utterance at which it can be synchronized with another utterance and a kinesic. For more complex kinds of alignment, involving possibly multiple synchronization points, an additional element is provided, known as a
1151	TSSAPA	. This consists of a series of
1161	TSSAPA	This timeline represents four points in time, named TS-P1, TS-P2, TS-P6, and TS-P3 (as with all attributes named
1163	TSSAPA	in the TEI scheme, the names must be unique within the document but have no other significance). TS-P1 is located absolutely, at 12:20:01:01 BST. TS-P2 is 4.5 seconds later than TS-P2 (i.e. at 12:20:46). TS-P6 is at some unspecified time later than TS-P2 and previous to TS-P3 (this is implied by its position within the timeline, as no attribute values have been specified for it). The fourth point, TS-P3, is 1.5 seconds later than TS-P6.
1165	TSSAPA	One or more such timelines may be specified within a spoken text, to suit the encoder's convenience. If more than one is supplied, the
1177	TSSAPA	elements in a time line are a fixed distance apart.
1179	TSSAPA	Three methods are available for aligning points or elements within a spoken text with the points in time defined by the
1185	TSSAPA	element as the value of one of the
1207	TSSAPA	For example, using the timeline given above:
1269	TSSAPA	Such conventions have the drawback that they are hard to generalize or to extend beyond the very simple case presented here. Their reliance on the accidentals of physical layout may also make them difficult to transport and to process computationally. These Guidelines recommend the following mechanisms to encode this.
1297	TSSAPA	(Note that If only the ordering or sequencing of utterances is needed, then specific timing information shown here in
1326	TSSAPA	To avoid deciding whether to point from the timeline to the text or vice versa, a
1377	TSREG	When speech is transcribed using ordinary orthographic notation, as is customary, some compromise must be made between the sounds produced and conventional orthography. Particularly when dealing with informal, dialectal, or other varieties of language, the transcriber will frequently have to decide whether a particular sound is to be treated as a distinct vocabulary item or not. For example, while in a given project
1379	TSREG	may not be worth distinguishing as a vocabulary item from
1389	TSREG	One rule of thumb might be to allow such variation only where a generally accepted orthographic form exists, for example, in published dictionaries of the language register being encoded; this has the disadvantage that such dictionaries may not exist. Another is to maintain a controlled (but extensible) set of normalized forms for all such words; this has the advantage of enforcing some degree of consistency among different transcribers. Occasionally, as for example when transcribing abbreviations or acronyms, it may be felt necessary to depart from conventional spelling to distinguish between cases where the abbreviation is spelled out letter by letter (e.g.
1397	TSREG	). Similar considerations might apply to pronunciation of foreign words (e.g.
1403	TSREG	In general, use of punctuation, capitalization, etc., in spoken transcripts should be carefully controlled. It is important to distinguish the transcriber's intuition as to what the punctuation should be from the marking of prosodic features such as pausing, intonation, etc.
1411	TSTPPR	In the absence of conventional punctuation, the marking of prosodic features assumes paramount importance, since these structure and organize the spoken message. Indeed, such prosodic features as points of primary or secondary stress may be represented by specialized punctuation marks, or other characters such as those provided by the Unicode Spacing Modifier Letters block. Pauses have already been dealt with in section
1412	TSTPPR	; while tone units (or intonational phrases) can be indicated by the segmentation tag discussed in section
1418	TSTPPR	In a more detailed phonological transcript, it is common practice to include a number of conventional signs to mark prosodic features of the surrounding or (more usually) preceding speech. Such signs may be used to record, for example, particular intonation patterns, truncation, vowel quality (long or short) etc. These signs may be preserved in a transcript either by using conventional punctuation or by marking their presence by
1426	TSTPPR	of the TEI header
1441	TSTPPR	These declarations might additionally provide information about how the characters concerned should be rendered, their equivalent IPA form, etc. In the transcript itself references to them can then be included as follows:
1493	TSTPPR	This example, which is taken from a corpus of bookshop service encounters,
1499	TSTPPR	. Where words are so unclear that only their extent can be recorded, the empty
1506	TSTPPR	For more detailed work, involving a detailed phonological transcript including representation of stress and pitch patterns, it is probably best to maintain the prosodic description in parallel with the conventional written transcript, rather than attempt to embed detailed prosodic information within it. The two parallel streams may be aligned with each other and with other streams, for example an acoustic encoding, using the general alignment mechanisms discussed in section
1515	TSTPSM	above), or to transcribe them using IPA or some other transcription system. To simplify analysis of the lexical features of a speech transcript, it may be felt useful to
1518	TSTPSM	, to make explicit the extent of regularization or normalization performed by the transcriber.
1544	TSTPSM	element may be used to indicate both the original and a corrected form of it:
1554	TSTPSM	, where a speaker switches from one language to another, may easily be represented in a transcript by using the
1556	TSTPSM	element provided by the core tagset:
1571	TSTPAC	The recommendations made here only concern the establishment of a basic text. Where a more sophisticated analysis is needed, more sophisticated methods of markup will also be appropriate, for example, using stand-off markup to indicate multiple segmentation of the stream of discourse, or complex alignment of several segments within it. Where additional annotations (sometimes called
1575	TSTPAC	) are used to represent such features as linguistic word class (noun, verb, etc.), type of speech act (imperative, concessive, etc.), or information status (theme/rheme, given/new, active/semi-active/new), etc., a selection from the general purpose analytic tools discussed in chapters
1597	TS	The selection and combination of modules to form a TEI schema is described in

FT-TablesFormulaeGraphics.xml#12973

#	id	text
5	FT	In addition to graphic images, documents often contain material presented in graphical or tabular format. In such materials, details of layout and presentation may also be of comparatively greater significance or complexity than they are for running text. Indeed, it may often be difficult to make a clear distinction between details relating purely to the rendition of information and those relating to the information itself.
13	FT	As with text markup in general, many incompatible formats have been proposed for the representation of graphics, formulæ, and tables in electronic form. Unfortunately, no single format as effective as XML in the domain of text has yet emerged for their interchange, to some extent because of the difficulty of representing the information these data formats convey independently of the way it is rendered.
15	FT	The module defined by this chapter defines special purpose
20	FT	. Specific recommendations for the encoding of graphic figures may be found in section
21	FT	. The rest of the chapter is devoted to general problems of encoding graphic information.
23	FT	There is at the time of writing no consensus on formats for graphical images, and such formats vary in many ways. We therefore provide (in section
25	FT	) a list of formal names for those representations most popular at this time. Each one includes a very brief description. These Guidelines recommend a few particular representations as being the most widely supported and understood.
29	FTTAB	A table is the least
30	FTTAB	graphic
31	FTTAB	of the elements discussed in this chapter. Almost any text structure can be presented as a series of rows and columns: one might, for example, choose to show a glossary or other form of list in tabular form, without necessarily regarding it as a table. In such cases, the global
33	FTTAB	attribute is an appropriate way of indicating that some element is being presented in tabular format, for example by using an appropriate display property in CSS. When tabular presentation is regarded as of less intrinsic importance, it is correspondingly simpler to encode descriptive or functional information about the contents of the table, for example to identify one cell as containing a name and another as containing a date, though the two methods may be combined.
35	FTTAB	When, however, particular elements are required to encode the tabular arrangement itself, then one or other of the various
36	FTTAB	table schemas
37	FTTAB	now available may be preferable. The schemas in common use generally view a table as a special text element, made up of row elements, themselves composed of cells.
38	FTTAB	Table cells generally appear in row-major order, with the first row from left to right, then the second row, and so on. Details of appearance such as column widths, border lines, and alignment are generally encoded by numerous attributes. Beyond this, however, such schemas differ greatly. This section begins by describing a table schema of this kind; a brief summary of some other widely available table schemas is also provided in section
41	FTTAB1	TEI Tables
43	FTTAB1	For encoding tables of low to moderate complexity, these Guidelines provide the following special purpose elements:
52	FTTAB1	It is to a large extent arbitrary whether a table should be regarded as a series of rows or as a series of columns. For compatibility with currently available systems, however, these Guidelines require a row-by-row description of a table. It is also possible to describe a table simply as a series of cells; this may be useful for tabular material which is not presented as a simple matrix.
58	FTTAB1	may be used to indicate the size of a table, or to indicate that a particular cell or row of a table spans more than one row or column. For both tables and cells, rows and columns are always given in top-to-bottom, left-to-right order, although formatting properties such as those provided by CSS may be used to specify that they should be displayed differently. These Guidelines do not require that the size of a table be specified; for most formatting and many other applications, it will be necessary to process the whole table in two passes in any case.
60	FTTAB1	Where cells span more than one column or row, the encoder must determine whether this is a purely presentational effect (in which case the
62	FTTAB1	attribute may be more appropriate), whether the part of the table affected would be better treated as a nested table, or whether to use the spanning attributes listed above.
66	FTTAB1	attribute may be used to categorize a single cell, or set a default for all the cells in a given row. The present Guidelines distinguish the roles of
67	FTTAB1	label
73	FTTAB1	numeric
85	FTTAB1	The following simple example demonstrates how the data presented as a labelled list in section
128	FTTAB1	The following example demonstrates how a simple statistical table may be represented using this scheme:
184	FTTAB1	Note the use of a blank cell in the first row to ensure that the column labels are correctly aligned with the data. Again, this encoding does not explicitly represent the alignment between column and row labels and the data to which they apply. Where the primary emphasis of an encoding is on the semantic content of a table, a more explicit mechanism for the representation of structured information such as that provided by the feature structure mechanism described in chapter
185	FTTAB1	may be preferred. Alternatively, the general purpose linkage and alignment mechanisms described in chapter
188	FTTAB1	The content of a table cell need not be simply character data. It may also contain any sequence of the phrase-level elements described in chapter
189	FTTAB1	, thus allowing for the encoding of potentially more useful semantic information, as in the following example, where the fact that one cell contains a number and the other contains a place name has been explicitly recorded:
255	FTTAB1	The content of table elements is not limited to
269	FTTAB1	provide options for including text which is clearly part of the table, but outside the actual tabular layout. This example shows the use of
308	FTTAB2	Many authoring systems include built-in support for their own or for public table schemas. These provide an enhanced user interface and good formatting capabilities, but are often product-specific, despite their use of a XML markup language.
310	FTTAB2	The DTD developed by the Association of American Publishers (AAP) and standardized in ANSI Z39.59 provided a very simple encoding for correspondingly simple tables. This has been further developed, together with the table DTD documented in ISO Technical Report 9537, and now forms part of ISO 12083. The TEI table model described above has functionality very similar to that defined by ISO 12083.
312	FTTAB2	For more complex tables, the most effective publicly-available DTD is probably that developed by the US Department of Defense CALS project. This supports vertical and horizontal spanning and various kinds of text rotation and justification within cells and is also directly supported by a number of existing XML software systems.
314	FTTAB2	The CALS table model is much too complex to describe fully here; for historical background see
316	FTTAB2	. As with any other XML vocabulary, the XML version of the CALS model may readily be included in a TEI schema, using the techniques described in
321	FTTAB2	The XHTML table model (
322	FTTAB2	) is based on the HTML table model (
323	FTTAB2	). Both models support arrangement of arbitrary data into rows and columns of cells. Table rows and columns may be grouped to convey additional structural information and may be rendered by user agents in ways that emphasize this structure. Support for incremental rendering of tables and for rendering on
327	FTTAB2	). Stylesheets provide a far more effective means of controlling layout and other visual characteristics in both HTML and XML documents.
332	FTFOR	Mathematical and chemical formulæ pose problems similar to those posed by tables in that rendition may be of great significance and hard to disentangle from content. They also require access to a wide range of special characters, for most of which standard entity names already exist in the documented ISO entity sets (see further chapters
338	FTFOR	The AAP and ISO standards mentioned in section
339	FTFOR	above both provide DTDs for equations as well as for tables, which now form part of ISO 12083. The European Mathematical Trust, an organization set up specifically to enhance research support for European mathematicians, has also defined a general purpose mathematical DTD known as EuroMath (
342	FTFOR	Most if not all of the functionality provided by these DTDs can now be found in the OpenMath and MathML XML-based systems briefly described below.
344	FTFOR	As with tables, in all the XML solutions a tension exists between the need to encode the way a formula is written (its appearance) and the need to represent its semantics. If the object of the encoding is purely to act as an interchange format among different formatting programs, then there is no need to represent the mathematical meaning of an expression. If however the object is to use the encoding as input to an algebraic manipulation system (such as Mathematica or Maple) or a database system, clearly simply representing superscripts and subscripts will be inadequate.
346	FTFOR	The present Guidelines make no attempt to add to the number of available DTDs for representing formulæ. Instead, we recommend that the user make an informed choice from those already available. The module described in this chapter makes available only the following element, which should be used to encode any formula, no matter what notation is employed:
357	FTFOR	must be escaped with entity references or numeric character references, e.g.
361	FTFOR	If desired, the content of the
366	FTFOR	When the content of a
377	FTFOR	attribute supplies the name of a notation (
389	FTFOR	structure of an expression. Most of its content elements correspond with the range of operators, relations, and named functions typically found at the high-school level of mathematics. The tortoise example given above in TeX can be re-expressed in MathML as
443	FTFOR	MathML 2.0 provides support for a
463	FTFOR	Encodings, both binary (
467	FTFOR	OpenMath and MathML have certain common aspects. They both use prefix operators, both are XML-based and they both construct their objects by applying certain rules recursively. Such similarities facilitate mapping between the two standards. There are also some key differences between MathML and OpenMath. OpenMath does not provide support for presentation of mathematical objects and its scope of semantically-oriented elements is much broader that of MathML, with the expressive power to cover virtually all areas of computational mathematics. In fact, a particular set of Content Dictionaries, the
472	FTFOR	) is an extension of the OpenMath standard that supplies markup for structures such as axioms, theorems, proofs, definitions, texts (mixing formal content with mathematical text).
474	FTFOR	In-line versus block placement for an equation can be distinguished if desired, via the global
480	FTFOR	attributes may also be used to label or identify the formula, as in the following example:
525	FTNM	Music, like many other art forms, is often mentioned, discussed and described in writings of various kinds. This applies to both historical and contemporary documents, even though methods of notating music have changed considerably in western history. In most cases, music notation enters the text flow in a way similar to figures, images or graphs. On other occasions, elements of music notation are treated as inline characters in running text.
528	FTNM	provides a way to signal the presence of music notation in text, but defer to other representations, which are not covered by the TEI guidelines, to describe the music notation itself. In fact, several commercial, academic and standard bodies have developed digital representations of music notation, and given the topic's complexity, these representations often focus on different aspects and adopt different methodologies. Therefore,
530	FTNM	only defines a container element to encode the occurrence of music notation and allows linking to the data format preferred by the encoder. (Note:
553	FTNM	can be used to indicate the location of a representation of the music notation.
556	FTNM	supplies the MIME type of the data format, when available.
566	FTNM	can be used to indicate the location of a graphical representation of the music notation.
570	FTNM	provides encoded binary data which constitutes another representation of the music notation (e.g. audio).
581	FTNM	supplies the MIME type of the data format when available. For example:
597	FTNM	It is possible to link to any kind of music notation data format. However, when a MIME type is not available, it is recommended that the format be specified in the description. See the following examples.
620	FTNM	It is possible to specify the location of digital objects representing the notated music in other media such as images or audio-visual files. The interpretation of the correspondence between the notated music and these digital objects is not encoded explicitly. We recommend the use of
624	FTNM	mainly as a fallback mechanism when the notated music format is not displayable by the application using the encoding. The alignment of encoded notated music, images carrying the notation, and audio files is a complex matter for which we refer the reader to other formats and specifications such as
634	FTNM	In modern printing, music notation positioned between blocks of text for illustrative purposes is usually referred to as a
635	FTNM	figure
674	FTGRA	The following special purpose elements are used to indicate the presence of graphic images within a document:
685	FTGRA	elements form part of the common core module, and are discussed in section
694	FTGRA	attribute provides the location of an image. For example:
696	FTGRA	Three kinds of content may be supplied inside a
700	FTGRA	may be used to transcribe (or supply) a descriptive heading or title for the graphic itself as in this example:
703	FTGRA	Figures are often accompanied not only by a title or heading (a caption), but by a paragraph or so of commentary (a legend) following the caption. One or more
708	FTGRA	may be used to transcribe any commentary on the figure in the source:
718	FTGRA	Here, the figure contains a heading
722	FTGRA	. Both of these are transcribed from the source, while the description is provided by the encoder, for use by applications which cannot display the graphic directly. In documents created in electronic form with the needs of print-handicapped readers in mind, the
724	FTGRA	element may be provided by the author rather than a subsequent encoder.
731	FTGRA	Where the graphic itself contains large amounts of text, perhaps with a complex structure, and perhaps difficult to distinguish from the graphic, the encoder should choose whether to regard the graphic as containing the text (in which case, a nested
735	FTGRA	element) or to regard the enclosed text as being a separate division of the
737	FTGRA	element in which the graphic appears. In this latter case, an appropriate
741	FTGRA	(etc.) element may be used for the text represented within the graphic, and the
743	FTGRA	element embedded within it. The choice will depend to a large degree on the encoder's understanding of the relationship between the graphic and the surrounding text.
745	FTGRA	A figure which is internally divided, or contains sub-figures, may be encoded with nested
766	FTGRA	Like any other element in the TEI scheme, figures may be given identifiers so that they can be aligned with other elements, and linked to or from them, as described in chapter
771	FTGRA	version which, when selected by the user, causes the other, high resolution, version to be accessed. In TEI terms, the thumbnail image acts as a
773	FTGRA	to the other. Supposing that a thumbnail version of the figure discussed above is available as
786	FTGRA	. When the module for transcription is included in a schema, specific attributes for parts of a text and parts (or all) of a digital image are available; these are discussed in
792	FTGRA	with chapter two of some text, and another portion of it with chapter three. The application may be thought of as a hypertext browser in which the user selects from a graphic image which part of a text to read next, but the mechanism is independent of this particular application.
794	FTGRA	The first requirement is some way of identifying and hence pointing to sub-parts of a graphic image. This may be done by pointing into an XML graphic representation, for example an SVG file. Thus
815	FTGRA	The next requirement is some way of identifying the parts of the document to which a link is to be made. The most obvious way of doing this is to use the global
824	FTGRA	Now, all that is needed to linking these areas to the relevant chapters is a
833	FTGRA	In this example, the SVG representation of the graphic is stored externally to the TEI document and linked by means of a pointer. It is also possible to embed the SVG representation directly within the TEI by extending the content model of the
837	FTGRA	from the SVG namespace. Like other customizations of the TEI scheme, this is carried out using the techniques documented in section
848	FTGROV	The first major distinction in graphic representation is that between raster graphics and vector graphics. A
850	FTGROV	is a list of points, or dots. Scanners, fax machines and other simple devices easily produce digital raster images, and such images are therefore quite common. A
852	FTGROV	, in contrast, is a list of geometrical objects, such as lines, circles, arcs, or even cubes. These are much more difficult to produce, and so are mainly encountered as the output of sophisticated systems such as architectural and engineering CAD programs.
854	FTGROV	Raster images are difficult to modify because by definition they only encode single points: a line, for example, cannot grow or shrink as such, since it is not identified as such. Only its component parts are identified, and only they can be manipulated. Therefore the resolution or dot-size of a raster image is important, which is not the case with vector images. It is also far more difficult to convert raster images to vector images than to perform the opposite conversion. Raster images generally require more storage space than vector images, and a wide variety of methods exists for compressing them; the variation in these methods leads to corresponding variations in representations for storage and transmission of raster images.
856	FTGROV	Motion video usually consists of a long series of raster images. Data compression is even more effective on video than on single raster images (mainly owing to redundancy which arises from the usual similarity of adjacent frames). Notations for representing full-motion video are hotly debated at this time, and any user of these Guidelines would do well to obtain up-to-date expert advice before undertaking a project using them.
864	FTGROV	save space by discarding a small portion of the image's detail, such as fine distinctions of shading. When decompressed, therefore, such an image will be only a close approximation of the original. In contrast,
866	FTGROV	guarantees that the exact uncompressed image will be reproducible from the compressed form: only truly redundant information is removed. In general, therefore, lossless compression does not save quite so much space as lossy compression, though it does guarantee fidelity to the original uncompressed image.
870	FTGROV	, which is the number of dots per inch used to represent the image. Doubling the resolution will give a more precise image, but also quadruple the storage requirement (before compression), and affect processing time for any operations to be performed, such as displaying an image for a reader. Motion video also has resolution in time: the number of frames to be shown per second. Encoders should consider carefully what resolution(s) and frame rate(s) to use for particular applications; these Guidelines express no recommendation in this matter, save the universal ones of consistency and documentation.
872	FTGROV	Within any image, it is typical to refer to locations via Cartesian coordinate axes: values for x, y, and sometimes z and/or time. However, graphic notations vary in whether coordinates count from left-to-right and top-to-bottom, or another way. They also vary in whether coordinates are considered real (inches, millimeters, and so on), or virtual (dots). These Guidelines do not recommend any of these methods over another, but all decisions made should be applied consistently, and documented in the
874	FTGROV	section of the TEI header.
875	FTGROV	Since no special purpose element is provided for this purpose by the current version of the Guidelines, such information should be provided as one or more distinct paragraphs at the end of the
880	FTGROV	Methods of aligning images and text are discussed in
885	FTGROV	images, each point is rendered in some shade of gray, the number of shades varying from system to system. In true polychrome images, points are rendered in different hues, again with varying limitations affecting the number of distinct shades and the means by which they are displayed.
889	FTGRNO	As noted above, there exists a wide variety of different graphics formats, and the following list is in no way exhaustive. Moreover, inclusion of any format in this list should not be taken as indicating endorsement by the TEI of this format or any products associated with it. Some of the formats listed here are proprietary to a greater or lesser extent and cannot therefore be regarded as standards in any meaningful sense. They are however widely used by many different vendors.
920	FTGRNO	Brief descriptions of all the above are given below. Where possible, current addresses or other contact information are shown for the originator of each format. Many formal standards, especially those promulgated by ISO and many related national organizations (ANSI, DIN, BSI, and many more), are available from those national organizations. Addresses may be found in any standard organizational directory for the country in question.
930	FTGRAVGF	SVG is a language for describing two-dimensional vector and mixed vector or raster graphics in XML. It is defined by the Scalable Vector Graphics (SVG) 1.0 Specification, W3C Recommendation, 04 September 2001, and is available at
946	FTGRARGF	Currently the most widely supported raster image format, especially for black and white images, TIFF is also one of the few formats commonly supported on more than one operating system. The drawback to TIFF is that it actually is a wrapper for several formats, and some TIFF-supporting software does not support all variants. TIFF files may use LZW, CCITT Group 4, or PackBits compression methods, or may use no compression at all. Also, TIFF files may be monochrome, grayscale, or polychromatic. All such options should be specified in prose at the end of the
948	FTGRARGF	section of the TEI header for any document including TIFF images. TIFF is owned by Aldus Corporation. Documentation on TIFF is available from them at Craigcook Castle, Craigcook Road, Edinburgh EH4 3UH, Scotland, or 411 First Avenue South, Seattle, Washington 98104 USA.
954	FTGRARGF	PBM files are easy to process, eschewing all compression in favor of transparency of file format. PBM files can, of course, be compressed by generic file-compression tools for storage and transfer. Public domain software exists which will convert many other formats to and from PBM. Documentation on PBM is copyright by Jeff Poskanzer, and is available widely on the Internet.
970	FTGRAMPEG	This standard is sponsored by CCITT and by ISO. It is ISO/IEC Draft International Standard 10918-1, and CCITT T.81. It handles monochrome and polychromatic images with a variety of compression techniques. JPEG per se, like CCITT Group IV, must be encapsulated before transmission; this can be done via TIFF, or via the JPEG File Interchange Format (JFIF), as commonly done for Internet delivery.
982	FTGRAMPEG	SMIL is a W3C Recommendation which supports the integration of independent multimedia objects into a synchronized multimedia presentation. It provides multimedia authors with easily-defined basic timing relationships, fine-tuned synchronization, spatial layout, direct inclusion of non-text and non-image media objects, hyperlink support for time-based media, and adaptiveness to varying user and system characteristics. SMIL 1.0 (
983	FTGRAMPEG	) became a W3C Recommendation on June 15, 1998, and was further developed in SMIL 2.0. SMIL 2.0 adds native support for transitions, animation, event-based interaction, extended layout facilities, and more sophisticated timing and synchronization primitives to the SMIL 1.0 language. It also allows reuse of SMIL syntax and semantics in other XML-based languages, in particular those who need to represent timing and synchronization. For example, SMIL 2.0 components are used for integrating timing into XHTML Document Types and into SVG. SMIL 2.0 also provides recommendations for Document Types based on SMIL 2.0 Modules (
985	FTGRAMPEG	). It contains support for all of the major SMIL 2.0 features including animation, content control, layout, linking, media object, meta-information, structure, timing, and transition effects and is designed for Web clients that support direct playback from SMIL 2.0 markup. SMIL 2.0 (
986	FTGRAMPEG	) became a W3C Recommendation on August 7, 2001, becoming the first vocabulary to provide XML Schema support and to have reached such status.
997	figures	Tables, formulæ, notated music, and figures
1009	FT	The selection and combination of modules to form a TEI schema is described in

AB-About.xml#12945

#	id	text
7	AB	They make recommendations about suitable ways of representing those features of textual resources which need to be identified explicitly in order to facilitate processing by computer programs. In particular, they specify a set of markers (or
9	AB	) which may be inserted in the electronic representation of the text, in order to mark the text structure and other features of interest. Many, or most, computer programs depend on the presence of such explicit markers for their functionality, since without them a digitized text appears to be nothing but a sequence of undifferentiated bits. The success of the World Wide Web, for example, is partly a consequence of its use of such markup to indicate such features as headings and lists on individual pages, and to indicate links between pages. The process of inserting such explicit markers for implicit textual features is often called
13	AB	; the term
15	AB	is also used informally. We use the term
18	AB	markup language
19	AB	to denote the complete set of rules associated with the use of markup in a given context; we use the term
21	AB	for the specific set of markers or named distinctions employed by a given encoding scheme. Thus, this work both describes the TEI encoding scheme, and documents the TEI markup vocabulary.
23	AB	The TEI encoding scheme is of particular usefulness in facilitating the loss-free interchange of data amongst individuals and research groups using different programs, computer systems, or application software. Since they contain an inventory of the features most often deployed for computer-based text processing, the Guidelines are also useful as a starting point for those designing new systems and creating new materials, even where interchange of information is not a primary objective.
25	AB	These Guidelines apply to texts in any natural language, of any date, in any literary genre or text type, without restriction on form or content. They treat both continuous materials (
26	AB	running text
27	AB	) and discontinuous materials such as dictionaries and linguistic corpora. Though principally directed to the needs of the scholarly research community, the Guidelines are not restricted to esoteric academic applications. They are also useful for librarians maintaining and documenting electronic materials, and for publishers and others creating or distributing electronic texts. Although they focus on problems of representing in electronic form texts which already exist in traditional media, these Guidelines are also applicable to textual material which is
31	AB	The rules and recommendations made in these Guidelines are expressed in terms of what is currently the most widely-used markup language for digital resources of all kinds: the Extensible Markup Language (XML), as defined by the World Wide Web Consortium's XML Recommendation. However, the TEI encoding scheme itself does not depend on this language; it was originally formulated in terms of SGML (the ISO Standard Generalized Markup Language), a predecessor of XML, and may in future years be re-expressed in other ways as the field of markup develops and matures. For more information on markup languages see chapter
35	AB	This document provides the authoritative and complete statement of the requirements and usage of the TEI encoding scheme. As such, although it includes numerous small examples, it must be stressed that this work is intended to be a reference manual rather than a tutorial guide.
37	AB	The remainder of this chapter comprises three sections. The first gives an overview of the structure and notational conventions used throughout these Guidelines. The second enumerates the design principles underlying the TEI scheme and the application environments in which it may be found useful. Finally, the third section gives a brief account of the origins and development of the Text Encoding Initiative itself.
41	ABSTRUNC	The remaining two sections of the front matter to the Guidelines provide background tutorial material for those unfamiliar with basic markup technologies. Following the present introductory section, we present a detailed introduction to XML itself, intended to cover in a relatively painless manner as much as the novice user of the TEI scheme needs to know about markup languages in general and XML in particular. This is followed by a discussion of the general principles underlying current practice in the representation of different languages and writing systems in digital form. This chapter is largely intended for the user unfamiliar with the Unicode encoding systems, though the expert may also find its historical overview of interest.
43	ABSTRUNC	The body of this edition of the Guidelines proper contains 23 chapters arranged in increasing order of specialist interest. The first five chapters discuss in depth matters likely to be of importance to anyone intending to apply the TEI scheme to virtually any kind of text. The next seven focus on particular kinds of text: verse, drama, spoken text, dictionaries, and manuscript materials. The next nine chapters deal with a wide range of topics, one or more of which are likely to be of interest in specialist applications of various kinds. The last two chapters deal with the XML encoding used to represent the TEI scheme itself, and provide technical information about its implementation. The last chapter also defines the notion of TEI conformance and its implications for interchange of materials produced according to these Guidelines.
45	ABSTRUNC	As noted above, this is a reference work, and is not intended to be read through from beginning to end. However, the reader wishing to understand the full potential of the TEI scheme will need a thorough grasp of the material covered by the first four chapters and the last two. Beyond that, the reader is recommended to select according to their specific interests: one of the strengths of the TEI architecture is its modular nature.
47	ABSTRUNC	As far as possible, extensive cross referencing is provided wherever related topics are dealt with; these are particularly effective in the online version of the Guidelines. In addition, a series of technical appendixes provide detailed formal definitions for every element, every class, and every macro discussed in the body of the work; these are also cross linked as appropriate. Finally, a detailed bibliography is provided, which identifies the source of many examples cited in the text as well as documenting works referred to, and listing other relevant publications.
49	ABSTRUNC	As an aid to the reader, most chapters of these Guidelines follow the same basic organization. The chapter begins with an overview of the subjects treated within it, linked to the following subsections. Within each section where new elements are described, a summary table is first given, which provides their names and a brief description of their intended usage. This is then followed where appropriate by further discussion of each element, including wherever possible usage examples taken somewhat eclectically from a variety of real sources. These examples are not intended to be exhaustive, but rather to suggest typical ways in which the elements concerned may usefully be applied. Where appropriate, a link to a statement of the source for most examples is provided in the online version. Within the examples, use of whitespace such as newlines or indentation is simply intended to aid legibility, and is not prescriptive or normative.
51	ABSTRUNC	Wherever TEI elements or classes are mentioned in the text, they are linked in the online version to the relevant reference specification for the element or class concerned. Element names are always given in the form
54	ABSTRUNC	name
61	ABSTRUNC	include a closing slash to distinguish them wherever they are discussed. References to attributes take the form
65	ABSTRUNC	is the name of the attribute. References to classes are also presented as links, for example
73	AB-namecon	TEI Naming Conventions
75	AB-namecon	These Guidelines use a more or less consistent set of conventions in the naming of XML elements and classes. This section summarizes those conventions.
80	AB-namecon	An unadorned name such as
82	AB-namecon	is the name of a TEI element or attribute.
83	AB-namecon	During generation of TEI RelaxNG schema fragments, the patterns corresponding with these TEI names are given a prefix
84	AB-namecon	tei
85	AB-namecon	to allow them to co-exist with names from other XML namespace. This prefix is not visible to the end user, and is not used in TEI documentation. When generating multi-namespace schemas, however, the user needs to be aware of them.
88	AB-namecon	The following conventions apply to the choice of names:
94	AB-namecon	Where an element name contains more than one token, the first letter of the second token, and of any subsequent ones, is capitalized, as in for example
104	AB-namecon	The specification for an element or attribute whose name contains abbreviations generally also includes a
106	AB-namecon	element providing the expanded sense of the name.
110	AB-namecon	element; this is not however generally done in TEI P5.
116	AB-namecon	att
120	AB-namecon	bibl
126	AB-namecon	category, especially as used in text classification
128	AB-namecon	char
134	AB-namecon	document: this usually refers to the original source document which is being encoded,
138	AB-namecon	declaration: has a specific sense in the TEI Header, as discussed in
140	AB-namecon	desc
142	AB-namecon	description: has a specific sense in the TEI header, as discussed in
147	AB-namecon	group. In TEI usage, a group is distinguished from a list in that the former associates several objects which act as a single entity, while the latter does not. For example, a
153	AB-namecon	simply lists a number of otherwise unrelated
157	AB-namecon	interp
159	AB-namecon	interpretation or analysis
161	AB-namecon	lang
162	AB-namecon	(natural) language
167	AB-namecon	org
169	AB-namecon	organization, that is, a named group of people or legal entity
171	AB-namecon	rdg
173	AB-namecon	reading or version found in a specific witness
175	AB-namecon	ref
176	AB-namecon	reference or link
184	AB-namecon	statement: used in a specific sense in the TEI header, as discussed in
188	AB-namecon	structured: that is, containing a specific set of named elements rather than
189	AB-namecon	mixed content
191	AB-namecon	val
195	AB-namecon	wit
207	AB-namecon	is an additional name, not the name of an addition. Such inconsistencies are relatively few in number, and it is hoped to remove them in subsequent revisions of the Guidelines.
219	AB-namecon	(division) etc. We do not specifically list such elements here: as noted above, an expansion of each such abbreviated name is provided within the documentation using the
240	ABSTRUNC	att.global
244	ABSTRUNC	model.biblPart
248	ABSTRUNC	macro.paraContent
252	ABSTRUNC	data.pointer
257	ABSTRUNC	. Here we simply note some conventions about their naming.
261	ABSTRUNC	Attribute class names take the form
265	ABSTRUNC	is typically an adjective, or a series of adjectives separated by dots, describing a property common to the attributes which make up the class.
267	ABSTRUNC	Attributes with the same name are considered to have the same semantics, whether the attribute is inherited from a class, or locally defined.
273	ABSTRUNC	Model classes have names beginning
276	ABSTRUNC	root name
279	ABSTRUNC	A root name may be the name of an element, generally the prototypical parent or sibling for elements which are members of the class.
283	ABSTRUNC	, if the class members are all children of the element named rootname; or
285	ABSTRUNC	, if the class members are all siblings of the element named
291	ABSTRUNC	is used to indicate that class members are permitted anywhere in a TEI document.
297	ABSTRUNC	For example, the class of elements which can form part of a
301	ABSTRUNC	. This class includes as a subclass the elements which can form part of a
303	ABSTRUNC	in a spoken text, which is named
309	ABTEI2	Because of its roots in the humanities research community, the TEI scheme is driven by its original goal of serving the needs of research, and is therefore committed to providing a maximum of comprehensibility, flexibility, and extensibility. More specific design goals of the TEI have been that the Guidelines should:
315	ABTEI2	support the encoding of all kinds of features of all kinds of texts studied by researchers
317	ABTEI2	be application independent
318	ABTEI2	This has led to a number of important design decisions, such as:
320	ABTEI2	the choice of XML and Unicode
322	ABTEI2	the provision of a large predefined tag set
324	ABTEI2	encodings for different views of text
331	ABTEI2	The goal of creating a common interchange format which is application independent requires the definition of a specific markup syntax as well as the definition of a large set of elements or concepts. The syntax of the recommendations made in this document conforms to the World Wide Web Consortium's XML Recommendation (
334	ABTEI2	The goal of providing guidance for text encoding suggests that recommendations be made as to what textual features should be recorded in various situations. However, when selecting certain features for encoding in preference to others, these Guidelines have tended to prefer generic solutions to specific ones, and to avoid areas where no consensus exists, while attempting to accommodate as many diverse views as feasible. Consequently, the TEI Guidelines make (with relatively rare exceptions) no suggestions or restrictions as to the relative importance of textual features. The philosophy of the Guidelines is
335	ABTEI2	if you want to encode this feature, do it this way
338	ABTEI2	The requirement to support all kinds of materials likely to be of interest in research has largely conditioned the development of the TEI into a very flexible and modular system. The development of other XML vocabularies or standards is typically motivated by the desire to create a single fully specified encoding scheme for use in a well-defined application domain. By contrast, the TEI is intended for use in a large number of rather ill-defined and often overlapping domains. It achieves its generality by means of the modular architecture described in
341	ABTEI2	The Guidelines have been written largely with a focus on text capture (i.e. the representation in electronic form of an already existing copy text in another medium) rather than text creation (where no such copy text exists). Hence the frequent use of terms like
346	ABTEI2	copy text
347	ABTEI2	, etc. However, the Guidelines are equally applicable to text creation, although certain elements, such as
350	ABTEI2	the rendition indicators
353	ABTEI2	Concerning text capture the TEI Guidelines do not specify a particular approach to the problem of fidelity to the source text and recoverability of the original; such a choice is the responsibility of the text encoder. The current version of these Guidelines, however, provides a more fully elaborated set of tags for markup of rhetorical, linguistic, and simple typographic characteristics of the text than for detailed markup of page layout or for fine distinctions among type fonts or manuscript hands. It should be noted also that, with the present version of the Guidelines, it is no longer necessarily the case that an unmediated version of the source text can be recovered from an encoded text simply by removing the markup.
362	ABTEI2	interpretation
363	ABTEI2	. These distinctions, though widely made and often useful in narrow, well-defined contexts, are perhaps best interpreted as distinctions between issues on which there is a scholarly consensus and issues where no such consensus exists. Such consensus has been, and no doubt will be, subject to change. The TEI Guidelines do not make suggestions or restrictions as to which of these features should be encoded. The use of the terms
367	ABTEI2	about different types of encoding in the Guidelines is not intended to support any particular view on these theoretical issues. Historically, it reflects a purely practical division of responsibility amongst the original working committees (see further
370	ABTEI2	In general, the accuracy and the reliability of the encoding and the appropriateness of the interpretation is for the individual user of the text to determine. The Guidelines provide a means of documenting the encoding in such a way that a user of the text can know the reasoning behind that encoding, and the general interpretive decisions on which it is based. The TEI header may be used to document and justify many such aspects of the encoding, but the choice of TEI elements for a particular feature is in itself a statement about the interpretation reached by the encoder.
372	ABTEI2	In many situations more than one view of a text is needed since no absolute recommendation to embody one specific view of text can apply to all texts and all approaches to them. Within limits, the syntax of XML ensures that some encodings can be ignored for some purposes. To enable encoding multiple views, these Guidelines not only treat a variety of textual features, but sometimes provide several alternative encodings for what appear to be identical textual phenomena. These Guidelines offer the possibility of encoding many different views of the text, simultaneously if necessary. Where different views of the formal structure of a text are required, as opposed to different annotations on a single structural view, however, the formal syntax of XML (which requires a single hierarchical view of text structure) poses some problems; recommendations concerning ways of overcoming or circumventing that restriction are discussed in chapter
375	ABTEI2	In brief, the TEI Guidelines define a general-purpose encoding scheme which makes it possible to encode different views of text, possibly intended for different applications, serving the majority of scholarly purposes of text studies in the humanities. Because no predefined encoding scheme can possibly serve all research purposes, the TEI scheme is designed to facilitate both selection from a wide range of predefined markup choices, and the addition of new (non-TEI) markup options. By providing a formally verifiable means of extending the TEI recommendations, the TEI makes it simple for such user-identified modifications to be incorporated into future releases of the Guidelines as they evolve. The underlying mechanisms which support these aspects of the scheme are introduced in chapter
383	ABAPP	guidance for individual or local practice in text creation and data capture;
385	ABAPP	support of data interchange;
387	ABAPP	support of application-independent local processing.
388	ABAPP	These three functions are so thoroughly interwoven in practice that it is hardly possible to address any one without addressing the others. However, the distinction provides a useful framework for discussing the possible role of the Guidelines in work with electronic texts.
394	ABAPP1	Problems specific to text creation or text
396	ABAPP1	have not been considered explicitly in this document. These Guidelines are not concerned with the process by which a digital text comes into being: it can be typed by hand, scanned from a printed book or typescript, read from a typesetter's tape, or acquired from another researcher who may have used another markup scheme (or no explicit markup at all).
400	ABAPP1	XML can appear distressingly verbose, particularly when (as in these Guidelines) the names of tags and attributes are chosen for clarity and not for brevity. Editor macros and keyboard shortcuts can allow a typist to enter frequently used tags with single keystrokes. It is often possible to transform word-processed or scanned text automatically. Markup-aware software can help with maintaining the hierarchical structure of the document, and display the document with visual formatting rather than raw tags.
403	ABAPP1	may be used to develop simpler data capture TEI-conformant schemas, for example with limited numbers of elements, or with shorter names for the tags being used most often. Documents created with such schemas may then be automatically converted to a more elaborated TEI form.
408	ABAPP2	The TEI format may simply be used as an interchange format, permitting projects to share resources even when their local encoding schemes differ. If there are
414	ABAPP2	such mappings are needed. However, for such translations to be carried out without loss of information, the interchange format chosen must be as expressive (in a formal sense) as any of the target formats; this is a further reason for the TEI's provision of both highly abstract or generic encodings and highly specific ones.
422	ABAPP2	creating a suitable set of mappings.
425	ABAPP2	For example, to translate from encoding scheme X into the TEI scheme:
427	ABAPP2	Make a list of all the textual features distinguished in X.
429	ABAPP2	Identify the corresponding feature in the TEI scheme. There are three possibilities for each feature:
431	ABAPP2	the feature exists in both X and the TEI scheme;
433	ABAPP2	X has a feature which is absent from the TEI scheme;
435	ABAPP2	X has a feature which corresponds with more than one feature in the TEI scheme.
436	ABAPP2	The first case is a trivial renaming. The second will require an extension to the TEI scheme, as described in chapter
437	ABAPP2	. The third is more problematic, but not impossible, provided that a consistent choice can be made (and documented) amongst the alternatives.
442	ABAPP2	Translating from the TEI into scheme X follows the same pattern, except that if a TEI feature has no equivalent in X, and X cannot be extended, information must be lost in translation.
447	ABAPP2	The TEI
448	ABAPP2	abstract model
449	ABAPP2	(that is, the set of categorical distinctions which it defines) must be respected. The correspondence between a tag X and the semantic function assigned to it by these Guidelines may not be changed; such changes are known as
450	ABAPP2	tag abuse
453	ABAPP2	A TEI document must be expressed as a valid XML-conformant document which uses the TEI namespace appropriately. If, for example, the document encodes features not provided by the Guidelines, such extensions may not be associated with the TEI namespace.
455	ABAPP2	It must be possible to validate a TEI document against a schema derived from these Guidelines, possibly with extensions provided in the recommended manner.
461	ABAPP3	Machine-readable text can be manipulated in many ways; some users:
465	ABAPP3	edit, display, and link texts in hypertext systems
475	ABAPP3	perform content analysis on texts
485	ABAPP3	scan verse texts metrically
487	ABAPP3	link text and images
490	ABAPP3	These applications cover a wide range of likely uses but are by no means exhaustive. The aim has been to make the TEI Guidelines useful for encoding the same texts for different purposes. We have avoided anything which would restrict the use of the text for other applications. We have also tried not to omit anything essential to any single application.
492	ABAPP3	Because the TEI format is expressed using XML, almost any modern text processing system is able to process it, and new TEI-aware software systems are able to build on a solid base of existing software libraries.
497	ABTEI	The Text Encoding Initiative grew out of a planning conference sponsored by the Association for Computers and the Humanities (ACH) and funded by the U.S. National Endowment for the Humanities (NEH), which was held at Vassar College in November 1987. At this conference some thirty representatives of text archives, scholarly societies, and research projects met to discuss the feasibility of a standard encoding scheme and to make recommendations for its scope, structure, content, and drafting. During the conference, the Association for Computational Linguistics and the Association for Literary and Linguistic Computing agreed to join ACH as sponsors of a project to develop the Guidelines. The outcome of the conference was a set of principles (the
504	ABTEI	The Text Encoding Initiative project began in June 1988 with funding from the NEH, soon followed by further funding from the Commission of the European Communities, the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada. Four working committees, composed of distinguished scholars and researchers from both Europe and North America, were named to deal with problems of text documentation,
505	ABTEI	text representation, text analysis and interpretation,
515	ABTEI	) of the Guidelines was distributed in July 1990 under the title
518	ABTEI	Extensive public comment and further work on areas not covered in this version resulted in the drafting of a revised version, TEI P2, distribution of which began in April 1992. This version included substantial amounts of new material, resulting from work carried out by several specialist working groups, set up in 1990 and 1991 to propose extensions and revisions to the text of P1. The overall organization, both of the draft itself and of the scheme it describes, was entirely revised and reorganized in response to public comment on the first draft.
520	ABTEI	In June 1993 an Advisory Board met to review the current state of the TEI Guidelines, and recommended the formal publication of the work done to that time. That version of the TEI Guidelines, TEI P3, consolidated the work published as parts of TEI P2, along with some additional new material and was finally published in May of 1994 without the label
525	ABTEI	XML was originally developed as a way of publishing on the World Wide Web richly encoded documents such as those for which the TEI was designed. Several TEI participants contributed heavily to the development of XML, most notably XML's senior co-editor C. M. Sperberg-McQueen, who served as the North American editor for the TEI Guidelines from their inception until 1999.
526	ABTEI	Following the rapid take-up of this new standard metalanguage, it became evident that the TEI Guidelines (which had been published originally as an SGML application) needed to be re-expressed in this new formalism if they were to survive. The TEI editors, with abundant assistance from others who had developed and used TEI, developed an update plan, and made tentative decisions on relevant syntactic issues.
528	ABTEI	In January of 1999, the University of Virginia and the University of Bergen formally proposed the creation of an international membership organization, to be known as the TEI Consortium, which would maintain, develop, and promote the TEI. Shortly thereafter, two further institutions with longstanding ties to the TEI (Brown University and Oxford University) joined them in formulating an Agreement to Establish a Consortium for the Maintenance of the Text Encoding Initiative (
529	ABTEI	), on which basis the TEI Consortium was eventually established and incorporated as a not-for-profit legal entity at the end of the year 2000. The first members of the new TEI Board took office during January of 2001.
531	ABTEI	The TEI Consortium was established in order to maintain a permanent home for the TEI as a democratically constituted, academically and economically independent, self-sustaining, non-profit organization. In addition, the TEI Consortium was intended to foster a broad-based user community with sustained involvement in the future development and widespread use of the TEI Guidelines (
534	ABTEI	To oversee and manage the revision process in collaboration with the TEI Editors, the TEI Board formed a Technical Council, with a membership elected from the TEI user community. The Council met for the first time in January 2002 at King's College London. Its first task was to oversee production of an XML version of the TEI Guidelines, updating P3 to enable users to work with the emerging XML toolset. This, the P4 version of the Guidelines, was published in June 2002. It was essentially an XML version of P3, making no substantive changes to the constraints expressed in the schemas apart from those necessitated by the shift to XML, and changing only corrigible errors identified in the prose of the P3 Guidelines. However, given that P3 had by this time been in steady use since 1994, it was clear that a substantial revision of its content was necessary, and work began immediately on the P5 version of the Guidelines. This was planned as a thorough overhaul, involving a public call for features and new development in a number of important areas not previously addressed including character encoding, graphics, manuscript description, biographical and geographical data, and the encoding language in which the TEI Guidelines themselves are written.
536	ABTEI	The members of the TEI Council and its associated workgroups are listed in
537	ABTEI	. In preparing this edition, they have been attentive to the requirements and practice of the widest possible range of TEI users, who are now to be found in many different research communities across the world, and have been largely instrumental in transforming the TEI from a grant-supported international research project into a self-sustaining community-based effort. One effect of the incorporation of the TEI has been the legal requirement to hold an annual meeting of the Consortium members; these meetings have emerged as an invaluable opportunity to sustain and reinforce that sense of community.
544	ABTEI4	The encoding recommended by this document may be used without fear that future versions of the TEI scheme will be inconsistent with it in fundamental ways. The TEI will be sensitive, in revising these Guidelines, to the possible problems which revision might pose for those who are already using this version of the Guidelines.
546	ABTEI4	With TEI P5, a version numbering system is introduced following
548	ABTEI4	: the first digit identifies a major version number, the second digit a minor version number, and the third digit a sub-minor version number. The TEI undertakes that no change will be made to the formal expression of these Guidelines (that is, a TEI schema, as defined in
549	ABTEI4	) such that documents conformant to a given major numbered release cease to be compatible with a subsequent release of the same major number. Moreover, as far as possible, new minor releases will be made only for the purpose of adding new compatible features, or of correcting errors in existing features.
551	ABTEI4	The Guidelines are currently maintained as an open source project on the Sourceforge site
554	ABTEI4	for information on how to find specific versions of TEI releases (Guidelines, schemas etc.). Notice of errors detected and enhancements requested may be submitted at

SG-GentleIntroduction.xml#12945

#	id	text
4	SG	The encoding scheme defined by these Guidelines is formulated as an application of the Extensible Markup Language (XML) (
5	SG	). XML is widely used for the definition of device-independent, system-independent methods of storing and processing texts in electronic form. It is now also the interchange and communication format used by many applications on the World Wide Web. In the present chapter we informally introduce some of its basic concepts and attempt to explain to the reader encountering them for the first time how and why they are used in the TEI scheme. More detailed technical accounts of TEI practice in this respect are provided in chapters
12	SG	, that is, a language used to describe other languages, in this case,
16	SG	has been used to describe annotation or other marks within a text intended to instruct a compositor or typist how a particular passage should be printed or laid out. Examples include wavy underlining to indicate boldface, special symbols for passages to be omitted or printed in a particular font, and so forth. As the formatting and printing of texts was automated, the term was extended to cover all sorts of special codes inserted into electronic texts to govern formatting, printing, or other processing.
22	SG	, as any means of making explicit an interpretation of a text. Of course, all printed texts are implicitly encoded (or marked up) in this sense: punctuation marks, capitalization, disposition of letters around the page, even the spaces between words all might be regarded as a kind of markup, the purpose of which is to help the human reader determine where one word ends and another begins, or how to identify gross structural features such as headings or simple syntactic units such as dependent clauses or sentences. Encoding a text for computer processing is, in principle, like transcribing a manuscript from
25	SG	continuous writing
27	SG	; it is a process of making explicit what is conjectural or implicit, a process of directing the user as to how the content of the text should be (or has been) interpreted.
30	SG	markup language
31	SG	we mean a set of markup conventions used together for encoding texts. A markup language must specify how markup is to be distinguished from text, what markup is allowed, what markup is required, and what the markup means. XML provides the means for doing the first three; documentation such as these Guidelines is required for the last.
52	SG11	These three aspects are discussed briefly below, and then in more depth in the remainder of this chapter.
54	SG11	XML is frequently compared with HTML, the language in which web pages have generally been written, which shares some of the above characteristics. Compared with HTML, however, XML has some other important features:
57	SG11	: it does not consist of a fixed set of tags;
77	SG111	the following item is a paragraph
79	SG111	this is the end of the most recently begun list
83	SG111	move the left margin 2 quads left, move the right margin 2 quads right, skip down one line, and go to the new left margin,
84	SG111	etc. In XML, the instructions needed to process a document for some particular purpose (for example, to format it) are sharply distinguished from the markup used to describe it.
86	SG111	Usually, the markup or other information needed to process a document will be maintained separately from the document itself, typically in a distinct document called a
88	SG111	, though it may do much more than simply define the rendition or visual appearance of a document.
94	SG111	When descriptive markup is used, the same document can readily be processed in many different ways, using only those parts of it which are considered relevant. For example, a content analysis program might disregard entirely the footnotes embedded in an annotated text, while a formatting program might extract and collect them all together for printing at the end of each chapter. Different kinds of processing can be carried out with the same part of a file. For example, one program might extract names of persons and places from a document to create an index or database, while another, operating on the same text, but using a different stylesheet, might print names of persons and places in a distinctive typeface.
105	SG112	title
107	SG112	author
109	SG112	abstract
110	SG112	and a sequence of one or more
112	SG112	. Anything lacking a title, according to this formal definition, would not formally be a report, and neither would a sequence of paragraphs followed by an abstract, whatever other report-like characteristics these might have for the human reader.
123	SG113	A basic design goal of XML is to ensure that documents encoded according to its provisions can move from one hardware and software environment to another without loss of information. The two features discussed so far both address this requirement at an abstract level; the third feature addresses it at the level of the strings of data characters that make up a document. All XML documents, whatever languages or writing systems they employ, use the same underlying character encoding (that is, the same method of representing as binary data those graphic forms making up a particular writing system).
132	SG113	which is implemented by a universal character set maintained by an industry group called the Unicode Consortium, and known as Unicode.
134	SG113	Unicode provides a standardized way of representing any of the many thousands of discrete symbols making up the world's writing systems, past and present.
137	SG113	Most modern computing systems now support Unicode directly; for those which do not, XML provides a mechanism for the indirect representation of single characters by means of their character number, known as
146	SG12	A text is not an undifferentiated sequence of words, much less of bytes. For different purposes, it may be divided into many different units, of different types or sizes. A prose text such as this one might be divided into sections, chapters, paragraphs, and sentences. A verse text might be divided into cantos, stanzas, and lines. Once printed, sequences of prose and verse might be divided into volumes, gatherings, and pages.
148	SG12	Structural units of this kind are most often used to identify specific locations or refer to points within a text (
151	SG12	canto 10, line 1234
154	SG12	, etc.) but they may also be used to subdivide a text into meaningful fragments for analytic purposes (
160	SG12	). Other structural units are more clearly analytic, in that they characterize a section of a text. A dramatic text might regard each speech by a different character as a unit of one kind, and stage directions or pieces of action as units of another kind. Such an analysis is less useful for locating parts of the text (
164	SG12	In a prose text one might similarly wish to regard as units of different types passages in direct or indirect speech, passages employing different stylistic registers (narrative, polemic, commentary, argument, etc.), passages of different authorship and so forth. And for certain types of analysis (most notably textual criticism) the physical appearance of one particular printed or manuscript source may be of importance: paradoxically, one may wish to use descriptive markup to describe presentational features such as typeface, line breaks, use of whitespace and so forth.
166	SG12	These textual structures overlap with one another in complex and unpredictable ways. Particularly when dealing with texts as instantiated by paper technology, the reader needs to be aware of both the physical organization of the book and the logical structure of the work it contains. Many great works (Sterne's
168	SG12	for example) cannot be fully appreciated without an awareness of the interplay between narrative units (such as chapters or paragraphs) and presentational ones (such as page divisions). For many types of research, the interplay among different levels of analysis is crucial: the extent to which syntactic structure and narrative structure mesh, or fail to mesh, for example, or the extent to which phonological structures reflect morphology.
176	SG131	The technical term used in XML for a textual unit, viewed as a structural component, is
186	SG131	of textual elements, because these are considered to be application dependent. It is up to the creators of XML vocabularies (such as these Guidelines) to choose intelligible element names and to define their intended use in text markup. That is the chief purpose of documents such as the TEI Guidelines. From the need to choose element names indicative of function comes the technical term for the name of an element type, which is
190	SG131	Within a marked-up text (a
192	SG131	), each element must be explicitly marked or tagged in some way. This is done by inserting a tag at the beginning of the element (a
196	SG131	). The start- and end-tag pair are used to bracket off element occurrences within the running text, in rather the same way as different types of parentheses or quotation marks are used in conventional punctuation. For example, a quotation element in a text might be tagged as follows:
200	SG131	As this example shows, a start-tag takes the form
201	SG131	quote
203	SG131	quote
209	SG131	The material between the start-tag and the end-tag (the string of words
212	SG131	content
213	SG131	of the element. Sometimes there may be nothing between the start and the end-tag; in this case the two may optionally be merged together into a single composite tag with the solidus at the end, like this:
221	SG132	, that is, it may have no content at all, or it may contain just a sequence of characters with no other elements. Often, however, elements of one type will be
229	SG132	, and it consists of a series of
235	SG132	, each stanza having embedded within it a number of
236	SG132	line
237	SG132	elements. Fully marked up, a text conforming to this model might appear as follows:
270	SG132	a valid TEI document.
271	SG132	The element names here have been chosen for clarity of exposition; there is, however, a TEI element corresponding to each, so that this example may be regarded as TEI-conformable in the sense that this term is defined in
273	SG132	It will, however, serve as an introduction to the basic notions of XML. Whitespace and line breaks have been added to the example for the sake of visual clarity only; they have no particular significance in the XML encoding itself. Also, the line
284	SG132	root element
289	SG132	each element is completely contained by the root element, or by an element that is so contained; elements do not partially overlap one another;
291	SG132	a tag explicitly marks the start and end of each element.
295	SG132	A well-formed XML document can be processed in a number of useful ways. A simple indexing program could extract only the relevant text elements in order to make a list of headings, first lines, or words used in the poem text; a simple formatting program could insert blank lines between stanzas, perhaps indenting the first line of each, or inserting a stanza number. Different parts of each poem could be typeset in different ways. A more ambitious analytic program could relate the use of punctuation marks to stanzaic and metrical divisions.
298	SG132	Scholars wishing to see the implications of changing the stanza or line divisions chosen by the editor of this poem can do so simply by altering the position of the tags. And of course, the text as presented above can be transported from one computer to another and processed by any program (or person) capable of making sense of the tags embedded within it with no need for the sort of transformations and translations needed for files which have been saved in one or other of the proprietary formats preferred by most word-processing programs.
300	SG132	As we noted above, one of the attractions of XML is that it enables us to make up our own names for the elements rather than requiring us always to use names predefined by other agencies. Clearly, however, if we wish to exchange our poems with others, or to include poems others have marked up in our anthology, we will need to know a bit more about the names used for the tags. The means that XML provides for this is called a
301	SG132	namespace
303	SG132	qualified name
304	SG132	, that is, a name with an optional prefix identifying the set of names to which it belongs. For example, we have defined an element
306	SG132	for the purpose of marking lines of verse. Another person might, however, define an element called
308	SG132	for the purpose of marking typographic lines, or drawn lines. Because of these different meanings, if we wish to share data it will be necessary to distinguish the two
309	SG132	line
311	SG132	namespace prefix
314	SG132	This feature is particularly important if we have different definitions of what a
315	SG132	line
316	SG132	is, of course, but there are many occasions when it is useful to distinguish groups of tags belonging to different
319	SG132	). One particularly useful namespace prefix is predefined for XML: it is
323	SG132	Namespaces allow us to represent the fact that a name belongs to a group of names, but don't allow us to do much more by way of checking the integrity or accuracy of our tagging. Simple well-formedness alone is not enough for the full range of what might be useful in marking up a document. It might well be useful if, in the process of preparing our digital anthology, a computer system could check some basic rules about how stanzas, lines, and headings can sensibly co-occur in a document. It would be even more useful if the system could check that stanzas are always tagged
331	SG132	document, and the ability to perform such validation is one of the key advantages of using XML. To carry this out, some way of formally stating the criteria for successful validation is necessary: in XML this formal statement is provided by an additional document known as a
338	SG132	, both abbreviated as DTD, may also be encountered. Throughout these Guidelines we use the term
346	SG14	The design of a schema may be as lax or as restrictive as the occasion warrants. A balance must be struck between the convenience of following simple rules and the complexity of handling real texts. This is particularly the case when the rules being defined relate to texts that already exist: the designer may have only the haziest of notions as to an ancient text's original purpose or meaning and hence find it very difficult to specify consistent rules about its structure. On the other hand, where a new text is being prepared to an exact specification, for entry into a textual database of some kind for example, the more precisely stated the rules, the better they can be enforced. Even in the case where an existing text is being marked up, it may be beneficial to define a restrictive set of rules relating to one particular view or hypothesis about the text—if only as a means of testing the usefulness of that view or hypothesis. A schema designed for use by a small project or team is likely to take a different position on such issues than one intended for use by a large and possibly fragmented community. It is important to remember that every schema results from an interpretation of a text. There is no single schema encompassing the absolute truth about any text, although it may be convenient to privilege some schemas above others for particular types of analysis.
348	SG14	XML is widely used in environments where uniformity of document structure is a major desideratum. In the production of technical documentation, for example, it is of major importance that sections and subsections should be properly nested, that cross-references should be properly resolved and so forth. In such situations, documents are seen as raw material to match against predefined sets of rules. As discussed above, however, the use of simple rules can also greatly simplify the task of tagging accurately elements of less rigidly constrained texts. By making these rules explicit, the scholar reduces his or her own burdens in marking up and verifying the electronic text, while also being forced to make explicit an interpretation of the structure and significant particularities of the text being encoded.
353	SG141bis	A schema can be expressed in a number of different ways; frequently-encountered methods include the Document Type Definition (DTD) language which XML inherited from SGML; the XML Schema language (
354	SG141bis	) defined by the W3C; and the RELAX NG language (
359	SG141bis	of RELAX NG, but the specifications within these Guidelines are expressed in a way that is largely independent of the specific language in which a schema generated from them is expressed.
362	SG141bis	. In practice, the only part of a TEI element specification not expressed using TEI-defined syntax is the content model for an element, which is expressed using the RELAX NG schema language for reasons of processing convenience. RELAX NG uses its own XML vocabulary to define content models, which is adopted by the TEI for the same purpose.
366	SG141bis	anthology_p = element anthology { poem_p+ } poem_p = element poem { heading_p?, stanza_p+ } stanza_p = element stanza {line_p+} heading_p = element heading { text } line_p = element line { text } start = anthology_p
376	SG141bis	; that is, it defines a number of named patterns, each of which acts as a kind of template against which an input document can be matched. The meaning of a pattern is expressed in a schema by reference to other patterns, or to a small number of built-in fundamental concepts, as we shall see. In the example above, the word to the left of the equals sign is the pattern's name, and the material following it declares a meaning for the pattern. Patterns may also be of particular types; the ones that interest us here are called
380	SG141bis	. In this example we see definitions for five element patterns. Note that we have used similar names for the pattern and the element which the pattern describes: so, for example, the line
384	SG141bis	, the value of which defines an element called
386	SG141bis	. These naming conventions are arbitrary; we could use the same name for the pattern as for the element, since the two are syntactically quite distinct. The name, or
391	SG141bis	content model
394	SG141bis	The last line of the schema above tells a RELAX NG validator which element (or elements) in a document can be used as the root element: in our case only
397	SG141bis	entry point
423	SG141x	; the root element of a TEI-conformant document is
434	SG143	content model
435	SG143	of the element being defined, because it specifies what may legitimately be contained within it. In RELAX NG, the content model is defined in terms of other patterns, either by embedding them, or (as in our examples above) by naming or referring to them. The RELAX NG compact syntax also uses a small number of reserved words to identify other possible contents for an element, of which by far the most commonly encountered is
436	SG143	text
439	SG143	), then almost always, following the branches of the tree downwards (for example, from
450	SG143	text
455	SG143	are so defined, since their content models say
456	SG143	text
457	SG143	only and name no embedded elements.
467	SG144	may be repeated. There are three occurrence indicators: the plus sign, the question mark, and the asterisk or star. The plus sign means that the pattern can match one or more times; the question mark means that it may match at most once but is not mandatory; the star means that the pattern concerned is not mandatory, but may match more than once. Thus, if the content model for
483	SG145	The content model
491	SG145	(the comma) used between its components. The comma connector indicates that the patterns concerned must appear in the sequence given. Another commonly encountered connector is the vertical bar, representing alternation. If the comma in this example were replaced by a vertical bar, then a
497	SG146	In our example so far, the components of each content model have been either single patterns or
498	SG146	text
499	SG146	. It is quite permissible, however, to define content models in which the components are lists of patterns, combined by connectors. Such lists may also be modified by occurrence indicators and themselves combined by connectors. To demonstrate these facilities, let us expand our example to include non-stanzaic types of verse. For the sake of demonstration, we will categorize poems as one of the following:
507	SG146	). A blank-verse poem consists simply of lines (we ignore the possibility of verse paragraphs for the moment),
508	SG146	It will not have escaped the astute reader that the fact that verse paragraphs need not start on a line boundary seriously complicates the issue; see further section
510	SG146	so no additional elements need be defined for it. A couplet is defined as a
524	SG146	(which are distinguished to enable studies of rhyme scheme, for example
525	SG146	This is however a rather artificial example; XPath, for example, provides ways of distinguishing elements in an XML structure by their position without the need to give them distinct names.
526	SG146	); these will have exactly the same content model as the existing
528	SG146	element. We will therefore add the following two lines to our example schema:
530	SG146	Next, we can change the declaration for the
536	SG146	The second version, by applying the occurrence indicator to the group rather than to each element within it, would allow a single poem to contain a mixture of stanzas, couplets, and lines.
538	SG146	A group of this kind can contain
539	SG146	text
541	SG146	mixed content
542	SG146	, allows for elements in which the sub-components appear with intervening stretches of character data. For example, if we wished to mark place names wherever they appear inside our verse lines, then, assuming we have also added a pattern for the
544	SG146	element, we could change the definition for
547	SG146	line_p = element line { (text \| name_p )* }
550	SG146	Some XML schema languages place no constraints on the way that mixed content models may be defined, but in the XML DTD language, when
551	SG146	text
552	SG146	appears with other elements in a content model, it must always appear as the first option in an alternation; it may appear once only, and in the outermost model group; and if the group containing it is repeated, the star operator must be used. Although these constraints do not apply to (for example) schemas expressed in the RELAX NG language, all TEI content models currently obey them.
554	SG146	Quite complex models can easily be built up in this way, to match the structural complexity of many types of text. As a further example, consider the case of stanzaic verse in which a refrain or chorus appears. Like a stanza, a refrain consists of repetitions of the line element. A refrain can appear at the start of a poem only, or as an optional addition following each stanza. This could be expressed by a pattern such as the following:
556	SG146	That is, a poem consists of an optional heading, followed by either a sequence of lines or an unnamed group, which starts with an optional refrain and is followed by one or more occurrences of another group, each member of which is composed of a stanza followed by an optional refrain. A sequence such as
558	SG146	follows this pattern, as does the sequence
560	SG146	. The sequence
562	SG146	does not, however, and neither does the sequence
564	SG146	Among other conditions made explicit by this content model are the requirements that at least one stanza must appear in a poem, if it is not composed simply of lines, and that if there is both a heading and a stanza they must appear in that order.
576	SG152	In the simple cases described so far, we have assumed that one can identify the immediate constituents of every element in a textual structure. A poem consists of stanzas, and an anthology consists of poems. Stanzas do not float around unattached to poems or combined into some other unrelated element; a poem cannot contain an anthology. All the elements of a given document type may be arranged into a hierarchic structure like a family tree, with a single ancestor at one end and many children (mostly the elements containing simple text) at the other. For example, we could represent an anthology containing two poems, the first of which contains two four-line stanzas and the second a single stanza, by a tree structure like the following figure:
580	SG152	This graphic representation of the structure of an XML document is close to the abstract model implicit in most XML processing systems. Most such systems now use a standardized way of accessing parts of an XML document called
587	SG152	XPath gives us a non-graphical way of referring to any part of an XML document: for example, we might refer to the last line of Blake's poem as
589	SG152	. The square brackets here indicate a numerical selection: we are talking about the fourth line in the second stanza of the first poem in the anthology. If we left out all the square-bracketted selections, the corresponding XPath expression would refer to all lines contained by stanzas contained by poems contained by anthologies. An XPath expression can refer to any collection of elements: for example, the expression
595	SG152	The solidus within an XPath expression behaves in much the same way as the solidus or backslash in a filename specification: it indicates that the item to the left directly contains the item to the right of it. In XPath it is also possible to indicate that any number of other items may intervene by repeating the solidus. For example, the XPath expression
597	SG152	will refer to the first line of each poem in the anthology, irrespective of whether it is in a stanza.
599	SG152	Clearly, there are many such trees that might be drawn to describe the structure of this or other anthologies. Some of them might be representable as further subdivisions of this tree: for example, we might subdivide the lines into individual words, since in our simple example no word crosses a line boundary. Surprisingly perhaps, this grossly simplified view of what text is (memorably termed an
600	SG152	ordered hierarchy of content objects
601	SG152	(OHCO) view of text by Renear
605	SG152	) turns out to be very effective for a large number of purposes. It is not, however, adequate for the full complexity of real textual structures, for which more complex mechanisms need to be employed. There are many other trees that might be drawn which do
609	SG152	In the OHCO model of text, representation of cases where different elements overlap so that several different trees may be identified in the same document is generally problematic. All the elements marked up in a document, no matter what namespace they belong to, must fit within a single hierarchy. To represent overlapping structures, therefore, a single hierarchy must be chosen, and the points at which other hierarchies intersect with it marked. For example, we might choose the verse structure as our primary hierarchy, and then mark the pagination by means of empty elements inserted at the boundary points between one page and the next. Or we could represent alternative hierarchies by means of the pointing and linking mechanisms described in chapter
619	SG16	, like some other words, has a specific technical sense. It is used to describe information that is in some sense descriptive of a specific element occurrence but not regarded as part of its content. For example, you might wish to add a
621	SG16	attribute to occurrences of some elements in a document to indicate their degree of reliability, or to add an
625	SG16	Although different elements may have attributes with the same name (for example, in the TEI scheme, every element is defined as having an attribute named
627	SG16	), they are always regarded as different, and may have different values assigned to them. If an element has been defined as having attributes, the attribute values are supplied in the document instance as
631	SG16	The order in which attribute-value pairs are supplied inside a tag has no significance; they must, however, be separated by at least one whitespace (blank, newline, or tab) character. The value part must always be given inside matching quotation marks, either single or double
632	SG16	In the unlikely event that both kinds of quotation marks are needed within the quoted string, either or both can also be presented in escaped form, using the predefined character entities
652	SG16	attribute has the value
656	SG16	attribute has the value
662	SG16	attribute has the value
664	SG16	might be formatted differently from one in which the same attribute has the value
668	SG16	attribute is a slightly special case in that, by convention, it is always used to supply a unique value to identify a particular element occurrence, which may be used for cross-reference purposes, as discussed further below (
673	SG-att	Attributes are declared in a schema in the same way as elements. As well as specifying an attribute's name and the element to which it is to be attached, it is possible to specify (within limits) what kind of value is acceptable for an attribute.
679	SG-att	, whose value is an attribute pattern defining an attribute named
681	SG-att	. Attribute names are subject to the same restrictions as other names in XML; they need not be unique across the whole schema, however, but only within the list of attributes for a given element.
683	SG-att	A pattern defining the possible values for this attribute is given within the curly braces, in just the same way as a content model is given for an element pattern. In this case, the attribute's value must be one of the strings presented explicitly above.
689	SG-att	In RELAX NG, an element pattern simply includes any attribute patterns applicable to it along with its other constituents, as shown above. Attribute patterns can also be grouped and alternated in the same way as element patterns, though this particular feature is not widely used in the TEI scheme, since it is not available to the same extent in all schema languages. Because a question mark follows the reference to the
697	SG-att	Instead of supplying a list of explicit values, an attribute pattern can specify that the attribute must have a value of a particular type, for example a text string, a numeric value, a normalized date, etc. This is accomplished by supplying a pattern that refers to a
698	SG-att	datatype
699	SG-att	. In the example above, because a list of acceptable values is predefined, a parser can check that no
711	SG-att	a parser would accept almost any unbroken string of characters (
717	SG-att	) as valid for this attribute. Sometimes, of course, the set of possible values cannot be predefined. Where it can, as in this case, it is generally better to do so.
719	SG-att	Schema languages vary widely in the extent to which they support validation of attribute values. Some languages predefine a small set of possibilities. Others allow the schema designer to use values from a predefined
721	SG-att	of possible datatypes, or to add their own definitions, possibly of great complexity. A
722	SG-att	datatype
723	SG-att	might be something fairly general (any positive integer), something very specific or idiosyncratic (any four-character string ending with "T"), or somewhere between the two. In the RELAX NG schemas used by the TEI, general patterns have been defined for about half a dozen datatypes (using the W3C Schema
726	SG-att	). In addition to the two possibilities already mentioned—plain text or an explicit list of possible strings—other datatypes likely to be encountered include the following:
732	SG-att	numeric
734	SG-att	values must represent a numeric quantity of some kind
736	SG-att	date
738	SG-att	values must represent a possible date and time in some calendar
751	SG-id	see note 6
754	SG-id	. When a text is being produced the actual numbers associated with the notes or chapters may not be certain. If we are using descriptive markup, such things as page or chapter numbers, being entirely matters of presentation, will not in any case be present in the marked-up text: they will be assigned by whatever processor is operating on the text (and may indeed differ in different applications). XML therefore predefines an attribute that may be used to provide any element occurrence with a special identifier, a kind of label, which may be used to refer to it from anywhere else: since it is defined in the XML namespace, the name of this attribute is
756	SG-id	and it is used throughout the TEI schema. Because it is intended to act as an identifier, its values must be unique within a given document. The cross-reference itself will be supplied by an element bearing an attribute of a specific kind, which must also be declared in the schema.
758	SG-id	Suppose, for example, we wish to include a reference within the notes on one poem that refers to another poem. We will first need to provide some way of attaching a label to each poem: this is easily done using the
772	SG-id	Next we need to define a new element for the cross-reference itself. This will not have any content—it is only a pointer—but it has an attribute, the value of which will be the identifier of the element pointed at. This is achieved by the following definition:
780	SG-id	. The value of this attribute must be a pointer or web reference of type
787	SG-id	(URI) may be supplied here. The accepted syntax for URIs is an Internet Standard, defined in
792	SG-id	defined by the W3C Schema datatype library.
793	SG-id	furthermore, because there is no indication of optionality on the attribute pattern, it must be supplied on each occurrence—a
807	SG-id	A processor may take any number of actions when it encounters a link encoded in this way: a formatter might construct an exact page and line reference for the location of the poem in the current document and insert it, or just quote the poem's title or first lines. A hypertext style processor might use this element as a signal to activate a link to the poem being referred to, for example by displaying it in a new window. Note, however, that the purpose of the XML markup is simply to indicate that a cross-reference exists: it does not necessarily determine what the processor is to do with it.
813	SG-id	attribute of datatype URI:
814	SG-id	graphic_p = element graphic {att.url, empty} att.url = attribute url {anyURI}
815	SG-id	With these additions to the schema, we can now represent the location of the illustration within our text like this:
817	SG-id	By providing a location from which a reproduction of the required image can be downloaded, this encoding makes it possible for appropriate software able to display the image as well as record its existence.
819	SG-id	Attributes form part of the structure of an XML document in the same way as elements, and can therefore be accessed using XPath. For example, to refer to all the poems in our anthology whose
821	SG-id	attribute has the value
833	SG-oth	In addition to the elements and attributes so far discussed, an XML document can contain a few other formally distinct things. An XML document may contain references to predefined strings of data that a validator must resolve before attempting to validate the document's structure; these are called
837	SG-oth	text or representing character data which cannot easily be keyboarded. An XML document may also contain arbitrary signals or flags for use when the document is processed in a particular way by some class of processor (a common example in document production is the need to force a formatter to start a new page at some specific point in a document); such flags are called
840	SG-oth	namespace
845	SG-er	As mentioned above, all XML documents use the same internal character encoding. Since not all computer systems currently support this encoding directly, a special syntax is defined that can be used to represent individual characters from the Unicode character set in a portable way by providing their numeric value, in decimal or hexadecimal notation.
849	SG-er	is represented within an XML document as the Unicode character with hexadecimal value
851	SG-er	. If such a document is being prepared on (or exported to) a system using a different character set in which this character is not available, it may instead be represented by the character reference
859	SG-er	To aid legibility, however, it is also possible to use a mnemonic name (such as
861	SG-er	) for such character references, provided that each such name is mapped to the required Unicode value by means of a construct known as an
863	SG-er	. A reference to a named character entity always takes the form of an ampersand, followed by the name, followed by a semicolon. For example an XML document containing the string
869	SG-er	There is a small set of such character entity references that do not have to be declared because they form part of the definition of XML. These include the names used for characters such as the ampersand (
873	SG-er	), which could not easily otherwise be included in an XML document without ambiguity. Other predeclared entity names are those for quotation marks (
881	SG-er	For all other named character entities, a set of entity declarations must be provided to an XML processor before the document referring to them can be validated. The declaration itself uses a non-XML syntax inherited from SGML; for example, to define an entity named
883	SG-er	with the replacement value é, the declaration could have any of the following forms:
892	SG-er	string substitution
893	SG-er	purposes, where the same text needs to be repeated uniformly throughout a text. For example, if a declaration such as
894	SG-er	<!ENTITY TEI "Text Encoding Initiative">
895	SG-er	is included with a document, then references such as
897	SG-er	may be used within it, each of which will be expanded in the same way and replaced by the string
899	SG-er	before the text is validated.
904	SG-pi	Although one of the aims of using XML is to remove any information specific to the processing of a document from the document itself, it is occasionally very convenient to be able to include such information—if only so that it can be clearly distinguished from the structure of the document. As suggested above, one common example is the need, when processing an XML document for printed output, to include a suggestion that the formatting processor might use to determine where to begin a new page of output. Page-breaking decisions are usually best made by the formatting engine alone, but there will always be occasions when it may be necessary to override these. An XML processing instruction inserted into the document is one very simple and effective way of doing this without interfering with other aspects of the markup.
912	SG-pi	. In between are two space-separated strings: by convention, the first is the name of some processor (
914	SG-pi	in the above example) and the second is some data intended for the use of that processor (in this case, the instruction to start a new page). The only constraint placed by XML on the strings is that the first one must be a valid XML name; the other can be any arbitrary sequence of characters, not including the closing character-sequence
920	SG-pi	which can be supplied at the beginning of an XML document, for example:
922	SG-pi	The XML declaration specifies the version number of the XML Recommendation applicable to the document it introduces (in this case, version 1.0), and optionally also the character encoding used to represent the Unicode characters within it. By default an XML document uses the character encoding UTF-8 or UTF-16; in this case, the 16-bit characters of Unicode have been mapped to the 8-bit character set known as ISO 8859-1; any characters present in the document but not available in the target character set will therefore need to be represented as character references (
923	SG-pi	). The XML declaration is purely documentary, but if it is wrong many XML-aware processors will be unable to process the associated text.
933	SGname	namespace
934	SGname	was introduced into the XML language as a means of addressing these and related problems. If the markup of an XML document is thought of as an expression in some language, then a namespace may be thought of as analogous to the lexicon of that language. Just as a document can contain words taken from different languages, so a well-formed XML document can include elements taken from different namespaces. A namespace resembles a schema in that we may say that a given set of elements
938	SGname	a given schema. However, a schema is a set of element definitions, whereas a namespace is really only a property of a collection of elements: the only tangible form it takes in an XML document is its distinctive
941	SGname	name
944	SGname	Suppose for example that we wish to extend our anthology to include a complex diagram. We might start by considering whether or not to extend our simple schema to include XML markup for such features as arcs, polygons, and other graphical elements. XML can be used to represent any kind of structure, not simply text, and there are clear advantages to having our text and our diagrams all expressed in the same way.
946	SGname	Fortunately we do not need to invent a schema for the representation of graphical components such as diagrams; it already exists in the shape of the Scalable Vector Graphics (SVG) language defined by the W3C.
949	SGname	SVG is a widely used and rich XML vocabulary for representing all kinds of two-dimensional graphics; it is also well supported by existing software. Using an SVG-aware drawing package, we can easily draw our diagram and save it in XML format for inclusion within our anthology. When we do so, we need to indicate that this part of the document contains elements taken from the SVG namespace, if only to ensure that processing software does not confuse our
955	SGname	An XML document need not specify any namespace: it is then said to use the
957	SGname	namespace. Alternatively, the root element of a document may supply a default namespace, understood to apply to all elements which have no namespace prefix. This is the function of the
959	SGname	attribute which provides a unique name for the default namespace, in the form of a URI:
964	SGname	In exactly the same way, on the root element for each part of our document which uses the SVG language, we might introduce the SVG namespace name:
973	SGname	Although a namespace name usually uses the URI (Uniform Resource Identifier) syntax, it is not treated as an online address and an XML processor regards it just as a string, providing a longer name for the namespace.
977	SGname	attribute can also be used to associate a short prefix name with the namespace it defines. This is very useful if we want to mingle elements from different namespaces within the same document, since the prefix can be attached to any element, overriding the implicit namespace for itself (but not its children):
988	SGname	There is no limit on the number of namespaces that a document can use. Provided that each is uniquely identified, an XML processor can identify those that are relevant, and validate them appropriately. To extend our example further, we might decide to add a linguistic analysis to each of the poems, using a set of elements such as
1016	SG-ms	We mentioned above that the syntax of XML requires the encoder to take special action if characters with a syntactic meaning in XML (such as the left angle bracket or ampersand) are to be used in a document to stand for themselves, rather than to signal the start of a tag or an entity reference respectively. The predefined entities
1022	SG-ms	provide one method of dealing with this problem, if the number of occurrences of such things is small. Other methods may be considered when the number is large, as in an XML document like the present Guidelines, which contains hundreds of examples of XML markup. One is to label the XML examples as belonging to a different namespace from that of the document itself, which is the approach taken in the present Guidelines. Another and simpler approach is provided by one of the features inherited by XML from its parent SGML: the
1026	SG-ms	A marked section is a block of text within an XML document introduced by the characters
1030	SG-ms	. Between these rather strange brackets, markup recognition is turned off, and any tags or entity references encountered are therefore treated as if they were plain text. For example, when we come to write the users' manual for our anthology, we may find ourselves often producing text like the following:
1043	SG18	if a document contains entity references that must be processed before the document can be validated, where are those entities defined?
1045	SG18	an XML document instance may be stored in a number of different operating system files; how should they be assembled together?
1047	SG18	how does a processor determine which stylesheets it should use when processing an XML document, or how to interpret any processing instructions it contains?
1053	SG18	Different schema languages and different XML processing systems take very different positions on all of these topics, since none of them is explicitly addressed in the XML specification itself. Consequently, the best answer is likely to be specific to a particular software environment and schema language. Since this chapter is concerned with XML considered independently of its processing environment, we only address them in summary detail here.
1060	SG-ass1	, which XML inherited from SGML. Different schema languages vary in the ways they make a collection of such definitions available to an XML processor, but fortunately there is one method that all current schema languages support.
1065	SG-ass1	statement. This declarative statement has been inherited by XML from SGML; in its full form it provides a large number of facilities, but we are here concerned only with the small subset of those facilities recognized by all schema languages.
1069	SG-ass1	Any XML processor encountering this statement will use it to add the two named entities it defines to those already predefined for XML. Before the document instance itself is validated, any references to these entities will be expanded to the character string given. Thus, wherever in the document instance the string
1072	SG-ass1	And, indeed, for those responsible for deciding the licensing conditions if they change their minds later.
1075	SG-ass1	following the string DOCTYPE in this example is, of course, the name of the root element of the document to which this declaration is prefixed; however, only an XML DTD processor will take note of this fact.
1088	SG-assoc	points to the location of the schema. This is the only mandatory pseudo-attribute, but others can be added to give more information about the schema:
1094	SG-assoc	This example includes a standard schema in XML Schema format, along with a schematron schema which might be used for checking the format and linking of names.
1098	SG-assoc	Any modern XML processing software tool will provide convenient methods of validating documents which are appropriate to the particular schema language chosen. In the interests of maximizing portability of document instances, they should contain as little processing-specific information as possible.
1103	SG-mult	As we have already indicated, a single XML document may be made up of several different operating system files that need to be pulled together by a processor before the whole document can be validated. The XML DTD language defines a special kind of entity (a
1105	SG-mult	) that can be used to embed references to whole files into a document for this purpose, in much the same way as the character or string entities discussed in
1112	SG-mult	defines a generic mechanism for this purpose, which is supported by an increasing number of XML processors.
1116	SG-style	As mentioned above, the processing of an XML document will usually involve the use of one or more stylesheets, often but not exclusively to provide specific details of how the document should be displayed or rendered. In general, there is no reason to associate a document instance with any specific stylesheet and the schema languages we have discussed so far do not therefore make any special provision for such association. The association is made when the stylesheet processor is invoked, and is thus entirely application-specific.
1118	SG-style	However, since one very common application for XML documents is to serve them as browsable documents over the Web, the W3C has defined a procedure and a syntax for associating a document instance with its stylesheet (see
1119	SG-style	). This Recommendation allows a document to supply a link to a default stylesheet and also to categorize the stylesheet according to its
1121	SG-style	, for example to indicate whether the stylesheet is written in CSS or XSLT, using a specialized form of processing instruction.
1125	SG-style	which is available from the same location as the anthology itself, we could make it available over the Web simply by adding a processing instruction like the following to the anthology:
1128	SG-style	Multiple stylesheets can be defined for the same document, and options are available to specify how a web browser should select amongst them. For example, if the document also contained a directive:
1132	SG-style	could be used when rendering the document on a handheld device such as a mobile phone.
1134	SG-style	Most modern web browsers support CSS (although the extent of their implementation varies), and some of them support XSLT.
1138	SG-val	As we noted above, most schema languages provide some degree of datatype validation for attribute values (
1139	SG-val	). They vary greatly in the validation facilities they offer for the content of elements, other than the syntactic constraints already discussed. Thus, while we may very easily check that our
1145	SG-val	elements contain between five and 500 correctly-spelled English words, should we wish to constrain our poetry in such a way. Also, because attributes and elements are treated differently, it is difficult or impossible to express co-occurrence constraints: for example, if the
1153	SG-val	The XML DTD language offers very little beyond syntactic checking of element content. By contrast, a major impetus behind the design and development of the W3C schema language was the addition of a much more general and powerful constraint language to the existing structural constraints of XML DTDs. In RELAX NG the opposite approach was taken, in that all datatype validation, whether of attributes or element content, is regarded as external to the schema language. For attributes, as we have seen, RELAX NG makes use of the W3C Schema Datatype Library (but permits use of others). Because RELAX NG treats both elements and attributes as special cases of patterns, the same datatype validation facilities are available for element content as for attribute values; it is unlike other schema languages in this respect. In addition, for content validation, a different component of DSDL known as Schematron can be used. Schematron is a pattern matching (rather than a grammar-based) language, which allows us to test the components of a document against templates that express constraints such as those mentioned above.

CH-LanguagesCharacterSets.xml#13235

#	id	text
4	CH	The documents which users of these Guidelines may wish to encode encompass all kinds of material, potentially expressed in the full range of written and spoken human languages, including the extinct, the non-existent, and the conjectural. Because of this wide scope, special attention has been paid to two particular aspects of the representation of linguistic information often taken for granted: language identification and character encoding.
6	CH	Even within a single document, material in many different languages may be encountered. Human culture, and the texts which embody it, is intrinsically multilingual, and shows no sign of ceasing to be so. Traditional philologists and modern computational linguists alike work in a polyglot world, in which code-switching (in the linguistic sense) and accurate representation of differing language systems constitute the norm, not the exception. The current increased interest in studies of linguistic diversity, most notably in the recording and documentation of endangered languages, is one aspect of this long standing tradition. Because of their historical importance, the needs of endangered and even extinct languages must be taken into account when formulating Guidelines and recommendations such as these.
8	CH	Beyond the sheer number and diversity of human languages, it should be remembered that in their written forms they may deploy a huge variety of scripts or writing systems. These scripts are in turn composed of smaller units, which for simplicity we term here characters. A primary goal when encoding a text should be to capture enough information for subsequent users of it correctly to identify both language, script, and constituent characters. In this chapter we address this requirement, and propose recommended mechanisms to indicate the languages, scripts and characters used in a document or a part thereof.
10	CH	Identification of language is dealt with in
11	CH	. In summary, it recommends the use of pre-defined identifiers for a language where these are available, as they increasingly are, in part as a result of the twin pressures of an increasing demand for language-specific software and an increased interest in language documentation. Where such identifiers are not available or not standardized, these Guidelines recommend a way of documenting language identifiers and their significance, in the same way as other metadata is documented in the TEI header.
13	CH	Standardization of the means available to represent characters and scripts has moved on considerably since the publication of the first version of these Guidelines. At that time, it was essential to explicitly document the characters and encoded character sets used by almost any digital resource if it was to have any chance of being usable across different computer platforms or environments, but this is no longer the case. With the availability of the Unicode standard, more than 110,000 different characters representing almost all of the world's current writing systems are available and usable in any XML processing environment without formality. Nevertheless, however large the number of standardized characters, there will always be a need to encode documents which use non-standard characters and glyphs, particularly but not exclusively in historical material. Furthermore, the full potential of Unicode is still not yet realized in all software which users of the Guidelines are likely to encounter. The second part of this chapter therefore discusses in some detail the concepts and practice underlying this standard, and also introduces the methods available for extending beyond it, which are more fully discussed in
18	CHSH	Identification of the language a document or part thereof is written in is a crucial requirement for many envisioned usages of an electronic document. The TEI therefore accommodates this need in the following way:
22	CHSH	is defined for all TEI elements. Its value identifies the language and writing system used.
24	CHSH	The TEI header has a section set aside for the information about the languages used in a document: see further
28	CHSH	The value of the attribute
30	CHSH	identifies the language using a coded value. For maximal compatibility with existing processes, modelling this value in the following way is recommended (this parallels the modelling of
34	CHSH	The identifier for the language should be constructed as in
41	CHSH	element in the TEI header, if one is present.
46	CHSH	, and proposes the following mechanism for constructing an identifier (tag) for languages as administered by the Internet Assigned Numbers Authority (IANA). The tag is assembled from a sequence of subtags separated by the hyphen (-, U+002D) character. It gives the language (possibly further identified with a sublanguage), a script and a region for this language, each possibly followed by a variant subtag.
48	CHSH	The authoritative list of registered subtags is maintained by IANA and is available at
49	CHSH	. For a good general overview of the construction of language tags, see
53	CHSH	In addition to the list of registered subtags, both BPC 47 and ISO 639-2 provide extensions that can be employed by private convention. The constructs provided can thus be used to generate identifiers for any language, past and present, in any used in any area of the world. If such private extensions are used within the context of the TEI, they should be documented within the
55	CHSH	element of the TEI header, which might also provide a prose description of the language described by the language tag.
57	CHSH	While language, region and script can be adequately identified using this mechanism, there is only very rough provision to express a dimension of time for the language of a document; those codes provided (e.g.
61	CHSH	in ISO 639-2) might not reflect the segments appropriate for a text at hand. Text encoders might express the time window of the language used in the document by means of the extension mechanism defined in BCP 47 and relate that to a
65	CHSH	section of the TEI header.
67	CHSH	Equivalences to language identifiers by other authorities can be given in the
71	CHSH	The scope of the language identification is extending to the whole subtree of the document anchored at the element that carries the
73	CHSH	attribute, including all elements and all attributes where a language might apply.
74	CHSH	This will exclude all attributes where a non-textual datatype has been specified, for example tokens, boolean values or predefined value lists.
81	CH	All document encoding has to do with representing one thing by another in an agreed and systematic way. Applied to the smallest distinctive units in any given writing system, which for the moment we may loosely call
88	D4-41	When the first methods of representing text for storage or transmission by machines were devised, long before the development of computers, the overriding aim was to identify the smallest set of symbols needed to convey the essential semantic content, and to encode that symbol set in the most economical way that the storage or transmission media allowed. The initial outcome were systems that encoded only such content as could be expressed in uppercase letters in the Latin script, plus a few punctuation marks and some
92	D4-41	For many years after the invention of computers, the way they represented text continued to be constrained by the imperative to use expensive resources with maximal efficiency. Even when storage and processing costs began their dramatic fall, the Anglo-centric outlook of most hardware designers and software engineers hampered initiatives to devise a more generous and flexible model for text representation. The wish to retain compatibility with
94	D4-41	data was an additional disincentive. Eventually, tension in East Asia between commitment to technological progress and the inability of existing computers to cope with local writing systems led to decisive developments. Japanese, Korean and Chinese standards bodies, who long before the advent of computers had been engaged in the specification of character sets, joined with computer manufacturers and software houses to devise ways of mapping those character sets to numeric encodings and processing the resulting text data.
96	D4-41	Unfortunately, in the early years there was little or no co-ordination among either the national standards bodies or the manufacturers concerned, so that although commercial necessity dictated that these various local standards were all compatible with the representation of US-American English, they were not straightforwardly compatible with one another. Even within Japan itself there emerged a number of mutually incompatible systems, thanks to a mixture of commercial rivalry, disagreements about how best to manage certain intractable problems, and the fact that such pioneering work inevitably involved some false starts, leading to incompatibilities even between successive products of the same bodies. Roughly at the same time, and for similar reasons, multiple and incompatible ways of representing languages that use Cyrillic scripts were devised, along with methods of encoding ancient writing systems which inevitably could not aim for compatibility with other writing systems apart from basic Latin script. Many of the earliest projects that fed into the TEI were shaped in this developmental phase of the computerized representation of texts, and it was also the context in which SGML was devised and finalized.
98	D4-41	SGML had of necessity to offer ways of coping with multiple writing systems in multiple representations; or rather, it provided a framework within which SGML-compliant applications capable of handling such multiple representations might be developed by those with sufficient financial and personnel resources (such as are seldom found in academia). Earlier editions of these Guidelines offered advice on character set and writing system issues addressed to the condition of those for whom SGML was the only feasible option. That advice must now be substantially altered because of two closely-related developments: the availability of the ISO/Unicode character set as an international standard, and the emergence of XML and related technologies which are committed to the theory and practice of character representation which Unicode embodies.
118	D4-42	will not of itself take us very far towards greater terminological precision. It tends to be used to refer indiscriminately both to the visible symbol on a page and to the letter or ideograph which that symbol represents, two things that it is essential to keep conceptually distinct. The visible symbol obviously has some aspects by which we interpret it as representing one character rather than another; but its appearance may also be significantly determined by features that have no effect on our notion of which character in a writing system it represents. A familiar instance is the lowercase
122	D4-42	symbol (
123	D4-42	cf. figure 1
127	D4-42	figure 1
129	D4-42	abstract character
136	D4-42	in a serif typeface has additional strokes that are absent from the same letter when printed using a sans-serif typeface, so that once again we have differing glyphs standing for the same abstract character. In
137	D4-42	there is even a font, Capitals Regular, in which the glyph for the lowercase letter
139	D4-42	looks like a typical glyph for the character uppercase
141	D4-42	. The distinction between abstract characters and glyphs is fundamental to all machine processing of documents.
143	D4-42	In most scholarly encoding projects, the accurate recording of the abstract characters which make up the text is of prime importance, because it is the essential prerequisite of digitizing and processing the document without semantic loss. In many cases (though there are important exceptions, to be touched on shortly) it may not be necessary to encode the specific glyphs used to render those abstract characters in the original document. An encoding that faithfully registers the abstract characters of a document allows us to search and analyse our document's content, language and structure and access its full semantics. That same encoding, however, may not contain sufficient information to allow an exact visual representation of the glyphs in the source text or manuscript to be recreated.
145	D4-42	The importance of this distinction between information content and its visual representation is not always immediately apparent to people unused to the specific complexities of text handling by machine. Such users tend to ask first what (in order of conceptual priority) should actually be their very last question: how do I get a physical image that looks like character x in my source document to appear on to the screen or the output page? Their first question should in fact be: how can I get an abstract representation of character x into my encoded document in a way that will be universally and unambiguously identifiable, no matter what it happens to look like in printout or on any particular display? And occasionally the response they receive as a result of their misguided initial question is a custom
147	D4-42	that satisfies their immediate rendering wishes at the price of making their underlying document unintelligible to other users (or even to the original user in other times and places) because it encodes the abstract character in an idiosyncratic way.
149	D4-42	That said, there will certainly be documents or projects where it is a matter of scholarly significance that the compositor or scribe chose to represent a given abstract character using one particular glyph or set of strokes rather than a semantically-equivalent but visually distinct alternative, and in that case the specific appearance of the form will have to be encoded on one way or another. But that encoding need not (and in most cases will not) involve a notation that visually resembles the original, any more than italicized text in an original document will be represented by the use of italic characters in the encoded version.
151	D4-42	A collection of the abstract characters needed to represent documents in a given writing system is known as a
152	D4-42	character set
153	D4-42	, and the character set or
155	D4-42	of a processing or rendering device is the set of abstract characters that it is equipped to recognize and handle appropriately. There is, however, a subtle distinction between these two parallel uses of the same term, involving one more key concept which it is essential to grasp. The character set of a document (or the writing system in which it is recorded) is purely a collection of abstract characters. But the character set of a computing device is a set of abstract characters which have been mapped in a well-defined way to a set of numbers or
156	D4-42	code points
157	D4-42	by which the device represents those abstract characters internally. It can therefore be referred to as a
158	D4-42	coded character set
159	D4-42	, meaning a set of abstract characters each of which has been assigned a numerical code point (or in some instances a sequence of code points) which unambiguously identifies the character concerned.
161	D4-42	It is now possible to use this terminology to say what Unicode is: it is a coded character set, devised and actively maintained by an international public body, where each abstract character is identified by a unique name and assigned a distinctive code point.
162	D4-42	Although only Unicode is mentioned here explicitly, it should be noted that the character repertoire and assigned code points of Unicode and the ISO standard 10646 are identical and maintained in a way that ensures this continues to be the case.
163	D4-42	Unicode is distinguished from other, earlier and co-existing coded character sets by its (current and potential) size and scope; its built-in provision for (in practical terms) limitless expansion; the range and quality of linguistic and computational expertise on which it draws; the commitment in principle (and to an increasing degree in practice) to implement it by all important providers of hardware and software worldwide; and the stability, authority and accessibility it derives from its status as an international public standard.
169	D4-43	The distinction between abstract characters and glyphs can be crucial when devising an encoding scheme. Users performing text retrieval, searching or concordancing will expect the system to recognize and treat different glyphs as instances of the same character; but when perusing the text itself they may well expect to see glyph variants preserved and rendered. When encoding a pre-existing text, the encoder must determine whether a particular letter or symbol is a character or a glyphic variant. A detailed model of the relationship between characters and glyphs has been developed within the Unicode Consortium and an ISO work group (ISO/IEC JTC1 SC2/WG2). Its report (
171	D4-43	) will form the base for much future standards work.
173	D4-43	The model makes explicit the distinction between two different properties of the components of written language:
175	D4-43	their content, i.e. its meaning and phonetic value (represented by a character)
181	D4-43	When searching for information, a system generally operates on the content aspects of characters, with little or no attention to their appearance. A layout or formatting process, on the other hand, must of necessity be concerned with the exact appearance of characters. Of course, some operations (hyphenation for example) require attention to both kinds of feature, but in general the kind of text encoding described in these Guidelines tends to focus on content rather than appearance (see further
186	D4-43	the level of character encoding, using an appropriate Unicode code point to represent the glyph concerned
188	D4-43	the markup level, with the glyph indicated via appropriate elements and/or attributes
192	D4-43	The encoding practice adopted may be guided by, among other things, an assessment of the most frequent uses to which the encoded text will be put. For example, if recognition of identical characters represented by a variety of glyphs is the main priority, it may be advisable to represent the glyph variations at markup level, so that the character value can be immediately exposed to the indexing and retrieval software. Plainly, an encoding project will need to consider such issues carefully and embody the outcome of their deliberations in local manuals of procedure to ensure encoding consistency. Using Unicode code points to represent glyph information requires that such choices be documented in the TEI header. Such documentation cannot of itself guarantee proper display of the desired glyph but at least makes the intention of the encoder discoverable.
194	D4-43	At present the Unicode Standard does not offer detailed specifications for the encoding of glyph variations. These Guidelines do give some recommendations; some discussion of related matters is given in
204	D4-44	(IMEs) commonly used for the entry of logographic characters. This is most likely to be convenient where the display used for text entry and/or the printer used to produce output for proofreading purposes is capable of rendering the characters concerned using correct and readily identifiable glyphs. Where such easily checkable rendering is not available, or where there is no suitable method of inputting certain characters directly, they may be input by one of two possible forms of indirect notation or
208	D4-44	The first form of reference is a
210	D4-44	(NCR), which takes the general form
214	D4-44	is an integer representing the code point of the character in base 10, or
218	D4-44	is the code point in hexadecimal notation. This has the advantage that no declaration of what this notation means is required anywhere in the document instance or its associated schema. Every XML processor is capable of recognising NCRs and replacing them with the required code point value without needing access to any additional data. The disadvantage of NCRs as a means of entering, representing and proofing character data is that most human beings find them anything but
222	D4-44	The second form of reference is a
226	D4-44	that could be distinctively recognized by a processing system). Character entity references can (and indeed should) have names whose significance is apparent to humans, but each and every entity name has to be associated with its replacement (which as explained below should be a character value, possibly in the form of a NCR) via a formal declaration in the document's internal or external subset. This, however, is not needed for Character Entities defined by the XML standard, namely & (&), > (>), < (<), ' ('), and " ("). For a large number of characters defined by Unicode and commonly used in documents, there are ISO entity sets declaring mnemonic names which should be used wherever feasible: XML compatible character entity declarations using ISO names and suitable for inclusion into the subset are available on the TEI web sites.
228	D4-44	Where characters are not defined in Unicode and so have to be assigned both a local code point and a local entity name of the project's choosing (see
229	D4-44	below) it is highly desirable to follow the same nomenclature principles as ISO and to emulate the practice in the ISO character entity declarations of appending a string giving the character a unique descriptive name as a comment to the actual entity declaration. In addition, where different groups or projects are working on texts with geographical, historical, linguistic or other similarities that give rise to common issues of character encoding, it is highly advisable in the interests of consistency that they should consult one another when devising entity names. The TEI mailing list may provide a suitable first point of contact for such consultations. Further advice on the matter of locally-defined characters is contained in
237	D4-45a	Rendering of the encoded text is a complicated process that depends largely on the purpose, external requirements, local equipment and so forth, it is thus outside the scope of coverage for these Guidelines.
239	D4-45a	It might however nevertheless be helpful to put some of the terminology used for the rendering process in the context of the discussion of this chapter. As was mentioned above, Unicode encodes abstract characters, not specific glyphs. For any process that makes characters visible, however, concrete, specifically designed glyph shapes have to be used. For a printing process, for example, these shapes describe exactly at which point ink has to be put on the paper and which areas have to be left blank. If we want to print a character from the Latin script, besides the selection of the overall glyph shape, this process also requires that a specific weight of the font has been selected, a specific size and to what degree the shape should be slanted. Beyond individual characters, the overall typesetting process also follows specific rules of how to calculate the distance between characters, how much whitespace occurs between words, at which points line breaks might occur and so forth.
241	D4-45a	If we concern ourselves only with the rendering process of the characters themselves, leaving out all these other parameters, we will realize that of all the information required for this process, only a small amount will be drawn from the encoded text itself. This information is the code point used to encode the character in the document. With this information, the font selected for printing will be queried to provide a glyph shape for this character. Some modern font formats (e.g. OpenType) do implement a sophisticated mapping from a code point to the glyph selected, which might take into account surrounding characters (to create ligatures where necessary) and the language or even area this character is printed for to accommodate different typesetting traditions and differences in the usage of glyphs.
243	D4-45a	A TEI document might provide some of the information that is required for this process for example by identifying the linguistic context with the
245	D4-45a	attribute. The selection of fonts and sizes is usually done in a stylesheet, while the actual layout of a page is determined by the typesetting system used. Similarly, if a document is rendered for publication on the Web, information of this kind can be shipped with the document in a stylesheet.
252	D4-45b	The devisers of the XML standard took the view that Unicode should be the only means of representing abstract characters which conformant XML processors were obliged to support. That certainly does not preclude the use of other character encoding schemes or character sets in documents which are to be handled by XML processors, but it does mean that all the abstract characters which are encoded as characters (as distinct from being represented indirectly via markup) in an XML document must either possess an assigned code point within the public Unicode standard, or be assigned a code point devised by and specific to the local project, taken from a reserved range set aside by the standard expressly for this purpose, the so-called
254	D4-45b	or PUAs. For the vast majority of projects to which these Guidelines are applicable, the Unicode standard will already offer code points for all the abstract characters their documents employ, and so the requirement that all such characters should be resolvable by XML processors to Unicode code points will not involve any representation via markup or use of PUA code points. Indeed, such projects are not obliged by their choice of XML to use Unicode in their documents. Provided they correctly declare at the requisite points any non-Unicode coded character set they may use, ensure that all their XML processors support their declared encoding, and then consistently employ that encoding in strict conformity with their declarations, they need not consciously concern themselves with Unicode unless and until they feel it is appropriate to do so.
259	D4-45-1	There are, however, strict limits to the way conformant XML processors handle documents whose character set is not Unicode, and unless these limits are understood it is likely that projects not yet ready to commit to Unicode across the board will run into unexpected and baffling problems as they attempt to operate with their legacy character encodings. First, it must be repeated that nothing in the XML standard
261	D4-45-1	conformant processors to handle non-Unicode documents. But even if there were any actual processors which on that basis refused to process non-Unicode documents, that would not limit their usefulness as severely as might at first appear. The reason is that there is a way of internally representing Unicode code points (explained further in
262	D4-45-1	below) where there is no detectable difference between a document which is actually encoded in ASCII employing only 7-bit values and one which is encoded in Unicode but which happens to contain only the abstract characters encompassed by the 7-bit ASCII standard. And the XML standard specifies that this way of representing Unicode is the one which processors must assume as the default for any document that does not explicitly declare an encoding. At a stroke, this provision ensures that all pure 7-bit ASCII encoded documents can be processed without further ado by all conformant XML processors. Add to this the provision, also within the XML standard, that allows any Unicode code point to be indirectly specified using only 7-bit ASCII characters via a Numeric Character Reference (NCR), and the upshot is that all documents in non-Unicode encodings which can be pre-processed to rewrite any characters outside the 7-bit ASCII range as Unicode code points in NCR notation (a simple batch procedure for which software is readily available) can be handled even by processors which have no inbuilt support for any encoding other than Unicode.
266	D4-45-1	To avoid confusion when taking advantage of such encoding support, it is first of all essential to grasp that an encoding declaration in an XML document is indeed simply a declaration: it is not an incantation that magically converts the document that follows into the encoding concerned. It is a common error to think that simply declaring a document's encoding to be, say ISO-8859-1 (or for that matter UTF-8 or UTF-16, the representations of Unicode for which support is mandatory) is sufficient to
268	D4-45-1	. Such a declaration is useless unless the document that follows actually is encoded strictly in conformance with the declaration. Some of the circumstances in which that may not in fact be the case are outlined in
269	D4-45-1	below. Secondly, an encoding declaration does not somehow switch an XML processor into a mode where it works entirely in the declared encoding for as long as the declaration is in scope. On the contrary, all it does is instruct the processor to pass its input through a filter that immediately converts all the code points in the declared encoding into their Unicode counterparts; from that point onwards the document as seen by all subsequent stages of processing is actually in Unicode, even though that may not be apparent to the user. Thirdly, this invariable internal conversion has a crucial consequence: the fact that a processor can successfully accept a document in a non-Unicode encoding does not mean that it will necessarily convert any output it may produce back into the declared input encoding. Internally, the document has been converted to and processed in Unicode, and there is nothing in the XML standard that requires the reverse conversion to be performed at the output stage. Most processors go beyond the standard by offering a facility to output in various encodings: but whether it is available and how to use it must be ascertained from the processor's documentation. Should it be unavailable or unreliable, the output may need to be post-processed through a character convertor to restore the original encoding, and again such software is freely available and easy to use.
275	D4-45-2	In the cases considered in the preceding section, there was a suitable Unicode code point corresponding to each abstract character contained in the non-Unicode character set of the input document. In such instances, the mandatory internal conversion to Unicode carried out by the processor can be more or less transparent to a user who wishes to continue to work with a non-Unicode character set. Things become rather different when the non-Unicode character set contains abstract characters for which there is no code point in the Unicode standard, or when a project that is attempting to work in Unicode throughout finds that it needs to represent abstract characters not currently provided for in the Unicode standard. Here, a significant difference between SGML and XML emerges in a rather troublesome way.
277	D4-45-2	Following their agenda to devise a subset of SGML that would be significantly easier to implement, the authors of the XML specification decided that one particular type of entity available in SGML, known as an internal SDATA entity, should not be carried over into XML. It would be idle to question that decision here, but its consequences for the handling of abstract characters for which there is no Unicode definition were significant.
279	D4-45-2	The procedures recommended in earlier versions of these Guidelines for encoding, processing and exchanging what we might call locally defined abstract characters were reliant on the availability of entities declared as of type SDATA, but that type is not supported in XML, and there is therefore no ready equivalent for XML-based projects to the recommendations previously offered.
280	D4-45-2	In essence, when an SGML parser encounters a reference to an entity of type SDATA, it supplies to the application which it is servicing the name of that entity, as found in the document, plus a pointer to a location somewhere on the local system, and what is present at that location may in turn allow or instruct the application to do one of a number of things, including looking up the entity name in a table and deriving information about the referenced entity which can trigger specific behaviours in the application appropriate to the processing of that abstract character. There is however no way to make an XML parser do anything of the kind in response to an entity reference.
281	D4-45-2	Entities in XML are really only of two basic types, parsed and unparsed. Unparsed entities are of no relevance here. References to parsed entities in an XML document result in only one kind of behaviour: when they appear in the parser's input stream, the parser expects to be able to resolve them by locating a declaration in the document's internal or external subset which maps the entity name to its replacement text. The parser then inserts that replacement text into the document in place of the entity reference, which is discarded without trace. The act of replacement is not notified to the application, except where it fails because the entity is undeclared or the declaration is in some way defective (in which case the parser signals a fatal error and stops.)
283	D4-45-2	Though for explanatory convenience much XML-related documentation, including these Guidelines, refers specifically to Character Entities and Character Entity References, a character entity in XML is not a distinct
285	D4-45-2	in the sense that
287	D4-45-2	is understood in Computer Science terminology, for example when referring to the type of an attribute. Hence there is no way in which editing or other software can check that the replacement to be inserted is indeed a single character or its equivalent rather than an arbitrary chunk of text, possibly including markup. A character entity is simply a general entity whose replacement text happens to be declared as a character value or a NCR representing that value. This has two important consequences if it is proposed to use such an entity reference to stand for a character that has no Unicode equivalent. First, the entity name reference will disappear at an early stage in the parse and be replaced by the declared value of the entity, so that no processing which requires access in the parsed document to the entity reference as originally entered is possible. Secondly, if a character entity is to be used as a true equivalent to a normal character, and consequently be employed at all points in a document where a single character could legitimately occur (apart from in element and attribute names, where no references of any kind are allowed) then it is essential that its replacement value indeed be pure character data. If the replacement value of the entity were to contain any markup, or a processing instruction, there would be many places in a document where simple character data would be legitimate, but where the substitution of markup or some other replacement could cause the document to become invalid or malformed. Taken together, these considerations mean that the transparent use of a CER to stand for a non-Unicode character in an XML document is simply not possible.
299	D4-46-1	The principles of Unicode are judiciously tempered with pragmatism. This means, among other things, that the actual repertoire of characters which the standard encodes, especially those parts dating from its earlier days, include a number of items which on a strict interpretation of the Unicode Consortium's theoretical approach should not have been regarded as abstract characters in their own right. Some of these characters are grouped
302	D4-46-1	. Ligatures are a case in point. Ligatures (e.g. the joining of adjacent lowercase letters
303	D4-46-1	s
307	D4-46-1	f
310	D4-46-1	in Latin scripts, whether produced by a scribal practice of not lifting the pen between strokes or dictated by the aesthetics of a type design) are representational features with no added semantic value beyond that of the two letters they unite (though for historians of typography their presence and form in a given edition may be of scholarly significance). However, by the time the Unicode standard was first being debated, it had become common practice to include single glyphs representing the more common ligatures in the repertoires of some typesetting devices and high-end printers, and for the coded character sets built into those devices to use a single code point for such glyphs, even though they represent two distinct abstract characters. So as to increase the acceptance of Unicode among the makers and users of such devices, it was agreed that some such pseudo-characters should be incorporated into the standard as compatibility characters. Nevertheless, if a project requires the presence of such ligatured forms to be encoded, this should normally be done via markup, not by the use of a compatibility character. That way, the presence of the ligature can still be identified (and, if desired, rendered visually) where appropriate, but indexing and retrieval software will treat the code points in the document as a simple sequential occurrence of the two constituent characters concerned and so correctly align their semantics with non-ligatured equivalents. Such ligatures should not be confused with digraphs (usually) indicating diphthongs, as in the French word "cœur". A digraph is an atomic orthographic unit representing an abstract character in its own right, not purely an amalgamation of glyphs, and indexing and retrieval software must treat it as such. Where a digraph occurs in a source text, it should normally be encoded using the appropriate code point for the single abstract character which it represents, either by direct entry of the character concerned or through the appropriate CER or NCR.
316	D4-46-2	The treatment of characters with diacritical marks within Unicode shows a similar combination of rigour and pragmatism. It is obvious enough that it would be feasible to represent many characters with diacritical marks in Latin and some other scripts by a sequence of code points, where one code point designated the base character and the remainder represented one or more diacritical marks that were to be combined with the base character to produce an appropriate glyphic rendering of the abstract character concerned. From its earliest phase, the Unicode Consortium espoused this view in theory but was prepared in practice to compromise by assigning single code points to
318	D4-46-2	characters which were already commonly assigned a single distinctive code point in existing encoding schemes. This means, however, that for quite a large number of commonly-occurring abstract characters, Unicode has two different, but logically and semantically equivalent encodings: a
320	D4-46-2	single code point, and a code point sequence of a base character plus one or more
323	D4-46-2	normalization
324	D4-46-2	of Unicode documents. Normalization is the process of ensuring that a given abstract character is represented in one way only in a given Unicode document or document collection. The Unicode Consortium provides four standard normalization forms, of which the Normalization Form C (NFC) seems to be most appropriate for text encoding projects. The NFC, as far as possible, defines conversions for all base characters followed by one or more combining characters into the corresponding precomposed characters. The World Wide Web Consortium has produced a document entitled
328	D4-46-2	, which among other things discusses normalization issues and outlines some relevant principles. An authoritative reference is Unicode Standard Annex #15
331	D4-46-2	. Individual projects will have to decide how far their decisions on normalization need be influenced by the fact that at present, by no means all hardware or software can correctly render (or even consistently identify) abstract characters encoded using combining symbols.
333	D4-46-2	It is important that every Unicode-based project should agree on, consistently implement and fully document a comprehensive and coherent normalization practice. As well as ensuring data integrity within a given project, a consistently implemented and properly documented normalization policy is essential for successful document interchange.
339	D4-46-3	In addition to the Universal Character Set itself, the Unicode Consortium maintains a database of additional character semantics
340	D4-46-3	. This includes names for each character code point and normative properties for it. Character properties, as given in this database, determine the semantics and thus the intended use of a code point or character. It also contains information that might be needed for correctly processing this character for different purposes. This database is an important reference in determining which Unicode code point to use to encode a certain character.
342	D4-46-3	In addition to the printed documentation and lists made available by the Unicode consortium, the information it contains may also be accessed by a number of search systems over the Web (e.g.
343	D4-46-3	). Examples of character properties included in the database include case, numeric value, directionality, and, where applicable status as a
349	D4-46-3	. Where a project undertakes local definition of characters with code point in the PUA, it is desirable that any relevant additional information about the characters concerned should be recorded in an analogous way, as further discussed under
357	D4-47	An important difference between SGML and XML is that the latter allows for the processing of non-validated documents. Since validity and validation are central TEI concerns, it is unlikely that documents prepared according to these Guidelines will ever be designed or implemented as merely well-formed in the XML sense. However in the domain of XML technologies, even where a document invokes a DTD or schema, it is not always necessarily the case that an XML processor will perform a full validation of it. XSLT transformation is a common case in point. By the workflow stage at which a document is handed off to an XSLT process for transformation, it is likely that its associated DTD or schema will already have fulfilled its role of integrity assurance and quality control, and so it may be undesirable to add validation to the processing overhead. For this reason, most XSLT processors do not attempt validation by default, even if a DTD or schema is declared and accessible. This can, however, create a problem where parsed entities, (and character entities in particular in the present context) are referenced. A validating parser reads all entity declarations from the DTD (including those for character entities) in the initial phase of processing, so that they can be resolved as and when required. However, where no validation takes place, it cannot automatically be assumed that the parser will be able to resolve such entities in all circumstances. The XML standard requires a non-validating parser to read and act on entity declarations only if they are located within the document's internal subset (which does not, of course, mean that the entity declarations have to be manually merged into the document instance in advance of processing: character entity sets, for instance, count as being in the internal subset if they are placed there via a parameter entity, as is normal TEI practice). Some parsers when in non-validating mode will also access entity declarations in the external subset, but this behaviour is not mandated by the standard and should not be relied upon. Provided these facts are borne in mind, the presence of character entities in a document when parser validation is switched off should not cause any difficulties.
363	D4-48	In theory it should not be necessary for encoders to have any knowledge of the various ways in which Unicode code points can be represented internally within a document or in the memory of a processing system, but experience shows that problems frequently arise in this area because of mistaken practice or defective software, and in order to recognize the resulting symptoms and correct their causes an outline knowledge of certain aspects of Unicode internal representation is desirable.
368	D4-48-1	The code points assigned by Unicode 3.0 and later are notionally 32-bit integers, and the most straightforward way to represent each such integer in computer storage would be to use 4 eight-bit bytes. However, many of the code points for characters most commonly used in Latin scripts can be represented in one byte only and the vast majority of the remainder which are in common use (including those assigned from the most frequently used PUA range) can be expressed in two bytes alone. This accounts for the use of UTF-8 and UTF-16 and their special place in the XML standard. UTF-8 and UTF-16 are ways of representing 32-bit code points in an economical way.
369	D4-48-1	UTF-8 is a variable length-encoding: the more significant bits there are in the underlying code point (or in everyday terminology the bigger the number used to represent the character), the more bytes UTF-8 uses to encode it. What makes UTF-8 particularly attractive for representing Latin scripts, explaining its status as the default encoding in XML documents, is that all code points that can be expressed in seven or fewer bits (the 127 values in the original ASCII character set) are also encoded as the same seven or fewer bits (and therefore in a single byte) in UTF-8. That is why a document which is actually encoded in pure 7-bit ASCII can be fed to an XML processor without alteration and without its encoding being explicitly declared: the processor will regard it as being in the UTF-8 representation of Unicode and be able to handle it correctly on that basis.
371	D4-48-1	However, even within the domain of Latin-based scripts, some projects have documents which use characters from 8 bit extensions to ASCII, e.g. those in the ISO-8859-n series of encodings, and the way characters which under ISO-8859-n use all eight bits are encoded in UTF-8 is significantly different, giving rise to puzzling errors. Abstract characters that have a
373	D4-48-1	byte code point where the highest bit is set (that is, they have a decimal numeric representation between 129 and 255) are encoded in ISO-8859-n as a
375	D4-48-1	byte with the same value as the code point. But in UTF-8 code-point values inside that range are expressed as a
377	D4-48-1	byte sequence. That is to say, the abstract character in question is no longer represented in the file or in memory by the same number as its code-point value: it is
379	D4-48-1	(hence the T in UTF) into a sequence of two different numbers. Now as a side-effect of the way such UTF-8 sequences are derived from the underlying code-point value, many of the single-byte eight-bit values employed in ISO-8859-n encodings are illegal in UTF-8.
381	D4-48-1	This complicated situation has a simple consequence which can cause great bewilderment. XML processors will effortlessly handle character data in pure 7-bit ASCII without that encoding needing to be declared to the parser, and will similarly accept documents encoded in an undeclared ISO-8859-n encoding if they happen to use no characters outside the strict ASCII subset of the ISO character sets; but the parse will immediately fail if an eight-bit character from an ISO-8859-n set is encountered in the input stream, unless the document's encoding has been explicitly and correctly declared. Explicitly declaring the encoding ought to solve the problem, and if the file is correctly encoded throughout, it will do so. But since text editors and word processors are currently acquiring different degrees of Unicode support at different rates, projects are likely to find that they have to deal with some files encoded in UTF-8 along with others in, say, ISO-8859-1. Such encoding differences may go unnoticed, especially if the proportion of characters where the internal encodings are distinguishable is relatively small (for example in a long English text with a smattering of French words). If in the process of document preparation two such files have been merged, or intermixed via
389	D4-48-1	Where erroneously mixed encodings are the source of such an error, altering the encoding declaration will not solve the problem, though it may obfuscate it. Eight-bit character codes in a file declared as UTF-8 will always stop the parser. More insidiously, UTF-8 sequences in a file declared as ISO-8859-1 will not halt the parse, but will cause data corruption, because the parser will silently but erroneously convert each byte in every UTF-8 sequence into a spurious separate character, introducing semantic errors which may not become apparent until much later in the processing chain.
391	D4-48-1	In projects that routinely handle documents in non-Latin scripts, everyone is well aware of the need to ensure correct and consistent encoding, so in such places mixed encoding problems seldom arise, and when they do are readily identified and remedied. Real confusion tends to arise, however, in projects which have a low awareness of the issues because they employ predominantly unaccented Latin characters, with only thinly-distributed instances of accented letters, or other
394	D4-48-1	non-breaking space
395	D4-48-1	). Even, or especially, if such projects view themselves as concerned only with English documents, the close relationship between XML and Unicode means they will need to acquire an understanding of these encoding issues and develop procedures which assure consistency and integrity of encoding and its correct declaration, including the use of appropriate software for transcoding and verification.
401	D4-48-2	The advantages of UTF-8 as an internal representation of Unicode code points outlined above do not obtain where documents are in scripts other than Latin, Cyrillic or Hebrew. Where characters with code points in the sixteen-bit range (two-byte) predominate, UTF-8 is inappropriate, because it requires three or more bytes to represent each abstract character. Here the preferred representation of Unicode (which all XML-conformant parsers must support) is UTF-16, where each code point corresponding to an abstract character is represented in two eight-bit bytes
404	D4-48-2	values to represent code points beyond the 16-bit range is passed over here, since it adds a complication that does not affect the key points at issue
405	D4-48-2	. This encoding presents a different hazard, especially while support for Unicode in editing software is relatively uneven and immature. Because the code points are represented as sixteen-bit integers stored (in most popular computers) in two separate bytes, the order in which those bytes are stored becomes important. This is dependent on the underlying hardware. In the realm of desktop computing, Macintosh machines, for example, store (on disk as well as in memory) byte pairs representing 16-bit integers with the higher-value byte first, whereas PCs using Intel processors store the bytes in the reverse order (this is often referred to with Swiftian nomenclature as
409	D4-48-2	byte order). This means that if a semantically identical plain text file encoded in UTF-16 is prepared on a Macintosh and on a PC, and the two files are then saved to disk, each byte pair in one file will be in the reverse order from the corresponding byte pair in the other file. To avoid the obvious incompatibility problems, the XML standard requires that all documents whose declared encoding is UTF-16 must begin with a special pseudo-character which is not itself part of the document, but merely a Byte Order Marker (BOM) from which the processor can determine the byte order of the document that follows. Now the insertion of a correct BOM and the consistent maintenance of the byte order throughout the file ought to be taken care of transparently by software, but experience, especially from environments where work is distributed across big-endian and little-endian hardware, shows that this cannot always be taken for granted in the current state of software development. As with mixed encoding problems involving UTF-8, inconsistent byte-order in UTF-16 files seems to be the result of merging or cutting and pasting between files using software which does not correctly enforce byte order integrity, and out of misconceived
411	D4-48-2	which conceals byte-order inconsistencies from the user. Once more, the result can be files which look correct in an editor, but which the XML parser either rejects outright or silently passes on in a seriously garbled form. Again, to avoid the consequent errors, projects need to cultivate an informed awareness of relevant encoding issues and devise policies to avoid them in the first place or detect them at an early stage.

ST-Infrastructure.xml#13092

#	id	text
2	ST	The TEI Infrastructure
9	ST	The TEI encoding scheme consists of a number of
12	ST	classes
13	ST	. Another part defines its possible content and attributes with reference to these classes. This indirection gives the TEI system much of its strength and its flexibility. Elements may be combined more or less freely to form a
15	ST	appropriate to a particular set of requirements. It is also easy to add new elements which reference existing classes or elements to a schema, as it is to exclude some of the elements provided by any module included in a schema.
17	ST	In principle, a TEI schema may be constructed using any combination of modules. However, certain TEI modules are of particular importance, and should always be included in all but exceptional circumstances: the module
25	ST	provides declarations for the metadata elements and attributes constituting the TEI header, a component which is required for TEI conformance, while the
30	ST	The specification for a TEI schema is itself a TEI document, using elements from the module described in chapter
40	ST	The bulk of this chapter describes the TEI infrastructure module itself. Although it may be skipped at a first reading, an understanding of the topics addressed here is essential for anyone planning to take full advantage of the TEI customization techniques described in chapter
43	ST	The chapter begins by briefly characterizing each of the modules available in the TEI scheme. Section
44	ST	describes in general terms the method of constructing a TEI schema in a specific schema language such as XML DTD language, RELAX NG, or W3C Schema.
46	ST	The next and largest part of the chapter introduces the attribute and element classes used to define groups of elements and their characteristics (section
52	ST	, which are used to express some commonly used content models, and lists the
54	ST	used to constrain the range of legal values for TEI attributes (section
58	STMA	TEI Modules
64	STMA	a formal declaration, expressed using a special-purpose XML vocabulary defined by these Guidelines in combination with elements taken from the ISO schema language RELAX NG
69	STMA	Each chapter of the Guidelines presents a group of related elements, and also defines a corresponding set of declarations, which we call a
71	STMA	. All the definitions are collected together in the reference sections provided as an appendix. Formal declarations for a given chapter are collected together within the corresponding module. For convenience, each element is assigned to a single module, typically for use in some specific application area, or to support a particular kind of usage. A module is thus simply a convenient way of grouping together a number of associated element declarations. In the simple case, a TEI schema is made by combining together a small number of modules, as further described in section
74	STMA	The following table lists the modules defined by the current release of the Guidelines:
78	tab-mods	Module name
86	tab-mods	analysis
93	tab-mods	certainty
100	tab-mods	core
107	tab-mods	corpus
115	tab-mods	dictionaries
122	tab-mods	drama
129	tab-mods	figures
136	tab-mods	gaiji
143	tab-mods	header
150	tab-mods	iso-fs
157	tab-mods	linking
164	tab-mods	msdescription
171	tab-mods	namesdates
178	tab-mods	nets
185	tab-mods	spoken
192	tab-mods	tagdocs
199	tab-mods	tei
201	tab-mods	TEI Infrastructure
207	tab-mods	textcrit
214	tab-mods	textstructure
221	tab-mods	transcr
228	tab-mods	verse
236	STMA	For each module listed above, the corresponding chapter gives a full description of the classes, elements, and macros which it makes available when it is included in a schema. Other chapters of these Guidelines explore other aspects of using the TEI scheme.
240	STIN	Defining a TEI Schema
243	STIN	. For a valid TEI document, this schema must be a conformant TEI schema, as further defined in chapter
246	STIN	be made explicit. The method of doing this recommended by these Guidelines is to provide explicitly or by reference a TEI schema specification against which the document may be validated.
248	STIN	A TEI-conformant schema is a specific combination of TEI modules, possibly also including additional declarations that modify the element and attribute declarations contained by each module, for example to suppress or rename some elements. The TEI provides an application-independent way of specifying a TEI schema by means of the
251	STIN	. The same system may also be used to specify a schema which extends the TEI by adding new elements explicitly, or by reference to other XML vocabularies. In either case, the specification may be processed to generate a formal schema, expressed in a variety of specific schema languages, such as XML DTD language, RELAX NG, or W3C Schema. These output schemas can then be used by an XML processor such as a validator or editor to validate or otherwise process documents. Further information about the processing of a TEI formal specification is given in chapter
257	STINsimpleExample	The simplest customization of the TEI scheme combines just the four recommended modules mentioned above. In ODD format, this schema specification takes this form:
272	STINsimpleExample	). An ODD processor will generate an appropriate schema from this set of declarations, expressed using the XML DTD language, the ISO RELAX NG language, the W3C Schema language, or in principle any other adequately powerful schema language. The resulting schema may then be associated with the document instance by one of a number of different mechanisms, as further described in chapter
273	STINsimpleExample	. The start point (or root element) of document instances to be validated against the schema is specified by means of the
282	STINlargerExample	These Guidelines introduce each of the modules making up the TEI scheme one by one, and therefore, for clarity of exposition, each chapter focusses on elements drawn from a single module. In reality, of course, the markup of a text will draw on elements taken from many different modules, partly because texts are heterogeneous objects, and partly because encoders have different goals. Some examples of this heterogeneity include:
284	STINlargerExample	a text may be a collection of other texts of different types: for example, an anthology of prose, verse, and drama;
286	STINlargerExample	a text may contain other smaller, embedded texts: for example, a poem or song included in a prose narrative;
288	STINlargerExample	some sections of a text may be written in one form, and others in a different form: for example, a novel where some chapters are in prose, others take the form of dictionary entries, and still others the form of scenes in a play;
290	STINlargerExample	an encoded text may include detailed analytic annotation, for example of rhetorical or linguistic features;
292	STINlargerExample	an encoded text may combine a literal transcription with a diplomatic edition of the same or different sources;
294	STINlargerExample	the description of a text may require additional specialized metadata elements, for example when describing manuscript material in detail.
297	STINlargerExample	The TEI provides mechanisms to support all of these and many other use cases. The architecture permits elements and attributes from any combination of modules to co-exist within a single schema. Within particular modules, elements and attributes are provided to support differing views of the
301	STINlargerExample	a definition of a corpus or collection as a series of
303	STINlargerExample	documents, sharing a common TEI header (see chapter
306	STINlargerExample	a definition of composite texts which combine optional front- and back-matter with a group of collected texts, themselves possibly composite (see section
317	STINlargerExample	Subsequent chapters of these Guidelines describe in detail markup constructs appropriate for these and many other possible features of interest. The markup constructs can be combined as needed for any given set of applications or project.
319	STINlargerExample	For example, a project aiming to produce an ambitious digital edition of a collection of manuscript materials, to include detailed metadata about each source, digital images of the content, along with a detailed transcription of each source, and a supporting biographical and geographical database might need a schema combining several modules, as follows:
348	STINlargerExample	The TEI architecture also supports more detailed customization beyond the simple selection of modules. A schema may suppress elements from a module, suppress some of their attributes, change their names, or even add new elements and attributes. Detailed discussion of the kind of modification possible in this way is provided in
349	STINlargerExample	and conformance rules relating to their application are discussed in
350	STINlargerExample	. These facilities are available for any schema language (though some features may not be available in all languages). The ODD language also makes it possible to combine TEI and non-TEI modules into a single schema, provided that the non-TEI module is expressed using the RELAX NG schema language (see further
356	STEC	The TEI Class System
358	STEC	The TEI scheme distinguishes about five hundred different elements. To aid comprehension, modularity, and modification, the majority of these elements are formally classified in some way. Classes are used to express two distinct kinds of commonality among elements. The elements of a class may share some set of attributes, or they may appear in the same locations in a content model. A class is known as an
360	STEC	if its members share attributes, and as a
362	STEC	if its members appear in the same locations. In either case, an element is said to
364	STEC	properties from any classes of which it is a member.
372	STEC	A basic understanding of the classes into which the TEI scheme is organized is strongly recommended and is essential for any successful customization of the system.
377	STECAT	An attribute class groups together elements which share some set of common attributes. Attribute classes are given names composed of the prefix
385	STECAT	attribute, both of which are inherited from their membership in the class rather than individually defined for each element. These attributes are said to be defined by (or inherited from) the
387	STECAT	class. If another element were to be added to the TEI scheme for which these attributes were considered useful, the simplest way to provide them would be to make the new element a member of the
389	STECAT	class. Note also that this method ensures that the attributes in question are always defined in the same way, taking the same default values etc., no matter which element they are attached to.
391	STECAT	Some attribute classes are defined within the
393	STECAT	infrastructural module and are thus globally available. Other attribute classes are specific to particular modules and thus defined in other chapters. Attributes defined by such classes will not be available unless the module concerned is included in a schema.
439	STECAT	when the
441	STECAT	module is included in a schema. If, however, this module is not included in a schema, then the
447	STECAT	, is common to all modules, and is therefore described in some detail in the next section. A full list of all attribute classes is given in
453	STGA	The following attributes are defined for every TEI element.
458	STGA	These attributes are optionally available for any TEI element; none of them is required. Their usage is discussed in the following subsections.
463	STGAid	The value supplied for the
466	STGAid	name
472	STGAid	The colon is also by default a valid name character; however, it has a specific purpose in XML (to indicate namespace prefixes), and may not therefore be used in any other way within a name.
476	STGAid	in an XML TEI document) uppercase and lowercase letters are distinguished, and thus
493	STGAid	attribute also provides an identifying name or number for an element, but in this case the information need not be a legal
495	STGAid	value. Its value may be any string of characters; typically it is a number or other similar enumerator or label. For example, the numbers given to the items of a numbered list may be recorded with the
497	STGAid	attribute; this would make it possible to record errors in the numeration of the original, as in this list of chapters, transcribed from a faulty original in which the number 10 is used twice, and 11 is omitted:
521	STGAid	As noted above there is no requirement to record a value for either the
525	STGAid	attribute. Any XML processor can identify the sequential position of one element within another in an XML document without any additional tagging. An encoding in which each line of a long poem is explicitly labelled with its numerical sequence such as the following
539	STGAla	attribute indicates the natural language and writing system applicable to the content of a given element. If it is not specified, the value is inherited from that of the immediately enclosing element. As a rule, therefore, it is simplest to specify the base language of the text on the
541	STGAla	element, and allow most elements to take the default value for
543	STGAla	; the language of an element then need be explicitly specified only for elements in languages other than the base language. For this reason, it is recommended practice to supply a default value for the
547	STGAla	root element, or on both the
551	STGAla	element. The latter is appropriate in the not uncommon case where the text element in a TEI document uses a different default language from that of the TEI header attached to it. Other language shifts in the source should be explicitly identified by use of the
555	STGAla	In the following example schematic, an English language TEI header is attached to an English language text:
565	STGAla	The same effect would be obtained by specifying the default language for both header and text:
575	STGAla	The latter approach is necessary in the case where the two differ: for example, where an English language header is applied to a French text:
585	STGAla	The same principle applies at any hierarchic level. In the following example, the default language of the text is French, but one section of it is in German:
614	STGAla	element, by contrast, because it is in the same language as its parent.
622	STGAla	Note that in cases where it is advisable or necessary to identify the language of the text that is pointed at, the (non-global) attribute
625	STGAla	the pointer references text written in French.
634	STGAla	Additional information about a particular language may be supplied in the
636	STGAla	element within the header (see section
649	STGAre	attributes are all used to give information about the physical presentation of the text in the source. In the following example,
651	STGAre	is used to indicate that both the emphasized word and the proper name are printed in italics:
669	STGAre	elements are rendered in the text by italics, it will be more convenient to register that fact in the TEI header once and for all (using the
675	STGAre	value only for any elements which deviate from the stated rendition.
681	STGAre	is that the value used for the former may contain one or more tokens from any vocabulary devised by the encoder, separated by space characters, whereas the value used for the latter must be a single string taken from a formally-defined style definition language such as CSS. The
683	STGAre	attribute values are sequence-indeterminate set of whitespace-separated tokens, whereas
685	STGAre	values allow whitespace and sequence relationships as part of the formally-defined style definition language.
692	STGAre	element can then be associated with any element, either by default, or by means of the global
724	STGAre	elements, each of which defines some aspect of the rendering or appearance of the text in its original form. These details may most conveniently be described using a formal style definition language, such as CSS (
726	STGAre	); in some other formal language developed for a specific project; or even informally in running prose. Although languages such as CSS and XSL-FO are generally used to describe document output to screen or print, they nonetheless provide formal and precise mechanisms for describing the appearance of source documents, especially print documents, but also many aspects of manuscript documents. For example, both CSS and XSL-FO provide mechanisms for describing typefaces, weight, and styles; character and line spacing; and so on.
730	STGAre	attribute is provided for encoders wishing to describe the appearance of individual source elements using a language such as CSS directly rather than by reference to a
732	STGAre	element. Its value may be any expression in the chosen formal style definition language.
734	STGAre	Formal definition languages such as CSS typically identity a series of
738	STGAre	are specified. A sequence of such property-value pairs makes up a stylesheet. The TEI uses such languages simply to describe the appearance of a source document, rather than to control how it should be formatted.
740	STGAre	In the TEI scheme, it is possible to supply information about the appearance of elements within a source document in the following distinct ways:
742	STGAre	One or more properties may be specified as the default for all elements of a given type, using the
750	STGAre	attribute with any convenient set of one or more sequence-indeterminate tokens;
758	STGAre	One or more properties may be supplied explicitly for individual element occurrences, using the
764	STGAre	If the same property is specified in more than one of the above ways, the one with the highest number in the list above is understood to be applicable. The resulting properties from each way are then combined to provide the full set of property-value pairs applicable to the given element, and (by default) to all of its children.
768	STGAre	attribute to indicate a different language for one or more
772	STGAre	attribute, if this is used in combination with either
778	STGAre	Note that these TEI attributes always describe the rendition or appearance of the source document,
786	STGAba	Several TEI elements carry attributes whose values are defined as
788	STGAba	, meaning that such attributes supply a link or pointer, typically expressed as a URL. Like other XML applications, the TEI allows use of a special attribute to set the context within which relative URLs are to be evaluated. The global attribute
790	STGAba	is defined as part of the XML specification and belongs to the XML namespace rather than the TEI namespace. We do not describe it in detail here: reference information about
797	STGAba	is used to set a context for all relative URLs within the scope of the element on which it is specified. For example:
816	STGAba	which supplies a value for
824	STGAba	which does not change the default context, and its target is therefore some element within the current document with the value
828	STGAba	attribute. Further discussion of this element and its effect on TEI linking methods is provided in chapter
837	STGAxs	provides a mechanism for indicating to systems processing an XML file how they should treat whitespace, that is, any sequences of consecutive tab (#x09), space (#x20), carriage return (#x0D) or linefeed (#x0A) characters. Like
839	STGAxs	this attribute is defined as part of the XML specification and belongs to the XML namespace rather than the TEI namespace. Complete information about this attribute is provided by
841	STGAxs	; here we provide a summary of how its use affects users of the TEI scheme.
848	STGAxs	default
849	STGAxs	. The first indicates that whitespace in a text node—every carriage return, every tab, etc.—should be maintained as is when the document is processed. The second (which is implied when the attribute is not supplied), indicates that whitespace should be handled
853	STGAxs	These Guidelines assume one of two different ways of processing whitespace will apply in a given case, depending on an element's content model. For an element that can contain only other elements with no intervening non-whitespace characters, whitespace is considered to have no semantic significance, and should therefore be discarded by a processor. For example, in a
863	STGAxs	since non-whitespace text is not permitted between the
875	STGAxs	element has a content model containing only elements: any punctuation or whitespace required between the lines of an address must therefore be supplied by the processor, as any whitespace present in the input document will be ignored.
877	STGAxs	Elements with content models of this type are comparatively unusual in the TEI: a list of them is provided in the TEI release file
883	STGAxs	Most TEI elements permit what is known as mixed-content: that is, they can contain both text and other elements. Here the assumption of these Guidelines is that whitespace will be normalized. This means that all space, carriage return, linefeed, and tab characters are converted into spaces, all consecutive spaces are then deleted and replaced by one space, and then space immediately after a start-tag or immediately before an end-tag is deleted. The result is that this encoding,
899	STGAxs	. The space before his name has been removed, a space is included between his forenames, the comma is preserved, and the newlines within his name have all been removed.
902	STGAxs	If the default treatment described above is not appropriate for a mixed content element, the processing required may be described in the
904	STGAxs	element of the TEI header, but generic XML processing tools may not take note of this.
908	STGAxs	attribute may be supplied with a value of
910	STGAxs	in order to indicate that every space, tab, carriage return and linefeed character found within that element in the document being processed is significant. Typically, the result of that processing will be to retain the whitespace characters in the output. Thus if the above example began
911	STGAxs	persName xml:space="preserve"
912	STGAxs	, the resulting text would most likely be rendered over five lines, indented, and with a blank line following.
916	STGAxs	attribute is rarely used in TEI documents because such layout features are generally captured with less risk and more precision by using native TEI elements such as
983	STECCM	As noted above, the members of a given TEI model class share the property that they can all appear in the same location within a document. Wherever possible, the content model of a TEI element is expressed not directly in terms of specific elements, but indirectly in terms of particular model classes. This makes content models simpler and more consistent; it also makes them much easier to understand and to modify.
985	STECCM	Like attribute classes, model classes may have subclasses or superclasses. Just as elements inherit from a class the ability to appear in certain locations of a document (wherever the class can appear), so all members of a subclass inherit the ability to appear wherever any superclass can appear. To some extent, the class system thus provides a way of reducing the whole TEI galaxy of elements into a tidy hierarchy. This is however not entirely the case.
987	STECCM	In fact, the nature of a given class of elements can be considered along two dimensions: as noted, it defines a set of places where the class members are permitted within the document hierarchy; it also implies a semantic grouping of some kind. For example, the very large class of elements which can appear within a paragraph comprises a number of other classes, all of which have the same structural property, but which differ in their field of application. Some are related to highlighting, while others relate to names or places, and so on. In some cases, the
988	STECCM	set of places where class members are permitted
989	STECCM	is very constrained: it may just be within one specific element, or one class of element, for example. In other cases, elements may be permitted to appear in very many places, or in more than one such set of places.
991	STECCM	These factors are reflected in the way that model classes are named. If a model class has a name containing
997	STECCM	then it is primarily defined in terms of its structural location. For example, those elements (or classes of element) which appear as content of a
1001	STECCM	class; those which appear as content of a
1005	STECCM	class. If, however, a model class has a name containing
1011	STECCM	, the implication is that its members all have some additional semantic property in common, for example containing a bibliographic description, or containing some form of name, respectively. These semantically-motivated classes often provide a useful way of dividing up large structurally-motivated classes: for example, the very general structural class
1014	STECCM	data elements that form part of a paragraph
1015	STECCM	) has four semantically-motivated member classes (
1025	STECCM	Although most classes are defined by the
1029	STECCM	, but instead gain their members as a consequence of individual elements' declaration of their membership. The same class may therefore contain different members, depending on which modules are active. Consequently, the content model of a given element (being expressed in terms of model classes) may differ depending on which modules are active.
1031	STECCM	Some classes contain only a single member, even when all modules are loaded. One reason for declaring such a class is to make it easier for a customization to add new member elements in a specific place, particularly in areas where the TEI does not make fully elaborated proposals. For example, the TEI class
1035	STECCM	module to include just the TEI
1037	STECCM	element. A project wishing to add an alternative way of structuring text-critical information could do so by defining their own elements and adding it to this class.
1039	STECCM	Another reason for declaring single-member classes is where the class members are not needed in all documents, but appear in the same place as elements which are very frequently required. For example, the specialized element
1041	STECCM	used to represent a non-Unicode character or glyph is provided as the only member of the
1043	STECCM	class when the
1045	STECCM	module is added to a schema. References to this class are included in almost every content model, since if it is used at all the
1047	STECCM	must be available wherever text is available; however these references have no effect unless the gaiji module is loaded.
1049	STECCM	At the other end of the scale, a few of the classes predefined by the tei module are subsequently populated with very many members. For example, the class
1051	STECCM	groups all the classes of element for simple editorial correction and transcription which can appear within a
1061	STECCM	element is one of the basic building blocks of a TEI document it is not surprising that each module will need to add elements to it. The class system here provides a very convenient way of controlling the resulting complexity. Typically, elements are not added directly to these very general classes, but via some intermediate semantically-motivated class.
1063	STECCM	Just as there are a few classes which have a single member, so there are some classes which are used only once in the TEI architecture. These classes, which have no superclass and therefore do not fit into the class hierarchy defined here, are a convenient way of maintaining elements which are highly structured internally, but which appear from the outside to be uniform objects like others at the same level.
1067	STECCM	Members of such classes can only ever appear within one element, or one class of elements. For example, the class
1069	STECCM	is used only to express the content model for the element
1071	STECCM	; it references some other classes of elements, which can appear elsewhere, and also some elements which can only appear inside an address.
1076	STBTC	Most TEI elements may also be informally classified as belonging to one of the following groupings:
1080	STBTC	high level, possibly self-nesting, major divisions of texts. These elements populate such classes as
1084	STBTC	, and typically form the largest component units of a text.
1091	STBTC	, either directly or by means of other classes such as
1105	STBTC	means any string of characters, and can apply to individual words, parts of words, and groups of words indifferently; it does not refer only to linguistically-motivated phrasal units. This may cause confusion for readers accustomed to applying the word in a more restrictive sense.
1109	STBTC	The TEI also identifies two further groupings derived from these three:
1121	STBTC	classes but rather a distinct grouping of elements which are both chunk-like and phrase-like. However, the classes
1132	STBTC	elements which can appear directly within texts or text divisions; this is a combination of the inter- and chunk- level elements defined above. These elements populate the class
1134	STBTC	, which is defined as a superset of the classes
1142	STBTC	Broadly speaking, the front, body, and back of a text each comprises a series of components, optionally grouped into divisions.
1144	STBTC	As noted above, some elements do not belong to any model class, and some model classes are not readily associated with any of the above informal groupings. However, over two-thirds of the
1145	STBTC	elements defined in the present edition of these Guidelines are classified in this way, and future editions of these recommendations will extend and develop this classification scheme.
1147	STBTC	A complete alphabetical list of all model classes is provided in
1269	STmacros	The infrastructure module defined by this chapter also declares a number of
1271	STmacros	, or shortcut names for frequently occurring parts of other declarations. Macros are used in two ways in the TEI scheme: to stand for frequently-encountered content models, or parts of content models (
1278	STECST	As far as possible, the TEI schemas use the following set of frequently-encountered content models to help achieve consistency among different elements.
1290	STECST	The present version of the TEI Guidelines includes some
1292	STECST	shows, in descending order of frequency, the seven most commonly used content models.
1306	DTYPES	The values which attributes may take in a TEI schema are defined, for the most part, by reference to a TEI
1307	DTYPES	datatype
1308	DTYPES	. Each such datatype is defined in terms of other primitive datatypes, derived mostly from
1310	DTYPES	, literal values, or other datatypes. This indirection makes it possible for a TEI application to set constraints either globally or in individual cases, by redefining the datatype definition or the reference to it respectively. In some cases, the TEI datatype includes additional usage constraints which cannot be enforced by existing schema languages, although a TEI-compliant processor should attempt to validate them (see further discussion in chapter
1313	DTYPES	Where literal values or name tokens are used in a datatype definition, an associated value list supplies definitions for the significance of suggested or (in the case of closed lists) all possible values.
1316	DTYPES	TEI-defined datatypes may be grouped into those which define normalized values for numeric quantities, probabilities, or temporal expressions, those which define various kinds of shorthand codes or keys, and those which define pointers or links.
1330	DTYPES	datatype include
1377	DTYPES	in the case of durations, times, and date; W3C Schema datatypes in the case of truth values; BCP 47 in the case of language; and ISO 5218 in the case of sex.
1410	DTYPES	By far the largest number of TEI attributes take values which are coded values or names of some kind. These values may be constrained or defined in a number of different ways, each of which is given a different name, as follows:
1431	DTYPES	, are used to supply an identifier expressed as any kind of single token or word. The TEI places a few constraints on the characters which may be used for this purpose: only Unicode characters classified as letters, digits, punctuation characters, or symbols can appear in an attribute value of this kind. Note in particular that such values cannot include whitespace characters. Legal values include
1445	DTYPES	Where identifiers are defined externally, for example as part of a database or file system, the inability to include whitespace or other special characters in a value may be problematic. In other cases, it may also be simply more convenient to supply a short sequence of natural language words including spaces as a single value. For these reasons, we also provide a datatype
1459	DTYPES	. This datatype should be used with care since XML will not normalize whitespace characters within it: for example the values
1463	DTYPES	(three spaces) would be considered distinct. This case should be distinguished from that of an attribute permitting multiple values, each of which may be separated by whitespace which
1472	DTYPES	, but with the additional constraint that they must be legal XML identifiers, as defined by the XML 1.0 specification, or successors. Hence, they may not begin with digits or punctuation characters. Legal identifiers include
1494	DTYPES	supplied by
1498	DTYPES	above, with the added constraint that the word supplied is taken from a specific list of possibilities. In each case, the element or class specification which includes the definition for the attribute will also contain a list of possible values, together with a prose description of their intended significance. This list may be open (in which case the list is advisory), or closed (in which case it determines the range of legal values). In this latter case, the datatype will not be
1500	DTYPES	, but an explicit list of the possible values.
1515	DTYPES	An attribute may, of course, take more than one value of a given type, for example a list of pointer values, or a list of words. In the TEI scheme, this information is regarded as a property of the
1517	DTYPES	element used to document the attribute in question rather than as a distinct
1518	DTYPES	datatype
1525	STOV	The TEI Infrastructure Module
1529	STOV	module defined by this chapter is a required component of any TEI schema. It provides declarations for all datatypes, and initial declarations for the attribute classes, model classes, and macros used by other modules in the TEI scheme. Its components are listed below in alphabetical order:
1531	tei	TEI Infrastructure
1533	tei	Declarations for classes, datatypes, and macros available to all TEI modules
1547	STOV	The order in which declarations are made within the infrastructure module is critical, since several class declarations refer to others, which must therefore precede them. Other constraints on the order of declarations derive from the way in which the modularity of the TEI scheme is implemented in different schema languages. The XML DTD fragment implementing this TEI module makes extensive use of
1551	STOV	to effect a kind of conditional construction; the RELAX NG schema fragment similarly predeclares a number of patterns with null (

ND-NamesDates.xml#13218

#	id	text
5	ND	it was noted that the elements provided in the core module allow an encoder to specify that a given text segment is a proper noun, or a
6	ND	referring string
7	ND	, and to specify the kind of object named or referred to only by supplying a value for the
11	ND	This module also provides elements for the representation of information about the person, place, or organization to which a given name is understood to refer and to represent the name itself, independently of its application. In simple terms, where the core module allows one simply to represent that a given piece of text is a
12	ND	name
14	ND	personal name
16	ND	person
18	ND	canonical name
23	ND	), place names (section
35	NDATTS	have specialized attributes which support linkage of a naming element with the entity (person, place, organization) being named; members of the class
37	NDATTS	have specialized attributes which support a number of ways of normalizing the date or time of the data encoded by the element concerned.
46	NDATTSnr	As discussed elsewhere, these attributes provide two different ways of associating any sort of name with its referent. For cases where all that is required is to provide some minimal information about the person name, for example their occupation or status, the
50	NDATTSnr	attribute. It also provides an additional attribute, which allows the name itself to be associated with a base or canonical form:
57	NDATTSnr	attribute should be used wherever it is possible to supply a direct link such as a URI to indicate the location of canonical information about the referent.
71	NDATTSnr	More than one URI may be supplied if the name refers to more than one person. For example, assuming the existence of another
85	NDATTSnr	attribute is provided for cases where no such direct link is required: for example because resolution of the reference is carried out by some local convention, or because the encoder judges that no such resolution is necessary. As an example of the first case, a project might maintain its own local database system containing canonical information about persons and places, each entry in which is accessed by means of some system-specific identifier constructed in a project-specific way from the value supplied for the
89	NDATTSnr	a similar method is used to link element descriptions to the modules or classes to which they belong, for example.
90	NDATTSnr	As an example of the second case, consider the use of well-established codifications such as country or airport codes, which it is probably unnecessary for an encoder to expand further:
98	NDATTSnr	, interchange is improved by use of tag URIs in
106	NDATTSnr	attribute has a more specialized use, where it is the name itself which is of interest rather than the person, place, or organization being named. See section
129	NDATTSda	attribute is used to specify a normalized form for any temporal expression, independently of how it is represented in the text, as in the following example:
138	NDATTSda	attribute provides a convenient way of associating an event or date with a named period. Its value is a pointer which should indicate some other element where the period concerned is more precisely defined. A convenient location for such definitions is the
144	NDATTSda	of a TEI Header. A
146	NDATTSda	may contain simply a bibliographic reference to an external definition for it. More usefully, it may also contain a series of
148	NDATTSda	elements, each with an identifier and a description. The identifier can then be used as the target for a
150	NDATTSda	attribute. For example, a taxonomy of named periods might be defined as follows:
186	NDATTSda	The other dating attributes provided by this class support a wide range of methods of specifying temporal information in a normalized form. Some simple examples follow:
204	NDATTSda	Normalization of date and time values permits the efficient processing of data (for example, to determine whether one event precedes or follows another). These examples all use the W3C standard format for representation of dates and times. Further examples, and discussion of some alternative approaches to normalization are given in section
214	NDPER	The core
218	NDPER	elements can distinguish names in a text but are insufficiently powerful to mark their internal components or structure. To conduct nominal record linkage or even to create an alphabetically sorted list of personal names, it is important to distinguish between a family name, a forename and an honorary title. Similarly, when confronted with a string such as
220	NDPER	, the analyst will often wish to distinguish amongst the various constituent elements present, since they provide additional information about the status, occupation, or residence of the person to whom the name belongs. The following elements are provided for these and related purposes:
225	NDPER	attributes mentioned above, all of the above elements are members of the class
234	NDPER	element irrespective of whether or not the components of the personal name are also to be marked.
238	NDPER	name type="person"
241	NDPER	attribute allows for further subcategorization of the personal name itself, for example as a
244	NDPER	birth
277	NDPER	elements because distinctive name components occurring within it can be marked as such.
280	NDPER	surname
281	NDPER	and additional personal names, often known as
311	NDPER	elements to provide further culture- or project-specific detail about the name component, for example:
340	NDPER	attribute are not constrained, and may be chosen as appropriate to the encoding needs of the project. They may be used to distinguish different kinds of forename or surname, as well as to indicate the function a name component fills within the whole. In this example, we indicate that a surname is toponymic, and also point to the specific place name from which it is derived:
353	NDPER	The value
355	NDPER	was suggested above for the not uncommon case where the whole of a surname is composed of several other surname elements. These nested surnames may be individually tagged as well, together with appropriate type values:
369	NDPER	attribute may be used to indicate whether a name is an abbreviation, initials, or given in full:
403	NDPER	Alternatively, it may be felt more appropriate to mark a patronymic as a distinct kind of name, neither a forename nor a surname, using the
429	NDPER	class; its effect is to state the sequence in which
433	NDPER	elements should be combined when constructing a sort key for the name.
471	NDPER	It is also often convenient to distinguish phrases (historically similar to the generational labels mentioned above) used to link parts of a name together, such as
477	NDPER	etc. It is often a matter of arbitrary choice whether such components are regarded as part of the surname or not; the
499	NDPER	elements are used to mark all name components other than those already listed. The distinction between them is that a
501	NDPER	encloses an associated name component such as an aristocratic or official title which exists in some sense independently of its bearer. The distinction is not always a clear one. As elsewhere, the
506	NDPER	An inherited or life-time title of nobility such as
515	NDPER	An academic or other honorific prefixed to a name e.g.
542	NDPER	role
543	NDPER	a person has in a given context (such as
544	NDPER	witness
549	NDPER	element, since this is intended to mark roles which function as part of a person's name, not the role of the person bearing the name in general. Information about roles, occupations, etc. of a person are encoded within the
588	NDPER	A name may have any combination of the above elements:
606	NDPER	Although highly flexible, these mechanisms for marking personal name components will not cater for every personal name, nor for every processing need. Where the internal structure of personal names is highly complex or where name components are particularly ambiguous, feature structures are recommended as the most appropriate mechanism to mark and analyze them, as further discussed in chapter
609	NDPER	White space is allowed and therefore significant between elements within
631	NDORG	In these Guidelines, we use the term
633	NDORG	for any named collection of people regarded as a single unit. Typical examples include institutions such as
645	NDORG	. Giving a loosely-defined group of individuals a name often serves a particular political or social agenda and an analysis of the way such phrases are constructed and used may therefore be of considerable importance to the social historian, even where the objective existence of an
647	NDORG	in this sense is harder to demonstrate than that of (say) a named person. In the case of businesses or other formally constituted institutions, the component parts of an organizational name may help to characterize the organization in terms of its perceived geographical location, ownership, likely number of employees, management structure, etc.
656	NDORG	This element is a member of the same attribute classes as
663	NDORG	element may be used to mark up any form of organizational name:
690	NDORG	attribute should be used to characterize the name (rather than the organization), for example as an acronym:
716	NDORG	The components of an organization's name may include place names as well as personal names:
724	NDORG	or role names:
760	NDPLAC	Like other proper nouns or noun phrases used as names, place names can simply be marked up with the
764	NDPLAC	element. For cartographers and historical geographers, however, the component parts of a place name provide important information about the relation between the name and some spot in space and time. They also provide important evidence in historical linguistics.
766	NDPLAC	These Guidelines distinguish three ways of referring to places. A place name (represented using the
769	NDPLAC	). A place named simply in terms of geographical features such as mountains or rivers is represented using the
772	NDPLAC	). Finally, an expression consisting of phrases expressing spatial or other kinds of relationship between other kinds of named place may itself be regarded as a way of referring to a place, and hence as a kind of named place (see section
785	NDPLAC	mentioned above. These attributes are primarily useful as a means of linking a place name with information about a specific place. Recommendations for the encoding of information about a place, as distinct from its name, are provided in
794	NDPLAC	name type="place"
796	NDPLAC	rs type="place"
798	NDPLAC	Strictly, a suitable value such as
800	NDPLAC	should be added to the two place names which are presented periphrastically in the second version of this example. This would preserve the distinction indicated by the choice of
827	NDPLGU	A place name may contain text with no indication of its internal structure:
829	NDPLGU	More usually however, a place name of this kind will be further analysed in terms of its constitutive geo-political or administrative units. These may be arranged in ascending sequence according to their size or administrative importance, for example:
845	NDPLGU	class, members of which may be used anywhere that text is permitted, including within each other as in the following examples:
924	NDPLGF	element for this component of the name and then point to it using the
932	NDPLR	All the place name specifications so far discussed are
934	NDPLR	, in the sense that they define only one place. A place may however be specified in terms of its relationship to another place, for example
939	NDPLR	relative place names
940	NDPLR	will contain a place name which acts as a referent (e.g.
944	NDPLR	). They will also contain a word or phrase indicating the position of the place being named in relation to the referent (e.g.
948	NDPLR	). A distance, possibly only vaguely specified, between the referent place and the place being indicated may also be present (e.g.
954	NDPLR	Relative place names may be encoded using the following elements in combination with either a
959	NDPLR	Some examples of relative place names are:
995	NDPLR	The internal structure of place names is like that of personal names—complex and subject to an enormous amount of variation across time and different cultures. The recommendations in this section should however be adequate for a majority of users and applications; they may be extended using the mechanisms described in chapter
996	NDPLR	to add new elements to the existing classes. When the focus of interest is on the name components themselves, as in place name studies for example, the elements discussed in
1019	NDPERS	This module defines a number of special purpose elements which can be used to markup biographical, historical, and prosopographical data. We envisage a number of users and uses for these elements. For example, an encoder may be interested in creating or converting a set of biographical records, for example of the type found in a Dictionary of National Biography. Another use is the creation or conversion of a database-like collection of information about a group of people, such as the people referenced in a marked-up collection of documents, or persons who have served as informants in the creation of spoken corpora. It is also appropriate to use these elements to register information relating to those who have taken part in the creation of a TEI document.
1021	NDPERS	To cater for this diversity, these Guidelines propose a flexible strategy, in which encoders may choose for themselves the approach appropriate to their needs. If one were interested, for example, in converting existing DNB-type records, and wanted to preserve the text as is, the
1024	NDPERS	) could simply contain the text of an article, placed within
1030	NDPERS	to mark up features of that text. For a more structured entry, however, one would extract the data and place information contained in the text, and encode it directly using the more specific elements described in this section.
1035	NDPERSbp	Information about people, places, and organizations, of whatever type, essentially comprises a series of statements or assertions relating to:
1039	NDPERSbp	which do not, by and large, change over time
1043	NDPERSbp	which hold true only at a specific time
1046	NDPERSbp	or incidents which may lead to a change of state or, less frequently, trait.
1052	NDPERSbp	are typically independent of an individual's volition or action and can be either physical, such as sex or hair and eye colour, or cultural, such as ethnicity, caste, or faith. The distinction is not entirely straightforward, however: while sex is fairly obviously a physical trait, gender should rather be regarded as culturally determined, and the division of mankind into different
1054	NDPERSbp	, proposed by early (white European) anthropologists on the basis of physical characteristics such as skin colour, hair type and skull measurements, is now considered to be more a social or mental construct. Furthermore, while some characteristics will obviously change over time, hair colour for example, none, in principle—not even sex—is immutable.
1057	NDPERSbp	include, for example, marital status, place of residence and position or occupation. Such states have a definite duration, that is, they have a beginning and an end and are typically a consequence of the individual's own action or that of others.
1060	NDPERSbp	changes in state
1061	NDPERSbp	are meant the events in a person's life such as birth, marriage, or appointment to office; such events will normally be associated with a specific date or a fairly narrow date-range. Changes in states can also cause or be caused by changes in characteristics. Any statement or assertion on any of these aspects of a person's life will be based on some source, possibly multiple sources, possibly contradictory. Taking all this into account it follows that each such statement or assertion needs to be able to be documented, put into a time frame and be relatable to other statements or assertions of the same or any of the other types.
1063	NDPERSbp	The elements defined by the module described in this chapter may, for the most part, all be regarded as specializations of one or other of the above three classes. Generic elements for state, trait, and event are also defined:
1076	NDPERSE	Information about a person, as distinct from references to a person, for example by name, is grouped together within a
1078	NDPERSE	element. Information about a group of people regarded as a single entity (for example
1082	NDPERSE	element. Note however that information about a group of people with a distinct identity (for example a named theatrical troupe) should be recorded using the
1097	NDPERSE	elements may be supplied within the
1101	NDPERSE	element of a TEI header (see
1104	NDPERSE	can also appear within the body of a text when the module defined by this chapter is included in a schema.
1130	NDPERSE	element carries several attributes. As a member of the classes
1141	NDPERSE	In addition, a small number of very commonly used personal properties may be recorded using attributes specific to
1149	NDPERSE	These attributes are intended for use where only a small amount of data is to be encoded in a more or less normalized form, possibly for many person elements, for example when encoding basic facts about respondents to a questionnaire. When however a more detailed encoding is required for all kinds of information about a person, for example in a historical gazetteer, then it will be more appropriate to use the elements
1157	NDPERSE	attribute is not intended to record the person's age expressed in years, months, or other temporal unit. Rather it is intended to record into which age bracket, for the purposes of some analysis, the person falls. A simple (perhaps too simple to be useful) binary classification of age brackets would be
1161	NDPERSE	. The actual age brackets useful to various projects are likely to be varied and idiosyncratic, and thus these Guidelines make no particular recommendation as to possible values. Instead, individual projects are recommended to define the values they use in their own customization file, using a declaration like the following:
1201	NDPERSE	element may contain many sub-elements, each specifying a different property of the person being described. The remainder of this section describes these more specific elements. For convenience, these elements are grouped into three classes, corresponding with the tripartite division outlined above: one for traits, one for states and one for events. Each class contains both specific elements for common types of biographical information, and a generic element for other, user-defined, types of information.
1203	NDPERSE	All the elements in these three classes belong to the attribute class
1234	NDPERSEpc	, allow content of ordinary prose containing phrase-level elements.
1241	NDPERSEpc	The meanings of concepts such as sex, nationality, or age are highly culturally-dependent, and the encoder should take particular care to be explicit about any assumptions underlying their usage of them. For example, when recording personal age in different cultures, there may be different assumptions about the point from which age is reckoned. A statement of the practice adopted in a given encoding may usefully be provided in the
1248	NDPERSEpc	element contains either paragraphs or a number of
1253	NDPERSEpc	tag
1254	NDPERSEpc	s for the languages. The
1258	NDPERSEpc	attribute, which indicates the language with the same kind of
1259	NDPERSEpc	language tag
1261	NDPERSEpc	language tags
1291	NDPERSEpc	attribute to give values from a project-internal taxonomy, or an external standard, such as vCard's sex property
1317	NDPERSEpc	As elsewhere, these coded values may be used as an alternative to or normalization of the actual descriptive text contained in the element. The previous example might equally well be given as
1330	NDPERSEpc	These element can be used to extend the range of information supplied about an individual's personal characteristics. Either may contain an optional
1332	NDPERSEpc	element, used to provide a human-readable specification for the characteristic concerned and a description of the feature itself supplied within a
1354	NDPERSEpc	These elements are provided as a simple means of extending the set of descriptive features available in a standardized way. For example, there are no predefined elements for such features as eye or hair colour. If these are to be recorded, they may simply be added as new types of trait:
1370	NDPERSEpc	If none of the more specialized elements listed above is appropriate, then a choice must be made between the two generic elements
1378	NDPERSEpc	for the latter. It may also be helpful to note that traits are typically, but not necessarily, independent of the volition or action of the holder. If the distinction between state and trait is not considered relevant or useful, use
1384	NDPERSEpc	element is repeatable and can, like all TEI elements, take the attribute
1386	NDPERSEpc	to indicate the language of the content of the element, as well as a
1388	NDPERSEpc	attribute to indicate the type of name, whether a nickname, maiden or birth name, alternative form, etc. This is useful in cases where, for example, a person is known by a Latin name and also by any number of vernacular names, many or all of which may have claims to
1390	NDPERSEpc	. In order to ensure uniformity, the method generally employed in the library world has been to accept the form found in some authority file, for example that of the American Library of Congress, as the
1396	NDPERSEpc	an overtly foreign form of the name of their local saint or hero. Within the
1398	NDPERSEpc	element any number of variant forms of a name can be given, with no prioritization, and hence less likelihood of offence. The Icelandic scholar and manuscript collector Árni Magnússon, to give his name in standard modern Icelandic spelling, is known in Danish as Arne Magnusson, the form which he himself, as a long term resident of Denmark, generally used; there is also a Latinized form, Arnas Magnæus, which he used in his scholarly writings. All three forms can be given, and in any order:
1410	NDPERSEpc	At the other extreme, a person may be named periphrastically as in the following example:
1484	NDPERSEpe	has a similar content model to that of
1490	NDPERSEpe	element to identify the name of the place where the event occurred. It is used to describe any event in the life of an individual or organization.
1492	NDPERSEpe	In the following example, we give a brief summary of the wedding of Jane Burden to the English writer, designer, and socialist William Morris, encoded as an
1496	NDPERSEpe	element used to record data about Morris, though we could equally well have embedded the event within the
1568	NDPERSEpe	elements point either to an external source or to a
1570	NDPERSEpe	element within which other information about the person named may be found. As further discussed below (
1573	NDPERSEpe	element may then be used to link them in a more meaningful way:
1580	NDPERSEpe	As mentioned above, all these elements, both the specific and the generic, are members of the
1582	NDPERSEpe	attribute class, which means they can be limited in terms of time. The following encoding, for example, demonstrates that the person named David Jones changed his name in 1966 to David Bowie:
1596	NDPERSEpe	classes. These classes make available the attributes
1604	NDPERSEpe	, a pointer to a resource from which the information derives. In this way it is possible, in the case of multiple and conflicting sources, to provide more than one view of what happened, as in the following example:
1626	NDPERSREL	attributes in the usual way. The value specified for either attribute on a
1634	NDPERSREL	, as defined here, may be any kind of describable link between specified participants. A participant (in this sense) might be a person, a place, or an organization. In the case of persons, therefore, a relationship might be a social relationship (such as employer/employee), a personal relationship (such as sibling, spouse, etc.) or something less precise such as
1640	NDPERSREL	relationship); or it may not be if participants are not identical with respect to their role in the relationship (for example, the
1642	NDPERSREL	relationship). For non-mutual relationships, only two kinds of role are currently supported; they are named
1648	NDPERSREL	, in the sense that they are most readily described by a transitive verb, or a verb phrase of the form
1687	NDPERSREL	This example defines the relationships amongst a number of people not further described here; we assume however that each person has been allocated an identifier such as
1695	NDPERSREL	, etc. Then the above set of
1729	ND-org	elements discussed elsewhere in this chapter, that is to provide a unique wrapper element for information about an entity, distinct from references to that entity which are typically encoded using a naming element such as
1730	ND-org	name type="org"
1733	ND-org	. The content of a naming element will represent the way an organization is named in a given context; the content of an
1737	ND-org	An organization is not the same thing as a list or group of people because it has an identity of its own. That identity may be expressed solely in the existence of a name (for example
1739	ND-org	), but is likely to consist in the combination of that name with a number of events, traits, or states which are considered to apply to the organization itself, rather than any of its members. For example, a sports team might be described in terms of its membership (a
1743	ND-org	), its geographical affiliation (a
1747	ND-org	attribute. However, it is the name of the sports team alone which identifies it.
1749	ND-org	The content model for
1776	ND-org	The names of the people making up an organization can also change over time, (if they are known at all). For example:
1843	ND-org	element to group together a number of
1906	NDGEOG	we discuss various ways of naming places such as towns, countries, etc. In much the same way as these Guidelines distinguish between the encoding of names for people and the encoding of other data about people, so they also distinguish between the encoding of names for places and the encoding of other data about places. In this section we present elements which may be used to record in a structured way data about places of any kind which might be named or referenced within a text. Such data may be useful as a way of normalizing or standardizing references to particular places, as the raw material for a gazetteer or similar reference document associated with a particular text or set of texts, or in conjunction with any form of geographical information system.
1916	NDGEOG	class contains elements describing characteristics of a place which have a definite duration, such as its name. Any member of the
1924	NDGEOG	For example, the modern city of Lyon in France was in Roman times known as Lugdunum. Although the modern and the Roman city are not physically co-extensive, they have significant areas which overlap, and we may therefore wish to regard them as the same place, while supplying both names with an indication of the time period during which each was current.
1926	NDGEOG	A place is defined, however, by its physical location, which does not typically change over time. Locations may be specified in a number of ways: as a set of coordinates defining a point or an area on the surface of the earth, or by providing a description of how the place may be found, usually in terms of other place names. For example, we can identify the location of the Canadian city of London, either by specifying its latitude and longitude, or by specifying that we mean the city called London located in the province called Ontario within the country called Canada.
1928	NDGEOG	In addition we may wish to supply a brief characterization of the place identified, for example to state that it is a city, an administrative area such as a country, or a landmark of some kind such as a monument or a battlefield. If our typology of places is simple, the open ended
1931	NDGEOG	place type="city"
1933	NDGEOG	place type="battlefield"
1938	NDGEOG	element, the following elements may be used to provide more information about specific aspects of the place in a structured form:
1946	NDGEOGva	A location may be specified in one or more of the following ways:
1948	NDGEOGva	by supplying a string representing its coordinates in some standardized way within a
1952	NDGEOGva	by supplying one or more place name component elements (e.g.
1956	NDGEOGva	etc.) to place it within a geo-political context
1970	NDGEOGva	The simplest method of specifying a location is by means of its geographic coordinates, supplied within the
1974	NDGEOGva	) used for the coordinate system itself. The default recommended by these Guidelines is to supply a string containing two real numbers separated by whitespace, of which the first indicates latitude and the second longitude according to the 1984 World Geodetic System (WGS84); this is the system currently used by most GPS applications which TEI users are likely to encounter.
1977	NDGEOGva	We might therefore record the information about the place known as
1991	NDGEOGva	Identifying Lyon by its geo-political status as a settlement within a country forming part of a larger political entity, we might represent the same
1992	NDGEOGva	place
2014	NDGEOGva	We may use the same procedure to represent the location of smaller places, such as a street or even an individual building:
2031	NDGEOGva	attribute to categorize more precisely both the kind of place concerned (a building) and the kind of name used to locate it, for example by characterizing the generic
2053	NDGEOGva	sometimes resembles a set of instructions for finding a place, rather than a name:
2073	NDGEOGva	may also be used to identify a location in terms of its postal or other address:
2095	NDGEOGva	When, as here, the same place is given multiple locations, the
2097	NDGEOGva	attribute should be used to characterize the kind of location, as a means of indicating that these are alternative ways of identifying the same place, rather than that the place is spread across several locations.
2101	NDGEOGva	element may thus identify a place to a greater or lesser degree of precision, using a variety of means: a name, a set of names, or a set of coordinates. The
2103	NDGEOGva	element introduced earlier is by default understood to supply a value expressed in a specific (and widely used) notation. If a
2107	NDGEOGva	, this is interpreted as being really the same place in the universe, but with different systems used to refer to it. If there is a lack of consensus about the location (of, for example, Camelot), more than one
2113	NDGEOGva	By default, the content of
2117	NDGEOGva	Firstly, the content of the
2140	NDGEOGva	In the following example, we have defined the location of the place
2165	NDGEOGva	to indicate the source of the location information.
2181	NDGEOGmp	A place may contain other places. This containment relation can be directly modelled in XML: thus we can say that the towns of Vilnius and Kaunas are both in a place called Lithuania (or Lietuva) as follows:
2204	NDGEOGmp	As a further example, the islands of Mauritius, Réunion, and Rodrigues are collectively known as the Mascarene Islands. Grouped together with Mauritius there are also several smaller offshore islands, with rather picturesque French names. These offshore islands do not however constitute an identifiable place as a whole. One way of representing this is as follows:
2234	NDGEOGmp	Here is a more complex example, showing the variety of names associated at different times and in different languages with a set of hierarchically grouped places—the settlement of Carmarthen Castle, within the town of Carmarthen, within the administrative county of Carmarthenshire, Wales.
2277	NDGEOGmp	place
2284	NDGEOGmp	elements should be distinguished from the (possibly simpler) case where a number of places with some property in common are being grouped together for convenience, for example, in a gazetteer. The
2286	NDGEOGmp	element is provided as a means of grouping places together where there is no implication that the grouped elements constitute a distinct place. For example:
2322	NDGEOGste	There are many different kinds of information which it might be considered useful to record for a place in addition to its name and location, and the categories selected are likely to be very project-specific. As with persons therefore these Guidelines make no claim to comprehensiveness in this context. Instead, the generic
2330	NDGEOGste	attribute. These are complemented by a small number of predefined elements of general utility:
2339	NDGEOGste	element. This element may be used for almost any kind of event in the life of a place; no specialized version of this element is proposed, nor do we attempt to enumerate the possible values which might be appropriate for the
2456	NDGEOGste	attribute are to be understood as cumulatively inherited, as elsewhere in the TEI scheme (for example on
2462	NDGEOGste	element concerns the squirrel population between the dates given. This is then broken down into red and gray squirrel populations, and within that into male and female:
2480	NDGEOGste	attribute: responsibility is not an additive property, and therefore an element either states it explicitly, or inherits it from its nearest ancestor. Dating is slightly different again, in that a child element may specify a date more precisely than its parent, as in the example above
2482	NDGEOGste	Events may also be subdivided into other events. For example, a two part meeting might be represented as follows:
2500	NDGEOGste	element is usually used to record information about a place, or a person; for this reason the element usually appears as content of a
2504	NDGEOGste	. However, it is also possible to describe events independently of either a person or a place. This may be useful in such applications as chronologies, lists of significant events such as battles, legislation, etc.
2564	place-rel	element may also be used to express relationships of various kinds between places, or between places and persons, in much the same way as it is used to express relationships between persons alone. Returning to the Mascarene Islands example cited above, we might define the island group and its constituents separately, but indicate the relationship by means of a
2594	place-rel	style of representation has the advantage that we can now also represent the fact that a place may be a
2596	place-rel	more than one other place; for example, Réunion is part of France, as well as part of the Mascarenes. If we add a declaration for France to the list above:
2653	NDNYM	So far we have discussed ways in which a name or referring string encountered in running text may be resolved by considering the object that the name refers to: in the case of a personal name, the name refers to a person; in the case of a place name, to a place, for example. The resolution of this reference is effected by means of the
2675	NDNYM	in Russian might all be regarded as existing independently of any person to which they are attached, and also independently of any variant forms that might be attested in different sources (such as Jon or Johnny in English, or Jehan or Jojo in French). We use the term
2676	NDNYM	nym
2677	NDNYM	to refer to the canonical or normalized form of a name regarded in such a way, and provide the following elements to encode it:
2687	NDNYM	to indicate the nym with which it corresponds. Thus, given the following
2689	NDNYM	for the name
2699	NDNYM	an occurrence of this name in running text might be encoded as follows:
2705	NDNYM	The person identified by this particular Tony may however be indicated independently using the
2707	NDNYM	attribute, either on the forename or on the whole name component:
2726	NDNYM	, etc. For example, we may show that the canonical form for a given nym has two orthographic variants in this way:
2790	NDNYM	element used here is provided by the TEI
2792	NDNYM	module, which would therefore also need to be included in a schema built to validate such markup. Other possibilities for more detailed linguistic analysis are provided by elements included in that and the
2802	NDNYM	might be regarded as a nym in its own right:
2812	NDNYM	Within running text, a name can specify all the nyms associated with it:
2818	NDNYM	is used to indicate its constituent parts, where these have been identified as distinct nyms:
2828	NDNYM	element may also combine a number of other
2830	NDNYM	elements together, where it is intended to show that they are all regarded as variations on the same root. Thus the different forms of the name John, all being derived from the same root, may be represented as a hierarchic structure like this:
2898	NDDATE	describes a date or time with reference to some other (absolute) temporal expression, and thus may contain an
2934	NDDATER	after the lamented death of the Doctor
2937	NDDATER	have two distinct components. As well as the absolute temporal expression or event to which reference is made (e.g.
2942	NDDATER	the death of the Doctor
2947	NDDATER	between the time or date which is indicated and the referent expression (e.g.
2954	NDDATER	offset
2955	NDDATER	describing the direction of the distance between the time or date indicated and the referent expression (e.g.
2974	NDDATER	offset
3013	NDDATER	and the cited date are parts of the same temporal expression, and hence to disambiguate the phrase
3039	NDDATER	Where more complex or ambiguous expressions are involved, and where it is desirable to make more explicit the interpretive processes required, the feature structure notation described in chapter
3054	NDDATER	). It is used here to link the temporal phrase with an interpretation of it. Like most traditional fairs and market days, the Glasgow Fair was established by local custom and could vary from year to year. Consequently, in order to provide such an interpretation, it is necessary to draw upon additional information which may or may not be located in the particular text in question. In this case, it is necessary at least to know the spatial and temporal context (year and place) of the fair referred to. These and other features required for the analysis of this particular temporal expression may be combined together as one feature structure of type
3081	NDDATEA	It may be useful to categorize a temporal expression which is given in terms of a named event, such as a public holiday, or a named time such as
3082	NDDATEA	tea time
3123	NDDATEISO	The attributes for normalization of dates and times so far described use a standard format defined by
3127	NDDATEISO	. The full ISO standard provides formats not available in the W3C recommendation, for example, the capability to refer to a date by its ordinal date or week date, or to refer to a century. It also provides ways of indicating duration and range.
3129	NDDATEISO	When this module is included in a schema, the following additional attributes are provided:
3133	NDDATEISO	These attributes may be used in preference to their W3C equivalent when it is necessary to provide a normalized value in some form not supported by the W3C attributes. For example, a century date in the W3C format must be expressed as a range, using the
3146	NDDATEISO	, however, it is possible to express the same normalized value in any of the following additional ways:
3170	NDDATECUSTOM	All date-related encoding described above makes use of the Gregorian calendar, on which both the ISO and W3C datetime formats are based. However, historical texts often pre-date the invention of the Gregorian calendar in the 16th century, or its adoption in Europe over the following centuries, and many other calendars are used in texts from other cultures and contexts. Non-Gregorian dates can be encoded using methods described below.
3172	NDDATECUSTOM	First, a Calendar Description element needs to be supplied in the
3199	NDDATECUSTOM	element in the header which defines and describes the calendar used.
3203	NDDATECUSTOM	attribute is used to specify the calendar used in the
3204	NDDATECUSTOM	text content
3211	NDDATECUSTOM	etc. to provide more precise expressions of dates and times in a constrained and computable form, it is often necessary to express a date or a date-range from a non-Gregorian calendar in a more precise manner. The attributes whose names end in
3215	NDDATECUSTOM	is used to identify the calendar used in the content of these attributes:
3224	NDDATECUSTOM	attribute specifies the calendar used in the text content of the
3228	NDDATECUSTOM	attribute signifies that the calendar used in the
3230	NDDATECUSTOM	attribute is also Julian. The schema could be customized in order to constrain the content of custom attributes in a manner similar to the constraints provided on regular Gregorian dating attributes such as
3236	NDDATECUSTOM	, providing the Gregorian calendar equivalent of the Julian date:
3259	ND	The selection and combination of modules to form a TEI schema is described in

DS-DefaultTextStructure.xml#13163

#	id	text
4	DS	This chapter describes the default high-level structure for TEI documents. A full TEI document combines metadata describing it, represented by a
10	DS	class, or the two in combination. This group of elements makes up a
23	DS	, is also defined for the representation of language corpora, or other collections of encoded texts. A
33	DS	. This permits the encoder to distinguish metadata applicable to the whole collection of encoded texts, which is represented by the outermost
37	DS	elements within the corpus. Further information about the organization and encoding of language corpora is given in chapter
40	DS	In summary, when the default structure module is included in a schema, the following elements are available for the representation of the outermost structure of a TEI document:
51	DS	). A TEI document may also contain elements from the
53	DS	class (such as a collection of facsimile images, or a feature system declaration) if the appropriate module is included in a schema (see further
61	DS	are available as major parts of a TEI document. These three elements are provided by the
70	DS	TEI texts may be regarded either as
74	DS	that is, consisting of several components which are in some important sense independent of each other. The distinction is not always entirely obvious: for example a collection of essays might be regarded as a single item in some circumstances, or as a number of distinct items in others. In such borderline cases, the encoder must choose whether to treat the text as unitary or composite; each may have advantages and disadvantages in a given situation.
76	DS	Whether unitary or composite, the text is marked with the
78	DS	tag and may contain front matter, a text body, and back matter. In unitary texts, the text body is tagged
80	DS	; in composite texts, where the text body consists of a series of subordinate texts or groups, it is tagged
85	DS	The overall structure of a unitary text is:
102	DS	The overall structure of a composite text made up of two unitary texts is:
137	DS	element is provided for the case where one text is embedded within another, but does not contribute to its hierarchical organization, for example because it interrupts it, or simply quoted within it. This is useful in such common literary contexts as the
157	DS	elements, used for more complex or composite text structures, are further discussed in section
159	DS	, in the case of elements which can appear in any kind of document, or elsewhere in the case of elements specific to particular kinds of document.
163	DSDIV	In some texts, the body consists simply of a sequence of low-level structural items, referred to here as
168	DSDIV	). Examples in prose texts include paragraphs or lists; in dramatic texts, speeches and stage directions; in dictionaries, dictionary entries. In other cases sequences of such elements will be grouped together hierarchically into textual divisions and subdivisions, such as chapters or sections. The names used for these structural subdivisions of texts vary with the genre and period of the text, or even at the whim of the author, editor, or publisher. For example, a major subdivision of an epic or of the Bible is generally called a
176	DSDIV	—unless it is an epistolary novel, in which case it may be called a
178	DSDIV	. Even texts which are not organized as linear prose narratives, or not as narratives at all, will frequently be subdivided in a similar way: a drama into
202	DSDIV	, etc., where the number indicates the depth of this particular division within the hierarchy, the largest such division being
203	DSDIV	div1
205	DSDIV	div2
207	DSDIV	div3
225	DSDIV1	, this element has the following additional attributes:
228	DSDIV1	Using this style, the body of a text containing two parts, each composed of two chapters, might be represented as follows:
266	DSDIV2	these elements all bear the following additional attributes:
269	DSDIV2	The largest possible subdivision of the body is
279	DSDIV2	Using this style, the body of a text containing two parts, each composed of two chapters, might be represented as follows:
338	DSDIV3	The choice between numbered and un-numbered divisions will depend to some extent on the complexity of the material: un-numbered divisions allow for an arbitrary depth of nesting, while numbered divisions limit the depth of the tree which can be constructed. Where divisions at different levels should be processed differently (for example to ensure that chapters, but not sections, begin on a new page), numbered divisions slightly simplify the task of defining the desired processing for each level, though this distinction could also be made by supplying this information on the
342	DSDIV3	. Some software may find numbered divisions easier to process, as there is no need to maintain knowledge of the whole document structure in order to know the level at which a division occurs; such software may, however, find it difficult to cope with some other aspects of the TEI scheme. On the other hand, in a collection of many works it may prove difficult or impossible to ensure that the same numbered division always corresponds with the same type of textual feature: a
360	DSDIV3	class may be used to provide a name or description for the division. Typical values might be
368	DSDIV3	, or (for verse texts)
448	DSDIV3	), etc. For example, suppose that the body of a text consists of a series of diary entries, each of which is potentially divided into entries for the morning and the afternoon. This might be represented in any of the following ways. First, using the un-numbered style:
535	DSDIV3X	(etc.) elements will be both complete and identically organized with reference to the original source. For some purposes however, in particular where dealing with unusually large or unusually small texts, encoders may find it convenient to present as textual divisions sequences of text which are incomplete with reference to the original text, or which are in fact an ad hoc agglomeration of tiny texts. Moreover, in some kinds of texts it is difficult or impossible to determine the order in which individual subdivisions should be combined to form the next higher level of subdivision, as noted below.
537	DSDIV3X	To overcome these problems, the following additional attributes are defined for all elements in the
552	DSDIV3X	represents a number for the chapter, and the
554	DSDIV3X	attribute takes the value
556	DSDIV3X	to indicate that this division is incomplete in some respect. Other possible values for this attribute indicate whether material has been omitted initially (I), finally (F), or in the middle (M) of the division, while the
559	DSDIV3X	) may be used to indicate exactly where material has been omitted:
568	DSDIV3X	element in the TEI header should also be used to record the principles underlying the selection of incomplete samples, as further described in section
604	DSDIV3X	, are really quite independent of each other, although they are all marked as subdivisions of the whole group. They can be read in any order without affecting the sense of the piece; indeed, in some cases, divisions of this nature are printed in such a way as to make it impossible to determine the order in which they are intended to be read. Individual stories can be added or removed without affecting the existing components.
611	DSDTB	The divisions of any kind of text may sometimes begin with a brief heading or descriptive title, with or without a byline, an epigraph or brief quotation, or a salutation such as one finds at the start of a letter. They may also conclude with a brief trailer, byline, postscript, or signature. Many of these (e.g. a byline) may appear either at the start or at the end of a text division proper.
613	DSDTB	To support this heterogeneity, the TEI architecture defines five classes, all of which are populated by this module:
635	DSHD	Unlike some other markup schemes, the TEI scheme does
655	DSHD	is the sole member to include other such elements if required.
657	DSHD	In certain kinds of text (notably newspapers), there may be a need to categorize individual headings within the sequence at the start of a division, for example as
700	DSHD	may be longer than in modern works. When heading-like material appears in the middle of a text, the encoder must decide whether or not to treat it as the start of a new division. If the phrase in question appears to be more closely connected with what follows than with what precedes it, then it may be regarded as a heading and tagged as the
706	DSHD	often found in newspapers or magazines, then the
740	DSOC	In addition to headings of various kinds, divisions sometimes include more or less formulaic opening or closing passages, typically conveying such information as the name and address of the person to whom the division is addressed, the place or time of its production, a salutation or exhortation to the reader, and so on. Divisions in epistolary form are particularly liable to include such features. Additional elements for the detailed encoding of personal names, dates, and places are provided in chapter
753	DSOC	elements are used to encode headings which identify the authorship and provenance of a division. Although the terminology derives from newspaper usage, there is no implication that
777	DSOC	Where a sequence of such elements appear together, either at the beginning or end of an element, it may be convenient to group them together using one of the following elements:
844	DSAE	element may be used to encode the prefatory list of topics sometimes found at the start of a chapter or other division. It is most conveniently encoded as a list, since this allows each item to be distinguished, but may also simply be presented as a paragraph. The following are thus both equally valid ways of encoding the same argument:
881	DSAE	epigraph
882	DSAE	is a quotation from some other work, a saying, or a motto, appearing on a title page, or at the start of a division. It may be encoded using the special-purpose
894	DSAE	When an epigraph contains a quotation, this may often be associated with a bibliographic reference. In such cases, it is recommended additionally to group the quotation and its source together using the
915	DSAE	postscript
916	DSAE	is a passage added after the signature of a letter or, less frequently, the main portion of the body of a book, article, or essay. In English a postscript is often abbreviated as
975	DSCO	classes, every textual division (numbered or un-numbered) consists of a sequence of ungrouped
978	DSCO	). The actual elements available will depend on the modules in use; in all cases, at least the component-level structural elements defined in the core will be available (paragraphs, lists, dramatic speeches, verse lines and line groups etc.). If the drama module has been selected, then other component- or phrase- level items specialized for performance texts (for example, cast lists or camera angles) will be available, as defined in chapter
979	DSCO	) will be available. If the dictionary module is in use, then dictionary entries, related entries, etc. (as defined in chapter
980	DSCO	) will also be available; if the module for transcribed speech is in use, then utterances, pauses, vocals, kinesics, etc., as defined in chapter
983	DSCO	Where a text contains low-level elements from more than one module these may appear at any point; there is no requirement that elements from the same module be kept together.
1004	DSGRPF	should be used to represent a collection of independent texts which is to be regarded as a single unit for processing or other purposes. The
1007	DSGRPF	should be used to represent an independent text which interrupts the text containing it at any point but after which the surrounding text resumes.
1014	DSGRP	element include anthologies and other collections. The presence of common front matter referring to the whole collection, possibly in addition to front matter relating to each individual text, is a good indication that a given text might usefully be encoded in this way; this structure may be found useful in other circumstances too.
1016	DSGRP	For example, the overall structure of a collection of short stories might be encoded as follows:
1091	DSGRP	A text which is a member of a group may itself contain groups. This is quite common in collections of verse, but may happen in any kind of text. As an example, consider the overall structure of a typical collection, such as the
1093	DSGRP	edition of Crashaw's poetry. Following a critical introduction and table of contents, this work contains the following major sections:
1096	DSGRP	(a collection of verse first published in 1648)
1105	DSGRP	I (a collection of fragments all taken from a single manuscript)
1108	DSGRP	II (a further collection of fragments, taken from a different manuscript)
1111	DSGRP	Each of the three collections published in Crashaw's lifetime has a reasonable claim to be considered as a text in its own right, and may therefore be encoded as such. It is rather more arbitrary as to whether the two posthumous collections should be treated as two groups, following the practice of the
1113	DSGRP	edition. An encoder might elect to combine the two into a single group or simply to treat each fragment as an ungrouped unitary text.
1117	DSGRP	edition reprints the whole of each of the three original collections, including their original front matter (title pages, dedications etc.). These should be encoded using the
1120	DSGRP	), while the body of each collection should be encoded as a single
1122	DSGRP	element. Each individual poem within the collections should be encoded as a distinct
1124	DSGRP	element. The beginning of the whole collection would thus appear as follows (for further discussion of the use of the elements
1237	DSGRP	element may be used in this way to encode any kind of collection of which the constituents are regarded by the encoder as texts in their own right. Examples include anthologies or collections of verse or prose by multiple authors, florilegia, or commonplace books, journals, day books, etc. As a fairly typical example, we consider
1254	DSGRP	Each titled section listed above comprises a group of extracts or complete texts from writers of a given historical period, preceded by an introductory essay. For example, the second group listed above contains, inter alia, the following:
1268	DSGRP	Each group of writings by a single author is preceded by a brief biographical notice. Some of the extracts are quite lengthy, containing several chapters or other divisions; others are quite short. As the above list indicates, the texts included range across all kinds of material: verse, prose, journals and letters.
1270	DSGRP	The easiest way of encoding such an anthology is to treat each individual extract as a text in its own right. A sequence of texts by a single author, together with the biographical note preceding it, can then be treated as a single
1274	DSGRP	formed by the section. The sequence of single or composite texts making up a single section of the work is likewise treated, together with its prefatory essay, as a single
1345	DSGRP	Note that the editor's introductory essays on each author may be treated as texts in their own right (as the essays on Lady Mary Wortley Montagu and Alexander Pope have been treated above), or as front matter to the embedded text, as the essay on Swift has been. The treatment in the example is intentionally inconsistent, to allow comparison of the two approaches. Consistency can be imposed either by treating the Swift section as a
1347	DSGRP	containing one text by Swift and one by the editor, or by treating the Montagu and Pope sections as
1349	DSGRP	elements containing the editor's essays as front matter. Marked in the second way, the Pope section of the book would look like this:
1370	DSGRP	front
1377	DSGRP	Where, as in this case, an anthology contains different kinds of text (for example, mixtures of prose and drama, or transcribed speech and dictionary entries, or letters and verse), the elements to be encoded will of course be drawn from more than one module. The elements provided by the core module described in chapter
1378	DSGRP	should however prove adequate for most simple purposes, where prose, drama, and verse are combined in a single collection.
1380	DSGRP	For anthologies of short extracts such as commonplace books, it may often be preferable to regard each extract not as a text in its own right but simply as a quotation or
1385	DSGRP	which appears in the front matter of Melville's
1432	DSFLT	An important characteristic of the unitary or composite text structures discussed so far is that they can be regarded as forming what is mathematically known as a
1434	DSFLT	covering the whole of the available text (or text division) at each hierarchic level. Just as an XML document has a single root element containing a single tree, each node of which forms a properly nested sub-tree, so it seems natural to think of the internal structure of a text as decomposable hierarchically into subparts, each of which is a properly nested subtree. While this is undoubtedly true of a large number of documents, it is not true of all. In particular, it is not true of texts which are only partly tesselated at a given level. For example, if a text A is contained by text B in such a way that part of B precedes A and part follows it, we cannot tesselate the whole of B. In such a case, we say that text A is a
1446	DSFLT	might be regarded as containing many floating texts embedded within another single text, the framing narrative, rather than as groups of discrete texts in which the fragments of framing narrative are regarded as front or back matter.
1448	DSFLT	As an example, we consider an 18th century text
1451	DSFLT	, by Jane Barker (1726). This lengthy narrative contains nearly a hundred distinct
1453	DSFLT	embedded (as the title suggests) in a single patchwork. The work begins by introducing the central character, Galecia, but within a few pages launches into a distinct narrative, the story of Captain Manly:
1504	DSFLT	In other multi-narrative texts, the individual nested tales may have greater significance than the framing narratives, and it may therefore be preferable to treat the fragments of framing narrative as front or back matter associated with each nested tale. This is commonly done, for example, in texts such as Chaucer's
1506	DSFLT	, where each tale is typically presented with front matter in which the teller of the tale is introduced, and back matter in which the pilgrims comment on it.
1514	DSFLT	suggest that its content derives from a source external to the current text,
1516	DSFLT	carries no such implication and is simply used whenever the richer content model that it provides is required to support the markup of a part of a text that is presented as a discrete
1518	DSFLT	In some cases, such inclusions could be considered external (e.g., enclosures, attachments, etc.); often however, as in the examples above, the included text bears no signs of emanating from outside.
1523	DSFLT	may be used in combination. For a text with rich internal structure that is quoted at length,
1536	DSVIRT	Where the whole of a division can be automatically generated, for example because it is derived from another part of this or another document, an encoder may prefer not to represent it explicitly but instead simply mark its location by means of a processing instruction, or by using the special purpose
1559	DSVIRT	For example, if the table of contents (toc) for a given work is simply derived by copying the first
1564	DSVIRT	Similarly, in a digital edition combining a transcribed version of some text with a translated version of it, it may be desired to represent the transcript, the translation, and an aligned version of the two as three distinct divisions. This could be achieved by an encoding like the following:
1568	DSVIRT	The processing to be carried out when a
1570	DSVIRT	element is rendered will be determined by the application program or stylesheet in use: the function of the TEI markup is simply to identify the location at which the virtual division is to be generated, and also to provide some information about the kind of division to be generated. As such it may be regarded as a special kind of processing instruction, and could equally well be represented by one.
1576	DSFRONT	front matter
1577	DSFRONT	we mean distinct sections of a text (usually, but not necessarily, a printed one), prefixed to it by way of introduction or identification as a part of its production. Features such as title pages or prefaces are clear examples; a less definite case might be the prologue attached to a play. The front matter of an encoded text should not be confused with the TEI header described in chapter
1578	DSFRONT	, which serves as a kind of front matter for the computer file itself, not the text it encodes.
1580	DSFRONT	An encoder may choose simply to ignore the front matter in a text, if the original presentation of the work is of no interest, or for other reasons; alternatively some or all components of the front matter may be thought worth including with the text as components of the
1586	DSFRONT	With the exception of the title page, (on which see section
1587	DSFRONT	), front matter should be encoded using the same elements as the rest of a text. As with the divisions of the text body, no other specific tags are proposed here for the various kinds of subdivision which may appear within front matter: instead either numbered or un-numbered
1592	DSFRONT	for attributes, it is recommended that software written to handle TEI-conformant texts be prepared to recognize and handle these values when they occur, without limiting the user to the values in this list.
1595	DSFRONT	attribute may be used to distinguish various kinds of division characteristic of front matter:
1598	DSFRONT	A foreword or preface addressed to the reader in which the author or publisher explains the content, purpose, or origin of the text.
1601	DSFRONT	A formal declaration of acknowledgment by the author in which persons and institutions are thanked for their part in the creation of a text.
1604	DSFRONT	A formal offering or dedication of a text to one or more persons or institutions by the author.
1605	DSFRONT	abstract
1607	DSFRONT	A summary of the content of a text as continuous prose.
1610	DSFRONT	A table of contents, specifying the structure of a work and listing its constituents. The
1618	DSFRONT	The following extended example demonstrates how various parts of the front matter of a text may be encoded. The front part begins with a title page, which is presented in section
1619	DSFRONT	below. This is followed by a dedication and a preface, each of which is encoded as a distinct
1647	DSFRONT	The front matter concludes with another
1649	DSFRONT	element, shown in the next example, this time containing a table of contents, which contains a
1654	DSFRONT	element to provide page-references: the implication here is that the target identifiers supplied (fish1, fish2, etc.) will correspond with identifiers used for the
1656	DSFRONT	elements containing chapters of the text itself. (For the
1688	DSFRONT	Alternatively, the pointers in the index might link to the page breaks at which a chapter begins, assuming that these have been included in the markup:
1702	DSFRONT	The following example uses numbered divisions to mark up the front matter of a medieval text. Note that in this case no title page in the modern sense occurs; the title is simply given as a heading at the start of the front matter. Note also the use of the
1751	DSFRONT	If, however, the table of contents can be automatically generated from the remainder of the text, it may be preferable simply to mark its presence, either by means of an empty
1758	DSTITL	Detailed analysis of the title page and other
1760	DSTITL	of older printed books and manuscripts is of major importance in descriptive bibliography and the cataloguing of printed books; such analysis may require a rather more detailed module than that proposed here.
1761	DSTITL	The following elements are suggested as a means of encoding the major features of most title pages:
1782	DSTITL	class. Any number of elements from this class can appear grouped together within a
1786	DSTITL	element is included so as to enable encoders to record the presence of complex non-textual material on a title page. For simple cases such as printers' ornaments or illustrations the
1797	DSTITL	element without any need to group them together and encode a complete title page.
1799	DSTITL	Encoders wishing to add new elements to either class may do so using the methods described in section
1800	DSTITL	. Two examples of the use of these elements follow. First, the title page of the work discussed earlier in this section:
1822	DSTITL	tag to mark the line breaks of the original where necessary:
1868	DSTITL	Where, as here, it is considered important to encode salient features of the way a title page was originally rendered, the techniques exemplified in
1873	DSTITL	Where title pages are encoded, their physical rendition is often of considerable importance. One approach to this requirement would be to use the
1876	DSTITL	, to segment the typographic content of each part of the title page, and then use the global
1888	DSBACK	Conventions vary as to which elements are grouped as back matter and which as front. For example, some books place the table of contents at the front, and others at the back. Even title pages may appear at the back of a book as well as at the front. The content model for
1896	DSBACK	attribute on all division elements, in order to distinguish various kinds of division characteristic of back matter:
1899	DSBACK	An ancillary self-contained section of a work, often providing additional but in some sense extra-canonical text.
1902	DSBACK	A list of terms associated with definition texts (
1905	DSBACK	list type="gloss"
1913	DSBACK	A list of bibliographic citations: this should be encoded as a
1917	DSBACK	index
1919	DSBACK	Any form of index to the work.
1920	DSBACK	colophon
1925	DSBACK	No additional elements are proposed for the encoding of back matter at present. Some characteristic examples follow; first, an index (for the case in which a printed index is of sufficient interest to merit transcription):
1958	DSBACK	Note that if the page breaks in the original source have also been explicitly encoded, and given identifiers, the references to them in the above index can more usefully be recorded as links. For example, assuming that the encoding of page 461 of the original source starts like this:
1959	DSBACK	then the last item above might be encoded more usefully in either of the following forms:
1984	DSBACK	And finally, a list of corrigenda and addenda with pseudo-epistolary features:
2022	textstructure	Default text structure
2037	DSSTRUC	The selection and combination of modules to form a TEI schema is described in

TC-CriticalApparatus.xml#13092

#	id	text
6	TC	to the text. Witnesses to a text may include authorial or other manuscripts, printed editions of the work, early translations, or quotations of a work in other texts. Information concerning variant readings of a text may be accumulated in highly structured form in a critical apparatus of variants. This chapter defines a module for use in encoding such an apparatus of variants, which may be used in conjunction with any of the modules defined in these Guidelines. It also defines an element class which provides extra attributes for some elements of the core tag set when this module is selected.
8	TC	Information about variant readings (whether or not represented by a critical apparatus in the source text) may be recorded in a series of
10	TC	, each entry documenting one
12	TC	, or set of readings, in the text. Elements for the apparatus entry and readings, and for the documentation of the witnesses whose readings are included in the apparatus, are described in section
14	TC	. The available methods for embedding the apparatus in the rest of the text, or for linking an external apparatus to the base text, are described in section
15	TC	. Finally, several extra attributes for some tags of the core tag set, made available when the additional tag set for text criticism is selected, are documented in section
18	TC	Many examples given in this chapter refer to the following texts of the opening (usually just line 1) of Chaucer's
56	TCAPLL	methods of identifying which witnesses support a particular reading, and for describing the witnesses included in the apparatus: see section
59	TCAPLL	elements for indicating which portions of a text are covered by fragmentary witnesses: see section
65	TCAPLL	element is in one sense a more sophisticated and complex version of the
68	TCAPLL	as a way of marking points where the encoding of a passage in a single source may be carried out in more than one way. Unlike
79	TCAPEN	element, which groups together all the readings constituting the variation. The identification of discrete textual variations or apparatus entries is not a purely mechanical process; different editors may group readings differently. No rules are given here as to how to group readings into apparatus entries; the tags given here may be used to group readings in whatever way the editor finds most perspicuous or useful.
81	TCAPEN	The individual apparatus entry is encoded with the
93	TCAPEN	, are used to link the apparatus entry to the base text, if present. In such cases, several methods may be used for such linkage, each involving a slightly different usage for these attributes. Linkage between text and apparatus is described below in section
103	TCAPEN	or other elements, as described in the next section. A very simple partial apparatus for the first line of the
105	TCAPEN	might take a form something like this:
115	TCAPEN	, to indicate a preference for one reading, etc. The following sections on readings, subvariation, and witness information describe some of the more important complications which can arise.
124	TCAPLR	Individual readings are the crucial elements in any critical apparatus of variants. The following elements should be used to tag individual readings within an apparatus entry:
128	TCAPLR	N.B. the term
130	TCAPLR	is used here in the text-critical sense of
131	TCAPLR	the reading accepted as that of the original or of the base text
132	TCAPLR	. This sense differs from that in which the word is used elsewhere in the Guidelines, for example as in the attribute
134	TCAPLR	where the intended sense is
135	TCAPLR	the root form of an inflected word
137	TCAPLR	the heading of an entry in a reference book, especially a dictionary
140	TCAPLR	In recording readings within an apparatus entry, the
152	TCAPLR	element may also be used to record the base text of the source edition, to mark the readings of a base witness, to indicate the preference of an editor or encoder for a particular reading, or (e.g. in the case of an external apparatus) to indicate precisely to which portion of the main text the variation applies. Those who prefer to work without the notion of a base text or who are not using the parallel segmentation method may prefer not to use it at all. How it is used depends in part on the method chosen for linking the apparatus to the text; for more information, see section
160	TCAPLR	As members of the attribute classes
174	TCAPLR	As elsewhere, these attributes may be used to indicate the person responsible for the editorial decision being recorded, and also the degree of certainty associated with that decision by the person carrying out the encoding.
178	TCAPLR	attribute identifies the witnesses which have the reading in question. It is required if the apparatus gathers together readings from different witnesses, but may be omitted in an apparatus recording the readings of only one witness, e.g. substitutions, divergent opinions on what is in the witness or on how to expand abbreviations, etc. Even in such a one-witness apparatus, however, the
180	TCAPLR	attribute may still be useful when it is desired to record the occurrence of a particular reading in some other witness. For other methods of identifying the witnesses to a reading, see section
204	TCAPLR	attributes may be used to convey information on the sequence and cause of variation. In the following apparatus fragment, the reading
209	TCAPLR	per
244	TCAPLR	Similarly, if a witness is hard to decipher, it may be desired to indicate responsibility for the claim that a particular reading is supported by a particular witness. In line 2212a of
246	TCAPLR	, for example, the manuscript is read in different ways by different scholars; the editor Klaeber prints one text, using parentheses to indicate his expansion, and records in the apparatus two different accounts of the manuscript reading, by Zupitza and Chambers:
268	TCAPLR	attributes are intelligible only on an element recording a reading from a single witness, and should not be used if more than one witness is given on the same
272	TCAPLR	element. If more than one witness is given for the reading, they are undefined. To convey this information when the witness is one among several, the
277	TCAPLR	Where there is a greater weight of editorial discussion and interpretation than can conveniently be expressed through the attributes provided on these elements (for example where there are multiple witnesses for a single reading or multiple editorial responsibility for an emendation) this information can be attached to the apparatus in a note, or recorded in the feature structure notation defined in chapter
278	TCAPLR	. In particular, such recurring text-critical situations as palaeographic confusion of particular letters, or homœoarchy or homœoteleuton involving specific character groups, may lend themselves to feature structure treatment. Information concerning these recurrent situations may be encoded into database-like fragments within the text which would then be available to sophisticated computer-assisted analysis. Further work remains to be done on such mechanisms, however, and so no examples are given here of the use of feature structures in text-critical apparatus.
282	TCAPLR	element may also be used to record the specific wording of notes in the apparatus of the source edition, as here in a transcription of Friedrich Klaeber's note on
293	TCAPLR	Notes providing details of the reading of one particular witness should be encoded using the specialized
298	TCAPLR	Encoders should be aware of the distinct fields of use of the attribute values
310	TCAPLR	indicates the scholar responsible for asserting the existence of that reading in that physical entity. In some cases, the categories may blur: a scholar may produce an edition introducing readings for which he or she is responsible; that edition may itself become a witness in a later critical apparatus. Thus, readings introduced as corrections in the earlier edition will be seen in the later apparatus as witnessed by the earlier edition. As observed in the discussion concerning the discrimination of
328	TCAPSU	element may be used to group readings, either because they have identical values on one or more attributes, or because they are seen as forming a self-contained variant sequence, or for some other reason. This grouping of readings is entirely optional: no such grouping of readings is required.
356	TCAPSU	To indicate that both Hg and La vary only orthographically from the lemma, one might tag both readings
357	TCAPSU	rdg type='orthographic'
373	TCAPSU	may be used to organize the substantive variants of an apparatus entry. Editors may need to indicate that each of a group of witnesses may be taken as all supporting a particular reading, even though there may be variation concerning the exact form of that reading in, or the degree of support offered by, those witnesses. For example: one may identify three substantive variants on the first word of Chaucer's
381	TCAPSU	. In fact, the manuscripts display many different spellings of these words, and a scholar may wish both to show that the manuscripts have all these variant spellings and that these variant spellings actually support only the three regularized spelling forms. One may term these variant spellings as
387	TCAPSU	element by gathering the readings into three groups according to the normalized form of their reading. All the readings within each group may be accounted subvariants of the main reading for the group, which may be indicated by tagging it as a
390	TCAPSU	rdg type='groupBase'
428	TCAPSU	is supported by Ra2, even though the form differs in that manuscript. Accordingly, an application which recognizes that these apparatus entries show subvariation may then assign all the witnesses instanced as attesting the sub-variants on that lemma as actually supporting the reading of the lemma itself at a higher level of classification. Thus, Ha4 here supports the reading
434	TCAPSU	element might also be used to group readings in the same way. The example above is substantially identical to the following, which uses
465	TCAPSU	This expresses even more clearly than the previous encoding of this material that at the highest level of classification (apparatus entry A1), this variation has three normalized readings, and that the first of these is supported by manuscripts El, Hg, and Ha4; the second by Cp, Ld1, and La; and the third by Ra2. Some encoders may find the use of nested apparatus entries less intuitive than the use of reading groups, however, so both methods of classifying the readings of a variation are allowed.
467	TCAPSU	Reading groups may also be used to bring together variants which form an apparent developmental sequence, and to make clear that other readings are not part of that sequence, as in the following example, which makes clear that the variant sequence
506	TCAPLW	A given reading is associated with the set of witnesses attesting it by listing the witnesses in the
514	TCAPLW	element. Special mechanisms, described in the following sections, are needed to associate annotation on a reading with one specific witness among several (section
515	TCAPLW	), to transcribe witness information verbatim from a source edition (section
516	TCAPLW	), and to identify the formal lists of witnesses typically provided in the front matter of critical editions (section
522	TCAPWD	When it is desired to give additional information about a particular witness or witnesses for the reading, the information may be given in a
524	TCAPWD	element. This is a specialized form of note, which can be linked to both a reading and to one or more of the witnesses for that reading. The former linkage is effected by the
541	TCAPWD	cannot be included in the text at the point of attachment; it must point to the reading(s) being annotated by means of its
543	TCAPWD	attribute. To indicate, on the authority of editor PR, that the Ellesmere manuscript has an ornamental capital in the word
555	TCAPWD	This encoding makes clear that the ornamental capital mentioned is in the Ellesmere manuscript, and not in Hengwrt or Ha4. The
563	TCAPWD	may be used to record the specific wording of information in the source text, even when the information itself is captured in some more formal way elsewhere. The example from the
566	TCAPWD	), for example, might be extended thus, to record the wording of the note explaining the variant:
590	TCAPWD	Observe that a single witness detail element may be linked to several different readings (noting, for example, a recurrent phenomenon in a particular manuscript) by having the
592	TCAPWD	attribute point at all the readings in question. Similarly, feature structures containing information about the text in a witness (whether retroversion, regularization, or other) can also be linked to specific
606	TCSCWL	In the transcription of printed critical editions, it may be desirable to retain for future reference the exact form in which the source edition records the witnesses to a particular reading; this is particularly important in cases of ambiguity in the information, or uncertainty as to the correct interpretation. The
613	TCSCWL	list may appear following a
619	TCSCWL	element in any apparatus entry, and should be used only to transcribe the witness information in the form found in the source.
626	TCSCWL	The advantage of holding witness information in the
633	TCSCWL	an application can check that every sigil
634	TCSCWL	We use the term sigil as the English equivalent of the Latin term
639	TCSCWL	attribute has declared datatype of one or more
641	TCSCWL	values, a check can be made that readings are assigned only to witness sigla which have been identified (using the
646	TCSCWL	). Such checking is more difficult for witness sigla held as the content of a
649	TCSCWL	For this reason, it is recommended that encoders always hold witness information in the
655	TCSCWL	, where possible. Thus, as in the examples below, even when a reference to a witness is exactly reproduced in the
657	TCSCWL	element, the corresponding sigil for that witness can be written into the
663	TCSCWL	. However, in cases where it is uncertain how the witness reference contained in the
665	TCSCWL	element should be interpreted, or where no witness exists, the
703	TCSCWL	Of course, the sigil used for a particular witness in the source, as recorded in the
705	TCSCWL	element, may well differ from that used to indicated the same witness in the
707	TCSCWL	attribute, as shown particularly in the apparatus for the second line of the poem (Diet.1.2).
716	TCAPWL	A list of all identified witnesses should normally be supplied in the front matter of the edition, or in the
723	TCAPWL	element, which contains a series of
727	TCAPWL	element may contain a brief characterization of the witness, given as one or more prose paragraphs. If more detailed information about a manuscript witness is available, it should be represented using the
737	TCAPWL	Whether information about a particular witness is supplied by means of a
743	TCAPWL	element, a unique sigil for this source should always be supplied, using the global
745	TCAPWL	attribute. This identifier can then be used elsewhere to refer to this particular witness.
753	TCAPWL	The minimal information provided by a witness list is thus the set of sigla for all the witnesses named in the apparatus. For example, the witnesses referenced by the examples of this chapter might simply be listed thus:
770	TCAPWL	It is more helpful, however, for witness lists to be somewhat more informative: each
781	TCAPWL	As the last example shows, the witness description here may be complemented by a reference to a full description of the manuscript supplied elsewhere, typically as the content of a
821	TCAPWL	. Note also that if the witnesses being recorded are not manuscripts but printed works, it may be preferable to document them using the standard
838	TCAPWL	In text-critical work it is customary to refer to frequently occurring groups of witnesses by means of a single common sigil. Such sigla may be documented as pseudo-witnesses in their own right by including a nested witness list within the witness list, which uses the sigil for the group as its identifier, and supplies a fuller name for the group in its optional child
869	TCAPWL	Note that a single witness cannot appear more than once in a witness list, and therefore cannot be assigned to more than one group of witnesses.
871	TCAPWL	Situations commonly arise where there are many more or less fragmentary witnesses, such that there may be quite distinct groups of witnesses for different parts of a text or collection of texts. One may treat this with distinct
875	TCAPWL	element at the beginning of the file or in its header listing all the witnesses, partial and complete, for the text, with the attestation of fragmentary witnesses indicated within the apparatus by use of the
882	TCAPWL	If a witness list is provided, it may be unnecessary to give, in each apparatus entry, an exhaustive list of the witnesses which agree with the base text. An application program can—in principle—compare the witnesses given for each variant found with those given in the full list of witnesses, subtracting from this list all the witnesses not active at this point (perhaps because of lacuna, or because they contain a variation on a different, overlapping lemma) and thence calculate all the manuscripts agreeing with the base text. In practice, encoders may find it less error-prone to list all witnesses explicitly in each apparatus entry.
893	TCAPMI	If a witness is incomplete (whether a single fragment, a series of fragments, or a relatively complete text with one or more lacunae), it is usually desirable to record explicitly where its preserved portions begin and end. The following empty tags, which may occur within any
897	TCAPMI	element, indicate the beginning or end of a fragmentary witness or of a lacuna within a witness:
909	TCAPMI	when the module defined by this chapter is included in a schema.
913	TCAPMI	has a physical lacuna, and the text of the manuscript begins with
933	TCAPMI	both appear in witness X. In some cases, the apparatus in the source may commence recording the readings for a particular witness without its being clear whether the previous absence of readings for this witness is due to a lacuna, or to some other reason. The
955	TCAPLK	Three different methods may be used to link a critical apparatus to the text:
961	TCAPLK	the parallel segmentation method.
968	TCAPLK	apparatus, the former dispersed within the base text, the latter held in some separate location, within or outside the document with the base text. The parallel segmentation method does not use the concept of a base text and may only be used for in-line apparatus.
975	TCAPLK	element provides a useful means of grouping together a series of
993	TCAPLK	element of its TEI header, thus:
1000	TCAPLO	The location-referenced method of encoding apparatus provides a convenient method for encoding printed apparatus; in this method as in most printed editions, the apparatus is linked to the base text by indicating explicitly only the block of text on which there is a variant (noted usually by a canonical reference scheme, or by line number in the edition, such as
1003	TCAPLO	Page 15 line 1
1006	TCAPLO	If the location-referenced method is used for an apparatus stored externally to the base text, the TEI header must have the declaration:
1010	TCAPLO	of the document, the base text (here El) will appear:
1034	TCAPLO	If the same text is encoded using in-line storage, the apparatus is dispersed through the base text block to which it refers. In this case, the location of the variant can be read from the line in which it occurs.
1047	TCAPLO	Since the location is not required to be exact, the apparatus for a line might also appear at the end of the line:
1057	TCAPLO	When the apparatus is linked to the text by means of location references, as shown here, it is not possible to find automatically the precise portion of text varied by the readings. In order to show explicitly what portion of the base text is replaced by the variant readings, the
1071	TCAPLO	base text reading
1072	TCAPLO	and requiring no qualification, but it may optionally carry the normal attributes, as shown here. Some text critics prefer to abbreviate or elide the lemma, in order to save space or trouble; such practice is not forbidden by these Guidelines, but no recommendations are made for conventions of abbreviating the lemma, whether abbreviation of each word, or suppression of all but the first and last word, etc.
1080	TCAPDE	In the double end-point attachment method, the beginning and end of the lemma in the base text are both explicitly indicated. It thus differs from the location-referenced method, in which only the larger span of text containing the lemma is indicated. Double end-point attachment permits unambiguous matching of each variant reading against its lemma. It or the parallel-segmentation method should be used in all cases where this is desired, for example where the apparatus is intended to enable full reconstruction of the text, or of the substantives, of every witness.
1091	TCAPDE	. In cases where it is not possible to insert anchors within the base text (e.g. where the text is on a read-only medium) the beginning and end of the lemma may be indicated by using the
1096	TCAPDE	The double end-point attachment method may be used with in-line or external apparatus. In the latter case, the base text (here El) will appear with
1098	TCAPDE	elements inserted at every place where a variant begins or ends (unless some element with an identifier already begins or ends at that point):
1120	TCAPDE	attribute can use the identifier for the line as a whole; the lemma is assumed to run from the beginning of the element indicated by the
1124	TCAPDE	attribute. If no value is given for
1149	TCAPDE	element in this method, as it may be extracted reliably from the base text. If an exhaustive list of witnesses is available, it will also not be necessary to specify just which manuscripts agree with the base text to enable reconstruction of witnesses. An application will be able to determine the manuscripts that witness the base reading, by noting which witnesses are attested as having a variant reading, and inferring the base text reading for all others after adjusting for fragmentary witnesses and for witnesses carrying overlapping variant readings.
1151	TCAPDE	Alternatively, if it is desired to make an explicit record of the attestation of the base text, the
1166	TCAPDE	. For example, at line 117 of the Wife of Bath's Prologue, the manuscripts Hg (Hengwrt), El (Ellesmere), and Ha4 (British Library Harleian 7334) read:
1206	TCAPDE	The parallel segmentation method, to be discussed next, cannot handle overlaps among variants, and would require the individual variants to be split into pieces.
1208	TCAPDE	Because creation and interpretation of double end-point attachment apparatus will be lengthy and difficult it is likely that they will usually be created and examined by scholars only with mechanical assistance.
1214	TCAPPS	This method differs from the double end-point attachment method in that all variants at any point of the text are expressed as variants on one another. In this method, no two variations can overlap, although they may nest. Thus, the concepts of a base text and of a lemma become unnecessary: the texts compared are divided into matching segments all synchronized with one another. This permits direct comparison of any span of text in any witness with that in any other witness. It is also very easy with this method for an application to extract the full text of any one witness from the apparatus.
1216	TCAPPS	This method will (by definition) always be satisfactory when there are just two texts for comparison (assuming they are in the same language and script). It will also be useful where editors do not wish to privilege a text as the
1218	TCAPPS	or when editors wish to present parallel texts. It will become less convenient as traditions become more complex and tension develops between the need to segment on the largest variation found and the need to express the finest detail of agreement between witnesses.
1220	TCAPPS	In the parallel segmentation method, each segment of text on which there is variation is marked by an
1224	TCAPPS	element; if it is desired to single out one reading as preferred, it may be tagged
1239	TCAPPS	This method cannot be used with external apparatus: it must be used in-line. Note that apparatus encoded with this method may be translated into the double end-point attachment method and back without loss of information. Where double-end-point-attachment encodings have no overlapping lemmata, translation of these to the parallel segmentation encoding and back will also be possible without loss of information.
1241	TCAPPS	For economy, the witnesses to the reading most widely attested need not be stated. Since all manuscripts must be represented in all apparatus entries, it will be possible for an application to read a
1243	TCAPPS	declaring all the witnesses to the text and then calculate which witnesses have not been named. In the example below, only La and Ra2 are identified explicitly with a reading; an application might successfully infer from this that
1260	TCAPPS	As noted, apparatus entries may nest in this method: if an imaginary fifth manuscript of the text read
1262	TCAPPS	, the variation on the individual words of the line would nest within that for the line as a whole:
1293	TCAPPS	Parallel segmentation cannot, however, deal very gracefully with variants which overlap without nesting: such variants must be broken up into pieces in order to keep all witnesses synchronized.
1300	TCAPLN	When an apparatus is provided it does not need to be given at the location in the transcription where the variation, emendation, attribution, or other apparatus observation occurs. Instead it may be stored in a separate place in the same file, or indeed in another file, and point to the location at which it is meant to be used. Storing apparatus entries separately can be beneficial when encoding multiple competing, potentially overlapping, interpretations of the same point in the source texts.
1302	TCAPLN	The location-referenced method can be used to point a position in a text using the
1310	TCAPLN	or other element at the location where the apparatus observation takes place. The contents of an element pointed to are understood to be equivalent to a
1312	TCAPLN	if none exists in the
1314	TCAPLN	, and if a
1322	TCAPLN	datatype and thus contains a URI as a value. This means that it can point directly to an
1353	TCAPLN	is not provided in the source file.
1355	TCAPLN	In addition, URLs can contain XPointer schemes including xpath(), range(), and string-range() which can be used in providing the location of an
1357	TCAPLN	that is stored separately from the text to which it applies. Both
1361	TCAPLN	can be used, as in the double end-point attachment method, to identify the starting and ending location for an apparatus using XPointer schemes described in
1362	TCAPLN	section to more precisely identify this location where beneficial.
1379	TCAPLN	attribute is provided then it should be understood that this supplies the location of the textual variance that the apparatus documents. If the
1381	TCAPLN	attribute contains an XPointer scheme that identifies a range of text (or elements) then this is understood to record the starting and ending of the range as in the double end-point attachment method. In such a case a @to attribute is unnecessary.
1390	TCTR	element. An application may then construct different
1398	TCTR	element. Consider, for example, the three different transcriptions given below of line 105 of the Hengwrt manuscript of Chaucer's
1400	TCTR	. The last word of the line
1407	TCTR	u
1413	TCTR	u
1428	TCTR	This example uses special purpose elements
1456	TCTR	In most cases, elements used to indicate features of a primary textual source may be represented within an
1464	TCTR	elements in the example just given. However, in cases where the tagged feature extends across a span of text which might itself contain variant readings which it is desired to represent by
1466	TCTR	structures, some adaptation of the tagging may be necessary. For example, a span of text may be marked in the transcription of the primary source as a single deletion but it may be desirable to represent just a few words from this source as individual deletions within the context of a critical apparatus drawing together readings from this and several other witnesses. In this case, the tagging of the span of words as one deletion may need to be decomposed into a series of one-word deletions for encoding within the apparatus. If it is important to record the fact that all were deleted by the same act, the markup may use the
1495	TC	The selection and combination of modules to form a TEI schema is described in

#	id	text
2	postCode	postal code
14	postCode	contains a numerical or alphanumeric code used as part of a postal address to simplify sorting or delivery of mail.
72	postCode	The position and nature of postal codes is highly country-specific; the conventions appropriate to the country concerned should be used.

#	id	text
2	att.pointing	defines a set of attributes used by all elements which point to other elements by means of one or more URI references.
18	att.pointing	specifies the language of the content to be found at the destination referenced by
21	att.pointing	language tag
33	att.pointing	if @target is specified.
52	att.pointing	The value must conform to BCP 47. If the value is a private use code (i.e., starts with
58	att.pointing	element with a matching value for its
60	att.pointing	attribute should be supplied in the TEI header to document this value. Such documentation may also optionally be supplied for non-private-use codes, though these must remain consistent with their
96	att.pointing	specifies the intended meaning when the target of a pointer is itself a pointer.
115	att.pointing	if the element pointed to is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found which is not a pointer.
131	att.pointing	if the element pointed to is itself a pointer, then its target (whether a pointer or not) is taken as the target of this pointer.
164	att.pointing	If no value is given, the application program is responsible for deciding (possibly on the basis of user input) how far to trace a chain of pointers.

#	id	text
21	att.naming	may be used to specify further information about the entity referenced by this name in the form of a set of whitespace-separated values, for example the occupation of a person, or the status of a place.
28	att.naming	reference to the canonical name
38	att.naming	provides a means of locating the canonical form (
39	att.naming	nym
64	att.naming	The value must point directly to one or more XML elements by means of one or more URIs, separated by whitespace. If more than one is supplied, the implication is that the name is associated with several distinct canonical names.

#	id	text
2	langKnown	language known
14	langKnown	summarizes the state of a person's linguistic competence, i.e., knowledge of a single language.
38	langKnown	supplies a valid language tag for the language concerned.
56	langKnown	The value for this attribute should be a language
57	langKnown	tag
79	langKnown	a code indicating the person's level of knowledge for this language

#	id	text
2	att.placement	provides attributes for describing where on the source page or object a textual element appears.
18	att.placement	specifies where this item is placed
27	att.placement	below the line
97	att.placement	on the other side of the leaf
113	att.placement	above the line
145	att.placement	within the body of the text.

#	id	text
2	form	form information group
14	form	groups all the information on the written and spoken forms of one headword.
49	form	classifies form as simple, compound, etc.
68	form	single free lexical item
100	form	a variant form
148	form	word in other than usual dictionary form
164	form	multiple-word lexical item

#	id	text
101	usg	domain
111	usg	domain or subject matter (e.g. scientific, literary etc.)
181	usg	language
191	usg	name of a language mentioned in etymological or other linguistic discussion.
405	usg	unclassifiable piece of information to guide sense choice

#	id	text
16	att.handFeatures	gives a name or other identifier for the scribe believed to be responsible for this hand.
34	att.handFeatures	points to a full description of the scribe concerned, typically supplied by a
43	att.handFeatures	characterizes the particular script or writing style used by this hand, for example
105	att.handFeatures	points to a full description of the script or writing style used by this hand, typically supplied by a
116	att.handFeatures	, or other writing medium, e.g.

#	id	text
2	listOrg	list of organizations
12	listOrg	contains a list of elements, each of which provides information about an identifiable organization.
80	listOrg	The type attribute may be used to distinguish lists of organizations of a particular type if convenient.

#	id	text
4	text	contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample.
156	text	The body of a text may be replaced by a group of nested texts, as in the following schematic:
175	text	This element should not be used to represent a text which is inserted at an arbitrary point within the structure of another, for example as in an embedded or quoted narrative; the

#	id	text
13	metDecl	documents the notation employed to represent a metrical pattern when this is specified as the value of a
19	metDecl	attribute on any structural element of a metrical text (e.g.
145	metDecl	indicates whether the notation conveys the abstract metrical form, its actual prosodic realization, or the rhyme scheme, or some combination thereof.
188	metDecl	declaration applies to the abstract metrical form recorded on the
266	metDecl	declaration applies to the rhyme scheme recorded on the
294	metDecl	element documents the notation used for metrical pattern and realization. It may also be used to document the notation used for rhyme scheme information; if not otherwise documented, the rhyme scheme notation defaults to the traditional
327	metDecl	specifies a regular expression defining any value that is legal for this notation.
346	metDecl	The value must be a valid regular expression per the World Wide Web Consortium's
370	metDecl	This example is intended for the far more restricted case typified by the Shakespearean iambic pentameter. Only metrical patterns containing exactly ten syllables, alternately stressed and unstressed, (except for the first two which may be in either order) to each metrical line can be expressed using this notation.
405	metDecl	may contain either a sequence of
407	metDecl	elements or, alternately, a series of paragraphs or other components. If the
411	metDecl	elements are used, then all the codes appearing within the
415	metDecl	Only usable within the header if the verse module is used.

#	id	text
2	metSym	metrical notation symbol
13	metSym	documents the intended significance of a particular character or character sequence within a metrical notation, either explicitly or in terms of other symbol elements in the same metDecl.
39	metSym	specifies the character or character sequence being documented.
60	metSym	specifies whether the symbol is defined in terms of other symbols (
62	metSym	is set to
66	metSym	is set to
146	metSym	The value
148	metSym	indicates that the element contains a prose definition of its meaning; the value

#	id	text
2	relatedItem	contains or references some other bibliographic item which is related to the present one in some specified manner, for example as a constituent or alternative version of it.
32	relatedItem	is used, the relatedItem element must be empty
34	relatedItem	A relatedItem element should have either a 'target' attribute or a child element to indicate the related bibliographic item

#	id	text
31	node	provides the value of a node, which is a feature structure or other analytic element.
69	node	initial node in a transition network
85	node	final node in a transition network
265	node	attributes when the graph is undirected and vice versa if the graph is directed.
286	node	gives the in degree of the node, the number of nodes which are adjacent from the given node.
323	node	gives the out degree of the node, the number of nodes which are adjacent to the given node.
360	node	gives the degree of the node, the number of arcs with which the node is incident.
400	node	attributes when the graph is undirected and vice versa if the graph is directed.
442	node	provides a label for the arc; the second provides a second label for the arc, and should be used if a transducer is being encoded whose actions are associated with nodes rather than with arcs.

#	id	text
2	span	associates an interpretative annotation directly with a span of text.
27	span	Only one of the attributes @target and @from may be supplied on
34	span	Only one of the attributes @target and @to may be supplied on
41	span	If @to is supplied on
42	span	, @from must be supplied as well
49	span	may each contain only a single value
55	span	gives the identifier of the node which is the starting point of the span of text being annotated; if not accompanied by a
57	span	attribute, gives the identifier of the node of the entire span of text being annotated.
88	span	gives the identifier of the node which is the end-point of the span of text being annotated.

#	id	text
2	listPrefixDef	list of prefix definitions
4	listPrefixDef	contains a list of definitions of prefixing schemes used in
23	listPrefixDef	In this example, two private URI scheme prefixes are defined and patterns are provided for dereferencing them. Each prefix is also supplied with a human-readable explanation in a

#	id	text
2	att.datable.custom	provides attributes for normalization of elements that contain datable events to a custom dating system (i.e. other than the Gregorian used by W3 and ISO).
6	att.datable.custom	supplies the value of a date or time in some custom standard form.
12	att.datable.custom	The following are examples of custom date or time formats that are
34	att.datable.custom	Not all custom date formulations will have Gregorian equivalents.
38	att.datable.custom	attribute and other custom dating are not contrained to a datatype by the TEI, but individual projects are recommended to regularize and document their dating formats.
43	att.datable.custom	specifies the earliest possible date for the event in some custom standard form.
50	att.datable.custom	specifies the latest possible date for the event in some custom standard form.
81	att.datable.custom	supplies a pointer to some location defining a named point in time with reference to which the datable item is understood to have occurred
104	att.datable.custom	element for the Julian calendar, specifying that the text content of the
108	att.datable.custom	attribute also points to the Julian calendar to indicate that the content of the
110	att.datable.custom	attribute value is Julian too.
122	att.datable.custom	In this example, a date is given in a Mediaeval text measured "from the creation of the world", which is normalised (in
126	att.datable.custom	) to a machine-actionable, numeric version of the date from the Creation.
135	att.datable.custom	) defines the calendar or dating system to which the date described by the parent element is normalized (i.e. in the
141	att.datable.custom	the calendar of the original date in the element.

#	id	text
4	listRelation	provides information about relationships identified amongst people, places, and organizations, either informally as prose or as formally expressed relation links.
102	listRelation	May contain a prose description organized as paragraphs, or a sequence of

#	id	text
2	div	text division
16	div	contains a subdivision of the front, body, or back of a text.

#	id	text
14	attRef	points to the definition of an attribute or group of attributes.
36	attRef	the name of the attribute class
43	attRef	the name of the attribute

#	id	text
2	index	index entry
14	index	marks a location to be indexed for whatever purpose.
45	index	a single word which follows the rules defining a legal XML name (see
46	index	), supplying a name to specify which index (of several) the index entry belongs to.

#	id	text
2	valItem	documents a single value in a predefined list of values.
30	valItem	specifies the value concerned.

#	id	text
4	space	indicates the location of a significant space in the text.
45	space	indicates whether the space is horizontal or vertical.
64	space	the space is horizontal.
80	space	the space is vertical.
97	space	For irregular shapes in two dimensions, the value for this attribute should reflect the more important of the two dimensions. In conventional left-right scripts, a space with both vertical and horizontal components should be classed as
116	space	(responsible party) indicates the individual responsible for identifying and measuring the space
141	space	This element should be used wherever it is desired to record an unusual space in the source text, e.g. space left for a word to be filled in later, for later rubrication, etc. It is not intended to be used to mark normal inter-word space or the like.

#	id	text
26	att.combinable	add
46	att.combinable	if present already, the whole of the declaration for this object is removed from the current setup
62	att.combinable	this declaration changes the declaration of the same name in the current definition
78	att.combinable	this declaration replaces the declaration of the same name in the current definition
100	att.combinable	add
102	att.combinable	add
103	att.combinable	mode); raise an error if an object with the same identifier already exists
109	att.combinable	do not process this object or any existing object with the same identifier; raise an error if any new children supplied
110	att.combinable	change
112	att.combinable	change

#	id	text
60	pron	full form
111	pron	indicates what notation is used for the pronunciation, if more than one occurs in the machine-readable dictionary.
195	pron	The values used to specify the notation may be taken from any appropriate project-defined list of values. Typical values might be

words that maybe should be in <gi> or <ident>

specifications (i.e., https://svn.code.sf.net/p/tei/code/trunk/P5/Source/Specs/)

factuality.xml#13000

collection.xml#13000

damage.xml#13000

pubPlace.xml#13000

cond.xml#13000

classRef.xml#13000

event.xml#13012

model.pLike.front.xml#13000

principal.xml#13000

biblScope.xml#13095

gap.xml#13012

surrogates.xml#13000

head.xml#13000

stress.xml#13000

listForest.xml#13000

eLeaf.xml#13000

att.declaring.xml#13000

authority.xml#13000

undo.xml#13000

entryFree.xml#13000

alternate.xml#13056

climate.xml#13242

when.xml#13000

titlePage.xml#13000

substJoin.xml#13000

stage.xml#13000

data.truthValue.xml#13000

model.headLike.xml#13000

mood.xml#13000

dimensions.xml#13000

damageSpan.xml#13000

lbl.xml#13000

model.lLike.xml#13000

reg.xml#13000

xr.xml#13000

att.datable.iso.xml#13000

triangle.xml#13000

foreign.xml#13012

code.xml#13000

anchor.xml#13000

rendition.xml#13000

list.xml#13046

interleave.xml#

docTitle.xml#13000

num.xml#13000

orig.xml#13092

transpose.xml#13000

catDesc.xml#13000

item.xml#13000

district.xml#13000

postCode.xml#13000

fs.xml#13000

model.teiHeaderPart.xml#13000

att.pointing.xml#13229

att.naming.xml#13000

form.xml#13000

usg.xml#13012

notesStmt.xml#13000

langKnown.xml#13000

att.placement.xml#13000

adminInfo.xml#13000

att.handFeatures.xml#13000

location.xml#13242

listOrg.xml#13000

text.xml#13000

metDecl.xml#13000

metSym.xml#13000

colloc.xml#13000

data.name.xml#13000

relatedItem.xml#13000

cell.xml#13000

node.xml#13000

span.xml#13000

listPrefixDef.xml#13000

att.datable.custom.xml#13227

faith.xml#13000

listRelation.xml#13000

moduleSpec.xml#13000

words that maybe should be in `<gi>` or `<ident>`

specifications (i.e., `https://svn.code.sf.net/p/tei/code/trunk/P5/Source/Specs/`)

#	id	text
12	msContents	describes the intellectual content of a manuscript or manuscript part, either as a series of paragraphs or as a series of structured manuscript items.
60	msContents	identifies the text types or classifications applicable to this object by pointing to other elements or resources defining the classification concerned.
328	msContents	. This constraint is not currently enforced by the schema.

#	id	text
4	byline	contains the primary statement of responsibility given for a work on its title page or at the head or end of the work.
132	byline	The byline on a title page may include either the name or a description for the document's author. Where the name is included, it may optionally be tagged using the

#	id	text
47	cRefPattern	The result of the substitution may be either an absolute or a relative URI reference. In the latter case it is combined with the value of
49	cRefPattern	in force at the place where the
51	cRefPattern	attribute occurs to form an absolute URI in the usual manner as prescribed by

#	id	text
4	seal	contains a description of one seal or similar attachment applied to a manuscript.
35	seal	specifies whether or not the seal is contemporary with the item to which it is affixed

#	id	text
4	graph	encodes a graph, which is a collection of nodes, and arcs which connect the nodes.
83	graph	undirected graph
99	graph	directed graph
115	graph	a directed graph with distinguished initial and final nodes
131	graph	a transition network with up to two labels on each arc
152	graph	, then the distinction between the
158	graph	tag is neutralized. Also, the
168	graph	(or any other value which implies directionality), then the
239	graph	states the order of the graph, i.e., the number of its nodes.
258	graph	states the size of the graph, i.e., the number of its arcs.

#	id	text
2	recordHist	recorded history
13	recordHist	provides information about the source and revision status of the parent manuscript description itself.

#	id	text
4	tree	encodes a tree, which is made up of a root, internal nodes, leaves, and arcs from root to leaves.
46	tree	gives the maximum number of children of the root and internal nodes of the tree.
75	tree	indicates whether or not the tree is ordered, or if it is partially ordered.
95	tree	indicates that all of the branching nodes of the tree are ordered.
111	tree	indicates that some of the branching nodes of the tree are ordered and some are unordered.
127	tree	indicates that all of the branching nodes of the tree are unordered.
145	tree	gives the order of the tree, i.e., the number of its nodes.
163	tree	The size of a tree is always one less than its order, hence there is no need for both a
305	tree	A root, and zero or more internal nodes and leaves, but if there is an internal node, there must also be at least one leaf.

#	id	text
2	iff	if and only if
13	iff	separates the condition from the consequence in a bicond element.

#	id	text
9	att.media	Where the media are displayed, indicates the display width
16	att.media	Where the media are displayed, indicates the display height
23	att.media	Where the media are displayed, indicates a scale factor to be applied when generating the desired display size

#	id	text
2	geogName	geographical name
14	geogName	identifies a name associated with some geographical feature such as Windrush Valley or Mount Sinai.