As with any other part of document production, when using URIs (web pointers) it is advantageous to check that they ‘work’ early in the production process. The aspects of a pointer we would want to test, in increasing order of difficulty, are that the URI:
<persName ref="#MLK">Martin Luther King, Jr.</persName>
should point to an entry for Martin Luther King, Jr., not an entry for Lois Lane).Most of the main closed schema languages, including PureODD and RELAX NG (but excluding DTDs), have the capability to test that an information item (in our case, and thus hereafter, an attribute value) meets the syntactic constraints of a URI. That is, we can test (1) above just by using our schemas. But none of those languages (including PureODD without use of <constraintSpec>) have the capability to test (2)–(4).
The open schema language used by TEI (ISO Schematron) does allow for (2) and (3). However, a) it is difficult to associate the needed tests with all teidata.pointer attributes, and b) it requires special-case coding for (3), depending on what is considered ‘right’ in the current circumstances. In any case, (4) is much more difficult, and usually requires a human, if not a human with domain-specific knowledge.
Thus the TEI Technical Council does not plan to include any general tests for (2)–(4) directly in TEI P5, at least not in the near future. However, recognizing that at least (2) & (3) represent an important subset of validation tests that projects want to perform, we exemplify herein some mechanisms for doing so. The intent is that projects can copy-and-paste ODD fragments from this file into their own, and modify as desired to suit their local needs.
Note, however, that this document only discusses testing of URIs that are intended to return an XML document or fragment thereof. It is possible to use URIs to point to and retrieve other sorts of information objects including JSON, images, HTML (other than XHTML), audio files, word processor files, etc., but we do not consider these cases herein.
Also note that the examples in this document often use an attribute as the context node of an ISO Schematron rule. Some Schematron processors fail to process this correctly. (The oXygen XML Editor handles these perfectly well, as does David Maus’s SchXslt.)
There are over 100 attributes in TEI defined as teidata.pointer.1 Roughly 60 of those are restricted to having only 1 URI as a value, but the other nearly 50 permit multiple URIs in the value. Processing multiple pointers is much more difficult than handling a single URI, and thus we consider the singleton cases separately from the cases in which multiple URI values of a single pointer attribute need to be tested.
We first discuss cases where items are being directly pointed to (first in the singleton case, then with multiple pointers), and then cases where an item is being referred to indirectly either via an intermediate <link> or <alt> element, or via a prefix defined in the <prefixDef>. As a last case, we demonstrate ensuring that the validation is being performed after XInclude processing.
Projects that intend to always use one and only one URI as the value of an attribute which by default in TEI may take multiple URIs will probably find it best to constrain said attribute to only 1 URI. For example, if the (fictional) project ‘The Papers of Dr. Virgil Swann’ were to provide correspondence links between Dr. Swann’s English translation of and his transcription of an intercepted Kryptonian message, each <s> element with an xml:lang of "en" would bear a corresp attribute that pointed to the corresponding <s> element with an xml:lang of "x-kr". The TEI corresp attribute allows one or more URIs. But for this project, the corresp of <s> should be limited to only 1 URI. This can be accomplished as shown in Figure 1, PureODD to limit attribute to one URI.
The fragment identifier portion of a URI is that which follows the first #
(read left-to-right). The portion of the URI before the first #
locates the document of interest; the portion after the first #
locates the element of interest within said document. The portion after the first
#
can take many forms, only one of which we consider in this document: the shorthand pointer fragment identifier. It is probably by far the most common form of fragment identifier;
it refers to an element in the document in question by referring to its ID. (In TEI
the ID is always indicated by the xml:id attribute.) This typically looks like either
https://www.w3.org/TR/xptr-framework/#shorthand⚓
or
#DeRoseetal1990⚓
If there is no document mentioned to the left of the first #
, then the document being referred to is the base document, which is typically the
current document in which the URI appears (but which can be modified by the xml:base attribute).
First, just to demonstrate how this works in general, we show in Figure 2, Ensure ref of g is a shorthand pointer how to test that the ref attribute of <g> has the correct syntax to point to something in the same file. However, we also note that this check could be expressed in PureODD without resorting to Schematron. This PureODD-only mechanism, exemplified in Figure 3, Ensure ref of g is a shorthand pointer, PureODD, has the sizable advantage that the testing is performed by the closed schema language. Note, however, that although this use of the restriction attribute is supported by the TEI’s current ODD processor, other uses may not be; in particular, restriction is currently only usable with the name attribute, not the key attribute of <dataRef>.
To check that a shorthand pointer fragment identifier URI points to something that can be retrieved (i.e., to test a pointer that looks like #duck
for (2), above) we can take advantage of the XPath id()
function. This technique is used in Figure 4, Check that ref of g points to something to ensure that the ref of a <g> actually points to something in the current document. A call to id('tho')
returns the element from the same document, if there is one, that bears an xml:id of "tho". There is supposed to be at most one such element; if there were two or more, only
the first is returned. Thus id('tho')
may be thought of as //*[@xml:id eq 'tho'][1]
2.
To check that a shorthand pointer fragment identifier URI specifically points to a
particular element type (in this case <char> or <glyph>), we simply append a node test, as seen in Figure 5, Check that ref of g points to a char or a glyph. Note that references to TEI elements in the XPath expressions in the Schematron
inside <constraint> need to be explicitly bound to the TEI namespace. Although an ODD author may define
any namespace prefix for the purpose (using the Schematron <ns> element), TEI ODD software will automatically insert a definition that binds the
prefix tei:
to the TEI namespace.
A URI may refer to a file on the local filesystem using either an absolute-path reference or a relative-path reference.3 An absolute-path reference starts with a slash; a relative-path reference does not.
A relative-path reference may start with a dot segment (./
), unless the first segment of its path contains a colon, in which case it must start with a dot segment.
The URI that refers to a file on the local filesystem may have a #
followed by a fragment identifier. Here we first consider testing that a URI points
to a local file, then that it points to a local file with a particular file extension,
then that it points to a local file with a particular root element, and last that
it points to a particular element type in a local file.
A <moduleRef> element is typically used to refer to one of the TEI modules (for example, "core", "gaiji", or "namesdates") from a customization ODD file using its key attribute. But <moduleRef> can also refer to a non-TEI module using its url attribute. The TEI schema ensures that the value of the url attribute is, in fact, a URI (that is, it performs test (1)). It would be quite reasonable for a project to want to check that the value of url was a URI that referred to an existing, readable, local XML file.
It is possible to write generic constraints for this purpose that would allow any valid URI that referred to a local file (whether an absolute path reference, a relative path reference, or a file URL scheme; see Figure 20, Test for any local URI) or that a value uses a private URI scheme prefix defined by a <prefixDef> (see Figure 21, ref uses defined prefix). However, in most cases projects would probably want to constrain the value to a particular method for referring to an existing, readable, local XML file (if not a particular file, for which see Figure 6, Require the url of moduleRef to refer to a particular file). For example, Figure 7, Check that url of moduleRef refers to an RNG file in the same directory demonstrates a method for requiring that the url attribute of <moduleRef> refers to a local file that is in the same directory as the instance ODD and whose filename ends in ".rng" (and thus is probably a RELAX NG file in the XML syntax).
We may wish to ensure that the file referred to (whether local or remote) is readable,
well-formed XML. Luckily, XPath provides the doc-available()
function for this very purpose. The constraints demonstrated in Figure 8, Ensure file is readable XML first ensure that the information item referred to by the url attribute is a file, not an element, and then require that the file be readable, well-formed XML. The example at Figure 9, Ensure file is readable RELAX NG grammar duplicates these constraints, and also tests that the outermost element of the retrieved
file is an <rng:grammar> element.
The ref attribute of <persName> should typically point to a <person> element, which will often be in a separate ‘personography’ file. Presuming that file is in a known location in the local filesystem, Figure 10, Require persName to refer to a person in the local personography can be used to test that the ref attribute refers to a <person> in that file.
Checking remote pointers is in principle very similar to checking local pointers, but with different tests on the syntax of the URI. (They are also, of course, a bit harder to test because you must have a working internet connection (or alternatively use an XML Catalog), and have to worry about firewalls, proxies, same-source problems, cached files, etc.) For example, Figure 11, Ensure uri of equiv is present and refers to an item on a particular page ensures that any <equiv> element has a uri attribute, and further that said attribute refers to the (fictional) ‘markup_taxonomy’ page of the WWP website, allowing a reference to that page on either the production or test version of the website, and allowing access with or without specifying a secure connection. Note that this example only checks the syntax of the uri attribute, and does not ensure that there actually is such a page or specific element on that page.
The Figure 12, Require a filter on equiv that points to an XSLT program example, on the other hand, ensures that the filter attribute of <equiv> points to an XSLT program. It does this by testing that either the namespace of the outermost element of the retrieved file is the XSLT namespace (because the outermost element of a ‘normal’ XSLT program could be either <xsl:stylesheet> or <xsl:transform>), or that the outermost element has a xsl:version attribute (because a simplified stylesheet must have such an attribute). Thus, in order for a filter attribute to pass this test, it not only needs the right syntax, but it must also point to an XSLT program that exists and is accessible via the web.
If a pointer attribute may have multiple values, testing is mildly more difficult because the attribute needs to be parsed first. In the general case, delivering a precise error message is quite difficult, as the entire process needs to be handled in XPath because Schematron does not have an iteration construct. However, specific cases may be reasonably easy to handle. Figure 13, rendition points to 1 or 2 renditions tests that a rendition attribute refers to at most 2 <rendition> elements in the same file. This is not particularly difficult because of the restriction that there are at most 2 pointers.
But in many, if not most, cases there is no such restriction. For example, there may be dozens of witnesses to a particular manuscript manifestation. Figure 14, Check that each pointer in wit points to witness tests that each pointer in the value of a wit refers to a <witness> element in the same file. In addition to tests similar to the previous example, this example reports to the user what each pointer in a failed value is, in fact, pointing to.
We have already demonstrated a methodology for ensuring that the ref of <persName> points to a <person>. One recommended method for encoding an ambiguous reference to a person is to use the <alt> element — the encoding of the ambiguous reference itself points to an <alt> element which, in turn, points to each of the possible <person>s. (See https://wwp.northeastern.edu/outreach/seminars/_current/presentations/contextual_encoding/advanced_context_09.xhtml for a sample encoding.) Similarly, a reference to multiple individuals could be encoded as multiple pointers on a single ref, or could be encoded as a single pointer on the ref that points to a <link> element which, in turn, points to each <person> referred to. One advantage to this latter method is that you can restrict the ref of <persName> to one and only one pointer, and use similar constraints for both ambiguous and multiple references. Figure 15, Required ref of persName eventually refers to person demonstrates such constraints.
A similar set of constraints can be expressed in a simpler, perhaps easier to follow, way by splitting them up as constraints on the ref of <persName> and the target of <link> and <alt> separately, as demonstrated in Figure 16, Required ref of persName eventually refers to person using abstract patterns. Because the code for <alt> and <link> is somewhat complicated, this technique expresses that code only once as an abstract Schematron pattern, which is instantiated separately, once for <alt> and once for <link>.
The TEI provides a method of indirection for both shortening URLs and having a single place to change a set of URIs. This mechanism makes use of a local URL prefix and a definition of how that prefix is mapped to a full URI which is expressed in a <prefixDef> element. Creating pointers using this method is simpler, shorter, easier to read (and thus proofread), and reduces the chance for errors in the first place. Checking pointers that use this method, however, is significantly more difficult. As with several other types of pointer checking, it is the general case that is most difficult; particular cases may be reasonably easy.
Here we will limit ourselves to the simple (and to our knowledge far most common)
case in which the matchPattern follows the shorthand pointer syntax or a subset thereof, and the replacementPattern follows the syntax of a URL except it ends with #$1
. I.e., the case in which the matched bit is the shorthand pointer, as seen in an
example from the Guidelines. While example Figure 17, Resolve prefixDef for references to people limits itself to <prefixDef>s of this sort, it does not limit itself to any particular prefix. The document is
searched for possible prefix values.
It is often the case that we know certain tests will fail before XInclude processing,
whereas we hope they would succeed after XInclude processing.4 For example, imagine that at our ‘The Papers of Dr. Virgil Swann’ project the extant documents to be encoded (which include letters, scientific notebooks,
satellite schematics, a dictionary, and translations of intercepted radio transmissions)
have been categorized using a project-specific taxonomy. This taxonomy is encoded
as a <taxonomy> element. Since each TEI document refers to this taxonomy (from its /TEI/teiHeader/profileDesc/textClass/catRef/@target
), the project has chosen to have a copy of the entire project taxonomy in each TEI
document’s header (in its /TEI/teiHeader/encodingDesc/classDecl
). In order to avoid multiple copies of the same information, the project has chosen
to store the <taxonomy> in a separate document and use XInclude to insert it into each TEI header.5
Given this situation, imagine now that the project wishes to check that the target attributes actually point to one or more <category> elements. This is problematic, because the <category> elements are not actually in the file as it sits unprocessed. There are a variety of ways this could be handled, probably the easiest of which is simply to skip the test (and warn the user it is being skipped) if XInclude processing has not yet taken place.
There are two straightforward methods of asking the question ‘has XInclude processing taken place yet?’. The first relies on the fact that after XInclude processing, no elements from the XInclude namespace should remain: they should have become the file to be included or, in case of error, the contents of the <xi:fallback>. This method is exemplified in Figure 18, XIncluded yet? — one size fits all.
The second relies on the fact that before XInclude processing the file does not have a <taxonomy>, and after XInclude processing it does. This method is exemplified in Figure 19, XIncluded yet? — per-test reporting.
file:
scheme is possible (in Figure 20, Test for any local URI) and the following comment on the general syntax of a file: scheme URI — Said format is file://host/path
, where the //host portion is optional (defaulting to ‘localhost’), or may be expressed as just // (again defaulting to ‘localhost’). Thus file:/path
and file:///path
are both perfectly acceptable ways to refer to the file found at path on the local filesystem.