Lexicons

1. Lexicon

1.1. Lexicon Goals

The lexicons defined by graphein for use in my Notebooks are just personal lexicons for my own study of languages. They're not substitutes for actual scholarly lexicons. Rather, they're more like simple wordlists, with a few linguistic annotations which assist me in using them and in writing simple programs to produce, for example, sorted or subsetted wordlists from them.

1.2. Encoding Lexicon Entries

I'll extend the conventional lexicographer's term "headword" to mean the full minimum form of a word which identifies it as in traditional dictionaries. For example, "ἀγορά [ᾰγορα̅], ἀγορᾶς, ἡ" is, in my lexicons, a "headword." "Heading identifier" or "entry identifier" would be better terms.

I'll encode headwords in Unicode UTF-8, including (in Greek) breathing and accent marks, but not including (in Greek at least) any short or long (macron) diacritical marks. In Greek, at least, I'll use precomposed characters for breathings and accents. I'll write out the versions of the word in full. Optionally, I'll indicate long and short vowels in additions in [square brackets] after a sub-word of the headword. I'll do this only after the first example if it would get repetitive. This is an extension of the practice of Liddell and Scott, who identify fragments of the word in this way. Thus, where Liddell and Scott have "ἀγορά [ᾰγ], ᾶς, ἡ" I have "ἀγορά [ᾰγορα̅], ἀγορᾶς, ἡ". I'll use combining diacritical marks to indicate length: U+0305 ("COMBINING OVERLINE") for the long mark or macron, U+0306 ("COMBINING BREVE") for the short mark, and U+0309 ("COMBINING HOOK ABOVE") to mark vowels I'm unsure of (because it looks a bit like a question mark). The use of the breve and overline/macron are relatively standard. The use of the "hook above" to indicate unknown length is entirely nonstandard. I should emphasize that when I indicate that a length is unknown, I mean only that I don't know it. I don't mean that it is in general unknown to a better scholar than me, much less that it is unknowable. By the same token, when I indicate a known length, I'm not implying that I know this through any analysis of my own. I've simply copied this information from one or more of the standard sources.

I identify the <entry> in an "xml:id" attribute, the value of which is first word in its "headword" (for example, just "ἀγορά", not "ἀγορά [ᾰγορα̅], ἀγορᾶς, ἡ"). I suppose I may encounter ambiguous situations where two headwords so abbreviated would be identical. When this happens, I'll extend the xml:id values on words I encounter subsequently to some non-ambiguous value. (Thus this is dependent upon the order in which I create the lexicon. This dependency is necessary so as not to break actual uses of the lexicon data via XInclude.) In creating these xml:id entries, I'll omit the [length information in square brackets]. Thus, if the "ἀγορά" entry were to turn out to be ambiguous, I'd extend it to "ἀγορά, ἀγορᾶς" rather than "ἀγορά [ᾰγορα̅], ἀγορᾶς".

TEI P5 allows xml:lang, but only to specify RFC 3066 language identifiers for element content. This isn't quite what is needed, What is needed is something that will distinguish languages and dialects within the present lexicon and between lexicons (e.g., Attic vs. Ionic, vs Latin, vs Vedic Sanskrit, etc. So just use a note.

Note the inflected forms only for the minimal set (e.g., for Greek nouns, nominative singular and genitive singular, for Greek verbs the principle parts).

1.3. Further Encoding for Wordlist Entries

Ideally, there would be no need for any further work to get "wordlist" information into this lexicon and out of it. After all, what more could there be in a wordlist than there is in a lexicon? Unfortunately, the tools get in the way.

The location of use of the wordlist data is outside of the lexicon itself, in other TEI encoded documents in this Notebook. In the TEI, as a markup language defined in XML, there are two ways to handle external materials.

They may be encoded as XML external entities. However, an XML external entity has no internal structure per se (other than whatever markup is contained within it). So this would involve importing the whole lexicon, or splitting the lexicon into separate files for each entry, or perhaps even separate files for each field of each entry (but then it would lose the structure of the entry). So I opted against this.

The other way is to use the XML-related "XInclude" facility. Unfortunately, XInclude (like many of the alphabet soup of standards surrounding XML) has its difficulties. Further, the actual implementations of XInclude in the tools I'm using are flawed. (They're even more flawed in some other tools that I'm not using, of course.) Though the TEI assumes the full implementation of XInclude, I haven't got such a full implementation available to me. This has led to some "workarounds" where there are aspects of the markup forced by the tools. This should not be, but is.

The first problem is that the part of XInclude available to me for specifying sections of a target document consists of only the "xpointer=element(id)" feature. This allows me to point into another document, but only to an individual named element. (It's named using the "xml:id=" tag.) So at the pointer end I would write, e.g.:

 
<xi:include 
xmlns:xi="http://www.w3.org/2001/XInclude" 
href="../lexicon/index.tei" 
xpointer="element(ἀγορά-wordlist-headword)" 
parse="xml"/>

At the pointee end, I would write, e.g.:

 
<entry xml:id="ἀγορά" type="main"> 
<form type="simple" xml:id="ἀγορά-wordlist-headword"> 
<orth>ἀγορά, ἀγορᾶς, ἡ [ᾰγ]</orth> 
</form>

That is, I cannot point to the <entry> element and navigate down from it to the actual sub-element I want. Instead, I must tag the sub-element directly.

The second problem is that there is no way to include not the full tagged element but only its contents. So actually the fragment above won't work. If I do it, I'll get:

 
<form type="simple" xml:id="ἀγορά-wordlist-headword"> 
<orth>ἀγορά, ἀγορᾶς, ἡ [ᾰγ]</orth> 
</form>

XIncluded into my document, tagging and all. This is a problem because the <form> tag won't be valid in this position of use.

The only workaround I can figure out is ugly: to duplicate in the lexicon entry the information that I want to use in wordlists. Thus, it is necessary to use a element of a type which can fit nearly anywhere. There aren't many of these within the <entry> element. So I'll just use a <note>, with an appropriate (made-up) attribute value.

 
<entry xml:id="ἀγορά" type="main"> 
<note type="wordlist-headword" xml:id="ἀγορά-headword">ἀγορά, ἀγορᾶς, ἡ [ᾰγ]</note>

The third problem is that because XInclude element(id) uses the xml:id attribute to identify an element, and because by definition (reasonable in its own context) every xml:id tag must be unique in a document, and because the XIncluded element includes its xml:id tag, the XIncluded element can only be XIncluded once in any document. So, for example, if I want to refer in the text of a document to "ἀγορά, ἀγορᾶς, ἡ [ᾰγ]" and also to include this in a wordlist within the same document, only one of these two uses can actually be an XInclude from the lexicon. This very nearly defeats the purpose of the whole enterprise, and is at very least an example of a situation where the limitations of the tool will influence the contents of that which is written using it.

This third problem could be addressed by a better XInclude which would either (a) allow me to point inside an element to its contents specifically (and in general navigate around the element hierarchy in an XPath style), and/or (b) allow me to transform the contents of an XIncluded entity's attributes on the fly.

Finally, avoid whitespace between the tags and contents of the "wordlist" notes, as it'll show up in the output and isn't always handled in ways which might be expected.

1.4. Lexicon Example

 
<entry xml:id="ἀγορά" type="main"> 
<note type="wordlist-headword" xml:id="ἀγορά-headword">ἀγορά, ἀγορᾶς, ἡ [ᾰγ]</note> 
<note type="wordlist-sense" xml:id="ἀγορά-wordlist-sense">marketplace and/or meetingplace</note> 
<form type="simple"> 
<orth>ἀγορά, ἀγορᾶς, ἡ [ᾰγ]</orth> 
</form> 
<form type="inflected"> 
<orth type="nominative singular">ἀ̆γορά</orth> 
<orth type="genitive singular">ἀ̆γορᾶς</orth> 
</form> 
<gramGrp> 
<pos>noun</pos> 
<gen>feminine</gen> 
<iType>first declension, subdivision 1</iType> 
</gramGrp> 
<sense n="simplistic"> 
<def>marketplace and/or meetingplace</def> 
</sense> 
<note type="in-vocabulary-of">cheadle1</note> 
<note type="in-vocabulary-of">mastronarde</note> 
<note type="language-or-dialect">attic</note> 
</entry>

ἀγορά, ἀγορᾶς, ἡ [ᾰγ]. ἀ̆γοράἀ̆γορᾶςmarketplace and/or meetingplace. C1 M