Lexicon File Format
About Lexicon File Format
When preparing a lexicon file, it is essential that the syntax of each line is correct so that the NLP engine will correctly process your new dictionary entries. This section describes the purpose and specific syntax for each part of the lexicon file.
- All lexicon files must be saved in a DCT file format. This represents a dictionary file.
- The first line in a DCT file should denote the lexicon type. See Lexicon Types for more information.
- All columns in a DCT file should be tab-delimited.
Qtip: Tab-delimited means that columns should be separated by the Tab key, not the Space bar. An easy way to preserve tab-delimited formatting is to prepare your lexicon file in a text editor (like Notepad++ on Windows or TextEdit on Mac) and save the file as a DCT file type. You can also achieve this within a text editor by hitting the Tab key between each column.
Column 1: Pattern Variation
The first column of the lexicon file contains variations, or synonyms that you wish to map to normal forms (or chiclets, see Column 3: Normal Form).
Each line in your lexicon file should contain exactly one variation. Additional variations should be placed on additional lines. If a word does not have any variations, then you do not need to define it in your lexicon, although doing so would not be detrimental.
A variation may include a common misspelling, abbreviation, acronym, or potential alternative name. Values in this column should always be lowercase except for case-sensitive entries and title case (see Column 2: Synonym Code).
For any entity with more than two words, you will need to define it in the dictionary in its standard form. This step is not necessary for single-word lexicon entries as the NLP engine already tokenizes all single words that are processed. However, if a single-word entry takes on a new meaning when capitalized, do initialize it on its own row. See the examples below.
Example:
harley davidson | CSYN | harley davidson | {SpeechPart=”Noun”} |
harley | CSYN | harley davidson | {SpeechPart=”Noun”} |
Example: If a single-word entry is case-sensitive, do include an initial entry mapping the variation to the normal form such as this example referring to the Los Angeles International Airport (LAX). It is capitalized here to disambiguate it from the adjective “lax.”
LAX | SYN | Los Angeles International Airport | {SpeechPart=”Noun”} |
Special Characters
The first column can contain special characters such as hyphens, apostrophes, or pound signs. No special escape characters are necessary when using special characters in your lexicon. The same applies to letters with diacritics such as accent marks, tildes, circumflexes, and so on.
However, do consider that when the NLP engine parses special characters, it sees them as separate words:
- The phrase “~two days” is parsed as three words: “~,” “two,” and “days.” If you create a lexicon entry for “~two days,” it will not match your data. Instead the lexicon should read “~ two days” (note the space between “~” and “two”).
- The phrase ‘Total Recall’ (with quotes included) is parsed as 4 words. To capture this phrase, the lexicon entry should have spaces between the quotation mark and the adjacent word and read as ‘ Total Recall ‘.
Periods
When acronyms contain periods between each letter, the extra space is not needed. For example, for the acronym “b.o.a.”, no extra space is needed. However, for lexicons that end with a single letter, such as “John D.”, the space is needed, becoming “John D .”
Hashtags and @ Mentions
When adding a new entry, the hashtagged (#) or mentioned (@) forms of a word are not automatically included. If you wish for these to be part of your lexicon entry, please make separate rows.
Example: #qualtrics and @qualtrics will require separate lines to map to “qualtrics.” Note that you do not need a space between # or @ and your term in these cases.
qualtrics | CSYN | qualtrics | {SpeechPart=”Noun”} |
#qualtrics | CSYN | qualtrics | {SpeechPart=”Noun”} |
@qualtrics | CSYN | qualtrics | {SpeechPart=”Noun”} |
Diacritics
If your variation includes a diacritical mark, the dictionary will only recognize that specific variation. However, if your variation does not include a diacritical mark, the dictionary will recognize both the unmarked form and the marked form. In many cases, it is better to use the unmarked form as the variation because it helps capture words when users are too lazy to use the diacritical character. However, be cautious as there are many examples of words that completely change meanings when a diacritical mark is added.
Example: The following entry will capture te and té. These words have different meanings in Spanish though! Te = you, té = tea.
te | CSYN | té | {SpeechPart=”Noun”} |
The entry below, however, will capture only té and not te.
té | CSYN | té | {SpeechPart=”Noun”} |
Asterisks
When working with data that’s been redacted with asterisks, add spaces between the asterisks in your lexicon entry.
* * * * * * | CSYN | [Redacted] | {SpeechPart=”Noun”} |
Column 2: Synonym Code
The second column of the lexicon file contains the synonym code that tells the NLP engine how to read the variation written in column one.
There are several accepted codes:
- SYN: This is a Synonym. The SYN code tells the NLP engine that the variation in column one should be interpreted exactly as written. Capitalization is taken into account. This code is useful when creating lexicons based on acronyms in which the acronym takes on a new meaning when not capitalized such as the airport abbreviation LAX and the word lax.
- CSYN: Under certain circumstances, this is a Case-insensitive synonym. The CSYN code, when used with any word that begins with a lowercase letter, tells the NLP engine that the variation in column one should be interpreted without regard to capitalization. Use this code when you only want to capture the specific form of the word listed in the variation, and when the variation is not a standard dictionary term.
- CSYN: Under certain circumstances, this is a Title-case synonym. The CSYN code, when used with any word that begins with a capital letter, tells the NLP engine that the variation in column one should be interpreted without regard to capitalization with the key caveat that the first letter of the first word must begin with a capital letter. Use this code when you want to capture a proper noun which may be ambiguous when uncapitalized such as the company “Best Buy” or the “Great Value” brand products from Walmart. Using CSYN Title Case should be a rare occurrence. Lexicons should be designed to be as tolerant to nonstandard capitalization as possible. However, in certain cases, this syntax will be useful.
- MSYN: This is a Morph-insensitive synonym. The MSYN code tells the NLP engine to expand the lexicon entry to include its related morphological forms (for example, “jump” includes “jumps,” “jumping,” and “jumped.” XM Discover includes all morphological variations regardless of the part of speech that you specify. You would not need additional rows in the lexicon file to handle these different verb forms. The “-er” and “-est” suffixes are not part of the same normal form and are not included in an MSYN expansion.
Qtip: The MSYN code will only work for standard dictionary terms. The XM Discover dictionaries may not know the correct morphological forms for proper nouns such as Qualtrics. Also note that when using MSYN, all variations will be considered case insensitively. Use the MSYN code whenever your lexicon includes dictionary terms in which you want to include alternative suffixes. This method will make your lexicon list shorter and more inclusive of language variations.
Column 3: Normal Form
The third column in the lexicon file should optionally contain the “normal form” of the word.
The normal form, or master token, is the version that will appear in Designer. This word or phrase should be the standard version of the variations that you have defined in column one. The normal form should be repeated on subsequent lines in your lexicon file for each corresponding variation in column one.
The NLP engine automatically capitalizes the normal form when the lexicon dictionary is processed. As a result, it is not case-sensitive. If this column is omitted, the variation in column one will be assigned as the normal form.
Like column one, column three can contain special characters such as hyphens, apostrophes, or pound signs. No special escape characters are necessary when using special characters in your lexicon. The same applies to letters with diacritics such as accent marks, tildes, circumflexes, and so on.
Column 4: Tags
The fourth column of the lexicon file provides a place for you to define grammatical attributes for your specific lexicon entry.
In most cases, you will only need to indicate the SpeechPart in column 4. However, in some cases you may also want to specify the degree, tense, and so on. This point is especially true with non-English lexicons that require agreement in case, gender, number and so on between words. You may choose to add SemanticType where applicable. This metadata may be used in the future for Intelligent Entities.
One or many attributes can be defined for each lexicon entry. All attributes should be encapsulated in {curly brackets}. Each attribute value should be inside quotation marks. Multiple attributes are separated by a comma and a space.
stainless steel | MSYN | stainless steel | {SpeechPart=”Noun”, Sentiment=”0″} |
The possible tags and their values are:
- Case (one of the following):
- Undefined (default)
- Nominative
- Objective
- Common
- Possessive
- ControlFlags: Combination of the following values (semicolon-separated list):
- Empty (default)
- SubjectAnimate
- SubjectInanimate
- ObjectAnimate
- ObjectInanimate
- IndirectObjectAnimate
- IndirectObjectInanimate
- Infinitive
- AdjectiveOrNoun
- Adjective
- PrepNoun
- PrepAdj
- ObjectSentence
- SubjectSentence
- SubjectInfinitive
- AdverbModifier
- ObjectVP
- PhrasalVerb
- ProperAdjective
- ControlPrepositions: A comma-separated list of prepositions. This attribute should be set if ControlFlags has one of the following values: PrepNoun, PrepAdj, or PhrasalVerb. By default it is empty.
- Empty (default)
- Degree: One of the following strings:
- Undefined (default)
- Comparative
- Superlative
- Gender: One of the following strings:
- Undefined (default)
- Masculine
- Feminine
- Number: One of the following strings:
- Undefined (default)
- Singular
- Plural
- Person: One of the following strings:
- Undefined (default)
- First
- Second
- Third
- PronounType: One of the following strings:
- Undefined (default)
- Personal
- Possessive
- Demonstrative
- PossessiveAbsolute
- Reflexive
- Relative
- ProperType: One of the following strings:
- No (default)
- Unknown
- Name
- Surname
- PersonName
- Organization
- Geography
- Semantic: Combination of the following values (semicolon-separated list):
- Organization
- Communication
- Group
- Act
- Artifact
- Location
- Cognition
- Relation
- Time
- Food
- Substance
- State
- Process
- Object
- Possession
- Phenomenon
- Plant
- Shape
- Body
- Person
- Tops
- Event
- Attribute
- Animal
- Geography
- Quantity
- Feeling
- Motive
- Sentiment: Integer value evaluating a measure of the corresponding word sentiment.
- SpeechPart: One of the following strings:
- Unknown (default)
- Adverb
- Adjective
- AdjectivePronoun
- Pronoun
- PronounInterrogative
- Noun
- Verb
- ParticipleI
- ParticipleII
- Gerund
- Aux
- ModalVerb
- Preposition
- ConjunctionCoordinate
- ConjunctionSubordinate
- SentenceModifier
- Partitive
- Proform
- Determiner
- Introductory
- NumeralCardinal
- NumeralOrdinal
- Particle
- Article
- InfinitiveMark
- Special
- Breaker
- Delimiter
- Tense: One of the following strings:
- Undefined (default)
- PastSimple
- PresentSimple
- FutureSimple
- PastContinuous
- PresentContinuous
- FutureContinuous
- PastPerfect
- PresentPerfect
- FuturePerfect
- PastPerfectContinuous
- PresentPerfectContinuous
- FuturePerfectContinuous
- FutureInThePastSimple
- FutureInThePastPerfect
- FutureInThePastContinuous
- FutureInThePastPerfectContinuous
- Perfect
- Continuous
- Simple
- PerfectContinuous
- Indefinite
- Voice: One of the following strings:
- Undefined (default)
- Active
- Passive
SpeechPart Tag
The SpeechPart tag defines when the lexicon should apply, not how it should be applied. By adding the SpeechPart=”Noun” tag to a lexicon entry, you are telling the NLP engine to apply the lexicon when the term is used as a noun in any given sentence as detected by the NLP engine. This tag does not tell the NLP engine to set the lexicon as a noun. Be sure to define the correct part of speech when preparing your lexicon file.
The @match tag is a powerful syntax to use when you need to modify a standard word. When you add a lexicon, it adds an entry for the specific word to the XM Discover dictionaries that are installed with your account. When adding a brand new term such as “qualtrics,” which would not previously exist in the standard English dictionaries, the term receives one entry with the designated speech part. This entry will fire when the term is used as that part of speech in actual data. For words that do already exist in the standard dictionaries, the lexicon entry will simply append another row to the dictionary for the designated part of speech. When the word occurs in your dataset, the NLP engine will determine its part of speech in that sentence and assign the corresponding linguistic attributes. In some cases, adding a lexicon for a standard dictionary term will result in multiple entries with the same part of speech for a single word. When multiple parts of speech entries exist for a single word, the NLP engine may not assign the correct one. In order to avoid this issue, you can use the @match tag to override all pre-existing entries for that part of speech/word combination. In many cases, similar results can be achieved using positional exception rules with part of speech flags in Designer.
Example: By default, “issue” is listed as a neutral verb and a negative noun. However, you may want to override the negative noun with a neutral noun to account for cases like “issue of a magazine.” By using the @match tag, you tell the NLP engine to override any other entries for ISSUE as a noun with this entry which will set its sentiment to 0.
issue | MSYN | issue | {SpeechPart=”Noun” @match, Sentiment=”0″} |
Example: You discovered an error in which the adjective “stunning” was mapping to the verb form of “stun.” In order to change this to “stunning,” you can use the SpeechPart=”Adjective” and @match code to override the existing entry for “stunning” as an adjective.
stunning | CSYN | stunning | {SpeechPart=”Adjective” @match) |
Tips for Creating a Lexicon File
- Always write your lexicon variations (column one) in lower case unless you have a specific use case for capitalization such as an ambiguous acronym.
- If your lexicon is a single word, you likely do not need to define it by itself as the NLP engine will recognize it as an entity already. If your lexicon requires specific case sensitivity, then you will need to define it up front.
- Use MSYN when your lexicon contains standard dictionary terms. This will automatically include other word forms so that you do not have to create specific line items for each one.
- If you are uncertain whether your lexicon contains standard dictionary terms, use CSYN.
- If your lexicon entry contains a special character at the beginning or end of the word, your variation in column one should have a space between the character and the word. For example, “Black Friday” should be ” Black Friday ” (note the spaces).
- Lexicons do not automatically include @ and # prefix variations. You should define these separately.
- Prepare your file in a text editor (like Notepad++ on Windows or TextEdit on Mac) and save the file as a DCT file type.
- If you are building a lexicon file on a Mac, be sure to use the Carriage Return Line Feed (CRLF) line break character between rows. This character is readable by both Windows and Mac, which is in contrast to the more common Carriage Return (CR) character used in Windows applications and the Line Feed (LF) character used on Macs. The distinction between these types is invisible in many text editors including the TextEdit application that is native to the MacOS. We recommend using a downloadable application called TextWrangler. There is a setting at the bottom of this application that lets you select which line break style you want to use. Please select the Windows option before building your lexicon file.