(c) 1995 RESNA Press. Reprinted with permission. A LEXICAL DATABASE FOR INTELLIGENT AAC SYSTEMS Wendy M. Zickus, Kathleen F. McCoy, Patrick W. Demasco, and Christopher A. Pennington Applied Science and Engineering Laboratories, University of Delaware/A.I. duPont Institute Wilmington, Delaware Abstract A typical non-computerized dictionary contains a wide range of information about words such as spelling, pronunciation, morphology, parts of speech, definitions, synonyms, antonyms, and other language features. The knowledge available in these dictionaries would be very useful for an intelligent AAC System; for instance, an AAC System that applies Natural Language Processing (NLP) in order to expand telegraphic messages. This project focuses on the development of a comprehensive language database that integrates several complementary lexical resources with a single unified programming interface. This database will be used in the development of several systems that employ natural language parsing and generation techniques. Background/Motivation The use of Artificial Intelligence (AI) and Natural Language Processing (NLP) techniques in the development of AAC systems and devices continues to grow both in research laboratories and, more recently, in commercial products. The use of AI/NLP methods in any application area often requires significant language knowledge such as syntax and semantics (1). Within AAC, the need to support relatively unconstrained message production (in contrast to something such as a database query) requires that this knowledge be broad as well as detailed. One example of an intelligent AAC technique is Compansion (2), an approach that takes telegraphic input from a user and expands it into a syntactically and semantically well-formed sentence. The Compansion technique assumes a communication system based on words, pictures, or icons (i.e., non-spelling) and attempts to enhance the user's message production rate [FOOTNOTE: While the Compansion techniques has been primarily described as a rate enhancement technique, it also has potential applications in helping users learn how to produce grammatical sentences] by requiring only the selection of content words. One advantage of such a system is that it reduces the need to represent morphological information (e.g., verb inflections). This is potentially very beneficial for systems that use picture-based representations. A major component of the Compansion system is the semantic parser which takes a set of words and attempts to fit these items into a well-formed semantic structure thus determining the intended meaning. In the current implementation, processing is non-incremental; all of the input words are taken together and a semantic representation is created which best accommodates the set of words as a whole. Generally there will be at most one word identified as the main verb in the input words; the parser must determine which semantic role is being played by the other words. Consider the processing of the input "John break hammer". Once break is identified as the verb, the parser must decide which word of the input represents the agent (i.e., person or thing doing the action), which represents the theme (i.e., thing being acted upon), etc...(2). This information is represented in the semantic parser in the form of a case frame [FOOTNOTE: A semantic representation developed by Fillmore (3).] for break (a simplified form of which is shown below): verb: break agent: [[human 3] [animate 2] [ergative 2]] theme: [[physical 3] [object 1]] instrument: [[tool_box 3] [tool 3] [solid 1]] goal: [[human 3]] beneficiary: [[human 3][organization 2]] location: [[place 4]] The above frame indicates that the agent role is preferred to be filled by a human, but that any animate object or ergative object (e.g., a car) would also be acceptable. The theme role is preferred to be filled by a physical object, but an abstract object could also serve as a filler (although less preferred). The basic idea of the semantic parser is to fit the non-verb words of the input into the case frame in the best way possible. In order to do this, the semantic parser must access type-information associated with each word. For instance, it must be able to tell that John is a human and that hammer is not a human but a physical object. With this information the semantic parser can reason about the words of input and generate the sentence John breaks the hammer. Statement of the Problem One of the limitations of AAC devices today is the size and information available in their dictionaries. While some may contain an adequate amount of words, none of them contain the sufficient information needed to do semantic and syntactic reasoning. For instance, the word information needed for the semantic parser described above is not generally available in current systems. In addition, while there is substantial interest in the development of natural language interfaces within the general software community, there currently do not exist any lexical databases that provide both a broad coverage (in terms of numbers of words) and sufficient depth of information (e.g., case frames) for individual words (1). Fortunately there are a variety of lexical resources available both commercially and from a variety of research laboratories. It would be advantageous to integrate these resources so that they could complement each other. This approach would allow a developer to extract desired information in a consistent, understandable and functional manner. This is the idea behind the Language Access Database (LAD). Approach The approach to designing LAD has been to create an implementation with C++ and Lisp [FOOTNOTE: In our laboratories, we often use Lisp to develop prototypes and C++ to for commercial application development.] interfaces that allows a programmer to access several different databases (or lexical resources). The programmer is given as much or as little control as they need. For instance, they can simply query LAD about the frequency of a word and LAD will return the frequency rating found for the most generally accepted meaning of that word in some default corpus. Alternatively, if the programmer prefers, they can specify a specific "sense" of the word they are interested in and specify which corpora they would like to use. LAD accesses several different lexical resources. The most unusual of these is an on-line dictionary/thesaurus created at Princeton University called WordNet (4). It is WordNet that contains much of the semantic information needed for intelligent AAC applications. WordNet At first, one might think that a computer-based lexical resource ought to be set up just like a traditional dictionary. However, this approach has some limitations. One such shortcoming is that the information stored with a word is often incomplete. When one looks up a noun, for example platypus, one learns that it is a semiaquatic, egg-laying mammal, but unless one is an expert on mammals, there is no way other than by looking up mammal to find out if the platypus has hair. Dictionaries are ordered alphabetically and not grouped semantically, therefore such searches can be cumbersome. This weakness in contemporary dictionaries demonstrates one of the major strengths of WordNet: its semantic and lexical relations. By using the WordNet on-line lexicon, it is easy to discover the attributes of a given noun by traversing the semantic relations of its superordinate term (i.e., its "parent" or category). Another deficiency in contemporary dictionaries is the lack of information about coordinate terms (i.e., "sister" terms). Someone looking for information about other mammals would be forced to search the dictionary from beginning to end looking for terms that are classified as mammals. The prototypical lexical entry for a word points to its superordinate term, not laterally to its coordinate terms or downwards to its hyponyms (i.e., its "children" or subordinate terms). Again, these weaknesses are strengths of WordNet: its ability to reach related terms easily through its direct links to superordinate, coordinate and hyponymic terms makes searches of such information routine. However, there are weaknesses in using only WordNet. If someone desires phonetic information, morphological forms of a word, information on non-noun/verb/adjective/adverb terms, proper nouns, or information on function words they need to go to another source. WordNet serves as a good foundation for developing a multi-purpose linguistic tool. Its breadth of coverage and sense information provides a wealth of lexical information. LAD is intended to utilize this knowledge and enhance it by using other database sources to create a centralized interface system that facilitates access to language. Secondary Databases Some of the other databases that LAD can access include an internally developed verb case frame database (where verb frames such as the one from our previous example are stored), a morphology database, a database containing phonetic information, a syllabification database, and statistical databases (e.g., frequency) derived from the Brown corpus and the Carterette corpus. The morphology database is important to systems like Compansion. For instance given the input "John eat many apple", the system needs to be able to reason about the word many and change apple to apples. If the tense is present it must change eat to eats and if the tense is past, change eat to ate. Phonetic information is important for systems that need to generate speech. The statistical databases are useful for traditional AAC techniques such as word prediction. System Diagram: Language Access Database (LAD) <----- WordNet, Case Frames, Morphology ^ Phonetic Info, Syllabification, | Frequency (Brown Corpus), Others | Database Definition File Integration of Resources The figure above shows the overall structure of LAD. One important function of LAD is the integration of multiple lexical resources. These resources are shown on the left part of the figure. The architecture is extensible in that new lexical resources can be added without modification to the database engine. This is possible through the database definition file which defines the set of lexical resources, their location, and what attributes (e.g., frequency) they contain. Secondary lexical resources are defined as files where each record contains the word, its attribute, and an optional WordNet sense specification. The coordination of secondary databases with WordNet senses is one of the major benefits of integration. For example, the noun "bow" would be pronounced differently if it is a ornamental ribbon compared to the front of a boat. LAD is intended to be used in several different applications. Its functionality lends itself to be a useful tool for abstracting various types of word information required by different systems. In some cases this information might not be explicitly available. For example, in systems that need verb frame information, there may be some verbs that do not have frames (e.g., pummel). By default, LAD currently retrieves a verb frame from a secondary database. In the case where the verb is not represented in the secondary database, a case frame is generated by first searching synonyms of the verb from WordNet (e.g., crush) and then checking in the secondary database for these synonyms. If that search still fails, then a case frame is generated based on the WordNet verb frame which, although it lacks detail (e.g. Somebody ---s something), would still be useful in a system that was designed to be linguistically robust. Discussion LAD is designed to interact with multiple lexical databases in a transparent manner. The user/system treats LAD as a single dictionary. The resulting system will be a useful tool for various AAC applications. Other applications that could benefit from LAD include a speech synthesizer needing pronunciation information and a syntactic-based word predictor using morphological information to predict correct verb forms. It is currently being tested with a semantic parser based on the reasoning principles used in Compansion. A number of enhancements are being planned that will increase the ultimate utility of the tool. This includes a compiler that will produce a more compact version of the database based on an input list of words. This will reduce the overall memory and disk space requirements when used in a practical system. In addition, while LAD is intended to be primarily used by programmers, it will also be necessary for non-technical people to enter new information into the system. For this a front-end program will be developed that will help facilitate this process. References [1] McHale, M. & Crowter, J. (1994) Constructing a Lexicon from a Machine Readable Dictionary. Army Rome Laboratory Technical Report, #RL-TR-94-178, Rome Laboratory, Griffis AFB, NY. [2] McCoy, K., Demasco, P., Jones, M., Pennington, C., Vanderheyden, P., and Zickus, W. (1994). A communication tool for people with disabilities: Lexical semantics for filling in the pieces. In Proceedings of ASSETS '94. [3] Fillmore, C. J. (1977). The case for case reopened. In P. Cole and J. M. Sadock, editors, Syntax and Semantics VIII: Grammatical Relations, pp. 59-61, Academic Press, New York. [4] Miller, Beckwith, Felbaum, Gross, Miller (1990). Introduction to WordNet: An On-line Lexical Database, CSL Report 43, Revised March 1993. Acknowledgments This work has been supported by a Rehabilitation Engineering Research Center Grant from the National Institute on Disability and Rehabilitation Research of the U.S. Department of Education (#H133E30010). Additional support has been provided by the Nemours Research Programs. Wendy M. Zickus Applied Science and Engineering Laboratories 1600 Rockland Road, P.O. Box 269 Wilmington, Delaware 19899 USA Internet: zickus@asel.udel.edu