Project 1: A Rule Compiler This project is intended to be an example of increasing programmer productivity by having a specification written in a simple little language invented just for one application and then using a scripting language to compile that specification into runnable code. Such an application-specific programming tool is said to increase a programmer's productivity ten-fold. For our application, we want to build simple expert systems. These are programs that are based on a set of diagnostic rules that have been obtained from an expert. When started, the program will ask the user a sequence of questions and then will print its diagnosis. One of the more famous rule sets in the expert system literature is a set of rules for identifying an animal. The program might ask "Can your animal fly?" or "Does your animal have hooves?". After several such questions, it might print out "I think your animal is horse/zebra". (The rule set that we are going to use is small, so the program doesn't ask the obvious question that will distinguish between a horse and a zebra; adding extra rules to do this is left as an exercise for the programmer.) True expert systems often have a facility that will explain to the user why they are asking the questions, but we are not going to do that in this project. As a design goal, we want to generate rule-based programs that ask each question only once, and we also want the programs to ask as few questions as possible to come up with a diagnosis. We will use the memoizing technique you learned in CISC 280 to keep from asking questions more than once. How we minimize the number of questions that get asked is where we can get creative. The specification will be in a specification file. For this project, you will write a Perl script that will compile a specification file into a C++ program or a Java program, which will then be compiled and tested to verify that it asks questions only once and that it asks the questions in a sensible, and hopefully efficient order. On the due date, you will turn in your Perl script, the programs that it generates from the three test specification files animal.idt, doctor.idt, and glass.idt (available on the class web site) and a typescript file or other trace of the successful compilation of the programs and their performance on a non-trivial diagnostic problem. (For example, pick one of the animals that the animal diagnostic program can identify and answer the questions truthfully about that animal.) There will thus be one Perl script, three C++ or Java programs and three trace files, one for each of the programs in your project submission. The Specification Language The specification of a particular expert diagnostic program will be a file consisting of three kinds of text lines: question_lines, rule_lines and goal_lines. The syntax rule for a question_line is question_line -> "ask" name string where name is a single sequence of non-space characters and string is any sequence of characters. The three components of a question_line are separated by a single space character. Here is an example of a question_line: ask rounded_shell Does the animal have a rounded shell? The syntax rule for a rule_line is rule_line -> name value "if" name value ("and" name value)* where value is a single sequence of non-space characters. All the lexemes will be separated by a single space character. Here is an example of a rule_line: type_animal turtle if order scales and rounded_shell yes (Note that the syntax rule for a rule_line is so simple that, although it is an Extended BNF rule that can be easily mapped into a recursive-descent function, Perl can break up the rule_line into the lexeme components in an even simpler way.) The syntax rule for a goal_line is goal_line -> "goal" name string An example of a goal_line would be goal type_animal I think your animal is There will be exactly one goal_line in the specification file. The Translation Basically, the names that appear in a specification line can be thought of as names of functions, and the values that immediately follow them are the values that the named function will return or the test value that will be compared with the named function's returned value. All values will be character strings. The strings that appear in the syntax rules above are the strings of text that will be displayed to ask a question or do show the final diagnosis. Here is how each kind of specification line might be translated. A question_line will be translated into a function that asks a question (the first time it is called) and returns "yes" or "no". It will use a static local variable to save the answer so that the question will not be asked again when the function is called more than once. The example question_line above might be translated into C++ code that looks something like this: char * rounded_shell() { static char result[10] = "undecided"; if (strcmp(result,"undecided")==0) { cout << "Does the animal have a rounded shell?"; cin >> result;} return result;} (This code assumes that the user types yes or no as the answer to the question. You are free to come up with some other scheme, such as typing 1 or 0 to indicate yes or no, but the value that gets stored in result should be "yes" or "no".) A rule_line will be translated into part of a function definition. All of the rules_lines that begin with the same name should be coded into the body of the function that has that name. For undergraduates (CISC 470 students), the function definition will be relatively simple. Each rule_line for a given function name will translate into a single conditional statement. For example, all the rules for type_animal will be grouped together and compiled into the following function: char * type_animal() { static char * result = "undecided"; if (strcmp(result,"undecided")!=0) return result; ... if (strcmp(order(),"scales")==0 && strcmp(rounded_shell(),"yes")==0) { result = "turtle"; return result;} ... result = "unknown"; return result;} Graduates (CISC 670 students) have to organize the rules in a little more efficient manner. Once all the rules for a given function have been found, the name that appears the most frequently on the test sides of the rules will be the first name that is called. Divide the rules up into groups, one for each possible value that the name is tested for, and one group for all the rules that don't mention that name. Make code that calls the name and then a conditional that tests the returned value against all the possible values that the rules test for; For each branch of the conditional, take the group for that test value, remove the name-value pair from those rules and combine that group with the group of rules that the name didn't appear in and repeat this process of generating conditional statements until a rule has no test part left. Then assign the value that the rule infers to the static variable to save the result. When all the values of the name have been tested for, the final else branch of the conditional consists of code generated from the group that didn't mention the name. This nesting of conditionals should ask the fewest number of questions (and make the fewest number of subroutine calls) before it decides what value the function should return. (If more than one name appears the same maximum number of times in the test sides of the rules, chose one of them arbitrarily.) To illustrate, suppose that all the rules for type_animal are these: type_animal turtle if order scales and rounded_shell yes type_animal frog if order soft and jump yes type_animal bird if fly yes The name that appears the most in the test sides is order, so that is the first decision that has to be made. There will be a test for the value scales and another for the value soft, and an else branch for the rule that doesn't mention order. That rule will also have to be put with the other rules in the then-part of the conditional statement so that they will be considered in case the rule about order fails. The resulting definition might look something like this: char * type_animal() { static char * result = "undecided"; char * test; if (strcmp(result,"undecided")!=0) return result; test = order(); if (strcmp(test,"scales")==0) { test = rounded_shell(); if(strcmp(test,"yes")==0) result = "turtle"; else { test = fly(); if (strcmp(test,"yes")==0) result = "bird";}} else if (strcmp(test,"soft")==0) { test = jump(); if (strcmp(test,"yes")==0) result = "frog"; else { test = fly(); if (strcmp(test,"yes")==0) result = "bird";}} else { test = fly(); if (strcmp(test,"yes")==0) result = "bird";} if (strcmp(result,"undecided")==0) result = "unknown"; return result;} The goal_line will translate into the main subroutine. For example, the goal_line shown above might be translated into int main() { char * diagnosis; diagnosis = type_animal(); cout << "I think your animal is" << " " << diagnosis << endl;} It should be noted that there are many ways in which the scheme I just outlined can be made into more efficient or elegant code, but that is not the purpose of this assignment. Do the assignment in a straight-forward way along the lines that I have presented; if you see an obvious improvement and can implement it without spending extra time on it, that is fine, but don't get bogged down in a clever approach that takes you extra time to make work. Hints One of the first things that you will want to do is push all the lines of the specification file into a list variable and then sort it. The default lexicographical sort is sufficient for this assignment. Then you can step through the list, and process the goal_lines and question_lines as you encounter them, and find each group of rule_lines that all belong to the same function and process that group to produce the definition of that function. It will probably be easiest if you generate four intermediate files before creating the final file. File 1 will contain the header lines that will be needed in the C++ (or Java) program and all the function definitions created from question_lines. File 3 will contain all the definitions that come from the groups of rules. As these definitions are written to file 3, prototypes for the functions being defined get written to file 2. The main function generated from the goal_line is written to file 4. When all the specification lines have been processed, the final code is generated by cating together files 1, 2, 3, and 4, in that order.