Summary
Constructing an interpreter in Java ties together many different programming techniques. Java object orientation makes implementing the interpreter both straightforward and easy to grasp by any programmer. Part 2 of this column looks at the Java-based interpreter (written in a form of BASIC I called Cocoa in Part 1) and its source code, with particular attention paid to the structure of the interpreter. (3,350 words)
In last month's column I discussed the idea of building an interpreter in Java and introduced my version of BASIC, named Cocoa. In this month's column we will jump right into the source code for the Cocoa interpreter. However, as the emphasis here is on building interpreters in Java, more attention will be paid to the structure than to the actual code. As always, the full source code for the column is available in the Resources section below.
The Program class
All the constituent classes of the interpreter are collected into a
single Java package. For this interpreter, that package is named
basic. (Here is Cocoa's programming
documentation.) This does contravene Sun's suggested naming
convention for packages for no good reason, but then again Sun's naming
scheme, as described in the Java Language spec, isn't particularly well
motivated either. :-) However, there is good reason to put the classes
together in a single package, and that is to provide visibility of
interpreter private methods to other interpreter classes without
exposing them to the interpreter's clients.
As I discussed last month, the primary class for the interpreter is
a public class with a factory method to create executable programs. I
chose to implement this load
method in a class named
basic.Program
. (For the rest of the article we'll assume
the class names are all preceded by the "basic" package name without
expressly calling it out.) Program
exports a static method
whose signature is:
public static Program load(InputStream source, PrintStream out) throws IOException, BASICSyntaxError { ... }
This method is responsible for getting the text in an
InputStream
parsed and collected into something that
can be interpreted. It returns an instance of a
Program
object, which is the parsed program. The other
important public method in Program
is run
,
whose signature is as follows:
public void run(InputStream in, OutputStream out) throws BASICRuntimeError { PrintStream pout;
The run
method is not static; it actually runs the
instance of the program using the input and output streams that are
passed for doing character I/O.
As you can see, both of these methods throw exceptions. The first
throws an IOException
if an I/O error occurs while reading
the source code from the input stream. The load
method
also can throw a BASICSyntaxError
exception if the source
code it is reading fails to parse. The second method, run
,
throws the exception BASICRuntimeError
when an error
occurs during the execution of the program. The kinds of situations
that would throw this error include dividing by zero or reading from an
uninitialized variable.
These two methods, load
and run
, define
the two halves of any interpreter, loading and executing.
Getting the program loaded
The load
method in the Program
class
begins as follows:
public static Program load(InputStream source, PrintStream out) throws IOException, BASICSyntaxError { DataInputStream dis = null; char data[] = new char[256]; LexicalTokenizer lt = new LexicalTokenizer(data); String lineData; Statement s; Token t; Program result = new Program(); if (! (source instanceof DataInputStream)) dis = new DataInputStream(new BufferedInputStream(source)); else dis = source;
In the code above, the variables for the parsing section of the
interpreter are initialized. These include a lexical analyzer (in the
form of the LexicalTokenizer
class) and a
DataInputStream
. There is a simple optimization that
checks to see whether or not the method was passed an instance of a
DataInputStream
. The data array is an array of characters
the lexical analyzer will use while reading in the source code. Note
that it puts a limit of about 256 characters on the input line.
The bulk of the loading method is simply a loop that reads in data (in the form of strings) and then parses it. Here is the source code for this loop:
1 while (true) { 2 lineData = dis.readLine(); 3 4 if (lineData == null) return result; 5 6 if (lineData.length() == 0) continue; 7 8 lt.reset(lineData); 9 t = lt.nextToken(); 10 if (t.typeNum() != Token.CONSTANT) { 11 throw new BASICSyntaxError("Line failed to start with a line number."); 12 } 13 14 try { 15 s = ParseStatement.statement(lt); 16 } catch (BASICSyntaxError bse) { 17 out.println("Syntax error: "+bse.getMsg()); 18 out.println(lt.showError()); 19 throw bse; 20 } 21 s.addText(lineData); 22 s.addLine((int) t.numValue()); 23 result.add((int) t.numValue(), s); 24 } 25 }
This loop will run until the return statement in line 4 returns a
result, or the catch statement is executed in line 16 and the exception
is thrown. The most important statements in this loop are lines #15,
#8, and #23. In line 15 a valid BASIC statement is parsed. In line 8
the lexical analyzer is preloaded with the string containing a BASIC
statement. In line 23 the parsed statement is stored in
result
(which is an instance of program) by adding the
parsed statement to a binary search tree, indexed by the parsed line
number.
All interpreters will need a collection of, or container for, statements. BASIC provides a convenient hook in the form of line numbers. My version uses a modified red-black tree, you may wish to refer to my column on containers if you are building your own interpreter.
In lines 21 and 22 I store a copy of the text that was read into a holding variable of the statement. Doing this allows me to implement a very simple program listing function where I simply iterate over the statement objects and print out the originally read line for each one.
Lexing BASIC statements
Decomposing the string into components that can be interpreted by the
parser is the job of the lexical analyzer. For a review on lexical
analysis see my previous Java In Depth columns
in JavaWorld.
The lexical analyzer I designed for this interpreter falls right in
the middle of the complexity range. It is too complicated to be handled
by either StringTokenizer
or StreamTokenizer
,
but it is simple enough that I found using a parser generator such as
JavaCC was overkill. (See my article "Looking for lex and yacc
for Java? You don't know Jack" in the December issue of
JavaWorld.) Added to the complexity issue, or lack thereof,
was the fact that customizing the lexical analyzers that are
automatically generated is still evolving as a feature in the automatic
generation space. So it was harder to write the analyzer code but
easier to plug it into the interpreter.
I won't go into great detail on the lexical analyzer, but I will cover two aspects of it that make the interpreter easier to write.
The first "trick" one can play is to design a
sophisticated Token
class. You will recall from the
discussion of lexical analyzers that the component parts produced are
tokens. Some of the tokens you will parse will end up as variables in
your interpreted code. I created a subclass of my Token
class, named Variable
, that was derived from
Token
by also including a setValue
method.
This allows me to store the variable directly after I receive it in the
parser, without having to create a new object instance to hold it.
The second technique I used was to incorporate a mark
and resetToMark
feature in the analyzer so that I could
parse several possible choices, first by marking my position and trying a
parse, then resetting and trying a different parse path.
The most commonly used method in the lexical analyzer is
nextToken
. This method encapsulates all the rules
regarding how a stream of ASCII characters should be divided into
individual tokens. A version of this method is shown below to
demonstrate the flow, although I've edited out redundant parts to save
space. See the Resources section for links to
the complete source.
Token nextToken() { Token r; .... if (currentPos >= buffer.length) return EOLToken;
The conditional statement above guarantees that every time
nextToken
is called, it returns something, even if it is only
an end-of-line indication.
The statement below saves our position in the input stream so
the lexical analyzer can return to the current position should the
analysis of the incoming character stream either fail to produce a
valid token, or the token that is returned need to be pushed back
to be re-read by another part of the interpreter. This technique of
pushing back the last token to be re-read is used extensively in the
expression evaluator. (See the source code in ParseExpression.java
.)
previousPos = currentPos;
The next step is to create a loop to consume all of the white space
(tabs, spaces, and so on) under the current parse point. The
definition of white space is encapsulated in the private
isSpace
method:
while (isSpace(buffer[currentPos])) currentPos++;
The bulk of the code is a giant switch statement, based on the character value, first checking for the operator characters as shown here,
switch (buffer[currentPos]) { case '+' : currentPos++; return new Token(Token.OPERATOR, "+", Expression.OP_ADD);and then the multiple character operators as shown below. In the following code the analyzer is trying to match the tokens '<', '<=', or '<>'.
case '<' : if (buffer[currentPos+1] == '=') { currentPos += 2; return new Token(Token.OPERATOR, "<=", Expression.OP_LE); } else if (buffer[currentPos+1] == '>') { currentPos += 2; return new Token(Token.OPERATOR, "<>", Expression.OP_NE); } currentPos++; return new Token(Token.OPERATOR, "<", Expression.OP_LT);
After the operators, various symbols are identified as shown below:
case '(' : case '\'': ... case ',' : return new Token(Token.SYMBOL, (double) buffer[currentPos++]);
Once all possible symbols are checked for, the analyzer checks to see if the token could be a numeric constant. A number always starts with either a digit or a decimal point so the analyzer checks for these characters before calling a helper method to actually convert the number into a Java number:
case '.' : if (r != null) return r; case '0': case '1': ... case '9': r = parseNumericConstant(); if (r != null) return r; return new Token(Token.SYMBOL, (double) buffer[currentPos++]);
Once numeric characters and symbols are out of the way, the class attempts some "universal" end of line processing, hiding from the user the differences in line endings between Unix, DOS, and MacOS.
case '\r' : case '\n' : while (currentPos < buffer.length) currentPos++; return EOLToken;
Next, if the current character isn't one of the above symbols, or a numeric constant, or an end of line, the penultimate step is to check to see if perhaps it is a double quote character, which indicates the beginning of a quoted string constant. If the character is a double quote, the code below extracts the string between the quotes and returns a string constant token:
case '"' : StringBuffer sb = new StringBuffer(); ... parse quoted string ... }
Finally, if the code above has not been able to identify the token, we can safely assume that the input character the lexical analyzer is looking at is a letter. This assumption is predicated on the fact that any non-letter character would have already been consumed by the earlier code.
The remaining code consumes the letters and numbers following the initial letter up to the next non-alphabetic character and processes the resulting string potentially as one of the following token types:
print
,
if
, and so on.
run
,
list
, new
, and so on.
The final result is a relatively large amount of code converting the characters into a token object that is easily manipulated by the next stage, the statement parser.
Collecting tokens into BASIC statements
The next major class in the interpreter is a class used to
parse statements, named ParseStatement
.
The ParseStatement
class is a subclass of the
Statement
class. The implementation of the
Statement
class was affected by a design issue: I wanted
some basic methods that all statements shared, such as
trace
, execute
, and unparse
, and
I didn't want a single monolithic class that defined the entire set of
all statements. By designing it in this way, I was able to easily add
additional statement types with a minimum impact on the classes that
were already in place.
ParseStatement
's factory method is named
doParse; it takes as an argument an instance of the
LexicalTokenizer
class and returns an instance of the
Statement
class. This structure is what makes the simple
"compiler" loop in the Program
class's
load
method so trivial to write.
The implementation of the doParse
method is fairly
mechanical. There are three valid initial token streams for BASIC
statements:
let
statement.
The requirement for case #1 in the list above stems from old BASIC
interpreters that used these symbols in lieu of keywords to save space
in precious random-acess memory. The doParse
method treats
lines with these initial symbols exactly as it would if those lines
started with the keyword. The code that handles case #1 is shown
below:
1 static Statement doParse(LexicalTokenizer lt) throws BASICSyntaxError { 2 Statement s; 3 Token t; 4 5 t = lt.nextToken(); 6 if (t.typeNum() == Token.SYMBOL) { 7 switch ((int) t.numValue()) { 8 case '?': 9 s = new PRINTStatement(lt); 10 t = lt.nextToken(); 11 if ((t == null) || (t.typeNum() == Token.EOL)) 12 return s; 13 14 if (t.isSymbol(':')) { 15 s.nxt = statement(lt); 16 return s; 17 } 18 throw new BASICSyntaxError(extraError); 19 case '\'': 20 s = new REMStatement(lt); 21 return s; 22 default : 23 throw new BASICSyntaxError("Illegal statement symbol start?"); 24 } 25 }
As you can see from the code listing, once a symbol is parsed as the
initial token, the only two exits from this code are either a return
statement, or the throwing of an exception. Take the case of the "?"
symbol. If it is returned by the lexical analyzer, the
doParse
method creates a new statement by instantiating an
instance of the PRINTStatement
class. As we will see a bit
later, this invokes a parser specific to print statements on the
remaining tokens on the input line. The next interesting bit is the
way in which the colon (":") character is handled. This character is
used by the BASIC interpreter to separate multiple statements on a
line, so once the print statement is parsed, the next token must be
either an end of line (EOL) or a symbol token (whose value is ":").
Otherwise, there is unparsed data after the statement proper, and this
always results in an error.
Parsing keyword statements for case #2, in the three-part list above, is very similar. As the code below shows, if the token is a keyword, then a large switch statement is entered. The cases for the switch statement are those keywords that can legally start a BASIC statement. The code parsing keyword statements is shown here:
1 if (t.typeNum() == Token.KEYWORD) { 2 switch ((int) t.numValue()) { 3 case TRON: 4 s = new TRONStatement(lt); 5 t = lt.nextToken(); 6 if (t.isSymbol(':')) { 7 s.nxt = statement(lt); 8 } else if (t.typeNum() != Token.EOL) 9 throw new BASICSyntaxError(extraError); 10 return s; 11 12 case TROFF: 13 s = new TROFFStatement(lt); 14 t = lt.nextToken(); 15 if (t.isSymbol(':')) { 16 s.nxt = statement(lt); 17 } else if (t.typeNum() != Token.EOL) 18 throw new BASICSyntaxError(extraError); 19 return s; ... ...
This code demonstrates the "cookie cutter" nature of parsers -- a feature that makes them so amenable to automatic generation. In pseudo code each case can be expressed as:
Other than the class used to parse the tokens, the only variable is
whether or not it makes sense to allow additional statements after the
current statement. Some statements, such as end
and
rem
, do not allow another statement to follow them on the
same line. Finally, this version of BASIC allows you to specify a
let
statement (let
is the BASIC keyword for
assignment statements) without actually specifying the let
keyword. Typically a statement that is entered as,
100 A = 10
is interpreted as the let
statement shown below:
100 LET A = 10
The let keyword is implicit in this case. The implicit
let
statement is parsed by the following code:
21 } else if (t.typeNum() == Token.VARIABLE) { 22 lt.unGetToken(); 23 s = new LETStatement(lt); 24 t = lt.nextToken(); 25 if ((t == null) || (t.typeNum() == Token.EOL)) { 26 return s; 27 } else if (t.isSymbol(':')) { 28 s.nxt = statement(lt); 29 return s; 30 } 31 throw new BASICSyntaxError(extraError); 32 } 33 throw new BASICSyntaxError("Unrecognized statement");
As the code above shows, if the first token on the line is
a variable, it is first "pushed back" onto the input stream in line 22,
and then a new LETStatement
object is instantiated as if
the parser had actually seen the let keyword.
All the keyword statement subclasses have the parser built into
the constructor. An example of this is the GOTOStatement
class, whose constructor is simply:
GOTOStatement(LexicalTokenizer lt) throws BASICSyntaxError { super(GOTO); parse(this, lt); }
Here you can see that the GOTOStatement
class invokes
the Statement
constructor that takes a keyword identifier
as a parameter. Then it invokes a static parse statement, which parses
the parameters that can follow the GOTO keyword in a BASIC
program. This class provides an example of a constructor that throws an
exception (in this case a syntax error exception).
The parse
method is shown below, and as you can read in
the code, it is a private static method responsible for parsing the
necessary parameters for a GOTO
. In the case of
GOTO
, the only valide token that could follow the keyword
is simply a line number; however it could just as easily have been an
expression.
private static void parse(GOTOStatement s, LexicalTokenizer lt) throws BASICSyntaxError { Token t = lt.nextToken(); if (t.typeNum() != Token.CONSTANT) { throw new BASICSyntaxError("Line number required after GOTO."); } s.lineTarget = (int) t.numValue(); }
Once complete, the parse
method leaves the lexical
analyzer ready to fetch the first token after the GOTO
's
parameters. This next token must be an end-of-line token rather than a
follow on statement because a GOTO
transfers control of
the BASIC program unconditionally to the line whose number follows the
keyword. Any statement that followed the GOTO
on the same
line would be inaccessible as it would have no line number associated
with it.
Wrapping up
Of course, there are many subclasses of Statement
, one for
each keyword. The advantage of designing your interpreter using a
technique like the one described in this column is that you can work on
a statement with the confidence and knowledge that you aren't causing
other statements problems with side effects. This encapsulation is one
of the real benefits of an object-oriented language and it really
shines when writing applications such as this one. Check out the
pointers in the Resources section, and in
particular, check out the complete classes to get a feel for how the
entire parsing structure fairly clips together.
In the next and final installment in the BASIC series, I'll take a
look at how you can execute these statements now that they are parsed
and how you can hook the resulting interpreter into another Java
class.
About the author
Chuck McManis currently is the director of system software at FreeGate
Corp., a venture-funded start-up that is exploring opportunities in the
Internet marketplace. Before joining FreeGate, Chuck was a member of
the Java Group. He joined the Java Group just after the formation of
FirstPerson Inc. and was a member of the portable OS group (the group
responsible for the OS portion of Java). Later, when FirstPerson was
dissolved, he stayed with the group through the development of the
alpha and beta versions of the Java platform. He created the first "all
Java" home page on the Internet when he did the programming for the
Java version of the Sun home page in May 1995. He also developed a
cryptographic library for Java and versions of the Java class loader
that could screen classes based on digital signatures. Before joining
FirstPerson, Chuck worked in the operating systems area of SunSoft,
developing networking applications, where he did the initial design of
NIS+. You can reach him at chuck.mcmanis@javaworld.com.
Also, check out his home page.
If you have problems with this magazine, contact
webmaster@javaworld.com
URL: http://www.javaworld.com/javaworld/jw-06-1997/jw-06-indepth.html
Last modified: