March 28, 2024, 08:32:16 AM

News:

Own IWBasic 2.x ? -----> Get your free upgrade to 3.x now.........


Tutorial - Writing an interpreter

Started by Parker, January 25, 2006, 08:12:43 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Parker

January 25, 2006, 08:12:43 PM Last Edit: January 29, 2006, 07:47:03 PM by Parker
Over the next few (insert lengthy units of time here) I'll be writing a series of tutorials about how to write a simple interpreter in Aurora.

Anyways, this is the first step, the planning. Of course before starting a project, we have to decide on the goals. In this project, it should be easy to follow by others and will be fairly simple. It's just going to give the idea of how it's accomplished and will feature a working interpreter when we're finished, but it's not going to be all that complex or something you'll want to write programs with unless I get carried away ;)

So first is how it will work. My decision has been to not go in a linear way as you might think, but first create a sort of intermediate representation of code using various structures. This will also allow faster execution since a function or section won't have to be reparsed every time you want to execute it. More on that later.

The language since it is the subject of a tutorial will be simple. It will not be possible to call external code, only functions written in the source file.

The keywords that will be implemented include
- FUNCTION
- IF
- SELECT
- WHILE
- FOR
- END
- CALL
- DEF/DIM
- DECLARE
- RETURN

Types include
- INTEGER
- FLOAT
- DOUBLE
- CHAR
- SHORT
- LONG (64bit integer, not the same as C)
- STRING

Built in functions include
- PRINT
- INPUT
for now, although more shouldn't be too hard to implement in the end. They're not keywords.

Expression support for numbers includes *, /, +, -, &, |, % and for comparison ==, <, <=, >, >=, <> also the parenthesis ( ) will be supported for sub-expressions. For strings only the + and == operators are supported.

For simplicity I've decided that ; will be the statement terminator and the == operator will be used for testing equality.

The next step (lexer/scanner plus any required classes) will be posted within the next few days. After that some of the tables will be designed and work on some parsers can start. Once enough parsers have been written an executor will follow.

And to clear that up, the parser is the part of the program that recognizes certain streams of tokens. A parser from a group of parsers has the job of recognizing one of the statements. A separate parser will be written for each keyword and a couple of other places where they are required, all of which will make up the full parser ;D

Bruce Peaslee

My suggestion at this point is to decide the level of the target audience. Is it someone who has never programmed a computer or someone, say, moving over from basic. Or can it be both?
Bruce Peaslee
"Born too loose."
iTired (There's a nap for that.)
Well, I headed for Las Vegas
Only made it out to Needles

Parker

I suppose the target audience would be anyone with knowledge of Aurora since that's how it will be implemented. I'm basing my approach somewhat on my book "Writing compilers and interpreters". But it's not going to be very complex, not anything to use in a real production environment.

So an example might look something like this
FUNCTION PrintHello(STRING name)
    PRINT("Hello " + name);
END FUNCTION

FUNCTION main()
    DIM STRING name;
    INPUT("Type your name: ", name);
    IF (name == "") RETURN;
    PrintHello(name);
END FUNCTION


For someone who has never programmed before the language wouldn't be too hard to use, but understanding how I implement it probably will.

Parker

So here's the next part, writing the token scanner/lexical analyzer (lexer).

There are a couple of ways to accomplish this. The lex way is to have it scan for each token, for example look for an 'i' and then an 'f' for the token IF. That's somewhat impratical if you're hand-writing a lexer since it's very boring to code that. So here's the other way. All the reserved words are stored in an array which is carefully organized to make looking up a word fast. So we just scan for a word and then look it up. The lookup part will come later. For now we're just looking for simple tokens.

This will be expanded at a later point in time. For now the goal is just to show how it works and give a working example.

There are a few different tokens we need to scan for.
- Identifier (in lex syntax [A-Za-z_][A-Za-z0-9_]* )
- Integer number (floating point, hex, exponential will come later. Lex syntax [0-9]+ )
- String (No escapes, newlines are allowed in strings for now to make it simple. Lex syntax \"[^\"]*\" )
- End of statement ( semicolon ; )
- Operator. Well actually each operator will have its own token code, but for now we'll just return the operators token code.
- Whitespace. This includes spaces, tabs, and newlines. But newlines increment the line number so that errors can be reported effectively. A newline is defined as a CR, LF, or CR LF. Comments will also count as whitespace but won't be implemented at this early stage.

So first we'll write the function that determines what token type a certain character is. It will only be called for the first character of the token so nothing complex has to be written to make sure numbers are allowed in identifiers. This function is part of the Lexer class which I'll define a little later.

The first part defines the token codes. We'll deal with whitespace in a minute.

Lexer::CharTokenType(byte ch)
{
if ((ch >= "A" and ch <= "Z") or (ch >= "a" and ch <= "z") or (ch = "_"))
{
return tWord;
}
else if (ch >= "0" and ch <= "9")
{
return tNumber;
}
else if (ch = "\"")
{
return tString;
}
else if (ch = ";")
{
return tEOS;
}
else
{
return tError;
}
}

You'll notice I haven't dealt with operators. That is because there will be a table making this process much faster but unfortunately it requires the use of the ' ' quotes to do an array index. Much of the lexer will be changed eventually.

The next step is to write the function that skips whitespace, but before that we have to understand what the lexer class will be like and write its constructor to initialize all the members.

The lexer class has the responsibility of holding the character buffer and its position. It also knows the length of that buffer, which is much more efficient than calling len() every time we need it.

This is the current declaration of that class.
class Lexer
{
string *m_pText;
unsigned int m_uPos;
unsigned int m_uLen;
unsigned int m_uLine;

string *m_pToken;

declare Lexer();
declare _Lexer();

declare CharTokenType(byte ch),int;
declare SkipWhitespace();
}


Now we have to write the constructor and destructor. Their purpose is simple -- to initialize members to 0 when an instance of the class is created, and to delete them when the class is destroyed. When class constructors can have arguments a new constructor may be written to load a source file, but that will be covered by another method a little later on.

Lexer::Lexer()
{
m_pText = null;
m_uPos = 0;
m_uLen = 0;
m_uLine = 1; // We start on line 1

// This variable is used to store the value of the current token.
m_pToken = 0;
}

Lexer::_Lexer()
{
delete m_pText;
delete m_pToken;
// Both of those calls are safe since delete tests for zero before destroying them.
}


Now that those are out of the way, it's time to skip whitespace. To prevent crashes, we first check if the text member is null. The function uses an infinite loop to skip one whitespace token at a time and returns when there are no longer whitespace tokens. What's nice is the ASCII 0 character isn't whitespace so it will automatically stop if it hits the end of the file.

Lexer::SkipWhitespace()
{
if (m_pText = null) return;

while (true)
{
if ((*m_pText[m_uPos] = " ") or (*m_pText[m_uPos] = "\t"))
{
m_uPos++;
}
else if (*m_pText[m_uPos] = 10)
{
// Either CR or CR LF.
if (*m_pText[m_uPos] = 13) m_uPos++;
m_uPos++;
m_uLine++; // We've gone on to the next line
}
else if (*m_pText[m_uPos] = 13)
{
// The LF line terminator
m_uPos++;
m_uLine++;
}
else
{
return; // No more whitespace
}
}
}


So this is great, we've got a class capable of skipping whitespace characters. But it can't load a file. So that's next. Two methods will be added. LoadFile() and LoadString(). The first will open the file specified as the argument, and the second will just copy the argument.

Lexer::LoadFile(string fn)
{
unsigned int hFile, filelen;

hFile = OpenFile(fn, MODE_READ);
if (hFile = 0) return; // Error.

// Make sure there is no memory loss in the class
delete m_pText;

filelen = GetFileLength(hFile);
if (filelen = 0)
{
m_uLen = 0;
m_pText = new(byte, 1);
*m_pText[0] = 0;
CloseFile(hFile);
return;
}

m_pText = new(byte, filelen+1); // Make room for null
Read(hFile, *m_pText, filelen);
m_uLen = filelen;

CloseFile(hFile);

return;
}

Lexer::LoadString(string str)
{
delete m_pText;
m_pText = new(byte, len(str)+1);
m_uLen = len(str);
*m_pText = str;
}


At this point we can test them. I made a test file at P:\Test.txt containing a bunch of tabs, spaces and new lines, and then a [ character. If all is good, calling SkipWhitespace and then printing *m_pText[m_uPos] will print the [ character. And it works! The file I used for testing looks like this

#include "lexer.inc"

global sub main()
{
Lexer l;
l.LoadFile("P:\\Test.txt");

l.SkipWhitespace();
writeln(chr$(l.*m_pText[l.m_uPos]));

while (GetKey() = "");
}


It's time to eat here but when I'm done I'll finish this up, which includes support for identifiers, numbers, and the end-of-statement operator ( ';' ).

I hope this will prove helpful for someone ;)
And let me know if there are any bugs along the way. Thanks
-Parker

Parker

Alright. The last method in the lexer class for now is called GetToken which sets the current token variable and returns the type of the token. This approach is based off lex where you might write { yylval = something; return tokentype; }

The GetToken function is the lexer. It is the only method that the parser will call and it is what fetches the next token from the text. I'm not going to just post a big chunk of code and say "study". So I'll try to explain it first.

I gave the lex syntax for the tokens above. I'll explain them now. Lex uses regular expressions to define tokens which are compiled into a C file that tries to match those expressions in an input stream. I listed them because it's become a standard for representing tokens.

The first, identifier, is represented by [A-Za-z_][A-Za-z0-9_]* which means that an identifier consists of one character that is an A through Z, an a through z, or an underscore ( _ ) character. And that the identifier can then have any number of characters that consist of upper or lower case A through Z, the digits 0 through 9, and the underscore character. The difference of that and an Aurora identifier is Aurora's can have a dollar sign tacked on the end if you like. A few examples are a, y, _2, abc123, _. You get the idea.

The next is a simple number. It can be any positive number of digits 0 through 9.

The only limitations on a string currently are no quotes inside. That is the main reason for escape characters which occur in BASIC or Pascal as two quotes in a row, and in C as \". Normally strings cannot continue onto the next line. But in this simple tutorial I will allow it.

The end of statement is probably the easiest token to get. Just setup the structure and return tEOS.

Operators aren't handled yet. I stated above that I'm waiting for the single quotes before I do that.

Whitespace is skipped (the function called) at the beginning of GetToken(). And the length of tokens is dynamic. Which means that the token d will take up 2 bytes and hello_world will take up 12. This is handled by using new() and MID$().

I've modified the Lexer class a little. So here's the whole thing. GetToken appears at the bottom.
const tError = 0;
const tWord = 1;
const tNumber = 2;
const tString = 3;
const tEOS = 4;
const tOper = 5;

struct Token
{
int nType;
//union {
string *pString;
int integer;
double real;
//}
}

class Lexer
{
string *m_pText;
unsigned int m_uPos;
unsigned int m_uLen;
unsigned int m_uLine;

Token m_tToken;

declare Lexer();
declare _Lexer();

declare CharTokenType(byte ch),int;
declare SkipWhitespace();

declare LoadFile(string fn);
declare LoadString(string str);

declare GetToken();
}

Lexer::Lexer()
{
m_pText = null;
m_uPos = 0;
m_uLen = 0;
m_uLine = 1; // We start on line 1

// This variable is used to store the value of the current token.
m_tToken.nType = 0;
m_tToken.real = 0.0;
// These don't need to be here when a union is used!
m_tToken.pString = 0;
m_tToken.integer = 0;
}

Lexer::_Lexer()
{
delete m_pText;
if (m_tToken.nType = tString or m_tToken.nType = tWord)
delete m_tToken.pString;
// Both of those calls are safe since delete tests for zero before destroying them.
}

Lexer::CharTokenType(byte ch)
{
if ((ch >= "A" and ch <= "Z") or (ch >= "a" and ch <= "z") or (ch = "_"))
{
return tWord;
}
else if (ch >= "0" and ch <= "9")
{
return tNumber;
}
else if (ch = "\"")
{
return tString;
}
else if (ch = ";")
{
return tEOS;
}
else
{
return tError;
}
}

Lexer::SkipWhitespace()
{
if (m_pText = null) return;

while (true)
{
if ((*m_pText[m_uPos] = " ") or (*m_pText[m_uPos] = "\t"))
{
m_uPos++;
}
else if (*m_pText[m_uPos] = 10)
{
// Either CR or CR LF.
if (*m_pText[m_uPos] = 13) m_uPos++;
m_uPos++;
m_uLine++; // We've gone on to the next line
}
else if (*m_pText[m_uPos] = 13)
{
// The LF line terminator
m_uPos++;
m_uLine++;
}
else
{
return; // No more whitespace
}
}
}

Lexer::LoadFile(string fn)
{
unsigned int hFile, filelen;

hFile = OpenFile(fn, MODE_READ);
if (hFile = 0) return; // Error.

// Make sure there is no memory loss in the class
delete m_pText;

filelen = GetFileLength(hFile);
if (filelen = 0)
{
m_uLen = 0;
m_pText = new(byte, 1);
*m_pText[0] = 0;
CloseFile(hFile);
return;
}

m_pText = new(byte, filelen+1); // Make room for null
Read(hFile, *m_pText, filelen);
m_uLen = filelen;

CloseFile(hFile);

return;
}

Lexer::LoadString(string str)
{
delete m_pText;
m_pText = new(byte, len(str)+1);
m_uLen = len(str);
*m_pText = str;
}

Lexer::GetToken()
{
int tokentype;
unsigned int start;

SkipWhitespace(); // This won't do anything if there is no whitespace

tokentype = CharTokenType(*m_pText[m_uPos]);

if (m_tToken.nType = tString or m_tToken.nType = tWord)
delete m_tToken.pString;

switch (tokentype)
{
case tWord:
start = m_uPos;
while (
(*m_pText[m_uPos] >= "A" and *m_pText[m_uPos] <= "Z")
or (*m_pText[m_uPos] >= "a" and *m_pText[m_uPos] <= "z")
or (*m_pText[m_uPos] >= "0" and *m_pText[m_uPos] <= "9")
or (*m_pText[m_uPos] = "_"))
{
m_uPos++;
}
// Now extract the token
m_tToken.nType = tWord;
m_tToken.pString = new(byte, (m_uPos-start)+1);
// The MID$ function takes a 1 based starting position
// The index is 0 based. So 1 has to be added.
m_tToken.*pString = mid$(*m_pText, m_uPos+1, m_uPos-start);

return tWord;
case tNumber:
start = m_uPos;
while (*m_pText[m_uPos] >= "0" and *m_pText[m_uPos] <= "9")
{
m_uPos++;
}
m_tToken.nType = tNumber;
// The val() function is used here
// But C's atof() function is likely to be used later
// For floating point numbers.
m_tToken.integer = val(mid$(*m_pText, start+1, m_uPos-start));

return tNumber;
case tString:
m_uPos++; // Don't take the quote
start = m_uPos;
m_tToken.nType = tString;
if (*m_pText[m_uPos] = "\"")
{
// It's an empty string.
m_tToken.pString = new(byte, 1);
m_tToken.*pString[0] = 0;

return tString;
}
while (
(*m_pText[m_uPos] <> 0) and (*m_pText[m_uPos] <> "\""))
{
m_uPos++;
}
if (*m_pText[m_uPos] = 0)
{
// Invalid string error
return tError;
}
m_uPos++; // Go past the quote

m_tToken.pString = new(byte, m_uPos-start);
m_tToken.*pString = mid$(*m_pText, start+1, (m_uPos-start)-1);

return tString;
case tEOS:
m_uPos++;
m_tToken.nType = tEOS;
return tEOS;
//case tError:
default:
m_uPos++;
m_tToken.nType = tError;
return tError;
}
}


Please post any questions you have about it. I'm hoping my comments were good enough but don't hesitate to ask, after all that's the point of the tutorial.

My test code now looks like this
#include "lexer.inc"

global sub main()
{
Lexer l;
l.LoadFile("P:\\Test.txt");

l.GetToken();
writeln(l.m_tToken.*pString);

while (GetKey() = "");
}

Of course my Test.txt contains only one string just for testing. That would fail for an integer.

Parker

Parsers are really interesting. It's taken me quite a while, but I've got written down on a piece of paper an expression evaluator in yacc syntax. The operators handled (precedence levels separated by semicolons) are:
- (unary), NOT; ^ (power); *, /; +, -; &, |, XOR.

What I have written down, and I've traced its path through an expression and know it works, is this
expr: '(' expr ')'
    | addexp '&' addexp
    | addexp '|' addexp
    | addexp XOR addexp
    | addexp
    ;
addexp: mulexp '+' mulexp
    | mulexp '-' mulexp
    | mulexp
    ;
mulexp: powexp '*' powexp
    | powexp '/' powexp
    | powexp
    ;
powexp: num '^' num
    | num
    ;
num: '-' NUMBER
    | NOT NUMBER
    | NUMBER
    ;


For example, consider the expression "1 + 2 ^ 2 / 2". If it was handled linearly, the result would be 4.5. When handled correctly, the result is 3. To see how it will parse, start at expr and follow the tree down. First we find an addexp, and the next token isn't an &, |, or XOR, so it only matches the description addexp. So addexp must reduce the expression, and it sees a mulexp and a '+', so far it matches the first definition. Each mulexp must be reduced now. The first reduces to a powexp, then to a num, then to a NUMBER (a lex token). Now the second mulexp is reduced, where we see a powexp, which sees num '^' num, and reduces each of the nums to NUMBERs. Then after the powexp is taken care of, the mulexp parser sees a '/' token which matches the second rule, and it is reduced to a powexp, then a num, then a NUMBER.

A parse tree for that expression looks like this
+
1            /
                ^               2
2              2

The interpreter will parse everything before evaluating, so all this will be stored in various structures and lists. I'll get back to that when I've got it all figured out - I don't have Aurora with me at the moment.

aurelCB

Hi mr.Parker
I first time read this topic and i have two question.
First you say that you write book about this,how i can find this book i mean for download?
Second your examples are very interested for interpreter makening.
Like you know or maby not i'm autor of one small interpreter completly written in Creative Basic.
I'm not good in Aurora programming(maby becose i'm lazy for larn) but is it posible translate your
examples to Creative code.I think that would be very useful for others and of course for me.
Your aproach is very diferent then mine.
regards
aurel

Parker

You got a lot further than I did. I always got distracted with other stuff, and on top of that I haven't really done any programming for quite a while (I've gotten very into drama and am writing a musical with a friend of mine). However, I have a couple of books from when I was really into this stuff (the idea still intrigues me but I can choose either programming or musical...). One is the "dragon book" which is basically a college textbook on compiler design. The other is called "writing compilers and interpreters" or something similar - it details how to write a Pascal interpreter and compiler in C++. Hope this helps.

aurelCB


Duffer

aurelCB:

I really suggest the following two books in order of importance:

1.  Game Scripting Mastery by Alex Varanese
2.  Virtual Machine Design and Implementation in C/C++ by Bill Blunden

Do not let the name "Game Scripting Mastery" put you off, this book teaches you everything you need to know to build a programming language, and although the book is comprehensive, it is very easy to understand.  I rely on this book so much that I am currently on my third copy of the book because my other two copies are falling apart I have read them so much  ;)

aurelCB

Thanks Brice...
Book look very interesting but there is nothing for download ...
I mean something like free ebook or book in pdf...