Inefficient Parsing

So I created a unique computer architecture and developed an instruction set for it. Now, even I am not so much of an uber-nerd that I want to spend the rest of my life writing machine code. Give me something that at least kinda sorta looks like English.

At first I experimented with Forth. I like Forth, it's simple and elegant while still being very powerful. But of course, "I can do better."

One thing about Forth is that it is ridiculously easy to parse. Elements are either names, numbers, or the blanks between them (spaces, carriage returns, etc. - "whitespace").

Now, I know enough about compilers that I'm aware that even parsers aren't trivial. But I figured that just keying on the very first character of each token would give me a lot of flexibility, and yet stay simple enough that I could code it up pretty quickly in machine language. And I wouldn't have to have a compiler manual by my side the whole time.

One other note: I believe in efficiency. That's an old fashioned view these days. Chips can be clocked at 4 gigahertz, and if that's not fast enough, just add more cores per chip or stack up boxes of processors in a handy warehouse. No one cares if programs or languages are efficient anymore because there's plenty of speed for everybody. But that's not a "green" way of viewing things, and I believe that sooner or later energy concerns are going to come visit the computer industry.

Be that as it may, I wanted my hand coded, simple parser to be super efficient. Imagine my surprise when I found out just how much processing overhead there is merely to grind through a few lines of text. What follows below is a discussion of dealing with a single character, and not even one of those first characters of an actual name of something, but of a simple SPACE - you know, that real big key at the bottom of a keyboard.

Not to get overly tl;dr, but I should mention that the architecture I created is also designed for efficiency, at the hardware level. For example, since it has local logic for equality testing at each sequencer, it can work its way through input text without any ALU involvement. This saves having to use extra setup instructions for setting ALU operations.

There are also some powerful methods for accessing jump tables. One kind uses a single instruction for looking up an address and loading it into the program counter. From a true efficiency standpoint, there's no magic way to reduce the number of clock cycles to accomplish this, but it's still nice to have well compressed machine language code.

Well, without further ado, let's follow the steps for dealing with one simple space character:

    # Read the next character from the input buffer
Good start!
        Instruction count: 1

    # Save the char in the CURR_CHAR register
In this case this is sort of a waste, but while processing actual named elements we will often need easy access to the current char while processing the token. Thus we save it here as well.
        Instruction count: 2

    # Call subroutine (SR), pass thru char input jump table
So efficient! Much better than plowing through individual tests for every letter on the keyboard. Really saving those clock cycles now.
        Instruction count: 3

    # Save the char's special flag in the CURR_FLAG register
But still, the jump table only got us here. Now we have to put the applicable flag on the data bus and then save it.
        Instruction count: 6

    # Return from SR
        Instruction count: 7

    # Test flag IS_WHITE
Counting as two instructions because even though the test is TRUE, still have to bypass the possible jump address.
        Instruction count: 9

    # Look for next character
A Forced jump plus the jump address goes to the code block to do this.
        Instruction count: 11

    # Bus from the CURR_CHAR register
Wait, we're still on this character? It's a SPACE! It doesn't DO anything! It's a placeholder between real interesting things like function and variable names. Why aren't we DONE? Well, if this space is the first whitespace following such a name, we've got some housekeeping to prepare for the next named entity.
        Instruction count: 12

    # Test char vs NULL
Oh, but first this block of code has to look for the possible end of the input file. This has to be done often, every time whitespace is encountered in fact.
        Instruction count: 14

    # Clear char count
Counting the chars in names is part of the interpreting/compiling process, so this has to be done (and repeated if the next character is also whitespace, alas).
        Instruction count: 17

    # Look for next token
Finally! Forced jump to go do that.
        Instruction count: 19

    # Increment input char pointer
NOW we have advanced to the next character!
        Instruction count: 20

    # Save SR return address
But wait, there's more! We called a subroutine way up top in order to utilize the jump table. It's a pretty good idea to be able to return from it, so here we set that up.
        Instruction count: 23


Twenty three friggin' instructions merely to read a space. Gah. I have to admit that as soon as I looked at the output from a run, I thought I had made some stupid coding mistake. But when I single-stepped and looked at the trace just covered, everything is there for a good reason.

So the title for this section is a bit unfair. Even getting past this space character requires a number of things that have to be accounted for and checked off. No getting around them.


Some Thoughts

Maybe the elephant in the room is my hyper-efficient mindset. More experienced low level programmers may be thinking, "What's this guy's deal?" Fair enough. I'm just surprised, is all. It takes what it takes, so be it. I simply didn't expect quite so much overhead to be involved.

Speaking of overhead, processing the subroutine call uses 5 of the 23 instructions. I started coding without using any SRs, but the spaghetti got thick real fast.

I may be doing some thrashing with registers like CURR_CHAR; eventually I'll go back through the code and see where I can work directly from the input array whenever possible. I say I care about this parser only paying special attention to the first character of a token, but what does that mean, exactly? In truth, there is some look-ahead involved.

For example, I just finished the code to process comments, which start with the '#' character, and end with either a line feed or character return. Merely some simple and quick tests, right? However, there are also block comments for commenting out multiple lines of code. These have the format, Start: #/ End: #// . Suddenly I'm interested in the second and third characters too.

I'm using a brute force method of dealing with the succeeding letters, because I'm more interested in efficiency than in producing some canonically perfect LALR parser ("Compiler writers Hate him!"). So the local variable CURR_CHAR is proving useful, but maybe I can weed it out later.

Speaking of tests, I'm beginning to get annoyed with them. There are so damn many! Ok, part of it is that they are boring. "Ho hum, copy from there and paste here ... think hard, jump on zero or one? ... another test now, which order do I want them in?"

Actually, that last part is kind of fun. The steps above test the IS_WHITE flag right away. They probably shouldn't. In most source code there are more IS_GRAF letters than whitespace, unless you go really crazy with indentation. Looking back, I realize I put IS_WHITE first for readability: this test is immediately followed by a forced jump, so the else-clause is only a few lines below. The IS_GRAF test? Juuust a bit more processing. So I should switch them ... and bump this trace from 23 instructions to 25. Jeesh.

I guess my last thought concerns overhead in general. There's always the mandatory "Do early and often" parts of any program like the testing for NULL above. Quite useful to know when you've reached the end of input!

Then there are the ones that could be avoided if we just added more tests. Clearing the char count register EVERY TIME whitespace is encountered gets pretty inefficient when running through several spaces in a row, but what are you gonna do? Could assign a register to count whitespace hits but that would create more code overhead than it eliminates. The final irony would be having to clear that register once a token is found.

Actually, several of the instructions above show a similar redundancy. Perhaps I will eventually go back and look at the code and decide it is efficient to keep track of whitespace "strings" after all. But there is only a maximum (minimum?) degree to which one can granularize the granularity, so to speak. Eventually the checking and double-checking begins to get in its own way.

Obvious stuff, obviously, but I find it fun to think about. And to wish for some magically efficient way to avoid what can't really be avoided.

One final point: My standard FSA machine language files should end with the string #@! - yet another specialized form of comment. So why look for an end-of-file NULL at all? Well, maybe I don't trust everyone (me) to always remember to end with the #@!. This kind of naturally leads to the idea of reporting syntax errors. And that leads to whole new orders of testing complexity and inefficiency. There's always a tradeoff between a perfect language consuming the minimum number of instructions and a language that is reasonably human friendly. Seems like a topic for a future essay ... when I learn more.





         Introduction to "The Perfect Language"    —   Table of Contents
         Compiling On    —   Previous
         Next    —   Eventually
         Glossary




Started: January 26, 2016