On Compiling

I have a new BBF (Best Blogger Forever). Forget Graham or Spolsky, who needs them anymore? Now I've found Steve Yegge, who has dual blogs, no less: Stevey's Drunken Blog Rants™, followed by Stevey's Blog Rants. Apparently he sobered up by the time he moved from working at Amazon to working at Google. Good for him!

If you have stumbled on this essay before plowing through my previous eleven blogs on "The Perfect Language", you might want to read my articles ( Introduction to ... ) before surfing over to his. Why? Because mine are very elementary by comparison. Steve is really smart. He's unpretentious enough that he would probably deny that, but there's no denying that he knows one hell of a lot more about computer languages than I do (therefore ... I have to hate him). He's also a better writer than I am (yes, I truly hate him). So start with my elementary stuff first if you wish [ Where'd everybody go? Looks like they already all clicked away ].

Oh, you're still here? Thanks for hanging around! So I am going to quote a bit from his Rich Programmer Food where he discusses - surprise - compilers. And I have discussed compiling off and on throughout my articles. What I want to do is compare what I have written about Forth based compilers (including "North", my own version) versus the high level language compilers he writes about. The Forth model is much simpler. I didn't say better, just simpler. I didn't say worse, either ...

Let's start by stealing fro ... quoting his blog:

... compilers can logically be thought of as three separate phases -- so separate, in fact, that they constitute entirely different and mostly non-overlapping research domains.

The first big phase of the compilation pipeline is parsing. You need to take your input and turn it into a tree. So you go through preprocessing, lexical analysis (aka tokenization), and then syntax analysis and IR
[intermediate representation] generation ... the output of this pipeline stage is usually a parse tree of some sort.

Parsing - Lexical analysis is already done in Forth/North. If a blob of text is surrounded by whitespace (spaces, tabs, returns and the like) it has already been analyzed. There should be a dictionary definition by the same name which will either immediately run its constituent machine language (if interpreting) or will yield the pointer to that code which will be added to a list of addresses from previous tokens (if compiling). There's no looking ahead to succeeding tokens, no looking back, the token is its own context and knows what to do with itself (so to speak).

By the same token [pun] there is no need for syntax analysis, for the same basic reason. North tokens stand alone and are dealt with as they are encountered. They are dealt with completely so there is no intermediate representation. The final representation is the above mentioned pointer which is stacked on top of previous pointers - we don't end up with any tree structure. The North wind blows the leaves off those trees! It is 4 freakin' AM, don't expect poetry ... More to the point, Forth-like languages are already written in a compiler friendly manner. An expression in North to add two numbers might be:
   somevar anothervar +
which is in "Reverse Polish Notation" (RPN). Conversely, most HLLs use a more user friendly "Infix Notation" like the math we learned in school:
   somevar + anothervar
yet part of converting to the IR in the HLL compiler is to internally restructure the infix notation into the stack friendly "Postfix" format.

The early Forths had no preprocessing. There are so many Forth versions out there these days that I'm sure some percentage of them do now. North will have a simple C-style 'substitution' preprocessor, simply because the hardware modules of my Flexible System Architecture can be dropped into arbitrary sections of the final silicon. I'm trying to avoid hardware implementation details here in this language blog, but what I'm essentially saying is that for the same code to run on different chips, the different 'wheres' of hardware units on those different chips have to be mapped with the equivalent of C #define statements. So, two passes through the source code, the first merely to plug in different numbers, the second to be relevant to this essay.

The next big phase is Type Checking.

Type Checking - There IS no type checking. Forth is a typeless language; it says so right there in Wikipedia, so it must be right. In a previous blog I've touched on the fact that Forth operators are typed, not formally, but practically. While running a program, if previous calls have put two 32 bit integers on the parameter stack, meaning four 16 bit stack slots are filled, then the North statement:
   some32var other32var +
is going to add the two halves of one of the 32 bit numbers to each other, put the resulting 16 bit value on the stack in place of them ... and leave 3 slots of junk. Crash time. The plus sign must be replaced with "D+", or "+32", or whatever name somebody wants to give to a completely different operator.

In one of his drunker, er, earlier blogs, Scheming is Believing, Yegge made a point about data typing that was brand new to me: "type tags are essentially metadata ... types really are just a classification scheme to help us (and our tools) understand our semantic intent. They don't actually change the semantics; they just describe or explain the (intended) semantics."

Metadata is 'data about data,' and not the data themselves. For example, the North plus sign doesn't care if it is adding two numbers, two ASCII characters, or junk to junk. If I wanted to add some sort of type checking to North - which might be hard due to its every-token-stands-alone compiling model - then the compiler could notice the '+' after the two _32vars above and print an error message and maybe stop the compiling altogether. Yet typing the three entities wouldn't actually change the basic nature of any of them. Interesting.

Wait ... the more I sit here visualizing, the more I believe it would be easy to add type checking to the RPN format. At least simple type checking. Do I really want to, that would be the question. I want to keep North unencumbered, and its kernel/compiler fast. Don't want to overload the language with a bunch of fluff. Another idea I picked up from reading the Rants is to make compile-time type checking optional. Warnings only - no fatal errors! I'm pretty sure I would have come to this conclusion on my own, because one of my major aims for North is to keep it flexible. Anyway, sometimes when I actually start to implement something I've previously visualized, I find out my brain was full of fluff. We'll see when we get there ...

The third camp, who tends to be the most isolated, is the code generation camp. Code generation is pretty straightforward, assuming you know enough recursion to realize your grandparents weren't Adam and Eve. So I'm really talking about Optimization, which is ... the art of producing correct code that is equivalent to the naive, expensive code written by your presumably naive, expensive programmers.

Code Generation - I suspect it was getting to be about 4 AM for Stevey at this point as well, he's getting a bit snarky. Be that as it may, code generation is so straightforward in North that it is already done. I build the top part of a definition as soon as the token for the name of the to-be-compiled entity has been read. Then a field tagging the remainder to be an address list rather than direct machine code, and finally I start writing the compiled addresses right there in the definition (for a graphic, see here). I'm sure most all Forth compilers do something very similar. I probably should have gone online and studied some, but I wanted the challenge of figuring it out myself (I did have one thin book handy).

I have given a lot of thought to optimization. Stacks seem somehow wasteful. When a compiled definition is called later in a program, that whole pointer list has to be copied over to the call stack. It would be preferable to simply run the calls from the original list. The thing is, something very beautiful happens when one compiled object calls another. The first address from the second entity's pointer list overwrites its own call and then the rest of its addresses are tacked on, and calling resumes. The stack grows and shrinks like an accordion, until the last call in the program gets its last beer and everything cycles back for more input from the keyboard.

To try to make these calls from within the definitions themselves would require some sort of 'called from' stack to remember where each compiled def jumped out of itself so it could get back to the right place when the second (and third, and fourth, and enn) def(s) finished up. It might be faster, but I think it could end up like twenty octopi playing Twister. As soon as I get my updated instruction set running properly, I'll have to give it a try, but my hunch is that I'll be opening Pandora's can of worms.

I'd call compiler optimization an endless chasm of eternal darkness, except that it's pretty fun.

Yes, it is!

North Rules

No, that title isn't bragging (can't rule if it hasn't even been born yet). The title is, rather, the beginning of a short section of notes to myself about things I need to remember as I try to extend North:

• Basic data structures - As alluded to above, the parameter fields of North definitions either contain machine code or stacks of call addresses. That's it. Simple. Someday I just know I'm going to try to get tricky and fudge in another format. No, No, NO! The run time paradigm is call, then run some code. Sometimes it may be call-call-run, or call-call-call-call-call-run, depending on how deep the compiling gets layered. But nothing more than that ever happens, And there's no excuse for screwing up this simplicity.

• Reflection - North must ALWAYS be able to introspect its own definitions. Also modify them during run time. It's a basic difference between a powerful language and a crippled one.

• Stacks Stacks Stacks - This will be my biggest temptation. After writing C for 30 years I don't think in stacks. I think in functions. With local variables. As I quote Charles Moore in the previous blog, "You can make local variables very efficient especially if you have local registers to store them in, but don't. It's bad. It's wrong." Aw, c'mon Chuck, just one little local variable? Or maybe two or three? No more than four, I promise. I already know that I am going to write crappy North code for awhile until I get the hang of thinking in stacks. And RPN. And recursion.

• Speaking of the previous essay - It deserves a bullet point all its own. The whole file is my note to myself to try to implement a grown up language right from the beginning by leveraging the 30 years of experience of Forth's inventor.

Compiling a Summation

To quote Mr. Yegge one more time:

Gentle, yet insistent executive summary: If you don't know how compilers work, then you don't know how computers work. If you're not 100% sure whether you know how compilers work, then you don't know how they work.

You have to know you know, you know.

In fact, Compiler Construction is, in my own humble and probably embarrassingly wrong opinion, the second most important CS class you can take in an undergraduate computer science program.

So what does he consider the MOST important? The 'Rich Food' link's right up top, go and see - no spoiler here.

I don't know how compilers work. Not at anywhere near Yegge's level. A long time ago, I got a secondhand copy of the infamous "Red Dragon Book" and read the first chapter or three. I kind of understood lexical analysis and context-free grammar, but then I got overwhelmed. More than I could handle at the time. I've since learned that anything that deep has to be learned by doing, not reading.

And yet I have written a compiler. Unfinished, but it works on enough test input to know I know, you know? So did I cheat by going with a dollar store compiling method like Forth? Or was I smart enough to avoid the unnecessary complication of all the things Steve has mastered about the subject that I haven't? The truth is that I was neither smart nor dumb; it's really only as I write this very blog that I realize the contrast between Forth and the big boy compilers - the Dragon Book was and is dim in my memory.

And yet Forth programmers claim their language can tackle any problem "those other" languages can handle and get there quicker with smaller, faster code. They can be pretty obnoxious about it in fact. Me, I've written very little Forth OR North to this point, and haven't tackled any new large problem since I started writing the FSA simulator (in vanilla, generic C). So I gotta sit on the fence and stay neutral.

I do have one strong general opinion, however: SIMPLICITY! If I somehow got transported to a parallel universe perfectly identical to our world of the 1950's except for the single exception that it was up to me to pick the very first language beyond assembler to jump start their computer revolution, I'd go with Forth. I'd kick myself for sticking myself (and the world) with its bass-ackward syntax and its plethora of dinky little operators. Hell, even the plus sign is a subroutine. EVERYTHING in Forth is a subroutine. It is machine language with named calls. But it is a functional language - and I mean that in the formal sense. It is (I believe) a tiny Lisp without the parentheses. The proof of that will be how well North is able to write, then execute, more of itself. Much to learn.

Most importantly, it is extensible. You could layer static typing and local variables on top of it. You could layer the syntax of Algol-like, C-like languages on top of it. You could layer classes and object orientation on top of it. You could layer any abstractions you wanted to on top, and yet at bottom would be a strong foundation of tiny unit-tested pieces. And I feel that foundation would be more solid than what we have in this world now. I could be wrong.

I really do have to code up some North as soon as I can. Even after writing my call/response piece with Charles Moore in the previous blog, there are issues about Forthish languages I cannot get my mind around. I can sort of imagine how I could rewrite my simulator in North. Yet I can't even dream of how one might go about implementing a relational database in Forth, North or Zorth. I think in C. I've thought in C for decades. How you do absolutely everything through a parameter stack, with an absolute minimum of variables (if you're writing to Moore's standard), I can't figure out how to figure it out, if that makes sense. Problem for another day.

The opposite of simplicity is complexity. I've written on complexity elsewhere and I don't want to endlessly repeat myself. That dead horse is starting to smell and needs to be dragged away. But I will restate: "Complexity has to go somewhere." Do we really want it spread throughout a ladder of abstractions or do we want it at the very bottom, with well implemented, well documented, fanatically tested base units? Can we make the ladder shorter and still come up with a sufficient number of time-saving condensations? Will it matter anyway - how would we replace billions (trillions?) of lines of buggy code at this point? If the "machine stops" a la Forster, will we want to start over with something more reliable or will we wish for a whole different world without digital computing?

Maybe the Mayans know.

         Introduction to "The Perfect Language"    —   Table of Contents
         Why Forth? - Part 2    —   Previous
         Parceling the Parsing    —   Next

Started: December 21, 2011