Adding it all up

Before getting into this essay, a quick definition of the term "ALU": It stands for Arithmetic Logic Unit, the circuitry inside a computer that does the actual computing. Not all ALUs are the same, but they usually at least can add, subtract, and compare two numbers (the ALU has probably been performing all the 'tests' referred to in the previous essay). It is sort of an electronic hopper - bit patterns into the top, resulting bit patterns out the bottom.

Anyway, here we are three essays in from the intro, and we are still working on the statement:

somevar = 3 + 5

To reiterate from the previous essay: We have the numbers 3, 5, on the stack, and we have a definite, unique place in memory, tied to the name "somevar", to store the ultimate sum. The only things left over are the '+' sign and the '=' sign.

The plus sign:

00101011    <-- ascii +

is the active part of the statement. It might even be called the 'verb' of the 'sentence'.

The plus sign basically has to "tell" the ALU it is going to be adding two numbers, it has to "know" where to find the 3 and 5 and get them to the ALU, and finally it has to do something with the 8 that the ALU is going to produce.

It somehow has to ultimately point to the beginning of a mini-program that can perform all the operations outlined in the previous paragraph.

The word "point" is a very good one to use at this, er, point, since "pointer" is a computer science term for the memory address of something. In this case we ultimately need to point to the start of the PLUS mini-program.

It would be convenient if the bit pattern for ascii '+' happened to actually be the starting address of its own program. But that never really happens. So somehow that ascii representation has to get translated into the true starting point(er).

One very straightforward way would be to use the '+' bit pattern as an index into a table of pointers. The basic ascii set uniquely encodes each of 128 characters, 94 of them printable English letters, numbers and punctuation (the rest are mostly "control" characters like the line feed or carriage return).

So single operations like +, -, * and / could each index into a small table of the starting addresses to the separate programs for add, subtract, multiply and divide. Those addresses would be pointers to the beginning of the respective routines, while the ascii bit patterns would become literally "pointers to pointers."

This is such a good idea that we'll explore it more fully very soon, but an alternative is to treat the plus sign in a way similar to the way we stored somevar.

Here is how the FSA [currently] stores the plus sign:

                  # The entry for "+"
        00000133  #   link field
        00002223  #   name field = LRS
        00000001  #   name field = # of chars
        00000223  #   ascii - +
        00000000  #   null termination
        00000001  #   stuf field
#       00000311      code field
                  # parm field
        23123032  #   Seq scope BUS fm param stack memory, pop
        03233103  #   LAT at BPP_ALU_APT
        03133203  #   direct immediate to ALU_CTRL_APT
        00002100  #   data - load PLUS code, no carry in
        23123022  #   BUS fm param stack memory
        23123122  #   LAT to param stack memory, push
        01000030  #   reset alu181 path (load 'straight' path)
        33203330  #   toggle clk enable of addr interp
        33303330  #   toggle clk enable of dictionary (self)
        00000000  #   final null

I say 'currently' because these essays reflect a development project and things are likely to change in future. For example, notice how the code field is commented out - it turns out not to be needed in the FSA.

The top part is made up of several 'fields', and if you look closely and ignore the "LRS" line for now, you'll see the resemblance to the way somevar is stored. Yes, it's part of a linked list of names, the name in this case being the single ascii character '+'.

This format is borrowed from a higher level computer language called "Forth". In Forth parlance, this whole data structure is called one "word" in a "dictionary" of many words, all connected together by their link fields.

Ignore three of the memory slots just below the + sign (including the one commented out), and there's the parm ("parameter") field, taking up a full nine slots. This is the active code for + in this system. The parm field is the mini-program that does the addition.


The deck is stacked

Time for a friendly warning about this parm field code. Remember a couple of essays ago when I came up with two different examples of 3 + 5 that would be " WAY more readable" than assembly code? I had:

somevar = 3 + 5

And also:

3 5 +

Well, this FSA code is stack based, which means it doesn't give a fig about somevar, or the '=' sign, for that matter. That's because it is an implementation of Forth, and wherever possible, Forth uses the stack instead of named variables.

Which is better? Oooh, that's a long running argument among programmers, let's avoid that one for the time being!

That being said, basically every modern compiler will reorder the 3 + 5 into 3 5 + internally so it can do its thing by way of a stack.

To get technical for a paragraph, that's because most compilers process the sorta-English of source code by using a "Pushdown automaton", which is a "Finite-state machine" plus a stack. And to use that stack, the tokens can't be in the 3 + 5 "infix" notation, but rather in the 3 5 + "postfix" notation, otherwise known as RPN (for Reverse Polish Notation). It's all on wikipedia, you can look it up if you want to.

For a high level language, Forth is a pretty low level language, if that makes sense. Some programmers swear by Forth. Most, it seems, swear at it, not least because Forth demands its input to be typed in postfix form.

For all of us who have been raised on umpteen years of math and algebra, having to change the right side of an equation such as

(3 + 5) * (4 + 6)

into

3 5 + 4 6 + *

does not come naturally.

And it's not just numbers. All Forth input pretty much directly goes onto, or at least uses, the stack. So most lines of Forth source seem uncomfortably backwards.

In a future essay I'll go into more detail about why I'm implementing a version of Forth on the FSA, but the main reason is because it is indeed a low level high level language. This means two things: it is simpler to implement because it is so direct, and it is very powerful, powerful because it doesn't hide any of the underlying computing machinery from the programmer. Nor does it - and this is important - hide its own underlying data structures from itself. Most other languages do hide one or both to some degree or another.

To get back to our addition problem, the PLUS code needs to set the ALU circuitry into ADD mode and run the true binary values of 3 and 5 through that ALU to get binary 8 out:

0000000000000011 + 0000000000000101 = 0000000000001000

and it all happens on the stack, just like the fake stack oriented assembly language segment from a couple of blogs ago.

This doesn't mean that variables aren't allowed in Forth. In fact, the complete Forth version of

somevar = 3 + 5

would be

3 5 + somevar !

No, the exclamation mark isn't there because it's such an exciting line of code; '!' is another Forth one-character word, equivalent to the equals sign (at least when '=' is interpreted as an assignment operator).

Here's what the line of Forth looks like from the stack's point of view. The stack starts out empty,

----- (empty)

Then ascii 3 is read in and stacked after being converted to pure binary,

  3
-----

Same for ascii 5,

  5
  3
-----

The '+' word is looked up in the dictionary; it doesn't actually have to be stacked (could be, but that would just be a wasted step). Its active code pops off 3 & 5 into the ALU and pushes the ALU result onto the stack,

  8
-----

somevar is found in the dictionary, and its address is put on the stack,

  address of somevar
  8
-----

Finally, the '!' word's active code uses the address of somevar that it pops off the stack to direct the 8 that it next pops off, into somevar's value slot, leaving the stack empty again.

----- (empty)

And here's a trick you might not be looking for ... somevar itself has a parameter field made up of active code which puts its own address on the stack!

In true Forth, all elements other than numbers are words in the dictionary, be they variables, constants, operators, etc., and almost all words perform some kind of self-contained action.

Direct, powerful, elegant, and hard to read.


Compilation

I really want to at least try to save most of my observations about Forth for its own essay (coming soon to a computer near you). Still, I can reiterate here it that makes a good first language for a new architecture because it is easy to implement, self contained, and its stack orientation means it is similar to the internals of compilers for other languages.

That last is something that I hope will give me a leg up on approaching the "truly good new" computer language. After all, I claimed in the introduction that that is what this overall blog is about, didn't I?

Let's talk about compiling. So somevar and all other variables can have active code that places their self-address on the stack. Fine, but how many memory slots does that use up? For each variable? Sounds like a lot of repetition, maybe a lot of memory space.

What if we put the code that pushes the address onto the stack somewhere else, and put a pointer to that code in the parm field of each variable? Yep, that works. Now there's just one copy of the active code and as many short, one-slot pointers to it as there are variables.

And why stop at one pointer? Forth has two special operators, ':' and ';' that say, respectively, "begin compiling" and "end compiling". Suppose we take the Forth statement we used earlier:

3 5 + somevar !

and surrounded part of it with colon-semicolon:

: someword + somevar ! ;

See that there's one more new token, "someword". That is going to be the name of a new entry in the Forth dictionary. Its top part will look a lot like the word for '+' above, except of course that the character count will be eight, and all eight ascii letters of "someword" will follow the count.

Cruising through the line of input, Forth will hit the ':' and switch its internal functionality into compile mode. Then it will start creating a new entry named someword. Finally, after that point and until it reaches the ';' it will interpret "+ somevar !" one last time.

That is to say Forth will find each of +, somevar, and ! in the dictionary and for each, it will store the address of the start of its active code into the parm field of someword.

But just before it does all that, it will place

        00000002

into the stuf field of someword, which is different than the

        00000001  

that the word for '+' has there.

The reason is that instead of having a short machine language mini-program in its parm field like '+' has, someword will have an array of pointers: the addresses of each of the mini-programs that +, somevar, and ! contain. Once all three have been processed, someword has been compiled. Finally, ';' switches the overall system state back to interpret mode.

The advantage of compiling is that even though the ultimate result is the same (8 being stored at somevar),

3 5 someword

runs a whole lot faster than

3 5 + somevar !

because going directly to the addresses of the active code of Forth words is much faster than hunting for them through the dictionary links.

This is true of compiling in general. Compiling is where the English gets fully squeezed out of the human - computer interface, and only machine language is left. For instance, once someword is compiled, you could overwrite the top half of the '+' entry with a bunch of random junk, yet so long as you left its parm field alone, someword would still run correctly. As far as someword is concerned, '+' is no longer a name or a token, it is simply an address of code to run.


Making a hash of it

Alright: the LRS field, what's that about? I mentioned awhile back that I have a simulator for running code written for the Flexible System Architecture. It is in fact the base platform I am using for my computer language explorations.

To be blunt about it, this simulator runs too slow for my taste. There are various reasons for that, which I may even have cause to go into sometime, but one good thing about it: It makes me extremely sensitive to the efficiency of any code running on it.

The sim's speed wasn't really an issue when I was developing proof of principle math routines for it like the screen shot of the CORDIC sine & cosine on the genapro.com home page. The math algorithms maybe didn't break any speed records, but they ran fast enough that I could debug them in something reasonably close to real time.

But once I started programming the "kernel" of the Forth system, well, who knew that chugging through a Forth dictionary was such a computationally involved process? I'm constantly sitting there going, "Come on, come on already."

My very first test Forth dictionary has barely more than six entries. Four of those were the one character +, :, ; and . . You're already familiar with the first three; "dot", '.', simply pops whatever value is on the top of stack and displays its numeric value (finally, something gets translated back to English!).

Being all single character operators, the character count field was doing more harm than good, speed wise, and on my slow simulator ... aaarrrgh!

So I decided to implement something I was planning to save for later: the Longitudinal Record Sum, or "LRS". Big name for something actually simple: to compute it, simply add all the ascii letters in a name together. Literally add them, treat them like numbers. And you get a number out - the LRS. Basically it's a "checksum", but since I don't use it for error detection like an official checksum, I decided to be accurate in my nomenclature for a change instead of giving in to my tendency for sloppiness.

It ends up something less than a true sum the way I treat it anyway. If the carries from the additions run over more than eight bits, I Zero them out. Then, in the byte that remains, if its most significant bit doesn't happen to be a One, I make it so. That way, the LRS looks different than any ascii near it in a memory display.

So testing the LRS before anything else finally gives an efficient dictionary match/don't_match test that usually branches away at the very top of each entry.

Usually? Well, think about it: after all the manipulations done to it, there are only 128 unique LRS bit patterns. It won't take adding too many new dictionary words (much less than 128 - look up "Birthday Paradox" sometime) before two of them happen to match. But I can live a 1/128 chance of a "collision" much better than the 4/6 chance the character count gave me in my first, tiny dictionary.

Interestingly, I found one collision pretty quickly. When I make changes to my Forth kernel, one of the last tests I often do is just type a random name into the input line to make sure the dictionary gets properly traversed all the way to the "variable not found" top entry. One time, after a vacation in Colorado spent playing video games, I put in "wii". Sure enough, out of the corner of my eye, I saw a little hitch in the display: wii's LRS had matched that of "dup", a Forth stack op. The character count of 3 also matched, so it didn't jump away until 'w' didn't match 'd'.

You could ask, I suppose, why I don't just use the whole originally computed LRS instead of clipping it back down to one byte. Wouldn't that increase the odds of collision way beyond 1/128? To which I'd answer, "Shut up, just shut up, I don't want to talk about it!"

OK, I'll talk about it. You see, sometimes I cheat. I'm still more comfortable writing in the C language than Forth or even FSA assembler, and the simulator is written in C, so I implemented the LRS algorithm in the simulator. Unfortunately, instead of using unsigned shorts to match the FSA 16 bit word size, I made use of the C string format which is basically an array of byte sized chars. Oops: 8 bits and 8 bits only. I'll fix it, I'll fix it, alright? Eventually.

Computer scientists would call the LRS a primitive "hash code". Unlike a lot of technical and mathematical jargon, "hashing" in CS-speak is actually very descriptive. Take a large pattern of bits (like a string of ascii characters) and shrink them, scrunch them, jumble them into a smaller pattern of bits, often much smaller. And why do this? There are various reasons. One is to save space, another is to use the hash code as a pointer, an index into an array, or table.

Let's tackle that last concept in easy steps. Just a few paragraphs from the top of this essay is one that begins, "It would be convenient if the bit pattern for ascii '+' happened to actually be the starting address of its own program." The next few paragraphs go on to describe how handy it would be to use an ascii character's unique bits as an index into a table of pointers. And then for each pointer in that array to be the address of the active code for the operation the character stands for.

The 94 ascii printable characters have values ranging from 21 to 7E (decimal 33 to 126), so if that pointer table could be squeezed up near address 0 in memory, the ascii values could be used directly as pointers to the pointers. This "indirect addressing" would be extremely fast. But even if the table has to be somewhere else in memory, a common scaling value can be added to the ascii values to make them address the pointer array. This would still be a very quick way to get to the start of the active code for each of the ascii operators. If one could invent a computer language made up only of the 94 single character tokens, it could run very efficiently. Hmmmm.

So now you get the idea how a "hash table" works. Although, since the ascii characters have been neither shrunk, scrunched, or jumbled, computer science would call them a 'trivial' hash code.

What tricks could be played with multi-letter names? Even with two-character tokens, the table size gets big fast. If seven bits are needed to address a space large enough to fit the 94 possible one-char addresses, fourteen bits will be needed when two letters are joined into an address. Suddenly the size of the pointer array grows from 128 to 16,384. Yikes. It's one thing to have some wasted space (unaddressed, therefore unused, pointer slots) in a size-128 array, that is not too expensive in memory usage. But a mostly empty 16K array? Yikes. Time to scrunch those two characters into something smaller than 14 bits - time to hash them.

Admittedly, it's a bit artificial to limit ourselves to two-character tokens, but let's keep it simple for now and stick to them. Also for simplicity's sake, let's say that all these tokens will be for single integer variables. That means that the table, instead of being filled with pointers to the start of machine code routines, need only be filled with the integer values of the named variables. No extra jumps; the magic hash code will point exactly to the variable value to be stored or retrieved.

Let's try a hash algorithm similar to ones CS students learn: Take the variable "my", say, and start with its first letter (already in a 16 bit cell),

0000000001101101    <-- ascii m

and shift it to the left, by say, 3 bits,

0000001101101000

then take the second letter,

0000000001111001    <-- ascii y

and maybe add them?

Or try something a little different, "Exclusive OR" them. Exclusive OR ("XOR") is an operation a lot like adding, but without the carries. It is done bit by bit (but all at once unless it is a really crappy ALU) across the 16 bit values. Each pair of bits is looked at, and if they are the same, the output bit in the same position is set to Zero. If the bits are different, a One comes out. Thus the XOR of the shifted 'm' and 'y' is:

0000001101101000
0000000001111001
----------------
0000001100010001

Since the leftmost six bits are always going to be Zero, we get a ten bit hash code,

1100010001

which can be used to address a memory array of size 1024, quite a bit smaller than 16384.

There's some built in flexibility to this algorithm. Shift by 2 and you get a 9 bit hash, addressing a 512 element array. Shift by 1 and you get an 8 bit hash, addressing 256 elements. Going the other direction, shift by 4 and the hash is 11 bits, for 2048 elements. And etcetera.

Which is best? There are various tradeoffs and considerations. Two that work against each other are the total size of the hash table and the total number of indexes into it, that is, how full it is going to be.

The problem is collisions.

Suppose "my" is the very first 2-char variable the system has to deal with, and it hashes it into the 1100010001 bit pattern. So the storage cell for "my" is going to be 785 (512 + 256 + 16 + 1) cells down from the top of the table. That leaves 1023 empty cells for other variables. Not much chance the next hash of the next variable will land on top of it, right?

Not much chance at all - unless the next variable name is "f!". It also hashes to 1100010001 using this algorithm.

Wait a minute, that's not a legal variable name, is it? Well, it is in Forth. Maybe other languages would disallow it, but would they disallow "kI", or "iY"? They also hash to 1100010001. So who would name a variable iY anyway? Only some user somewhere, sometime, who will call you to report a bug right as you are packing for your well deserved vacation.

Remember: Don't assume! "Try to break it!"

So is hashing worth it? Sometimes. It's all about tradeoffs. There are stronger, more randomizing hash algorithms than the simple one above, but they take longer to compute, either because they make use of lots more shifts and XORs, or they make use of some other time consuming arcane processing.

Suppose it's known in advance that some application will have about 600 two-character variable names. A bit artificial? OK, but one of the underrated talents of a good programmer is imagination. Sometimes it can be useful to push beyond the bounds of absurdity to run a thought experiment that makes a point.

Imagine a Forth dictionary containing those 600 names. First, they'll take up a lot of room, and second, to find any one of them, the average traversal will be 300. Even with our old buddy LRS to help provide quick branching, that's at least 300 tests & jumps. The question is, can a good hash algorithm beat that average time? Keep in mind that the hash has to be computed every single time a name is encountered. Another possibly important question may be about the size of the hash table, compared against 600 freakin' dictionary words.

So for this example, maybe hashing comes out ahead, efficiency wise. Except there's still the issue of collisions. There are going to be collisions, especially as the table gets more and more full. They can be handled in various ways, but when they happen, the hash lookup slows down.

Collisions affect size, too. How to know if there even is a collision? How about looking at the variable value itself? Fill the table with all Zeros to start with and then check if the slot is now nonzero? But variables are random! If "my" stores a Zero, then "kI" is going to come along and say, "Hey, it's empty!" when it's not. This is the same problem we had in the last essay when we were first trying to figure out how to store "somevar" and attempted to delimit it with a "marker". Just like before, there is no Magic Marker™!

There either needs to be twice as many cells within the hash table, or a mirror array beside it, to keep track of the collision situation.

And when collisions do happen, extra processing (read: time) has to be used to arbitrate among the offending hash codes to finally find a unique memory slot (read: even more space) for each variable.

There are many different collision resolution schemes - hash theory and practice has been around for a long time. Ironically, one of them involves making a linked list of any two or more names that crash into each other. In other words, this fix involves creating something for the variables involved that is very much like a tiny Forth dictionary.

For now, I'm going to stick with the dictionary. Maybe someday when whatever language I'm fumbling towards becomes huge and bloated, I'll need a hashing system. Just as likely, that will the the time to refactor back to something simple!


On further thought ...

... A month or so later, I realize that this all sort of reflects the old random-access versus serial memory question. The hash table, once the hash code is computed (however long that takes), provides a mostly constant lookup time for a data record.

The dictionary linked list, OTOH, is a classic serial lookup situation. For the long run, I have planned for a facility to read through a program and find the most commonly used Forth words within that application, and prioritize the order of the dictionary based on those statistics. Better yet would be a facility to monitor the program as it runs, as the most common words may be hidden inside loops.

Another approach would be to reorder a dictionary based on the LRS value. Then a binary search could be used to determine what quarter, or eighth, or even less, of the dictionary should be searched for any given word.

Sometimes designing a computer language seems like it is its own long, drawn out binary search ...





         Introduction to "The Perfect Language"    —   Table of Contents
         What's in Store?    —   Previous
         Deciding to Get Loopy    —   Next
         Glossary




Started: August 27, 2010