What goes Where?

Let's go into more detail about how a computer might "understand" English.

(inline footnote: since we're speaking English here, we'll stick to ASCII here. ASCII stands for the American Standard Code for Information Interchange, which was developed in the 1960's, way before the Internet, and way before website developers wanted to create pages in other languages - many other languages. So now there's Unicode, UTF-8, UTF-16 and other encodings available for more and more of the world's human languages. The good news: the most popular of these are backward compatible with good old ASCII)

Sit at a keyboard and type

somevar = 3 + 5

into a likely interpreter or compiler. What does the computer see? It sees sixteen bytes encoded as:

01110011
01101111
01101101
01100101
01110110
01100001
01110010
00100000
00111101
00100000
00110011
00100000
00101011
00100000
00110101
00001010

and it is 'expected' to ultimately associate the sum, 8, with the English word "somevar".

Why 16 bytes? There are spaces between the symbols and a line feed (or perhaps a 00001101 carriage return (or both, if Windows)) at the end.

To think about it here, we don't have to do things in exactly the same order as a translating program, so let's start with the easiest things and isolate the numbers:

00110011   <-- ascii 3
00110101   <-- ascii 5

We can easily turn the ascii digits into pure binary by masking away the two leftmost Ones in each:

00000011   <-- binary 3
00000101   <-- binary 5

Replacing a couple of Ones with Zeros in the digits takes just a few simple machine instructions.

If the program is, say, translating for a 16 bit machine, it can extend those values to the larger word size:

0000000000000011
0000000000000101

Finally, and while there is no rule out there saying this is how it has to be done, by far the simplest way to store these values, which are pretty temporary anyway, is to stack them in a memory area reserved for just this sort of storage. Since the numbers are the last in, they'll be the first off - "LIFO". So these digits are, for present, out of sight if not out of mind.

The rest of the tokens aren't dealt with so easily. Take somevar:

01110011
01101111
01101101
01100101
01110110
01100001
01110010

One could add it to the stack as well, but there are two problems with that. The first is that dealing with a group of seven items on a stack is not as simple as dealing with a single item. Metaphorically, a big rubber band needs to be tied around the somevar cafeteria trays so they are all dealt with as a single group. This is not an impossible task by any means, but it makes the part of the translator program that operates the stack more complicated - and less efficient.

The bigger problem is that in the user's program somevar is not just a temporary value - else why give it a name at all? Presumably, at least one other line of English-code in the source file will also refer to somevar. The translating program, whether interpreter or compiler, really has to put somevar someplace more permanent, and be able to find it again when it has to.

Suppose the translator program has reserved a memory area for variables, so it knows where it is going to put somevar. It still has two problems facing it: How big a space will it need for the variable value itself, and how it will handle the English name associated with that value so it is in a unique enough form to be able to come back to it?

The first problem has to do with variable "typing," which is one of those fascinating issues I may devote a whole essay to later.

The good news is that in the previous blog we invented a complete program which already had declared somevar:

variable integer = somevar
somevar = 3 + 5
print somevar

The keyword "integer" can tell the translator that the "type" of somevar is a simple 16 bit number, so only one 16 bit memory slot will be needed for the result.

The second problem, uniqueness, can be dealt with in many increasingly sophisticated ways.

A brute force version would simply store the whole character string of somevar in seven contiguous memory words (again, our examples will be in a 16 bit form):

0000000001110011
0000000001101111
0000000001101101
0000000001100101
0000000001110110
0000000001100001
0000000001110010
0000000000000000

with an empty cell at the bottom to be eventually filled with the 3 + 5 sum.


Progressing from the What to the How

The brute force way of finding it each time it showed up in a line of source code would be to have an algorithm something like:

  Start at the very top of the variable memory space
    Go to each next memory cell looking for a
        match to 0000000001110011, the lower-case 's'
    If find a match, see
      If the very next memory cell matches
        0000000001101111, the lower-case 'o'
      Else
        start looking for another 's'
  And so on and so on until we've actually matched all seven letters
      of somevar

Even if you don't have much programming experience, you can probably tell that this is a very inefficient way to find somevar or any other variable. This could end up being a very slow interpreter / compiler.

Since we've digressed into talking about an algorithm, let's spend some time on an aspect of programming an algorithm which is too often ignored in the real world: "How can we break it?"

Suppose we declared two variables in this order:

variable integer = thisisnotsomevar
variable integer = somevar

It's bad enough that our algorithm would trip over every 's' in thisisnotsomevar, when it proceeded from the third 's' it would think it had found a match for the real somevar! This wouldn't be a problem if we declared thisisnotsomevar but never got around to actually using it.

But if we do ever use it, well, check out these lines of code:

somevar = 3 + 5
thisisnotsomevar = 20,000
someothervar = somevar + 10

someothervar is ultimately supposed to end up equal to 3 + 5 + 10, or 18. But instead it's going to equal 20,010 because things assigned to both somevar and thisisnotsomevar are going to end up in the same memory slot even though they aren't supposed to. And the last one assigned is going to 'win', so to speak. But we don't want them fighting each other.

How to fix this? Back up a bit and realize that any algorithm that reads its way through a memory space likely has a twin that wrote its way through that memory space in the first place. Some routine had to transfer the typed-in letters into the 16 bit name-plus-empty-slot format. So have the creation routine add another slot at the very top of each new variable and fill it with something that cannot be confused with a ascii character, say, 1111111111111111.

Now the read algorithm can only care about matching to an 's' if that 's' comes right after the all-Ones marker. No 's'?, then scoot on down to the next marker and try again. Yes 's'?, then look for the second letter, 'o', and if no 'o' scoot again and keep trying until there's finally a complete match to "somevar".

This isn't going to be much faster because it still is going to increment through every single memory space looking for a match, now mainly looking for hex ffff (marker) rather than hex 0073 ('s'). But at least it's now unbreakable, right?

Don't assume it! try to break it:

variable integer = somevarthisisnot
variable integer = somevar

If the read algorithm hits a marker immediately followed by somevarthisisnot, it is going to mess up worse than the previous "how can we break it" example. This time it's going to match the first seven letters of somevarthisisnot, think it's found somevar, and overwrite the first 't' with 8. Which means at least two bad things: it will now never be able to actually find somevarthisisnot because somevarthisisnot is now spelled somevar[^H]hisisnot, where [^H] is equal to 8, which is the ascii-encoded control char for 'backspace.' Probably even worse, what if somewhere in the source code is the line:

somevar = 65,535

65,535 is the same as hex ffff or binary 1111111111111111, which is of course, the marker for top-of-variable. So now the storage magically contains a new variable named "hisisnot", which will lead to even more confusion and even misplaced assignments if there just happens to be a real variable in the source with that name.

So who would name a variable "hisisnot" anyway? That's not the point. A programmer has to assume that if there is a way to break a program, some user, some where, some time, will. If there's never a program written with the variable examples here, there will be some other set of variables used that will eventually uncover the same weakness in this algorithm. How about: "our", "my", and "myfour"?

Sometimes, "when good programs go bad," you have to throw the whole thing out and start over. Other times, it merely takes some tweaking to overcome the problem(s).

The current problem (besides speed) with this marker system is that the marker can be confused with random data. We need a way to mark the markers so they somehow stand out from the general noise. What if the writing routine that transfers the typed-in letters into the variable memory area also saves each address of each marker? This would give the variable storage area something outside of itself to be in overall control. If there were a list of ten variables, there'd be an array of ten addresses of the tops of the variables.

This 'sideways addressing' can also give us a way to speed up searching for somevar. Now we can take each address in turn and check if the first letter is 's'. If it isn't, then we can immediately bypass all the rest of the letters of that particular variable name by getting the next address and jumping immediately to the next variable. And so on as often as necessary until we finally exactly match "somevar".

Wow, more speed! And unbreakable too.

Don't assume it - input is random! That empty slot at the bottom of each variable can eventually be filled with any of 65536 different bit patterns - including any ascii pattern.

Which means the empty cell can sometimes look like one more letter at the end of a variable name. We know by now this is not a good thing.

Obviously our algorithm cannot just keep matching its way through a variable name indefinitely, something has to stop it. Looking at the new, slightly improved data record for somevar:

1111111111111111 - marker: address saved elsewhere
0000000001110011
0000000001101111
0000000001101101
0000000001100101
0000000001110110
0000000001100001
0000000001110010
xxxxxxxxxxxxxxxx - slot for somevar value: any random data
1111111111111111 - marker: address saved elsewhere

at least it is surrounded by addresses saved off to the side that won't be randomly overwritten. Indeed, the value slot is at an address exactly one less than the next variable's marker. That is a value we can test against! Now the algorithm can match against all the letters of somevar and if it has reached the value cell for somevar, it has definitely found somevar, and only somevar.

Next question: do we really need a marker at all? Each address can point directly to the first letter of each saved variable, both saving a little space and making the search algorithm a little simpler.

However, there's a trick we can play if we leave the marker slots in. We can create something called a "linked list": the markers themselves can be addresses - not of themselves, but of their nearest neighbor, either the next higher or the next lower.

An example makes this clearer; here are 64 memory cells (numbered in hexadecimal) with some variable names stored:

00 - address 00
01 - address 00
02 - ascii 's'
03 - ascii 'o'
04 - ascii 'm'
05 - ascii 'e'
06 - ascii 'v'
07 - ascii 'a'
08 - ascii 'r'
09 - value
0a - address 01 
0b - ascii 'a'
0c - ascii 'n'
0d - ascii 'o'
0e - ascii 't'
0f - ascii 'h'
10 - ascii 'e'
11 - ascii 'r'
12 - ascii 'v'
13 - ascii 'a'
14 - ascii 'r'
15 - value
16 - address 0a
17 - ascii 'o'
18 - ascii 'u'
19 - ascii 'r'
1a - value
1b - address 16
1c - ascii 'm'
1d - ascii 'y'
1e - value
1f - address 1b
20 - ascii 'm'
21 - ascii 'y'
22 - ascii 'o'
23 - ascii 'n'
24 - ascii 'e'
25 - value
26 - address 1f
27 - ascii 'm'
28 - ascii 'y'
29 - ascii 't'
2a - ascii 'w'
2b - ascii 'o'
2c - value
2d - address 26
2e - ascii 'm'
2f - ascii 'y'
30 - ascii 't'
31 - ascii 'h'
32 - ascii 'r'
33 - ascii 'e'
34 - ascii 'e'
35 - value
36 - address 2d
37 - ascii 'm'
38 - ascii 'y'
39 - ascii 'f'
3a - ascii 'o'
3b - ascii 'u'
3c - ascii 'r'
3d - value
3e - address 36
3f - open ...

Notice we only need to store the starting point, address 3e, on the side. The search for somevar would use the contents of addr 3e to get to addr 36, save its contents (addr 2d) for future use, increment to addr 37 to compare its contents to 's', and finding no match, go on to addr 2d, and so on up to addr 01 where it would finally find the place where somevar is stored. Each value is still at one smaller address than where just jumped from, so the "stop comparing letters" test remains the same: stopping when a limiting address (previous address minus 1) is reached.

Notice also that the link of somevar points to addr 00. If no match to any variable name is found in the linked list, the algorithm would attempt to match the contents of addr 01, which is all Zeros, 0000000000000000, which doesn't match any printable ascii character. So the algorithm would then jump to the contents of addr 00, which points to itself! Things would then want to go into an "infinite loop" and just keep trying match some character to the all Zeros over and over again. Once again, there's a way to break things: simply type in a variable name that doesn't exist.

Obviously, the algorithm better contain a test for having reached the top and do something like print "variable not found." Simple fix, this time.

One simple fix and it's working fine, right? ... Right?

Look at those last three addresses, 26, 2d, and 36. Their bit patterns exactly match ascii '&' '-' and '6'. Those maybe are and maybe aren't characters which might be allowed in variable names (depending on the overall rules for the English-like language), but it wouldn't take many more variable declarations for the addresses to grow into the ascii sections for upper- and lower-case letters. Confusion time again?

Nope, not this time. The links still work fine as addresses, no matter being small enough to fall in the ascii space. Subtract One from a link, and that derived address still works fine to test against, no matter if it also happens to be an ascii value.

It's good that we asked the question, though.

OK, these essays aren't meant do be a programming tutorial, so I'll quit beating this dead horse. Just to sum up: "Test the hell out of the small parts of your program so when you combine them into bigger parts, the bigger parts will also work right."


Regressing back to the What and Where

Speaking of summing up, what ever happened to 3 + 5? We've sort of lost track through all these digressions. And yet, we're almost done. We have the numbers on the stack, and we now have a definite, unique place in memory to store the ultimate sum. The only things left over are the '=' sign and the '+' sign.

Ah yes, the plus sign. Such a tiny token, yet ultimately the plus sign is going to do most of the work. It will have to be the subject of the next blog entry.

Before leaving here however, we should ask if the variable storage data structure described so far is the only approach possible, and if not, is there a better one? Trust me, there are lots of approaches, and we'll look at at least one more in the, yep, next essay.

Can we improve this general approach first, either by speeding it up or having it take less room, or both?

Our algorithm is now more finesse and less brute force, now something like:

  For each link of the linked list
    Go to each next memory cell looking for a
        match to 0000000001110011, the lower-case 's'
    If find a match, move on to the very next memory cell, and
      Test its address against the address of the value slot
        If match that address
          Go to the next link
        If the very next memory cell matches
            0000000001101111, the lower-case 'o'
          Move on to the very next memory cell
        Else
          Go to the next link
  And so on and so on until we've actually matched all seven letters
      of somevar

This is kind of sloppily written "pseudo code", but it gives the general idea.

In the 64 cell example above, somevar was the only variable that started with the letter 's', so each 'wrong' variable got branched away from immediately. However, somevar shared that space with seven other variables, five of which started with the letters "my". If the above algorithm looks for "my" instead of "somevar", it will take longer, because it will have to dig down through the 'm' & 'y' of four other variables before it finally reaches "my" itself.

Believe it or not, it will take quite a bit more time to settle on "my" versus "somevar" - there will be a lot more looping and a lot more testing. The more often we can branch away due to the very first test, the faster things will go.

Early compiler writers came up with the trick of adding the character count to the top of each variable record. Takes a bit more memory, but tends to make early branches happen more often. Look at the number of chars versus the variable names for the current example:

 7 - somevar
10 - anothervar
 3 - our
 2 - my
 5 - myone
 5 - mytwo
 7 - mythree
 6 - myfour

Only two are the same, "myone" and "mytwo". Putting the character count just below the link and above the first character itself allows it to be easily tested first, causing the earliest possible branch.

Can we 'break' this by making all the variables the same length? No, that's just being annoyingly clever. That's not actually breaking it, it will still find each variable uniquely. Yes, if most programmers wrote most programs using equal length variable names, we'd want to leave out this character count improvement. But most programs end up with a fairly random assortment of name lengths.

One note on saving space: when ascii bytes are transferred from some typed-in line to a 16 bit memory word, the left half of the word always gets filled with Zeros, eg: 0000000001110011, the lower-case 's'. Variable names can be packed more densely, like:
0110111101110011 - 'o' & 's'
0110010101101101 - 'e' & 'm'
0110000101110110 - 'a' & 'v'
0000000001110010 - --- & 'r'

The algorithm for testing each letter then gets changed into an algorithm for testing two letters at once. Better, huh? Well, maybe. Remember that the letters were originally typed in as a string of bytes, eg, somevar:

01110011
01101111
01101101
01100101
01110110
01100001
01110010

If the operating system stored them that way, then to test them against the condensed format used in the variable storage area, they have to be shifted and combined into the same format. Or else the stored variables have to be shifted and masked to match the typed input. All of which may be worth the extra processing time if space is really at a premium.

If by this time you are asking, "Can this crap get any MORE complicated?" I wouldn't be blaming you. This low-level stuff can be tedious. If you've read this far maybe you have a knack for it.

Tedious or not, it is necessary. If you are programming successfully in a higher-level language, it is because somebody put the time into creating low-level crap like this as a foundation for it.

If nothing else, maybe you begin to see that "Computer Science" is as much an Art as a Science, with choices to be made about tradeoffs at nearly every step along the way.





         Introduction to "The Perfect Language"    —   Table of Contents
         English versus the Machine    —   Previous
         Hashing it Out    —   Next
         Glossary




Started: August 20, 2010