|
|
What goes Where?
Let's go into more detail about how a computer might "understand"
English.
(inline footnote: since we're speaking English here, we'll stick
to ASCII here. ASCII stands for the American Standard Code for
Information Interchange, which was developed in the 1960's, way
before the Internet, and way before website developers wanted to
create pages in other languages - many other languages. So now
there's Unicode, UTF-8, UTF-16 and other encodings available for
more and more of the world's human languages. The good news: the
most popular of these are backward compatible with good old ASCII)
Sit at a keyboard and type
somevar = 3 + 5
into a likely interpreter or compiler. What does the computer see?
It sees sixteen bytes encoded as:
01110011
01101111
01101101
01100101
01110110
01100001
01110010
00100000
00111101
00100000
00110011
00100000
00101011
00100000
00110101
00001010
and it is 'expected' to ultimately associate the sum, 8, with
the English word "somevar".
Why 16 bytes? There are spaces between the symbols and a line
feed (or perhaps a 00001101 carriage return (or both, if Windows))
at the end.
To think about it here, we don't have to do things in exactly
the same order as a translating program, so let's start with the
easiest things and isolate the numbers:
00110011 <-- ascii 3
00110101 <-- ascii 5
We can easily turn the ascii digits into pure binary by masking
away the two leftmost Ones in each:
00000011 <-- binary 3
00000101 <-- binary 5
Replacing a couple of Ones with Zeros in the digits takes just
a few simple machine instructions.
If the program is, say, translating for a 16 bit machine, it can
extend those values to the larger word size:
0000000000000011
0000000000000101
Finally, and while there is no rule out there saying this is how
it has to be done, by far the simplest way to store these values,
which are pretty temporary anyway, is to stack them in a memory
area reserved for just this sort of storage. Since the numbers are
the last in, they'll be the first off - "LIFO". So these digits are,
for present, out of sight if not out of mind.
The rest of the tokens aren't dealt with so easily. Take somevar:
01110011
01101111
01101101
01100101
01110110
01100001
01110010
One could add it to the stack as well, but there are two problems
with that. The first is that dealing with a group of seven items
on a stack is not as simple as dealing with a single item.
Metaphorically, a big rubber band
needs to be tied around the somevar cafeteria
trays so they are all dealt with as a single group. This is not an
impossible task by any means, but it makes the part of the translator
program that operates the stack more complicated - and less efficient.
The bigger problem is that in the user's program somevar is not just
a temporary value - else why give it a name at all? Presumably, at
least one other line of English-code in the source file will also
refer to somevar. The translating program, whether interpreter or
compiler, really has to put somevar someplace more permanent, and
be able to find it again when it has to.
Suppose the translator program has reserved a memory area for
variables, so it knows where it is going to put somevar. It still
has two problems facing it: How big a space will it need for the
variable value itself, and how it will handle the English name
associated with that value so it is in a unique enough form to be
able to come back to it?
The first problem has to do with variable "typing," which is one
of those fascinating issues I may devote a whole essay to later.
The good news is that in the previous blog we invented a complete
program which already had declared somevar:
variable integer = somevar
somevar = 3 + 5
print somevar
The keyword "integer" can tell the translator that the "type" of
somevar is a simple 16 bit number, so only one 16 bit memory slot
will be needed for the result.
The second problem, uniqueness, can be dealt with in many
increasingly sophisticated ways.
A brute force version would simply store the whole character string
of somevar in seven contiguous memory words (again, our examples will
be in a 16 bit form):
0000000001110011
0000000001101111
0000000001101101
0000000001100101
0000000001110110
0000000001100001
0000000001110010
0000000000000000
with an empty cell at the bottom to be eventually filled with the
3 + 5 sum.
Progressing from the What to the How
The brute force way of finding it each time it showed up in a line
of source code would be to have an algorithm something like:
Start at the very top of the variable memory space
Go to each next memory cell looking for a
match to 0000000001110011, the lower-case 's'
If find a match, see
If the very next memory cell matches
0000000001101111, the lower-case 'o'
Else
start looking for another 's'
And so on and so on until we've actually matched all seven letters
of somevar
Even if you don't have much programming experience, you can probably
tell that this is a very inefficient way to find somevar or any other
variable. This could end up being a very slow
interpreter / compiler.
Since we've digressed into talking about an algorithm, let's spend
some time on an aspect of programming an algorithm which is too often
ignored in the real world: "How can we break it?"
Suppose we declared two variables in this order:
variable integer = thisisnotsomevar
variable integer = somevar
It's bad enough that our algorithm would trip over every 's' in
thisisnotsomevar, when it proceeded from
the third 's' it would think it had
found a match for the real somevar! This wouldn't be a problem if
we declared thisisnotsomevar but never got around to actually
using it.
But if we do ever use it, well, check out these lines of code:
somevar = 3 + 5
thisisnotsomevar = 20,000
someothervar = somevar + 10
someothervar is ultimately supposed to end up equal to 3 + 5 + 10,
or 18. But instead it's going to equal 20,010 because things assigned
to both somevar and thisisnotsomevar are going to end up in the same
memory slot even though they aren't supposed to. And the last one
assigned is going to 'win', so to speak. But we don't want them
fighting each other.
How to fix this? Back up a bit and realize that any algorithm that
reads its way through a memory space likely has a twin that wrote
its way through that memory space in the first place. Some routine
had to transfer the typed-in letters into the 16 bit
name-plus-empty-slot format. So have the creation routine add another
slot at the very top of each new variable and fill it with something
that cannot be confused with a ascii character, say,
1111111111111111.
Now the read algorithm can only care about matching to an 's' if that
's' comes right after the all-Ones marker. No 's'?, then scoot on
down to the next marker and try again. Yes 's'?, then look for the
second letter, 'o', and if no 'o' scoot again and keep trying until
there's finally a complete match to "somevar".
This isn't going to be much faster because it still is going to
increment through every single memory space looking for a match, now
mainly looking for hex ffff (marker) rather than hex 0073 ('s').
But at least it's now unbreakable, right?
Don't assume it! try to break it:
variable integer = somevarthisisnot
variable integer = somevar
If the read algorithm hits a marker immediately followed by
somevarthisisnot, it is going to mess up worse than the previous
"how can we break it" example. This time it's going to match the
first seven letters of somevarthisisnot, think it's found somevar,
and overwrite the first 't' with 8. Which means at least two bad
things: it will now never be able to actually find somevarthisisnot
because somevarthisisnot is now spelled somevar[^H]hisisnot, where
[^H] is equal to 8, which is the ascii-encoded control char for
'backspace.' Probably even worse, what if somewhere in the source
code is the line:
somevar = 65,535
65,535 is the same as hex ffff or binary 1111111111111111, which
is of course, the marker for top-of-variable. So now the storage
magically contains a new variable named "hisisnot", which will lead
to even more confusion and even misplaced assignments if there just
happens to be a real variable in the source with that name.
So who would name a variable "hisisnot" anyway? That's not the point.
A programmer has to assume that if there is a way to break a program,
some user, some where, some time, will. If there's never a program
written with the variable examples here, there will be some other
set of variables used that will eventually uncover the same weakness
in this algorithm. How about: "our", "my", and "myfour"?
Sometimes, "when good programs go bad," you have to throw the whole
thing out and start over. Other times, it merely takes some tweaking
to overcome the problem(s).
The current problem (besides speed) with this marker system is that
the marker can be confused with random data. We need a way to mark
the markers so they somehow stand out from the general noise. What
if the writing routine that transfers the typed-in letters into the
variable memory area also
saves each address of each marker? This would
give the variable storage area something outside of itself to be
in overall control. If there were a list of ten variables, there'd
be an array of ten addresses of the tops of the variables.
This 'sideways addressing' can also give us a way to speed up
searching for somevar. Now we can take each address in turn and
check if the first letter is 's'. If it isn't, then we can
immediately bypass all the rest of the letters of that particular
variable name by getting the next address and jumping immediately to
the next variable. And so on as often as necessary until we finally
exactly match "somevar".
Wow, more speed! And unbreakable too.
Don't assume it - input is random! That empty slot at the bottom of
each variable can eventually be filled with
any
of 65536 different bit patterns - including any ascii pattern.
Which means the empty cell can sometimes look like one more letter
at the end of a variable name. We know by now this is not a good
thing.
Obviously our algorithm cannot just keep matching its way through
a variable name indefinitely, something has to stop it. Looking at
the new, slightly improved data record for somevar:
1111111111111111 - marker: address saved elsewhere
0000000001110011
0000000001101111
0000000001101101
0000000001100101
0000000001110110
0000000001100001
0000000001110010
xxxxxxxxxxxxxxxx - slot for somevar value: any random data
1111111111111111 - marker: address saved elsewhere
at least it is surrounded by addresses saved off to the side that
won't be randomly overwritten. Indeed, the value slot is at an
address exactly one less than the next variable's marker. That is
a value we can test against! Now the algorithm can match against
all the letters of somevar and if it has reached the value cell
for somevar, it has definitely found somevar, and only somevar.
Next question: do we really need a marker at all? Each address
can point directly to the first letter of each saved variable,
both saving a little space and making the search algorithm a little
simpler.
However, there's a trick we can play if we leave the marker slots
in. We can create something called a "linked list": the markers
themselves
can be addresses - not of themselves, but of their nearest neighbor,
either the next higher or the next lower.
An example makes this clearer; here are 64 memory cells (numbered in
hexadecimal) with some variable names stored:
00 - address 00
01 - address 00
02 - ascii 's'
03 - ascii 'o'
04 - ascii 'm'
05 - ascii 'e'
06 - ascii 'v'
07 - ascii 'a'
08 - ascii 'r'
09 - value
0a - address 01
0b - ascii 'a'
0c - ascii 'n'
0d - ascii 'o'
0e - ascii 't'
0f - ascii 'h'
10 - ascii 'e'
11 - ascii 'r'
12 - ascii 'v'
13 - ascii 'a'
14 - ascii 'r'
15 - value
16 - address 0a
17 - ascii 'o'
18 - ascii 'u'
19 - ascii 'r'
1a - value
1b - address 16
1c - ascii 'm'
1d - ascii 'y'
1e - value
1f - address 1b
20 - ascii 'm'
21 - ascii 'y'
22 - ascii 'o'
23 - ascii 'n'
24 - ascii 'e'
25 - value
26 - address 1f
27 - ascii 'm'
28 - ascii 'y'
29 - ascii 't'
2a - ascii 'w'
2b - ascii 'o'
2c - value
2d - address 26
2e - ascii 'm'
2f - ascii 'y'
30 - ascii 't'
31 - ascii 'h'
32 - ascii 'r'
33 - ascii 'e'
34 - ascii 'e'
35 - value
36 - address 2d
37 - ascii 'm'
38 - ascii 'y'
39 - ascii 'f'
3a - ascii 'o'
3b - ascii 'u'
3c - ascii 'r'
3d - value
3e - address 36
3f - open ...
Notice we only need to store the starting point, address 3e, on the
side. The search for somevar would use the contents of addr 3e to
get to addr 36, save its contents (addr 2d) for future use,
increment to addr 37 to compare its contents to 's', and finding no
match, go on to addr 2d, and so on up to addr 01 where it would
finally find the place where somevar is stored. Each value is
still at one smaller address than where just jumped from, so the
"stop comparing letters" test remains the same: stopping when a
limiting address (previous address minus 1) is reached.
Notice also that the link of somevar points to addr 00. If no match
to any variable name is found in the linked list, the algorithm would
attempt to match the contents of addr 01, which is all Zeros,
0000000000000000, which doesn't match any printable ascii character.
So the algorithm would then jump to the contents of addr 00, which
points to itself! Things would then want to go into an "infinite
loop" and just keep trying match some character to the all Zeros
over and over again. Once again, there's a way to break things:
simply type in a variable name that doesn't exist.
Obviously, the algorithm better contain a test for having reached
the top and do something like print "variable not found." Simple
fix, this time.
One simple fix and it's working fine, right? ... Right?
Look at those last three addresses, 26, 2d, and 36. Their bit
patterns exactly match ascii '&' '-' and '6'. Those maybe are and
maybe aren't characters which might be allowed in variable names
(depending on the overall rules for the English-like language),
but it wouldn't take many more variable declarations for the
addresses to grow into the ascii sections for upper- and lower-case
letters. Confusion time again?
Nope, not this time. The links still work fine as addresses, no
matter being small enough to fall in the ascii space. Subtract
One from a link, and that derived address
still works fine to test against, no matter if it also happens to
be an ascii value.
It's good that we asked the question, though.
OK, these essays
aren't meant do be a programming tutorial, so I'll quit beating
this dead horse. Just to sum up: "Test the hell out of the small
parts of your program so when you combine them into bigger parts,
the bigger parts will also work right."
Regressing back to the What and Where
Speaking of summing up, what ever happened to 3 + 5? We've sort of
lost track through all these digressions. And yet, we're almost done.
We have the numbers on the stack, and we now have a definite, unique
place in memory to store the ultimate sum. The only things left over
are the '=' sign and the '+' sign.
Ah yes, the plus sign. Such a tiny token, yet ultimately the plus
sign is going to do most of the work. It will have to be the subject
of the next blog entry.
Before leaving here however, we should ask if the variable storage
data structure described so far is the only approach possible, and
if not, is there a better one? Trust me, there are lots of approaches,
and we'll look at at least one more in the, yep, next essay.
Can we improve this general approach first, either by speeding it
up or having it take less room, or both?
Our algorithm is now more finesse and less brute force, now something
like:
For each link of the linked list
Go to each next memory cell looking for a
match to 0000000001110011, the lower-case 's'
If find a match, move on to the very next memory cell, and
Test its address against the address of the value slot
If match that address
Go to the next link
If the very next memory cell matches
0000000001101111, the lower-case 'o'
Move on to the very next memory cell
Else
Go to the next link
And so on and so on until we've actually matched all seven letters
of somevar
This is kind of sloppily written "pseudo code", but it gives the
general idea.
In the 64 cell example above, somevar was the only variable that
started with the letter 's', so each 'wrong' variable got branched
away from immediately. However, somevar shared that space with seven
other variables, five of which started with the letters "my". If the
above algorithm looks for "my" instead of "somevar", it will take
longer, because it will have to dig down through the 'm' & 'y' of
four other variables before it finally reaches "my" itself.
Believe it or not, it will take quite a bit more time to settle on
"my" versus "somevar" - there will be a lot more looping and a lot
more testing. The more often we can branch away due to the very first
test, the faster things will go.
Early compiler writers came up with the trick of adding the character
count to the top of each variable record. Takes a bit more memory,
but tends to make early branches happen more often. Look at the
number of chars versus the variable names for the current example:
7 - somevar
10 - anothervar
3 - our
2 - my
5 - myone
5 - mytwo
7 - mythree
6 - myfour
Only two are the same, "myone" and "mytwo". Putting the character
count just below the link and above the first character itself
allows it to be easily tested first, causing the earliest possible
branch.
Can we 'break' this by making all the variables the same length?
No, that's just being annoyingly clever. That's not actually
breaking it, it will still find each variable uniquely.
Yes, if most programmers wrote most programs using equal length
variable names, we'd want to leave out this character count
improvement. But most programs end up with a fairly random assortment
of name lengths.
One note on saving space: when ascii bytes are transferred from
some typed-in line to a 16 bit memory word, the left half of the
word always gets filled with Zeros, eg: 0000000001110011, the
lower-case 's'. Variable names can be packed more densely, like:
0110111101110011 - 'o' & 's'
0110010101101101 - 'e' & 'm'
0110000101110110 - 'a' & 'v'
0000000001110010 - --- & 'r'
The algorithm for testing each letter then gets changed into an
algorithm for testing two letters at once. Better, huh? Well, maybe.
Remember that the letters were originally typed in as a string of
bytes, eg, somevar:
01110011
01101111
01101101
01100101
01110110
01100001
01110010
If the operating system stored them that way, then to test them
against the condensed format used in the variable storage area,
they have to be shifted and combined into the same format. Or else
the stored variables have to be shifted and masked to match the
typed input. All of which may be worth the extra processing time
if space is really at a premium.
If by this time you are asking, "Can this crap get any
MORE
complicated?" I wouldn't be blaming you. This low-level stuff can
be tedious. If you've read this far maybe you have a knack for it.
Tedious or not, it is necessary. If you are programming successfully
in a higher-level language, it is because somebody put the time into
creating low-level crap like this as a foundation for it.
If nothing else, maybe you begin to see that "Computer Science" is
as much an Art as a Science, with choices to be made about tradeoffs
at nearly every step along the way.
|
|