Just Your Type

Way back in store.htm, when we were looking at the low level details of attaching a computer readable variable value to its English tag name, I wrote:

" Suppose the translator program has reserved a memory area for variables, so it knows where it is going to put somevar. It still has two problems facing it: How big a space will it need for the variable value itself, and how it will handle the English name associated with that value so it is in a unique enough form to be able to come back to it?

The first problem has to do with variable "typing," which is one of those fascinating issues I may devote a whole essay to later. "

I simplified things from store.htm until now by limiting my references to single 16 bit numbers. But that's about to change.

Later, in hash.htm, I described a more fleshed out approach to store named entities, a linked list system borrowed from the Forth language:

                  # The entry for "avariable"
        00000133  #   link field
        00002223  #   name field = LRS
        00000021  #   name field = # of chars
        00001201  #   ascii a
        ...           ...
        00001211  #   ascii e
        00000000  #   null termination
        xxxxxxxx  #   stuf field
                  # parm field
        xxxxxxxx  #   The binary 16 bit variable

The step facing us now is to handle variable (or named constant) data larger than one 16 bit memory word. Actually, even a single word can have its variations. It can be an integer in the range of 0 to 65535, or a signed int ranging from -32768 to +32767. It could be two bytes stored side by side (and in what format? ASCII, or ...). Could be a row of single bits bunched altogether. Or combinations of one byte sharing storage with various smaller entities.

Then we can start adding words. Two words back to back can be a signed or unsigned 32 bit integer, or 4 bytes, or ... A pack of 100 storage words can be an array of 100 16 bit integers, or 50 32 bit integers, or an object made up of different sized variables mixed with pointers to things like arrays and functions and other objects. It can get complicated pretty fast.

There was a book written in 1975 by Niklaus Wirth entitled

   Algorithms + Data Structures = Programs

that wins my nomination for "best title of a book on programming, ever."

Wirth was primarily writing about A and DS relative to his own higher level language (HLL), Pascal (yes, of "Don't get me started on Pascal" in the previous blog), but even at this low level of dealing with simple stored variables, each unique data type will require its own specific machine language (ML) algorithm to process it.

Think about it - machine language just sees a block of data. Maybe, since the data is in a variable storage area, it 'knows' it is not a block of ML code to try to run, but it doesn't otherwise know anything about its size or structure. We have to figure out a way to 'tell' it those details.

Before we get to that, a further note about algorithms and data structures. Suppose there's an HLL expression which, when translated into English says, "I refer to the tenth function pointer down from the beginning of object X". What it's telling the machine language that underlies the HLL is, "Get me that tenth pointer and put it on the stack (or wherever) so the next expression can access it." Now, that function pointer is itself part of some algorithm written in the higher language. Maybe it points to a square root function to be used on some number in the overall HLL statement. And maybe that number is part of object Y.

Yet first, a lower algorithm has to search through object X in order to retrieve that pointer in a form the HLL can use. Then another low level routine has to dig the number out of object Y. So the overall result is algorithm built on algorithm built on underlying data structures. It doesn't matter how high the pyramid of algorithms is, or how far apart the data structures may be, the two concepts together are all there is to programming. Really.

Algorithms can modify data structures, and often do, and data structures can contain the active code of algorithms (we saw in hash.htm that virtually all Forth data structures do so). Active and passive; actor and acted upon ...

Anyway, back to the current challenge. Actually, there are two challenges. First, to deal with multi-word data types, and second, to provide a generalized storage format that can support larger systems than I've covered so far in this blog.

The approach shown above works fine for small projects that can have their whole dictionary fit within a single 64K memory. Yet small projects become large projects eventually, and will require a way to get beyond such a limit.

And it's worth covering what I mean by 'large projects.' Again I have to touch on the platform on which I'm doing language development, the Flexible System Architecture. A traditional computer has a central processing unit and a main memory. More modern architectures have a CPU and several levels of memory, from local registers through various 'caches' through system RAM, ending in mass storage like a hard drive, thumb drive or CD ROM. Each succeding memory level is bigger but slower. Now of course, there are multi-core systems with multiple CPUs.

The FSA is a multi-core system made up of many small 16 bit sequencers, each with its own 65536 words of very fast memory. A lot of ML functionality can be packed into 64K, and the FSA's main application target as a "control architecture" means that most applications will only reside in these small, fast spaces. Yet somewhere there will be a main memory, if only to hold large blocks of data that don't easily fit into 64K.

Homonym Alert - From this point, the term 'word' is used in this essay to refer to the Forth word, name, symbol, token, etc. Above, the term 'word' referred to a memory word, cell, space, slot, etc. In the very next blog, I will drive a stake into the heart of this confusion forever. I promise.

So the generalized storage format of Forth words must take into account all these factors. Without further ado, here is that updated format, with the caveat that things will almost certainly change in future:

                  # Generalized dictionary entry format
        xxxxxxxx  #   link field
        xxxxyyyy  #   left byte: namespace / right byte: LRS
        0000xxxx  #   name field = # of chars
        0000xxxx  #   ascii ...
        xxxx0000  #   left byte: stack effect / right: null termination
        xxxxxxxx  #   type field
        xxxxxxxx  #   reserved field
                  # parm field
        vvvvvvvv  #   variable, constant, code ...

Starting near the bottom, note that 'stuf' field is being changed to 'type' field. This field will grow well beyond the two (base 4) values mentioned in hash.htm, 00000002 for runnable machine code in the parm ("parameter") field, and 00000001 for threaded address pointers in the parm field.

The current implementation of this Forth-like system also uses 00000003 to signal the end of compiling. This 00000003 tag is basically equivalent to the 'priority bit' of traditional Forth, meaning "interpret this token (ie, run the code in the parm field) even if the system is currently in compile mode."

You can see that these tags all relate to what is just below in the parm field. As of this writing I haven't yet created any other kinds of parm fields, although my previous description of storing 'somevar' pretty well demonstrates that named variables will fit into this format - they merely need their own special type tags.

With this burden of having to live within a larger system, the stuf field does not only get renamed the type field, but it also gets revamped. So let's revamp it (values this time in binary):

0000000000000001 - parm field contains thread addresses
0000000000000010 - parm field contains executable code
0000000000000011 - priority tag: execute, don't compile; parm field empty
0000000000001000 - parm field contains 16 bit pointer to local sequencer 0
0000000000001001 - parm field contains 16 bit pointer to local sequencer 1
0000000000001010 - parm field contains 16 bit pointer to local sequencer 2
0000000000001011 - parm field contains 16 bit pointer to local sequencer 3
0000000000001100 - parm field contains 16 bit pointer to main memory
0000000000001101 - parm field contains 32 bit pointer to main memory
0000000000001110 - parm field contains 48 bit pointer to main memory
0000000000001111 - parm field contains 64 bit pointer to main memory
1xxxxxxxxxxxxxxx - parm field can contain various var or const types

I should first explain about the local pointers. I wrote above about the FSA being made up of small sequencers, and for this Forth-oriented language development, I bundle a "four-pack" of them together, one each for: input buffer, stacks, kernel and dictionary. A look at this graphic in the next essay should help make things clearer.

I'll admit these local pointers may be too clever for their own good. Being able to branch within the dictionary (seq 3) is pretty redundant, as that's what the 'thread addresses' tag already does. Jumping into either of the stacks (seq 1) is dangerous, although having the ability to 'goto' the call stack would be one way to implement looping, albeit a bit of an undisciplined way. One point to make is that if all four sequencers have their full 64K of memory, there would be lots of room for other machine language programs. Still, to call any ML routine outside of the dictionary kind of defeats the idea of having an English-like higher level language. The same objection applies to branching into the kernel (seq 2), but at least there may end up being legitimate reasons of efficiency for doing so.

The most intriguing possibility for the local jump is into the input buffer (seq 0). One of the multipliers of programming power in the Lisp language is its implementation of "macros", the ability to write its own code snippets in its own language and then run them as if a programmer had typed them in. Being able to have access to the input buffer gives this new 'perfect language' a way into having the same functionality. And remember, the format being described is for named entities, so we can have many different macros, each with its unique pointer.

Before leaving the type field altogether, note that the One in the MSB position allows for tagging over 32,000 variable and constant types. I'm still using up scratch paper (and brain cells) working out the details. This is one essay I will be coming back and updating after it is first published on the Web.

I will say I want the type field to have a small inner field of probably 2 to 4 bits to interact with the reserved field. For future expansion, is reserving merely one single memory cell planning far enough ahead? I hope so. I do know that my years of developing the FSA instruction set has given me a lot of practice in squeezing everything possible out of 16 bits.

One thing that may well go into the reserved space is a system of "protection levels" a la object oriented languages like Java and C++. More on this soon.

Another new thing is the namespace byte. This is analogous to "vocabularies" in traditional Forth. This is a way to possibly allow a group of Forth words to reside elsewhere than the dictionary sequencer's local memory. More importantly, it's also a way to allow the reuse of a token name. I'm thinking of borrowing Java's dot-separator, eg, "class.method" for this. To make it easy for the interpreter to recognize a namespace reference, I'll probably use a double dot format:

.namespace.forthword

The 'forthword' half can be any arbitrary name, and it can be the same name as used in other namespaces. They won't interfere with each other because they will have different "scopes". Indeed, this effectively gets rid of any reserved words in the language (although there may be ultimately reserved namespaces). The word 'dup' could be used as a variable name in one namespace without getting confused with the Forth 'dup' stack operator.


A Cup of Java

My path to all this language exploration had a kind of strange beginning: Java. I had bought a fairly recent edition of Herbert Schildt's dictionary-sized Complete Reference to help fix some old applets that didn't work right anymore, and generally to update my knowledge of the language.

Java is a version of an OOP (Object Oriented Programming) HLL. OOP is one of those concepts that sounds absolutely wonderful when you read about it, but its real world execution tends to get bogged down in complications. At least that's the way I've found it to be. Be that as it may, I had already been working on a Forth system when I started studying the Java book and I began thinking about an object oriented Forth (OOF!). And that is what got me intrigued with the low level intricacies of storing and digging out symbolic references within computing machinery.

In his introduction, Schildt describes three advantages that the object oriented style provides: encapsulation, polymorphism, and inheritance. Let's take them one at a time.

Encapsulation - this is the idea of "information hiding", that an entity, say a variable used in one part of a program, is not visible to another part which doesn't need it. It is not that the other part shouldn't access it, but rather that it can't access it because the entity is protected by the design of the language. There are two schools of thought on this, by the way: "Don't handcuff good programmers" versus "Prevent mistakes."

Polymorphism - The idea here is to avoid multiple function names for similar operations. Instead of say, 'isqrt' and 'fsqrt' to compute integer and floating point square roots, have a single 'sqrt' function that 'looks at' the parameters sent to it and 'knows' how to compute the correct output. This concept is often extended to "operator overloading". For example in Java the plus sign works with various types of numbers of course, but also with text data. The Java statement:

System.out.println("A string literal =" + avariable);

might produce as output:

A string literal = 99

The '+' sign has connected two very different elements to display a sentence.

Inheritance - This is the core of OOP - objects made up of internal "methods" (which are basically a mix of variables and functions). The idea is to create a top level object like 'animal' and give it basic methods such as each of the five senses. Then you can create child objects (warm blooded / cold blooded animals) that can inherit methods from the parent. Go further and you can divide warm blooded into two legged and four legged. And on down to cats, dogs, cows, horses... In each case, you don't have to reinvent the five senses, but rather inherit them from the top object. Along the way, its easy to add new children (winged warm blooded) and reuse most of the code. If necessary at a given level, you can add new senses (bats: echolocation) or even overload and thereby replace a sense (eagles: really good vision).

Sounds good on paper, looks good on the screen. Yet as systems grow, the devil shows up in the details. Really big objects have trouble communicating with each other, so various kinds of interface objects have to be invented. Or methods have to creep higher up the inheritance tree - "Uh oh, dolphins also echolocate, better move that up from bats to warm blooded." Multiple inheritance may be allowed - "Well, we better have both 'animal' and 'senses' objects since 'animal' is getting a bit unwieldy." The underlying machinery supporting OOP becomes complicated and slows down. Multiple inheritance is so hard to stay on top of that the designers of Java decided right from the start, "Nope, ain't gonna happen, not in this language."

Some writers have called Forth an object oriented language. Ummm, if you consider a collection of hundreds of tiny containers as fitting into the OOP paradigm, maybe so, but I'd guess that most people involved in OOP development wouldn't agree. There's no inheritance (reusing code is not the same as having a formal inheritance system), no real polymorphism, and very minimal encapsulation.

I'm trying to accomplish two things with my so-called Perfect Language: 1) to create a full featured, powerful language that provides low level visiblility, and 2) to provide a set of generalized building blocks to make it easy for others to build their own HLLs on top of it. Some minimal 'hooks' for an OOP environment would make these building blocks more robust and useful.

I'm still at too elementary a level in the creation of my own new language to clearly see if I will be able to provide good hooks to OOP. My guess would be no. Keeping things agile and low level and at the same time trying to support objects doesn't seem a likely mix. But we'll see.

The addition of namespaces does help in information hiding, and to a degree emulates polymorphism. Polymorphism is overrated in my opinion anyway. In a medium sized program, it does indeed make for clearer reading if one basic symbol can handle different inputs. But as systems grow, memory (of the human sort) becomes strained: "Does sqrt support long integers?" You have to look it up anyhow, just as you'd have to look up to see if there's a 'lsqrt' in addition to an 'isqrt' in a non-polymorphic language. Furthermore, memory of the machine sort can also take a hit from OOP. If you're writing an application that only requires floating point square roots, the system still has to store all the different methods of the sqrt object no matter how much space they take.


You need Scope

Something I referred to above - protection levels - would be another way to provide information encapsulation. Here's a quote from Schiltd's Seventh Edition Java book where he writes about Java's access control mechanism: "Classes and packages are both means of encapsulating and containing the name space and scope of variables and methods. Packages act as containers for classes and other subordinate packages. Classes act as containers for data and code."

Borrowing from the Java system, I scratched out a 3 bit encoding for possible access control:

000 - hackable_0
001 - hackable_1
010 - global_0
011 - global_1
100 - public
101 - protected
110 - no specific modifier
111 - private

The bottom four are from Java and enforce which elements of classes, package subclasses, and package non-subclasses can access each others' methods and variables. Private is the most restrictive, public the least. Java also allows global variables, visible to everything.

Here's another quote, this time from Paul Graham's essay on what makes a language popular? (paulgraham.com/popular.html) In reason 4, "Hackability" he writes: "... being able to do what you want. In the history of programming languages a surprising amount of effort has gone into preventing programmers from doing things considered to be improper. This is a dangerously presumptuous plan. How can the language designer know what the programmer is going to need to do?"

In that spirit, I would provide the top four access control encodings to allow for overriding another element's protection level. That is, if I provide access control at all. One point is that it might have to be a dual system - 3 bits for the power to protect itself, 3 more for the power to override another element's protection. And if so, would a tie go to the caller or the callee? Whether such control bits would be in the type field or the reserved field can be determined later. I have a knack for tossing out ideas that have to be completely reworked later. This may be one of them.


Strong versus Weak

Data typing is something I may never implement. Like protection levels, it would primarily be a hook for higher level languages to be built upon the base system. The basic idea of typing is to prevent two very different kinds of data from being merged by an operation that doesn't apply to at least one of them. From the Java example above, if the plus sign were not overloaded, and could only be used for adding two numbers, adding an integer to a byte string would be ridiculous. Where in the string would you add it? How would you use the result? A type system warns, "Hey, you can't do that."

There are two basic kinds of typing: "static" and "dynamic". Static typing is done at compile time. The program text is checked for type mismatches as it is processed, and an error, or at least a warning, is issued if an illegality is discovered. Static typing is simply not going to be part of the FSA's Forth-like base language.

Why not? Because Forth compiling is done on tiny pieces of source code, one at a time. It is therefore "context-free" compilation, which is to say that when a Forth word is compiled it doesn't care what has gone before it or what comes after it. A Forth word's job is almost always to do something to the parameter stack. It may add to it, subtract from it, or leave it unchanged. Its job isn't to worry about what the following word may do with the data on the stack - that's the programmer's job.

Dynamic typing happens during run time. The arguments to an operation are checked against each other and a run time exception is generated if they don't match. If the types are simple enough, there's usually a mechanism for coercing one type to match another. For example, in many languages, an expression like (32.456 + 5) will promote the integer 5 to the bigger floating point format and go ahead and add them. The same would apply to a statement with variables like z = y + x, where x is an int and z & y are floating.

For better or for worse, Forth is pretty much just as context free at run time as at compile time. Its concentration on stack based processing means a Forth operation isn't overly concerned with how any previous stack data got there. About the only 'type checking' it performs is to create a "stack empty" error if a word tries to pop something off the stack when there's nothing on it.

Forth is considered to be "weakly typed" (or even, according to wikipedia, "typeless"). One way it gets around this potentially mistake-prone limitation is that it embodies the epitome of "unit testing". Forth programs are built in very small steps, and each piece is fully tested before adding it to the dictionary. So side effects and unpredictable behavior are kept to a minimum.

At this time, I'm thinking of dedicating either the type field or the reserved field to hold the size of the associated parm field. This could at least provide a "hey, they're the same size, do what you want with them" facility.

One of the important things about Forth is its complete absense of operator overloading. Even addition, something defacto overloaded in almost all languages, has its explicit operators for each type. To add two 16 bit integers, use the '+' sign. to add two 32 bit ints, use 'D+'. Many programmers complain that this makes Forth too cryptic and hard to read, but I have to believe that in a very large program, having everything defined extremely literally would be an advantage. As I mentioned above, if a language has too many things built on top of each other, the programmer's memory eventually becomes strained. Explicitness may seem at first too cluttered and detailed, but at least you can simply look something up in a manual instead of hunting backwards and upwards through a pile of code to find hidden connections.

An interesting point: while Forth variables may be typeless, its operators, by strictly avoiding overloading, turn out to be strongly typed.

One final field note: My friend Dirk has suggested that each word in the dictionary should have a way to show its effect on the stack. I think this is a good idea, thus the stack effect byte. These values would be mini pointers into an array of text messages (they would fit nicely into Seq 0, which is already used for storing input text) describing possible stack results. Here are a couple of examples of typical Forth stack shorthand:

dup :  (n -> n n)
+   :  (n1 n2 -> sum)

The byte would allow for only 256 such canned messages, but some of them could be generic ('+' could alternately be described by (n n -> n) for example) and number 255 could be an escape: "effect unknown." Typing in a simple interactive statement such as:

eff forthword

could display the stack effect text for that word.

Well, I've written a whole essay about extensions to the Forth paradigm, and then managed to talk myself out of half of them! I've written enough bad code in my life without thinking enough about it first that blogging like this could be a good new habit.

Let me, if only for my own sake, try to restate my 'perfect language' aims. Nobody of course can create a language that is perfect in everybody's eyes. If you're one who thinks polymorphism is the best thing since sliced bread, you probably won't like my end result much.

My first aim is to make the leap from the Ones & Zeros machine language of the Flexible System Architecture to a powerful but still low level abstraction of a Forth-like system, bypassing the development of a mnemonic based assembler. The next target is to expand that, if possible, into a higher level approach to programming. My attempt here has been to add extra fields to the basic Forth dictionary format to facilitate this expansion. I've gotten way ahead of my own current developments, so no doubt there will be future modifications to the format proposed here. Maybe many of them.

No matter what, I want my first pass to run as fast as possible. My main worry is that even the few changes to the format I've made so far are likely to expand the kernel to 3 or 4 times its present size. There's no problem with that from a memory usage standpoint, but the extra tests and processing will inevitably slow it down a little. As usual, the rules of the game come down to tradeoffs.

Always tradeoffs - how do you abstract a coder's ideas into a clear English form yet still efficiently use a machine's resources? To quote myself from the previous essay, loopy.htm, "Here we are beginning to discover that the English in the source code of higher level languages might impose a useful formalism on ... a program. Even if the resulting machine code isn't perfectly efficient ... the gain in readability and maintainability may be worth it."

... And I think I am edging into a subject better left for a future essay. But a couple of things come immediately to mind: First, add a parser that can recognize infix notation and convert it to prefix notation. The bad news is that the kernel would grow to 5X in size and real Forth programmers would hate me. But of course it would be optional!

A second idea is to tag variables and constants with "Hungarian Notation" (also optional). But watch out! As quoted from www.c2.com/cgi/wiki?HungarianNotation, "Hungarian notation inspires some of the most bitter religious wars among programmers." This is because awhile back, Microsoft foisted a version of it that made their system variables look like they all began with the vowel-challenged Hungarian language. But it doesn't have to be that complicated. The basic idea is to prefix names with a lowercase letter or two denoting its type. For example, a 32 bit variable might be named dMyvar instead of myvar. You could add an underscore: d_myvar, or since it's Forth, save hitting the shift key and use: d-myvar. This notation doesn't cost the interpreter/compiler anything except slightly longer tokens - it's strictly for helping human eyes follow the code more easily. As another example, the Perl language standardizes with three one-character markers:

  $somevar   — scalars (both strings and numerics)
  @somearray — arrays
  %somehash  — hashes

One of the pitched battles in the Hungarian War was that it wasn't originally applied to the type of a variable, but rather its kind. That is, HN started out as a hint about how to use a var, not how to store it. In www.joelonsoftware.com/articles/Wrong.html, Joel Spolsky describes how, in the source code of Excel, 'rw' and 'col' were used as prefixes for different variables accessing the spreadsheet checkerboard. The different prefixes depended on how they were applied - even though all such variables were integers, their prefixes were different. As a form of self-documenting code, it helped to ensure someone wouldn't accidently modify a column var with a row value, or vice-versa.

Yeah, I know, I probably won't use HN either. Still, maintainability is no joke. In the terseness versus typing war (not data typing this time, but the amount of keyboard typing), most programmers, including me, will always side with terseness. Yet going too far with it can lead to exasperation six months down the line when you can't figure out your own code.





         Introduction to "The Perfect Language"    —   Table of Contents
         Deciding to Get Loopy    —   Previous
         Why Forth? - Part 1    —   Next
         Glossary




Started October 22, 2010