String Theory

Such a difference between human and computer language! That very sentence started off these essays back in English versus the Machine. Since then we've seen that a computer has to make room to store even one word of English, figure out how to tag it uniquely so it can find it again, and arrange those tags so it can find it efficiently. Ultimately, during compilation, it does its best to get rid of the human words altogether.

And yet, what is this system I'm typing into right now doing? It is dealing with English! It can't get away from it after all. As long as machines have to interact with we the people, they have to ultimately communicate in a human tongue. In a sense, this blog has come full circle: from how to translate human language into machine language to how to translate in the other direction. Sort of, anyway; admittedly that is a bit of a stretchy metaphor.

When a computer "talks" to us, the term for what it reads, stores, or outputs is 'string'. Every mature computer language has commands for manipulating strings: comparing them, concatenating them, finding letters or patterns within them, and so on. The little secret is that all these commands aren't really part of most languages at all. And I'm back to making outrageous statements again.

Okay, I exaggerate. What I mean is that strings aren't actually a basic unit of computing. You aren't going to multiply one string by another, or find the fifth root of one, because strings are simply arrays of human language characters. An array is a line of equal sized entities stored in memory one after another. A string is an array of ASCII bytes. It may be a word, a sentence, a paragraph or a whole encyclopedia (very rare to find one that long). Size doesn't matter, but the mere fact that strings can extend over such a huge range pretty much demonstrates that they aren't fundamental.

A string function or operator in a typical high level language is often a part of a "string library", a collection of handy utilities that do basic things with strings like those manipulations mentioned two paragraphs back. Languages provide such libraries because some things programmers want from an HLL are so common that if the language didn't provide them, hundreds of programmers would have to create them for themselves. They would all continually reinvent each others' wheels, and even worse, there would be no standardization among them.


The Format

So the purveyors of the HLLs provide some simple string functionality that all programmers can access. Quickly now, what's wrong with that sentence? Answer: it contains the word "simple." Strings are not as simple as they first appear. First of all, nobody can seem to agree on how to format a string array. In C, for example, a string is the array of letters plus an ending NUL, or ascii zero. At first glance this seems pretty handy, because zero doesn't encode any letter, number or special character, so it is easy to use it as a marker for ending a string.

The ascii set also contains formatting characters. There's the space, of course, and also byte sized elements like tab, carriage return, and line feed. Often, when you're outputting text to display on a screen, you don't want to rely on word wrapping to determine the end of a line, so you can add a "hard return" to the end of a sentence. In C, you would use the line feed special character '\n' which is ascii-coded as binary 00001010. For example:

  printf ("This is my sentence.\n");

The quote marks automatically tell C it's a string so it would be stored internally with the ending NUL (C special char '\0'):

  This is my sentence.\n\0

Things are slightly different if the output is to a text file. No one wants nulls in a text file because it should be made up of only printable characters. So the string is saved in the file as:

  This is my sentence.\n

Unless it is a DOS/Windows file rather than a Unix/Linux file, in which case it will be saved as:

  This is my sentence.\r\n

since Windows uses both a carriage return AND a line feed for its hard returns. Do understand that the slash characters are just shorthand for displaying here. The last five bytes as stored in the DOS text file would actually be:

  01100011 - Lowercase c
  01100101 - Lowercase e
  00101110 - Period
  00001101 - Carriage Return
  00001010 - Line Feed

A C compiler for Windows has to treat strings with returns slightly differently than one for Linux, due to the extra '\r' it may be reading or writing. "Simple" is already becoming less so.

Notice that in C is there is no information about the length of the string. This became a problem with the rise of the Internet as hackers discovered they could send longer strings across the Web than many of the programs running on servers were set up to handle. This gave the hackers a chance to write past the end of the memory area supposed to contain an input string and overwrite actual machine code and crash the server program. Then they got really clever and added their own code to the ends of their fake strings and started running their own programs on the server instead. The era of the computer virus got a major shot in the arm from this built-in C language vulnerability.

Strings in the Pascal language don't end in zero but instead begin with the size of the string itself. However, since a byte can only count up to 255, that limits the length of Pascal strings. The original FORTRAN used something called Hollerith notation, where the number of characters preceded the letter H, which then preceded the string itself:

  20HThis is my sentence.

Depending on how big a number is allowed before the H, much larger strings than Pascal allows are possible. OTOH, the Fortran strings aren't much fun to read within the source code.

That is only three possible string formats. Near the end of What's in Store?, I described a packed format to save room storing ascii bytes into the FSA 16 bit word - two bytes per word, where possible.

You can begin to see that manipulating strings is very dependent on the format the strings are in. The act of joining two strings together ('concatenating' them) is a common library function. Will the format help or get in the way? Do they have to be unpacked, joined, then repacked? Will the allowable maximum count be exceeded? Can the second one be written in memory right after the first, or will they both have to be moved to somewhere with more room? If they both end with carriage returns, should the return in between be eliminated?

On the face of it, the C strings would seem to be the easiest to join, but if you try to use the strcat() function to turn a lot of strings into a single one, there can be a performance hit because the C format doesn't know where the end of a string is (no length info). Each call to strcat() has to search byte by byte until it finds the magic zero. Every format has its own strengths and weaknesses.


Who's ASCIIng?

Pretty much every reference to characters or strings that I've made in this blog has been to the ascii system. All the formats discussed above have nevertheless been ascii examples. But of course human ingenuity would never settle for only one standard. Probably the first format for digital transmission of text was Morse Code. The dots & dashes it is made up of can maybe be considered as Ones & Twos rather than Zeros & Ones, but mathematically it is still binary.

Morse code is interesting because it is so condensed. If you're a newspaper buff ("Newspapers? what're they?"), you may be familiar with the nonsense phrase, "ETAOIN SHRDLU." These were the leftmost keys on a Linotype typesetting machine ("Linotype?") used when papers were printed from lead type. Anyway, "etaoin shrdlu" matched the order of the 12 most frequently used letters in publishing at the time (and to think - they had no computers to count them).

Standard Morse code encodes E as a single dot and T as a single dash, followed by the two 'bit' combinations for I, A, N, and M. Since Morse code has a time element embedded in its transmission (hmmm, maybe not strictly binary after all), the 3 dot S is actually faster to send than the 2 dash M. Why O ends up as 3 dashes is kind of strange - maybe because it makes 'SOS' sound good?

You can see the potential space savings compared to ascii. Even though unextended ascii is seven bits per character rather than the eight bits of the bytes it's usually stored in, both E and T would still take 1/7th the room of ascii, and I, A, N, & M each 2/7ths. Those six letters own a high percentage (~46%) of all the letters used in all the English text in all the world.

Not so fast. Since it is in byte sized chunks, it is easy to move from one ascii letter to another, and tell them apart. How would you tell 3 Es from one S? You'd see three zeros in a row in each case. The time element of Morse code works against it now. The Morse standard calls for three dots worth of space between letters and seven dots between words - the condensation is beginning to evaporate. Even worse, now there's three elements to deal with, dot, dash, and space, and suddenly we've broken the binary model altogether. We could get back to binary by representing the 'space' between letters with a pattern not already used, like six dots in a row, but by then we'll probably end up with something ultimately taking more space than ascii. Oh well, Morse only encodes 26 letters and 10 numbers anyway.

However, ya gotta ask, "Is there a way to take the frequency idea and come up with a way to store smaller strings than ascii?" Sure there is: it's called "Huffman coding". Starting with the frequency of letters, numbers, punctuation, and "whitespace" (in English, the space occurs even more often than E), the Huffman algorithm constructs a binary tree which yields small bit codes for common chars and longer ones for uncommon ones. Even better, the Huffman code is "self-synchronizing". This means that the decoding algorithm can always tell a character from the one following it, even if they are all in a big line thousands of bits long.

ASCII is also self-synchronizing, and very simply decoded: every eight bits there's a unique character. In comparison, decoding Huffman is a much more time consuming algorithm. Yet sometimes you want to trade time for space. Huffman coding can be used to compress other data besides text strings, and admittedly its more practical uses are for compressing pictures into formats like JPEG and sound into formats like MP3. Since David Huffman published his discovery in 1952, there have been many other methods of 'lossless data compression' developed, such as Arithmetic Coding, CTW, LZW ...

Enough with visiting the Ivory Tower, let's get back down to Earth. If you click the "View" menu in your browser and then choose something like "Character Encoding" or perhaps plain "Encoding," you will find lots of choices for the language your browser can display. What these choices really are are character encodings for the different alphabets of those languages. Yet one choice you probably will not find is ... ASCII.

Ascii is not exactly obsolete, but it has been swallowed up by the real world of the Internet and its hundreds of human languages with thousands of characters which simply won't fit into seven or eight bits. Here are some modern, or at least recent, coding systems for strings:

ANSEL
MARC #
IBM-OEM
ISO-8859-#
Windows-1252
Unicode
ISO 10646
UCS-#
UTF-#
GB 18030

The '#' means there are multiple versions of the standard to choose from. Go visit your favorite wiki or search engine and paste in some of the terms. You'll learn something new!

Chances are you'll also be a bit overwhelmed - I know I was. Is there an underlying rhyme or reason? Well, in all of them you can find your way back to ascii - sometimes easily, sometimes less so, but the backward compatibility is there. Ascii is the lingua franca of most programming languages, which is to say it embodies the characters that are allowed to be used in the source code of those langs. It's not going away any time soon. Will some future language make use of more characters? Will there be single-character operators based on the tilded 'n' or the umlatted 'u'? The Euro sign or the Yen sign? Perhaps.

If there's a fulcrum point on the list, it's "Unicode". Unicode has room for 1,112,064 different encodings, way more than ascii's 128. It is the standard for the UCSs and UTFs listed below it. They come in different sizes, from 1 byte through 4, 8 bits through 32. Which is to say that UCS-4 is the same as UTF-32. And that UCS-2 is ... not ... the same as UTF-16. What the ...? Oh, both of them are 16 bits, but the first is fixed-width and the second is variable-width. While you're scratching your head over that one, don't forget to pay attention to the order of the 2 bytes within the 16 bits. Sorry it has to be so difficult.

If you want a fixed sized, brute-force format that will probably not be obsolete for a very long time, you can go with UTF-32. Of course, if most of your characters still happen to be ascii, you'll be storing only 7 bits of encoding in every 32 bit chunk. By necessity, any UTF-# less than 32 has to be shoehorned into a variable sized storage format (in theory, a UTF-21 could be a complete fixed-sized Unicode implementation, but who has ever built a 21 bit memory system?).

As you may imagine, variable sizing makes string processing even more complicated. The smaller UTF formats are actually quite cleverly designed, but there's no way to make working with them as straightforward as working with a simple array can be. To know where one letter ends and another begins, to stay synchronized in other words, can take a wee bit of processing.

It is tempting to go with UCS-2, a fixed-width subset of UTF-16 that is considered obsolete because it only encodes the Unicode Basic Multilingual Plane (BMP), one "plane" out of a total of 17 planes in the Unicode standard. Yet most of the 1,112,064 encodings have yet to be assigned, and the BMP currently has the lion's share of the ones that have been. Still, as more and more ideograms from Far Eastern languages get added, no doubt the empties will start to get full. Thus UCS-2 is obsolete. As a space-related version of Parkinson's Law would go: "Work expands so as to fill the space available for it to fit in."

It's almost time to move on to a lighter topic. Just a paragraph or two more to underline how mind numbingly complex the subject of strings can be. First though, if you want to take a side trip to a good historical and practical overview of character encodings, visit Joel Spolsky's: The Absolute Minimum ... (No Excuses!). There you will see a graphic of IBM-OEM, the first popular 8 bit extension to ascii - the one with the little smiley faces and suits of playing cards, even some astrological signs. It's fun to speculate about the arguments that may have surrounded the creation of this standard: "We need a code for pi much worse than we need one for Aquarius!" It ended up with both ... I miss IBM-OEM, although I expect it still exists on a Unicode plane somewhere.

You can probably imagine that translating between different character sets is really troublesome. Yep. When you're reading email or viewing a Web page do you sometimes see question marks, or rows of junk, or even square things that look like Mahjong tiles? Look closely and you'll see tiny hexadecimal numbers on those. What has happened is the application rendering the text can't handle certain characters being sent so it has to either default to a single "huh?" letter like a question mark, or else unknowingly mistranslate altogether, yielding unintelligible sequences of gibberish known as "mojibake". It's like reading a foreign language, only the language may indeed be your own, just in a 'foreign' format.

Ever try reading a Word document in Notepad? Mojibake Bonanza! That's because word processing programs like Word use all kinds of elaborate control codes for formatting text. Titles, tables, fonts and the like all have their special codes that don't translate into a simpler program like Notepad. One nice thing about HTML is that its formatting is actually in readable English. On this very browser, you can pick View —> Source from the menu and see what this very page looks like to me as I create it. All the links, new paragraphs, italics and whatnot are nestled between the brackets '<' & '>'. So why didn't the browser get confused when I used brackets right here? Because I didn't actually type them in from the keyboard, but rather put in a sequence like this one,

  ampersand-#62-semicolon   (but with no hyphens)

which prints the right bracket without actually being a right bracket.

Finally, for completeness sake, I'll mention "regular expressions". These are ways to really dig out patterns from strings. Here are two examples of matching to a pattern that I've borrowed from Wikipedia's article about them:

  cat|dog  matches  "cat" or "dog"  in a string.
  
  [hc]?at  matches  "hat", "cat", and "at"  in a string.

"Regexes" are very powerful, and are considered to be formal programming languages. They require an actual program to use them, and some languages like Perl & Awk have regular expression handling as part of their core syntax. Of course, the regex program still has to know which character encoding it's dealing with, hee hee.


Literally Confusing

As I said at the outset, computer languages pretend that strings are a fundamental data type when they are (in most languages) just another array. One way they pretend is with the "string literal". Remember this C example?

  printf ("This is my sentence.\n");

The quoted part of that statement, not including both quote marks, is the literal. A string literal can be assigned to a named array,

  char  buffer[] = "This is my sentence.\n";

then easily passed around to other parts of a program. What is worth noting is that the assignment automatically sizes the array to fit the string, minus the quotes but with the ending zero. It is equivalent to tediously sizing the array by hand and then initializing it with each individual ascii value:

  char  buffer[22] = {0x54, 0x68, 0x69, ... 0x65, 0x2e, 0xa, 0x0};

(so tedious in fact that I cheated with the 3 dots in the middle because I know you're smart enough to know what I'm getting at. I also used hex values just to make it ugly as possible).

So yes, strings are a part of the C language, yet they are a special abstraction of the more generalized system of C arrays. And why? Because computers have to speak English back to us! A language's usefulness can be maximized if it makes it easy for programmers to handle the I/O of human communication. In fact, "Scripting languages" like Perl or JavaScript, that started out intended primarily for text handling, have been gaining popularity in recent years.

One of the quirky things I find fascinating about string literals is that they embody a sort of self-referential conundrum. Suppose you want to emphasize that the example string is "my" sentence:

  printf ("This is "my" sentence.\n");

This print statement is going to be very confused about seeing four quote marks rather the expected two. It won't work right, probably won't even compile. What is needed is a way to 'escape' the inner quotes:

  printf ("This is \"my\" sentence.\n");

The left-leaning backslash is C's way of telling itself that the character following it is a special case: "Don't interpret the next char the way I usually would." Now try this one:

  printf ("In C, \n stands for line feed.\n");

This is going to display as:

  In C, 
   stands for line feed.

Looks like we have to escape the escape!

  printf ("In C, \\n stands for line feed.\n");

Hey, let's infect this string with "leaning toothpick syndrome" (no, I didn't invent that expression):

  printf ("In C, \"\\n\" stands for \"line feed\".\n");

It's all about avoiding "delimiter collision". Some languages use a different delimiter character than the double quote. Some have multiple delimiter choices, some, like Perl, even let you choose your own,

  print qq@This is "my" sentence.@;

where the first character after 'qq' becomes the delimiter for that string. This is clever and very flexible, but what if you want the string to display all the possible keys on a keyboard? Then you'll be glad that Perl also offers the common backslash escape mechanism.

I guess I like this subtopic because there appears to be no magic bullet, no perfect delimiting method that will absolutely guarantee that every desired string inserted into source code will always be taken literally, so to speak. Especially if the source code is itself the output of another program. Something might work right 99.99999% of the time, but there's always that .0000001 special case out there to trip over. Even if the method allows use of some random, garbled pattern of characters as the delimiter, somewhere, sometime, that same weird pattern may show up in someone's actual text string. There's no escape!





         Introduction to "The Perfect Language"    —   Table of Contents
         Controlling Complexity    —   Previous
         Secure Thoughts    —   Next
         Glossary




Started: June 22, 2011