Welcome to the home page of General Application Processing™. Read about the creation of a Flexible System Architecture™ for the next generation of computer, System-On-Chip, and multicore processor design.

Screen shot of the FSA™ Simulator thirteen cycles into a CORDIC computation of sine & cosine. Input: 45 degrees. Output: cosine is at top of P1 at level C3, sine is just below it. Both values are .70703125 in a fixed point format with thirteen bits to the right of the binary point.

For fun: click the pic or HERE for an explanation of the CORDIC computing system
Happy New Year — 2013
More serious: scroll DOWN for the NEW Home Page

So it has been awhile since I updated this home page. Of course there's a reason —

I've read in various places that one thing you don't want to do is rewrite a big system from scratch. "You will regret it!" I actually believed this, in fact. It helped that examples were given - I think Netscape was one of them. Yahoo too, possibly.

But - In my case my thinking went, "Well, my FSA Simulator isn't all that large, and besides, I wrote 99% of it so I know it well. Shouldn't be a problem."

It's almost two years later and I'm only now seeing the light at the end of the tunnel. Youch!

It's not like I didn't have good reason. The old version just ran too slowly, and I had designed mucho new instructions to implement (and I distrusted the robustness of the original instruction processing code). Fair enough, probably had to be reworked anyway. But - Double Youch!

Besides the time hit, the worst thing is finding out how much really bad code you wrote in earlier versions. And you cannot stay willfully blind to that, because even though you think you are merely going to rewrite a small block of functions ... they call other functions. So sometimes you have to make changes all the way up branches into the leafy nodes. Meanwhile, time keeps on slippin' into the future...

Guest Pages

Brian's CNC Site Brian's Tesla Lab

GenAPro continues to welcome two additions to the site. These areas demonstrate the hard work of Brian Foley and his active mind and hands.

About Vision &
New for 2010 / 2011   —— Essays ——   "The Perfect Language"

            Orignal static taxonomy graphic from EDN Magazine *

Distinguished Visitor —

I am changing the format of this homepage to a simpler, hopefully more readable question and answer format, similar to a column you might see in a technical magazine. Of course in this case, the 'editor' and the 'interviewee' are both the same person. But why not?

I've also reduced the links to the bar above. The original home page is here, with its greater detail, and most of the pages on the bar contain a fuller set of links to the site.

So, let's get right into the "interview" :

In 25 words or less ...

The FSA provides extremely low-level control of things that run extremely fast.

Bit-twiddling, in other words.

Yes. On things that run the fastest: fast transistors, fast gates, fast logic.

And what makes the FSA able to keep up?

And hopefully, stay ahead! A couple of things:

Its "Standard Sequencer" is extremely small and simple. It has no microcode; indeed, it has little control logic at all in order to keep instruction cycle times short. The instructions themselves are minimal, and use minimal pipelining.

About half the instructions (and the most commonly used) operate directly out of memory. It takes one gate delay to recognize these instructions, and then BOOM, they're on their way; or, boom, they've checked a test bit and are already loading a response. The rest fall into a decoder that's simple enough to keep up with the clock rate.

That covers speed, what about power?

If you're talking about power dissipation, I would expect the FSA to primarily be used within applications that have the circuitry running pretty hot anyway. Still, a 16 bit core should certainly use less power than a 32 bit core, if that's a requirement.

On the other hand, if you're asking about processing power, the FSA has a multi-path ALU available to crunch control data. One path is based on the venerable TTL 181, yielding common add, subtract, compare, and bitwise logic functionality. But there's more: other paths perform proprietary operations that a standard ALU can't do, at least not without multiple clock cycles. So both time and code size can be reduced. I should add that the ALU also has open paths available for custom development. Even more power can be factored in.

And, it doesn't stop there. Even a multi-path ALU can still be limited by bus contention issues, something that in a multicore system can bog things down very easily. So the FSA has an overlaid communication system that bypasses the ALU altogether. I can't really go into that any further here, except to say that two hands are usually better than one.

Imagine the dealer in a game of blackjack. The dealer has to pass out cards to the players around the table, and also has to be responsive to input from those players ("hit me"), which can affect the order of the card distribution.

The FSA is obviously the dealer in this analogy. What makes it different than the zillion other cores that are out there? First, the FSA is optimized to be the fastest dealer in the business.

One way to take advantage of its speed is to have a game with only smart players. That is to say in a friendly game, the dealer might offer some advice from time to time ("Should I stay on 16?"), but this just slows down play while the dealer has to respond. When the FSA is the 'dealer', the 'players' should definitely have the primary responsibility for their own play.

To begin to stretch the blackjack analogy, it can also be said that the FSA is also the dealer with the most cards. Within its instruction set's addressing range, the architecture can pass to IP subsystems a plethora of information over multiple paths. Again, it all comes down to what a computer design is optimized for. Not only is the FSA a Control Architecture, but the control is fast and ... dense.
You wrote in your original homepage that "the FSA machine language is its own microcode," yet you just said that there was no microcoding of the sequencer.

If you want to fit a general descriptive category, you could say it is closest to vertical microcode. The difference is, where is that microcode aimed? It's aimed at application logic rather than at its own hardware.

When I was first developing the instruction set, I had a type of indirect instruction that I assumed I would have to "microcode," that is, have two standard sequencers side by side and use one of them to step the other through the completion of the instruction. That turned out to be unnecessary. A closer look at the problem revealed that the supposed target sequencer had enough control built in to handle the instruction by itself.

Yet the facility is still there to put one sequencer beside another to handle more complex tasks. In fact, I like to bundle a "four-pack" of them together to allow for easy context switching and various kinds of threaded code. But so far, no need for one to take over and run another.

Speaking of threaded code, I recently wrote a Forth interpreter / compiler for the FSA. It almost seemed like cheating to have four interoperating sequencers available to spread the dictionary, stacks, and everything else into. I have to say I gained a lot of respect for Chuck Moore and other Forth pioneers who did their development on single micros with only one memory space and limited scratch registers.

Anyway, to get back to my original point, the microinstructions are primarily meant to be sent directly to IP subsystems, the sequencer acting like the dealer in a game of cards. [see sidebar]

Of course, sometimes the IP wants to talk back, so a full one quarter of the instruction set is dedicated to responding to what comes over an internal test bus.

And if that test bus itself gets overloaded?

Why, then just drop in more sequencers. They're small, and some IP may not need the full ALU capability, so localize the control resources, and buses, where they're most needed.

A lot of people may have trouble with the 16 bit size ...

I understand. "The 80's have called and they want their architecture back." Look, this FSA core doesn't care if your data paths are 64 bits or a thousand bits wide, or how much memory your application is addressing, its job is to act as a traffic cop directing the bigger, wider bustle going on around it. I mean, your head is smaller than your body, right?

I have held for a long time that sixteen bits is the very best word size for this kind of low level control; I'm absolutely convinced of this. Yes, if you're doing long computations or need floating point, you'll want more bits, but to control something? If you need more than 16 bits, you aren't thinking the problem through!

But is there any way to extend this architecture if that's what the market demands?

Well, those instructions which contain addresses have them on the left hand side, so they could be extended indefinitely. Yet those instructions which don't contain addresses would then contain a lot of wasted bits. I don't like wasted bits. So while I do think it's easier to expand a 16 bit design to 32 bits than the other way around, I believe using multiple 16 bit engines as distributed cores is a better way to go, particularly for control purposes.

So, what applications do you see for the FSA?

Hopefully you noticed the animated .gif at the top of this section. The FSA is meant to make more low level decisions than any other architecture, and make them faster. Being a general purpose machine, it could handle the whole left side of the picture out to maybe a third, but I tried to make the graphic point to its sweet spot, to a classification area that points to where no architecture has gone before.

That is its general application target. If you want specifics, I'll admit I'm not sure yet exactly what they might be. One application I added to the Executive Summary awhile back was network processing . Expensive custom logic ruled the day at the time, and the FSA seemed a perfect fit. By now there are several commercial NPUs out there, and typically, creating software for them is the biggest bottleneck - a game in which the FSA would now be playing catch up. One aspect of network processing I think it would excel at even now is Deep Packet Inspection.

Still, you get the idea. Potential niches are always coming along. The core features of rapid processing, inherent parallelism, and a small silicon (or GaAs?) footprint will always apply to high performance and multicore design. Throw in superior "decision processing," and you have something that's pushing the envelope. The question is, which envelope?

How about standalone or embedded applications?

I remember the first Palm Pilot came out when I was working on an earlier version of this architecture, I thought, "Damn, that should be MY chip in there!" But, it wasn't yet ready for prime time.

I suppose you could say using an architecture as an on-chip core is a form of "deep embedding," so maybe after getting its instruction set entrenched in several such designs, the FSA could begin to gain popularity in standalone development. That should be a few years down the road if it happens.

Nowadays, while you could certainly make one heck of a single board computer out of the FSA, how many such systems would it be competing against? That niche seems pretty full to me. Yet, couple the FSA with its somewhat quirky simulator / development system, and there might be surprising acceptance from students and hobbyists. Time will tell.

Any final words?

Well, this may be merely an inventor's conceit, but the FSA is a beautiful architecture. Elegance may not always win in the marketplace, but it is certainly more enjoyable to develop within.

Bob Loy, Founder

* From EDN mazazine, Issue 11, 6/11/2009
"Tee up your multiprocessing options" by Robert Cravotta