Three years ago, IBM, Sony and Toshiba announced a partnership aimed at developing a new processor for use in digital entertainment devices like the PlayStation. Since then, the product has seen a billion dollars in development work. Two fabs, one in Tokyo and one in Fishkills, New York, have been custom-built to make the new processor in large volumes. On May 12th, IBM announced that the first commercial workstations based on this processor would become available to game-industry developers late this year.
A lot is known about this processor as planned, but relatively little real information about the product as built has yet leaked. To the extent that performance information has become available, it is characterized by numbers so high that most people simply dismissed the reports. In November of last year, for example, a senior Sony executive told an internal audience that implementations would scale from uniprocessors to 64-way groupings that would deliver in excess of two teraflops — making it more than 10 times faster than Xeon.
Most of what we know about this machine comes from U.S. patent #6,526,491 as issued to Sony in February 2003 for a “memory protection system and method for computer architecture for broadband networks.”
Here’s the abstract:
A computer architecture and programming model for high speed processing over broadband networks are provided. The architecture employs a consistent modular structure, a common computing module and uniform software cells. The common computing module includes a control processor, a plurality of processing units, a plurality of local memories from which the processing units process programs, a direct memory access controller and a shared main memory.
A synchronized system and method for the coordinated reading and writing of data to and from the shared main memory by the processing units also are provided. A hardware sandbox structure is provided for security against the corruption of data among the programs being processed by the processing units. The uniform software cells contain both data and applications and are structured for processing by any of the processors of the network. Each software cell is uniquely identified on the network. A system and method for creating a dedicated pipeline for processing streaming data also are provided.
The machine is widely referred to as a cell processor, but the cells involved are software, not hardware. Thus a cell is a kind of TCP packet on steroids, containing both data and instructions and linked back to the task of which it forms part via unique identifiers that facilitate results assembly just as the TCP sequence number does.
Outrageous Performance Claims
The basic processor itself appears to be a PowerPC derivative with high-speed built-in local communications, high-speed access to local memory, and up to eight attached processing units broadly akin to the Altivec short array processor used by Apple. The actual product consists of one to eight of these on a chip — a true grid-on-a-chip approach in which a four-way assembly can, when fully populated, consist of four core CPUs, 32 attached processing units and 512 MB of local memory.
The per-cycle performance of the core CPU is undocumented but may be expected to be comparable to other PowerPC machines running at high cache hit rates. Specifications for the four or eight attached processors comprising the array are known; these are expected to turn in one floating point operation per cycle or around 32 Gigaflops for the fully populated array at a nominal 4 GHz.
That’s where the apparently outrageous performance claims come from; a four-way assembly running at a planned 4 GHz offers 32 x 4 = 128 Gigaflops in potential floating-point execution. A 64-way supergrid made by stacking eight eight-way assemblies would have a total of 512 attached processors and could, therefore, break 2 teraflops if data transportation kept up with the processors.
In practice, however, Apple has never succeeded in getting the bulk of its developers to make effective use of the Altivec, and Sun has had essentially no success getting people outside the military and intelligence communities to use the four-way SIMD capabilities built into its Sparc processors. Grid computing is slowly entering the commercial mainstream, but combining both local-array access with grid computing requires a significant shift in programming paradigm that will not appeal to the mainstream Wintel and IBM customer base.
Gains Outweigh the Pain
For games developers, however, the potential gains — up to 50 times the best x86-based processor and graphics board combinations can deliver — should outweigh the pain. Even minor software change, the kind of thing Adobe does to take advantage of the Altivec in Photoshop, should offer significant advantages to a wider programming community and enable floating-point-intensive applications to run a full order of magnitude more quickly on this machine than on Intel’s best.
An important point to bear in mind is that this processor will be inexpensive, and systems built around it even less expensive because no external graphics or network boards will be needed. Both Sony and IBM have been building fabs specifically to make this device. Volumes will be high because Sony will use up to 20 million assemblies in the PlayStation, while 10 million or more that don’t quite make the quality cut will get used in its digital televisions and other products.
Very little has been publicly revealed about the operating system for this thing, but it is quite obvious what it has to be and how it has to work. Each core will have its own local Unix kernel, with most just executing cells as they arrive from the dispatch manager and one managing the traffic-coordination hardware. In all likelihood, the kernel used will prove to be both Linux-derived and Linux-compatible — meaning that most Linux software will run out of the box on the uniprocessor configuration while software adapted for the grid environment will run unchanged on everything from the uniprocessor to configurations with hundreds or even thousands of processor assemblies.
As users of Sun’s open-source grid software have found, performance losses on single processes increase as you add processors because data flow and timing control issues increase in complexity nonlinearly with system growth. Fundamentally, what happens is that the larger you make the total machine, whether on one piece of silicon or in a rack, the more cell transit time dominates execution time and the greater the performance cost imposed by the need to coordinate operations.
New Generation of Linux PCs
The patent mentions the use of no-ops (processor nulls) inserted into cells to get around timing problems associated with having components run at different speeds — with processor coordination initially enforced by setting TTL-like time budgets for cell execution. My guess, however, is that advances in cell isolation and programming for asynchronous event handling have since obsolesced those solutions.
I expect, therefore, that when the real thing appears, it will fully support both the traditional grid format for on-chip work and an asynchronous hypergrid for multi-assembly processes on the model Thinking Machines hoped to achieve with the transputer-based hypercube in 1985 — and that NSA is rumored to actually have built on 1989’s Sparc-SIMD-based CM-5.
Either way, however, the OS for this machine is likely to offer both Linux compatibility at the low end and enormous scalability for those willing to modify their software — which is why, as I discuss in next week’s column, I expect IBM and Toshiba soon to launch a new generation of Linux PCs built around the combination of this CPU with IBM software products like Lotus Workspace for Linux.
See “Linux on Intel: Think Dead Man Walking” and “Grid vs. SMP: The Empire Tries Again” for additional coverage on this topic by Paul Murphy…
Paul Murphy, a LinuxInsider columnist, wrote and published The Unix Guide to Defenestration. Murphy is a 20-year veteran of the IT consulting industry, specializing in Unix and Unix-related management issues.