Diff: HowToParallelProcessingHOWTO - Waikato Linux Users Group

Differences between current version and revision by previous author of HowToParallelProcessingHOWTO.

Other diffs: Previous Major Revision, Previous Revision, or view the Annotated Edit History

Newer page:	version 3	Last edited on Thursday, October 21, 2004 6:22:24 pm	by AristotlePagaltzis
Older page:	version 2	Last edited on Saturday, July 20, 2002 12:02:53 am	by !PerryLorier	Revert

@@ -1,5804 +1 @@

-~~Linux Parallel Processing HOWTO~~

-

-~~----~~

-

-~~!!!Linux Parallel Processing HOWTO~~

-

-~~!!Hank Dietz,~~

-~~pplinux@ecn.purdue.eduv980105, 5 January 1998~~

-

-~~----~~

-~~''__Parallel Processing__ refers to the concept of speeding-up the~~

-~~execution of a program by dividing the program into multiple fragments~~

-~~that can execute simultaneously, each on its own processor. A program~~

-~~being executed across ''N'' processors might execute ''N''~~

-~~times faster than it would using a single processor. This document~~

-~~discusses the four basic approaches to parallel processing that are~~

-~~available to Linux users: SMP Linux systems, clusters of networked~~

-~~Linux systems, parallel execution using multimedia instructions (i.e.,~~

-~~MMX), and attached (parallel) processors hosted by a Linux system.''~~

-~~----~~

-

-~~!!1. Introduction~~

-

-*1.1 Is Parallel Processing What I Want?

-

-*1.2 Terminology

-

-*1.3 Example Algorithm

-

-*1.4 Organization Of This Document

-

-~~!!2. SMP Linux~~

-

-*2.1 SMP Hardware

-

-*2.2 Introduction To Shared Memory Programming

-

-*2.3 bb_threads

-

-*2.4 !LinuxThreads

-

-*2.5 System V Shared Memory

-

-*2.6 Memory Map Call

-

-~~!!3. Clusters Of Linux Systems~~

-

-*3.1 Why A Cluster?

-

-*3.2 Network Hardware

-

-*3.3 Network Software Interface

-

-*3.4 PVM (Parallel Virtual Machine)

-

-*3.5 MPI (Message Passing Interface)

-

-*3.6 AFAPI (Aggregate Function API)

-

-*3.7 Other Cluster Support Libraries

-

-*3.8 General Cluster References

-

-~~!!4. SIMD Within A Register (e.g., using MMX)~~

-

-*4.1 SWAR: What Is It Good For?

-

-*4.2 Introduction To SWAR Programming

-

-*4.3 MMX SWAR Under Linux

-

-~~!!5. Linux-Hosted Attached Processors~~

-

-*5.1 A Linux PC Is A Good Host

-

-*5.2 Did You DSP That?

-

-*5.3 FPGAs And Reconfigurable Logic Computing

-

-~~!!6. Of General Interest~~

-

-*6.1 Programming Languages And Compilers

-

-*6.2 Performance Issues

-

-*6.3 Conclusion - It's Out There

-

-~~----~~

-

-~~!!1. Introduction~~

-

-~~__Parallel Processing__ refers to the concept of speeding-up the~~

-~~execution of a program by dividing the program into multiple fragments~~

-~~that can execute simultaneously, each on its own processor. A program~~

-~~being executed across ''n'' processors might execute ''n''~~

-~~times faster than it would using a single processor.~~

-

-~~Traditionally, multiple processors were provided within a specially~~

-~~designed "parallel computer"; along these lines, Linux now supports~~

-~~__SMP__ systems (often sold as "servers") in which multiple~~

-~~processors share a single memory and bus interface within a single~~

-~~computer. It is also possible for a group of computers (for example,~~

-~~a group of PCs each running Linux) to be interconnected by a network~~

-~~to form a parallel-processing __cluster__. The third alternative~~

-~~for parallel computing using Linux is to use the __multimedia~~

-~~instruction extensions__ (i.e., MMX) to operate in parallel on~~

-~~vectors of integer data. Finally, it is also possible to use a Linux~~

-~~system as a "host" for a specialized __attached__ parallel~~

-~~processing compute engine. All these approaches are discussed in~~

-~~detail in this document.~~

-

-~~!!1.1 Is Parallel Processing What I Want?~~

-

-~~Although use of multiple processors can speed-up many operations, most~~

-~~applications cannot yet benefit from parallel processing. Basically,~~

-~~parallel processing is appropriate only if:~~

-

-*Your application has enough parallelism to make good use of

-~~multiple processors. In part, this is a matter of identifying~~

-~~portions of the program that can execute independently and~~

-~~simultaneously on separate processors, but you will also find that~~

-~~some things that ''could'' execute in parallel might actually slow~~

-~~execution if executed in parallel using a particular system. For~~

-~~example, a program that takes four seconds to execute within a single~~

-~~machine might be able to execute in only one second of processor time~~

-~~on each of four machines, but no speedup would be achieved if it took~~

-~~three seconds or more for these machines to coordinate their actions.~~

-

-*

-

-*Either the particular application program you are interested in

-~~already has been __parallelized__ (rewritten to take advantage of~~

-~~parallel processing) or you are willing to do at least some new coding~~

-~~to take advantage of parallel processing.~~

-

-*

-

-*You are interested in researching, or at least becoming familiar

-~~with, issues involving parallel processing. Parallel processing using~~

-~~Linux systems isn't necessarily difficult, but it is not familiar to~~

-~~most computer users, and there isn't any book called "Parallel~~

-~~Processing for Dummies"... at least not yet. This HOWTO is a good~~

-~~starting point, not all you need to know.~~

-*

-

-~~The good news is that if all the above are true, you'll find that~~

-~~parallel processing using Linux can yield supercomputer performance~~

-~~for some programs that perform complex computations or operate on~~

-~~large data sets. What's more, it can do that using cheap hardware...~~

-~~which you might already own. As an added bonus, it is also easy to~~

-~~use a parallel Linux system for other things when it is not busy~~

-~~executing a parallel job.~~

-

-~~If parallel processing is ''not'' what you want, but you would~~

-~~like to achieve at least a modest improvement in performance, there are~~

-~~still things you can do. For example, you can improve performance of~~

-~~sequential programs by moving to a faster processor, adding memory,~~

-~~replacing an IDE disk with fast wide SCSI, etc. If that's all you are~~

-~~interested in, jump to section 6.2; otherwise, read on.~~

-

-~~!!1.2 Terminology~~

-

-~~Although parallel processing has been used for many years in many~~

-~~systems, it is still somewhat unfamiliar to most computer users.~~

-~~Thus, before discussing the various alternatives, it is important to~~

-~~become familiar with a few commonly used terms.~~

-

-~~; __SIMD:__:~~

-

-~~SIMD (Single Instruction stream, Multiple Data stream) refers to a~~

-~~parallel execution model in which all processors execute the same~~

-~~operation at the same time, but each processor is allowed to operate~~

-~~upon its own data. This model naturally fits the concept of~~

-~~performing the same operation on every element of an array, and is~~

-~~thus often associated with vector or array manipulation. Because all~~

-~~operations are inherently synchronized, interactions among SIMD~~

-~~processors tend to be easily and efficiently implemented.~~

-

-~~; __MIMD:__:~~

-

-~~MIMD (Multiple Instruction stream, Multiple Data stream) refers to a~~

-~~parallel execution model in which each processor is essentially acting~~

-~~independently. This model most naturally fits the concept of~~

-~~decomposing a program for parallel execution on a functional basis;~~

-~~for example, one processor might update a database file while another~~

-~~processor generates a graphic display of the new entry. This is a~~

-~~more flexible model than SIMD execution, but it is achieved at the~~

-~~risk of debugging nightmares called __race conditions__, in which~~

-~~a program may intermittently fail due to timing variations reordering~~

-~~the operations of one processor relative to those of another.~~

-

-~~; __SPMD:__:~~

-

-~~SPMD (Single Program, Multiple Data) is a restricted version of MIMD~~

-~~in which all processors are running the same program. Unlike SIMD,~~

-~~each processor executing SPMD code may take a different control flow~~

-~~path through the program.~~

-

-~~; __Communication Bandwidth:__:~~

-

-~~The bandwidth of a communication system is the maximum amount of data~~

-~~that can be transmitted in a unit of time... once data transmission~~

-~~has begun. Bandwidth for serial connections is often measured in~~

-~~__baud__ or __bits/second (b/s)__, which generally~~

-~~correspond to 1/10 to 1/8 that many __Bytes/second (B/s)__. For~~

-~~example, a 1,200 baud modem transfers about 120 B/s, whereas a 155~~

-~~Mb/s ATM network connection is nearly 130,000 times faster,~~

-~~transferring about about 17 MB/s. High bandwidth allows large blocks~~

-~~of data to be transferred efficiently between processors.~~

-

-~~; __Communication Latency:__:~~

-

-~~The latency of a communication system is the minimum time taken to~~

-~~transmit one object, including any send and receive software~~

-~~overhead. Latency is very important in parallel processing because it~~

-~~determines the minimum useful __grain size__, the minimum run~~

-~~time for a segment of code to yield speed-up through parallel~~

-~~execution. Basically, if a segment of code runs for less time than it~~

-~~takes to transmit its result value (i.e., latency), executing that~~

-~~code segment serially on the processor that needed the result value~~

-~~would be faster than parallel execution; serial execution would avoid~~

-~~the communication overhead.~~

-

-~~; __Message Passing:__:~~

-

-~~Message passing is a model for interactions between processors within~~

-~~a parallel system. In general, a message is constructed by software~~

-~~on one processor and is sent through an interconnection network to~~

-~~another processor, which then must accept and act upon the message~~

-~~contents. Although the overhead in handling each message (latency)~~

-~~may be high, there are typically few restrictions on how much~~

-~~information each message may contain. Thus, message passing can yield~~

-~~high bandwidth making it a very effective way to transmit a large~~

-~~block of data from one processor to another. However, to minimize the~~

-~~need for expensive message passing operations, data structures within~~

-~~a parallel program must be spread across the processors so that most~~

-~~data referenced by each processor is in its local memory... this task~~

-~~is known as __data layout__.~~

-

-~~; __Shared Memory:__:~~

-

-~~Shared memory is a model for interactions between processors within a~~

-~~parallel system. Systems like the multi-processor Pentium machines~~

-~~running Linux __physically__ share a single memory among~~

-~~their processors, so that a value written to shared memory by one~~

-~~processor can be directly accessed by any processor. Alternatively,~~

-~~__logically__ shared memory can be implemented for~~

-~~systems in which each processor has it own memory by converting each~~

-~~non-local memory reference into an appropriate inter-processor~~

-~~communication. Either implementation of shared memory is generally~~

-~~considered easier to use than message passing. Physically shared~~

-~~memory can have both high bandwidth and low latency, but only when~~

-~~multiple processors do not try to access the bus simultaneously; thus,~~

-~~data layout still can seriously impact performance, and cache effects,~~

-~~etc., can make it difficult to determine what the best layout is.~~

-

-~~; __Aggregate Functions:__:~~

-

-~~In both the message passing and shared memory models, a communication~~

-~~is initiated by a single processor; in contrast, aggregate function~~

-~~communication is an inherently parallel communication model in which~~

-~~an entire group of processors act together. The simplest such action~~

-~~is a __barrier synchronization__, in which each individual~~

-~~processor waits until every processor in the group has arrived at the~~

-~~barrier. By having each processor output a datum as a side-effect of~~

-~~reaching a barrier, it is possible to have the communication hardware~~

-~~return a value to each processor which is an arbitrary function of the~~

-~~values collected from all processors. For example, the return value~~

-~~might be the answer to the question "did any processor find a~~

-~~solution?" or it might be the sum of one value from each processor.~~

-~~Latency can be very low, but bandwidth per processor also tends to be~~

-~~low. Traditionally, this model is used primarily to control parallel~~

-~~execution rather than to distribute data values.~~

-

-~~; __Collective Communication:__:~~

-

-~~This is another name for aggregate functions, most often used when~~

-~~referring to aggregate functions that are constructed using multiple~~

-~~message-passing operations.~~

-

-~~; __SMP:__:~~

-

-~~SMP (Symmetric Multi-Processor) refers to the operating system concept~~

-~~of a group of processors working together as peers, so that any piece~~

-~~of work could be done equally well by any processor. Typically, SMP~~

-~~implies the combination of MIMD and shared memory. In the IA32 world,~~

-~~SMP generally means compliant with MPS (the Intel !MultiProcessor~~

-~~Specification); in the future, it may mean "Slot 2"....~~

-

-~~; __SWAR:__:~~

-

-~~SWAR (SIMD Within A Register) is a generic term for the concept of~~

-~~partitioning a register into multiple integer fields and using~~

-~~register-width operations to perform SIMD-parallel computations across~~

-~~those fields. Given a machine with ''k''-bit registers, data~~

-~~paths, and function units, it has long been known that ordinary~~

-~~register operations can function as SIMD parallel operations on as~~

-~~many as ''n'', ''k''/''n''-bit, field values. Although~~

-~~this type of parallelism can be implemented using ordinary integer~~

-~~registers and instructions, many high-end microprocessors have~~

-~~recently added specialized instructions to enhance the performance of~~

-~~this technique for multimedia-oriented tasks. In addition to the~~

-~~Intel/AMD/Cyrix __MMX__ (!MultiMedia eXtensions), there are:~~

-~~Digital Alpha __MAX__ (MultimediA eXtensions), Hewlett-Packard~~

-~~PA-RISC __MAX__ (Multimedia Acceleration eXtensions), MIPS~~

-~~__MDMX__ (Digital Media eXtension, pronounced "Mad Max"), and Sun~~

-~~SPARC V9 __VIS__ (Visual Instruction Set). Aside from the three~~

-~~vendors who have agreed on MMX, all of these instruction set~~

-~~extensions are roughly comparable, but mutually incompatible.~~

-

-~~; __Attached Processors:__:~~

-

-~~Attached processors are essentially special-purpose computers that are~~

-~~connected to a __host__ system to accelerate specific types of~~

-~~computation. For example, many video and audio cards for PCs contain~~

-~~attached processors designed, respectively, to accelerate common~~

-~~graphics operations and audio __DSP__ (Digital Signal~~

-~~Processing). There is also a wide range of attached __array~~

-~~processors__, so called because they are designed to accelerate~~

-~~arithmetic operations on arrays. In fact, many commercial~~

-~~supercomputers are really attached processors with workstation hosts.~~

-

-~~; __RAID:__:~~

-

-~~RAID (Redundant Array of Inexpensive Disks) is a simple technology for~~

-~~increasing both the bandwidth and reliability of disk I/O. Although~~

-~~there are many different variations, all have two key concepts in~~

-~~common. First, each data block is __striped__ across a group of~~

-~~''n+k'' disk drives such that each drive only has to read or~~

-~~write 1/''n'' of the data... yielding ''n'' times the~~

-~~bandwidth of one drive. Second, redundant data is written so that~~

-~~data can be recovered if a disk drive fails; this is important because~~

-~~otherwise if any one of the ''n+k'' drives were to fail, the~~

-~~entire file system could be lost. A good overview of RAID in general~~

-~~is given at~~

-~~http://www.dpt.com/uraiddoc.html, and~~

-~~information about RAID options for Linux systems is at~~

-~~http://linas.org/linux/raid.html. Aside from specialized RAID~~

-~~hardware support, Linux also supports software RAID , 1, 4, and 5~~

-~~across multiple disks hosted by a single Linux system; see the~~

-~~Software RAID mini-HOWTO and the Multi-Disk System Tuning mini-HOWTO~~

-~~for details. RAID across disk drives ''on multiple machines in a~~

-~~cluster'' is not directly supported.~~

-

-~~; __IA32:__:~~

-

-~~IA32 (Intel Architecture, 32-bit) really has nothing to do with~~

-~~parallel processing, but rather refers to the class of processors whose~~

-~~instruction sets are generally compatible with that of the Intel 386.~~

-~~Basically, any Intel x86 processor after the 286 is compatible with~~

-~~the 32-bit flat memory model that characterizes IA32. AMD and Cyrix~~

-~~also make a multitude of IA32-compatible processors. Because Linux~~

-~~evolved primarily on IA32 processors and that is where the commodity~~

-~~market is centered, it is convenient to use IA32 to distinguish any of~~

-~~these processors from the PowerPC, Alpha, PA-RISC, MIPS, SPARC, etc.~~

-~~The upcoming IA64 (64-bit with EPIC, Explicitly Parallel Instruction~~

-~~Computing) will certainly complicate matters, but Merced, the first~~

-~~IA64 processor, is not scheduled for production until 1999.~~

-

-~~; __COTS:__:~~

-

-~~Since the demise of many parallel supercomputer companies, COTS~~

-~~(Commercial Off-The-Shelf) is commonly discussed as a requirement for~~

-~~parallel computing systems. Being fanatically pure, the only COTS~~

-~~parallel processing techniques using PCs are things like SMP Windows~~

-~~NT servers and various MMX Windows applications; it really doesn't pay~~

-~~to be that fanatical. The underlying concept of COTS is really~~

-~~minimization of development time and cost. Thus, a more useful, more~~

-~~common, meaning of COTS is that at least most subsystems benefit from~~

-~~commodity marketing, but other technologies are used where they are~~

-~~effective. Most often, COTS parallel processing refers to a cluster~~

-~~in which the nodes are commodity PCs, but the network interface and~~

-~~software are somewhat customized... typically running Linux and~~

-~~applications codes that are freely available (e.g., copyleft or public~~

-~~domain), but not literally COTS.~~

-

-~~!!1.3 Example Algorithm~~

-

-~~In order to better understand the use of the various parallel~~

-~~programming approaches outlined in this HOWTO, it is useful to have an~~

-~~example problem. Although just about any simple parallel algorithm~~

-~~would do, by selecting an algorithm that has been used to demonstrate~~

-~~various other parallel programming systems, it becomes a bit easier to~~

-~~compare and contrast approaches. M. J. Quinn's book, ''Parallel~~

-~~Computing Theory And Practice'', second edition, !McGraw Hill, New~~

-~~York, 1994, uses a parallel algorithm that computes the value of Pi to~~

-~~demonstrate a variety of different parallel supercomputer programming~~

-~~environments (e.g., nCUBE message passing, Sequent shared memory). In~~

-~~this HOWTO, we use the same basic algorithm.~~

-

-~~The algorithm computes the approximate value of Pi by summing the area~~

-~~under ''x'' squared. As a purely sequential C program, the~~

-~~algorithm looks like:~~

-

-~~----~~

-

-~~#include <stdlib.h>;~~

-~~#include <stdio.h>;~~

-~~main(int argc, char **argv)~~

-{

-~~register double width, sum;~~

-~~register int intervals, i;~~

-~~/* get the number of intervals */~~

-~~intervals = atoi(argv~~ [[1 ]);

-~~width = 1.0 / intervals;~~

-~~/* do the computation */~~

-~~sum = ;~~

-~~for (i=; i<intervals; ++i) {~~

-~~register double x = (i + .5) * width;~~

-~~sum += 4.0 / (1.0 + x * x);~~

-}

-~~sum *= width;~~

-~~printf("Estimation of pi is %f\n", sum);~~

-~~return();~~

-}

-

-~~----~~

-

-~~However, this sequential algorithm easily yields an "embarrassingly~~

-~~parallel" implementation. The area is subdivided into intervals, and~~

-~~any number of processors can each independently sum the intervals~~

-~~assigned to it, with no need for interaction between processors. Once~~

-~~the local sums have been computed, they are added together to create a~~

-~~global sum; this step requires some level of coordination and~~

-~~communication between processors. Finally, this global sum is printed~~

-~~by one processor as the approximate value of Pi.~~

-

-~~In this HOWTO, the various parallel implementations of this algorithm~~

-~~appear where each of the different programming methods is discussed.~~

-

-~~!!1.4 Organization Of This Document~~

-

-~~The remainder of this document is divided into five parts. Sections~~

-~~2, 3, 4, and 5 correspond to the three different types of hardware~~

-~~configurations supporting parallel processing using Linux:~~

-

-*Section 2 discusses SMP Linux systems. These directly support

-~~MIMD execution using shared memory, although message passing also is~~

-~~implemented easily. Although Linux supports SMP configurations up to~~

-~~16 processors, most SMP PC systems have either two or four identical~~

-~~processors.~~

-

-*

-

-*Section 3 discusses clusters of networked machines, each running

-~~Linux. A cluster can be used as a parallel processing system that~~

-~~directly supports MIMD execution and message passing, perhaps also~~

-~~providing logically shared memory. Simulated SIMD execution and~~

-~~aggregate function communication also can be supported, depending on~~

-~~the networking method used. The number of processors in a cluster can~~

-~~range from two to thousands, primarily limited by the physical wiring~~

-~~constraints of the network. In some cases, various types of machines~~

-~~can be mixed within a cluster; for example, a network combining DEC~~

-~~Alpha and Pentium Linux systems would be a __heterogeneous~~

-~~cluster__.~~

-

-*

-

-*Section 4 discusses SWAR, SIMD Within A Register. This is a

-~~very restrictive type of parallel execution model, but on the other~~

-~~hand, it is a built-in capability of ordinary processors. Recently,~~

-~~MMX (and other) instruction set extensions to modern processors have~~

-~~made this approach even more effective.~~

-

-*

-

-*Section 5 discusses the use of Linux PCs as hosts for simple

-~~parallel computing systems. Either as an add-in card or as an~~

-~~external box, attached processors can provide a Linux system with~~

-~~formidable processing power for specific types of applications. For~~

-~~example, inexpensive ISA cards are available that provide multiple DSP~~

-~~processors offering hundreds of MFLOPS for compute-bound problems.~~

-~~However, these add-in boards are ''just'' processors; they~~

-~~generally do not run an OS, have disk or console I/O capability, etc.~~

-~~To make such systems useful, the Linux "host" must provide these~~

-~~functions.~~

-*

-

-~~The final section of this document covers aspects that are of general~~

-~~interest for parallel processing using Linux, not specific to a~~

-~~particular one of the approaches listed above.~~

-

-~~As you read this document, keep in mind that we haven't tested~~

-~~everything, and a lot of stuff reported~~ here ~~"still has a research~~

-~~character" (a nice way to say "doesn't quite work like it should" ;-).~~

-~~However, parallel processing using Linux is useful now, and an~~

-~~increasingly large group is working to make it better.~~

-

-~~The author of this HOWTO is Hank Dietz, Ph.D., currently Associate~~

-~~Professor of Electrical and Computer Engineering at Purdue University,~~

-~~in West Lafayette, IN, 47907-1285. Dietz retains rights to this~~

-~~document as per the Linux Documentation Project guidelines. Although~~

-~~an effort has been made to ensure the correctness and fairness of this~~

-~~presentation, neither Dietz nor Purdue University can be held~~

-~~responsible for any problems or errors, and Purdue University does not~~

-~~endorse any of the work/products discussed.~~

-

-~~----~~

-

-~~!!2. SMP Linux~~

-

-~~This document gives a brief overview of how to use~~

-~~SMP Linux systems~~

-~~for parallel processing. The most up-to-date information on SMP Linux~~

-~~is probably available via the SMP Linux project mailing list; send~~

-~~email to~~

-~~majordomo@vger.rutgers.edu with the text subscribe~~

-~~linux-smp to join the list.~~

-

-~~Does SMP Linux really work? In June 1996, I purchased a brand new~~

-~~(well, new off-brand ;-) two-processor 100MHz Pentium system. The~~

-~~fully assembled system, including both processors, Asus motherboard,~~

-~~256K cache, 32M RAM, 1.6G disk, 6X CDROM, Stealth 64, and 15" Acer~~

-~~monitor, cost a total of $1,800. This was just a few hundred~~

-~~dollars more than a comparable uniprocessor system. Getting SMP Linux~~

-~~running was simply a matter of installing the "stock" uniprocessor~~

-~~Linux, recompiling the kernel with the SMP=1 line in the~~

-~~makefile uncommented (although I find setting SMP to~~

-~~1 a bit ironic ;-), and informing lilo about the new~~

-~~kernel. This system performs well enough, and has been stable enough,~~

-~~to serve as my primary workstation ever since. In summary, SMP Linux~~

-~~really does work.~~

-

-~~The next question is how much high-level support is available for~~

-~~writing and executing shared memory parallel programs under SMP Linux.~~

-~~Through early 1996, there wasn't much. Things have changed. For~~

-~~example, there is now a very complete POSIX threads library.~~

-

-~~Although performance may be lower than for native shared-memory~~

-~~mechanisms, an SMP Linux system also can use most parallel processing~~

-~~software that was originally developed for a workstation cluster using~~

-~~socket communication. Sockets (see section 3.3) work within an SMP~~

-~~Linux system, and even for multiple SMPs networked as a cluster.~~

-~~However, sockets imply a lot of unnecessary overhead for an SMP. Much~~

-~~of that overhead is within the kernel or interrupt handlers; this~~

-~~worsens the problem because SMP Linux generally allows only one~~

-~~processor to be in the kernel at a time and the interrupt controller~~

-~~is set so that only the boot processor can process interrupts.~~

-~~Despite this, typical SMP communication hardware is so much better~~

-~~than most cluster networks that cluster software will often run better~~

-~~on an SMP than on the cluster for which it was designed.~~

-

-~~The remainder of this section discusses SMP hardware, reviews the~~

-~~basic Linux mechanisms for sharing memory across the processes of a~~

-~~parallel program, makes a few observations about atomicity, volatility,~~

-~~locks, and cache lines, and finally gives some pointers to other~~

-~~shared memory parallel processing resources.~~

-

-~~!!2.1 SMP Hardware~~

-

-~~Although SMP systems have been around for many years, until very~~

-~~recently, each such machine tended to implement basic functions~~

-~~differently enough so that operating system support was not portable.~~

-~~The thing that has changed this situation is Intel's Multiprocessor~~

-~~Specification, often referred to as simply __MPS__. The MPS 1.4~~

-~~specification is currently available as a PDF file at~~

-~~http://www.intel.com/design/pro/datashts/242016.htm, and there~~

-~~is a brief overview of MPS 1.1 at~~

-~~http://support.intel.com/oem_developer/ial/support/9300.HTM,~~

-~~but be aware that Intel does re-arrange their WWW site often. A wide~~

-~~range of~~

-~~vendors are building MPS-compliant systems supporting up to~~

-~~four processors, but MPS theoretically allows many more processors.~~

-

-~~The only non-MPS, non-IA32, systems supported by SMP Linux are Sun4m~~

-~~multiprocessor SPARC machines. SMP Linux supports most Intel MPS~~

-~~version 1.1 or 1.4 compliant machines with up to sixteen 486DX,~~

-~~Pentium, Pentium MMX, Pentium Pro, or Pentium II processors.~~

-~~Unsupported IA32 processors include the Intel 386, Intel 486SX/SLC~~

-~~processors (the lack of floating point hardware interferes with the~~

-~~SMP mechanisms), and AMD & Cyrix processors (they require~~

-~~different SMP support chips that do not seem to be available at this~~

-~~writing).~~

-

-~~It is important to understand that the performance of MPS-compliant~~

-~~systems can vary widely. As expected, one cause for performance~~

-~~differences is processor speed: faster clock speeds tend to yield~~

-~~faster systems, and a Pentium Pro processor is faster than a Pentium.~~

-~~However, MPS does not really specify how hardware implements shared~~

-~~memory, but only how that implementation must function from a software~~

-~~point of view; this means that performance is also a function of how~~

-~~the shared memory implementation interacts with the characteristics of~~

-~~SMP Linux and your particular programs.~~

-

-~~The primary way in which systems that comply with MPS differ is in how~~

-~~they implement access to physically shared memory.~~

-

-~~!Does each processor have its own L2 cache?~~

-

-~~Some MPS Pentium systems, and all MPS Pentium Pro and Pentium II~~

-~~systems, have independent L2 caches. (The L2 cache is packaged within~~

-~~the Pentium Pro or Pentium II modules.) Separate L2 caches are~~

-~~generally viewed as maximizing compute performance, but things are not~~

-~~quite so obvious under Linux. The primary complication is that the~~

-~~current SMP Linux scheduler does not attempt to keep each process on~~

-~~the same processor, a concept known as __processor affinity__.~~

-~~This may change soon; there has recently been some discussion about~~

-~~this in the SMP Linux development community under the title "processor~~

-~~binding." Without processor affinity, having separate L2 caches may~~

-~~introduce significant overhead when a process is given a timeslice on~~

-~~a processor other than the one that was executing it last.~~

-

-~~Many relatively inexpensive systems are organized so that two Pentium~~

-~~processors share a single L2 cache. The bad news is that this causes~~

-~~contention for the cache, seriously degrading performance when running~~

-~~multiple independent sequential programs. The good news is that many~~

-~~parallel programs might actually benefit from the shared cache because~~

-~~if both processors will want to access the same line from shared~~

-~~memory, only one had to fetch it into cache and contention for the bus~~

-~~is averted. The lack of processor affinity also causes less damage~~

-~~with a shared L2 cache. Thus, for parallel programs, it isn't really~~

-~~clear that sharing L2 cache is as harmful as one might expect.~~

-

-~~Experience with our dual Pentium shared 256K cache system shows quite~~

-~~a wide range of performance depending on the level of kernel activity~~

-~~required. At worst, we see only about 1.2x speedup. However, we also~~

-~~have seen up to 2.1x speedup, which suggests that compute-intensive~~

-~~SPMD-style code really does profit from the "shared fetch" effect.~~

-

-~~!Bus configuration?~~

-

-~~The first thing to say is that most modern systems connect the~~

-~~processors to one or more PCI buses that in turn are "bridged" to one~~

-~~or more ISA/EISA buses. These bridges add latency, and both EISA and~~

-~~ISA generally offer lower bandwidth than PCI (ISA being the lowest), so~~

-~~disk drives, video cards, and other high-performance devices generally~~

-~~should be connected via a PCI bus interface.~~

-

-~~Although an MPS system can achieve good speed-up for many~~

-~~compute-intensive parallel programs even if there is only one PCI bus,~~

-~~I/O operations occur at no better than uniprocessor performance...~~

-~~and probably a little worse due to bus contention from the~~

-~~processors. Thus, if you are looking to speed-up I/O, make sure that~~

-~~you get an MPS system with multiple independent PCI busses and I/O~~

-~~controllers (e.g., multiple SCSI chains). You will need to be careful~~

-~~to make sure SMP Linux supports what you get. Also keep in mind that~~

-~~the current SMP Linux essentially allows only one processor in the~~

-~~kernel at any time, so you should choose your I/O controllers~~

-~~carefully to pick ones that minimize the kernel time required for each~~

-~~I/O operation. For really high performance, you might even consider~~

-~~doing raw device I/O directly from user processes, without a system~~

-~~call... this isn't necessarily as hard as it sounds, and need not~~

-~~compromise security (see section 3.3 for a description of the basic~~

-~~techniques).~~

-

-~~It is important to note that the relationship between bus speed and~~

-~~processor clock rate has become very fuzzy over the past few years.~~

-~~Although most systems now use the same PCI clock rate, it is not~~

-~~uncommon to find a faster processor clock paired with a slower bus~~

-~~clock. The classic example of this was that the Pentium 133 generally~~

-~~used a faster bus than a Pentium 150, with appropriately~~

-~~strange-looking performance on various benchmarks. These effects are~~

-~~amplified in SMP systems; it is even more important to have a faster~~

-~~bus clock.~~

-

-~~!Memory interleaving and DRAM technologies?~~

-

-~~Memory interleaving actually has nothing whatsoever to do with MPS,~~

-~~but you will often see it mentioned for MPS systems because these~~

-~~systems are typically more demanding of memory bandwidth. Basically,~~

-~~two-way or four-way interleaving organizes RAM so that a block access~~

-~~is accomplished using multiple banks of RAM rather than just one.~~

-~~This provides higher memory access bandwidth, particularly for cache~~

-~~line loads and stores.~~

-

-~~The waters are a bit muddied about this, however, because EDO DRAM and~~

-~~various other memory technologies tend to improve similar kinds of~~

-~~operations. An excellent overview of DRAM technologies is given in~~

-~~http://www.pcguide.com/ref/ram/tech.htm.~~

-

-~~So, for example, is it better to have 2-way interleaved EDO DRAM or~~

-~~non-interleaved SDRAM? That is a very good question with no simple~~

-~~answer, because both interleaving and exotic DRAM technologies tend to~~

-~~be expensive. The same dollar investment in more ordinary memory~~

-~~configurations generally will give you a significantly larger main~~

-~~memory. Even the slowest DRAM is still a heck of a lot faster than~~

-~~using disk-based virtual memory....~~

-

-~~!!2.2 Introduction To Shared Memory Programming~~

-

-~~Ok, so you have decided that parallel processing on an SMP is a great~~

-~~thing to do... how do you get started? Well, the first step is to~~

-~~learn a little bit about how shared memory communication really works.~~

-

-~~It sounds like you simply have one processor store a value into memory~~

-~~and another processor load it; unfortunately, it isn't quite that~~

-~~simple. For example, the relationship between processes and~~

-~~processors is very blurry; however, if we have no more active~~

-~~processes than there are processors, the terms are roughly~~

-~~interchangeable. The remainder of this section briefly summarizes the~~

-~~key issues that could cause serious problems, if you were not aware of~~

-~~them: the two different models used to determine what is shared,~~

-~~atomicity issues, the concept of volatility, hardware lock~~

-~~instructions, cache line effects, and Linux scheduler issues.~~

-

-~~!Shared Everything Vs. Shared Something~~

-

-~~There are two fundamentally different models commonly used for shared~~

-~~memory programming: __shared everything__ and __shared~~

-~~something__. Both of these models allow processors to communicate~~

-~~by loads and stores from/into shared memory; the distinction comes in~~

-~~the fact that shared everything places all data structures in shared~~

-~~memory, while shared something requires the user to explicitly~~

-~~indicate which data structures are potentially shared and which are~~

-~~__private__ to a single processor.~~

-

-~~Which shared memory model should you use? That is mostly a question of~~

-~~religion. A lot of people like the shared everything model because~~

-~~they do not really need to identify which data structures should be~~

-~~shared at the time they are declared... you simply put locks around~~

-~~potentially-conflicting accesses to shared objects to ensure that only~~

-~~one process(or) has access at any moment. Then again, that really~~

-~~isn't all that simple... so many people prefer the relative safety of~~

-~~shared something.~~

-

-~~!Shared Everything~~

-

-~~The nice thing about sharing everything is that you can easily take an~~

-~~existing sequential program and incrementally convert it into a shared~~

-~~everything parallel program. You do not have to first determine which~~

-~~data need to be accessible by other processors.~~

-

-~~Put simply, the primary problem with sharing everything is that any~~

-~~action taken by one processor could affect the other processors. This~~

-~~problem surfaces in two ways:~~

-

-*Many libraries use data structures that simply are not

-~~sharable. For example, the UNIX convention is that most functions can~~

-~~return an error code in a variable called errno; if two shared~~

-~~everything processes perform various calls, they would interfere with~~

-~~each other because they share the same errno. Although there~~

-~~is now a library version that fixes the errno problem,~~

-~~similar problems still exist in most libraries. For example, unless~~

-~~special precautions are taken, the X library will not work if calls~~

-~~are made from multiple shared everything processes.~~

-

-*

-

-*Normally, the worst-case behavior for a program with a bad

-~~pointer or array subscript is that the process that contains the~~

-~~offending code dies. It might even generate a core file that~~

-~~clues you in to what happened. In shared everything parallel~~

-~~processing, it is very likely that the stray accesses will bring the~~

-~~demise of ''a process other than the one at fault'', making it~~

-~~nearly impossible to localize and correct the error.~~

-*

-

-~~Neither of these types of problems is common when shared something is~~

-~~used, because only the explicitly-marked data structures are shared.~~

-~~It also is fairly obvious that shared everything only works if all~~

-~~processors are executing the exact same memory image; you cannot use~~

-~~shared everything across multiple different code images (i.e., can use~~

-~~only SPMD, not general MIMD).~~

-

-~~The most common type of shared everything programming support is a~~

-~~__threads library__.~~

-~~Threads are essentially "light-weight" processes that might~~

-~~not be scheduled in the same way as regular UNIX processes and, most~~

-~~importantly, share access to a single memory map. The POSIX~~

-~~Pthreads package has been the focus of a number of porting~~

-~~efforts; the big question is whether any of these ports actually run~~

-~~the threads of a program in parallel under SMP Linux (ideally, with a~~

-~~processor for each thread). The POSIX API doesn't require it, and~~

-~~versions like~~

-~~http://www.aa.net/~mtp/PCthreads.html~~

-~~apparently do not implement parallel thread execution - all the threads~~

-~~of a program are kept within a single Linux process.~~

-

-~~The first threads library that supported SMP Linux parallelism was the~~

-~~now somewhat obsolete bb_threads library,~~

-~~ftp://caliban.physics.utoronto.ca/pub/linux/, a very small~~

-~~library that used the Linux clone() call to fork new,~~

-~~independently scheduled, Linux processes all sharing a single address~~

-~~space. SMP Linux machines can run multiple of these "threads" in~~

-~~parallel because each "thread" is a full Linux process; the trade-off~~

-~~is that you do not get the same "light-weight" scheduling control~~

-~~provided by some thread libraries under other operating systems. The~~

-~~library used a bit of C-wrapped assembly code to install a new chunk~~

-~~of memory as each thread's stack and to provide atomic access~~

-~~functions for an array of locks (mutex objects). Documentation~~

-~~consisted of a README and a short sample program.~~

-

-~~More recently, a version of POSIX threads using clone() has~~

-~~been developed. This library,~~

-~~!LinuxThreads, is clearly the preferred shared everything~~

-~~library for use under SMP Linux. POSIX threads are well documented,~~

-~~and the~~

-~~!LinuxThreads README and~~

-~~!LinuxThreads FAQ are very well done. The primary problem now~~

-~~is simply that POSIX threads have a lot of details to get right and~~

-~~!LinuxThreads is still a work in progress. There is also the problem~~

-~~that the POSIX thread standard has evolved through the standardization~~

-~~process, so you need to be a bit careful not to program for obsolete~~

-~~early versions of the standard.~~

-

-~~!Shared Something~~

-

-~~Shared something is really "only share what needs to be shared." This~~

-~~approach can work for general MIMD (not just SPMD) provided that care~~

-~~is taken for the shared objects to be allocated at the same places in~~

-~~each processor's memory map. More importantly, shared something makes~~

-~~it easier to predict and tune performance, debug code, etc. The only~~

-~~problems are:~~

-

-*It can be hard to know beforehand what really needs to be shared.

-

-*

-

-*The actual allocation of objects in shared memory may be awkward,

-~~especially for what would have been stack-allocated objects. For~~

-~~example, it may be necessary to explicitly allocate shared objects in~~

-~~a separate memory segment, requiring separate memory allocation~~

-~~routines and introducing extra pointer indirections in each reference.~~

-*

-

-~~Currently, there are two very similar mechanisms that allow groups of~~

-~~Linux processes to have independent memory spaces, all sharing only a~~

-~~relatively small memory segment. Assuming that you didn't foolishly~~

-~~exclude "System V IPC" when you configured your Linux system, Linux~~

-~~supports a very portable mechanism that has generally become known as~~

-~~"System V Shared Memory." The other alternative is a memory mapping~~

-~~facility whose implementation varies widely across different UNIX~~

-~~systems: the mmap() system call. You can, and should, learn~~

-~~about these calls from the manual pages... but a brief overview of~~

-~~each is given in sections 2.5 and 2.6 to help get you started.~~

-

-~~!Atomicity And Ordering~~

-

-~~No matter which of the above two models you use, the result is pretty~~

-~~much the same: you get a pointer to a chunk of read/write memory that~~

-~~is accessible by all processes within your parallel program. Does~~

-~~that mean I can just have my parallel program access shared memory~~

-~~objects as though they were in ordinary local memory? Well, not~~

-~~quite....~~

-

-~~__Atomicity__ refers to the concept that an operation on an~~

-~~object is accomplished as an indivisible, uninterruptible, sequence.~~

-~~Unfortunately, sharing memory access does not imply that all~~

-~~operations on data in shared memory occur atomically. Unless special~~

-~~precautions are taken, only simple load or store operations that occur~~

-~~within a single bus transaction (i.e., aligned 8, 16, or 32-bit~~

-~~operations, but not misaligned nor 64-bit operations) are atomic.~~

-~~Worse still, "smart" compilers like GCC will often perform~~

-~~optimizations that could eliminate the memory operations needed to~~

-~~ensure that other processors can see what this processor has done.~~

-~~Fortunately, both these problems can be remedied... leaving only the~~

-~~relationship between access efficiency and cache line size for us to~~

-~~worry about.~~

-

-~~However, before discussing these issues, it is useful to point-out~~

-~~that all of this assumes that memory references for each processor~~

-~~happen in the order in which they were coded. The Pentium does this,~~

-~~but also notes that future Intel processors might not. So, for future~~

-~~processors, keep in mind that it may be necessary to surround some~~

-~~shared memory accesses with instructions that cause all pending memory~~

-~~accesses to complete, thus providing memory access ordering. The~~

-~~CPUID instruction apparently is reserved to have this~~

-~~side-effect.~~

-

-~~!Volatility~~

-

-~~To prevent GCC's optimizer from buffering values of shared memory~~

-~~objects in registers, all objects in shared memory should be declared~~

-~~as having types with the volatile attribute. If this is~~

-~~done, all shared object reads and writes that require just one word~~

-~~access will occur atomically. For example, suppose that ''p''~~

-~~is a pointer to an integer, where both the pointer and the integer it~~

-~~will point at are in shared memory; the ANSI C declaration might be:~~

-

-~~----~~

-

-~~volatile int * volatile p;~~

-

-~~----~~

-

-~~In this code, the first volatile refers to the int~~

-~~that p will eventually point at; the second volatile~~

-~~refers to the pointer itself. Yes, it is annoying, but it is the~~

-~~price one pays for enabling GCC to perform some very powerful~~

-~~optimizations. At least in theory, the -traditional option~~

-~~to GCC might suffice to produce correct code at the expense of some~~

-~~optimization, because pre-ANSI K&R C essentially claimed that all~~

-~~variables were volatile unless explicitly declared as~~

-~~register. Still, if your typical GCC compile looks like~~

-~~cc -O6 ''...'', you really will want to explicitly mark~~

-~~things as volatile only where necessary.~~

-

-~~There has been a rumor to the effect that using assembly-language~~

-~~locks that are marked as modifying all processor registers will cause~~

-~~GCC to appropriately flush all variables, thus avoiding the~~

-~~"inefficient" compiled code associated with things declared as~~

-~~volatile. This hack appears to work for statically allocated~~

-~~global variables using version 2.7.0 of GCC... however, that behavior~~

-~~is ''not'' required by the ANSI C standard. Still worse, other~~

-~~processes that are making only read accesses can buffer the values in~~

-~~registers forever, thus ''never'' noticing that the shared memory~~

-~~value has actually changed. In summary, do what you want, but only~~

-~~variables accessed through volatile are ''guaranteed''~~

-~~to work correctly.~~

-

-~~Note that you can cause a volatile access to an ordinary variable by~~

-~~using a type cast that imposes the volatile attribute. For~~

-~~example, the ordinary int i; can be referenced as a volatile~~

-~~by *((volatile int *) &i); thus, you can explicitly~~

-~~invoke the "overhead" of volatility only where it is critical.~~

-

-~~!Locks~~

-

-~~If you thought that ++i; would always work to add one to a~~

-~~variable i in shared memory, you've got a nasty little~~

-~~surprise coming: even if coded as a single instruction, the load and~~

-~~store of the result are separate memory transactions, and other~~

-~~processors could access i between these two transactions.~~

-~~For example, having two processes both perform ++i; might~~

-~~only increment i by one, rather than by two. According to~~

-~~the Intel Pentium "Architecture and Programming Manual," the~~

-~~LOCK prefix can be used to ensure that any of the following~~

-~~instructions is atomic relative to the data memory location it~~

-~~accesses:~~

-

-~~----~~

-

-~~BTS, BTR, BTC mem, reg/imm~~

-~~XCHG reg, mem~~

-~~XCHG mem, reg~~

-~~ADD, OR, ADC, SBB, AND, SUB, XOR mem, reg/imm~~

-~~NOT, NEG, INC, DEC mem~~

-~~CMPXCHG, XADD~~

-

-~~----~~

-

-~~However, it probably is not a good idea to use all these operations.~~

-~~For example, XADD did not even exist for the 386, so coding~~

-~~it may cause portability problems.~~

-

-~~The XCHG instruction ''always'' asserts a lock, even~~

-~~without the LOCK prefix, and thus is clearly the preferred~~

-~~atomic operation from which to build higher-level atomic constructs~~

-~~such as semaphores and shared queues. Of course, you can't get GCC to~~

-~~generate this instruction just by writing C code... instead, you must~~

-~~use a bit of in-line assembly code. Given a word-size volatile object~~

-~~''obj'' and a word-size register value ''reg'', the GCC~~

-~~in-line assembly code is:~~

-

-~~----~~

-

-~~__asm__ __volatile__ ("xchgl %1,%"~~

-~~:"=r" (reg), "=m" (obj)~~

-~~:"r" (reg), "m" (obj));~~

-

-~~----~~

-

-~~Examples of GCC in-line assembly code using bit operations for locking~~

-~~are given in the source code for the~~

-~~bb_threads library.~~

-

-~~It is important to remember, however, that there is a cost associated~~

-~~with making memory transactions atomic. A locking operation carries a~~

-~~fair amount of overhead and may delay memory activity from other~~

-~~processors, whereas ordinary references may use local cache. The best~~

-~~performance results when locking operations are used as infrequently~~

-~~as possible. Further, these IA32 atomic instructions obviously are not~~

-~~portable to other systems.~~

-

-~~There are many alternative approaches that allow ordinary instructions~~

-~~to be used to implement various synchronizations, including __mutual~~

-~~exclusion__ - ensuring that at most one processor is updating a~~

-~~given shared object at any moment. Most OS textbooks discuss at least~~

-~~one of these techniques. There is a fairly good discussion in the~~

-~~Fourth Edition of ''Operating System Concepts'', by Abraham~~

-~~Silberschatz and Peter B. Galvin, ISBN -201-50480-4.~~

-

-~~!Cache Line Size~~

-

-~~One more fundamental atomicity concern can have a dramatic impact on~~

-~~SMP performance: cache line size. Although the MPS standard requires~~

-~~references to be coherent no matter what caching is used, the fact is~~

-~~that when one processor writes to a particular line of memory, every~~

-~~cached copy of the old line must be invalidated or updated. This~~

-~~implies that if two or more processors are both writing data to~~

-~~different portions of the same line a lot of cache and bus traffic may~~

-~~result, effectively to pass the line from cache to cache. This problem~~

-~~is known as __false sharing__. The solution is simply to try to~~

-~~''organize data so that what is accessed in parallel tends to come~~

-~~from a different cache line for each process''.~~

-

-~~You might be thinking that false sharing is not a problem using a~~

-~~system with a shared L2 cache, but remember that there are still~~

-~~separate L1 caches. Cache organization and number of separate levels~~

-~~can both vary, but the Pentium L1 cache line size is 32 bytes and~~

-~~typical external cache line sizes are around 256 bytes. Suppose that~~

-~~the addresses (physical or virtual) of two items are ''a'' and~~

-~~''b'' and that the largest per-processor cache line size is~~

-~~''c'', which we assume to be a power of two. To be very precise,~~

-~~if ((int) ''a'') & ~(''c'' - 1) is equal to~~

-~~((int) ''b'') & ~(''c'' - 1), then both~~

-~~references are in the same cache line. A simpler rule is that if~~

-~~shared objects being referenced in parallel are at least ''c''~~

-~~bytes apart, they should map to different cache lines.~~

-

-~~!Linux Scheduler Issues~~

-

-~~Although the whole point of using shared memory for parallel~~

-~~processing is to avoid OS overhead, OS overhead can come from things~~

-~~other than communication per se. We have already said that the number~~

-~~of processes that should be constructed is less than or equal to the~~

-~~number of processors in the machine. But how do you decide exactly~~

-~~how many processes to make?~~

-

-~~For best performance, ''the number of processes in your parallel~~

-~~program should be equal to the expected number of your program's~~

-~~processes that simultaneously can be running on different~~

-~~processors''. For example, if a four-processor SMP typically has~~

-~~one process actively running for some other purpose (e.g., a WWW~~

-~~server), then your parallel program should use only three processes.~~

-~~You can get a rough idea of how many other processes are active on~~

-~~your system by looking at the "load average" quoted by the~~

-~~uptime command.~~

-

-~~Alternatively, you could boost the priority of the processes in your~~

-~~parallel program using, for example, the renice command or~~

-~~nice() system call. You must be privileged to increase~~

-~~priority. The idea is simply to force the other processes out of~~

-~~processors so that your program can run simultaneously across all~~

-~~processors. This can be accomplished somewhat more explicitly using~~

-~~the prototype version of SMP Linux at~~

-~~http://luz.cs.nmt.edu/~rtlinux/, which offers real-time~~

-~~schedulers.~~

-

-~~If you are not the only user treating your SMP system as a parallel~~

-~~machine, you may also have conflicts between the two or more parallel~~

-~~programs trying to execute simultaneously. This standard solution is~~

-~~__gang scheduling__ - i.e., manipulating scheduling priority so~~

-~~that at any given moment, only the processes of a single parallel~~

-~~program are running. It is useful to recall, however, that using more~~

-~~parallelism tends to have diminishing returns and scheduler activity~~

-~~adds overhead. Thus, for example, it is probably better for a~~

-~~four-processor machine to run two programs with two processes each~~

-~~rather than gang scheduling between two programs with four processes~~

-~~each.~~

-

-~~There is one more twist to this. Suppose that you are developing a~~

-~~program on a machine that is heavily used all day, but will be fully~~

-~~available for parallel execution at night. You need to write and test~~

-~~your code for correctness with the full number of processes, even~~

-~~though you know that your daytime test runs will be slow. Well, they~~

-~~will be ''very'' slow if you have processes __busy waiting__~~

-~~for shared memory values to be changed by other processes that are not~~

-~~currently running (on other processors). The same problem occurs if~~

-~~you develop and test your code on a single-processor system.~~

-

-~~The solution is to embed calls in your code, wherever it may loop~~

-~~awaiting an action from another processor, so that Linux will give~~

-~~another process a chance to run. I use a C macro, call it~~

-~~IDLE_ME, to do this: for a test run, compile with~~

-~~cc -DIDLE_ME=usleep(1); ...; for a "production" run,~~

-~~compile with cc -DIDLE_ME={} .... The~~

-~~usleep(1) call requests a 1 microsecond sleep, which has the~~

-~~effect of allowing the Linux scheduler to select a different process~~

-~~to run on that processor. If the number of processes is more than~~

-~~twice the number of processors available, it is not unusual for codes~~

-~~to run ten times faster with usleep(1) calls than without~~

-~~them.~~

-

-~~!!2.3 bb_threads~~

-

-~~The bb_threads ("Bare Bones" threads) library,~~

-~~ftp://caliban.physics.utoronto.ca/pub/linux/, is a remarkably~~

-~~simple library that demonstrates use of the Linux clone()~~

-~~call. The gzip tar file is only 7K bytes! Although this~~

-~~library is essentially made obsolete by the !LinuxThreads library~~

-~~discussed in section 2.4, bb_threads is still usable, and it is~~

-~~small and simple enough to serve well as an introduction to use of~~

-~~Linux thread support. Certainly, it is far less daunting to read this~~

-~~source code than to browse the source code for !LinuxThreads. In~~

-~~summary, the bb_threads library is a good starting point, but~~

-~~is not really suitable for coding large projects.~~

-

-~~The basic program structure for using the bb_threads library is:~~

-

-~~#Start the program running as a single process.~~

-

-#

-

-~~#You will need to estimate the maximum stack space that will be~~

-~~required for each thread. Guessing large is relatively harmless (that~~

-~~is what virtual memory is for ;-), but remember that ''all'' the~~

-~~stacks are coming from a single virtual address space, so guessing~~

-~~huge is not a great idea. The demo suggests 64K. This size is set to~~

-~~''b'' bytes by~~

-~~bb_threads_stacksize(''b'').~~

-

-#

-

-~~#The next step is to initialize any locks that you will need.~~

-~~The lock mechanism built-into this library numbers locks from 0 to~~

-~~MAX_MUTEXES, and initializes lock ''i'' by~~

-~~bb_threads_mutexcreate(''i'').~~

-

-#

-

-~~#Spawning a new thread is done by calling a library routine that~~

-~~takes arguments specifying what function the new thread should execute~~

-~~and what arguments should be transmitted to it. To start a new thread~~

-~~executing the void-returning function ''f'' with the~~

-~~single argument ''arg'', you do something like~~

-~~bb_threads_newthread(''f'', &arg),~~

-~~where ''f'' should be declared something like void~~

-~~''f''(void *arg, size_t dummy). If you need to pass~~

-~~more than one argument, pass a pointer to a structure initialized to~~

-~~hold the argument values.~~

-

-#

-

-~~#Run parallel code, being careful to use~~

-~~bb_threads_lock(''n'') and~~

-~~bb_threads_unlock(''n'') where ''n''~~

-~~is an integer identifying which lock to use. Note that the lock and~~

-~~unlock operations in this library are very basic spin locks using~~

-~~atomic bus-locking instructions, which can cause excessive~~

-~~memory-reference interference and do not make any attempt to ensure~~

-~~fairness.~~

-~~The demo program packaged with bb_threads did not correctly use~~

-~~locks to prevent printf() from being executed simultaneously~~

-~~from within the functions fnn and main... and~~

-~~because of this, the demo does not always work. I'm not saying this~~

-~~to knock the demo, but rather to emphasize that this stuff is ''very~~

-~~tricky''; also, it is only slightly easier using !LinuxThreads.~~

-

-#

-

-~~#When a thread executes a return, it actually destroys~~

-~~the process... but the local stack memory is not automatically~~

-~~deallocated. To be precise, Linux doesn't support deallocation, but~~

-~~the memory space is not automatically added back to the~~

-~~malloc() free list. Thus, the parent process should reclaim~~

-~~the space for each dead child by~~

-~~bb_threads_cleanup(wait(NULL)).~~

-#

-

-~~The following C program uses the algorithm discussed in section 1.3 to~~

-~~compute the approximate value of Pi using two bb_threads~~

-~~threads.~~

-

-~~----~~

-

-~~#include <stdio.h>~~

-~~#include <stdlib.h>~~

-~~#include <unistd.h>~~

-~~#include <sys/types.h>~~

-~~#include <sys/wait.h>~~

-~~#include "bb_threads.h"~~

-~~volatile double pi = .;~~

-~~volatile int intervals;~~

-~~volatile int pids[[2]; /* Unix PIDs of threads */~~

-~~void~~

-~~do_pi(void *data, size_t len)~~

-{

-~~register double width, localsum;~~

-~~register int i;~~

-~~register int iproc = (getpid() != pids[[]);~~

-~~/* set width */~~

-~~width = 1.0 / intervals;~~

-~~/* do the local computations */~~

-~~localsum = ;~~

-~~for (i=iproc; i<intervals; i+=2) {~~

-~~register double x = (i + .5) * width;~~

-~~localsum += 4.0 / (1.0 + x * x);~~

-}

-~~localsum *= width;~~

-~~/* get permission, update pi, and unlock */~~

-~~bb_threads_lock();~~

-~~pi += localsum;~~

-~~bb_threads_unlock();~~

-}

-~~int~~

-~~main(int argc, char **argv)~~

-{

-~~/* get the number of intervals */~~

-~~intervals = atoi(argv[[1]);~~

-~~/* set stack size and create lock... */~~

-~~bb_threads_stacksize(65536);~~

-~~bb_threads_mutexcreate();~~

-~~/* make two threads... */~~

-~~pids[[] = bb_threads_newthread(do_pi, NULL);~~

-~~pids[[1] = bb_threads_newthread(do_pi, NULL);~~

-~~/* cleanup after two threads (really a barrier sync) */~~

-~~bb_threads_cleanup(wait(NULL));~~

-~~/* print the result */~~

-~~printf("Estimation of pi is %f\n", pi);~~

-~~/* check-out */~~

-~~exit();~~

-}

-

-~~----~~

-

-~~!!2.4 !LinuxThreads~~

-

-~~!LinuxThreads~~

-~~http://pauillac.inria.fr/~xleroy/linuxthreads/ is a fairly~~

-~~complete and solid implementation of "shared everything" as per the~~

-~~POSIX 1003.1c threads standard. Unlike other POSIX threads ports,~~

-~~!LinuxThreads uses the same Linux kernel threads facility~~

-~~(clone()) that is used by bb_threads. POSIX~~

-~~compatibility means that it is relatively easy to port quite a few~~

-~~threaded applications from other systems and various tutorial~~

-~~materials are available. In short, this is definitely the threads~~

-~~package to use under Linux for developing large-scale threaded~~

-~~programs.~~

-

-~~The basic program structure for using the !LinuxThreads library is:~~

-

-~~#Start the program running as a single process.~~

-

-#

-

-~~#The next step is to initialize any locks that you will need.~~

-~~Unlike bb_threads locks, which are identified by numbers, POSIX~~

-~~locks are declared as variables of type~~

-~~pthread_mutex_t lock. Use~~

-~~pthread_mutex_init(&lock,val) to initialize~~

-~~each one you will need to use.~~

-

-#

-

-~~#As with bb_threads, spawning a new thread is done by~~

-~~calling a library routine that takes arguments specifying what~~

-~~function the new thread should execute and what arguments should be~~

-~~transmitted to it. However, POSIX requires the user to declare a~~

-~~variable of type pthread_t to identify each thread. To~~

-~~create a thread pthread_t thread running f(),~~

-~~one calls pthread_create(&thread,NULL,f,&arg).~~

-

-#

-

-~~#Run parallel code, being careful to use~~

-~~pthread_mutex_lock(&lock) and~~

-~~pthread_mutex_unlock(&lock) as appropriate.~~

-

-#

-

-~~#Use pthread_join(thread,&retval) to clean-up~~

-~~after each thread.~~

-

-#

-

-~~#Use -D_REENTRANT when compiling your C code.~~

-#

-

-~~An example parallel computation of Pi using !LinuxThreads follows. The~~

-~~algorithm of section 1.3 is used and, as for the bb_threads~~

-~~example, two threads execute in parallel.~~

-

-~~----~~

-

-~~#include <stdio.h>~~

-~~#include <stdlib.h>~~

-~~#include "pthread.h"~~

-~~volatile double pi = .; /* Approximation to pi (shared) */~~

-~~pthread_mutex_t pi_lock; /* Lock for above */~~

-~~volatile double intervals; /* How many intervals? */~~

-void *

-~~process(void *arg)~~

-{

-~~register double width, localsum;~~

-~~register int i;~~

-~~register int iproc = (*((char *) arg) - '');~~

-~~/* Set width */~~

-~~width = 1.0 / intervals;~~

-~~/* Do the local computations */~~

-~~localsum = ;~~

-~~for (i=iproc; i<intervals; i+=2) {~~

-~~register double x = (i + .5) * width;~~

-~~localsum += 4.0 / (1.0 + x * x);~~

-}

-~~localsum *= width;~~

-~~/* Lock pi for update, update it, and unlock */~~

-~~pthread_mutex_lock(&pi_lock);~~

-~~pi += localsum;~~

-~~pthread_mutex_unlock(&pi_lock);~~

-~~return(NULL);~~

-}

-~~int~~

-~~main(int argc, char **argv)~~

-{

-~~pthread_t thread0, thread1;~~

-~~void * retval;~~

-~~/* Get the number of intervals */~~

-~~intervals = atoi(argv[[1]);~~

-~~/* Initialize the lock on pi */~~

-~~pthread_mutex_init(&pi_lock, NULL);~~

-~~/* Make the two threads */~~

-~~if (pthread_create(&thread0, NULL, process, "") ||~~

-~~pthread_create(&thread1, NULL, process, "1")) {~~

-~~fprintf(stderr, "%s: cannot make thread\n", argv[[]);~~

-~~exit(1);~~

-}

-~~/* Join (collapse) the two threads */~~

-~~if (pthread_join(thread0, &retval) ||~~

-~~pthread_join(thread1, &retval)) {~~

-~~fprintf(stderr, "%s: thread join failed\n", argv[[]);~~

-~~exit(1);~~

-}

-~~/* Print the result */~~

-~~printf("Estimation of pi is %f\n", pi);~~

-~~/* Check-out */~~

-~~exit();~~

-}

-

-~~----~~

-

-~~!!2.5 System V Shared Memory~~

-

-~~The System V IPC (Inter-Process Communication) support consists of a~~

-~~number of system calls providing message queues, semaphores, and a~~

-~~shared memory mechanism. Of course, these mechanisms were originally~~

-~~intended to be used for multiple processes to communicate within a~~

-~~uniprocessor system. However, that implies that it also should work~~

-~~to communicate between processes under SMP Linux, no matter which~~

-~~processors they run on.~~

-

-~~Before going into how these calls are used, it is important to~~

-~~understand that although System V IPC calls exist for things like~~

-~~semaphores and message transmission, you probably should not use~~

-~~them. Why not? These functions are generally slow and serialized~~

-~~under SMP Linux. Enough said.~~

-

-~~The basic procedure for creating a group of processes sharing access~~

-~~to a shared memory segment is:~~

-

-~~#Start the program running as a single process.~~

-

-#

-

-~~#Typically, you will want each run of a parallel program to have~~

-~~its own shared memory segment, so you will need to call~~

-~~shmget() to create a new segment of the desired size.~~

-~~Alternatively, this call can be used to get the ID of a pre-existing~~

-~~shared memory segment. In either case, the return value is either the~~

-~~shared memory segment ID or -1 for error. For example, to create a~~

-~~shared memory segment of ''b'' bytes, the call might be shmid~~

-~~= shmget(IPC_PRIVATE, ''b'', (IPC_CREAT | 0666)).~~

-

-#

-

-~~#The next step is to attach this shared memory segment to this~~

-~~process, literally adding it to the virtual memory map of this process.~~

-~~Although the shmat() call allows the programmer to specify~~

-~~the virtual address at which the segment should appear, the address~~

-~~selected must be aligned on a page boundary (i.e., be a multiple of~~

-~~the page size returned by getpagesize(), which is usually~~

-~~4096 bytes), and will override the mapping of any memory formerly at~~

-~~that address. Thus, we instead prefer to let the system pick the~~

-~~address. In either case, the return value is a pointer to the base~~

-~~virtual address of the segment just mapped. The code is shmptr =~~

-~~shmat(shmid, , ).~~

-~~Notice that you can allocate all your static shared variables into~~

-~~this shared memory segment by simply declaring all shared variables as~~

-~~members of a struct type, and declaring ''shmptr'' to be~~

-~~a pointer to that type. Using this technique, shared variable~~

-~~''x'' would be accessed as~~

-~~''shmptr''->''x''.~~

-

-#

-

-~~#Since this shared memory segment should be destroyed when the~~

-~~last process with access to it terminates or detaches from it, we need~~

-~~to call shmctl() to set-up this default action. The code is~~

-~~something like shmctl(shmid, IPC_RMID, ).~~

-

-#

-

-~~#Use the standard Linux fork() call to make the desired~~

-~~number of processes... each will inherit the shared memory segment.~~

-

-#

-

-~~#When a process is done using a shared memory segment, it really~~

-~~should detach from that shared memory segment. This is done by~~

-~~shmdt(shmptr).~~

-#

-

-~~Although the above set-up does require a few system calls, once the~~

-~~shared memory segment has been established, any change made by one~~

-~~processor to a value in that memory will automatically be visible to~~

-~~all processes. Most importantly, each communication operation will~~

-~~occur without the overhead of a system call.~~

-

-~~An example C program using System V shared memory segments follows.~~

-~~It computes Pi, using the same algorithm given in section 1.3.~~

-

-~~----~~

-

-~~#include <stdio.h>~~

-~~#include <stdlib.h>~~

-~~#include <unistd.h>~~

-~~#include <sys/types.h>~~

-~~#include <sys/stat.h>~~

-~~#include <fcntl.h>~~

-~~#include <sys/ipc.h>~~

-~~#include <sys/shm.h>~~

-~~volatile struct shared { double pi; int lock; } *shared;~~

-~~inline extern int xchg(register int reg,~~

-~~volatile int * volatile obj)~~

-{

-~~/* Atomic exchange instruction */~~

-~~__asm__ __volatile__ ("xchgl %1,%"~~

-~~:"=r" (reg), "=m" (*obj)~~

-~~:"r" (reg), "m" (*obj));~~

-~~return(reg);~~

-}

-~~main(int argc, char **argv)~~

-{

-~~register double width, localsum;~~

-~~register int intervals, i;~~

-~~register int shmid;~~

-~~register int iproc = ;;~~

-~~/* Allocate System V shared memory */~~

-~~shmid = shmget(IPC_PRIVATE,~~

-~~sizeof(struct shared),~~

-~~(IPC_CREAT | 0600));~~

-~~shared = ((volatile struct shared *) shmat(shmid, , ));~~

-~~shmctl(shmid, IPC_RMID, );~~

-~~/* Initialize... */~~

-~~shared->pi = .;~~

-~~shared->lock = ;~~

-~~/* Fork a child */~~

-~~if (!fork()) ++iproc;~~

-~~/* get the number of intervals */~~

-~~intervals = atoi(argv[[1]);~~

-~~width = 1.0 / intervals;~~

-~~/* do the local computations */~~

-~~localsum = ;~~

-~~for (i=iproc; i<intervals; i+=2) {~~

-~~register double x = (i + .5) * width;~~

-~~localsum += 4.0 / (1.0 + x * x);~~

-}

-~~localsum *= width;~~

-~~/* Atomic spin lock, add, unlock... */~~

-~~while (xchg((iproc + 1), &(shared->lock))) ;~~

-~~shared->pi += localsum;~~

-~~shared->lock = ;~~

-~~/* Terminate child (barrier sync) */~~

-~~if (iproc == ) {~~

-~~wait(NULL);~~

-~~printf("Estimation of pi is %f\n", shared->pi);~~

-}

-~~/* Check out */~~

-~~return();~~

-}

-

-~~----~~

-

-~~In this example, I have used the IA32 atomic exchange instruction to~~

-~~implement locking. For better performance and portability, substitute~~

-~~a synchronization technique that avoids atomic bus-locking~~

-~~instructions (discussed in section 2.2).~~

-

-~~When debugging your code, it is useful to remember that the~~

-~~ipcs command will report the status of the System V IPC~~

-~~facilities currently in use.~~

-

-~~!!2.6 Memory Map Call~~

-

-~~Using system calls for file I/O can be very expensive; in fact, that is~~

-~~why there is a user-buffered file I/O library (getchar(),~~

-~~fwrite(), etc.). But user buffers don't work if multiple~~

-~~processes are accessing the same writeable file, and the user buffer~~

-~~management overhead is significant. The BSD UNIX fix for this was the~~

-~~addition of a system call that allows a portion of a file to be mapped~~

-~~into user memory, essentially using virtual memory paging mechanisms to~~

-~~cause updates. This same mechanism also has been used in systems from~~

-~~Sequent for many years as the basis for their shared memory parallel~~

-~~processing support. Despite some very negative comments in the (quite~~

-~~old) man page, Linux seems to correctly perform at least some of the~~

-~~basic functions, and it supports the degenerate use of this system~~

-~~call to map an anonymous segment of memory that can be shared across~~

-~~multiple processes.~~

-

-~~In essence, the Linux implementation of mmap() is a plug-in~~

-~~replacement for steps 2, 3, and 4 in the System V shared memory scheme~~

-~~outlined in section 2.5. To create an anonymous shared memory segment:~~

-

-~~----~~

-

-~~shmptr =~~

-~~mmap(, /* system assigns address */~~

-~~b, /* size of shared memory segment */~~

-~~(PROT_READ | PROT_WRITE), /* access rights, can be rwx */~~

-~~(MAP_ANON | MAP_SHARED), /* anonymous, shared */~~

-~~, /* file descriptor (not used) */~~

-~~); /* file offset (not used) */~~

-

-~~----~~

-

-~~The equivalent to the System V shared memory shmdt()~~

-~~call is munmap():~~

-

-~~----~~

-

-~~munmap(shmptr, b);~~

-

-~~----~~

-

-~~In my opinion, there is no real benefit in using mmap()~~

-~~instead of the System V shared memory support.~~

-

-~~----~~

-

-~~!!3. Clusters Of Linux Systems~~

-

-~~This section attempts to give an overview of cluster parallel~~

-~~processing using Linux. Clusters are currently both the most popular~~

-~~and the most varied approach, ranging from a conventional network of~~

-~~workstations (__NOW__) to essentially custom parallel machines~~

-~~that just happen to use Linux PCs as processor nodes. There is also~~

-~~quite a lot of software support for parallel processing using clusters~~

-~~of Linux machines.~~

-

-~~!!3.1 Why A Cluster?~~

-

-~~Cluster parallel processing offers several important advantages:~~

-

-*Each of the machines in a cluster can be a complete system,

-~~usable for a wide range of other computing applications. This leads~~

-~~many people to suggest that cluster parallel computing can simply~~

-~~claim all the "wasted cycles" of workstations sitting idle on people's~~

-~~desks. It is not really so easy to salvage those cycles, and it will~~

-~~probably slow your co-worker's screen saver, but it can be done.~~

-

-*

-

-*The current explosion in networked systems means that most of the

-~~hardware for building a cluster is being sold in high volume, with~~

-~~correspondingly low "commodity" prices as the result. Further savings~~

-~~come from the fact that only one video card, monitor, and keyboard are~~

-~~needed for each cluster (although you may need to swap these into each~~

-~~machine to perform the initial installation of Linux, once running, a~~

-~~typical Linux PC does not need a "console"). In comparison, SMP and~~

-~~attached processors are much smaller markets, tending toward somewhat~~

-~~higher price per unit performance.~~

-

-*

-

-*Cluster computing can ''scale to very large systems''.

-~~While it is currently hard to find a Linux-compatible SMP with many~~

-~~more than four processors, most commonly available network hardware~~

-~~easily builds a cluster with up to 16 machines. With a little work,~~

-~~hundreds or even thousands of machines can be networked. In fact, the~~

-~~entire Internet can be viewed as one truly huge cluster.~~

-

-*

-

-*The fact that replacing a "bad machine" within a cluster is

-~~trivial compared to fixing a partly faulty SMP yields much higher~~

-~~availability for carefully designed cluster configurations. This~~

-~~becomes important not only for particular applications that cannot~~

-~~tolerate significant service interruptions, but also for general use~~

-~~of systems containing enough processors so that single-machine~~

-~~failures are fairly common. (For example, even though the average~~

-~~time to failure of a PC might be two years, in a cluster with 32~~

-~~machines, the probability that at least one will fail within 6 months~~

-~~is quite high.)~~

-*

-

-~~OK, so clusters are free or cheap and can be very large and highly~~

-~~available... why doesn't everyone use a cluster? Well, there are~~

-~~problems too:~~

-

-*With a few exceptions, network hardware is not designed for

-~~parallel processing. Typically latency is very high and bandwidth~~

-~~relatively low compared to SMP and attached processors. For example,~~

-~~SMP latency is generally no more than a few microseconds, but is~~

-~~commonly hundreds or thousands of microseconds for a cluster. SMP~~

-~~communication bandwidth is often more than 100 MBytes/second; although~~

-~~the fastest network hardware (e.g., "Gigabit Ethernet") offers~~

-~~comparable speed, the most commonly used networks are between 10 and~~

-~~1000 times slower.~~

-~~The performance of network hardware is poor enough as an ''isolated~~

-~~cluster network''. If the network is not isolated from other~~

-~~traffic, as is often the case using "machines that happen to be~~

-~~networked" rather than a system designed as a cluster, performance can~~

-~~be substantially worse.~~

-

-*

-

-*There is very little software support for treating a cluster as a

-~~single system. For example, the ps command only reports the~~

-~~processes running on one Linux system, not all processes running~~

-~~across a cluster of Linux systems.~~

-*

-

-~~Thus, the basic story is that clusters offer great potential, but that~~

-~~potential may be very difficult to achieve for most applications. The~~

-~~good news is that there is quite a lot of software support that will~~

-~~help you achieve good performance for programs that are well suited to~~

-~~this environment, and there are also networks designed specifically to~~

-~~widen the range of programs that can achieve good performance.~~

-

-~~!!3.2 Network Hardware~~

-

-~~Computer networking is an exploding field... but you already knew~~

-~~that. An ever-increasing range of networking technologies and~~

-~~products are being developed, and most are available in forms that~~

-~~could be applied to make a parallel-processing cluster out of a group~~

-~~of machines (i.e., PCs each running Linux).~~

-

-~~Unfortunately, no one network technology solves all problems best; in~~

-~~fact, the range of approach, cost, and performance is at first hard to~~

-~~believe. For example, using standard commercially-available hardware,~~

-~~the cost per machine networked ranges from less than $5 to over~~

-~~$4,000. The delivered bandwidth and latency each also vary~~

-~~over four orders of magnitude.~~

-

-~~Before trying to learn about specific networks, it is important to~~

-~~recognize that these things change like the wind (see~~

-~~http://www.uk.linux.org/!NetNews.html for Linux networking news),~~

-~~and it is very difficult to get accurate data about some networks.~~

-

-~~Where I was particularly uncertain,~~

-~~I've placed a ''?''. I have spent a lot of time researching this~~

-~~topic, but I'm sure my summary is full of errors and has omitted many~~

-~~important things. If you have any corrections or additions, please~~

-~~send email to~~

-~~pplinux@ecn.purdue.edu.~~

-

-~~Summaries like the LAN Technology Scorecard at~~

-~~http://web.syr.edu/~jmwobus/comfaqs/lan-technology.html give~~

-~~some characteristics of many different types of networks and LAN~~

-~~standards. However, the summary in this HOWTO centers on the network~~

-~~properties that are most relevant to construction of Linux clusters.~~

-~~The section discussing each network begins with a short list of~~

-~~characteristics. The following defines what these entries mean.~~

-

-~~; __Linux support:__:~~

-

-~~If the answer is ''no'', the meaning is pretty clear. Other~~

-~~answers try to describe the basic program interface that is used to~~

-~~access the network. Most network hardware is interfaced via a kernel~~

-~~driver, typically supporting TCP/UDP communication. Some other~~

-~~networks use more direct (e.g., library) interfaces to reduce latency~~

-~~by bypassing the kernel.~~

-

-~~Years ago, it used to be considered perfectly acceptable to access a~~

-~~floating point unit via an OS call, but that is now clearly ludicrous;~~

-~~in my opinion, it is just as awkward for each communication between~~

-~~processors executing a parallel program to require an OS call. The~~

-~~problem is that computers haven't yet integrated these communication~~

-~~mechanisms, so non-kernel approaches tend to have portability problems.~~

-~~You are going to hear a lot more about this in the near future, mostly~~

-~~in the form of the new __Virtual Interface (VI) Architecture__,~~

-~~http://www.viarch.org/, which is a standardized method for~~

-~~most network interface operations to bypass the usual OS call layers.~~

-~~The VI standard is backed by Compaq, Intel, and Microsoft, and is sure~~

-~~to have a strong impact on SAN (System Area Network) designs over the~~

-~~next few years.~~

-

-~~; __Maximum bandwidth:__:~~

-

-~~This is the number everybody cares about. I have generally used the~~

-~~theoretical best case numbers; your mileage ''will'' vary.~~

-

-~~; __Minimum latency:__:~~

-

-~~In my opinion, this is the number everybody should care about even more~~

-~~than bandwidth. Again, I have used the unrealistic best-case numbers,~~

-~~but at least these numbers do include ''all'' sources of latency,~~

-~~both hardware and software. In most cases, the network latency is just~~

-~~a few microseconds; the much larger numbers reflect layers of~~

-~~inefficient hardware and software interfaces.~~

-

-~~; __Available as:__:~~

-

-~~Simply put, this describes how you get this type of network hardware.~~

-~~Commodity stuff is widely available from many vendors, with price as~~

-~~the primary distinguishing factor. Multiple-vendor things are~~

-~~available from more than one competing vendor, but there are~~

-~~significant differences and potential interoperability problems.~~

-~~Single-vendor networks leave you at the mercy of that supplier~~

-~~(however benevolent it may be). Public domain designs mean that even~~

-~~if you cannot find somebody to sell you one, you or anybody else can~~

-~~buy parts and make one. Research prototypes are just that; they are~~

-~~generally neither ready for external users nor available to them.~~

-

-~~; __Interface port/bus used:__:~~

-

-~~How does one hook-up this network? The highest performance and most~~

-~~common now is a PCI bus interface card. There are also EISA, VESA~~

-~~local bus (VL bus), and ISA bus cards. ISA was there first, and is~~

-~~still commonly used for low-performance cards. EISA is still around~~

-~~as the second bus in a lot of PCI machines, so there are a few cards.~~

-~~These days, you don't see much VL stuff (although~~

-~~http://www.vesa.org/ would beg to differ).~~

-

-~~Of course, any interface that you can use without having to open your~~

-~~PC's case has more than a little appeal. IrDA and USB interfaces are~~

-~~appearing with increasing frequency. The Standard Parallel Port (SPP)~~

-~~used to be what your printer was plugged into, but it has seen a lot~~

-~~of use lately as an external extension of the ISA bus; this new~~

-~~functionality is enhanced by the IEEE 1284 standard, which specifies~~

-~~EPP and ECP improvements. There is also the old, reliable, slow RS232~~

-~~serial port. I don't know of anybody connecting machines using VGA~~

-~~video connectors, keyboard, mouse, or game ports... so that's about~~

-~~it.~~

-

-~~; __Network structure:__:~~

-

-~~A bus is a wire, set of wires, or fiber. A hub is a little box that~~

-~~knows how to connect different wires/fibers plugged into it; switched~~

-~~hubs allow multiple connections to be actively transmitting data~~

-~~simultaneously.~~

-

-~~; __Cost per machine connected:__:~~

-

-~~Here's how to use these numbers. Suppose that, not counting the~~

-~~network connection, it costs $2,000 to purchase a PC for use as~~

-~~a node in your cluster. Adding a Fast Ethernet brings the per node~~

-~~cost to about $2,400; adding a Myrinet instead brings the cost~~

-~~to about $3,800. If you have about $20,000 to spend,~~

-~~that means you could have either 8 machines connected by Fast Ethernet~~

-~~or 5 machines connected by Myrinet. It also can be very reasonable to~~

-~~have multiple networks; e.g., $20,000 could buy 8 machines~~

-~~connected by both Fast Ethernet and TTL_PAPERS. Pick the~~

-~~network, or set of networks, that is most likely to yield a cluster~~

-~~that will run your application fastest.~~

-

-~~By the time you read this, these numbers will be wrong... heck,~~

-~~they're probably wrong already. There may also be quantity discounts,~~

-~~special deals, etc. Still, the prices quoted here aren't likely to be~~

-~~wrong enough to lead you to a totally inappropriate choice. It~~

-~~doesn't take a PhD (although I do have one ;-) to see that expensive~~

-~~networks only make sense if your application needs their special~~

-~~properties or if the PCs being clustered are relatively expensive.~~

-

-~~Now that you have the disclaimers, on with the show....~~

-

-~~!!ArcNet~~

-

-*Linux support: ''kernel drivers''

-*

-

-*Maximum bandwidth: ''2.5 Mb/s''

-*

-

-*Minimum latency: ''1,000 microseconds?''

-*

-

-*Available as: ''multiple-vendor hardware''

-*

-

-*Interface port/bus used: ''ISA''

-*

-

-*Network structure: ''unswitched hub or bus (logical ring)''

-*

-

-*Cost per machine connected: ''$200''

-*

-

-~~ARCNET is a local area network that is primarily intended for use in~~

-~~embedded real-time control systems. Like Ethernet, the network is~~

-~~physically organized either as taps on a bus or one or more hubs,~~

-~~however, unlike Ethernet, it uses a token-based protocol logically~~

-~~structuring the network as a ring. Packet headers are small (3 or 4~~

-~~bytes) and messages can carry as little as a single byte of data.~~

-~~Thus, ARCNET yields more consistent performance than Ethernet, with~~

-~~bounded delays, etc. Unfortunately, it is slower than Ethernet and~~

-~~less popular, making it more expensive. More information is available~~

-~~from the ARCNET Trade Association at~~

-~~http://www.arcnet.com/.~~

-

-~~!ATM~~

-

-*Linux support: ''kernel driver, AAL* library''

-*

-

-*Maximum bandwidth: ''155 Mb/s'' (soon, ''1,200 Mb/s'')

-*

-

-*Minimum latency: ''120 microseconds''

-*

-

-*Available as: ''multiple-vendor hardware''

-*

-

-*Interface port/bus used: ''PCI''

-*

-

-*Network structure: ''switched hubs''

-*

-

-*Cost per machine connected: ''$3,000''

-*

-

-~~Unless you've been in a coma for the past few years, you have probably~~

-~~heard a lot about how ATM (Asynchronous Transfer Mode) ''is'' the~~

-~~future... well, sort-of. ATM is cheaper than HiPPI and faster than~~

-~~Fast Ethernet, and it can be used over the very long distances that~~

-~~the phone companies care about. The ATM network protocol is also~~

-~~designed to provide a lower-overhead software interface and to more~~

-~~efficiently manage small messages and real-time communications (e.g.,~~

-~~digital audio and video). It is also one of the highest-bandwidth~~

-~~networks that Linux currently supports. The bad news is that ATM isn't~~

-~~cheap, and there are still some compatibility problems across~~

-~~vendors. An overview of Linux ATM development is available at~~

-~~http://lrcwww.epfl.ch/linux-atm/.~~

-

-~~!CAPERS~~

-

-*Linux support: ''AFAPI library''

-*

-

-*Maximum bandwidth: ''1.2 Mb/s''

-*

-

-*Minimum latency: ''3 microseconds''

-*

-

-*Available as: ''commodity hardware''

-*

-

-*Interface port/bus used: ''SPP''

-*

-

-*Network structure: ''cable between 2 machines''

-*

-

-*Cost per machine connected: ''$2''

-*

-

-~~CAPERS (Cable Adapter for Parallel Execution and Rapid~~

-~~Synchronization) is a spin-off of the PAPERS project,~~

-~~http://garage.ecn.purdue.edu/~papers/, at the Purdue University~~

-~~School of Electrical and Computer Engineering. In essence, it defines~~

-~~a software protocol for using an ordinary "!LapLink" SPP-to-SPP cable~~

-~~to implement the PAPERS library for two Linux PCs. The idea doesn't~~

-~~scale, but you can't beat the price. As with TTL_PAPERS, to improve~~

-~~system security, there is a minor kernel patch recommended, but not~~

-~~required:~~

-~~http://garage.ecn.purdue.edu/~papers/giveioperm.html.~~

-

-~~!Ethernet~~

-

-*Linux support: ''kernel drivers''

-*

-

-*Maximum bandwidth: ''10 Mb/s''

-*

-

-*Minimum latency: ''100 microseconds''

-*

-

-*Available as: ''commodity hardware''

-*

-

-*Interface port/bus used: ''PCI''

-*

-

-*Network structure: ''switched or unswitched hubs, or hubless bus''

-*

-

-*Cost per machine connected: ''$100'' (hubless, ''$50'')

-*

-

-~~For some years now, 10 Mbits/s Ethernet has been the standard network~~

-~~technology. Good Ethernet interface cards can be purchased for well~~

-~~under $50, and a fair number of PCs now have an Ethernet controller~~

-~~built-into the motherboard. For lightly-used networks, Ethernet~~

-~~connections can be organized as a multi-tap bus without a hub; such~~

-~~configurations can serve up to 200 machines with minimal cost, but are~~

-~~not appropriate for parallel processing. Adding an unswitched hub~~

-~~does not really help performance. However, switched hubs that can~~

-~~provide full bandwidth to simultaneous connections cost only about~~

-~~$100 per port. Linux supports an amazing range of Ethernet~~

-~~interfaces, but it is important to keep in mind that variations in the~~

-~~interface hardware can yield significant performance differences. See~~

-~~the Hardware Compatibility HOWTO for comments on which are supported~~

-~~and how well they work; also see~~

-~~http://cesdis1.gsfc.nasa.gov/linux/drivers/.~~

-

-~~An interesting way to improve performance is offered by the 16-machine~~

-~~Linux cluster work done in the Beowulf project,~~

-~~http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html, at NASA~~

-~~CESDIS. There, Donald Becker, who is the author of many Ethernet card~~

-~~drivers, has developed support for load sharing across multiple~~

-~~Ethernet networks that shadow each other (i.e., share the same network~~

-~~addresses). This load sharing is built-into the standard Linux~~

-~~distribution, and is done invisibly below the socket operation level.~~

-~~Because hub cost is significant, having each machine connected to two~~

-~~or more hubless or unswitched hub Ethernet networks can be a very~~

-~~cost-effective way to improve performance. In fact, in situations~~

-~~where one machine is the network performance bottleneck, load sharing~~

-~~using shadow networks works much better than using a single switched~~

-~~hub network.~~

-

-~~!Ethernet (Fast Ethernet)~~

-

-*Linux support: ''kernel drivers''

-*

-

-*Maximum bandwidth: ''100 Mb/s''

-*

-

-*Minimum latency: ''80 microseconds''

-*

-

-*Available as: ''commodity hardware''

-*

-

-*Interface port/bus used: ''PCI''

-*

-

-*Network structure: ''switched or unswitched hubs''

-*

-

-*Cost per machine connected: ''$400?''

-*

-

-~~Although there are really quite a few different technologies calling~~

-~~themselves "Fast Ethernet," this term most often refers to a hub-based~~

-~~100 Mbits/s Ethernet that is somewhat compatible with older "10 BaseT"~~

-~~10 Mbits/s devices and cables. As might be expected, anything called~~

-~~Ethernet is generally priced for a volume market, and these interfaces~~

-~~are generally a small fraction of the price of 155 Mbits/s ATM cards.~~

-~~The catch is that having a bunch of machines dividing the bandwidth of~~

-~~a single 100 Mbits/s "bus" (using an unswitched hub) yields~~

-~~performance that might not even be as good on average as using 10~~

-~~Mbits/s Ethernet with a switched hub that can give each machine's~~

-~~connection a full 10 Mbits/s.~~

-

-~~Switched hubs that can provide 100 Mbits/s for each machine~~

-~~simultaneously are expensive, but prices are dropping every day, and~~

-~~these switches do yield much higher total network bandwidth than~~

-~~unswitched hubs. The thing that makes ATM switches so expensive is~~

-~~that they must switch for each (relatively short) ATM cell; some Fast~~

-~~Ethernet switches take advantage of the expected lower switching~~

-~~frequency by using techniques that may have low latency through the~~

-~~switch, but take multiple milliseconds to change the switch path...~~

-~~if your routing pattern changes frequently, avoid those switches. See~~

-~~http://cesdis1.gsfc.nasa.gov/linux/drivers/ for information~~

-~~about the various cards and drivers.~~

-

-~~Also note that, as described for Ethernet, the Beowulf project,~~

-~~http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html, at NASA~~

-~~has been developing support that offers improved performance by load~~

-~~sharing across multiple Fast Ethernets.~~

-

-~~!Ethernet (Gigabit Ethernet)~~

-

-*Linux support: ''kernel drivers''

-*

-

-*Maximum bandwidth: ''1,000 Mb/s''

-*

-

-*Minimum latency: ''300 microseconds?''

-*

-

-*Available as: ''multiple-vendor hardware''

-*

-

-*Interface port/bus used: ''PCI''

-*

-

-*Network structure: ''switched hubs or FDRs''

-*

-

-*Cost per machine connected: ''$2,500?''

-*

-

-~~I'm not sure that Gigabit Ethernet,~~

-~~http://www.gigabit-ethernet.org/, has a good technological~~

-~~reason to be called Ethernet... but the name does accurately reflect~~

-~~the fact that this is intended to be a cheap, mass-market, computer~~

-~~network technology with native support for IP. However, current~~

-~~pricing reflects the fact that Gb/s hardware is still a tricky thing~~

-~~to build.~~

-

-~~Unlike other Ethernet technologies, Gigabit Ethernet provides for a~~

-~~level of flow control that should make it a more reliable network.~~

-~~FDRs, or Full-Duplex Repeaters, simply multiplex lines, using~~

-~~buffering and localized flow control to improve performance. Most~~

-~~switched hubs are being built as new interface modules for existing~~

-~~gigabit-capable switch fabrics. Switch/FDR products have been shipped~~

-~~or announced by at least~~

-~~http://www.acacianet.com/,~~

-~~http://www.baynetworks.com/,~~

-~~http://www.cabletron.com/,~~

-~~http://www.networks.digital.com/,~~

-~~http://www.extremenetworks.com/,~~

-~~http://www.foundrynet.com/,~~

-~~http://www.gigalabs.com/,~~

-~~http://www.packetengines.com/.~~

-~~http://www.plaintree.com/,~~

-~~http://www.prominet.com/,~~

-~~http://www.sun.com/, and~~

-~~http://www.xlnt.com/.~~

-

-~~There is a Linux driver,~~

-~~http://cesdis.gsfc.nasa.gov/linux/drivers/yellowfin.html, for~~

-~~the Packet Engines "Yellowfin" G-NIC,~~

-~~http://www.packetengines.com/. Early tests under Linux achieved~~

-~~about 2.5x higher bandwidth than could be achieved with the best 100~~

-~~Mb/s Fast Ethernet; with gigabit networks, careful tuning of PCI bus~~

-~~use is a critical factor. There is little doubt that driver~~

-~~improvements, and Linux drivers for other NICs, will follow.~~

-

-~~!FC (Fibre Channel)~~

-

-*Linux support: ''no''

-*

-

-*Maximum bandwidth: ''1,062 Mb/s''

-*

-

-*Minimum latency: ''?''

-*

-

-*Available as: ''multiple-vendor hardware''

-*

-

-*Interface port/bus used: ''PCI?''

-*

-

-*Network structure: ''?''

-*

-

-*Cost per machine connected: ''?''

-*

-

-~~The goal of FC (Fibre Channel) is to provide high-performance block~~

-~~I/O (an FC frame carries a 2,048 byte data payload), particularly for~~

-~~sharing disks and other storage devices that can be directly connected~~

-~~to the FC rather than connected through a computer. Bandwidth-wise,~~

-~~FC is specified to be relatively fast, running anywhere between 133~~

-~~and 1,062 Mbits/s. If FC becomes popular as a high-end SCSI~~

-~~replacement, it may quickly become a cheap technology; for now, it is~~

-~~not cheap and is not supported by Linux. A good collection of FC~~

-~~references is maintained by the Fibre Channel Association at~~

-~~http://www.amdahl.com/ext/CARP/FCA/FCA.html~~

-

-~~!!FireWire (IEEE 1394)~~

-

-*Linux support: ''no''

-*

-

-*Maximum bandwidth: ''196.608 Mb/s'' (soon, ''393.216 Mb/s'')

-*

-

-*Minimum latency: ''?''

-*

-

-*Available as: ''multiple-vendor hardware''

-*

-

-*Interface port/bus used: ''PCI''

-*

-

-*Network structure: ''random without cycles (self-configuring)''

-*

-

-*Cost per machine connected: ''$600''

-*

-

-~~!FireWire,~~

-~~http://www.firewire.org/, the IEEE 1394-1995~~

-~~standard, is destined to be the low-cost high-speed digital network~~

-~~for consumer electronics. The showcase application is connecting DV~~

-~~digital video camcorders to computers, but !FireWire is intended to be~~

-~~used for applications ranging from being a SCSI replacement to~~

-~~interconnecting the components of your home theater. It allows up to~~

-~~64K devices to be connected in any topology using busses and bridges~~

-~~that does not create a cycle, and automatically detects the~~

-~~configuration when components are added or removed. Short (four-byte~~

-~~"quadlet") low-latency messages are supported as well as ATM-like~~

-~~isochronous transmission (used to keep multimedia messages~~

-~~synchronized). Adaptec has !FireWire products that allow up to 63~~

-~~devices to be connected to a single PCI interface card, and also has~~

-~~good general !FireWire information at~~

-~~http://www.adaptec.com/serialio/.~~

-

-~~Although !FireWire will not be the highest bandwidth network available,~~

-~~the consumer-level market (which should drive prices very low) and low~~

-~~latency support might make this one of the best Linux PC cluster~~

-~~message-passing network technologies within the next year or so.~~

-

-~~!HiPPI And Serial HiPPI~~

-

-*Linux support: ''no''

-*

-

-*Maximum bandwidth: ''1,600 Mb/s'' (serial is ''1,200 Mb/s'')

-*

-

-*Minimum latency: ''?''

-*

-

-*Available as: ''multiple-vendor hardware''

-*

-

-*Interface port/bus used: ''EISA, PCI''

-*

-

-*Network structure: ''switched hubs''

-*

-

-*Cost per machine connected: ''$3,500'' (serial is ''$4,500'')

-*

-

-~~HiPPI (High Performance Parallel Interface) was originally intended to~~

-~~provide very high bandwidth for transfer of huge data sets between a~~

-~~supercomputer and another machine (a supercomputer, frame buffer, disk~~

-~~array, etc.), and has become the dominant standard for~~

-~~supercomputers. Although it is an oxymoron, __Serial HiPPI__ is~~

-~~also becoming popular, typically using a fiber optic cable instead of~~

-~~the 32-bit wide standard (parallel) HiPPI cables. Over the past few~~

-~~years, HiPPI crossbar switches have become common and prices have~~

-~~dropped sharply; unfortunately, serial HiPPI is still pricey, and that~~

-~~is what PCI bus interface cards generally support. Worse still, Linux~~

-~~doesn't yet support HiPPI. A good overview of HiPPI is maintained by~~

-~~CERN at~~

-~~http://www.cern.ch/HSI/hippi/; they also maintain~~

-~~a rather long list of HiPPI vendors at~~

-~~http://www.cern.ch/HSI/hippi/procintf/manufact.htm.~~

-

-~~!IrDA (Infrared Data Association)~~

-

-*Linux support: ''no?''

-*

-

-*Maximum bandwidth: ''1.15 Mb/s'' and ''4 Mb/s''

-*

-

-*Minimum latency: ''?''

-*

-

-*Available as: ''multiple-vendor hardware''

-*

-

-*Interface port/bus used: ''IrDA''

-*

-

-*Network structure: ''thin air'' ;-)

-*

-

-*Cost per machine connected: ''$''

-*

-

-~~IrDA (Infrared Data Association,~~

-~~http://www.irda.org/) is~~

-~~that little infrared device on the side of a lot of laptop PCs. It is~~

-~~inherently difficult to connect more than two machines using this~~

-~~interface, so it is unlikely to be used for clustering. Don Becker~~

-~~did some preliminary work with IrDA.~~

-

-~~!Myrinet~~

-

-*Linux support: ''library''

-*

-

-*Maximum bandwidth: ''1,280 Mb/s''

-*

-

-*Minimum latency: ''9 microseconds''

-*

-

-*Available as: ''single-vendor hardware''

-*

-

-*Interface port/bus used: ''PCI''

-*

-

-*Network structure: ''switched hubs''

-*

-

-*Cost per machine connected: ''$1,800''

-*

-

-~~Myrinet~~

-~~http://www.myri.com/ is a local area network (LAN)~~

-~~designed to also serve as a "system area network" (SAN), i.e., the~~

-~~network within a cabinet full of machines connected as a parallel~~

-~~system. The LAN and SAN versions use different physical media and~~

-~~have somewhat different characteristics; generally, the SAN version~~

-~~would be used within a cluster.~~

-

-~~Myrinet is fairly conventional in structure, but has a reputation for~~

-~~being particularly well-implemented. The drivers for Linux are said~~

-~~to perform very well, although shockingly large performance variations~~

-~~have been reported with different PCI bus implementations for the host~~

-~~computers.~~

-

-~~Currently, Myrinet is clearly the favorite network of cluster groups~~

-~~that are not too severely "budgetarily challenged." If your idea of a~~

-~~Linux PC is a high-end Pentium Pro or Pentium II with at least 256 MB~~

-~~RAM and a SCSI RAID, the cost of Myrinet is quite reasonable. However,~~

-~~using more ordinary PC configurations, you may find that your choice~~

-~~is between ''N'' machines linked by Myrinet or ''2N'' linked~~

-~~by multiple Fast Ethernets and TTL_PAPERS. It really depends~~

-~~on what your budget is and what types of computations you care about~~

-~~most.~~

-

-~~!Parastation~~

-

-*Linux support: ''HAL or socket library''

-*

-

-*Maximum bandwidth: ''125 Mb/s''

-*

-

-*Minimum latency: ''2 microseconds''

-*

-

-*Available as: ''single-vendor hardware''

-*

-

-*Interface port/bus used: ''PCI''

-*

-

-*Network structure: ''hubless mesh''

-*

-

-*Cost per machine connected: ''> $1,000''

-*

-

-~~The !ParaStation project~~

-~~http://wwwipd.ira.uka.de/parastation at University of Karlsruhe~~

-~~Department of Informatics is building a PVM-compatible custom~~

-~~low-latency network. They first constructed a two-processor ParaPC~~

-~~prototype using a custom EISA card interface and PCs running BSD UNIX,~~

-~~and then built larger clusters using DEC Alphas. Since January 1997,~~

-~~!ParaStation has been available for Linux. The PCI cards are being~~

-~~made in cooperation with a company called Hitex (see~~

-~~http://www.hitex.com:80/parastation/). Parastation hardware~~

-~~implements both fast, reliable, message transmission and simple barrier~~

-~~synchronization.~~

-

-~~!PLIP~~

-

-*Linux support: ''kernel driver''

-*

-

-*Maximum bandwidth: ''1.2 Mb/s''

-*

-

-*Minimum latency: ''1,000 microseconds?''

-*

-

-*Available as: ''commodity hardware''

-*

-

-*Interface port/bus used: ''SPP''

-*

-

-*Network structure: ''cable between 2 machines''

-*

-

-*Cost per machine connected: ''$2''

-*

-

-~~For just the cost of a "!LapLink" cable, PLIP (Parallel Line Interface~~

-~~Protocol) allows two Linux machines to communicate through standard~~

-~~parallel ports using standard socket-based software. In terms of~~

-~~bandwidth, latency, and scalability, this is not a very serious~~

-~~network technology; however, the near-zero cost and the software~~

-~~compatibility are useful. The driver is part of the standard Linux~~

-~~kernel distributions.~~

-

-~~!SCI~~

-

-*Linux support: ''no''

-*

-

-*Maximum bandwidth: ''4,000 Mb/s''

-*

-

-*Minimum latency: ''2.7 microseconds''

-*

-

-*Available as: ''multiple-vendor hardware''

-*

-

-*Interface port/bus used: ''PCI, proprietary''

-*

-

-*Network structure: ''?''

-*

-

-*Cost per machine connected: ''> $1,000''

-*

-

-~~The goal of SCI (Scalable Coherent Interconnect, ANSI/IEEE 1596-1992)~~

-~~is essentially to provide a high performance mechanism that can~~

-~~support coherent shared memory access across large numbers of~~

-~~machines, as well various types of block message transfers. It is~~

-~~fairly safe to say that the designed bandwidth and latency of SCI are~~

-~~both "awesome" in comparison to most other network technologies. The~~

-~~catch is that SCI is not widely available as cheap production units,~~

-~~and there isn't any Linux support.~~

-

-~~SCI primarily is used in various proprietary designs for~~

-~~logically-shared physically-distributed memory machines, such as the~~

-~~HP/Convex Exemplar SPP and the Sequent NUMA-Q 2000 (see~~

-~~http://www.sequent.com/). However, SCI is available as a PCI~~

-~~interface card and 4-way switches (up to 16 machines can be connected~~

-~~by cascading four 4-way switches) from Dolphin,~~

-~~http://www.dolphinics.com/, as their !CluStar product line. A~~

-~~good set of links overviewing SCI is maintained by CERN at~~

-~~http://www.cern.ch/HSI/sci/sci.html.~~

-

-~~!SCSI~~

-

-*Linux support: ''kernel drivers''

-*

-

-*Maximum bandwidth: ''5 Mb/s'' to over ''20 Mb/s''

-*

-

-*Minimum latency: ''?''

-*

-

-*Available as: ''multiple-vendor hardware''

-*

-

-*Interface port/bus used: ''PCI, EISA, ISA card''

-*

-

-*Network structure: ''inter-machine bus sharing SCSI devices''

-*

-

-*Cost per machine connected: ''?''

-*

-

-~~SCSI (Small Computer Systems Interconnect) is essentially an I/O bus~~

-~~that is used for disk drives, CD ROMS, image scanners, etc. There are~~

-~~three separate standards SCSI-1, SCSI-2, and SCSI-3; Fast and Ultra~~

-~~speeds; and data path widths of 8, 16, or 32 bits (with !FireWire~~

-~~compatibility also mentioned in SCSI-3). It is all pretty confusing,~~

-~~but we all know a good SCSI is somewhat faster than EIDE and can handle~~

-~~more devices more efficiently.~~

-

-~~What many people do not realize is that it is fairly simple for two~~

-~~computers to share a single SCSI bus. This type of configuration is~~

-~~very useful for sharing disk drives between machines and implementing~~

-~~__fail-over__ - having one machine take over database requests~~

-~~when the other machine fails. Currently, this is the only mechanism~~

-~~supported by Microsoft's PC cluster product, !WolfPack. However, the~~

-~~inability to scale to larger systems renders shared SCSI uninteresting~~

-~~for parallel processing in general.~~

-

-~~!!ServerNet~~

-

-*Linux support: ''no''

-*

-

-*Maximum bandwidth: ''400 Mb/s''

-*

-

-*Minimum latency: ''3 microseconds''

-*

-

-*Available as: ''single-vendor hardware''

-*

-

-*Interface port/bus used: ''PCI''

-*

-

-*Network structure: ''hexagonal tree/tetrahedral lattice of hubs''

-*

-

-*Cost per machine connected: ''?''

-*

-

-~~!ServerNet is the high-performance network hardware from Tandem,~~

-~~http://www.tandem.com. Especially in the online transation~~

-~~processing (OLTP) world, Tandem is well known as a leading producer of~~

-~~high-reliability systems, so it is not surprising that their network~~

-~~claims not just high performance, but also "high data integrity and~~

-~~reliability." Another interesting aspect of !ServerNet is that it~~

-~~claims to be able to transfer data from any device directly to any~~

-~~device; not just between processors, but also disk drives, etc., in a~~

-~~one-sided style similar to that suggested by the MPI remote memory~~

-~~access mechanisms described in section 3.5. One last comment about~~

-~~!ServerNet: although there is just a single vendor, that vendor is~~

-~~powerful enough to potentially establish !ServerNet as a major~~

-~~standard... Tandem is owned by Compaq.~~

-

-~~!SHRIMP~~

-

-*Linux support: ''user-level memory mapped interface''

-*

-

-*Maximum bandwidth: ''180 Mb/s''

-*

-

-*Minimum latency: ''5 microseconds''

-*

-

-*Available as: ''research prototype''

-*

-

-*Interface port/bus used: ''EISA''

-*

-

-*Network structure: ''mesh backplane (as in Intel Paragon)''

-*

-

-*Cost per machine connected: ''?''

-*

-

-~~The SHRIMP project,~~

-~~http://www.CS.Princeton.EDU/shrimp/,~~

-~~at the Princeton University Computer Science Department is building a~~

-~~parallel computer using PCs running Linux as the processing elements.~~

-~~The first SHRIMP (Scalable, High-Performance, Really Inexpensive~~

-~~Multi-Processor) was a simple two-processor prototype using a~~

-~~dual-ported RAM on a custom EISA card interface. There is now a~~

-~~prototype that will scale to larger configurations using a custom~~

-~~interface card to connect to a "hub" that is essentially the same mesh~~

-~~routing network used in the Intel Paragon (see~~

-~~http://www.ssd.intel.com/paragon.html). Considerable effort~~

-~~has gone into developing low-overhead "virtual memory mapped~~

-~~communication" hardware and support software.~~

-

-~~!SLIP~~

-

-*Linux support: ''kernel drivers''

-*

-

-*Maximum bandwidth: ''.1 Mb/s''

-*

-

-*Minimum latency: ''1,000 microseconds?''

-*

-

-*Available as: ''commodity hardware''

-*

-

-*Interface port/bus used: ''RS232C''

-*

-

-*Network structure: ''cable between 2 machines''

-*

-

-*Cost per machine connected: ''$2''

-*

-

-~~Although SLIP (Serial Line Interface Protocol) is firmly planted at~~

-~~the low end of the performance spectrum, SLIP (or CSLIP or PPP) allows~~

-~~two machines to perform socket communication via ordinary RS232 serial~~

-~~ports. The RS232 ports can be connected using a null-modem RS232~~

-~~serial cable, or they can even be connected via dial-up through a~~

-~~modem. In any case, latency is high and bandwidth is low, so SLIP~~

-~~should be used only when no other alternatives are available. It is~~

-~~worth noting, however, that most PCs have two RS232 ports, so it would~~

-~~be possible to network a group of machines simply by connecting the~~

-~~machines as a linear array or as a ring. There is even load sharing~~

-~~software called EQL.~~

-

-~~!TTL_PAPERS~~

-

-*Linux support: ''AFAPI library''

-*

-

-*Maximum bandwidth: ''1.6 Mb/s''

-*

-

-*Minimum latency: ''3 microseconds''

-*

-

-*Available as: ''public-domain design, single-vendor hardware''

-*

-

-*Interface port/bus used: ''SPP''

-*

-

-*Network structure: ''tree of hubs''

-*

-

-*Cost per machine connected: ''$100''

-*

-

-~~The PAPERS (Purdue's Adapter for Parallel Execution and Rapid~~

-~~Synchronization) project,~~

-~~http://garage.ecn.purdue.edu/~papers/, at the Purdue University~~

-~~School of Electrical and Computer Engineering is building scalable,~~

-~~low-latency, aggregate function communication hardware and software~~

-~~that allows a parallel supercomputer to be built using unmodified~~

-~~PCs/workstations as nodes.~~

-

-~~There have been over a dozen different types of PAPERS hardware built~~

-~~that connect to PCs/workstations via the SPP (Standard Parallel Port),~~

-~~roughly following two development lines. The versions called "PAPERS"~~

-~~target higher performance, using whatever technologies are appropriate;~~

-~~current work uses FPGAs, and high bandwidth PCI bus interface designs~~

-~~are also under development. In contrast, the versions called~~

-~~"TTL_PAPERS" are designed to be easily reproduced outside~~

-~~Purdue, and are remarkably simple public domain designs that can be~~

-~~built using ordinary TTL logic. One such design is produced~~

-~~commercially,~~

-~~http://chelsea.ios.com:80/~hgdietz/sbm4.html.~~

-

-~~Unlike the custom hardware designs from other universities,~~

-~~TTL_PAPERS clusters have been assembled at many universities~~

-~~from the USA to South Korea. Bandwidth is severely limited by the SPP~~

-~~connections, but PAPERS implements very low latency aggregate function~~

-~~communications; even the fastest message-oriented systems cannot~~

-~~provide comparable performance on those aggregate functions. Thus,~~

-~~PAPERS is particularly good for synchronizing the displays of a video~~

-~~wall (to be discussed further in the upcoming Video Wall HOWTO),~~

-~~scheduling accesses to a high-bandwidth network, evaluating global~~

-~~fitness in genetic searches, etc. Although PAPERS clusters have been~~

-~~built using IBM PowerPC AIX, DEC Alpha OSF/1, and HP PA-RISC HP-UX~~

-~~machines, Linux-based PCs are the platforms best supported.~~

-

-~~User programs using TTL_PAPERS AFAPI directly access the SPP~~

-~~hardware port registers under Linux, without an OS call for each~~

-~~access. To do this, AFAPI first gets port permission using either~~

-~~iopl() or ioperm(). The problem with these calls is~~

-~~that both require the user program to be privileged, yielding a~~

-~~potential security hole. The solution is an optional kernel patch,~~

-~~http://garage.ecn.purdue.edu/~papers/giveioperm.html, that~~

-~~allows a privileged process to control port permission for any process.~~

-

-~~!USB (Universal Serial Bus)~~

-

-*Linux support: ''kernel driver''

-*

-

-*Maximum bandwidth: ''12 Mb/s''

-*

-

-*Minimum latency: ''?''

-*

-

-*Available as: ''commodity hardware''

-*

-

-*Interface port/bus used: ''USB''

-*

-

-*Network structure: ''bus''

-*

-

-*Cost per machine connected: ''$5?''

-*

-

-~~USB (Universal Serial Bus,~~

-~~http://www.usb.org/) is a~~

-~~hot-pluggable conventional-Ethernet-speed, bus for up to 127~~

-~~peripherals ranging from keyboards to video conferencing cameras. It~~

-~~isn't really clear how multiple computers get connected to each other~~

-~~using USB. In any case, USB ports are quickly becoming as standard on~~

-~~PC motherboards as RS232 and SPP, so don't be surprised if one or two~~

-~~USB ports are lurking on the back of the next PC you buy. Development~~

-~~of a Linux driver is discussed at~~

-~~http://peloncho.fis.ucm.es/~inaky/USB.html.~~

-

-~~In some ways, USB is almost the low-performance, zero-cost, version of~~

-~~!FireWire that you can purchase today.~~

-

-~~!WAPERS~~

-

-*Linux support: ''AFAPI library''

-*

-

-*Maximum bandwidth: ''.4 Mb/s''

-*

-

-*Minimum latency: ''3 microseconds''

-*

-

-*Available as: ''public-domain design''

-*

-

-*Interface port/bus used: ''SPP''

-*

-

-*Network structure: ''wiring pattern between 2-64 machines''

-*

-

-*Cost per machine connected: ''$5''

-*

-

-~~WAPERS (Wired-AND Adapter for Parallel Execution and Rapid~~

-~~Synchronization) is a spin-off of the PAPERS project,~~

-~~http://garage.ecn.purdue.edu/~papers/, at the Purdue University~~

-~~School of Electrical and Computer Engineering. If implemented~~

-~~properly, the SPP has four bits of open-collector output that can be~~

-~~wired together across machines to implement a 4-bit wide wired AND.~~

-~~This wired-AND is electrically touchy, and the maximum number of~~

-~~machines that can be connected in this way critically depends on the~~

-~~analog properties of the ports (maximum sink current and pull-up~~

-~~resistor value); typically, up to 7 or 8 machines can be networked by~~

-~~WAPERS. Although cost and latency are very low, so is bandwidth;~~

-~~WAPERS is much better as a second network for aggregate operations~~

-~~than as the only network in a cluster. As with TTL_PAPERS, to~~

-~~improve system security, there is a minor kernel patch recommended,~~

-~~but not required:~~

-~~http://garage.ecn.purdue.edu/~papers/giveioperm.html.~~

-

-~~!!3.3 Network Software Interface~~

-

-~~Before moving on to discuss the software support for parallel~~

-~~applications, it is useful to first briefly cover the basics of~~

-~~low-level software interface to the network hardware. There are~~

-~~really only three basic choices: sockets, device drivers, and~~

-~~user-level libraries.~~

-

-~~!Sockets~~

-

-~~By far the most common low-level network interface is a socket~~

-~~interface. Sockets have been a part of unix for over a decade, and~~

-~~most standard network hardware is designed to support at least two~~

-~~types of socket protocols: UDP and TCP. Both types of socket allow~~

-~~you to send arbitrary size blocks of data from one machine to another,~~

-~~but there are several important differences. Typically, both yield a~~

-~~minimum latency of around 1,000 microseconds, although performance can~~

-~~be far worse depending on network traffic.~~

-

-~~These socket types are the basic network software interface for most~~

-~~of the portable, higher-level, parallel processing software; for~~

-~~example, PVM uses a combination of UDP and TCP, so knowing the~~

-~~difference will help you tune performance. For even better~~

-~~performance, you can also use these mechanisms directly in your~~

-~~program. The following is just a simple overview of UDP and TCP; see~~

-~~the manual pages and a good network programming book for details.~~

-

-~~!UDP Protocol (SOCK_DGRAM)~~

-

-~~__UDP__ is the User Datagram Protocol, but you more easily can~~

-~~remember the properties of UDP as Unreliable Datagram Processing. In~~

-~~other words, UDP allows each block to be sent as an individual message,~~

-~~but a message might be lost in transmission. In fact, depending on~~

-~~network traffic, UDP messages can be lost, can arrive multiple times,~~

-~~or can arrive in an order different from that in which they were~~

-~~sent. The sender of a UDP message does not automatically get an~~

-~~acknowledgment, so it is up to user-written code to detect and~~

-~~compensate for these problems. Fortunately, UDP does ensure that if a~~

-~~message arrives, the message contents are intact (i.e., you never get~~

-~~just part of a UDP message).~~

-

-~~The nice thing about UDP is that it tends to be the fastest socket~~

-~~protocol. Further, UDP is "connectionless," which means that each~~

-~~message is essentially independent of all others. A good analogy is~~

-~~that each message is like a letter to be mailed; you might send~~

-~~multiple letters to the same address, but each one is independent of~~

-~~the others and there is no limit on how many people you can send~~

-~~letters to.~~

-

-~~!TCP Protocol (SOCK_STREAM)~~

-

-~~Unlike UDP, __TCP__ is a reliable, connection-based, protocol.~~

-~~Each block sent is not seen as a message, but as a block of data~~

-~~within an apparently continuous stream of bytes being transmitted~~

-~~through a connection between sender and receiver. This is very~~

-~~different from UDP messaging because each block is simply part of the~~

-~~byte stream and it is up to the user code to figure-out how to extract~~

-~~each block from the byte stream; there are no markings separating~~

-~~messages. Further, the connections are more fragile with respect to~~

-~~network problems, and only a limited number of connections can exist~~

-~~simultaneously for each process. Because it is reliable, TCP~~

-~~generally implies significantly more overhead than UDP.~~

-

-~~There are, however, a few pleasant surprises about TCP. One is that,~~

-~~if multiple messages are sent through a connection, TCP is able to~~

-~~pack them together in a buffer to better match network hardware packet~~

-~~sizes, potentially yielding better-than-UDP performance for groups of~~

-~~short or oddly-sized messages. The other bonus is that networks~~

-~~constructed using reliable direct physical links between machines can~~

-~~easily and efficiently simulate TCP connections. For example, this was~~

-~~done for the !ParaStation's "Socket Library" interface software, which~~

-~~provides TCP semantics using user-level calls that differ from the~~

-~~standard TCP OS calls only by the addition of the prefix~~

-~~PSS to each function name.~~

-

-~~!Device Drivers~~

-

-~~When it comes to actually pushing data onto the network or pulling data~~

-~~off the network, the standard unix software interface is a part of the~~

-~~unix kernel called a device driver. UDP and TCP don't just transport~~

-~~data, they also imply a fair amount of overhead for socket management.~~

-~~For example, something has to manage the fact that multiple TCP~~

-~~connections can share a single physical network interface. In~~

-~~contrast, a device driver for a dedicated network interface only needs~~

-~~to implement a few simple data transport functions. These device~~

-~~driver functions can then be invoked by user programs by using~~

-~~open() to identify the proper device and then using system~~

-~~calls like read() and write() on the open "file."~~

-~~Thus, each such operation could transport a block of data with little~~

-~~more than the overhead of a system call, which might be as fast as~~

-~~tens of microseconds.~~

-

-~~Writing a device driver to be used with Linux is not hard... if you~~

-~~know ''precisely'' how the device hardware works. If you are not~~

-~~sure how it works, don't guess. Debugging device drivers isn't fun~~

-~~and mistakes can fry hardware. However, if that hasn't scared you~~

-~~off, it may be possible to write a device driver to, for example, use~~

-~~dedicated Ethernet cards as dumb but fast direct machine-to-machine~~

-~~connections without the usual Ethernet protocol overhead. In fact,~~

-~~that's pretty much what some early Intel supercomputers did.... Look~~

-~~at the Device Driver HOWTO for more information.~~

-

-~~!User-Level Libraries~~

-

-~~If you've taken an OS course, user-level access to hardware device~~

-~~registers is exactly what you have been taught never to do, because~~

-~~one of the primary purposes of an OS is to control device access.~~

-~~However, an OS call is at least tens of microseconds of overhead. For~~

-~~custom network hardware like TTL_PAPERS, which can perform a~~

-~~basic network operation in just 3 microseconds, such OS call overhead~~

-~~is intolerable. The only way to avoid that overhead is to have~~

-~~user-level code - a user-level library - directly access hardware~~

-~~device registers. Thus, the question becomes one of how a user-level~~

-~~library can access hardware directly, yet not compromise the OS~~

-~~control of device access rights.~~

-

-~~On a typical system, the only way for a user-level library to directly~~

-~~access hardware device registers is to:~~

-

-~~#At user program start-up, use an OS call to map the page of~~

-~~memory address space containing the device registers into the user~~

-~~process virtual memory map. For some systems, the mmap()~~

-~~call (first mentioned in section 2.6) can be used to map a special~~

-~~file which represents the physical memory page addresses of the I/O~~

-~~devices. Alternatively, it is relatively simple to write a device~~

-~~driver to perform this function. Further, this device driver can~~

-~~control access by only mapping the page(s) containing the specific~~

-~~device registers needed, thereby maintaining OS access control.~~

-

-#

-

-~~#Access device registers without an OS call by simply loading or~~

-~~storing to the mapped addresses. For example, *((char *) 0x1234) =~~

-~~5; would store the byte value 5 into memory location 1234~~

-~~(hexadecimal).~~

-#

-

-~~Fortunately, it happens that Linux for the Intel 386 (and compatible~~

-~~processors) offers an even better solution:~~

-

-~~#Using the ioperm() OS call from a privileged process,~~

-~~get permission to access the precise I/O port addresses that~~

-~~correspond to the device registers. Alternatively, permission can be~~

-~~managed by an independent privileged user process (i.e., a "meta OS")~~

-~~using the~~

-~~giveioperm() OS call patch for Linux.~~

-

-#

-

-~~#Access device registers without an OS call by using 386 port I/O~~

-~~instructions.~~

-#

-

-~~This second solution is preferable because it is common that multiple~~

-~~I/O devices have their registers within a single page, in which case~~

-~~the first technique would not provide protection against accessing~~

-~~other device registers that happened to reside in the same page as the~~

-~~ones intended. Of course, the down side is that 386 port I/O~~

-~~instructions cannot be coded in C - instead, you will need to use a~~

-~~bit of assembly code. The GCC-wrapped (usable in C programs) inline~~

-~~assembly code function for a port input of a byte value is:~~

-

-~~----~~

-

-~~extern inline unsigned char~~

-~~inb(unsigned short port)~~

-{

-~~unsigned char _v;~~

-~~__asm__ __volatile__ ("inb %w1,%b0"~~

-~~:"=a" (_v)~~

-~~:"d" (port), "" ());~~

-~~return _v;~~

-}

-

-~~----~~

-

-~~Similarly, the GCC-wrapped code for a byte port output is:~~

-

-~~----~~

-

-~~extern inline void~~

-~~outb(unsigned char value,~~

-~~unsigned short port)~~

-{

-~~__asm__ __volatile__ ("outb %b0,%w1"~~

-~~:/* no outputs */~~

-~~:"a" (value), "d" (port));~~

-}

-

-~~----~~

-

-~~!!3.4 PVM (Parallel Virtual Machine)~~

-

-~~PVM (Parallel Virtual Machine) is a freely-available, portable,~~

-~~message-passing library generally implemented on top of sockets. It~~

-~~is clearly established as the de-facto standard for message-passing~~

-~~cluster parallel computing.~~

-

-~~PVM supports single-processor and SMP Linux machines, as well as~~

-~~clusters of Linux machines linked by socket-capable networks (e.g.,~~

-~~SLIP, PLIP, Ethernet, ATM). In fact, PVM will even work across groups~~

-~~of machines in which a variety of different types of processors,~~

-~~configurations, and physical networks are used - __Heterogeneous~~

-~~Clusters__ - even to the scale of treating machines linked by the~~

-~~Internet as a parallel cluster. PVM also provides facilities for~~

-~~parallel job control across a cluster. Best of all, PVM has long been~~

-~~freely available (currently from~~

-~~http://www.epm.ornl.gov/pvm/pvm_home.html), which has~~

-~~led to many programming language compilers, application libraries,~~

-~~programming and debugging tools, etc., using it as their "portable~~

-~~message-passing target library." There is also a network newsgroup,~~

-~~comp.parallel.pvm.~~

-

-~~It is important to note, however, that PVM message-passing calls~~

-~~generally add significant overhead to standard socket operations,~~

-~~which already had high latency. Further, the message handling calls~~

-~~themselves do not constitute a particularly "friendly" programming~~

-~~model.~~

-

-~~Using the same Pi computation example first described in section 1.3,~~

-~~the version using C with PVM library calls is:~~

-

-~~----~~

-

-~~#include <stdlib.h>~~

-~~#include <stdio.h>~~

-~~#include <pvm3.h>~~

-~~#define NPROC 4~~

-~~main(int argc, char **argv)~~

-{

-~~register double lsum, width;~~

-~~double sum;~~

-~~register int intervals, i;~~

-~~int mytid, iproc, msgtag = 4;~~

-~~int tids[[NPROC]; /* array of task ids */~~

-~~/* enroll in pvm */~~

-~~mytid = pvm_mytid();~~

-~~/* Join a group and, if I am the first instance,~~

-~~iproc=, spawn more copies of myself~~

-*/

-~~iproc = pvm_joingroup("pi");~~

-~~if (iproc == ) {~~

-~~tids[[] = pvm_mytid();~~

-~~pvm_spawn("pvm_pi", &argv[[1], , NULL, NPROC-1, &tids[[1]);~~

-}

-~~/* make sure all processes are here */~~

-~~pvm_barrier("pi", NPROC);~~

-~~/* get the number of intervals */~~

-~~intervals = atoi(argv[[1]);~~

-~~width = 1.0 / intervals;~~

-~~lsum = .;~~

-~~for (i = iproc; i<intervals; i+=NPROC) {~~

-~~register double x = (i + .5) * width;~~

-~~lsum += 4.0 / (1.0 + x * x);~~

-}

-~~/* sum across the local results & scale by width */~~

-~~sum = lsum * width;~~

-~~pvm_reduce(!PvmSum, &sum, 1, PVM_DOUBLE, msgtag, "pi", );~~

-~~/* have only the console PE print the result */~~

-~~if (iproc == ) {~~

-~~printf("Estimation of pi is %f\n", sum);~~

-}

-~~/* Check program finished, leave group, exit pvm */~~

-~~pvm_barrier("pi", NPROC);~~

-~~pvm_lvgroup("pi");~~

-~~pvm_exit();~~

-~~return();~~

-}

-

-~~----~~

-

-~~!!3.5 MPI (Message Passing Interface)~~

-

-~~Although PVM is the de-facto standard message-passing library, MPI~~

-~~(Message Passing Interface) is the relatively new official standard.~~

-~~The home page for the MPI standard is~~

-~~http://www.mcs.anl.gov:80/mpi/ and the newsgroup is~~

-~~comp.parallel.mpi.~~

-

-~~However, before discussing MPI, I feel compelled to say a little bit~~

-~~about the PVM vs. MPI religious war that has been going on for the~~

-~~past few years. I'm not really on either side. Here's my attempt at~~

-~~a relatively unbiased summary of the differences:~~

-

-~~; __Execution control environment.__:~~

-

-~~Put simply, PVM has one and~~

-~~MPI doesn't specify how/if one is implemented. Thus, things like~~

-~~starting a PVM program executing are done identically everywhere, while~~

-~~for MPI it depends on which implementation is being used.~~

-

-~~; __Support for heterogeneous clusters.__:~~

-

-~~PVM grew-up in the~~

-~~workstation cycle-scavenging world, and thus directly manages~~

-~~heterogeneous mixes of machines and operating systems. In contrast,~~

-~~MPI largely assumes that the target is an MPP (Massively Parallel~~

-~~Processor) or a dedicated cluster of nearly identical workstations.~~

-

-~~; __Kitchen sink syndrome.__:~~

-

-~~PVM evidences a unity of purpose that~~

-~~MPI 2.0 doesn't. The new MPI 2.0 standard includes a lot of features~~

-~~that go way beyond the basic message passing model - things like RMA~~

-~~(Remote Memory Access) and parallel file I/O. Are these things~~

-~~useful? Of course they are... but learning MPI 2.0 is a lot like~~

-~~learning a complete new programming language.~~

-

-~~; __User interface design.__:~~

-

-~~MPI was designed after PVM, and~~

-~~clearly learned from it. MPI offers simpler, more efficient, buffer~~

-~~handling and higher-level abstractions allowing user-defined data~~

-~~structures to be transmitted in messages.~~

-

-~~; __The force of law.__:~~

-

-~~By my count, there are still~~

-~~significantly more things designed to use PVM than there are to use~~

-~~MPI; however, porting them to MPI is easy, and the fact that MPI is~~

-~~backed by a widely-supported formal standard means that using MPI is,~~

-~~for many institutions, a matter of policy.~~

-

-~~Conclusion? Well, there are at least three independently developed,~~

-~~freely available, versions of MPI that can run on clusters of Linux~~

-~~systems (and I wrote one of them):~~

-

-*LAM (Local Area Multicomputer) is a full implementation of the

-~~MPI 1.1 standard. It allows MPI programs to be executed within an~~

-~~individual Linux system or across a cluster of Linux systems using~~

-~~UDP/TCP socket communication. The system includes simple execution~~

-~~control facilities, as well as a variety of program development and~~

-~~debugging aids. It is freely available from~~

-~~http://www.osc.edu/lam.html.~~

-

-*

-

-*MPICH (MPI CHameleon) is designed as a highly portable full

-~~implementation of the MPI 1.1 standard. Like LAM, it allows MPI~~

-~~programs to be executed within an individual Linux system or across a~~

-~~cluster of Linux systems using UDP/TCP socket communication. However,~~

-~~the emphasis is definitely on promoting MPI by providing an efficient,~~

-~~easily retargetable, implementation. To port this MPI implementation,~~

-~~one implements either the five functions of the "channel interface"~~

-~~or, for better performance, the full MPICH ADI (Abstract Device~~

-~~Interface). MPICH, and lots of information about it and porting, are~~

-~~available from~~

-~~http://www.mcs.anl.gov/mpi/mpich/.~~

-

-*

-

-*AFMPI (Aggregate Function MPI) is a subset implementation of the

-~~MPI 2.0 standard. This is the one that I wrote. Built on top of the~~

-~~AFAPI, it is designed to showcase low-latency collective communication~~

-~~functions and RMAs, and thus provides only minimal support for MPI~~

-~~data types, communicators, etc. It allows C programs using MPI to run~~

-~~on an individual Linux system or across a cluster connected by~~

-~~AFAPI-capable network hardware. It is freely available from~~

-~~http://garage.ecn.purdue.edu/~papers/.~~

-*

-

-~~No matter which of these (or other) MPI implementations one uses, it~~

-~~is fairly simple to perform the most common types of communications.~~

-

-~~However, MPI 2.0 incorporates several communication paradigms that are~~

-~~fundamentally different enough so that a programmer using one of them~~

-~~might not even recognize the other coding styles as MPI. Thus, rather~~

-~~than giving a single example program, it is useful to have an example~~

-~~of each of the fundamentally different communication paradigms that~~

-~~MPI supports. All three programs implement the same basic algorithm~~

-~~(from section 1.3) that is used throughout this HOWTO to compute the~~

-~~value of Pi.~~

-

-~~The first MPI program uses basic MPI message-passing calls for each~~

-~~processor to send its partial sum to processor , which sums and~~

-~~prints the result:~~

-

-~~----~~

-

-~~#include <stdlib.h>~~

-~~#include <stdio.h>~~

-~~#include <mpi.h>~~

-~~main(int argc, char **argv)~~

-{

-~~register double width;~~

-~~double sum, lsum;~~

-~~register int intervals, i;~~

-~~int nproc, iproc;~~

-~~MPI_Status status;~~

-~~if (MPI_Init(&argc, &argv) != MPI_SUCCESS) exit(1);~~

-~~MPI_Comm_size(MPI_COMM_WORLD, &nproc);~~

-~~MPI_Comm_rank(MPI_COMM_WORLD, &iproc);~~

-~~intervals = atoi(argv[[1]);~~

-~~width = 1.0 / intervals;~~

-~~lsum = ;~~

-~~for (i=iproc; i<intervals; i+=nproc) {~~

-~~register double x = (i + .5) * width;~~

-~~lsum += 4.0 / (1.0 + x * x);~~

-}

-~~lsum *= width;~~

-~~if (iproc != ) {~~

-~~MPI_Send(&lbuf, 1, MPI_DOUBLE, , , MPI_COMM_WORLD);~~

-~~} else {~~

-~~sum = lsum;~~

-~~for (i=1; i<nproc; ++i) {~~

-~~MPI_Recv(&lbuf, 1, MPI_DOUBLE, MPI_ANY_SOURCE,~~

-~~MPI_ANY_TAG, MPI_COMM_WORLD, &status);~~

-~~sum += lsum;~~

-}

-~~printf("Estimation of pi is %f\n", sum);~~

-}

-~~MPI_Finalize();~~

-~~return();~~

-}

-

-~~----~~

-

-~~The second MPI version uses collective communication (which, for this~~

-~~particular application, is clearly the most appropriate):~~

-

-~~----~~

-

-~~#include <stdlib.h>~~

-~~#include <stdio.h>~~

-~~#include <mpi.h>~~

-~~main(int argc, char **argv)~~

-{

-~~register double width;~~

-~~double sum, lsum;~~

-~~register int intervals, i;~~

-~~int nproc, iproc;~~

-~~if (MPI_Init(&argc, &argv) != MPI_SUCCESS) exit(1);~~

-~~MPI_Comm_size(MPI_COMM_WORLD, &nproc);~~

-~~MPI_Comm_rank(MPI_COMM_WORLD, &iproc);~~

-~~intervals = atoi(argv[[1]);~~

-~~width = 1.0 / intervals;~~

-~~lsum = ;~~

-~~for (i=iproc; i<intervals; i+=nproc) {~~

-~~register double x = (i + .5) * width;~~

-~~lsum += 4.0 / (1.0 + x * x);~~

-}

-~~lsum *= width;~~

-~~MPI_Reduce(&lsum, &sum, 1, MPI_DOUBLE,~~

-~~MPI_SUM, , MPI_COMM_WORLD);~~

-~~if (iproc == ) {~~

-~~printf("Estimation of pi is %f\n", sum);~~

-}

-~~MPI_Finalize();~~

-~~return();~~

-}

-

-~~----~~

-

-~~The third MPI version uses the MPI 2.0 RMA mechanism for each processor~~

-~~to add its local lsum into sum on processor :~~

-

-~~----~~

-

-~~#include <stdlib.h>~~

-~~#include <stdio.h>~~

-~~#include <mpi.h>~~

-~~main(int argc, char **argv)~~

-{

-~~register double width;~~

-~~double sum = , lsum;~~

-~~register int intervals, i;~~

-~~int nproc, iproc;~~

-~~MPI_Win sum_win;~~

-~~if (MPI_Init(&argc, &argv) != MPI_SUCCESS) exit(1);~~

-~~MPI_Comm_size(MPI_COMM_WORLD, &nproc);~~

-~~MPI_Comm_rank(MPI_COMM_WORLD, &iproc);~~

-~~MPI_Win_create(&sum, sizeof(sum), sizeof(sum),~~

-~~, MPI_COMM_WORLD, &sum_win);~~

-~~MPI_Win_fence(, sum_win);~~

-~~intervals = atoi(argv[[1]);~~

-~~width = 1.0 / intervals;~~

-~~lsum = ;~~

-~~for (i=iproc; i<intervals; i+=nproc) {~~

-~~register double x = (i + .5) * width;~~

-~~lsum += 4.0 / (1.0 + x * x);~~

-}

-~~lsum *= width;~~

-~~MPI_Accumulate(&lsum, 1, MPI_DOUBLE, , ,~~

-~~1, MPI_DOUBLE, MPI_SUM, sum_win);~~

-~~MPI_Win_fence(, sum_win);~~

-~~if (iproc == ) {~~

-~~printf("Estimation of pi is %f\n", sum);~~

-}

-~~MPI_Finalize();~~

-~~return();~~

-}

-

-~~----~~

-

-~~It is useful to note that the MPI 2.0 RMA mechanism very neatly~~

-~~overcomes any potential problems with the corresponding data structure~~

-~~on various processors residing at different memory locations. This is~~

-~~done by referencing a "window" that implies the base address,~~

-~~protection against out-of-bound accesses, and even address scaling.~~

-~~Efficient implementation is aided by the fact that RMA processing may~~

-~~be delayed until the next MPI_Win_fence. In~~

-~~summary, the RMA mechanism may be a strange cross between distributed~~

-~~shared memory and message passing, but it is a very clean interface~~

-~~that potentially generates very efficient communication.~~

-

-~~!!3.6 AFAPI (Aggregate Function API)~~

-

-~~Unlike PVM, MPI, etc., the AFAPI (Aggregate Function Application~~

-~~Program Interface) did not start life as an attempt to build a~~

-~~portable abstract interface layered on top of existing network~~

-~~hardware and software. Rather, AFAPI began as the very~~

-~~hardware-specific low-level support library for PAPERS (Purdue's~~

-~~Adapter for Parallel Execution and Rapid Synchronization; see~~

-~~http://garage.ecn.purdue.edu/~papers/).~~

-

-~~PAPERS was discussed briefly in section 3.2; it is a public domain~~

-~~design custom aggregate function network that delivers latencies as~~

-~~low as a few microseconds. However, the important thing about PAPERS~~

-~~is that it was developed as an attempt to build a supercomputer that~~

-~~would be a better target for compiler technology than existing~~

-~~supercomputers. This is qualitatively different from most Linux~~

-~~cluster efforts and PVM/MPI, which generally focus on trying to use~~

-~~standard networks for the relatively few sufficiently coarse-grain~~

-~~parallel applications. The fact that Linux PCs are used as components~~

-~~of PAPERS systems is simply an artifact of implementing prototypes in~~

-~~the most cost-effective way possible.~~

-

-~~The need for a common low-level software interface across more than a~~

-~~dozen different prototype implementations was what made the PAPERS~~

-~~library become standardized as AFAPI. However, the model used by~~

-~~AFAPI is inherently simpler and better suited for the finer-grain~~

-~~interactions typical of code compiled by parallelizing compilers or~~

-~~written for SIMD architectures. The simplicity of the model not only~~

-~~makes PAPERS hardware easy to build, but also yields surprisingly~~

-~~efficient AFAPI ports for a variety of other hardware systems, such as~~

-~~SMPs.~~

-

-~~AFAPI currently runs on Linux clusters connected using TTL_PAPERS,~~

-~~CAPERS, or WAPERS. It also runs (without OS calls or even bus-lock~~

-~~instructions, see section 2.2) on SMP systems using a System V Shared~~

-~~Memory library called SHMAPERS. A version that runs across Linux~~

-~~clusters using UDP broadcasts on conventional networks (e.g.,~~

-~~Ethernet) is under development. All released versions are available~~

-~~from~~

-~~http://garage.ecn.purdue.edu/~papers/. All versions~~

-~~of the AFAPI are designed to be called from C or C++.~~

-

-~~The following example program is the AFAPI version of the Pi~~

-~~computation described in section 1.3.~~

-

-~~----~~

-

-~~#include <stdlib.h>~~

-~~#include <stdio.h>~~

-~~#include "afapi.h"~~

-~~main(int argc, char **argv)~~

-{

-~~register double width, sum;~~

-~~register int intervals, i;~~

-~~if (p_init()) exit(1);~~

-~~intervals = atoi(argv[[1]);~~

-~~width = 1.0 / intervals;~~

-~~sum = ;~~

-~~for (i=IPROC; i<intervals; i+=NPROC) {~~

-~~register double x = (i + .5) * width;~~

-~~sum += 4.0 / (1.0 + x * x);~~

-}

-~~sum = p_reduceAdd64f(sum) * width;~~

-~~if (IPROC == CPROC) {~~

-~~printf("Estimation of pi is %f\n", sum);~~

-}

-~~p_exit();~~

-~~return();~~

-}

-

-~~----~~

-

-~~!!3.7 Other Cluster Support Libraries~~

-

-~~In addition to PVM, MPI, and AFAPI, the following libraries offer~~

-~~features that may be useful in parallel computing using a cluster of~~

-~~Linux systems. These systems are given a lighter treatment in this~~

-~~document simply because, unlike PVM, MPI, and AFAPI, I have little or~~

-~~no direct experience with the use of these systems on Linux clusters.~~

-~~If you find any of these or other libraries to be especially useful,~~

-~~please send email to me at~~

-~~pplinux@ecn.purdue.edu describing what you've found, and I will~~

-~~consider adding an expanded section on that library.~~

-

-~~!Condor (process migration support)~~

-

-~~Condor is a distributed resource management system that can manage~~

-~~large heterogeneous clusters of workstations. Its design has been~~

-~~motivated by the needs of users who would like to use the unutilized~~

-~~capacity of such clusters for their long-running,~~

-~~computation-intensive jobs. Condor preserves a large measure of the~~

-~~originating machine's environment on the execution machine, even if~~

-~~the originating and execution machines do not share a common file~~

-~~system and/or password mechanisms. Condor jobs that consist of a~~

-~~single process are automatically checkpointed and migrated between~~

-~~workstations as needed to ensure eventual completion.~~

-

-~~Condor is available at~~

-~~http://www.cs.wisc.edu/condor/. A~~

-~~Linux port exists; more information is available at~~

-~~http://www.cs.wisc.edu/condor/linux/linux.html. Contact~~

-~~condor-admin@cs.wisc.edu for details.~~

-

-~~!DFN-RPC (German Research Network - Remote Procedure Call)~~

-

-~~The DFN-RPC, a (German Research Network Remote Procedure Call) tool,~~

-~~was developed to distribute and parallelize scientific-technical~~

-~~application programs between a workstation and a compute server or a~~

-~~cluster. The interface is optimized for applications written in~~

-~~fortran, but the DFN-RPC can also be used in a C environment. It has~~

-~~been ported to Linux. More information is at~~

-~~ftp://ftp.uni-stuttgart.de/pub/rus/dfn_rpc/README_dfnrpc.html.~~

-

-~~!DQS (Distributed Queueing System)~~

-

-~~Not exactly a library, DQS 3.0 (Distributed Queueing System) is a job~~

-~~queueing system that has been developed and tested under Linux. It is~~

-~~designed to allow both use and administration of a heterogeneous~~

-~~cluster as a single entity. It is available from~~

-~~http://www.scri.fsu.edu/~pasko/dqs.html.~~

-

-~~There is also a commercial version called CODINE 4.1.1 (COmputing in~~

-~~DIstributed Network Environments). Information on it is available~~

-~~from~~

-~~http://www.genias.de/genias_welcome.html.~~

-

-~~!!3.8 General Cluster References~~

-

-~~Because clusters can be constructed and used in so many different ways,~~

-~~there are quite a few groups that have made interesting contributions.~~

-~~The following are references to various cluster-related projects that~~

-~~may be of general interest. This includes a mix of Linux-specific and~~

-~~generic cluster references. The list is given in alphabetical order.~~

-

-~~!Beowulf~~

-

-~~The Beowulf project,~~

-~~http://cesdis1.gsfc.nasa.gov/beowulf/, centers on production of~~

-~~software for using off-the-shelf clustered workstations based on~~

-~~commodity PC-class hardware, a high-bandwidth cluster-internal~~

-~~network, and the Linux operating system.~~

-

-~~Thomas Sterling has been the driving force behind Beowulf, and~~

-~~continues to be an eloquent and outspoken proponent of Linux~~

-~~clustering for scientific computing in general. In fact, many groups~~

-~~now refer to their clusters as "Beowulf class" systems - even if the~~

-~~cluster isn't really all that similar to the official Beowulf design.~~

-

-~~Don Becker, working in support of the Beowulf project, has produced~~

-~~many of the network drivers used by Linux in general. Many of these~~

-~~drivers have even been adapted for use in BSD. Don also is~~

-~~responsible for many of these Linux network drivers allowing~~

-~~load-sharing across multiple parallel connections to achieve higher~~

-~~bandwidth without expensive switched hubs. This type of load sharing~~

-~~was the original distinguishing feature of the Beowulf cluster.~~

-

-~~!Linux/AP+~~

-

-~~The Linux/AP+ project,~~

-~~http://cap.anu.edu.au/cap/projects/linux/, is not exactly about~~

-~~Linux clustering, but centers on porting Linux to the Fujitsu AP1000+~~

-~~and adding appropriate parallel processing enhancements. The AP1000+~~

-~~is a commercially available SPARC-based parallel machine that uses a~~

-~~custom network with a torus topology, 25 MB/s bandwidth, and 10~~

-~~microsecond latency... in short, it looks a lot like a SPARC Linux~~

-~~cluster.~~

-

-~~!Locust~~

-

-~~The Locust project,~~

-~~http://www.ecsl.cs.sunysb.edu/~manish/locust/, is building a~~

-~~distributed virtual shared memory system that uses compile-time~~

-~~information to hide message-latency and to reduce network traffic at~~

-~~run time. Pupa is the underlying communication subsystem of Locust,~~

-~~and is implemented using Ethernet to connect 486 PCs under FreeBSD.~~

-~~Linux?~~

-

-~~!Midway DSM (Distributed Shared Memory)~~

-

-~~Midway,~~

-~~[http://www.cs.cmu.edu/afs/cs.cmu.edu/project/midway/WWW/!HomePage.html],~~

-~~is a software-based DSM (Distributed Shared Memory) system, not unlike~~

-~~!TreadMarks. The good news is that it uses compile-time aids rather~~

-~~than relatively slow page-fault mechanisms, and it is free. The bad~~

-~~news is that it doesn't run on Linux clusters.~~

-

-~~!Mosix~~

-

-~~MOSIX modifies the BSDI BSD/OS to provide dynamic load balancing and~~

-~~preemptive process migration across a networked group of PCs. This is~~

-~~nice stuff not just for parallel processing, but for generally using a~~

-~~cluster much like a scalable SMP. Will there be a Linux version? Look~~

-at

-~~http://www.cs.huji.ac.il/mosix/ for more information.~~

-

-~~!NOW (Network Of Workstations)~~

-

-~~The Berkeley NOW (Network Of Workstations) project,~~

-~~http://now.cs.berkeley.edu/, has led much of the push toward~~

-~~parallel computing using networks of workstations. There is a lot~~

-~~work going on here, all aimed toward "demonstrating a practical 100~~

-~~processor system in the next few years." Alas, they don't use Linux.~~

-

-~~!Parallel Processing Using Linux~~

-

-~~The parallel processing using Linux WWW site,~~

-~~http://yara.ecn.purdue.edu/~pplinux/, is the home of this HOWTO~~

-~~and many related documents including online slides for a full-day~~

-~~tutorial. Aside from the work on the PAPERS project, the Purdue~~

-~~University School of Electrical and Computer Engineering generally has~~

-~~been a leader in parallel processing; this site was established to~~

-~~help others apply Linux PCs for parallel processing.~~

-

-~~Since Purdue's first cluster of Linux PCs was assembled in February~~

-~~1994, there have been many Linux PC clusters assembled at Purdue,~~

-~~including several with video walls. Although these clusters used 386,~~

-~~486, and Pentium systems (no Pentium Pro systems), Intel recently~~

-~~awarded Purdue a donation which will allow it to construct multiple~~

-~~large clusters of Pentium II systems (with as many as 165 machines~~

-~~planned for a single cluster). Although these clusters all have/will~~

-~~have PAPERS networks, most also have conventional networks.~~

-

-~~!Pentium Pro Cluster Workshop~~

-

-~~In Des Moines, Iowa, April 10-11, 1997, AMES Laboratory held the~~

-~~Pentium Pro Cluster Workshop. The WWW site from this workshop,~~

-~~http://www.scl.ameslab.gov/workshops/PPCworkshop.html, contains~~

-~~a wealth of PC cluster information gathered from all the attendees.~~

-

-~~!!TreadMarks DSM (Distributed Shared Memory)~~

-

-~~DSM (Distributed Shared Memory) is a technique whereby a~~

-~~message-passing system can appear to behave as an SMP. There are~~

-~~quite a few such systems, most of which use the OS page-fault mechanism~~

-~~to trigger message transmissions. !TreadMarks,~~

-~~http://www.cs.rice.edu/~willy/!TreadMarks/overview.html, is one~~

-~~of the more efficient of such systems and does run on Linux clusters.~~

-~~The bad news is "!TreadMarks is being distributed at a small cost to~~

-~~universities and nonprofit institutions." For more information about~~

-~~the software, contact~~

-~~treadmarks@ece.rice.edu.~~

-

-~~!U-Net (User-level NETwork interface architecture)~~

-

-~~The U-Net (User-level NETwork interface architecture) project at~~

-~~Cornell,~~

-~~http://www2.cs.cornell.edu/U-Net/Default.html,~~

-~~attempts to provide low-latency and high-bandwidth using commodity~~

-~~network hardware by by virtualizing the network interface so that~~

-~~applications can send and receive messages without operating system~~

-~~intervention. U-Net runs on Linux PCs using a DECchip DC21140 based~~

-~~Fast Ethernet card or a Fore Systems PCA-200 (not PCA-200E) ATM card.~~

-

-~~!WWT (Wisconsin Wind Tunnel)~~

-

-~~There is really quite a lot of cluster-related work at Wisconsin. The~~

-~~WWT (Wisconsin Wind Tunnel) project,~~

-~~http://www.cs.wisc.edu/~wwt/, is doing all sorts of work toward~~

-~~developing a "standard" interface between compilers and the underlying~~

-~~parallel hardware. There is the Wisconsin COW (Cluster Of~~

-~~Workstations), Cooperative Shared Memory and Tempest, the Paradyn~~

-~~Parallel Performance Tools, etc. Unfortunately, there is not much~~

-~~about Linux.~~

-

-~~----~~

-

-~~!!4. SIMD Within A Register (e.g., using MMX)~~

-

-~~SIMD (Single Instruction stream, Multiple Data stream) Within A~~

-~~Register (SWAR) isn't a new idea. Given a machine with ''k''-bit~~

-~~registers, data paths, and function units, it has long been known that~~

-~~ordinary register operations can function as SIMD parallel operations~~

-~~on ''n'', ''k''/''n''-bit, integer field values.~~

-~~However, it is only with the recent push for multimedia that the 2x to~~

-~~8x speedup offered by SWAR techniques has become a concern for~~

-~~mainstream computing. The 1997 versions of most microprocessors~~

-~~incorporate hardware support for SWAR:~~

-

-*

-~~AMD K6 MMX (!MultiMedia eXtensions)~~

-

-*

-

-*

-~~Cyrix M2 MMX (!MultiMedia eXtensions)~~

-

-*

-

-*

-~~Digital Alpha MAX (MultimediA eXtensions)~~

-

-*

-

-*

-~~Hewlett-Packard PA-RISC MAX (Multimedia Acceleration eXtensions)~~

-

-*

-

-*

-~~Intel Pentium II & Pentium with MMX (!MultiMedia eXtensions)~~

-

-*

-

-*

-~~Microunity Mediaprocessor SIGD (Single Instruction on Groups of Data)~~

-

-*

-

-*

-~~MIPS Digital Media eXtension (MDMX, pronounced Mad Max)~~

-

-*

-

-*

-~~Sun SPARC V9 VIS (Visual Instruction Set)~~

-*

-

-~~There are a few holes in the hardware support provided by the new~~

-~~microprocessors, quirks like only supporting some operations for some~~

-~~field sizes. It is important to remember, however, that you don't~~

-~~need any hardware support for many SWAR operations to be efficient.~~

-~~For example, bitwise operations are not affected by the logical~~

-~~partitioning of a register.~~

-

-~~!!4.1 SWAR: What Is It Good For?~~

-

-~~Although ''every'' modern processor is capable of executing with~~

-~~at least some SWAR parallelism, the sad fact is that even the best~~

-~~SWAR-enhanced instruction sets do not support very general-purpose~~

-~~parallelism. In fact, many people have noticed that the performance~~

-~~difference between Pentium and "Pentium with MMX technology" is often~~

-~~due to things like the larger L1 cache that coincided with appearance~~

-~~of MMX. So, realistically, what is SWAR (or MMX) good for?~~

-

-*Integers only, the smaller the better. Two 32-bit values fit in

-~~a 64-bit MMX register, but so do eight one-byte characters or even an~~

-~~entire chess board worth of one-bit values.~~

-~~Note: there ''will be a floating-point version of MMX'', although~~

-~~very little has been said about it at this writing. Cyrix has posted~~

-~~a set of slides,~~

-~~ftp://ftp.cyrix.com/developr/mpf97rm.pdf,~~

-~~that includes a few comments about __MMFP__. Apparently, MMFP~~

-~~will support two 32-bit floating-point numbers to be packed into a~~

-~~64-bit MMX register; combining this with two MMFP pipelines will yield~~

-~~four single-precision FLOPs per clock.~~

-

-*

-

-*SIMD or vector-style parallelism. The same operation is applied

-~~to all fields simultaneously. There are ways to nullify the effects on~~

-~~selected fields (i.e., equivalent to SIMD enable masking), but they~~

-~~complicate coding and hurt performance.~~

-

-*

-

-*Localized, regular (preferably packed), memory reference

-~~patterns. SWAR in general, and MMX in particular, are terrible at~~

-~~randomly-ordered accesses; gathering a vector x[[y] (where~~

-~~y is an index array) is prohibitively expensive.~~

-*

-

-~~These are serious restrictions, but this type of parallelism occurs in~~

-~~many parallel algorithms - not just multimedia applications. For the~~

-~~right type of algorithm, SWAR is more effective than SMP or cluster~~

-~~parallelism... and it doesn't cost anything to use it.~~

-

-~~!!4.2 Introduction To SWAR Programming~~

-

-~~The basic concept of SWAR, SIMD Within A Register, is that operations~~

-~~on word-length registers can be used to speed-up computations by~~

-~~performing SIMD parallel operations on ''n''~~

-~~''k''/''n''-bit field values. However, making use of~~

-~~SWAR technology can be awkward, and some SWAR operations are actually~~

-~~more expensive than the corresponding sequences of serial operations~~

-~~because they require additional instructions to enforce the field~~

-~~partitioning.~~

-

-~~To illustrate this point, let's consider a greatly simplified SWAR~~

-~~mechanism that manages four 8-bit fields within each 32-bit register.~~

-~~The values in two registers might be represented as:~~

-

-~~----~~

-

-~~PE3 PE2 PE1 PE0~~

-~~+-------+-------+-------+-------+~~

-~~Reg0 | D 7:0 | C 7:0 | B 7:0 | A 7:0 |~~

-~~+-------+-------+-------+-------+~~

-~~Reg1 | H 7:0 | G 7:0 | F 7:0 | E 7:0 |~~

-~~+-------+-------+-------+-------+~~

-

-~~----~~

-

-~~This simply indicates that each register is viewed as essentially a~~

-~~vector of four independent 8-bit integer values. Alternatively, think~~

-~~of A and E as values in Reg0 and Reg1 of processing~~

-~~element 0 (PE0), B and F as values in PE1's~~

-~~registers, and so forth.~~

-

-~~The remainder of this document briefly reviews the basic classes of~~

-~~SIMD parallel operations on these integer vectors and how these~~

-~~functions can be implemented.~~

-

-~~!Polymorphic Operations~~

-

-~~Some SWAR operations can be performed trivially using ordinary 32-bit~~

-~~integer operations, without concern for the fact that the operation is~~

-~~really intended to operate independently in parallel on these 8-bit~~

-~~fields. We call any such SWAR operation ''polymorphic'', since~~

-~~the function is unaffected by the field types (sizes).~~

-

-~~Testing if any field is non-zero is polymorphic, as are all bitwise~~

-~~logic operations. For example, an ordinary bitwise-and operation (C's~~

-~~& operator) performs a bitwise and no matter what the~~

-~~field sizes are. A simple bitwise and of the above registers yields:~~

-

-~~----~~

-

-~~PE3 PE2 PE1 PE0~~

-~~+---------+---------+---------+---------+~~

-~~Reg2 | D&H 7:0 | C&G 7:0 | B&F 7:0 | A&E 7:0 |~~

-~~+---------+---------+---------+---------+~~

-

-~~----~~

-

-~~Because the bitwise and operation always has the value of result bit~~

-~~''k'' affected only by the values of the operand bit ''k''~~

-~~values, all field sizes are supported using the same single~~

-~~instruction.~~

-

-~~!Partitioned Operations~~

-

-~~Unfortunately, lots of important SWAR operations are not polymorphic.~~

-~~Arithmetic operations such as add, subtract, multiply, and divide are~~

-~~all subject to carry/borrow interactions between fields. We call such~~

-~~SWAR operations ''partitioned'', because each such operation must~~

-~~effectively partition the operands and result to prevent interactions~~

-~~between fields. However, there are actually three different methods~~

-~~that can be used to achieve this effect.~~

-

-~~!Partitioned Instructions~~

-

-~~Perhaps the most obvious approach to implementing partitioned~~

-~~operations is to provide hardware support for "partitioned parallel~~

-~~instructions" that cut the carry/borrow logic between fields. This~~

-~~approach can yield the highest performance, but it requires a change~~

-~~to the processor's instruction set and generally places many~~

-~~restrictions on field size (e.g., 8-bit fields might be supported, but~~

-~~not 12-bit fields).~~

-

-~~The AMD/Cyrix/Intel MMX, Digital MAX, HP MAX, and Sun VIS all~~

-~~implement restricted versions of partitioned instructions.~~

-~~Unfortunately, these different instruction set extensions have~~

-~~significantly different restrictions, making algorithms somewhat~~

-~~non-portable between them. For example, consider the following~~

-~~sampling of partitioned operations:~~

-

-~~----~~

-

-~~Instruction AMD/Cyrix/Intel MMX DEC MAX HP MAX Sun VIS~~

-~~+---------------------+---------------------+---------+--------+---------+~~

-~~| Absolute Difference | | 8 | | 8 |~~

-~~+---------------------+---------------------+---------+--------+---------+~~

-~~| Merge Maximum | | 8, 16 | | |~~

-~~+---------------------+---------------------+---------+--------+---------+~~

-~~| Compare | 8, 16, 32 | | | 16, 32 |~~

-~~+---------------------+---------------------+---------+--------+---------+~~

-~~| Multiply | 16 | | | 8x16 |~~

-~~+---------------------+---------------------+---------+--------+---------+~~

-~~| Add | 8, 16, 32 | | 16 | 16, 32 |~~

-~~+---------------------+---------------------+---------+--------+---------+~~

-

-~~----~~

-

-~~In the table, the numbers indicate the field sizes, in bits, for which~~

-~~each operation is supported. Even though the table omits many~~

-~~instructions including all the more exotic ones, it is clear that~~

-~~there are many differences. The direct result is that high-level~~

-~~languages (HLLs) really are not very effective as programming models,~~

-~~and portability is generally poor.~~

-

-~~!Unpartitioned Operations With Correction Code~~

-

-~~Implementing partitioned operations using partitioned instructions can~~

-~~certainly be efficient, but what do you do if the partitioned~~

-~~operation you need is not supported by the hardware? The answer is~~

-~~that you use a series of ordinary instructions to perform the operation~~

-~~with carry/borrow across fields, and then correct for the undesired~~

-~~field interactions.~~

-

-~~This is a purely software approach, and the corrections do introduce~~

-~~overhead, but it works with fully general field partitioning. This~~

-~~approach is also fully general in that it can be used either to fill~~

-~~gaps in the hardware support for partitioned instructions, or it can~~

-~~be used to provide full functionality for target machines that have no~~

-~~hardware support at all. In fact, by expressing the code sequences in~~

-~~a language like C, this approach allows SWAR programs to be fully~~

-~~portable.~~

-

-~~The question immediately arises: precisely how inefficient is it to~~

-~~simulate SWAR partitioned operations using unpartitioned operations~~

-~~with correction code? Well, that is certainly the $64k question...~~

-~~but many operations are not as difficult as one might expect.~~

-

-~~Consider implementing a four-element 8-bit integer vector add of two~~

-~~source vectors, x+y, using ordinary 32-bit~~

-~~operations.~~

-

-~~An ordinary 32-bit add might actually yield the correct result, but~~

-~~not if any 8-bit field carries into the next field. Thus, our goal is~~

-~~simply to ensure that such a carry does not occur. Because adding two~~

-~~''k''-bit fields generates an at most ''k''+1 bit~~

-~~result, we can ensure that no carry occurs by simply "masking out" the~~

-~~most significant bit of each field. This is done by bitwise anding~~

-~~each operand with 0x7f7f7f7f and then performing an~~

-~~ordinary 32-bit add.~~

-

-~~----~~

-

-~~t = ((x & 0x7f7f7f7f) + (y & 0x7f7f7f7f));~~

-

-~~----~~

-

-~~That result is correct... except for the most significant bit within~~

-~~each field. Computing the correct value for each field is simply a~~

-~~matter of doing two 1-bit partitioned adds of the most significant~~

-~~bits from x and y to the 7-bit carry result~~

-~~which was computed for t. Fortunately, a 1-bit~~

-~~partitioned add is implemented by an ordinary exclusive or operation.~~

-~~Thus, the result is simply:~~

-

-~~----~~

-

-~~(t ^ ((x ^ y) & 0x80808080))~~

-

-~~----~~

-

-~~Ok, well, maybe that isn't so simple. After all, it is six operations~~

-~~to do just four adds. However, notice that the number of operations~~

-~~is not a function of how many fields there are... so, with more~~

-~~fields, we get speedup. In fact, we may get speedup anyway simply~~

-~~because the fields were loaded and stored in a single (integer vector)~~

-~~operation, register availability may be improved, and there are fewer~~

-~~dynamic code scheduling dependencies (because partial word references~~

-~~are avoided).~~

-

-~~!Controlling Field Values~~

-

-~~While the other two approaches to partitioned operation implementation~~

-~~both center on getting the maximum space utilization for the registers,~~

-~~it can be computationally more efficient to instead control the field~~

-~~values so that inter-field carry/borrow events should never occur.~~

-~~For example, if we know that all the field values being added are such~~

-~~that no field overflow will occur, a partitioned add operation can be~~

-~~implemented using an ordinary add instruction; in fact, given this~~

-~~constraint, an ordinary add instruction appears polymorphic, and is~~

-~~usable for any field sizes without correction code. The question~~

-~~thus becomes how to ensure that field values will not cause~~

-~~carry/borrow events.~~

-

-~~One way to ensure this property is to implement partitioned~~

-~~instructions that can restrict the range of field values. The Digital~~

-~~MAX vector minimum and maximum instructions can be viewed as hardware~~

-~~support for clipping field values to avoid inter-field carry/borrow.~~

-

-~~However, suppose that we do not have partitioned instructions that can~~

-~~efficiently restrict the range of field values... is there a~~

-~~sufficient condition that can be cheaply imposed to ensure~~

-~~carry/borrow events will not interfere with adjacent fields? The~~

-~~answer lies in analysis of the arithmetic properties. Adding two~~

-~~''k''-bit numbers generates a result with at most~~

-~~''k''+1 bits; thus, a field of ''k''+1 bits can safely~~

-~~contain such an operation despite using ordinary instructions.~~

-

-~~Thus, suppose that the 8-bit fields in our earlier example are now~~

-~~7-bit fields with 1-bit "carry/borrow spacers":~~

-

-~~----~~

-

-~~PE3 PE2 PE1 PE0~~

-~~+----+-------+----+-------+----+-------+----+-------+~~

-~~Reg0 | D' | D 6:0 | C' | C 6:0 | B' | B 6:0 | A' | A 6:0 |~~

-~~+----+-------+----+-------+----+-------+----+-------+~~

-

-~~----~~

-

-~~A vector of 7-bit adds is performed as follows. Let us assume that,~~

-~~prior to the start of any partitioned operation, all the carry spacer~~

-~~bits (A', B', C', and D') have the~~

-~~value . By simply executing an ordinary add operation, all the~~

-~~fields obtain the correct 7-bit values; however, some spacer bit~~

-~~values might now be 1. We can correct this by just one more~~

-~~conventional operation, masking-out the spacer bits. Our 7-bit~~

-~~integer vector add, x+y, is thus:~~

-

-~~----~~

-

-~~((x + y) & 0x7f7f7f7f)~~

-

-~~----~~

-

-~~This is just two instructions for four adds, clearly yielding good~~

-~~speedup.~~

-

-~~The sharp reader may have noticed that setting the spacer bits to~~

-~~does not work for subtract operations. The correction is, however,~~

-~~remarkably simple. To compute x-y, we simply~~

-~~ensure the initial condition that the spacers in x are all~~

-~~1, while the spacers in y are all . In the worst case,~~

-~~we would thus get:~~

-

-~~----~~

-

-~~(((x | 0x80808080) - y) & 0x7f7f7f7f)~~

-

-~~----~~

-

-~~However, the additional bitwise or operation can often be optimized~~

-~~out by ensuring that the operation generating the value for~~

-~~x used | 0x80808080 rather than &~~

-~~0x7f7f7f7f as the last step.~~

-

-~~Which method should be used for SWAR partitioned operations? The~~

-~~answer is simply "whichever yields the best speedup." Interestingly,~~

-~~the ideal method to use may be different for different field sizes~~

-~~within the same program running on the same machine.~~

-

-~~!Communication & Type Conversion Operations~~

-

-~~Although some parallel computations, including many operations on image~~

-~~pixels, have the property that the ''i''th value in a vector is~~

-~~a function only of values that appear in the ''i''th position~~

-~~of the operand vectors, this is generally not the case. For example,~~

-~~even pixel operations such as smoothing require values from adjacent~~

-~~pixels as operands, and transformations like FFTs require more complex~~

-~~(less localized) communication patterns.~~

-

-~~It is not difficult to efficiently implement 1-dimensional nearest~~

-~~neighbor communication for SWAR using unpartitioned shift operations.~~

-~~For example, to move a value from PE''i'' to~~

-~~PE(''i''+1), a simple shift operation suffices.~~

-~~If the fields are 8-bits in length, we would use:~~

-

-~~----~~

-

-~~(x << 8)~~

-

-~~----~~

-

-~~Still, it isn't always quite that simple. For example, to move a~~

-~~value from PE''i'' to~~

-~~PE(''i''-1), a simple shift operation might~~

-~~suffice... but the C language does not specify if shifts right~~

-~~preserve the sign bit, and some machines only provide signed shift~~

-~~right. Thus, in the general case, we must explicitly zero the~~

-~~potentially replicated sign bits:~~

-

-~~----~~

-

-~~((x >> 8) & 0x00ffffff)~~

-

-~~----~~

-

-~~Adding "wrap-around connections" is also reasonably efficient using~~

-~~unpartitioned shifts. For example, to move a value from~~

-~~PE''i'' to PE(''i''+1) with~~

-~~wraparound:~~

-

-~~----~~

-

-~~((x << 8) | ((x >> 24) & 0x000000ff))~~

-

-~~----~~

-

-~~The real problem comes when more general communication patterns must~~

-~~be implemented. Only the HP MAX instruction set supports arbitrary~~

-~~rearrangement of fields with a single instruction, which is called~~

-~~Permute. This Permute instruction is really~~

-~~misnamed; not only can it perform an arbitrary permutation of the~~

-~~fields, but it also allows repetition. In short, it implements an~~

-~~arbitrary x[[y] operation.~~

-

-~~Unfortunately, x[[y] is very difficult to implement without~~

-~~such an instruction. The code sequence is generally both long and~~

-~~inefficient; in fact, it is sequential code. This is very~~

-~~disappointing. The relatively high speed of x[[y]~~

-~~operations in the !MasPar MP1/MP2 and Thinking Machines CM1/CM2/CM200~~

-~~SIMD supercomputers was one of the key reasons these machines performed~~

-~~well. However, x[[y] has always been slower than nearest~~

-~~neighbor communication, even on those supercomputers, so many~~

-~~algorithms have been designed to minimize the need for~~

-~~x[[y] operations. In short, without hardware support, it~~

-~~is probably best to develop SWAR algorithms as though~~

-~~x[[y] wasn't legal... or at least isn't cheap.~~

-

-~~!Recurrence Operations (Reductions, Scans, etc.)~~

-

-~~A recurrence is a computation in which there is an apparently~~

-~~sequential relationship between values being computed. However, if~~

-~~these recurrences involve associative operations, it may be possible~~

-~~to recode the computation using a tree-structured parallel algorithm.~~

-

-~~The most common type of parallelizable recurrence is probably the~~

-~~class known as associative reductions. For example, to compute the~~

-~~sum of a vector's values, one commonly writes purely sequential C code~~

-~~like:~~

-

-~~----~~

-

-~~t = ;~~

-~~for (i=; i<MAX; ++i) t += x[[i];~~

-

-~~----~~

-

-~~However, the order of the additions is rarely important. Floating~~

-~~point and saturation math can yield different answers if the order of~~

-~~additions is changed, but ordinary wrap-around integer additions will~~

-~~yield the same results independent of addition order. Thus, we can~~

-~~re-write this sequence into a tree-structured parallel summation in~~

-~~which we first add pairs of values, then pairs of those partial sums,~~

-~~and so forth, until a single final sum results. For a vector of four~~

-~~8-bit values, just two addition steps are needed; the first step does~~

-~~two 8-bit adds, yielding two 16-bit result fields (each containing a~~

-~~9-bit result):~~

-

-~~----~~

-

-~~t = ((x & 0x00ff00ff) + ((x >> 8) & 0x00ff00ff));~~

-

-~~----~~

-

-~~The second step adds these two 9-bit values in 16-bit fields to~~

-~~produce a single 10-bit result:~~

-

-~~----~~

-

-~~((t + (t >> 16)) & 0x000003ff)~~

-

-~~----~~

-

-~~Actually, the second step performs two 16-bit field adds... but the~~

-~~top 16-bit add is meaningless, which is why the result is masked to a~~

-~~single 10-bit result value.~~

-

-~~Scans, also known as "parallel prefix" operations, are somewhat harder~~

-~~to implement efficiently. This is because, unlike reductions, scans~~

-~~produce partitioned results. For this reason, scans can be implemented~~

-~~using a fairly obvious sequence of partitioned operations.~~

-

-~~!!4.3 MMX SWAR Under Linux~~

-

-~~For Linux, IA32 processors are our primary concern. The good news is~~

-~~that AMD, Cyrix, and Intel all implement the same MMX instructions.~~

-~~However, MMX performance varies; for example, the K6 has only one MMX~~

-~~pipeline - the Pentium with MMX has two. The only really bad news is~~

-~~that Intel is still running those stupid MMX commercials.... ;-)~~

-

-~~There are really three approaches to using MMX for SWAR:~~

-

-~~#Use routines from an MMX library. In particular, Intel has~~

-~~developed several "performance libraries,"~~

-~~http://developer.intel.com/drg/tools/ad.htm, that offer a~~

-~~variety of hand-optimized routines for common multimedia tasks. With~~

-~~a little effort, many non-multimedia algorithms can be reworked to~~

-~~enable some of the most compute-intensive portions to be implemented~~

-~~using one or more of these library routines. These libraries are not~~

-~~currently available for Linux, but could be ported.~~

-

-#

-

-~~#Use MMX instructions directly. This is somewhat complicated by~~

-~~two facts. The first problem is that MMX might not be available on~~

-~~the processor, so an alternative implementation must also be~~

-~~provided. The second problem is that the IA32 assembler generally~~

-~~used under Linux does not currently recognize MMX instructions.~~

-

-#

-

-~~#Use a high-level language or module compiler that can directly~~

-~~generate appropriate MMX instructions. Such tools are currently under~~

-~~development, but none is yet fully functional under Linux. For~~

-~~example, at Purdue University (~~

-~~http://dynamo.ecn.purdue.edu/~hankd/SWAR/) we are currently~~

-~~developing a compiler that will take functions written in an~~

-~~explicitly parallel C dialect and will generate SWAR modules that are~~

-~~callable as C functions, yet make use of whatever SWAR support is~~

-~~available, including MMX. The first prototype module compilers were~~

-~~built in Fall 1996, however, bringing this technology to a usable~~

-~~state is taking much longer than was originally expected.~~

-#

-

-~~In summary, MMX SWAR is still awkward to use. However, with a little~~

-~~extra effort, the second approach given above can be used now. Here~~

-~~are the basics:~~

-

-~~#You cannot use MMX if your processor does not support it. The~~

-~~following GCC code can be used to test if MMX is supported on your~~

-~~processor. It returns 0 if not, non-zero if it is supported.~~

-~~----~~

-

-~~inline extern~~

-~~int mmx_init(void)~~

-{

-~~int mmx_available;~~

-~~__asm__ __volatile__ (~~

-~~/* Get CPU version information */~~

-~~"movl $1, %%eax\n\t"~~

-~~"cpuid\n\t"~~

-~~"andl $0x800000, %%edx\n\t"~~

-~~"movl %%edx, %"~~

-~~: "=q" (mmx_available)~~

-~~: /* no input */~~

-);

-~~return mmx_available;~~

-}

-

-~~----~~

-

-#

-

-~~#An MMX register essentially holds one of what GCC would call an~~

-~~unsigned long long. Thus, memory-based variables of this type~~

-~~become the communication mechanism between your MMX modules and the C~~

-~~programs that call them. Alternatively, you can declare your MMX data~~

-~~as any 64-bit aligned data structure (it is convenient to ensure~~

-~~64-bit alignment by declaring your data type as a union with~~

-~~an unsigned long long field).~~

-

-#

-

-~~#If MMX is available, you can write your MMX code using~~

-~~the .byte assembler directive to encode each instruction.~~

-~~This is painful stuff to do by hand, but not difficult for a compiler~~

-~~to generate. For example, the MMX instruction PADDB MM0,MM1~~

-~~could be encoded as the GCC in-line assembly code:~~

-~~----~~

-

-~~__asm__ __volatile__ (".byte 0x0f, 0xfc, 0xc1\n\t");~~

-

-~~----~~

-~~Remember that MMX uses some of the same hardware that is used for~~

-~~floating point operations, so code intermixed with MMX code must not~~

-~~invoke any floating point operations. The floating point stack also~~

-~~should be empty before executing any MMX code; the floating point~~

-~~stack is normally empty at the beginning of a C function that does not~~

-~~use floating point.~~

-

-#

-

-~~#Exit your MMX code by executing the EMMS instruction,~~

-~~which can be encoded as:~~

-~~----~~

-

-~~__asm__ __volatile__ (".byte 0x0f, 0x77\n\t");~~

-

-~~----~~

-

-#

-

-~~If the above looks very awkward and crude, it is. However, MMX is~~

-~~still quite young.... future versions of this document will offer~~

-~~better ways to program MMX SWAR.~~

-

-~~----~~

-

-~~!!5. Linux-Hosted Attached Processors~~

-

-~~Although this approach has recently fallen out of favor, it is~~

-~~virtually impossible for other parallel processing methods to achieve~~

-~~the low cost and high performance possible by using a Linux system to~~

-~~host an attached parallel computing system. The problem is that very~~

-~~little software support is available; you are pretty much on your own.~~

-

-~~!!5.1 A Linux PC Is A Good Host~~

-

-~~In general, attached parallel processors tend to be specialized to~~

-~~perform specific types of functions.~~

-

-~~Before becoming discouraged by the fact that you are somewhat on your~~

-~~own, it is useful to understand that, although it may be difficult to~~

-~~get a Linux PC to appropriately host a particular system, a Linux PC~~

-~~is one of the few platforms well suited to this type of use.~~

-

-~~PCs make a good host for two primary reasons. The first is the cheap~~

-~~and easy expansion capability; resources such as more memory, disks,~~

-~~networks, etc., are trivially added to a PC. The second is the ease~~

-~~of interfacing. Not only are ISA and PCI bus prototyping cards widely~~

-~~available, but the parallel port offers reasonable performance in a~~

-~~completely non-invasive interface. The IA32 separate I/O space also~~

-~~facilitates interfacing by providing hardware I/O address protection~~

-~~at the level of individual I/O port addresses.~~

-

-~~Linux also makes a good host OS. The free availability of full source~~

-~~code, and extensive "hacking" guides, obviously are a tremendous help.~~

-~~However, Linux also provides good near-real-time scheduling, and there~~

-~~is even a true real-time version of Linux at~~

-~~http://luz.cs.nmt.edu/~rtlinux/. Perhaps even more important~~

-~~is the fact that while providing a full UNIX environment, Linux can~~

-~~support development tools that were written to run under Microsoft DOS~~

-~~and/or Windows. MSDOS programs can execute within a Linux process~~

-~~using dosemu to provide a protected virtual machine that can~~

-~~literally run MSDOS. Linux support for Windows 3.xx programs is even~~

-~~more direct: free software such as wine,~~

-~~http://www.linpro.no/wine/, simulates Windows 3.11 well enough~~

-~~for most programs to execute correctly and efficiently within a UNIX/X~~

-~~environment.~~

-

-~~The following two sections give examples of attached parallel systems~~

-~~that I'd like to see supported under Linux....~~

-

-~~!!5.2 Did You DSP That?~~

-

-~~There is a thriving market for high-performance DSP (Digital Signal~~

-~~Processing) processors. Although these chips were generally designed~~

-~~to be embedded in application-specific systems, they also make great~~

-~~attached parallel computers. Why?~~

-

-*Many of them, such as the Texas Instruments (

-~~http://www.ti.com/) TMS320 and the Analog Devices (~~

-~~http://www.analog.com/) SHARC DSP families, are designed to~~

-~~construct parallel machines with little or no "glue" logic.~~

-

-*

-

-*They are cheap, especially per MIP or MFLOP. Including the cost

-~~of basic support logic, it is not unheard of for a DSP processor to be~~

-~~one tenth the cost of a PC processor with comparable performance.~~

-

-*

-

-*They do not use much power nor generate much heat. This means

-~~that it is possible to have a bunch of these chips powered by a~~

-~~conventional PC's power supply - and enclosing them in your PC's case~~

-~~will not turn it into an oven.~~

-

-*

-

-*There are strange-looking things in most DSP instruction sets

-~~that high-level (e.g., C) compilers are unlikely to use well - for~~

-~~example, "Bit Reverse Addressing." Using an attached parallel system,~~

-~~it is possible to straightforwardly compile and run most code on the~~

-~~host, while running the most time-consuming few algorithms on the DSPs~~

-~~as carefully hand-tuned code.~~

-

-*

-

-*These DSP processors are not really designed to run a UNIX-like

-~~OS, and generally are not very good as stand-alone general-purpose~~

-~~computer processors. For example, many do not have memory management~~

-~~hardware. In other words, they work best when hosted by a more~~

-~~general-purpose machine... such as a Linux PC.~~

-*

-

-~~Although some audio cards and modems include DSP processors that Linux~~

-~~drivers can access, the big payoff comes from using an attached~~

-~~parallel system that has four or more DSP processors.~~

-

-~~Because the Texas Instruments TMS320 series,~~

-~~http://www.ti.com/sc/docs/dsps/dsphome.htm, has been very~~

-~~popular for a long time, and it is trivial to construct a TMS320-based~~

-~~parallel processor, there are quite a few such systems available.~~

-~~There are both integer-only and floating-point capable versions of the~~

-~~TMS320; older designs used a somewhat unusual single-precision~~

-~~floating-point format, but the new models support IEEE formats. The~~

-~~older TMS320C4x (aka, 'C4x) achieves up to 80 MFLOPS using the~~

-~~TI-specific single-precision floating-point format; in contrast, a~~

-~~single 'C67x will provide up to 1 GFLOPS single-precision or 420~~

-~~MFLOPS double-precision for IEEE floating point calculations, using a~~

-~~VLIW-based chip architecture called VelociTI. Not only is it easy to~~

-~~configure a group of these chips as a multiprocessor, but in a single~~

-~~chip, the 'C8x multiprocessor will provide a 100 MFLOPS IEEE~~

-~~floating-point RISC master processor along with either two or four~~

-~~integer slave DSPs.~~

-

-~~The other DSP processor family that has been used in more than a few~~

-~~attached parallel systems lately is the SHARC (aka, ADSP-2106x) from~~

-~~Analog Devices~~

-~~http://www.analog.com/. These chips can be~~

-~~configured as a 6-processor shared memory multiprocessor without~~

-~~external glue logic, and larger systems also can be configured using~~

-~~six 4-bit links/chip. Most of the larger systems seem targeted to~~

-~~military applications, and are a bit pricey. However, Integrated~~

-~~Computing Engines, Inc.,~~

-~~http://www.iced.com/, makes an~~

-~~interesting little two-board PCI card set called GreenICE. This unit~~

-~~contains an array of 16 SHARC processors, and is capable of delivering~~

-~~a peak speed of about 1.9 GFLOPS using a single-precision IEEE format.~~

-~~GreenICE costs less than $5,000.~~

-

-~~In my opinion, attached parallel DSPs really deserve a lot more~~

-~~attention from the Linux parallel processing community....~~

-

-~~!!5.3 FPGAs And Reconfigurable Logic Computing~~

-

-~~If parallel processing is all about getting the highest speedup, then~~

-~~why not build custom hardware? Well, we all know the answers; it~~

-~~costs too much, takes too long to develop, becomes useless when we~~

-~~change the algorithm even slightly, etc. However, recent advances in~~

-~~electrically reprogrammable FPGAs (Field Programmable Gate Arrays)~~

-~~have nullified most of those objections. Now, the gate density is~~

-~~high enough so that an entire simple processor can be built within a~~

-~~single FPGA, and the time to reconfigure (reprogram) an FPGA has also~~

-~~been dropping to a level where it is reasonable to reconfigure even~~

-~~when moving from one phase of an algorithm to the next.~~

-

-~~This stuff is not for the weak of heart: you'll have to work with~~

-~~hardware description languages like VHDL for the FPGA configuration, as~~

-~~well as writing low-level code to interface to programs on the Linux~~

-~~host system. However, the cost of FPGAs is low, and especially for~~

-~~algorithms operating on low-precision integer data (actually, a small~~

-~~superset of the stuff SWAR is good at), FPGAs can perform complex~~

-~~operations just about as fast as you can feed them data. For example,~~

-~~simple FPGA-based systems have yielded better-than-supercomputer times~~

-~~for searching gene databases.~~

-

-~~There are other companies making appropriate FPGA-based hardware, but~~

-~~the following two companies represent a good sample.~~

-

-~~Virtual Computer Company offers a variety of products using~~

-~~dynamically reconfigurable SRAM-based Xilinx FPGAs. Their 8/16 bit~~

-~~"Virtual ISA Proto Board"~~

-~~http://www.vcc.com/products/isa.html is less than $2,000.~~

-

-~~The Altera ARC-PCI (Altera Reconfigurable Computer, PCI bus),~~

-~~http://www.altera.com/html/new/pressrel/pr_arc-pci.html,~~

-~~is a similar type of card, but uses Altera FPGAs and a PCI bus~~

-~~interface rather than ISA.~~

-

-~~Many of the design tools, hardware description languages, compilers,~~

-~~routers, mappers, etc., come as object code only that runs under~~

-~~Windows and/or DOS. You could simply keep a disk partition with~~

-~~DOS/Windows on your host PC and reboot whenever you need to use them,~~

-~~however, many of these software packages may work under Linux using~~

-~~dosemu or Windows emulators like wine.~~

-

-~~----~~

-

-~~!!6. Of General Interest~~

-

-~~The material covered in this section applies to all four parallel~~

-~~processing models for Linux.~~

-

-~~!!6.1 Programming Languages And Compilers~~

-

-~~I am primarily known as a compiler researcher, so I'd like to be able~~

-~~to say that there are lots of really great compilers automatically~~

-~~generating efficient parallel code for Linux systems. Unfortunately,~~

-~~the truth is that it is hard to beat the performance obtained by~~

-~~expressing your parallel program using various explicit communication~~

-~~and other parallel operations within C code that is compiled by GCC.~~

-

-~~The following language/compiler projects represent some of the best~~

-~~efforts toward producing reasonably efficient code from high-level~~

-~~languages. Generally, each is reasonably effective for the kinds of~~

-~~programming tasks it targets, but none is the powerful general-purpose~~

-~~language and compiler system that will make you forever stop writing C~~

-~~programs to compile with GCC... which is fine. Use these languages~~

-~~and compilers as they were intended, and you'll be rewarded with~~

-~~shorter development times, easier debugging and maintenance, etc.~~

-

-~~There are plenty of languages and compilers beyond those listed here~~

-~~(in alphabetical order). A list of freely available compilers (most~~

-~~of which have nothing to do with Linux parallel processing) is at~~

-~~http://www.idiom.com/free-compilers/.~~

-

-~~!Fortran 66/77/PCF/90/HPF/95~~

-

-~~At least in the scientific computing community, there will always be~~

-~~Fortran. Of course, now Fortran doesn't mean the same thing it did in~~

-~~the 1966 ANSI standard. Basically, Fortran 66 was pretty simple stuff.~~

-~~Fortran 77 added tons of features, the most noticeable of which were the~~

-~~improved support for character data and the change of DO loop~~

-~~semantics. PCF (Parallel Computing Forum) Fortran attempted to add a~~

-~~variety of parallel processing support features to 77. Fortran 90 is~~

-~~a fully-featured modern language, essentially adding C++-like~~

-~~object-oriented programming features and parallel array syntax to the~~

-~~77 language. HPF (High-Performance Fortran,~~

-~~http://www.crpc.rice.edu/HPFF/home.html), which has itself gone~~

-~~through two versions (HPF-1 and HPF-2), is essentially the enhanced,~~

-~~standardized, version of what many of us used to know as CM Fortran,~~

-~~!MasPar Fortran, or Fortran D; it extends Fortran 90 with a variety of~~

-~~parallel processing enhancements, largely focussed on specifying data~~

-~~layouts. Finally, Fortran 95 represents a relatively minor~~

-~~enhancement and refinement of 90.~~

-

-~~What works with C generally can also work with f2c,~~

-~~g77 (a nice Linux-specific overview is at~~

-~~http://linux.uni-regensburg.de/psi_linux/gcc/html_g77/g77_91.html),~~

-~~or the commercial Fortran 90/95 products from~~

-~~http://extweb.nag.co.uk/nagware/NCNJNKNM.html. This is because~~

-~~all of these compilers eventually come down to the same code-generation~~

-~~used in the back-end of GCC.~~

-

-~~Commercial Fortran parallelizers that can generate code for SMPs are~~

-~~available from~~

-~~http://www.kai.com/ and~~

-~~http://www.psrv.com/vast/vast_parallel.html. It is not~~

-~~clear if these compilers will work for SMP Linux, but it should be~~

-~~possible given that the standard POSIX threads (i.e., !LinuxThreads)~~

-~~work under SMP Linux.~~

-

-~~The Portland Group,~~

-~~http://www.pgroup.com/, has commercial~~

-~~parallelizing HPF Fortran (and C, C++) compilers that generate code for~~

-~~SMP Linux; they also have a version targeting clusters using MPI or~~

-~~PVM. FORGE/spf/xHPF products at~~

- ~~http://www.apri.com/~~

-~~might also be useful for SMPs or clusters.~~

-

-~~Freely available parallelizing Fortrans that might be made to work~~

-~~with parallel Linux systems include:~~

-

-*ADAPTOR (Automatic DAta Parallelism TranslaTOR,

-~~http://www.gmd.de/SCAI/lab/adaptor/adaptor_home.html),~~

-~~which can translate HPF into Fortran 77/90 code with MPI or PVM calls,~~

-~~but does not mention Linux.~~

-

-*

-

-*Fx

-~~http://www.cs.cmu.edu/~fx/Fx at Carnegie Mellon~~

-~~targets some workstation clusters, but Linux?~~

-

-*

-

-*HPFC (prototype HPF Compiler,

-~~http://www.cri.ensmp.fr/~coelho/hpfc.html) generates Fortran 77~~

-~~code with PVM calls. Is it usable on a Linux cluster?~~

-

-*

-

-*Can PARADIGM (PARAllelizing compiler for DIstributed-memory

-~~General-purpose Multicomputers,~~

-~~http://www.crhc.uiuc.edu/Paradigm/) be used with Linux?~~

-

-*

-

-*The Polaris compiler,

-~~http://ece.www.ecn.purdue.edu/~eigenman/polaris/, generates~~

-~~Fortran code for shared memory multiprocessors, and may soon be~~

-~~retargeted to PAPERS Linux clusters.~~

-

-*

-

-*PREPARE,

-~~http://www.irisa.fr/EXTERNE/projet/pampa/PREPARE/prepare.html,~~

-~~targets MPI clusters... it is not clear if it can generate code to~~

-~~run on IA32 processors.~~

-

-*

-

-*Combining ADAPT and ADLIB, shpf (Subset High Performance Fortran

-~~compilation system,~~

-~~http://www.ccg.ecs.soton.ac.uk/Projects/shpf/shpf.html) is~~

-~~public domain and generates Fortran 90 with MPI calls... so, if you~~

-~~have a Fortran 90 compiler under Linux....~~

-

-*

-

-*SUIF (Stanford University Intermediate Form, see

-~~http://suif.stanford.edu/) has parallelizing compilers for both~~

-~~C and Fortran. This is also the focus of the National Compiler~~

-~~Infrastructure Project... so, is anybody targeting parallel Linux~~

-~~systems?~~

-*

-

-~~I'm sure that I have omitted many potentially useful compilers for~~

-~~various dialects of Fortran, but there are so many that it is difficult~~

-~~to keep track. In the future, I would prefer to list only those~~

-~~compilers known to work with Linux. Please email comments and/or~~

-~~corrections to~~

-~~pplinux@ecn.purdue.edu.~~

-

-~~!GLU (Granular Lucid)~~

-

-~~GLU (Granular Lucid) is a very high-level programming system based on~~

-~~a hybrid programming model that combines intensional (Lucid) and~~

-~~imperative models. It supports both PVM and TCP sockets. Does it run~~

-~~under Linux? More information is available at~~

-~~http://www.csl.sri.com/GLU.html.~~

-

-~~!Jade And SAM~~

-

-~~Jade is a parallel programming language that extends C to exploit~~

-~~coarse-grain concurrency in sequential, imperative programs. It~~

-~~assumes a distributed shared memory model, which is implemented by SAM~~

-~~for workstation clusters using PVM. More information is available at~~

-~~http://suif.stanford.edu/~scales/sam.html.~~

-

-~~!Mentat And Legion~~

-

-~~Mentat is an object-oriented parallel processing system that works~~

-~~with workstation clusters and has been ported to Linux. Mentat~~

-~~Programming Language (MPL) is an object-oriented programming language~~

-~~based on C++. The Mentat run-time system uses something vaguely~~

-~~resembling non-blocking remote procedure calls. More information is~~

-~~available at~~

-~~http://www.cs.virginia.edu/~mentat/.~~

-

-~~Legion~~

-~~http://www.cs.virginia.edu/~legion/ is built on top~~

-~~on Mentat, providing the appearance of a single virtual machine across~~

-~~wide-area networked machines.~~

-

-~~!MPL (!MasPar Programming Language)~~

-

-~~Not to be confussed with Mentat's MPL, this language was originally~~

-~~developed as the native parallel C dialect for the !MasPar SIMD~~

-~~supercomputers. Well, !MasPar isn't really in that business any more~~

-~~(they are now !NeoVista Solutions,~~

-~~http://www.neovista.com,~~

-~~a data mining company), but their MPL compiler was built using GCC, so~~

-~~it is still freely available. In a joint effort between the~~

-~~University of Alabama at Huntsville and Purdue University, !MasPar's~~

-~~MPL has been retargeted to generate C code with AFAPI calls (see~~

-~~section 3.6), and thus runs on both Linux SMPs and clusters. The~~

-~~compiler is, however, somewhat buggy... see~~

-~~http://www.math.luc.edu/~laufer/mspls/papers/cohen.ps.~~

-

-~~!PAMS (Parallel Application Management System)~~

-

-~~Myrias is a company selling a software product called PAMS (Parallel~~

-~~Application Management System). PAMS provides very simple directives~~

-~~for virtual shared memory parallel processing. Networks of Linux~~

-~~machines are not yet supported. See~~

-~~http://www.myrias.com/ for more information.~~

-

-~~!Parallaxis-III~~

-

-~~Parallaxis-III is a structured programming language that extends~~

-~~Modula-2 with "virtual processors and connections" for data~~

-~~parallelism (a SIMD model). The Parallaxis software comprises~~

-~~compilers for sequential and parallel computer systems, a debugger~~

-~~(extensions to the gdb and xgbd debugger), and a large variety of~~

-~~sample algorithms from different areas, especially image processing.~~

-~~This runs on sequential Linux systems... an old version supported~~

-~~various parallel targets, and the new version also will (e.g.,~~

-~~targeting a PVM cluster). More information is available at~~

-~~http://www.informatik.uni-stuttgart.de/ipvr/bv/p3/p3.html.~~

-

-~~!pC++/Sage++~~

-

-~~pC++/Sage++ is a language extension to C++ that permits data-parallel~~

-~~style operations using "collections of objects" from some base~~

-~~"element" class. It is a preprocessor generating C++ code that can~~

-~~run under PVM. Does it run under Linux? More information is~~

-~~available at~~

-~~http://www.extreme.indiana.edu/sage/.~~

-

-~~!SR (Synchronizing Resources)~~

-

-~~SR (Synchronizing Resources) is a concurrent programming language in~~

-~~which resources encapsulate processes and the variables they share;~~

-~~operations provide the primary mechanism for process interaction. SR~~

-~~provides a novel integration of the mechanisms for invoking and~~

-~~servicing operations. Consequently, all of local and remote procedure~~

-~~call, rendezvous, message passing, dynamic process creation,~~

-~~multicast, and semaphores are supported. SR also supports shared~~

-~~global variables and operations.~~

-

-~~It has been ported to Linux, but it isn't clear what parallelism it~~

-~~can execute with. More information is available at~~

-~~http://www.cs.arizona.edu/sr/www/index.html.~~

-

-~~!ZPL And !IronMan~~

-

-~~ZPL is an array-based programming language intended to support~~

-~~engineering and scientific applications. It generates calls to a~~

-~~simple message-passing interface called !IronMan, and the few functions~~

-~~which constitute this interface can be easily implemented using nearly~~

-~~any message-passing system. However, it is primarily targeted to PVM~~

-~~and MPI on workstation clusters, and Linux is supported. More~~

-~~information is available at~~

-~~http://www.cs.washington.edu/research/projects/orca3/zpl/www/.~~

-

-~~!!6.2 Performance Issues~~

-

-~~There are a lot of people who spend a lot of time benchmarking~~

-~~particular motherboards, network cards, etc., trying to determine~~

-~~which is the best. The problem with that approach is that by the time~~

-~~you've been able to benchmark something, it is no longer the best~~

-~~available; it even may have been taken off the market and replaced by~~

-~~a revised model with entirely different properties.~~

-

-~~Buying PC hardware is like buying orange juice. Usually, it is made~~

-~~with pretty good stuff no matter what company name is on the label.~~

-~~Few people know, or care, where the components (or orange juice~~

-~~concentrate) came from. That said, there are some hardware~~

-~~differences that you should pay attention to. My advice is simply~~

-~~that you be aware of what you can expect from the hardware under~~

-~~Linux, and then focus your attention on getting rapid delivery, a good~~

-~~price, and a reasonable policy for returns.~~

-

-~~An excellent overview of the different PC processors is given in~~

-~~http://www.pcguide.com/ref/cpu/fam/; in fact, the whole WWW~~

-~~site~~

-~~http://www.pcguide.com/ is full of good technical~~

-~~overviews of PC hardware. It is also useful to know a bit about~~

-~~performance of specific hardware configurations, and the Linux~~

-~~Benchmarking HOWTO~~

-~~http://sunsite.unc.edu/LDP/HOWTO/Benchmarking-HOWTO.html is a~~

-~~good place to start.~~

-

-~~The Intel IA32 processors have many special registers that can be used~~

-~~to measure the performance of a running system in exquisite detail.~~

-~~Intel VTune,~~

-~~http://developer.intel.com/design/perftool/vtune/, uses the~~

-~~performance registers extensively in a very complete code-tuning~~

-~~system... that unfortunately doesn't run under Linux. A loadable~~

-~~module device driver, and library routines, for accessing the Pentium~~

-~~performance registers is available from~~

-~~http://www.cs.umd.edu/users/akinlar/driver.html. Keep in mind~~

-~~that these performance registers are different on different IA32~~

-~~processors; this code works only with Pentium, not with 486, Pentium~~

-~~Pro, Pentium II, K6, etc.~~

-

-~~Another comment on performance is appropriate, especially for those~~

-~~of you who want to build big clusters and put them in small spaces.~~

-~~At least some modern processors incorporate thermal sensors and~~

-~~circuits that are used to slow the internal clock rate if operating~~

-~~temperature gets too high (an attempt to reduce heat output and~~

-~~improve reliability). I'm not suggesting that everyone should go buy~~

-~~a peltier device (heat pump) to cool each CPU, but you should be aware~~

-~~that high operating temperature does not just shorten component life -~~

-~~it also can directly reduce system performance. Do not arrange your~~

-~~computers in physical configurations that block airflow, trap heat~~

-~~within confined areas, etc.~~

-

-~~Finally, performance isn't just speed, but also reliability and~~

-~~availability. High reliability means that your system almost never~~

-~~crashes, even when components fail... which generally requires~~

-~~special features like redundant power supplies and hot-swap~~

-~~motherboards. That usually isn't cheap. High availability refers to~~

-~~the concept that your system is available for use nearly all the~~

-~~time... the system may crash when components fail, but the system is~~

-~~quickly repaired and rebooted. There is a High-Availability HOWTO~~

-~~that discusses many of the basic issues. However, especially for~~

-~~clusters, high availablity can be achieved simply by having a few~~

-~~spares. I recommend at least one spare, and prefer to have at least~~

-~~one spare for every 16 machines in a large cluster. Discarding faulty~~

-~~hardware and replacing it with a spare can yield both higher~~

-~~availability and lower cost than a maintenance contract.~~

-

-~~!!6.3 Conclusion - It's Out There~~

-

-~~So, is anybody doing parallel processing using Linux? Yes!~~

-

-~~It wasn't very long ago that a lot of people were wondering if the~~

-~~death of many parallel-processing supercomputer companies meant that~~

-~~parallel processing was on its way out. I didn't think it was dead~~

-~~then (see~~

-~~http://dynamo.ecn.purdue.edu/~hankd/Opinions/pardead.html for a~~

-~~fun overview of what I think really happened), and it seems quite~~

-~~clear now that parallel processing is again on the rise. Even Intel,~~

-~~which just recently stopped making parallel supercomputers, is proud~~

-~~of the parallel processing support in things like MMX and the upcoming~~

-~~IA64 EPIC (Explicitly Parallel Instruction Computer).~~

-

-~~If you search for "Linux" and "parallel" with your favorite search~~

-~~engine, you'll find quite a few places are involved in parallel~~

-~~processing using Linux. In particular, Linux PC clusters seem to be~~

-~~popping-up everywhere. The appropriateness of Linux, combined with~~

-~~the low cost and high performance of PC hardware, have made parallel~~

-~~processing using Linux a popular approach to supercomputing for both~~

-~~small, budget-constrained, groups and large, well-funded, national~~

-~~research laboratories.~~

-

-~~Various projects listed elsewhere in this document maintain lists of~~

-~~"kindred" research sites that have similar parallel Linux~~

-~~configurations. However, at~~

-~~http://yara.ecn.purdue.edu/~pplinux/Sites/, there is a~~

-~~hypertext document intended to provide photographs, descriptions, and~~

-~~contact information for all the various sites using Linux systems for~~

-~~parallel processing. To have information about your site posted there:~~

-

-*You must have a "permanent" parallel Linux site: an SMP,

-~~cluster of machines, SWAR system, or PC with attached processor, which~~

-~~is configured to allow users to ''execute parallel programs under~~

-~~Linux''. A Linux-based software environment (e.g., PVM, MPI,~~

-~~AFAPI) that directly supports parallel processing must be installed on~~

-~~the system. However, the hardware need not be dedicated to parallel~~

-~~processing under Linux, and may be used for completely different~~

-~~purposes when parallel programs are not being run.~~

-

-*

-

-*Request that your site be listed. Send your site information to

-~~pplinux@ecn.purdue.edu. Please follow the format used in other~~

-~~entries for your site information. ''No site will be listed without~~

-~~an explicit request from the contact person for that site.''~~

-*

-

-~~There are 14 clusters in the current listing, but we are aware of at~~

-~~least several dozen Linux clusters world-wide. Of course, listing~~

-~~does not imply any endorsement, etc.; our hope is simply to increase~~

-~~awareness, research, and collaboration involving parallel processing~~

-~~using Linux~~ .

-

-~~----~~

+Describe [HowToParallelProcessingHOWTO ] here.