A P P E N D I X  B

SPARC Behavior and Implementation

This chapter discusses issues related to the floating-point units used in SPARC® based workstations and describes a way to determine which code generation flags are best suited for a particular workstation.


B.1 Floating-Point Hardware

This section lists a number of SPARC floating-point units and describes the instruction sets and exception handling features they support. See the SPARC Architecture Manual Version 8 Appendix N, "SPARC IEEE 754 Implementation Recommendations", and Version 9 Appendix B, "IEEE Std 754-1985 Requirements for SPARC-V9", for brief descriptions of what happens when a floating-point trap is taken, the distinction between trapped and untrapped underflow, and recommended possible courses of action for SPARC implementations that provide a non-IEEE (nonstandard) arithmetic mode.

TABLE B-1 lists the hardware floating-point implementations used by SPARC workstations. Many early SPARC based systems have floating-point units derived from cores developed by TI or Weitek:

These two families of FPUs have been licensed to other workstation vendors, so chips from other semiconductor manufacturers may be found in some SPARC based workstations. Some of these other chips are also shown in the table.


TABLE B-1 SPARC Floating-Point Options

FPU

Description or

Processor Name

Appropriate for Machines

Notes

Optimum -xchip and -xarch

Weitek 1164/1165-based FPU

or no FPU

Kernel emulates floating-point

instructions

Obsolete

Slow; not recommended

-xchip=old -xarch=v7

TI 8847-based FPU

TI 8847; controller from Fujitsu or LSI

Sun-4trademark/1xx

Sun-4/2xx

Sun-4/3xx

Sun-4/4xx

SPARCstation® 1 (4/60)

1989

Most SPARCstation 1 workstations have

Weitek 3170

-xchip=old -xarch=v7

Weitek 3170-based FPU

 

SPARCstation 1 (4/60) SPARCstation 1+ (4/65)

1989, 1990

-xchip=old -xarch=v7

TI 602a

 

SPARCstation 2 (4/75)

1990

-xchip=old -xarch=v7

Weitek 3172-based FPU

 

SPARCstation SLC (4/20)

SPARCstation IPC (4/40)

1990

-xchip=old -xarch=v7

Weitek 8601 or Fujitsu 86903

Integrated CPU and FPU

SPARCstation IPX (4/50)

SPARCstation ELC (4/25)

1991

IPX uses 40 MHz CPU/FPU; ELC uses 33 MHz

-xchip=old -xarch=v7

Cypress 602

Resides on Mbus Module

SPARCserver® 6xx

1991

-xchip=old -xarch=v7

TI TMS390S10

(STP1010)

microSPARC®-I

SPARCstation LX

SPARCclassic

1992

No FsMULd in hardware

-xchip=micro -xarch=v8a

Fujitsu 86904

(STP1012)

microSPARC-II

SPARCstation 4 and 5

SPARCstation Voyager

No FsMULd in hardware

-xchip=micro2 -xarch=v8a

TI TMS390Z50

(STP1020A)

SuperSPARC®-I

SPARCserver 6xx
SPARCstation 10
SPARCstation 20

SPARCserver 1000

SPARCcenter 2000

 

-xchip=super -xarch=v8

STP1021A

SuperSPARC-II

SPARCserver 6xx

SPARCstation 10

SPARCstation 20

SPARCserver 1000

SPARCcenter 2000

 

-xchip=super2 -xarch=v8

Ross RT620

hyperSPARC®

SPARCstation 10/HSxx

SPARCstation 20/HSxx

 

-xchip=hyper -xarch=v8

Fujitsu 86907

TurboSPARC

SPARCstation 4 and 5

 

-xchip=micro2 -xarch=v8

STP1030A

UltraSPARC® I

Ultra-1, Ultra-2

Ex000

V9+VIS

-xchip=ultra -xarch=v8plusa

STP1031

UltraSPARC II

Ultra-2, E450

Ultra-30, Ultra-60, Ultra-80, Ex500

Ex000, E10000

V9+VIS

-xchip=ultra2 -xarch=v8plusa

SME1040

UltraSPARC IIi

Ultra-5, Ultra-10

V9+VIS

-xchip=ultra2i -xarch=v8plusa

 

UltraSPARC IIe

Sun Bladetrademark 100

V9+VIS

-xchip=ultra2e

-xarch=v8plusa

 

UltraSPARC III

Sun Blade 1000

Sun Blade 2000

V9+VIS II

-xchip=ultra3

-xarch=v8plusb*

 

UltraSPARC IIIi

Sun Blade 1500

Sun Blade 2500

V9+VIS II

-xchip=ultra3i

-xarch=v8plusb*

 

UltraSPARC IV

Sun Fire V490

Sun Fire V890

Sun Fire E2900

Sun Fire E4900

Sun Fire E6900

Sun Fire E20K

Sun Fire E25K

V9+VIS II

-xchip=ultra4

-xarch=v8plusb*

*Programs compiled or linked with -xarch=v8plusb will work only on UltraSPARC III/IV systems. To create a program that can run on any UltraSPARC (I,II,III,IV) system, use -xarch=v8plusa.


The last column in the preceding table shows the compiler flags to use to obtain the fastest code for each FPU. These flags control two independent attributes of code generation: the -xarch flag determines the instruction set the compiler may use, and the -xchip flag determines the assumptions the compiler will make about a processor's performance characteristics in scheduling the code. Because all SPARC floating-point units implement at least the floating-point instruction set defined in the SPARC Architecture Manual Version 7, a program compiled with -xarch=v7 will run on any SPARC based system, although it may not take full advantage of the features of later processors. Likewise, a program compiled with a particular -xchip value will run on any SPARC based system that supports the instruction set specified with -xarch, but it may run more slowly on systems with processors other than the one specified.

The floating-point units listed in the table preceding the microSPARC-I implement the floating-point instruction set defined in the SPARC Architecture Manual Version 7. Programs that must run on systems with these FPUs should be compiled with -xarch=v7. The compilers make no special assumptions regarding the performance characteristics of these processors, so they all share the single -xchip option -xchip=old. (Not all of the systems listed in TABLE B-1 are still supported by the compilers; they are listed solely for historical purposes. Refer to the appropriate version of the Numerical Computation Guide for the code generation flags to use with compilers supporting these systems.)

The microSPARC-I and microSPARC-II floating-point units implement the floating-point instruction set defined in the SPARC Architecture Manual Version 8 except for the FsMULd and quad precision instructions. Programs compiled with -xarch=v8 will run on systems with these processors, but because unimplemented floating-point instructions must be emulated by the system kernel, programs that use FsMULd extensively (such as Fortran programs that perform a lot of single precision complex arithmetic), may encounter severe performance degradation. To avoid this, compile programs for systems with these processors with -xarch=v8a.

The SuperSPARC-I, SuperSPARC-II, hyperSPARC, and TurboSPARC floating-point units implement the floating-point instruction set defined in the SPARC Architecture Manual Version 8 except for the quad precision instructions. To get the best performance on systems with these processors, compile with -xarch=v8.

The UltraSPARC I, UltraSPARC II, UltraSPARC IIe, UltraSPARC IIi, UltraSPARC III, UltraSPARC IIIi, and UltraSPARC IV floating-point units implement the floating-point instruction set defined in the SPARC Architecture Manual Version 9 except for the quad precision instructions; in particular, they provide 32 double precision floating-point registers. To allow the compiler to use these registers, compile with -xarch=v8plus (for programs that run under a 32-bit OS) or -xarch=v9 (for programs that run under a 64-bit OS). These processors also provide extensions to the standard instruction set. The additional instructions, known as the Visual Instruction Set or VIS, are rarely generated automatically by the compilers, but they may be used in assembly code. Therefore, to take full advantage of the instruction set these processors support, use -xarch=v8plusa (32-bit) or -xarch=v9a (64-bit).

The -xarch and -xchip options can be specified simultaneously using the -xtarget macro option. (That is, the -xtarget flag simply expands to a suitable combination of -xarch, -xchip, and -xcache flags.) The default code generation option is -xtarget=generic. See the cc(1), CC(1), and f95(1) man pages and the compiler manuals for more information including a complete list of -xarch, -xchip, and -xtarget values. Additional -xarch information is provided in the Fortran User's Guide, C User's Guide, and C++ User's Guide.

B.1.1 Floating-Point Status Register and Queue

All SPARC floating-point units, regardless of which version of the SPARC architecture they implement, provide a floating-point status register (FSR) that contains status and control bits associated with the FPU. All SPARC FPUs that implement deferred floating-point traps provide a floating-point queue (FQ) that contains information about currently executing floating-point instructions. The FSR can be accessed by user software to detect floating-point exceptions that have occurred and to control rounding direction, trapping, and nonstandard arithmetic modes. The FQ is used by the operating system kernel to process floating-point traps and is normally invisible to user software.

Software accesses the floating-point status register via STFSR and LDFSR instructions that store the FSR in memory and load it from memory, respectively. In SPARC assembly language, these instructions are written as follows:


        st      %fsr, [addr]  ! store FSR at specified address
        ld      [addr], %fsr  ! load FSR from specified address

The inline template file libm.il located in the directory containing the libraries supplied with the Sun Studio compilers contains examples showing the use of STFSR and LDFSR instructions.

FIGURE B-1 shows the layout of bit fields in the floating-point status register.


FIGURE B-1 SPARC Floating-Point Status Register

SPARC Floating-Point Status Register


In versions 7 and 8 of the SPARC architecture, the FSR occupies 32 bits as shown. In version 9, the FSR is extended to 64 bits, of which the lower 32 match the figure; the upper 32 are largely unused, containing only three additional floating point condition code fields.

Here res refers to bits that are reserved, ver is a read-only field that identifies the version of the FPU, and ftt and qne are used by the system when it processes floating-point traps. The remaining fields are described in the following table.


TABLE B-2 Floating-Point Status Register Fields

Field

Contains

RM

rounding direction mode

TEM

trap enable modes

NS

nonstandard mode

fcc

floating point condition code

aexc

accrued exception flags

cexc

current exception flags


The RM field holds two bits that specify the rounding direction for floating-point operations. The NS bit enables nonstandard arithmetic mode on SPARC FPUs that implement it; on others, this bit is ignored. The fcc field holds floating-point condition codes generated by floating-point compare instructions and used by branch and conditional move operations. Finally, the TEM, aexc, and cexc fields contain five bits that control trapping and record accrued and current exception flags for each of the five IEEE 754 floating-point exceptions. These fields are subdivided as shown in TABLE B-3.


TABLE B-3 Exception Handling Fields

Field

Corresponding bits in register

TEM, trap enable modes

NVM

27

OFM

26

UFM

25

DZM

24

NXM

23

aexc, accrued exception flags

nva

9

ofa

8

ufa

7

dza

6

nxa

5

cexc, current exception flags

nvc

4

ofc

3

ufc

2

dzc

1

nxc

0


(The symbols NV, OF, UF, DZ, and NX above stand for the invalid operation, overflow, underflow, division-by-zero, and inexact exceptions respectively.)

B.1.2 Special Cases Requiring Software Support

In most cases, SPARC floating-point units execute instructions completely in hardware without requiring software support. There are four situations, however, when the hardware will not successfully complete a floating-point instruction:

In each situation, the initial response is the same: the process "traps" to the system kernel, which determines the cause of the trap and takes the appropriate action. (The term "trap" refers to an interruption of the normal flow of control.) In the first three situations, the kernel emulates the trapping instruction in software. Note that the emulated instruction can also incur an exception whose trap is enabled.

In the first three situations above, if the emulated instruction does not incur an IEEE floating-point exception whose trap is enabled, the kernel completes the instruction. If the instruction is a floating-point compare, the kernel updates the condition codes to reflect the result; if the instruction is an arithmetic operation, it delivers the appropriate result to the destination register. It also updates the current exception flags to reflect any (untrapped) exceptions raised by the instruction, and it "or"s those exceptions into the accrued exception flags. It then arranges to continue execution of the process at the point at which the trap was taken.

When an instruction executed by hardware or emulated by the kernel software incurs an IEEE floating-point exception whose trap is enabled, the instruction is not completed. The destination register, floating point condition codes, and accrued exception flags are unchanged, the current exception flags are set to reflect the particular exception that caused the trap, and the kernel sends a SIGFPE signal to the process.

The following pseudo-code summarizes the handling of floating-point traps. Note that the aexc field can normally only be cleared by software.


FPop provokes a trap;
if trap type is fp_disabled, unimplemented_FPop, or
  unfinished_FPop then
    emulate FPop;
texc ¨ all IEEE exceptions generated by FPop;
if (texc and TEM) = 0 then
    f[rd]  ¨ fp_result;  // if fpop is an arithmetic op
    fcc ¨ fcc_result;  // if fpop is a compare
    cexc ¨ texc;
    aexc ¨ (aexc or texc);
else
    cexc ¨ trapped IEEE exception generated by FPop;
    throw SIGFPE;

A program will encounter severe performance degradation when many floating-point instructions must be emulated by the kernel. The relative frequency with which this happens can depend on several factors including, of course, the type of trap.

Under normal circumstances, the fp_disabled trap should occur only once per process. The system kernel disables the floating-point unit when a process is first started, so the first floating-point operation executed by the process will cause a trap. After processing the trap, the kernel enables the floating-point unit, and it remains enabled for the duration of the process. (It is possible to disable the floating-point unit for the entire system, but this is not recommended and is done only for kernel or hardware debugging purposes.)

An unimplemented_FPop trap will obviously occur any time the floating-point unit encounters an instruction it does not implement. Since most current SPARC floating-point units implement at least the instruction set defined by the SPARC Architecture Manual Version 8 except for the quad precision instructions, and the Sun Studio compilers do not generate quad precision instructions, this type of trap should not occur on most systems. As mentioned above, two notable exceptions are the microSPARC-I and microSPARC-II processors, which do not implement the FsMULd instruction. To avoid unimplemented_FPop traps on these processors, compile programs with the -xarch=v8a option.

The remaining two trap types, unfinished_FPop and trapped IEEE exceptions, are usually associated with special computational situations involving NaNs, infinities, and subnormal numbers.

B.1.2.1 IEEE Floating-Point Exceptions, NaNs, and Infinities

When a floating-point instruction encounters an IEEE floating-point exception whose trap is enabled, the instruction is not completed; instead the system delivers a SIGFPE signal to the process. If the process has established a SIGFPE signal handler, that handler is invoked, and otherwise, the process aborts. Since trapping is most often enabled for the purpose of aborting the program when an exception occurs, either by invoking a signal handler that prints a message and terminates the program or by resorting to the system default behavior when no signal handler is installed, most programs do not incur many trapped IEEE floating-point exceptions. As described in Chapter 4, however, it is possible to arrange for a signal handler to supply a result for the trapping instruction and continue execution. Note that severe performance degradation can result if many floating-point exceptions are trapped and handled in this way.

Most SPARC floating-point units will also trap on at least some cases involving infinite or NaN operands or IEEE floating-point exceptions even when trapping is disabled or an instruction would not cause an exception whose trap is enabled. This happens when the hardware does not support such special cases; instead it generates an unfinished_FPop trap and leaves the kernel emulation software to complete the instruction. Different SPARC FPUs vary as to the conditions that result in an unfinished_FPop trap: for example, most early SPARC FPUs as well as the hyperSPARC FPU trap on all IEEE floating-point exceptions regardless of whether trapping is enabled, while UltraSPARC FPUs can trap "pessimistically" when a floating-point exception's trap is enabled and the hardware is unable to determine whether or not an instruction would raise that exception. On the other hand, the SuperSPARC-I, SuperSPARC-II, TurboSPARC, microSPARC-I, and microSPARC-II FPUs handle all exceptional cases in hardware and never generate unfinished_FPop traps.

Since most unfinished_FPop traps occur in conjunction with floating-point exceptions, a program can avoid incurring an excessive number of these traps by employing exception handling (i.e., testing the exception flags, trapping and substituting results, or aborting on exceptions). Of course, care must be taken to balance the cost of handling exceptions with that of allowing exceptions to result in unfinished_FPop traps.

B.1.2.2 Subnormal Numbers and Nonstandard Arithmetic

The most common situations in which some SPARC floating-point units will trap with an unfinished_FPop involve subnormal numbers. Many SPARC FPUs will trap whenever a floating-point operation involves subnormal operands or must generate a nonzero subnormal result (i.e., a result that incurs gradual underflow). Because underflow is somewhat rare but difficult to program around, and because the accuracy of underflowed intermediate results often has little effect on the overall accuracy of the final result of a computation, the SPARC architecture includes a nonstandard arithmetic mode that provides a way for a user to avoid the performance degradation associated with unfinished_FPop traps involving subnormal numbers.

The SPARC architecture does not precisely define nonstandard arithmetic mode; it merely states that when this mode is enabled, processors that support it may produce results that do not conform to the IEEE 754 standard. However, all existing SPARC implementations that support this mode use it to disable gradual underflow, replacing all subnormal operands and results with zero. (There is one exception: Weitek 1164/1165 FPUs only flush subnormal results to zero in nonstandard mode, they do not treat subnormal operands as zero.)

Not all SPARC implementations provide a nonstandard mode. Specifically, the SuperSPARC-I, SuperSPARC-II, TurboSPARC, microSPARC-I, and microSPARC-II floating-point units handle subnormal operands and generate subnormal results entirely in hardware, so they do not need to support nonstandard arithmetic. (Any attempt to enable nonstandard mode on these processors is ignored.) Therefore, gradual underflow incurs no performance loss on these processors.

To determine whether gradual underflows are affecting the performance of a program, you should first determine whether underflows are occurring at all and then check how much system time is used by the program. To determine whether underflows are occurring, you can use the math library function ieee_retrospective() to see if the underflow exception flag is raised when the program exits. Fortran programs call ieee_retrospective() by default. C and C++ programs need to call ieee_retrospective() explicitly prior to exit. If any underflows have occurred, ieee_retrospective() prints a message similar to the following:


Note: IEEE floating-point exception flags raised:  Inexact; Underflow; See the Numerical Computation Guide, ieee_flags(3M)

If the program encounters underflows, you might want to determine how much system time the program is using by timing the program execution with the time command.


demo% /bin/time myprog > myprog.output
305.3 real	      32.4 user      	271.9 sys 

If the system time (the third figure shown above) is unusually high, multiple underflows might be the cause. If so, and if the program does not depend on the accuracy of gradual underflow, you can enable nonstandard mode for better performance. There are two ways to do this. First, you can compile with the -fns flag (which is implied as part of the macros -fast and -fnonstd) to enable nonstandard mode at program startup. Second, the value-added math library libsunmath provides two functions to enable and disable nonstandard mode, respectively: calling nonstandard_arithmetic() enables nonstandard mode (if it is supported), while calling standard_arithmetic() restores IEEE behavior. The C and Fortran syntax for calling these functions is as follows:


C, C++

nonstandard_arithmetic();

standard_arithmetic();

Fortran

call nonstandard_arithmetic()

call standard_arithmetic()




caution icon

Caution - Since nonstandard arithmetic mode defeats the accuracy benefits of gradual underflow, you should use it with caution. For more information about gradual underflow, see Chapter 2.



B.1.2.3 Nonstandard Arithmetic and Kernel Emulation

On SPARC floating-point units that implement nonstandard mode, enabling this mode causes the hardware to treat subnormal operands as zero and flush subnormal results to zero. The kernel software that is used to emulate trapped floating-point instructions, however, does not implement nonstandard mode, in part because the effect of this mode is undefined and implementation-dependent and because the added cost of handling gradual underflow is negligible compared to the cost of emulating a floating-point operation in software.

If a floating-point operation that would be affected by nonstandard mode is interrupted (for example, it has been issued but not completed when a context switch occurs or another floating-point instruction causes a trap), it will be emulated by kernel software using standard IEEE arithmetic. Thus, under unusual circumstances, a program running in nonstandard mode might produce slightly varying results depending on system load. This behavior has not been observed in practice. It would affect only those programs that are very sensitive to whether one particular operation out of millions is executed with gradual underflow or with abrupt underflow.


B.2 fpversion(1) Function -- Finding Information About the FPU

The fpversion utility distributed with the compilers identifies the installed CPU and estimates the processor and system bus clock speeds. fpversion determines the CPU and FPU types by interpreting the identification information stored by the CPU and FPU. It estimates their clock speeds by timing a loop that executes simple instructions that run in a predictable amount of time. The loop is executed many times to increase the accuracy of the timing measurements. For this reason, fpversion is not instantaneous; it can take several seconds to run.

fpversion also reports the best -xtarget code generation option to use for the host system.

On an Ultra 4 workstation, fpversion displays information similar to the following. (There may be variations due to differences in timing or machine configuration.)


demo% fpversion 
 A SPARC-based CPU is available.
 CPU's clock rate appears to be approximately 461.1 MHz.
 Kernel says CPU's clock rate is 480.0 MHz.
 Kernel says main memory's clock rate is 120.0 MHz.
 
 Sun-4 floating-point controller version 0 found.
 An UltraSPARC chip is available.
 FPU's frequency appears to be approximately 492.7 MHz.
 
 Use "-xtarget=ultra2 -xcache=16/32/1:2048/64/1" code-generation option.
 
 Hostid = hardware_host_id

See the fpversion(1) manual page for more information.