SSE — An Overview
SSE is a newer
SIMD extension to the
Intel Pentium
III and
AMD AthlonXP microprocessors.
Unlike
MMX and
3DNow!
extensions, which occupy the same register space as the normal
FPU registers, SSE adds a separate
register space to the microprocessor. Because of this, SSE can only be
used on operating systems that support it. Fortunately, most recent
operating systems have support built in. All versions of Windows since
Windows98 support SSE, as do Linux kernels since 2.2.
SSE was introduced in 1999, and was also known as "Katmai New Instructions" (or KNI)
after the Pentium III's core codename.
SSE adds 8 new 128-bit registers, divided into 4 32-bit
(single precision) floating point values. These registers are called
XMM0 - XMM7
. An additional control register,
MXCSR
, is also available to control and check the status of SSE
instructions.
SSE gives us access to 70 new instructions that
operate on these 128bit registers, MMX registers, and sometimes even
regular 32bit registers.
SSE — MXCSR
The
MXCSR
register is a 32-bit register containing flags
for control and status information regarding SSE instructions. As of
SSE3, only bits 0-15 have been defined.
Pnemonic | Bit Location | Description |
FZ | bit 15 | Flush To Zero |
R+ | bit 14 | Round Positive |
R- | bit 13 | Round Negative |
RZ | bits 13 and 14 | Round To Zero |
RN | bits 13 and 14 are 0 | Round To Nearest |
PM | bit 12 | Precision Mask |
UM | bit 11 | Underflow Mask |
OM | bit 10 | Overflow Mask |
ZM | bit 9 | Divide By Zero Mask |
DM | bit 8 | Denormal Mask |
IM | bit 7 | Invalid Operation Mask |
DAZ | bit 6 | Denormals Are Zero |
PE | bit 5 | Precision Flag |
UE | bit 4 | Underflow Flag |
OE | bit 3 | Overflow Flag |
ZE | bit 2 | Divide By Zero Flag |
DE | bit 1 | Denormal Flag |
IE | bit 0 | Invalid Operation Flag |
FZ
mode causes all underflowing operations to simply go to zero. This saves
some processing time, but loses precision.
The
R+
,
R-
,
RN
, and
RZ
rounding modes determine how the
lowest bit is generated. Normally,
RN
is used.
PM
,
UM
,
MM
,
ZM
,
DM
, and
IM
are masks that
tell the processor to ignore the exceptions that happen, if they do. This keeps the
program from having to deal with problems, but might cause invalid results.
DAZ
tells the
CPU
to force all Denormals to zero. A Denormal is a number that is so small
that
FPU can't
renormalize it due to limited exponent ranges. They're just like normal
numbers, but they take considerably longer to process. Note that not all
processors support
DAZ
.
PE
,
UE
,
ME
,
ZE
,
DE
, and
IE
are the exception flags that are set
if they happen, and aren't unmasked. Programs can check these to see if
something interesting happened. These bits are "sticky", which means
that once they're set, they stay set forever until the program clears
them. This means that the indicated exception could have happened several
operations ago, but nobody bothered to clear it.
DAZ
wasn't available in the first version of SSE. Since setting
a reserved bit in
MXCSR
causes a general protection fault,
we need to be able to check the availability of this feature without
causing problems. To do this, one needs to set up a 512-byte area of
memory to save the SSE state to, using
fxsave
, and then
one needs to inspect bytes 28 through 31 for the
MXCSR_MASK
value. If bit 6 is set,
DAZ
is
supported, otherwise, it isn't.
SSE — OpCode List
(still under construction) (lowest = bits 0-31, not smallest of
set.)(byte, word, 8bit, 16bit, need to regularize...)
Arithmetic:
addps
- Adds 4 single-precision (32bit) floating-point
values to 4 other single-precision floating-point values.
addss
- Adds the lowest single-precision values, top 3
remain unchanged.
subps
- Subtracts 4 single-precision floating-point
values from 4 other single-precision floating-point values.
subss
- Subtracts the lowest single-precision values, top
3 remain unchanged.
mulps
- Multiplies 4 single-precision floating-point
values with 4 other single-precision values.
mulss
- Multiplies the lowest single-precision values,
top 3 remain unchanged.
divps
- Divides 4 single-precision floating-point values
by 4 other single-precision floating-point values.
divss
- Divides the lowest single-precision values, top 3
remain unchanged.
rcpps
- Reciprocates (1/x) 4 single-precision
floating-point values.
rcpss
- Reciprocates the lowest single-precision values,
top 3 remain unchanged.
sqrtps
- Square root of 4 single-precision values.
sqrtss
- Square root of lowest value, top 3 remain
unchanged.
rsqrtps
- Reciprocal square root of 4 single-precision
floating-point values.
rsqrtss
- Reciprocal square root of lowest
single-precision value, top 3 remain unchanged.
maxps
- Returns maximum of 2 values in each of 4
single-precision values.
maxss
- Returns maximum of 2 values in the lowest
single-precision value. Top 3 remain unchanged.
minps
- Returns minimum of 2 values in each of 4
single-precision values.
minss
- Returns minimum of 2 values in the lowest
single-precision value, top 3 remain unchanged.
pavgb
- Returns average of 2 values in each of 8 bytes.
pavgw
- Returns average of 2 values in each of 4 words.
psadbw
- Returns sum of absolute differences of 8 8bit
values. Result in bottom 16 bits.
pextrw
- Extracts 1 of 4 words.
pinsrw
- Inserts 1 of 4 words.
pmaxsw
- Returns maximum of 2 values in each of 4 signed
word values.
pmaxub
- Returns maximum of 2 values in each of 8
unsigned byte values.
pminsw
- Returns minimum of 2 values in each of 4 signed
word values.
pminub
- Returns minimum of 2 values in each of 8
unsigned byte values.
pmovmskb
- builds mask byte from top bit of 8 byte
values.
pmulhuw
- Multiplies 4 unsigned word values and stores
the high
16bit result.
pshufw
- Shuffles 4 word values. Takes 2 128bit values (source and dest) and an 8-bit immediate value, and then fills in each Dest 32-bit value from a Source
32-bit value specified by the immediate. The immediate byte is broken into 4 2-bit values.
Logic:
andnps
- Logically ANDs 4 single-precision values with
the logical inverse (NOT) of 4 other single-precision values.
andps
- Logically ANDs 4 single-precision values with 4
other single-precision values.
orps
- Logically ORs 4 single-precision values with 4
other single-precision values.
xorps
- Logically XORs 4 single-precision values with 4
other single-precision values.
Compare:
cmpxxps
- Compares 4 single-precision values.
cmpxxss
- Compares lowest 2 single-precision values.
comiss
- Compares lowest 2 single-recision values and
stores result in
EFLAGS
.
ucomiss
- Compares lowest 2 single-precision values and
stores result in
EFLAGS
. (
QNaNs
don't throw exceptions with
ucomiss
, unlike
comiss
.)
Compare Codes (the
xx
parts above):
eq
- Equal to.
lt
- Less than.
le
- Less than or equal to.
ne
- Not equal.
nlt
- Not less than.
nle
- Not less than or equal to.
ord
- Ordered.
unord
- Unordered.
Conversion:
cvtpi2ps
- Converts 2 32bit integers to 32bit
floating-point values. Top 2 values remain unchanged.
cvtps2pi
- Converts 2 32bit floating-point values to
32bit integers.
cvtsi2ss
- Converts 1 32bit integer to 32bit
floating-point value. Top 3 values remain unchanged.
cvtss2si
- Converts 1 32bit floating-point value to 32bit
integer.
cvttps2pi
- Converts 2 32bit floating-point values to
32bit integers using truncation.
cvttss2si
- Converts 1 32bit floating-point value to
32bit integer using truncation.
State:
fxrstor
- Restores
FP and SSE State.
fxsave
- Stores
FP and SSE State.
ldmxcsr
- Loads the
mxcsr
register.
stmxcsr
- Stores the
mxcsr
register.
Load/Store:
movaps
- Moves a 128bit value.
movhlps
- Moves high half to a low half.
movlhps
- Moves low half to upper halves.?
movhps
- Moves 64bit value into top half of an
xmm
register.
movlps
- Moves 64bit value into bottom half of an
xmm
register.
movmskps
- Moves top bits of single-precision values into
bottom 4 bits of a 32bit register.
movss
- Moves the bottom single-precision value, top 3
remain unchanged if the destination is another
xmm
register, otherwise they're set to zero.
movups
- Moves a 128bit value. Address can be unaligned.
maskmovq
- Moves a 64bit value according to a mask.
movntps
- Moves a 128bit value directly to memory,
skipping the cache. (NT stands for "Non Temporal".)
movntq
- Moves a 64bit value directly to memory, skipping
the cache.
Shuffling:
shufps
- Shuffles 4 single-precision values. Complex.
unpckhps
- Unpacks single-precision values from high
halves.
unpcklps
- Unpacks single-precision values from low
halves.
Cache Control:
prefetchT0
- Fetches a cache-line of data into all levels
of cache.
prefetchT1
- Fetches a cache-line of data into all but
the highest levels of cache.
prefetchT2
- Fetches a cache-line of data into all but
the two highest levels of cache.
prefetchNTA
- Fetches data into only the highest level of
cache, not the lower levels.
sfence
- Guarantees that all memory writes issued before
the
sfence
instruction are completed before any writes
after the
sfence
instruction.
With prefetches, it's ok to access an invalid memory location (i.e. off the end of an array) -- however, generating the address must not fault.