Streaming SIMD Extensions 2 (SSE2)

SSE 2 — An Overview

SSE2 was first introduced on the Intel Pentium 4, and are also known sometimes as "Willamette" instructions. These instructions are very similar to the SSE instructions in structure, but allow us considerably more flexibility in crunching numbers. The biggest differences between SSE and SSE2 were the ability to deal with double-precision, or 64bit, floating-point values as well as with 32bit ones, along with the ability to now work on 128bit integer types in XMM registers as well. In total, 144 new instructions were added.

One instruction that was introduced with SSE2 is the CLFlush instruction. This instruction has its own bit in the cpuid feature flag which must be checked before using it, even if the CPU supports SSE2. See cpuid for details.

AMD didn't support SSE2 until 2003, with their Opteron and Athlon64 processors.

This instruction set was made available around February of 2000.

SSE2 — OpCode List

(under construction -- this list might be incomplete?!)
Additionally, with AMD64's 64/128 bit register extensions some of the functionality changes...

Arithmetic:
addpd - Adds 2 64bit doubles.
addsd - Adds bottom 64bit doubles.
subpd - Subtracts 2 64bit doubles.
subsd - Subtracts bottom 64bit doubles.
mulpd - Multiplies 2 64bit doubles.
mulsd - Multiplies bottom 64bit doubles.
divpd - Divides 2 64bit doubles.
divsd - Divides bottom 64bit doubles.
maxpd - Gets largest of 2 64bit doubles for 2 sets.
maxsd - Gets largets of 2 64bit doubles to bottom set.
minpd - Gets smallest of 2 64bit doubles for 2 sets.
minsd - Gets smallest of 2 64bit values for bottom set.
paddb - Adds 16 8bit integers.
paddw - Adds 8 16bit integers.
paddd - Adds 4 32bit integers.
paddq - Adds 2 64bit integers.
paddsb - Adds 16 8bit integers with saturation.
paddsw - Adds 8 16bit integers using saturation.
paddusb - Adds 16 8bit unsigned integers using saturation.
paddusw - Adds 8 16bit unsigned integers using saturation.
psubb - Subtracts 16 8bit integers.
psubw - Subtracts 8 16bit integers.
psubd - Subtracts 4 32bit integers.
psubq - Subtracts 2 64bit integers.
psubsb - Subtracts 16 8bit integers using saturation.
psubsw - Subtracts 8 16bit integers using saturation.
psubusb - Subtracts 16 8bit unsigned integers using saturation.
psubusw - Subtracts 8 16bit unsigned integers using saturation.
pmaddwd - Multiplies 16bit integers into 32bit results and adds results.
pmulhw - Multiplies 16bit integers and returns the high 16bits of the result.
pmullw - Multiplies 16bit integers and returns the low 16bits of the result.
pmuludq - Multiplies 2 32bit pairs and stores 2 64bit results.
rcpps - Approximates the reciprocal of 4 32bit singles.
rcpss - Approximates the reciprocal of bottom 32bit single.
sqrtpd - Returns square root of 2 64bit doubles.
sqrtsd - Returns square root of bottom 64bit double.

Logic:
andnpd - Logically NOT ANDs 2 64bit doubles.
andnps - Logically NOT ANDs 4 32bit singles.
andpd - Logically ANDs 2 64bit doubles.
pand - Logically ANDs 2 128bit registers.
pandn - Logically Inverts the first 128bit operand and ANDs with the second.
por - Logically ORs 2 128bit registers.
pslldq - Logically left shifts 1 128bit value.
psllq - Logically left shifts 2 64bit values.
pslld - Logically left shifts 4 32bit values.
psllw - Logically left shifts 8 16bit values.
psrad - Arithmetically right shifts 4 32bit values.
psraw - Arithmetically right shifts 8 16bit values.
psrldq - Logically right shifts 1 128bit values.
psrlq - Logically right shifts 2 64bit values.
psrld - Logically right shifts 4 32bit values.
psrlw - Logically right shifts 8 16bit values.
pxor - Logically XORs 2 128bit registers.
orpd - Logically ORs 2 64bit doubles.
xorpd - Logically XORs 2 64bit doubles.

Compare:
cmppd - Compares 2 pairs of 64bit doubles.
cmpsd - Compares bottom 64bit doubles.
comisd - Compares bottom 64bit doubles and stores result in EFLAGS.
ucomisd - Compares bottom 64bit doubles and stores result in EFLAGS. (QNaNs don't throw exceptions with ucomisd, unlike comisd.
pcmpxxb - Compares 16 8bit integers.
pcmpxxw - Compares 8 16bit integers.
pcmpxxd - Compares 4 32bit integers.
Compare Codes (the xx parts above):
eq - Equal to.
lt - Less than.
le - Less than or equal to.
ne - Not equal.
nlt - Not less than.
nle - Not less than or equal to.
ord - Ordered.
unord - Unordered.

Conversion:
cvtdq2pd - Converts 2 32bit integers into 2 64bit doubles.
cvtdq2ps - Converts 4 32bit integers into 4 32bit singles.
cvtpd2pi - Converts 2 64bit doubles into 2 32bit integers in an MMX register.
cvtpd2dq - Converts 2 64bit doubles into 2 32bit integers in the bottom of an XMM register.
cvtpd2ps - Converts 2 64bit doubles into 2 32bit singles in the bottom of an XMM register.
cvtpi2pd - Converts 2 32bit integers into 2 32bit singles in the bottom of an XMM register.
cvtps2dq - Converts 4 32bit singles into 4 32bit integers.
cvtps2pd - Converts 2 32bit singles into 2 64bit doubles.
cvtsd2si - Converts 1 64bit double to a 32bit integer in a GPR.
cvtsd2ss - Converts bottom 64bit double to a bottom 32bit single. Tops are unchanged.
cvtsi2sd - Converts a 32bit integer to the bottom 64bit double.
cvtsi2ss - Converts a 32bit integer to the bottom 32bit single.
cvtss2sd - Converts bottom 32bit single to bottom 64bit double.
cvtss2si - Converts bottom 32bit single to a 32bit integer in a GPR.
cvttpd2pi - Converts 2 64bit doubles to 2 32bit integers using truncation into an MMX register.
cvttpd2dq - Converts 2 64bit doubles to 2 32bit integers using truncation.
cvttps2dq - Converts 4 32bit singles to 4 32bit integers using truncation.
cvttps2pi - Converts 2 32bit singles to 2 32bit integers using truncation into an MMX register.
cvttsd2si - Converts a 64bit double to a 32bit integer using truncation into a GPR.
cvttss2si - Converts a 32bit single to a 32bit integer using truncation into a GPR.

Load/Store:
(is "minimize cache pollution" the same as "without using cache"??)
movq - Moves a 64bit value, clearing the top 64bits of an XMM register.
movsd - Moves a 64bit double, leaving tops unchanged if move is between two XMMregisters.
movapd - Moves 2 aligned 64bit doubles.
movupd - Moves 2 unaligned 64bit doubles.
movhpd - Moves top 64bit value to or from an XMM register.
movlpd - Moves bottom 64bit value to or from an XMM register.
movdq2q - Moves bottom 64bit value into an MMX register.
movq2dq - Moves an MMX register value to the bottom of an XMM register. Top is cleared to zero.
movntpd - Moves a 128bit value to memory without using the cache. NT is "Non Temporal."
movntdq - Moves a 128bit value to memory without using the cache.
movnti - Moves a 32bit value without using the cache.
maskmovdqu - Moves 16 bytes based on sign bits of another XMM register.
pmovmskb - Generates a 16bit Mask from the sign bits of each byte in an XMM register.

Shuffling:
pshufd - Shuffles 32bit values in a complex way.
pshufhw - Shuffles high 16bit values in a complex way.
pshuflw - Shuffles low 16bit values in a complex way.
unpckhpd - Unpacks and interleaves top 64bit doubles from 2 128bit sources into 1.
unpcklpd - Unpacks and interleaves bottom 64bit doubles from 2 128 bit sources into 1.
punpckhbw - Unpacks and interleaves top 8 8bit integers from 2 128bit sources into 1.
punpckhwd - Unpacks and interleaves top 4 16bit integers from 2 128bit sources into 1.
punpckhdq - Unpacks and interleaves top 2 32bit integers from 2 128bit sources into 1.
punpckhqdq - Unpacks and interleaces top 64bit integers from 2 128bit sources into 1.
punpcklbw - Unpacks and interleaves bottom 8 8bit integers from 2 128bit sources into 1.
punpcklwd - Unpacks and interleaves bottom 4 16bit integers from 2 128bit sources into 1.
punpckldq - Unpacks and interleaves bottom 2 32bit integers from 2 128bit sources into 1.
punpcklqdq - Unpacks and interleaces bottom 64bit integers from 2 128bit sources into 1.
packssdw - Packs 32bit integers to 16bit integers using saturation.
packsswb - Packs 16bit integers to 8bit integers using saturation.
packuswb - Packs 16bit integers to 8bit unsigned integers unsing saturation.

Cache Control:
clflush - Flushes a Cache Line from all levels of cache.
lfence - Guarantees that all memory loads issued before the lfence instruction are completed before anyloads after the lfence instruction.
mfence - Guarantees that all memory reads and writes issued before the mfence instruction are completed before any reads or writes after the mfence instruction.
pause - Pauses execution for a set amount of time.

SSE3 — Other Changes

There were a few other changes that took place around the introduction of SSE3.

Reading performance counters using rdpmc could now just read the bottom 32bits into eax instead of reading all 40bits into edx:eax. This faster read can be activated by clearing bit 31 of ecx just before the read. Additionally, Branch Hints were introduced. These hint prefixes are used to help the processor perform better branch prediction.
hwnt - Hint Weakly Not Taken.
hst - Hint Strongly Taken.

Styles: Default · Green · Sianse