Guide: Using Paired Singles and GQRs

Punkline · Jul 27, 2018

Gamecube and Wii consoles are equipped with some hardware that allows them to use a special floating point data type called “paired singles.” These aren’t different from any other normal single-precision floating point -- aside from the fact that there are 2 of them in one register.

There are a slew of special instructions in the PowerPC language (unique to the Gekko and Broadway processors) designed to handle pairs of floating point singles in roughly the same time it would normally take to operate on one floating point double -- making them very efficient in math-heavy algorithms.

---

Melee Code Manager supports paired singles!

If for some reason your architecture doesn't support them, you can try using these macros as an alternative:

Code:

# PSQA format:
.macro PSQAform, op, frD, rA, W, i, d
.long \op<<(31-5)+\frD<<(31-10)+\rA<<(31-15)+\W<<(31-16)+\i<<(31-19)+(\d&0xFFF)
.endm
# frD = destination register
# frS = source register
# d   = address displacement
# rA  = address register
# W   = bool -- if true, use 1.0 in place of second variable
# qrI = QR ID

# psq _l, _st, _lu, _stu
.macro psq_l, frD, d, rA, W, qrI
PSQAform 56, \frD, \rA, \W, \qrI, \d
.endm
# quantized load

.macro psq_st, frS, d, rA, W, qrI
PSQAform 60, \frS, \rA, \W, \qrI, \d
.endm
# quantized store

.macro psq_lu, frD, d, rA, W, qrI
PSQAform 57, \frD, \rA, \W, \qrI, \d
.endm
# quantized load update

.macro psq_stu, frS, d, rA, W, qrI
PSQAform 61, \frS, \rA, \W, \qrI, \d
.endm
# quantized store update

Code:

# PSQAB format:
.macro PSQABform, op, rD, A, B, W, i, op2
.long \op<<(31-5)+\rD<<(31-10)+\A<<(31-15)+\B<<(31-20)+\W<<(31-21)+\i<<(31-24)+\op2<<1
.endm
# frD = destination register
# frS = source register
# d   = address displacement
# rA  = address register
# rB  = index register
# W   = bool -- if true, use 1.0 in place of second variable
# qrI = QR ID

# psq _lx, _stx, _lux, _stux
.macro psq_lx, frD, rA, rB, W, qrI
PSQABform 4, \frD, \rA, \rB, \W, \qrI, 6
.endm
# quantized load index

.macro psq_stx, frS, rA, rB, W, qrI
PSQABform 4, \frS, \rA, \rB, \W, \qrI, 7
.endm
# quantized store index

.macro psq_lux, frD, rA, rB, W, qrI
PSQABform 4, \frD, \rA, \rB, \W, \qrI, 38
.endm
# quantized load update index

.macro psq_stux, frS, rA, rB, W, qrI
PSQABform 4, \frS, \rA, \rB, \W, \qrI, 39
.endm
# quantized store update index

Code:

# PSAB format:
.macro PSABform, op, rD, A, B, op2, f
.long \op<<(31-5)+\rD<<(31-10)+\A<<(31-15)+\B<<(31-20)+\op2<<1+\f
.endm
# frD = destination register
# frA = operand pair 1
# frB = operand pair 2


# ps _mr, _abs, _nabs, _neg
.macro ps_mr, frD, frA
PSABform 4, \frD, 0, \frA, 72, 0
.endm
# move register
# D0 = A0
# D1 = A1

.macro ps_abs, frD, frA
PSABform 4, \frD, 0, \frA, 264, 0
.endm
# absolute value
# D0 = abs(A0)
# D1 = abs(A1)

.macro ps_neg, frD, frA
PSABform 4, \frD, 0, \frA, 40, 0
.endm
# negate value
# D0 = A0 * -1.0
# D1 = A1 * -1.0

.macro ps_nabs, frD, frA
PSABform 4, \frD, 0, \frA, 136, 0
.endm
# negated absolute value
# D0 = abs(A0) * -1.0
# D1 = abs(A1) * -1.0



# ps _merge00, _merge01, _merge10, _merge11
.macro ps_merge00, frD, frA, frB
PSABform 4, \frD, \frA, \frB, 528, 0
.endm
# merge high
# D0 = A0
# D1 = B0


.macro ps_merge01, frD, frA, frB
PSABform 4, \frD, \frA, \frB, 560, 0
.endm
# merge direct
# D0 = A0
# D1 = B1

.macro ps_merge10, frD, frA, frB
PSABform 4, \frD, \frA, \frB, 592, 0
.endm
# merge swapped
# D0 = A1
# D1 = B0

.macro ps_merge11, frD, frA, frB
PSABform 4, \frD, \frA, \frB, 624, 0
.endm
# merge low
# D0 = A1
# D1 = B1


# ps _cmpo0, cmpo1, cmpu0, cmpu1
.macro ps_cmpo0, crfD, frA, frB
PSABform 4, \crfD<<2, \frA, \frB, 32, 0
.endm
# compare ordered high
# crD = compare A0 vs B0

.macro ps_cmpo1, crfD, frA, frB
PSABform 4, \crfD<<2, \frA, \frB, 96, 0
.endm
# compare odered low
# crD = compare A1 vs B1

.macro ps_cmpu0, crfD, frA, frB
PSABform 4, \crfD<<2, \frA, \frB, 0, 0
.endm
# compare unordered high
# crD = compare A0 vs B0

.macro ps_cmpu1, crfD, frA, frB
PSABform 4, \crfD<<2, \frA, \frB, 64, 0
.endm
# compare unordered low
# crD = compare A1 vs B1

Code:

# PSABC format:
.macro PSABCform, op, rD, A, B, C, op2, f
.long \op<<(31-5)+\rD<<(31-10)+\A<<(31-15)+\B<<(31-20)+\C<<(31-25)+\op2<<1+\f
.endm
# frD = destination register
# frA = operand pair 1
# frB = operand pair 2
# frC = operand pair 3

# ps _add, _sub, _mul, _div
.macro ps_add, frD, frA, frB
PSABCform 4, \frD, \frA, \frB, 0, 21, 0
.endm
# add
# D0 = A0 + B0
# D1 = A1 + B1

.macro ps_sub, frD, frA, frB
PSABCform 4, \frD, \frA, \frB, 0, 20, 0
.endm
# subtract
# D0 = A0 - B0
# D1 = A1 - B1

.macro ps_mul, frD, frA, frB
PSABCform 4, \frD, \frA, 0, \frB, 25, 0
.endm
# multiply
# D0 = A0 * B0
# D1 = A1 * B1

.macro ps_div, frD, frA, frB
PSABCform 4, \frD, \frA, \frB, 0, 18, 0
.endm
# divide
# D0 = A0 / B0
# D1 = A1 / B1


# ps _muls0, _muls1
.macro ps_muls0, frD, frA, frB
PSABCform 4, \frD, \frA, 0, \frB, 12, 0
.endm
# multiply scalar high
# D0 = A0 * B0
# D1 = A1 * B0

.macro ps_muls1, frD, frA, frB
PSABCform 4, \frD, \frA, 0, \frB, 13, 0
.endm
# multiply scalar low
# D0 = A0 * B1
# D1 = A1 * B1


# ps _sum0, _sum1
.macro ps_sum0, frD, frA, frB, frC
PSABCform 4, \frD, \frA, \frC, \frB, 10, 0
.endm
# Vector sum high
# D0 = A0 + B1
# D1 = C1

.macro ps_sum0, frD, frA, frB, frC
PSABCform 4, \frD, \frA, \frC, \frB, 11, 0
.endm
# Vector sum low
# D0 = C0
# D1 = A0 + B1


# ps _madd, _msub, _nmadd, _nmsub
.macro ps_madd, frD, frA, frB, frC
PSABCform 4, \frD, \frA, \frC, \frB, 29, 0
.endm
# Multiply then Add
# D0 = A0 * B0 + C0
# D1 = A1 * B1 + C1

.macro ps_msub, frD, frA, frB, frC
PSABCform 4, \frD, \frA, \frC, \frB, 28, 0
.endm
# Multiply then Subtract
# D0 = A0 * B0 - C0
# D1 = A1 * B1 - C1

.macro ps_nmadd, frD, frA, frB, frC
PSABCform 4, \frD, \frA, \frC, \frB, 31, 0
.endm
# Negate, Multiply, then Add
# D0 = A0 * -1.0 * B0 + C0
# D1 = A1 * -1.0 * B1 + C1

.macro ps_nmsub, frD, frA, frB, frC
PSABCform 4, \frD, \frA, \frC, \frB, 30, 0
.endm
# Negate, Multiply, then Subtract
# D0 = A0 * -1.0 * B0 + C0
# D1 = A1 * -1.0 * B1 + C1


# ps _madds0, _madds1
.macro ps_madds0, frD, frA, frB, frC
PSABCform 4, \frD, \frA, \frC, \frB, 14, 0
.endm
# Multiply then add scalar high
# D0 = A0 * B0 + C0
# D1 = A1 * B0 + C1

.macro ps_madds1, frD, frA, frB, frC
PSABCform 4, \frD, \frA, \frC, \frB, 15, 0
.endm
# Multiply then add scalar low
# D0 = A0 * B1 + C0
# D1 = A1 * B1 + C1

.macro ps_res, frD, frA
PSABCform 4, \frD, 0, \frA, 0, 24, 0
.endm
# Reciprocal
# D0 = 1.0 / A0
# D1 = 1.0 / A1

.macro ps_rsqrte, frD, frA
PSABCform 4, \frD, 0, \frA, 0, 26, 0
.endm
# Reciprocal of Square Root (estimate)
# D0 = 1.0 / sqrt(A0)
# D0 = 1.0 / sqrt(A1)

.macro ps_sel, frD, frA, frB, frC
PSABCform 4, \frD, \frA, \frC, \frB, 23, 0
.endm
# Select (float)
# if   A0 >= 0.0
# then D0 = C0
# else D0 = B0
# if   A1 >= 0.0
# then D1 = C1
# else D1 = B1

Lightly tested -- let me know if you run into any problems while using these.

---

How Paired Singles use Floating Point Registers:

There are 32 floating point registers used by the processor for handling floating point data. Each is 64-bits -- which is enough room to accommodate a double precision floating point. Paired singles are made possible by splitting these 64 bit registers in half; dedicating the low and high 32 bits to separate floating point singles. This creates a vector, or a “pair.”

If you open Dolphin in debug mode and take a look at the registers panel, you can see these registers as they’re being used. For whatever reason, Dolphin seems to use 128 bits per register.

As a mildly confusing consequence, a pair is incorrectly displayed as a pair of doubles:

Internally, these pairs are single precision rather than double precision.

Regardless of how it’s displayed -- you may use the two columns in the floating points register array to check the value of a floating point single, a floating point double, or a pair. When considering pairs; it’s easier to think of a floating point register as a small array of 2 allocations. “fr[0, 1]” -- referred to as ps0 and ps1 in various documentation sources.

Dolphin will usually refer to these as “pairs” by using p1 instead of f1 when printing the operands in the code panel; but they may just as easily be thought of as f1 or simply 1. They all refer to the same thing -- a 64-bit floating point register.

---

While the system is in paired singles mode, regular floating point singles are treated like pairs. When you load a floating point using the instruction “lfs” -- you’re actually loading the same float twice. The same is not true of “lfd” however, because there are only enough bits for one double in a floating point register.

Melee is always in paired singles mode, so you can expect the following behavior when loading different types of floats:

---

Graphics Quantization Registers:

The instructions dedicated to loading and storing paired singles are particularly special, because they incorporate quantization. With these quantization instructions, it’s very easy to work with both integers and floats at the same time while using very few instructions to convert between the two.

There are 8 “GQRs” -- supervisor-level special-purpose-registers that may be accessed with mtspr and mfspr instructions by using the following IDs:

912 = QR0
913 = QR1
914 = QR2
915 = QR3
916 = QR4
917 = QR5
918 = QR6
919 = QR7

Note - these should be treated like saved registers! If writing to a QR; first back it up, then make sure to restore it before returning.

---

Each GQR may specify a target quantization and dequantization setting for respective store and load instructions to make use of. I've sometimes heard this regarded as "float packing" and "unpacking".

The GQR format comprises of 4 bytes, made up from 2 kinds of parameters for specifying quantization settings:

SCALE: describes a power of 2 to scale the loaded/stored value
TYPE: describes whether to load/store as a type of integer

The GQR format aligns each parameter to 1 of 4 bytes in the register:
0x0 = 6-bit LD_SCALE # load
0x1 = 3-bit LD_TYPE
0x2 = 6-bit ST_SCALE # store
0x3 = 3-bit ST_TYPE

The TYPE parameter may be specified as any of the following:
0 - FPS - floating point singles (no casting)
1...3 - (invalid)
4 - 8UINT - cast to/from unsigned 8-bit fixed point integers
5 - 16UINT - cast to/from unsigned 16-bit fixed point integers
6 - 8SINT - cast to/from signed 8-bit fixed point integers
7 - 16SINT - cast to/from signed 16-bit fixed point integers

A type specifies the quantized format of one of these paired singles. The packed form. Once loaded, it will be unpacked as a pair of floating points, potentially allowing for an opportunity to cast between data formats. Int sizes larger than 16 bits will still need to be converted using conventional casting methods.

If the packed form is type 0, then it will not be converted at all -- read or written directly as a pair of floating points.

The “fixed point” in said integer types is defined by the corresponding scale parameter used in conjunction with the type.

The possible sizes of packed forms include the following:
0 = 8-byte (pair of FPSs)
5, 7 = 4-byte (pair of 16INTs)
4, 6 = 2-byte (pair of 8INTs)

The SCALE parameter uses the value 0 as a bias to describe a 1:1 ratio.

Adding to the scale is much like shifting the unpacked number right (x>>n) -- while subtracting is like shifting left (x<<n). However, the “shifts” are merely modifications to the resulting floating point’s exponent; so each shift direction is capable of describing a scale that surpases the limitations of their packed formats.

For example, the value “-4” would be quantized using increments of 16.0 for each integral unit (1 == 0x10.0)
This would give a 16-bit value the ability to describe 20-bit values at the cost of accuracy.

The value “4” on the other hand would be quantized into increments of 1/16 (1 == "0x00.1")
This would sacrifice 4 bits to give an integer a mantissa representation, for improved integral accuracy.

---

It’s important to note that Melee uses the following default QR values. They may serve the purpose of providing simple float casting options without the need to write a new set of QR params:

QR0 = 0x00000000 -- load/store: floats
QR1 = 0x00000000 -- load/store: floats
QR2 = 0x00040004 -- load/store: unsigned bytes
QR3 = 0x00050005 -- load/store: unsigned hwords
QR4 = 0x00060006 -- load/store: signed bytes
QR5 = 0x00070007 -- load/store: signed hwords
QR6 = 0x3d043d04 -- load/store: unsigned bytes (8* scale; 0.0 ... 2040.0)
QR7 = 0x00000000 -- load/store: floats

Edit - It's worth noting that there is a brief period before the game starts where QR6 is blank, like QR7.
You may use this code with a breakpoint to inquire about the current QR values for each global draw frame:

Code:

$GQR Query Test [Punkline]
C21A4D48 00000005
7CB0E2A6 7CD1E2A6
7CF2E2A6 7D13E2A6
7D34E2A6 7D55E2A6
7D76E2A6 7D97E2A6
3BC49D48 00000000

Enable this code, and put a breakpoint at 801a4d4c
r5 ... r12 will contain the most recent QR values in QR0...QR7, just like the above screenshot.

If writing to a QR is necessary for your operation; I recommend starting with QR7. Again, be sure to back up a QR if you intend to write to it -- and be especially careful about calling functions that use paired single instructions that target a modified QR.

---

Using Quantized Load/Store Instructions:

The “psq” instructions included in the paired singles family are the instructions that use the above GQRs to load and store values as packed/unpacked floating points.

These instructions have 2 additional operands in their syntax when compared to other load/store syntaxes. If you’re used to loading/storing other data types, then they might otherwise appear to be familiar:

Code:

lfd   fD, d(rA)
psq_l fD, d(rA), W, i
# W = bool  -- if true, load 1.0 in place of target ps1
# i = 3-bit -- select QR0...QR7

Note - If you’re compiling paired singles instruction with macros, then you’ll have to use a more literal syntax:

Code:

psq_l D, d, A, W, i
# use symbols or literals to describe each operand

---

To get an idea for how GQRs may be utilized with psq instructions, consider the following 2 examples:

Code:

 # r3 = address of RGBA color

mfspr r31, 919
# backup QR7 in r31

lis r4, 0x0804      # high order = LD_ SCALE, TYPE
ori r4, r4, 0x0804  # low  order = ST_ SCALE, TYPE
mtspr 919, r4
# QR7 scale:  8 (0x00 ... 0x0.FF)
# QR7 type:   4 (unsigned byte)

psq_l 1, 0, 3, 0, 7
psq_l 2, 2, 3, 1, 7
# f1[0], f1[1] = R, G
# f2[0], f2[1] = B, 100% Alpha

The above code snippet backs up QR7 and modifies it for the purpose of loading in color channels as coefficients. A scale of 8 causes the value to be interpreted as a coefficient between 0.0 and 1.0; effectively causing the full byte to represent only a mantissa.

Because of the fact that the second psq_l instruction uses a “1” for the “W” operand -- the alpha channel is always loaded as 1.0, regardless of the input color alpha. In this case, 1.0 is the same as 100%, so this may potentially be useful.

If stored again with QR7, the quantization will cap values at 0xFF to fit in range of the 8-bit UINT format.

Here's an example of the color red when loaded and re-quantized with the described QR settings:

FF 00 00 FF

Now consider the same scenario with the use of default QR2.

Using default QRs requires no setup -- but grants little to no control over scale.

Default QR2 specifies an 8-bit UINT type to quantize and dequantize, but it doesn’t scale the exponent. The values will be casted 1:1, creating a range between 0.0 and 255.0:

Code:

 # r3 = address of RGBA color

psq_l 1, 0, 3, 0, 2
psq_l 2, 2, 3, 0, 2
# f1[0], f1[1] = R, G
# f2[0], f2[1] = B, A

Example 2 requires no QR setup because of the integrity of the default QR2 value. So long as the default value remains protected, these instructions will function as intended.

---

Complex Instruction Formats:

I thought that a few of the formats could use a diagram or two to help explain the relationships between the input operand pairs. Let me know if you think I should cover any others.

---

Merge instructions are a bit like “move” instructions that can cross over the ps0/ps1 boundary. They’re very useful for setting up parallel calculations, but they can also be used to make ps1 values accessible via “stfs” instructions. The latter is particularly useful when writing to the GX FIFO pipe.

ps_merge00 = A0, B0 "high"
ps_merge01 = A0, B1 "direct"
ps_merge10 = A1, B0 "swapped"
ps_merge11 = A1, B1 "low"

---

In contrast to the ps_add instruction, which adds 2 pairs in parallel -- the ps_sum* instructions combine an addition operation with a merge operation.

ps_sum0 adds A0 + C1, and merges the result with B1.
ps_sum1 adds A0 + C1, and merges the result with B0.

These can be useful for adding a ps0 value with a ps1 value, which is not possible with a normal ps_add.

---

Unlike ps_mul, its variations ps_muls0 and ps_muls1 can multiply from one single in a pair, as opposed to using both.

ps_muls0 multiplies both A0 and A1 by B0.
ps_muls1 multiplies both A0 and A1 by B1.

This makes it easier to load only one coefficient to scale things with, instead of redundantly loading something twice just to achieve the same effect.

---

Variations of ps_madd include the scalar multiply behavior seen in ps_muls0 and ps_muls1.

ps_madds0 multiplies both A0 and A1 by B0 before adding the results to C0 and C1.
ps_madds1 multiplies both A0 and A1 by B1 before adding the results to C0 and C1.

The number of calculations performed in these floating point operations makes ps_madd and its family curiously powerful. It would seem that under certain circumstances, performing a similar set of integer operations in the same quantity would actually take more clock cycles.

Consider the ps_nmsub instruction vs a sequence of 6 integer instructions that create an equivalent integer solution. Not only would the ps_nmsub be faster than the proposed sequence; it would take fewer registers and fewer instructions to complete, and may benefit from the added fidelity of floating point arithmetic over integer arithmetic.

UnclePunch · Jul 28, 2018

read the first few lines of the post and scrolled down to like it. took me like a minute to get down to the like button lol.

i dont do much heavy calculations but maybe i can find a use for this functionality in the future. ill defnitely mess with the float casting stuff, i do that a lot.

great writeup as always punkline.

Punkline · Jul 28, 2018

UnclePunch said:
read the first few lines of the post and scrolled down to like it. took me like a minute to get down to the like button lol.

i dont do much heavy calculations but maybe i can find a use for this functionality in the future. great writeup as always punkline.

lol, yeah paired singles are fairly complex. I learned most of this through trial and error while using the little documentation available, so I did my best to cover the fundamentals with a little bit of extra detail.

Edit -- I added some spoilers, for your scrollbar :dizzy:

Edit 2021 -- I use this guide a lot to remind myself of QR details, and found the spoilers too obscuring -- so they're gone again.
---

UnclePunch

I kinda figured you wouldn’t be into the whole double math thing, but you should know that’s only half the story to paired singles. The other half of their power comes from their ability to cast and scale integers<>floats using only single instructions. As you probably know, casting normally requires exploiting something like this -- but not if you take advantage of the unique hardware included in the gamecube.

I tagged you mostly because of your experience with writing custom subaction event syntaxes. The thing here that would probably interest you most of all is the quantized load instruction “psq_l”. It'll let you load in a (shiftable) integer as a float -- allowing you to make extremely compact data representations, like a 4-bit floating point.

Guide: Using Paired Singles and GQRs

Punkline

Dr. Frankenstack

UnclePunch

Smash Ace

Punkline

Dr. Frankenstack

Information

Network