ASM explanations...

Started by Stuff, August 24, 2011, 02:19:59 AM

Previous topic - Next topic

dcx2

ps = paired single.  ps is a lame version of SIMD (Single Instruction Multiple Data).  Other less lame versions include SSE and AltiVec.

fregs are 64-bits.  Single-precision floats are 32-bits.  ps instructions pack two singles into an freg.  A Single Instruction can then operate upon Multiple Data values at once.

ps are specific to the Gekko processor architecture.  Since they're Nintendo-specific, there isn't much info on them on the web.  About the best resource available is YAGCD, but for _l and _st it doesn't have much other than opcodes and bit fields.  http://hitmen.c02.at/files/yagcd/yagcd/chap3.html#sec3.4

psq_l and _st will do 32-bit or 64-bit loads or stores depending on one of the arguments...I can't remember which.  I think it's the first one.  The W bit field in 3.4.3 of the YAGCD link.

The Quantization registers are some kinda voodoo magic (more formally known as "SPRs" or Special Purpose Registers).  I think they're used for scaling and converting between ints and floats.  That part is actually pretty useful, if it weren't so difficult to leverage.

===

In case the dry description doesn't get the right picture in your head, the following are mostly equivalent.

lwz f0,0(r4)     # load two floats from r4
lwz f1,4(r4)

lwz f10,0(r5)    # load two more floats from r5
lwz f11,4(r5)

fadds f0,f0,f10  # add first set of floats
fadds f1,f1,f11  # add second set of floats

stw f0,0(r3)      # store two floats to r3
stw f1,4(r3)

--- or

psq_l   f0,0(r4),0,0 # load two floats from r4
psq_l   f1,0(r5),0,0 # load two floats from r5
ps_add   f0,f0,f1    # add first set to second set
psq_st   f0,0(r3),0,0 # store two floats to r3

Stuff

Interesting. My head exploded, but I'm alright . :D

Do you know if there's a guide on the safety of the fregs like the one you have for registers? fregs sound like fun. I probably won't mess with it much, though.
.make Stuff happen.
Dropbox. If you don't have one, get it NOW! +250MB free if you follow my link :p.

Mod code Generator ~50% complete but very usable:
http://dl.dropbox.com/u/24514984/modcodes/modcodes.htm

dcx2

#32
It's been a while since I looked, but it works pretty much the same way.  Except there aren't any special registers like r1/r2/r13.

EDIT:

In terms of good ASM explanations, I found this stuff from brkirch on rlwinm and rlwimi.  Very useful instructions.

http://wiird.l0nk.org/forum/index.php/topic,2844.msg29959.html#msg29959

Stuff

Ah good to know, I guess. I'll hope I'll never need more than f12 :D

oh man. That's very nice info. Another page I'll be looking at until it sticks. I was wondering what those were. That Collective is pretty beefy. lol.

wonder why I wasn't notified about this reply. >.>
.make Stuff happen.
Dropbox. If you don't have one, get it NOW! +250MB free if you follow my link :p.

Mod code Generator ~50% complete but very usable:
http://dl.dropbox.com/u/24514984/modcodes/modcodes.htm

dcx2

In keeping with the theme of the thread, I'm going to add links to post that describe various things.   Here's an example code which takes three integers in r4, r5, r6, converts them to float, multiplies them by customizable floats embedded with the bl trick, and then converts the floats back to integers in r26,r27,r28.  It's nice to see the conversions to and from float all in one place.

http://wiird.l0nk.org/forum/index.php/topic,8757.msg73781.html#msg73781

It's worth noting that the thought "will I need more than f12" is the wrong approach.  You should start at f12 and work your way toward f0 ("I'll hope I'll never need less than f0").  Hence why my code starts at f12, and then uses f11, f10, and f9.

[spoiler=dcx2 said...]
Quote from: Xenom on August 26, 2011, 07:29:47 AM
I'm trying to REDUCE the exp multiplier and had a bit of success by using:
040C1550 1F45xxxx - although it was semi-successful, I have the feeling that I'm going about this all the wrong way (shifting bits?)

1F = multiply instruction?
44 = offset in memory to multiply with?
xxxx = unsigned 16-bit integer?

0x1F44XXXX = mulli r26,r4,0xXXXX.  The 1F doesn't neatly map to "mulli"; the first 6 bits, 0x1C, are the op code for mulli.  The next five bits are the first register operand, the next five bits are the second register operand, and the last 16 bits are the immediate.

You could get by using right-shifts to divide by powers of 2, but that will only get you .5, .25, .125, etc.  Not very practical.  You could use fixed point multiplies but that's pretty difficult for other people to customize.

Quote from: mugwhump on September 02, 2011, 09:45:27 AM
I too am extremely interested in this, and I know quite a few other people who would be too. Unfortunately my USB Gecko broke.  :-\

Exactly how successful was your attempt? What value did you use?

I'm pretty sure it's just standard integer multiplication. We need some float multiplication.


QuoteOriginal codes:
$EXP multiplier [Thomas83Lin]
040C1550 1F44xxxx
**Ported from Unknown*

$AP multiplier [Thomas83Lin]
040C1554 1F65xxxx
**Ported from Unknown*

$SP multiplier [Thomas83Lin]
040C1558 1F86xxxx
**Ported from Unknown*

This code will do all three.  I don't have this game and can't test it, but I'm reasonably sure it will work.

This code uses the red zone.  The red zone is addressed by using "negative stack frames", i.e. -8(r1) and -4(r1).  It allows you to avoid creating a stack frame if you don't need to, but you must be careful because the USB Gecko debugger will over-write this area if you step through it.  Hence why it's a "red" zone.

The three 3F800000's at the beginning are the EXP, AP, and SP multipliers in single-precision floating-point format.  3F000000 = 0.5, 3F400000 = 0.75, 3FC00000 = 1.5 etc

Some important things to note about the PowerPC architecture.  1) Double-precision instructions can take single-precision operands, but the result is double-precision.  That is why I can fmuls and then fctiwz and then stfd.  2) Single-precision instructions can take double-precision operands, but produce single-precision results.  That is why I can lfd and then fsubs.

[spoiler=source]hook  800C1558:  7CDC3378   mr   r28,r6

# multipliers

bl _SKIP_DATA
.float 1.0      # EXP
.float 1.0      # AP
.float 1.0      # SP
_SKIP_DATA:

# load fregs

lfd f12,-32360(r2)   # load magic float into f12, -32360(r2) varies for each game and region
stfd f12,-8(r1)      # store magic float onto stack to create the 0x43300000

xoris r26,r4,0x8000   # load flipped-int
stw r26,-4(r1)      # store flipped-int to stack
lfd f11,-8(r1)      # load flipped-int into f11
fsubs f11,f11,f12   # normalize f11

xoris r27,r5,0x8000
stw r27,-4(r1)   
lfd f10,-8(r1)   
fsubs f10,f10,f12   

xoris r28,r6,0x8000
stw r28,-4(r1)   
lfd f9,-8(r1)      
fsubs f9,f9,f12   

# f11/f10/f9 have EXP/AP/SP

mflr r12

lfs f12,0(r12)      # load EXP mul
fmuls f11,f11,f12   # scale EXP

lfs f12,4(r12)      # load AP mul
fmuls f10,f10,f12   # scale AP

lfs f12,8(r12)      # load SP mul
fmuls f9,f9,f12      # scale SP

# store scaled versions

fctiwz f11,f11      # convert from double to int
stfd f11,-8(r1)      # store conversion
lwz r26,-4(r1)      # load conversion

fctiwz f10,f10   
stfd f10,-8(r1)   
lwz r27,-4(r1)   

fctiwz f9,f9      
stfd f9,-8(r1)   
lwz r28,-4(r1)
[/spoiler]

EXP/AP/SP float multiplier [dcx2]
C20C1558 00000012
48000011 3F800000
3F800000 3F800000
C9828198 D981FFF8
6C9A8000 9341FFFC
C961FFF8 ED6B6028
6CBB8000 9361FFFC
C941FFF8 ED4A6028
6CDC8000 9381FFFC
C921FFF8 ED296028
7D8802A6 C18C0000
ED6B0332 C18C0004
ED4A0332 C18C0008
ED290332 FD60581E
D961FFF8 8341FFFC
FD40501E D941FFF8
8361FFFC FD20481E
D921FFF8 8381FFFC
60000000 00000000
modified base on Thomas83lin/Unknown's work

For just the EXP multiplier (again, change the 3F800000)
C20C1550 00000008
48000009 3F800000
C9828198 D981FFF8
6C9A8000 9341FFFC
C961FFF8 ED6B6028
7D8802A6 C18C0000
ED6B0332 FD60581E
D961FFF8 8341FFFC
60000000 00000000

[/spoiler]

Stuff

8004FE84:  38000005   li   r0,5
8004FE88:  7C0903A6   mtctr   r0
....
8004FEC4:  4200FFC8   bdnz+   0x8004fe8c

I did some googling first, I just want to make sure I got this right cuz bdnz sounds ridiculous. Before and after reading it.
So I can think of this as a loop, right? for (ctr=5; ctr>0; ctr--)
mtctr sounds simple, but is bdnz really doing ctr-- and then for lack of better words(I know it's a terrible example):
cmp "ctr", 0
bne 0x8004fe8c

Pretty awesome if so. I think I just learned an awesome combo.
.make Stuff happen.
Dropbox. If you don't have one, get it NOW! +250MB free if you follow my link :p.

Mod code Generator ~50% complete but very usable:
http://dl.dropbox.com/u/24514984/modcodes/modcodes.htm

dcx2

Yes, that is a correct interpretation of bdnz+.  Branch Decrement Not Zero.  The decrement to ctr happens first (important).  Then the cmpwi ctr, 0.  Then the b is taken or not taken.  The conditional branch instruction datasheet page indicates that you can also combine a conditional branch like bne with a bdnz, however I've not done this so I am unsure of what the mnemonic would look like.

Stuff

What's the advantage of using ori instead of addi? I've seen this a few times:

lis rX, YYYY
ori rX, rX, ZZZZ
...

And always wondered what ori was for, and I saw it now when I thought about 'li word' being impossible unless I get funky with it. I was thinking something like this:

lis rX, YYYY
addi rX, ZZZZ
...

And this works. But it looks like everyone prefers ori.

Also if I do this
lis r0, 8
will r0=80000000 or 00080000?
.make Stuff happen.
Dropbox. If you don't have one, get it NOW! +250MB free if you follow my link :p.

Mod code Generator ~50% complete but very usable:
http://dl.dropbox.com/u/24514984/modcodes/modcodes.htm

goemon_guy

I'm pretty sure that ori and addi can be used to achieve the same thing, (loading an address,) but it's just a matter of preference.

Just like you could do:

lis rX, 0xZZZZ
lhz rY,0xTTTT(rX)
...

To load a value from an address as well.
I'd say it's just a preference.

If you do

lis r0, 8

It's the same as 0x0008, so:
r0=00080000
-Currently hacking the following game(s):
...
Request a code via PM, if you wish.

Stuff

Well I just realized that addi anything > 7FFF becomes subi. That's a problem. I was messing around with controller digits and came across this. 80660000-8000 != 80660000+8000 >.<. Of course, I could add 1 to the lis, but ori is superior here since it's not adding or subtracting. It just fills in the last 16 bits. Guess I'll be using ori now.

Good to know that lis does that. I was hoping for that. Cuz if lis 0x8000 and lis 8 did the same thing...>.>
.make Stuff happen.
Dropbox. If you don't have one, get it NOW! +250MB free if you follow my link :p.

Mod code Generator ~50% complete but very usable:
http://dl.dropbox.com/u/24514984/modcodes/modcodes.htm

dcx2

As you have discovered, addi and subi sign extend the immediate.  This makes them very unwieldy to use for loading addresses, because the sign-extended bits end up modifying the half-word loaded with lis.  This is also the reason that the assembler supports @ha - the a takes care of the sign extension by adding 1 if it is needed to compensate for the subi.

Displacement operands (i.e. the d in lwz rS,d(rD)) are also signed.  So if d > 0x7FFF, then you will also need to increment the lis value.

It's also worth pointing out that ori doesn't "fill" the lower half-word.  Rather, it uses the immediate as a one's hot mask to set bits.  This is not normally a problem, because lis will clear the lower half-word.  But in the event that you try to get clever, it will backfire.  If you were to do something like

lis r12,0x8012
ori r12,r12,0x2344    # r12 = 80122344
ori r12,r12,0x8044    # r12 = 8012A344

Stuff

@ha? I googled it to understand. But Pyiiasmh doesn't handle it right. Or maybe this is what I did:

lis r5,33@ha
subi r5,r5,30827
| to gecko code
\/
3CA00000 38A58795
| to asm
\/
lis r5,0
subi r5,r5,30827

Is that @ha usable with displacement operands?

ori: well yeah. I just meant it if the lower half-word was 0 since lis shifts to the upper(?) half-word. I don't know why I would want to OR twice. I'm not there yet. But now I'll be messing with it to see what can be done with OR, XOR, and AND. :p
.make Stuff happen.
Dropbox. If you don't have one, get it NOW! +250MB free if you follow my link :p.

Mod code Generator ~50% complete but very usable:
http://dl.dropbox.com/u/24514984/modcodes/modcodes.htm

dcx2

er, it appears you misunderstand how @ha is supposed to work.  The code generated is technically correct.  @ha turns a 32-bit value into a 16-bit value preserving the high word.  @l is the complementary way to load the low word.

Take for example 0x80128344.  Say you wanted to load this into r12 using fancy ASM.

.set some_pointer,0x80128344

lis r12,some_pointer@ha
ori r12,r12,some_pointer@l

When you tried "33@ha", what you got was the upper 16-bits of 0x00000021

Stuff

Oh wow. That's pretty awesome. I can't even explain what I thought it was before anymore.

I ran that through pyiiasmh and got an unwanted result though:
3D808013 618C8344
^would've been correct if it wasn't followed by ori. Still pretty awesome for addi/subi. Ori is still superior when it comes to making a register some word, but that whatever-they're-called was interesting.
.make Stuff happen.
Dropbox. If you don't have one, get it NOW! +250MB free if you follow my link :p.

Mod code Generator ~50% complete but very usable:
http://dl.dropbox.com/u/24514984/modcodes/modcodes.htm

dcx2

#44
haha, that was my bad.  I'm so accustomed to using ori that I forgot you should use @h instead of @ha when using ori.  Or I forgot that you were trying to use subi.  Either way, @h doesn't compensate for sign extension.

.set some_pointer,0x80128344

lis r12,some_pointer@h        # r12 = 0x80120000
ori r12,r12,some_pointer@l

lis r11,some_pointer@ha       # r11 = 0x80130000
addi r11,r11,some_pointer@l

EDIT:

I should also note that in the case the @l part will NOT be sign extended, @ha is equivalent to @h

.set some_pointer,0x80122344

lis r12,some_pointer@h        # r12 = 0x80120000
ori r12,r12,some_pointer@l

lis r11,some_pointer@ha       # r11 = 0x80120000
addi r11,r11,some_pointer@l