[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Untagging by subtraction instead of masking on USE_LSB_TAG
From: |
Thien-Thi Nguyen |
Subject: |
Re: Untagging by subtraction instead of masking on USE_LSB_TAG |
Date: |
Mon, 28 Jan 2008 04:52:22 +0100 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.0.50 (gnu/linux) |
() YAMAMOTO Mitsuharu <address@hidden>
() Mon, 28 Jan 2008 11:07:28 +0900
_cons_to_long: _cons_to_long:
andi. r0,r3,7 andi. r0,r3,7
srawi r0,r3,3 srawi r0,r3,3
beq cr0,L592 beq cr0,L592
rlwinm r2,r3,0,0,28
A lwz r9,4(r2) lwz r9,-1(r3)
B lwz r3,0(r2) lwz r3,-5(r3)
rlwinm r0,r9,0,29,31 rlwinm r0,r9,0,29,31
cmpwi cr7,r0,5 cmpwi cr7,r0,5
bne cr7,L593 bne cr7,L593
rlwinm r2,r9,0,0,28
C lwz r9,0(r2) lwz r9,-5(r9)
L593: L593:
rlwinm r2,r3,13,0,15 rlwinm r2,r3,13,0,15
srawi r0,r9,3 srawi r0,r9,3
or r0,r2,r0 or r0,r2,r0
L592: L592:
mr r3,r0 mr r3,r0
blr blr
This would make sense if the latency of load/store does not
depend on its displacement (I'm not sure if that is the case in
general). Comments?
For masking, i see offsets (lwz) of 4,0,0 (lines A,B,C).
For subtraction, -1,-5,-5.
It's very possible that the machine can handle 4,0,0 more
efficiently; those all are even (0, modulo 2) and in two cases
"nothing"! Furthermore, the maximum absolute offset for the
subtraction method is 5, which is larger (faaarther away) than 4.
Anyway, here is an excerpt from p.532 of "PowerPC 405, Embedded
Processor Core, User's Manual":
| C.2.6 Alignment in Scalar Load and Store Instructions
|
| The PPC405 requires an extra cycle to execute scalar loads and
| stores having unaligned big or little endian data (except for
| lwarx and stwcx., which require word-aligned operands). If the
| target data is not operand aligned, and the sum of the least two
| significant bits of the effective address (EA) and the byte count
| is greater than four, the PPC405 decomposes a load or store scalar
| into two load or store operations. That is, the PPC405 never
| presents the DCU with a request for a transfer that crosses a word
| boundary. For example, a lwz with an EA of 0b11 causes the PPC405
| to decompose the lwz into two load operations. The first load
| operation is for a byte at the starting effective address; the
| second load operation is for three bytes, starting at the next
| word address.
But don't heed my (mostly) ignorant gut feelings! Esperience sez:
isolate the variable; build two versions; compare on "typical"
workload; if (dis)advantage is under some "wow!" threshold, write
down your findings in the notebook (for Emacs, comments would be
fine), but prioritize maintainability (i.e, refrain from
implementing).
I am interested in how you define "typical" and "wow!".
Seasons change, pipelines change. Keep in mind that sometimes
optimization now translates to pessimization down the road.
thi