Icebird Acorn Demos

Introduction to StrongARM assembler instructions                   2000/07/05
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                      mrh/icb

Some people asked me about the new StrongARM instructions, so I decided to
write a short article on this. Here we go.

LDRH, STRH - 16-bit load/store...


The StrongARM has got some extensions to the LDR/STR instructions. A very
nice one is the H option. It enables 16 bit memory access and works the same
way as the B option for Byte-access, e.g. LDRH R3,[R0,R1] loads a 16-bit
halfword (that is where the 'H' comes from) from address [R0+R1].

Having 16-Bit-LDRs is rather nice for demo programming. It keeps look-up
tables small and cache-friendly, thus speeds up your routines. Just imagine
a division table with 1024 entries: Using 32-Bit values would waste 4kByte
of cache. Using 8-Bit values would lack precision. So 16 Bit is just what you
need.

But wait. LDRH and STRH do not work with the RiscPC memory bus, because the
RPC bus does not support 16-bit accesses. (If you use STRH on the RPC the
lowbyte of the store register is written with an 8-bit access to memory.
Destination address is truncated to a multiple of 2, i.e. bit 0 of address
is forced to 0.)

Nevertheless LDRH works ok thanks to the SA data cache. If an LDRH is
executed and the requested memory location is already in cache, everything is 
fine and the SA fetches your 16-bit halfword from the cache. If it is not in
the cache, first a whole cacheline - that is 8 32-bit words - is fetched from
memory into cache. And next your 16-bit halfword is fetched savely from cache.
The memory-fetch is performed with 32-bit access so everything works out fine.

Oh, before I forget - when using LDRH you have to take care that your 16-Bit
data is at 2-byte aligned addresses. If you try to load with LDRH from an
unaligned address, bit 0 of the address is forced to 0 (as for STRH), thus
your LDRH will load odd data.

LDRSB, LDRSH - sign extension...


Another fine extension to LDR is the S = "Sign Extension" option. It can be
used for 8-Bit LDRs as well as for 16-Bit. When a negative 8/16-Bit value is
loaded, i.e. its top-bit is set, it is extended to the full 32 bits of the
destination register.

Imagine you have a 16-Bit negative value in memory, let's say -2 that is
&FFFE as 16-Bit signed. If it is loaded with LDRH the destination register
will contain &FFFFFFFE, i.e. -2 in 32-Bit signed.

For better undestanding another example:

LDRSH R0,[R1]        respectively   LDRSB R0,[R1]

is equivalent to

LDRH R0,[R1]         respectively   LDRB R0,[R1]
MOV  R0,R0,LSL #16                  MOV  R0,R0,LSL #24
MOV  R0,R0,ASR #16                  MOV  R0,R0,ASR #24

However the S option costs an extra cycle at instruction execution. Therefore
it does not always speed up things but saves some bytes of memory, because
you do not have to do the sign-extension "by hand".


A little drawback when using the H or S option is that restrictions apply to
the load/store address. You cannot use shifts in the address calculation and
range of the immediate offset is limited to -256 <= offset <= +256.

Examples:
LDRH  R0,[R1,R2,LSL #2] ; ** invalid instruction because of shift **
LDRSB R0,[R1,#1023]     ; ** invalid instruction because offset > 256 **

SMUL, SMLA, UMUL, SMUL - 64-Bit multiplications...


There are four instructions for 32x32-Bit multiplication with 64-Bit result.

SMUL - signed multiplication 
SMLA - signed multiplication with accumulator
UMUL - unsigned multiplication 
UMLA - unsigned multiplication with accumulator

All four instructions take 4 registers and work as follows:

instruction             operation

SMUL Rl,Rh,Rm,Rn        Rh:Rl = Rm * Rn
UMUL Rl,Rh,Rm,Rn      [64 Bit]

SMLA Rl,Rh,Rm,Rn        Rh:Rl = Rm * Rn + Rh:Rl
UMLA Rl,Rh,Rm,Rn      [64 Bit]           [64 Bit]

SMLA and UMLA perform a 64-Bit addition of the 64-Bit result and the 64-Bit
value made of the old contents of Rh:Rl.

On StrongARM you should be carefull when using the S-option, which compares
the multiplication's result with 0. This can cost up to 3 extra cycles (see
below for timings), in almost all cases it is faster not to use the S-option
and use an extra CMP-intruction instead. By the way, this also applies for
normal 32-Bit MULs.

Instruction timings...


instruction   execution cycles

LDR           1+f+e    f=1 if Rd is needed in next instruction
LDRB                   f=0 otherwise
LDRH                   e=1 if LDRSB/LDRSH sign-extension
                       e=0 otherwise

SMUL          1+x+f+s  f=1 if Rh is needed in next instruction or S used
SMLA                   f=0 otherwise
UMUL                   x=1 if ABS(Rn) in range &00000000-&000007FF
UMLA                   x=2 if ABS(Rn) in range &00000800-&007FFFFF
                       x=3 if ABS(Rn) in range &00800000-&7FFFFFFF
                       s=2 if S condition used
                       s=0 otherwise

There are some exceptions which have different cycle timings, if you are
interested in this see the article "List of StrongARM instruction execution
cycles".