 ```=============================================================================
FPU Tutorial v 1.02 (06/05/2000)               (c) 2000 Eli-Jean Leyssens

This tutorial can be downloaded from http://www.dse.nl/~topix
=============================================================================

This is an extremely short (or is it ;) tutorial on how to use the FPU,
Floating Point Unit. It shows you how to move values to and from the FPU and
how to use the data operations, like divide, square root etc.

First off, I assume you already know what floating point numbers are and
what single, double and extended precision means. If you don't know what they
are then you can probably still learn something from this tutorial and you
could probably incorporate some of the example code into your own programs.
However, I would certainly advise you to look around on the Internet for some
documents describing the general idea and workings of floating point numbers.
I've included some links at the end of this tutorial which could help you

Secondly, note that to run the examples you'll either need RISC OS 4 or the
ExtBas module which extends the BASIC module to recognize and assemble FP
instructions. The ExtBas module is part of the archive:
ftp://mic2.hensa.ac.uk/local/riscos/programming/extbasdis.zip

---------------------------------------------
Floating Point Unit   on RISC OS machines
---------------------------------------------

On some machines the FPU is present in hardware as a coprocessor, called a
FPA, Floating Point Accelerator, but most RISC OS machines only have the FPE,
Floating Point Emulator. Note that even with a FPA fitted some instructions
may still be emulated in software.

There can be slight variations in accuracy between FPA and FPE
implementations, but generally speaking programs do not need to know whether
a FPA is fitted or not. The main difference between FPA and FPE is speed of
execution.

This means that you can write code that uses FP instructions without having
to worry whether they'll be executed by dedicated hardware or emulated in
software. When no FPA is present an FP instruction will yield an "Undefined
instruction" exception. The Undefined instruction vector is called, which is
claimed by the FP Emulator, which then emulates the "undefined" instruction.
Execution is then continued after the emulated FP instruction, without any
registers being corrupted (except for the ones requested by the FP
instruction of course).

The FPU has 8 (eight) floating point registers, known as F0 to F7 and also
a status and a control register. In this tutorial we'll only look at the
"normal" floating point registers, from here on called FP registers, not at
the status or control registers.

The format in which numbers are stored in FP registers is not specified.
The different FP formats only become visible when transferring a number from
or to memory:

Name                           Size             Exponent      Fraction

Single Precision (S)            4 bytes          8 bits       23 bits
Double Precision (D)            8 bytes         11 bits       52 bits
Double Extended Precision (E)  12 bytes         15 bits       64 bits

Packed Decimal (P)             12 bytes          4 digits     19 digits
Expanded Packed Decimal (EP)   16 bytes          6 digits     24 digits

If you look closely at the the table above you'll notice that Packed
formats store the numbers as digits rather than bits. This is done by storing
1 digit per nibble (4 bits).

In almost all our examples we'll store numbers in memory at Single
Precision (S) and we won't even look into the Packed format as it's rather
silly ;) although it can of course be useful, especially when communicating
with humans as they better understand digits than bits :)

All basic floating point instructions operate as though the result were
computed to infinite precision and then rounded to the length, and in the
way, specified by the instruction. The rounding is selectable from:

- Round to nearest
- Round to +infinity (P)
- Round to -infinity (M)
- Round to zero (Z)

The default is "round to nearest"; in the even of a tie, this rounds to
"nearest even" as required by the IEEE.

>>  NOTE! you should only use FP instructions in User Mode programs! <<

--------------------------
Moving data TO the FPU
--------------------------

Before we can tell the FPU to for instance divide two numbers we'll of
course need a way to tell it what these two numbers are. There are many ways
to load a number into a FP register; I'll only show the three most popular
ones here.

The first method is to load a value into a normal ARM register first and
then load the value of that register into any of the 8 FP registers. The
instruction used for the latter operation is FLT. Here's an example of how
you can load the value 123 into FP register f0:

mov     r0, #123        ; First setup an ARM register with the value
flts    f0, r0          ; Now transfer the value from the ARM
; register into the FP register.

The s in flts means that we want to use single precision. Also note that
instead of r0 we could have used any other general purpose ARM register and
instead of f0 we could have used any of the 8 FP registers. So,

mov     r9, #123
flts    f3, r9

would also have worked, although then of course f3 would have 123 loaded into
it and not f0.

It should be fairly obvious that by using FLT we can only load integers
into the FP registers. I mean, you can't load 123 and a half into r0 and
therefore you can't load 123.5 into f0 either. At least, not by using FLT.

Which brings us to the second most popular instruction for loading values
into FP registers, namely FLD. To use FLD though you'll first need to set up
a floating point value in memory. And I do mean floating point value. So,
just setting up an integer value using equd won't work. Luckily there's an
instruction for defining a floating point value in memory as well, namely
EQUF. So, to load 123 and a half into f0 you could use:

equfs   123.5
.code

Easy huh?  Note that once again the s in both equfs and ldfs stands for
single precision. For ldfs this is particularly important as the precision
must match the precision you specified at the equf command.

The third method shown here for loading values into FP registers also uses
the FLT instruction, but instead of loading the value from an ARM register
the value is encoded in the FP instruction. There are only a small number of
values that can be loaded in this way though. They are: 0, 1, 2, 3, 4, 5, 10
and 0.5

flts    f0, #3          ; Load 3 into f0
flts    f1, #0.5        ; Load 0.5 into f1

These special values can be used in FP data operations as well as you'll
find out later on.

Now, before we look at how we can perform operations on the FP registers,
let's first look at ways to move data from the FP registers back to the ARM
registers or memory.

----------------------------
Moving data FROM the FPU
----------------------------

For copying data from the FPU we'll look only at the two most popular ways:
transfering a single FP register to a single ARM register or memory.

To transfer the value from a FP register into a normal ARM register you can
use the FIX instruction. So to transfer the value of f3 into r9:

fix     r9, f3

Note the absence of the precision identifier, fix doesn't take one. Also
note that registers can only contain integers, so the number stored in r9 is
the rounded value of f3. You can find out how to specify the rounding mode
further down in this document.

To save the value from f3 into memory:

Yes, once again you need to specify the precision. Note that Double and
Extended precision floating point numbers take up more bytes than single
precision ones. So, if you defined floataddress with equfs than you should
not use stfD as that will overwrite more bytes than you reserved with equfs.

Right, now that we now how to move data to and from the FPU let's look at
some data operations.

---------------
Square root
---------------

One of the simplest operations is the Square root operation as it only
operates on one value. The instruction for it is SQT and it takes two
parameters. The first parameter indicates the FP register to store the
result in, the other indicates the FP register to take the Square root of.

sqts    f0, f1          ; f0 = sqt( f1)

It's as simple as that. So, the "entire" code to calculate the square root
of an ARM register, by using the FPU for the calculation would be:

; r0 = number
flts    f0, r0          ; f0 = r0
sqts    f0, f0          ; f0 = square root of f0, single precision
fix     r0, f0          ; r0 = f0 = sqt( r0)

The "sqroot" program included in this archive contains a working example.

----------------------
Divide and conquer
----------------------

Another handy operation is the divide operation. The instruction for it is
DVF and it takes three parameters, all indicating FP registers. The
parameters are for Quotient, Number and Divisor.

dvfs    f0, f1, f2      ; f0 = f1 / f2

So, the code to divide two ARM registers, by using the FPU for the
calculation would be:

; r0 = number
; r1 = divisor

flts    f0, r0          ; f0 = r0
flts    f1, r1          ; f1 = r1
dvfs    f0, f0, f1      ; f0 = f0 / f1
fix     r0, f0          ; r0 = f0 = f0 / f1 = r0 / r1

The "divide" program included in this archive contains a working example.

------------------
Wave "Bye-Bye"
------------------

For our last example of data operation instructions we'll look at the sine
wave. As the FPU's sine (and cosine) calculations are extremely slow you will
almost certainly only want to use them to build a look up table. So, that's
just what I'm going to show you.

The first thing you need to know about FPU's sine, cosine, tangent etc
functions is that they work with radians, not degrees. So, a full sine period
is 2*PI (radians) and not 360 (degrees).

Right then, let's say we want to build a sine lookup table with 256 values
describing a whole period. We'll set the amplitude at 127. In BASIC you would
probably do it somewhat like this:

Steps% = 256      : REM Number of steps to divide one period in
Amplitude% = 127  : REM Amplitude of the sine wave

DIM SineTable% Steps%*4 : REM 4 bytes per value as we're storing
REM words, not bytes

FOR x% = 0 TO Steps%-1
SineTable%!( x% * 4) = Amplitude% * SIN( x% * ( 2*PI / Steps%))
NEXT

If you have a hard time understanding this BASIC version then I can only
advise you to dust off some old calculus books before you proceed to the FPU
version ;)

The assembly version using FPU isn't much different from the above. I'm not
going to type it in here though, just look at the "sine" example program.

-----------------------------
Could you be more precise?
-----------------------------

Note that throughout these examples I've used single precision. This means
that only 23 bits will be used for the Fractional part of the floating point
number. However, due to the way floating point numbers work we effectively
get 24 significant bits. So, if you want to load/store numbers bigger than
&ffffff without losing information from the least significant bits then you
should use Double or Extended precision instead. Simply append a d or e
instead of an s after the floating point instruction. So, instead of flts,
you should use fltd or flte. Take a look at the "fltSfltD" example for
further clarification.

------------------------------------
"No, it's rounder" (c) 2000 Nike
------------------------------------

As you have probably read in the part "Floating Point Unit   on RISC OS
machines" there are several rounding modes. By default numbers are rounded to
nearest. Note that this rounding not only occurs when transferring values
from FP registers to ARM registers, but also when storing FP registers in
memory, but more importantly also internally in the FPU.

Assume we're loading the value of f3 into r9. Let's see what the results of
the different rounding modes are for 4 different values of f3.

Rounding         -4.5   -3.6   -3.5   -3.4     3.4    3.6    3.5    4.5

(Nearest)          -4     -4     -4     -3       3      4      4      4
P(lus infinity)    -4     -3     -3     -3       4      4      4      5
M(inus infinity)   -5     -4     -4     -4       3      3      3      4
Z(ero)             -4     -3     -3     -3       3      3      3      4

So, Nearest is also nearest to what you're used to in every day life,
except that on a tie, that is x.5 it is rounded to the "nearest even". So,
that's why 4.5 is not rounded to 5 (uneven), but to 4 (even).

Plus infinity means it's always rounded up to the "higher" value. So, -3.6
is rounded up to -3 as -3 is higher than -4.

Minus infinity means it's always rounded down to the "lower" value. So, 3.6
is rounded down to 3 as that's lower than 4.

Zero is simply discarding the part after the point :)

-----------------------
FP Instruction List
-----------------------

This list is in no way complete! It doesn't include instructions for
handling the status or control registers, nor does it include instructions

-- Register transfer --

Instruction syntax:

FLT{cond}prec{round}        Fn, Rd
FLT{cond}prec{round}        Fn, #Value
FIX{cond}{round}            Rd, Fm

Don't get fooled by the d in FLT... Fn, Rd The destinaton register is
always the first one, just like with any other ARM instruction. So, FLT Fn,
Rd stores the ARM register Rd in FP register Fn.

{cond} is the standard ARM instruction condition (eq, ne, gt etc)
prec is the precision ( S, D, E etc)
{round} is the rounding mode ( P, M, Z)

{cond} and {round} are of course optional and default to respectively
Always and Nearest

Value can be any of 0, 1, 2, 3, 4, 5, 10, 0.5

Instructions:

FLT   Integer to Floating Point     Fn := Rd
FIX   Floating Point to Integer     Rd := Fm

-- Data operations --

Instruction syntax:

unop{cond}prec{round}       Fd, Fm
unop{cond}prec{round}       Fd, #Value
binop{cond}prec{round}      Fd, Fn, Fm
binop{cond}prec{round}      Fd, Fn, #Value

unop, or unary operations, calculate with just one parameter
binop, or binary operations, calculate with two parameters

Value can be any of 0, 1, 2, 3, 4, 5, 10, 0.5

Instructions:

ADF   Add                       Fd := Fn + Fm
MUF   Multiply                  Fd := Fn * Fm
SUF   Subtract                  Fd := Fn - Fm
RSF   Reverse Subtract          Fd := Fm - Fn
DVF   Divide                    Fd := Fn / Fm
RDF   Reverse Divide            Fd := Fm / Fn
POW   Power                     Fd := Fn to the power of Fm
RPW   Reverse Power             Fd := Fm to the power of Fn
RMF   Remainder                 Fd := remainder of Fn / Fm
Fn - Fm * integer value of ( Fn/Fm)
* FML   Fast Multiply             Fd := Fn * Fm
* FDV   Fast Divide               Fd := Fn / Fm
* FRD   Fast Reverse Divide       Fd := Fm / Fn

MVF   Move                      Fd := Fm
MNF   Move Negated              Fd := -Fm
ABS   Absolute value            Fd := ABS( Fm)
RND   Round to integral value   Fd := integer value of Fm
SQT   Square root               Fd := square root of Fm
LOG   Logarithm to base 10      Fd := log Fm
LGN   Logarithm to base e       Fd := ln Fm
EXP   Exponent                  Fd := e to the power of Fm
SIN   Sine                      Fd := sine of Fm
COS   Cosine                    Fd := cosine of Fm
TAN   Tangent                   Fd := tangent of Fm
** ASN   Arc Sine                  Fd := arcsine of Fm
ACS   Arc Cosine                Fd := arccosine of Fm
ATN   Arc Tangent               Fd := arctangent of Fm

* FML, FDV and FRD are only definded to work with single precision operands
and are not necessarily faster than MUF, DVF and RDF.

** Use ASN Fd, #1 to easily load Pi/2 into Fd.

Note that for all these unops and binops you can replace Fm by one of the
constants 0, 1, 2, 3, 4, 5, 10 and 0.5 This is also why there are Reverse
version of some of the instructions.

The rounding according to the rounding mode specified in the instruction
is only applied in the final stage. The rounding done during the actual
calculations to compute the value are all done with the Nearest rounding
mode.

This is especially noticable for RMF:

Fn := 18
Fm := 5
Fd := Fn - Fm * integer value, rounded to Nearest, of ( Fn / Fm)
:= 18 - 5 * integer value, rounded to Nearest, of ( 18 / 5)
:= 18 - 5 * integer value, rounded to Nearest, of 3.6
:= 18 - 5 * 4 <- !!!
:= 18 - 20
:= -2 !!!

You could correct for this by adding Fm to the remainder when the
remainder is less than zero.

--------------
--------------

Here are some links to documents you might find useful in respect to using
and coding for the FPU.

As mentioned at the start of this tutorial, you'll need something like the
ExtBas module to assemble FP instructions if you don't have RISC OS 4. This
module is part of the archive:

ftp://mic2.hensa.ac.uk/local/riscos/programming/extbasdis.zip

There is a whole chapter on the Floating Point Emulator in the RISC OS 3
PRMs (Programmer's Reference Manuals). It should probably have been called
Floating Point Unit instead and it's quite a good read:

Programmer's Reference Manual, Volume 4, Pages 4-163 to 4-184

Even more technical documentation can be found on the ARM Ltd site. The
documentation for the ARM7500FE contains three chapters on the FPA. The
documentation for the ARM7500FE has been split up into several files. Either
documentation. Note that the documentation is in PDF format. There are PDF
readers out in the Public Domain though.

http://www.arm.com/Documentation/UserMans/PDF/ARM7500FEvB.html

http://www.arm.com/Documentation/UserMans/PDF/ARM7500FEvB_5.pdf

Last but not least, you can learn quite a bit from looking at other
people's code. Many entries in the CodeCraft competition(s) use FP
instructions and as one of the rules of the competition(s) is that full
sources must be included they might prove to be valuable examples. If you're
lost in the high number of entries then I can only say that at least my entry
called HappyRGB, which can be found in the 1K Entries section of the
CodeCraft#2 competition, has a lot of FP code.

http://surf.to/codecraft

http://www.cybercable.tm.fr/~brooby/code.htm

http://www.dse.nl/~topix -> Click the CodeCraft menu entry

-----------
Credits
-----------

Many thanks to Tony Haines for proof reading this tutorial and making some
excellent suggestions on how to improve it.

Much information was gathered from the Floating Point Emulator chapter of
Acorn's Programmer's Reference manual and ARM Ltd's ARM7500FE documentation.
You can find links to both in the "Link me up" chapter above.

-------------
-------------

This tutorial and the accompanying example programs have all been written
by Eli-Jean Leyssens, aka Pervect of Topix. Eli-Jean Leyssens holds the
copyright to this tutorial. The accompanying example programs are to be
considered an integral part, and as such this text may only be copied
/together/ with the example programs. Equally, if you wish to copy the
example programs then you must also include this text.

You are freely permitted to use the example routines in your own programs.
An acknowledgement of any help obtained would be appreciated.

This tutorial, in whole or in part, may not be published in any magazine,
digital or hardcopy, or on any website without the written permission of the
Download text version + example sources: FPE102.ZIP (10k)