
OK, here is some information I put together over the last
couple of days. I would like some kind of responce, 
positive or negative over what I've written. No point in 
uploading it to x2ftp or cdrom if you all think it's crap.


------------- begin ftmap.txt



                Fast texture mapping
                --------------------

                        by

                  Mats Byggmastar
                       a.k.a.
                   MRI / Doomsday
                  mri@penti.sit.fi

            19 Jun. 1996  Espoo, Finland

    Read this today, it might be obsolete tomorrow.




About this document
-------------------

This document tries to describe how to make fast texture mapping. It is not
a complete tutorial as it only describes the time critical parts which
happens to be the most interesting parts. The information here is not aimed
to be a beginners guide to texture mapping. The information is aimed at
people who maybe already have a working texture mapper but are looking for
more speed.

The information is based on my own work and findings but also on information
found on the net and on ideas given to me by other coders. Especially the
individuals that hang out on IRC channel #coders. I once spoke to a 'cool'
dude from .au that said: "Man, did you think we come here to talk about
coding?". Well, what can I say... Go screw yourself mister! Many of the
coders there are willing to share ideas and answer decent questions.  Just
look for .fi in their email address... only joking, huhu.

I am not claiming that the method described here is THE fastest method of
doing texture mapping. But it is pretty fast. The resulting image will look a
bit more ragged than ordinary u,v interpolation in the inner loop but one
can't have both quality and speed in this case, I think.

Ah, yes. This document describes linear texture mapping, not perspective
correct mapping. So if you are looking for some information on how to do
scanline subdivision, look somewhere else. You won't find it here.
"Linear mapping should be enough for anyone..."


Disclaimer
----------

If water didn't exist, no one could learn how to swim, then everybody would
drown.

I take no responsibility for all the people drowning in Finland this summer.

Some parts of the technical discussion at the end of the document might not
be 100% accurate. But from a programmers point of view the guidelines given
should apply anyway.


Assume the following
--------------------  

 - We are only doing triangles.
 - You agree that a fractional part of 8 bit is enough for interpolating
   u and v on the scanlines of the triangle.
 - Bitmaps always has a width of 256 pixels and a maximum height of 256
   pixels.
 - The CPU is a Pentium and we are in 32 bit protected mode and flat memory
   model (TASM+WATCOM+PMODE/W).
 - You have made your own Phong shader or texture mapper that is not quite
   as fast as your friends texture mapper, that he brags about being 1.5
   tick per pixel, doing 70k polys per sec. (What a stupid way of expressing
   the speed of a texture mapper really. Who want's to see a vector sequence
   at 1 frame per second anyway, hehe.)


The slow method
---------------

The slow method of doing texture mapping is to interpolate u and v (and
triangle x) on both the left and right side of the triangle and then
calculate the delta u and v for each scanline. Then interpolate the u and v
when drawing the scanline. To make this even slower you could first
interpolate the u and v (and triangle x) for both sides of the triangle and
store the values in a edge buffer. Then pick the values from the edge buffer
and draw each scanline.


The fast method
---------------

Don't use a edge buffer as described in the slow method above. Calculate the
edge deltas when you need them and interpolate along the edges at the same
time you draw each scanline.

One important thing you should realize is that when texturemapping a
triangle, you are interpolating two variables, u and v, whose deltas are
constant over the whole triangle. I repeat, the deltas are constant for the
whole triangle. Make sure you understand this because this is the key to fast
texture mapping (or any other type of linear shading for that matter).

Because the deltas (delta u and delta v) are constant, we only need to
calculate them once for the whole triangle. Also when interpolating u and v
along the edges of the triangle you only need to interpolate u and v on one
side of the triangle. (Triangle x must be interpolated on both sides.)

The deltas are constant so the offsets for each pixel in each scanline into
the bitmap will also be constant. I.e. we can precalculate a whole run and
use that in the inner loop. The inner loops for this type of texture mapper
can look very different. The most radical must be to unroll it all the way
and to plug in the offsets right into the mov instructions, i.e.
selfmodifying code. But do be careful with selfmodifying code. It can slow
down your code instead of speeding it up. Each time you modify the code, it
will be reloaded into the code cache which seems to take a long time. Also
unrolling code is a very bad idea on the Pentium chip. Here is some crappy
code that shows the principle of this type of inner loop:

    jmp   ecx                   ; Jump to the right place in the 'loop'
    mov   al, [esi+12345]
    mov   [edi+319], al
    mov   al, [esi+12345]       ; Get pixel 
    mov   [edi+318], al         ; Plot pixel
    ......
    mov   al, [esi+12345]       ; '12345' is the selfmodifying part
    mov   [edi+2], al           ; that will be modified once per triangle
    mov   al, [esi+12345]
    mov   [edi+1], al
    mov   al, [esi+12345]
    mov   [edi+0], al

Note that we are doing it backwards, from right to left. This makes it easier
to setup esi and edi. You might want to know that each instruction in this
loop is 7 byte so you will be doing a X*14 when calculating the jump offset.
X*14 is X*16-X*2. You should offcource not plug in the offsets for the whole
loop if you only have a small triangle. The length of the longest scanline
is a byproduct from the constant delta calculations.


So what about the 1.5 tick per pixel loop?
Well the following peace of code is usually what people think of. I'm not
really sure that this is actually 1.5 tick per pixel as the 'mov [edi+?],ax' 
has a prefix byte. This code will need some work to make the instructions
pair on the Pentium (Read about the code cache at the end).

    jmp   ecx
    ......
    mov   al, [esi+12345]
    mov   ah, [esi+12345]
    mov   [edi+4], ax
    mov   al, [esi+12345]
    mov   ah, [esi+12345]
    mov   [edi+2], ax
    mov   al, [esi+12345]
    mov   ah, [esi+12345]
    mov   [edi], ax


Now to a cooler method that is not selfmodifying and don't need to be
unrolled. The idea is very similar to the above but in this loop we have the
offsets stored in a lookup table instead. A loop that plots one pixel per
turn could then look like this:

  destptr = screen + edge_x + y * screenwidth;
  srcptr  = bitmap + edge_u + edge_v * 256;
  for( scanlinewidth )
  {
     *(destptr++) = srcptr[ *(lookup++) ];
  }


    ; esi = ptr to starting position in bitmap
    ; ebx = 0
    ; edx = ptr to lookup table
    ; edi = ptr to starting point in screen buffer
    ; ebp = width of scanline

  @@inner:
    mov   al, [esi+ebx]         ; Get pixel
    mov   ebx, [edx]            ; Get offset for next pixel
    mov   [edi], al             ; Plot pixel
    add   edx, 4
    inc   edi
    dec   ebp
    jnz   @@inner

The above loop should be something like 4 ticks per pixel on a Pentium if we
don't have any cache misses and such. Not very impressive. This loop can on
the other hand be changed to plot 4 pixels at a time:

  @@inner:
    mov   al, [esi+ebx]         ; Get pixel #1
    mov   ebx, [edx]          
    mov   ah, [esi+ecx]         ; Get pixel #2
    mov   ecx, [edx+4]         
    shl   eax, 16               ; Move pixels 1 and 2 to the high word
    add   edi, 4
    mov   al, [esi+ebx]         ; Get pixel #3
    mov   ebx, [edx+8]
    mov   ah, [esi+ecx]         ; Get pixel #4
    mov   ecx, [edx+12]          
    rol   eax, 16               ; Swap the high and low words
    add   edx, 16
    mov   [edi], eax            ; Plot all 4 pixels
    dec   ebp
    jnz   @@inner

Now this one is 9 ticks per 4 pixel with the pixels written as a dword. Very
good if we align the write on dword. Use the other loop for very short lines
or to get this one aligned on dword and use this for the rest of the
scanline.


Calculate the lookup table with the following loop (this loop can also be
used to calculate the offsets in the selfmodifying example):

   u = delta_u;
   v = delta_v;
   for( tablewidth )
   {
      *(lookup++) = u + v * 256;
      u = u + delta_u;
      v = v + delta_v;
   }


   ; ebx = esi = delta u  (26:8 bit fixed point)
   ; ecx = edi = delta v  (26:8 bit fixed point)
   ; edx = ptr to lookup table
   ; ebp = length of table (the width of the longest scanline)

 @@mklookup:
    mov   eax, ecx          
    add   ecx, edi          ; Interpolate v
    mov   al, bh
    add   ebx, esi          ; Interpolate u
    mov   [edx], eax        ; lookup[edx] = u+256*v
    add   edx, 4
    dec   ebp
    jnz   @@mklookup

Note that the first offset stored in the lookup table is not 0.


Equations for the constant deltas
---------------------------------

Sort the vertices in the triangle so that the topmost vertex is known as
x1,y1 and the bottom vertex is known as x3,y3. Like the drawing below.

                               x1,y1
                                p1
                                  /   
                               / /    
                            /   /     
                         /     /      
                      /       /       
                   /         /        
        x2,y2   /           /         
        p2   /_____________/          
             \    width   /          
               \         /             
                 \      /               
                   \   /             
                     \/        
                   x3,y3    
                    p3     

  xn,yn     - x,y screen coordinates at vertex n (integers)
  pn        - Value of variable at vertex n to calculate the constant delta
              for. Note that this variable is assumed to have a 6 bit
              fractional part (26:6 bit fixed point). 
  width     - Width of the longest scanline in the triangle


Sorting 3 vertices is no more that 3 compares. Another thing: Don't load
all x,y,u and v values of the vertices into registers. Use pointers to the
vertex structures instead. Something like this:

    ; EDX -> vertex 1
    ; ESI -> vertex 2
    ; EDI -> vertex 3
    mov     EAX, [EDX+vertex_y]
    cmp     EAX, [ESI+vertex_y]
    jle     short @@sorta
    xchg    EDX, ESI
 @@sorta:
    mov     EAX, [EDX+vertex_y]
    cmp     EAX, [EDI+vertex_y]
    jle     short @@sortb
    xchg    EDX, EDI
 @@sortb:
    mov     EAX, [ESI+vertex_y]
    cmp     EAX, [EDI+vertex_y]
    jle     short @@sortc
    xchg    ESI, EDI
 @@sortc:
    ; EDX -> topmost vertex
    ; ESI -> middle vertex
    ; EDI -> bottom vertex


The following two equations needs only be calculated once for all the
constant deltas in the triangle. Skip the triangle if y3 == y1, i.e. if the
triangle has zero height. The width can be either positive or negative
depending on which side the x2,y2 vertex is. This will be useful information
when sorting out which is left and which is the right side of the triangle.

              (y2-y1) << 16
       temp = --------------
                  y3-y1

       width = temp * (x3-x1) + ((x1-x2) << 16)

The equation below is used to calculate the delta for a variable that should
be interpolated over the triangle, e.g. texture u. Beware of the denominator
in this equation! Make sure it won't cause divide overflow in case the width
is less than one pixel. (Remember that width is a 16:16 bit fixed point
number.) Note that shift by 10 in the equation. This is because p1,p2,p3 has
a 6 bit fractional part. The resulting delta p is a 16:16 bit number.

                  ( temp * (p3-p1) + ((p1-p2) << 16) ) << 10
       delta p = --------------------------------------------
                                 width  

So for a texture mapper where we have 2 variables (u,v) to interpolate over
the triangle, we have a total of 3 divs and 3 muls to calculate the needed
deltas.



Here is another equation that can be used to calculate the deltas. It was
posted to the newsgroup comp.graphic.algorithm the other day by Mark Pursey.

  There is a cleaner way, which doen't rely on finding the widest line:
  A-B-C: a triangle with screen x and y components, as well as t, a
  value which could represent lightning, texture cordinates etc.
  The following equation gives you the increment for t per horizontal pixel:
        
          (At-Ct)*(By-Cy) - (Bt-Ct)*(Ay-Cy)
  dt/dx = ---------------------------------
          (Ax-Cx)*(By-Cy) - (Bx-Cx)*(Ay-Cy)

The denominator is reusable for both u and v. This makes a total of 6 muls
and 2 divs. However, I do not know if this works as I don't know the theory
behind it (it looks like some vector algebra) and haven't tested it myself.
Feel free to try it out.


Special case code
-----------------

It often pays off to make special case code that takes care of the delta
calculations when the triangle is 1, 2 or 4 pixels high. Then you can skip
the divs and use shifts instead. This is when you are calculating the deltas
to interpolate along the edges of the triangle.

I once made a histogram of the length of each scanline in the very popular
chrmface.3ds object. This object has about 850 triangles and was scaled up
so it just touched the top and the bottom of a 320x200 pixel screen. The
histogram showed that most scanlines was only 1 or 2 pixels wide. This proves
that the outer loop is just as important as the inner loop and also that it
might be a good idea to have special case code for those 1 or 2 pixel lines.

width    number of scanlines
  1     *********************
  2     ******************
  3     **********
  4     ******
  5     ***
  6     **
  7     **


Clipping
--------

Clipping is most of the time a real pain in the ass implementing. It will
always mess up a nice looking routine with extra junk. One possibility is to
have two separate functions, one with clipping and one with no clipping. Then
test the triangle if it needs clipping before calling any of the functions.

The actual clipping code is not that difficult to implement really. Say if
you need to clip a texture mapped scanline, you first have to get the number
of pixels you need to skip at the end of the scanline and the number of
pixels in the beginning of the scanline. Then subtract the number of pixels
skipped from the original scanline width. If you skipped some pixels at the
start of the scanline, the new starting u and v must be calculated. This is
done by multiplying the pixels skipped by delta u and delta v respectively.
And adding the original starting u and v offcource.

The following code is what I'm using to sort out the stuff:

    movsx   EBP, word ptr [left_x+2]    ; Get the integer part from the
    movsx   ECX, word ptr [right_x+2]   ; 16:16 bit numbers.
    mov     EDX, EBP
    sub     EDX, ECX
    ; EDX = width of scanline
    ; ECX = x1
    ; EBP = x2
    mov     EBX, EDX
    sub     EBP, [_RightClip]
    jle     short @@rightok
    sub     EDX, EBP                    ; skip pixels at end
 @@rightok:
    xor     EAX, EAX
    cmp     ECX, [_LeftClip]
    jge     short @@leftok
    mov     EAX, [_LeftClip]
    sub     EAX, ECX
    mov     ECX, [_LeftClip]
 @@leftok:
    sub     EDX, EAX                    ; skip pixels at start
    jle     @@notvisible
    ; EAX = pixels skipped at start
    ; ECX = clipped x1
    ; EDX = clipped width of scanline

So now you just have to multiply EAX by delta u and add the original u to get
the clipped u. The same apply for v.


The data cache
--------------

Although it is fun optimizing inner loops there are other important factors
that one should look at. With today's processors the cache aspects are very
important. Maybe more important than the speed of the inner loop. Don't know
how long this is true though as the caches are getting smarter and bigger.
How about integrating a 1MB cache on the same chip as the CPU, hehe?

I am no expert on how the cache works but I think that this is the general 
idea:

When the CPU has decoded an instruction that needs to get a variable from
memory, the CPU first checks the cache to see if the variable is already
in the cache. If it is there the CPU reads the variable from the cache.
This is called a cache hit. If the variable is not in the cache the CPU first
has to wait for the data to be read from RAM (or the secondary cache) into
the cache and first after that get the variable from the cache. The cache
always loads a full cacheline at a time so this will take a few clock ticks.
A cacheline is 16 byte on a 486 and 32 byte on Pentium. The advantage of
this is when reading byte after byte from the memory, the data will most of
the time already be loaded into the cache because we have accessed the same
cacheline just before.

So in the case of a texture mapper where we might be reading pixels in a
vertical line in the bitmap, the inner loop will be accessing pixels that
are >256 bytes apart. The CPU will then be busy filling cachelines for each
pixel. A 64k bitmap won't fit inside a 8k cache, you know. So what can we do?
Well, we can buy a bigger cache or we might consider storing our bitmaps some
other way.

I got an interesting tip from Otto Chrons the other day about this. He said
that one should store the bitmap as a set of tiles, say 8 x 8 pixels instead
of the usual 256 x 256 pixel. This would mean that a small part of the bitmap
(8 x 4 pixel) would fit in the same 32 byte cacheline. This way new
cachelines don't need to be loaded that often when reading pixels in a
vertical line in the bitmap.

The inner loop of this mapper will be pretty slow (or maybe not, actually)
but the gained speed from the better cache usage could possibly make it
faster than any other type of texture mapper.


The code cache
--------------

The cool thing about Pentiums is that it can execute two instructions in
parallel. This is called instruction pairing. But there is a lot of rules
that must be fulfilled for the pairing to take place. One rule is that both
instructions must already be in the code cache. This means that the first
time trough a innerloop, no instructions will pair. There is one exception
to this rule. If the first instruction is a 1 byte instruction, e.g. inc eax,
then they will pair the first time.

So to have a completely unrolled loop that is selfmodifying is a pretty bad
idea on the Pentium. On the other hand, we are not modifying the loop for
each scanline so chances are that parts of it will be in the code cache from
drawing the previous scanline.


Some pairing rules
------------------

First of all, both instructions must be simple instructions, i.e. mov, inc,
dec, add, sub, and, or, xor, cmp, test, jcc, ...(?)  Instructions like neg,
xchg, idiv, imul will never pair.

Another rule is that the second instruction can't depend on the first.
Here are some examples:

    add   ecx, eax      ; store result in ecx
    add   edx, ecx      ; get result from ecx. No pairing!

    mov   ecx, eax
    mov   edx, ecx      ; No pairing!

    mov   al, bh        ; al and ah is in the in the same register
    mov   ah, ch        ; No pairing!


    mov   ecx, eax      ; 
    mov   edx, eax      ; Pairs ok.

    mov   ecx, eax
    add   eax, edx      ; Pairs ok.

There are two exception to this rule. Namely the flag register and the stack
pointer. Intel has been kind enough to optimize these for us.

    dec   ecx           ; modifies the flag register
    jnz   @@inner       ; Pairs ok.

    push  eax           ; both instructions are accessing esp
    push  ebx           ; Pairs ok.


So for example the loop we used to calculate the lookup table with, all
instructions are simple and not dependent on the previous one. The 8
instructions should execute in 4 clock ticks.

 @@mklookup:
    mov   eax, ecx          
    add   ecx, edi      ; Pairs ok.
    mov   al, bh
    add   ebx, esi      ; Pairs ok.
    mov   [edx], eax
    add   edx, 4        ; Pairs ok.
    dec   ebp
    jnz   @@mklookup    ; Pairs ok.


Another thing that one should watch out for is the effective address
calculations. I.e. what the CPU must do if you are using indexes to access
data in memory. You must not write to the index register just before you
access the data. Actually there must be a few instructions in between the
write and the access.

    add   esi, ebx      ; Move the array pointer.
    inc   edi
    mov   eax, [esi+8]  ; Ouch! You just modified esi.


    add   esi, ebx      ; Move the array pointer.
    add   ebx, ecx      ; Do something useful here
    inc   edi           ; while waiting for the
    add   ebp, 4        ; address to be resolved.
    mov   eax, [esi+8]  ; Now it's OK to access the data. Pairs ok.


If you don't have any useful instructions to fill out the gap with you could
try to swap the two instructions so that you access the data first and then
modify the index register. 

    mov   eax, [esi+8]
    add   esi, ebx      ; Pairs ok.


There is a lot more rules one must follow so I suggest you buy a good book on
the subject, if you are really interested in this.


Branch prediction
-----------------

Just a few closing words about the branch prediction logic on the Pentium.
The Pentium has some sort of lookup table in which it stores the last 100
(?) jumps it has taken. In other words it remembers the last conditional
jumps if it took the jumps or not.

So when getting closer to a jcc (conditional jump) instruction it will in
an early stage of the instruction pipeline check the lookup table to see if
it jumped the last time or not. If it took the jump the last time it will
assume that it should take the jump this time also. The CPU will then start
filling the instruction pipeline with instructions from where the jump
instruction is pointing to. Then when the actual jump instruction arrives in
the execution stage, the pipeline is already filled with the right
instructions, so there will be no delay.

On the other hand, if the condition had changed this time and we did not take
the jump, the pipeline would be filled with the completely wrong
instructions. Ouch! Now the pipeline must be filled with the right
instructions which in this case are those following right after the jcc
instruction. This will take a few clock ticks.

So you should avoid inner loops that jumps differently at close intervals.
Like the following:

    jmp   @@inner
  @@extra:
    ....                ; Do something extra when we get carry overflow
    dec   ebp
    jz    @@done
  @@inner:
    ....                ; Do something useful here
    add   eax, ebx
    jc    @@extra       ; Jump on carry overflow
    dec   ebp
    jnz   @@inner
  @@done:

In this loop it is the 'jc @@extra' instruction that will mess up the branch
prediction. Sometimes the jump will be taken and sometimes not. The typical
way of doing masking with compares and jumps has this problem also.


Credits and greetings
---------------------

Nix / Logic Design
Otto Chrons  a.k.a. OCoke
jmagic     (Can I send greetings to jmagic? Well I'm doin it :)
MacFeenix
BX
thefear
FatPhil
LoZEr
Addict, Damac, Dice, Doom, Swallow / Doomsday

<EOF>




--
 --=(__Mats_Byggmastar____a.k.a._MRI__3D-coder__)=--
 --=(__email_mri@penti.sit.fi_(mri@surfnet.fi)__)=--
 --=(__www______________http://www.sit.fi/~mri__)=--
