                  PCI EIDE CONTROLLER FLAWS
                              
                              
Revision 15: 1995 August 31


SUMMARY OF RECENT CHANGES

1)   EIDEtest 1.5 and CDTest 1.0 released.
  
2)   Yet another suspect EIDE controller chip: the SMC
     37650.
  
3)   Intel contradicts itself on the performance hit from
     disabling prefetch to bypass the flaw.
  
4)   Software from IBM and Intel to detect both faulty chips
     directly.
  
5)   The precise mechanism of failure for both the RZ-1000
     and CMD 640B is now understood. The RZ-1000 and CMD 640B
     both have the prefetch flaw. The CMD 640B has two additional
     flaws.
  
6)   Explanation of what "Intel Inside" means.
  
7)   Dell offers upgrade BIOS to turn off the prefetch
     buffers.

8)   RZ-1000 flaw bypass for APAR PJ19409 for Warp now
     available.

9)   List of safe and unsafe operating system software.

10)  IBM hardware is clean.

11)  Stonewall rebuilds. Intel recants on offer to replace
     defective motherboard.

12)  Problem is showing up under Windows For WorkGroups in
     32 bit mode.

13)  Cleaning up past damage is very difficult.

14)  Assigning blame.

15)  The Triton chipset is immune. These chips are marked
     with an FX suffix.

16)  Windows-95, NT are immune.

17)  DOS and Windows 3.1 are immune if you have an Intel
     BIOS.

INTRODUCTION

There are serious flaws affecting about 1/3 of all PCI
motherboards. The flaws affect any motherboard or EIDE
controller paddleboard containing the PC-Tech RZ-1000 PCI
EIDE controller chip or the CMD PCIO 640B PCI EIDE
controller chip. There are preliminary reports of yet a
third flawed chip -- the SMC 37650.

The flaws affect motherboards from ASUSTeK, AT&T, Dell,
Gateway, Zeos and Intel. Since Intel makes so many of the
motherboards sold under other brand names, the flaws affect
many machines, both 486 and Pentium PCI.

The flaw shows up most frequently when you run a true
multitasking operating system such as OS/2 Warp. It also
shows up under Windows For WorkGroups in 32 bit mode during
tape or floppy backup and restore. In theory the flaw could
do damage under DOS, DESQview, Windows and Windows For
WorkGroups in 16 bit mode, but so far there have been no
damage reports. Recent versions of Microsoft NT and Windows-
95 contain code to bypass the flaw.


WHAT ARE THE SYMPTOMS?

When you are using an IDE or EIDE hard disk attached to the
EIDE motherboard port, the flaw subtly corrupts your files
by randomly changing bytes every once in a while. The flaw
introduces bugs into EXE files, subtle errors into your
spreadsheets, stray characters into your word processing
documents, changes to the deductions in last year's tax
return files, and random changes to engineering design
files.

This corruption happens when you are simultaneously using
your EIDE or IDE hard disk and some other device, most
commonly the floppy drive or mag tape backup.

The same sorts of problem may occur on reading a CD-ROM
drive attached to an EIDE port.


IS IT SERIOUS?

These flaws are nasty. They are causing hundreds of times
more havoc than the infamous Pentium divide flaw ever did.
"I am Pentium of Borg. You will be approximated."

Not only does this corruption occur, but it occurs quietly,
often going unnoticed.

If the system crashes, you usually put the blame on the
operating system software, or the application. It might
actually be a faulty RZ-1000 or CMD 640B EIDE controller
chip nailing you.

When a directory becomes corrupted, you may not notice it
until the damage is irreparable. If a spreadsheet
application reads a comma-delimited ASCII file, it may
simply miss a few bytes in a number, an error that may go
unnoticed, and that error could cascade through the rest of
the spreadsheet.

If you have had unexplained crashes in OS/2, you have
probably experienced the problem, and should make a thorough
check for hidden corruption. Remember that the bug may only
slightly alter your data, and the corruption may not be
obvious.

Keep in mind that not every problem is the RZ-1000's or the
CMD 640B's fault. Overheating, unrelated hardware faults and
design flaws, or software bugs can cause similar symptoms.
DMA channel conflicts also cause similar symptoms. Happily,
EIDEtest and CDTest can unmask all manner of simultaneous
I/O faults.

Unfortunately, correcting the problem just stops further
file corruption. It will help to clean up the existing
damage to your files. Right now, the focus is on bypassing
the flaw. Preventing further corruption is child's play
compared with the nightmare of trying to track down all the
existing random errors in files. Backups even from day one
may be corrupted. If you have the flaw, you will probably
never be able to completely eliminate the effects of past
corruption.


HOW DO YOU TELL IF YOU HAVE THE FLAW?

There are four categories of motherboard:

1) Definitely safe. Motherboards may still have the flaw,
  but all software in use bypasses it.
  
2) Probably safe. In theory there could be problems, but
  no one has reported any so far.
  
3) Possibly dangerous. You will have to run EIDEtest,
  CDtest, or IOTest to find out.

4) Probably dangerous. You will still have to run the
  tests to find out for sure.
  

Definitely Safe

Definitely safe includes older machines with ISA. EISA, VESA
VL or MCA buses. The flaw only affects machines with the new
PCI bus. PCI machines that use the new Triton chipset from
Intel do not have the flaw.

PCI machines with Intel BIOSes that run only DOS, DESQview,
Windows 3.1, Windows-95 or NT 3.5 are safe. If you have a
non-Intel BIOS and run only DOS, DESQview, Windows 3.1,
Windows-95 or NT 3.5 and never use the "fast mode"
simultaneous disk I/O feature on floppy or tape
backup/restore, you are safe.

You still might want to test your machine. There are similar
problems with other causes the tests will unmask.


Probably Safe

If you have a non-Intel BIOS and run only DOS, DESQview,
Windows 3.1, or Word For Windows in 16-bit disk access mode,
you probably will not see the problem, even though you may
have one of the faulty chips.


Possibly Dangerous

Most auxiliary chipsets (e.g., OPTI Viper, SMC, Mercury and
Neptune) used on PCI motherboards do not include a built in
EIDE controller.  Such motherboards use a separate EIDE
controller chip -- often the flawed RZ-1000 or CMD 640B. If
you use a separate EIDE paddleboard, it will likely use the
one of the flawed chips. In theory, the flaw could affect
DOS, Windows, and Windows For WorkGroups with 16 bit disk
access during floppy/tape backup and restore, though no one
has reported problems yet. Windows For WorkGroups with 32
bit disk access is dangerous if you have the flaw.


Probably Dangerous

PCI Motherboards (both 486 and Pentium) with the older
Mercury and Neptune chipsets are likely to have the flaw.
The Mercury chipset was popular in P60 and P66 systems, and
the Neptune in P70, P90 and P100 systems. Mercury chipsets
are labelled with an MX suffix and Neptune with NX. If you
are using NT 3.1, OS/2 Warp or Linux, you are likely to have
already experienced extensive file corruption if the flaw is
present.


TESTING FOR THE FLAW

Scot Llewelyn, one of the eight authors of
PowerQuest's PartitionMagic, discovered the RZ-1000 flaw and
made it public. Prior to that, only employees of PC-Tech,
Intel and Microsoft were aware of how to bypass the flaw. In
the process of tracking the RZ-1000 problem down, Internet
comp.os.os2.bugs participants discovered a second flawed
chip, the flawed CMD 640B, and are now suspicious about the
SMC 37650.

Scot did most of the initial work documenting the RZ-1000
flaw. He wrote a program called IOtest that can detect the
flaw if:

1)   You are using OS/2 Warp.
  
2)   You are willing to go through the hassle of creating a
separate small partition to run the test. You can use his
program, PartitionMagic, to make room to create one.
  
3)   You have an EIDE hard disk attached to your EIDE port.
It cannot detect the problem if you only have an EIDE CD-
ROM, or if the EIDE port is currently unused.
You can find his test program on the Internet at:

     http://www.powerquest.com/

Scot originally called his test program DMAtest because he
erroneously thought simultaneous DMA was the culprit.  Do
not confuse PowerQuest DMAtest with Gazelle's DMAtest which
only tests if the floppy drive will work happily
simultaneously with the hard disk.

The world needed an easier-to-use test that would run under
DESQview, Windows, Word For Windows, Windows 95, NT and
OS/2. So I wrote EIDEtest to test for the flaw without
requiring you to create a special partition or buy Warp
OS/2. I also wrote CDTest to test for the flaw when you have
an EIDE CD-ROM drive.

I posted them on the Internet at:

     ftp://ftp.cdrom.com/.4/os2/incoming/EIDEte15.zip

The file also contains a 16-page unabridged copy of this
article.

You can also get both programs from me by snail mail.

If these tests fail, it proves you have a serious problem,
but not necessarily that you have the RZ-1000 or CMD 640B
chip.

If the tests pass, you still may have a problem since,
especially under DOS, DESQview and Windows, the flaw may
only show its head very rarely. If you run the tests under
NT or Windows-95 they will always pass, even if you have the
defective chip, because the operating system already
bypasses the flaw. If you suspect trouble, run the tests
several times.


VISUAL INSPECTION

You can also have a look at your motherboard. Between the
PCI slots, at the edge of the motherboard, look for a
rectangular chip about 1 by 2 cm (0.5" x 0.75") that says RZ-
1000 near the top of the chip. There are variations on the
chip name e.g., RZ-1000BP. Unfortunately, the markings are
not always present, especially in ASUSTeK motherboards which
may have the CMD PCIO 640B chip.  The other suspect chip is
the SMC 37650.


DIRECT TESTS

The OS/2 Warp Bonus Pack Sysinfo 3.02 utility will report on
your EIDE controller. The signature for the RZ-1000 looks
like this:

manufacturer: PC TECHNOLOGY INC
class code : 0001
Vendor ID: 1042
Device ID: 1000
Revision ID: 0001

For the CMD 640B it will look like this:
manufacturer: CMD
class code : ???
Vendor ID: ???
Device ID: ???
Revision ID: ???

The Warp disk driver IBM1S506.ADD with the /V switch will
tell you if you have the RZ-1000 chip.

Intel posted a test that looks directly for either of the
two faulty chips:

http://www.intel.com/procs/support/ctrltest/ctrltest.exe

The Windows-95 Control panel will also report on the EIDE
controller chip.


WHERE HAS THE FLAW BEEN FOUND?

Via email, on BIX and on the Internet and in
comp.os.os2.bugs, people have reported finding flaws in the
following specific motherboards.

Motherboard          Chip     Reporters

ASUSTeK PCI/I P54SP4 CMD      Maurice Schekkerman
                     640B     (schekker@prl.philips.nl)

Dell Dimension XPS   RZ-1000  Scot Llewelyn
P100                          (scotl@itsnet.com)

Dell Dimension XPS   RZ-1000  Steve Ertman
P75                           (sertman@ocean.fsu.edu)

Dell Dimension XPS   RZ-1000  Larry Lai (lai@iastate.edu)
P90                           Mike Heath
                              (heath@rohan.sdsu.edu)
                              Moira Watson
                              (watson6@uwindsor.ca)
                              Pete (pag@interramp.com)
                              Wijadi Jodi
                              (r2nw@dax.cc.uakron.edu)

Dell Escom P60I      CMD      Tim Schofield,
                     640B     schofieldt@logica.com

Dell Optiplex 575    CMD      David W. Mittlefehldt
                     640B     (duck@snmail.jsc.nasa.gov)

Dell XPS-133c        neither  Blake Scholl
                              (bscholl@one.net)

EliteGroup UM8810P-  CMD      Bodo Huckestein (bh@thp.Uni-
AIO                  640B     Koeln.DE)

Escom P5/60          RZ-1000  Rogier van Wanroij
                              (wanroij@cs.utwente.nl)

Escom P90            RZ-1000  Karl Knoflach
                              (151579kk@student.eur.nl )
                              (Xav@mantra01.demon.co.uk)

Gateway 2000 P5-60,  RZ-1000  Angus Black
Intel Mercury Rev 3           (angus@spanner.hiway.co.uk)
                              Gary Farr
                              (garyfarr@ix.netcom.com)
                              Daron Davis
                              (daron_davis@dca.com)
                              Jerry Lynch
                              (lynch.94@osu.edu)
                              Keith Patterson
                              (dinosaur@buffnet.net)
                              Roy L. Smith
                              (smittyry@ix.netcom.com)

Gateway 2000 P5-66   RZ-1000  Randy Nerwick
                              (nerwick@netcom.com)

Gateway 2000 P5-90   RZ-1000  Roy L. Smith
                              (smittyry@ix.netcom.com)

Intel Hendrix        CMD      Clif Purkiser Intel Corp
                     640B     (support@cs.intel.com)

Intel Insight P5-60  ?        Jim Arnone
                              (arnone@primenet.com)

Intel Plato 90       RZ-1000  Clif Purkiser Intel Corp
                              (support@cs.intel.com)
                              Adrian Teo
                              (adriant@singnet.com.sg)
                              Alain Rassel
                              (Alain.Rassel@restena.lu)
                              Chris Norman
                              (cnorman@oboe.aix.calpoly.edu
                              )
                              Kevin Chua
                              (chua@server.uwindsor.ca)
                              Kevin T. Van Maren
                              (vanmaren@cs.utah.edu)
                              Kim Hvarre
                              (kims@crash.ping.dk)
                              Rick Nelson
                              (rnelson2@ccmail.unl.edu)

Intel Premiere       RZ-1000  Clif Purkiser Intel Corp
                              (support@cs.intel.com)

Intel Premiere LPX   CMD      Clif Purkiser Intel Corp
                     640B     (support@cs.intel.com)

Intel Premiere MM    CMD      Clif Purkiser Intel Corp
                     640B     (support@cs.intel.com)

Intel Robin LC       CMD      Clif Purkiser Intel Corp
                     640B     (support@cs.intel.com)

Knowledgebase P90    RZ-1000  Andy Longton
laptop                        (alongton@clark.net)

Midwest Micro P90    RZ-1000  (412d25$e8j@clarknet.clark.ne
                              t)

PCI-EIDE local       CMD      (whelk@ios.com)
clone, Phoenix BIOS  640B
4.04, ALI chipset

Quantex P5/90 PM-2   RZ-1000  Jay Schamus
                              (jaylord@rcinet.com)

Unknown 486 DX       SMC3765  Eric Stephen Mountain
                     0        (esm1@oak70.doc.ic.ac.uk )

Unknown 90 MHz       ?        Andreas
                              (abenamou@galaxy.csc.calpoly.
                              edu)
                              Carol Lim (law30185@nus.sg)

Vobis                RZ-1000  Thomas Wagner
                              (twagner@bix.com)

ZEOS Pantera         RZ-1000  Paul Whitelock
                              (paulw9DDFL3r.DDI@netcom.com)


WHAT CAN YOU DO IF YOU HAVE THE FLAW?

1)   Pester the manufacturer. Unfortunately, the EIDE
  controller chips are soldered in. The only way to repair the
  flaw is to replace the whole motherboard, recycling the
  socketed chips -- the CPU, DRAM and SRAM cache. It would be
  very expensive for computer and motherboard manufacturers to
  fix the flaw. After a month of stonewalling, Dell has
  announced it will offer a BIOS upgrade to turn off the
  prefetch buffers. You can contact Dell at
  support@us.dell.com or (800) 624-9896. Intel is now
  acknowledging the problem. For a short while, Intel offered
  to replace defective motherboards, then they reneged. You
  can contact them at support@cs.intel.com or call their tech
  support line (800) 628-8686. Select options 1-3-1. You can
  find international contact numbers at:
  http://www.intel.com/intel/intelis/contact.html.
  
2)   Buy a new unpopulated Triton PCI motherboard and
  recycle the CPU, DRAM and SRAM cache chips from the old
  motherboard.
  
3)   Run the controller in degraded mode. Some BIOSes have a
  feature to turn off the EIDE prefetch buffer. Vendors may
  offer a BIOS upgrade to allow prefetch to be configured or
  to turn it off automatically if either of the defective
  chips is present.
  
4)   Buy a PCI EIDE paddleboard controller such as the
  Promise 2300+ to replace the one on the motherboard. You
  must disable the controller on the motherboard. This fix
  will waste one of your precious slots. Be careful. You could
  be leaping out of the RZ-1000 frying pan into the CMD 640B
  fire since paddleboards often use the CMD 640B.
  
5)   Buy a SCSI hard disk and CD-ROM, and avoid using the
  EIDE ports entirely. Under OS/2 and Linux, SCSI gives better
  performance, but costs more. DOS, Windows, Windows For
  WorkGroups and Windows-95 are unable to exploit the advanced
  features of SCSI, but at least avoid the EIDE flaw when you
  go pure SCSI.
  
6)   Switch to Windows-95 or NT 3.5. Microsoft has already
  modified its EIDE drivers to bypass the flaws.
  
7)   Find a software work-around. The Warp fix for the RZ-
  1000 turns off the prefetch buffer. Fixpack 5 and pre-
  release Fixpack 9 do not bypass the flaw. Now that Intel and
  IBM have finally revealed the technical details, all the
  operating system writers can patch their EIDE drivers to
  bypass the flaw. The Warp fix for the CMD 640B should be
  available soon.
  
8)   Get a BIOS upgrade. For DOS, DESQview, and Windows 3.1,
  to bypass the flaw you may need a new BIOS -- an EPROM chip.
  If you have a flash BIOS, you can update it simply by
  downloading a file. Most BIOSes already have code to bypass
  the flaw for DOS, DESQview and Windows. However, more
  advanced operating systems bypass the BIOS, so even a smart
  BIOS will not protect you. However, the BIOS CMOS settings
  may allow you to disable prefetch, which protects you in
  true multitasking operating systems as well.
  
Whatever method you use to bypass the flaw, retest with
EIDEtest and CDTest afterwards to be sure your fix worked
and you caught all the problems.


CLEANING UP THE MESS

Once you have bypassed the flaw, you can start working the
problem of cleaning up your files.

The first thing to do is to re-install your operating system
and all your application programs. This will replace any
damaged EXE and DLL files.

Catching errors in your data files is more difficult. Keep
your eyes peeled for any improbable spreadsheet results. You
may have to hire a programmer to write you some comb
programs to sniff through your databases, looking for
suspicious values.

If you routinely use the verify feature of Lotus Magellan,
it can detect changes to files that should not have changed.
This may help you uncover some of the damage. The flaw is
not polite enough to redate the files it corrupts.

If you have backups from before the time you bought the
faulty machine, you can restore them and re-key everything.

Most people will not be so fortunate. All their backups will
also be corrupt.

Most people with the flaw will just have to put up with
random errors dotting their data files ever after.


SUMMARY

Operating System   Work Around

Netware            -no problems reported
Unixware 1.1
NEXTSTEP
Banyan
Solaris 2.4+
SCO Unix 3.1+
Windows NT 3.5
Windows-95

DOS                -no problems reported so far. If you do
DESQview           have trouble:
Windows 3.1        -turn off EIDE prefetch in CMOS
                   settings.
                   -Upgrade BIOS chip.
                   -Turn off simultaneous disk/floppy/tape
                   I/O in your backup programs.

Windows For        -turn off 32 disk access mode.
WorkGroups         -turn off EIDE prefetch in CMOS
                   settings.
                   -Upgrade BIOS chip.
                   -Turn off simultaneous disk/floppy/tape
                   I/O in your backup programs.

Windows NT 3.1     -turn off EIDE prefetch in CMOS
                   settings.
                   -apply ATDISK.SYS patch available at
                   http://www.microsoft.com/KB/softlib

OS/2 2.1           - disable prefetch buffer in CMOS
                   settings.
                   - Load the IBMINT13.I13 driver instead
                   of the IBM1S506.ADD driver. This trick
                   will only work if your BIOS has flaw
                   bypass code. It will be slow.
                   - upgrade to Warp

OS/2 Warp 3        - disable prefetch buffer in CMOS
                   settings.
                   - apply fix for APAR PJ19409 from IBM at
                   ftp://service.boulder.ibm.com/ps/product
                   s/os2/fixes/v3.0warp/english-
                   us/pj19409/pj19409.zip
                   - in a pinch, if you cannot do either of
                   the first two things, add a line to
                   config.sys BASEDEV=IBMINT13.I13 and
                   remove the line BASDEV=IBM1S506.SYS. The
                   IBMINTI3.I13 Device driver lives in
                   C:\OS2\BOOT, and on the first install
                   diskette, and the on the CDROM in
                   \OS2IMAGE\DISK_1. This trick will work
                   only if your BIOS has flaw-bypass code.
                   It will be slow.

Linux              - disable prefetch buffer in CMOS
                   settings.
                   - To bypass the original CMD 640B flaw
                   use the boot time kernel parameter:
                   hda=serialize.
                   - Use the default settings to suppress
                   interrupts during I/O on the external
                   Hard Disk Parameter utility hdparm..

REPORTING YOUR FINDINGS

Whether or not you find the flaw, please email me at
Roedy@bix.com or post the following information in the
Internet newsgroup comp.os.os2.bugs:

1)   Test results. (I need to hear about both machines with
  and without the flaw.)
  
2)   Brand and model of your motherboard.
  
3)   Brand and model of your entire system.

4)   Which chip did you find, the RZ-1000, the CMD 640B, the
  SMC 37650?  What did SYSINFO 3.02 report about your EIDE
  controller chip?
  
5)   Have you noticed data file corruption?
  
6)   Which tests and versions did you use? (IOtest,
  EIDEtest, CDtest, RZtest, Ctrltest or visual inspection)
  
7)   What activities did you run in the background during
  the test?
  
8)   Which operating system and version you used to run the
  test (e.g. Warp Connect blue spine)
  
9)   Brand and model of EIDE hard disk
  
10)  Brand and model of EIDE CD-ROM

11)  Markings on the suspect chip, e.g., "RZ-1000BP", "CMD
     PCIO640B", "SMC 37650".

12)  Vendor's name

13)  Vendor's response on informing him of your problem.
Please do not bother to report after 1995 September 30. The
Internet is allowing the user community to rapidly sort this
problem out, and all will be well-documented by then.


WHOSE FAULT IS IT?

The wags will have fun tormenting Intel for using the flawed
RZ-1000 chip and the triply flawed CMD 640B in its
motherboard designs, even though Intel did not manufacture
either of the two faulty chips. Intel is not the only
company to manufacture motherboards with the faulty chips,
but Intel will bear the brunt of the bad publicity.

PC-Tech manufactured the faulty RZ-1000 EIDE controller chip
used in many PCI motherboards. PC-Tech is a subsidiary of
ZEOS, the clonemaker. In turn Micron Electronics owns ZEOS.
PC-Tech has offices just down the street from Zeos in
Minnesota. Intel bought the chips from PC-Tech, and in turn
many clone makers bought motherboards from Intel. Other
motherboard manufacturers also used the faulty chips.

PC-Tech, Intel and the clone makers all failed to test their
designs properly. The software makers did not test their
software on enough machines to show up the problem before
releasing it.

Even worse, in some motherboard designs, Intel used the CMD
640B chip. This goof was inexcusable, since the chip, by
deliberate design, is incapable of simultaneous I/O.

How did the triply-flawed CMD 640B chip and the RZ-1000 slip
through Quality Assurance testing? My guess is no one did
real world testing; technicians only tested under laboratory
conditions using only simple operating systems like DOS.
They might have ignored flaws that happened only
sporadically, blaming it on a faulty chip rather than a
faulty design. It is very hard to catch a flaw that only
manifests rarely.

CMD, PC-Tech, Intel, and Microsoft have known about how to
bypass these problems for quite some time. IBM was aware
there was a problem but was unaware of the solution. For
obvious reasons, these companies were reluctant to inform
the public of the danger of the ongoing subtle corruption.

The collective damage done by withholding information about
the flaw is huge, certainly many millions of dollars for
those large companies whose backups are corrupt as well. It
will be interesting to see if anyone launches a damage
lawsuit against CMD, PC-Tech, Intel or Microsoft. If they
do, it might make both hardware and software makers more
careful about releasing improperly tested products.

There is potential here for some massive lawsuits. No wonder
the companies who knew about the flaw have been so tight-
lipped. Think of the damage if Boeing or GM had its plans
for coming products stored on flawed machines. Literally,
this flaw could cause plane crashes.


INTEL'S SPIN

There are three levels of "Intel Inside".

1.   Your motherboard has an Intel CPU but a support chipset
 from another manufacturer.
 
2.   Your motherboard has an Intel CPU and Intel support
 chipset such as the Neptune or Triton, but some other
 company built the BIOS and motherboard.
 
3.   Your motherboard has an Intel CPU, Intel support
chipset, Intel motherboard and Intel BIOS.
Intel literature on the RZ-1000 and CMD 640B only refers to
(3). Intel cannot very well speak for (1) and (2) where the
PCI EIDE controller design was out of their control, even
though these machines bear the "Intel Inside" logo.

Intel does not make this distinction clear in their
literature.

According to Intel, "This problem is a consequence of the RZ-
1000's inability to fully compensate for all the
implications of running an IDE hard disk as an extension of
the PCI bus, instead of running as an extension of the AT
bus which it was originally designed to do."

Intel would have us believe the problem is not a flaw per
se, but rather a limitation that the programmers forgot to
take into consideration.

The truth is grey. UART chips have similar flaws.
Programmers have gradually learned to code around them. We
don't insist that all COM port hardware be recalled. We now
tend to blame a programmer if he does not bypass the known
UART flaws.

No one who understood the RZ-1000 and CMD 640B flaws
publicised their findings. If PC-TECH, Intel and Microsoft
had not been so secretive, the damage would have been
averted. Perhaps they were silent because the flaw primarily
hurt the customers of competitor, IBM.

Given that software work-arounds are now possible, the
primary blame shifts for any perpetuation of the problem to
the software authors.

However, there are many other EIDE chip designs that do not
have this "limitation". Since the RZ-1000 chip was a
supposedly generic implementation of the ATA interface
standard, this flaw cannot be so lightly excused.

The CMD 640B is triply flawed:

1.   It has the same prefetch problem as the RZ-1000.
 
2.   It erroneously responds to floppy status commands, and
 even worse, in the process, corrupts hard disk data.
 
3.   It does not support simultaneous I/O on the primary and
 secondary EIDE ports.
 
The CMD 640B chip should never have been used in any PC. I
am unaware of any legitimate use for such a brain-damaged
chip. Intel and ASUSTeK must take full blame here for using
such an inappropriate part in their motherboards. In my
eyes, Intel and ASUSTeK have irreparably ruined their
reputations.


SPECULATION

Because setting the flaw right would be so expensive, I
suspect that clone makers and motherboard manufacturers will
continue to refuse to correct the flaw. At best they may
offer BIOS upgrades to bypass the flaws. Microsoft has
already added code to Windows-95 and NT 3.5 to bypass the
flaws. Clone makers will rely on software vendors to write
drivers that bypass the flaws for Warp, Linux and the
various UNIXes.

Now that the OS/2 patches will be out soon, the pressure to
set things right will dwindle. Since DOS, Windows in 16 bit
mode, Windows-95 and NT 3.5 are immune, little pressure to
correct the problem is likely to come from those camps.

The motherboard manufacturer has five options:

1)   Replace the motherboard. Recalls on a mass scale would
  be extremely costly for the motherboard manufacturers, so
  you can count on them to fight. ($400 parts + $250 labour)
  
2)   Provide a replacement paddleboard EIDE controller that
  takes up a PCI slot. ($75)
  
3)   Provide a new BIOS chip that bypasses potential
  problems for DOS and Windows. It could also turn off
  prefetch which would rescue multitasking operating systems
  that do not use the BIOS for I/O. ($10)
  
4)   Tell the users to upgrade to software that bypasses the
  flaw, and to turn off simultaneous disk/tape/floppy I/O in
  any backup software run under DOS, DESQview or Windows. ($0)

5)   Stonewall and refuse to even acknowledge the problem.
  This will be more difficult now that Intel and Dell have
  publicly admitted the problem. ($0)
  
Intel has already set the precedent by offering to replace
defective Pentiums, even though software can bypass its
divide flaw. The RZ-1000 flaw is far more serious, and the
CMD 640B is even more serious still.

Keeping this under wraps is going to be hard for the clone
builders. Brooke Crothers of Infoworld did a story based on
my compilation. I have been in contact with Jerry Pournelle
of Byte. I sent email to John Dvorak. Even the San Jose
Mercury Daily News did story. An 1000 abridged version of
this essay is appearing in The Computer Paper that goes
across Canada. The stonewall is coming tumbling down. As one
man pointed out, I read your postings on the Internet, and
see them the next day quoted in my daily newspaper.


TECHNICALLY WHAT ARE THE FLAWS?

After the manner of Ionesco, Roedy Green said, "All great
programmers are paranoid." Programmers have to anticipate
problems that could happen only once in a trillion machine
cycles since such a problem would still show up on average
every three hours. The EIDE problem sometimes goes for days
without manifesting. Sometimes it shows up within seconds,
depending on the unrelated I/O activity in the machine.

I have read about ten conflicting explanations from
authorities on the cause of the problems. I based my
explanations on postings from Sam Detweiler of IBM's Warp
Device Driver section (sdetweil@vnet.ibm.com).

The RZ-1000 and CMD 640B both have the prefetch flaw.  The
CMD 640B has two additional flaws: lack of simultaneous I/O
support and floppy controller interference.


Flaw 1: Prefetch Buffer Flaw

The RZ-1000 and CMD 640B both have the prefetch flaw

The fatal co-incidence tends to happen when you have both
the EIDE controller (Hard disk or CD-ROM) and the floppy
controller (floppy or tape backup) working at once.

Data moves from the hard disk to RAM via a bit bucket
brigade. The RZ-1000 grabs data 16 bits at a time from a
buffer in the integrated controller on the hard disk, and
hands it off 32 bits at a time off to the PCI bus. The CPU
sits in a tight loop grabbing data from PCI bus and storing
it in RAM. In prefetch mode, the RZ-1000 keeps ahead of the
CPU, requesting two 16-bit chunks from the hard disk, in
order to have a 32 bit chunk ready when the CPU asks.

When you disable the prefetch buffer, you turn off the
parallelism and run in a degraded lock-step mode. In
degraded mode, the RZ-1000 waits until the CPU asks for a 32
bit chunk. Then it puts the CPU on hold while it asks the
hard disk for two 16-bit chunks. It glues them together, and
puts them on the PCI bus and allows the CPU to continue.

When there is a delay from some other unrelated device
generating an interrupt or DMA bus cycles, the EIDE chip
sometimes becomes confused and stores status instead of data
into RAM, thus corrupting your data. This flaw is the result
of a shortcut in the chip design -- using the same registers
for both status and data.

There are two software techniques to bypass this flaw:

1)   Never schedule more than one I/O at a time. Use strict
  polled mode with no interrupts. Turn off all unrelated
  interrupts during I/O. This is the DOS/Windows approach. The
  disadvantage is poor performance and possible lost incoming
  modem characters.
  
2)   Turn off the prefetch buffer. In a lightly loaded
  system, there is sufficient spare capacity on the PCI bus so
  running in degraded mode only slows the disk down by 1%.
  However, programs making extensive use of the PCI bus such
  as LANs or video bit-map painting will also slow down. No
  one has yet done benchmarks to measure the amount of
  degradation. Both Intel and IBM tell us that turning off
  prefetch to bypass the flaw has negligible effect on
  performance. Yet in the Plato BIOS rev 12, Intel says that
  enabling the prefetch buffers will "significantly increase
  PCI IDE Hard Disk performance." They can't have it both
  ways.
  

Flaw 2: No Simultaneous I/O

Only the CMD 640B has this flaw.  The CMD 640B can't do more
than one I/O at a time. This flaw was so obvious everyone
found out about it long ago. All EIDE controllers (even
fully functioning ones) cannot run master and slave
simultaneously. However, two separate EIDE controllers are
supposed to allow primary and secondary channels to run
simultaneously. The CMD 640B has dual controllers on one
chip. However, the primary and secondary channels will not
work simultaneously unlike every other design. For example,
you can't run your EIDE hard disk and EIDE CD-ROM at the
same time.

Simultaneous I/O speed is the reason we put two EIDE devices
on separate channels, both as masters, rather than making
one a master and one a slave on the same channel.

IBM has a bypass for this blunder. When it detects a CMD
640B, Warp never schedules more than one I/O at a time when
the CMD 640B is active, reducing the operating system to DOS-
like performance.


Flaw 3: Floppy Controller Interference

This flaw only affects some CMD 640B designs, not all.  The
CMD 640B controller contains logic to have it act also as a
floppy controller. This feature is never used. However, some
motherboard manufacturers failed to hook the chip up
properly to fully disable this function. The CMD 640B thinks
it is in charge of floppy I/O when it is not. It erroneously
responds to status commands directed to the real floppy
controller.

What is worse, when it responds, it becomes confused and
corrupts any hard disk or CD-ROM I/O in progress.

IBM is working on a Warp fix for this problem. Primitive
operating systems are immune to this flaw since they never
attempt to run the hard disk and floppy at the same time.


BACKGROUND

If you read the literature on this problem, you will see
various daunting technical terms. Here is a rough
explanation.

There are six kinds of I/O used in PCs.

1.   PIO - Programmed I/O. The CPU spoon-feeds each byte to
 the I/O port. The port can usually accept data as fast as
 the CPU can feed it. Typical IDE drives work this way under
 DOS. For slower devices, the CPU polls the status to see if
 the device is ready for yet another byte.
 
2.   Scheduled I/O. This is a variant of PIO where the
 operating system feeds the I/O device some bytes, then
 calculates how long it should take for the I/O device to
 digest them, then it goes away for a while to do something
 else, then it comes back when it figures the I/O should be
 complete, and feeds the device a few more bytes. This is how
 Warp usually controls parallel port printers.
 
3.   Interrupt I/O. Every time the port is ready to eat
another byte, it raises an interrupt and the CPU feeds it
some more. This is the typical way COM ports work and how
Warp uses printers with the /IRQ option. Warp EIDE drivers
combine methods (1) and (2). The hard disk interrupts when
it has completed the read into its on-board buffer. Then the
CPU fetches data out of the buffer with PIO mode.

4.   Third party DMA. The DMA controller on the motherboard
 copies data from RAM to the port and generates an interrupt
 when it is done with a block. Floppy drives and inexpensive
 mag tape backup drives use this method. Because of the
 unfortunate original AT design compromises, this method is
 exceedingly slow. Third Party DMA is never used for PCI bus
 devices though it is still used for ISA or motherboard-based
 floppy controllers on PCI motherboards.
 
5.   First party DMA, sometimes called Bus Mastering. A DMA
 controller on the device copies data from RAM to the port
 and generates an interrupt when done High end SCSI cards --
 such as the Adaptec 2940 or 2940W use this ultimate way to
 fly.
 
6.   Memory mapped I/O. The CPU copies data to a magic
 region of RAM which is actually on the I/O device. LAN cards
 or REGEN VRAM on video cards use this technique.
 
In a true multi-tasking system, such as OS/2, the CPU goes
off and works on behalf of applications when the port is
busy, and trusts an interrupt to bring it back when the
device needs more service. It schedules several I/Os
simultaneously. In contrast, DOS and Windows never do more
than one I/O at a time. Further, under DOS/Windows the CPU
idles while waiting for its single I/O to complete rather
than working on applications.


LEARNING MORE

You can use the Internet to learn more about this problem.
If you do not have Internet access, I can provide you these
files on diskette.

Roedy Green's FAQ (Frequently Asked Questions) an unabridged
version of this article including the EIDEtest and CDTest
programs:

     ftp://ftp.cdrom.com/.4/os2/incoming/eidete15.zip

Warp bypass for the RZ-1000 chip


ftp://service.boulder.ibm.com/ps/products/os2/fixes/v3.0warp
/english-us/pj19409/pj19409.zip

Intel's FAQ

     http://www.intel.com/procs/support/rz1000

Intel's RZ-1000 chip detect program

     http://www.intel.com/procs/support/rz1000/rztest.exe

Intel's CMD 640B and RZ-1000 chip detect program.


http://www.intel.com/procs/support/ctrltest/ctrltest.exe

IBM's bypass for the first CMD 640B chip flaw. IBM will soon
be replace it with one that bypasses all three of the CMD
640B's faults.

     ftp://ftp-os2.cdrom.com/pub/os2/drivers/cmd640x.zip

IOTest from PowerQuest, the makers of Partition Magic, a
Warp test for the flaw.

     http://www.powerquest.com/

PC-Tech's essay:

     http://www.mei.micron.com/rz1000/rz1000.txt


CONTACTING THE AUTHOR

The author, Roedy Green is a computer consultant who prefers
to work on Forth, C++, Delphi, DOS, OS/2 and Internet Web
projects.

If you send me $5 (US or Canadian) to cover duplication,
shipping and handling I will send you a diskette containing
all the relevant test programs, patches and essays.

Please report which machines you find the flaw in, and which
software and fixpacks you were using at the time. Send email
to:

     Roedy@bix.com

or discuss this problem on the Internet newsgroup in:

     comp.os.os2.bugs.

You can also write via snail mail:

Roedy Green
Canadian Mind Products
#601 - 1330 Burrard Street
Vancouver, BC  CANADA
V6Z 2B8
(604) 685-8412

-30-

