VOICE DIGITIZATION AND REPRODUCTION ON THE
                 IBM PC/XT AND PC/AT BUILT-IN SPEAKER
             --------------------------------------------

                    Alan D. Jones        July 1988


    The speaker on the PC and its associated driver circuitry is quite
simple and crude, having been designed primarily for creating single
square-wave tones of various audio frequencies. This speaker is typically
driven by a pair of transistors used as current amplifier which is in turn
driven directly by the output of a TTL gate. This results in only two
possibilities of voltage across the voice coil: 0 volts and 5 volts. Any
sound to be reproduced by this system must be reduced to an approximation
in the form of a stream of constant-amplitude, variable-width rectangular
pulses.
    Examination of a speech waveform on an oscilloscope display quickly
tells us that it is not going to be possible to even remotely mimic this
waveform under the above restrictions. Much of the information contained
in the waveform is in the form of amplitude variations, and this is the
one attribute we cannot reproduce. It is initially tempting to try to
use the technique of the "class D" amplifier to create the waveform, using
high-speed pulse width modulation and depending on the mechanical
characteristics of the speaker and those of the human ear to provide the
missing low-pass filtering. Assuming the sampling rate to be 8 KHz (based
on the Nyquist criterion) and, to conserve memory, assuming the samples
to contain only 4 bits of amplitude information (16 levels), we can see
that data accumulates at a rate of 4k bytes per second, which is certainly
acceptable. The problem comes when we try to play back the sound. Pulses
occur at intervals of 125 microseconds, which doesn't seem too bad, but
since each pulse can have 16 possible widths, it is necessary to time the
pulses with a resolution of well under 8 microseconds. This is only a
couple of instruction times on a 4.77 MHz XT, and even on a fast 80386
it doesn't give the CPU much time between bits to shift bits, read and
increment a pointer, check the pointer to see if it's done yet, etc., not
to mention the difficulty of servicing unrelated interrupts.
    The search for simpler (but still usable) and less CPU-intensive
methods of reproducing speech leads to the question of what information
in the waveform we can discard without an unacceptable loss of
intelligibility. My experiments with running speech signals through
a graphic equalizer revealed that the lower-frequency components, those
which are most visible to the eye on the oscilloscope, are actually of
minimal importance in understanding speech. This is also demonstrated by
the fact that a whisper is just as understandable as normal speech, but
does not make use of vibrating vocal chords, which are the primary source
of low-frequency components in the voice.
    The schematic created by printing the file SCHEMATC.PRT arose partly
from the above observations and partly from trial-and-error. The circuit
consists of two stages of voltage amplification with some high-pass
filtering built into the coupling capacitors, followed by a differentiator.
The output of the differentiator is fed to a voltage comparator, thus
producing an output which has approximately the following relationship
to the input from the microphone: If the derivative of the speech waveform
if positive, then the output is logic zero; If the derivative of the speech
waveform is negative, then the output is logic one. The transition timing
at the output is entirely analog in nature; there is no synchronizing
clock signal anywhere in the circuit.
    If the output of this circuit is connected directly to a speaker, the
resulting sound will still be an understandable version of the input.
Since the output consists of nothing but a digital bit stream, the job
of the computer becomes that of simply recording and accurately reproducing
this bit stream.
    The trimpot at the input of amplifier U3 is used to set the DC idle
voltage output from the differentiator to somewhere near the threshold
of comparator U4. There will be a considerable amount of noise at the output
of U3, originating at the microphone and within the input circuitry of U1,
and highly amplified by U1 and U2. The trimpot should be adjusted so that
the comparator threshold is just outside the normal excursion of the noise
signal ("off to one side"), otherwise "silence" at the microphone will
become, at the speaker output from the computer, a loud hiss with a strong
component at half the sampling frequency.
    I used LF356's for U1, U2, and U3, and an LM393 for U4. Everything is
powered by +12 and ground. All amplifiers should have power supply bypass
capacitors (not shown). The microphone is a 600 ohm dynamic type. The 12
volt power supply should be quiet and well-regulated; the one in the PC is
too noisy unless you use heavy filtering.
    The two programs, RECORD and PLAY, are used as follows: Attach the
circuit to the CTS input on one of the PC's COM ports. Then type:
RECORD <number> <filename>    where <number> is the COM port number
and <filename> is the name of the disk file to contain the voice data.
RECORD will respond with "Press a key to start and stop." Press the space
bar and start talking. Press the space bar again to end recording and write
the data to disk. Play it back with PLAY <filename>. The sampling rate is
about 16.5k bits per second. This means that about 30 seconds of voice will
make a 64k disk file. This is a simple program; it runs out of steam at 64k.
The programs both operate by reprogramming the 8253 time chip to produce
hardware interrupts at the 16.5 KHz rate. The interrupt service routine then
manipulates the NAND gate driving the speaker based on bits read from the
file. The 16.5 Khz rate was chosen by trial-and-error; this is the audible
"point of diminishing returns", where a further increase in sampling rate
didn't produce enough of an improvement to warrant the increased memory
usage.
    This technique is somewhat limited in its usefulness. It necessitates
the writing of a "badly behaved" program which not only reprograms the timer
chip but also totally hogs the CPU for the duration of the voice output.
Nevertheless, it demonstrates a few interesting things about how humans hear
speech. I first developed this circuit over a year ago as a rebuttal to
someone who said "it couldn't be done". Not only can it be done, it is
actually quite simple. Certainly the curcuit could be improved, at the
possible expense of increased complexity. I'm waiting to hear from some of
you. If anyone has questions, especially about my sloppy code, I check
for messages on CIS every three or four days.

                                                        - Alan

                                                          74030,554