VOICE DIGITIZATION AND REPRODUCTION ON THE IBM PC/XT AND PC/AT BUILT-IN SPEAKER -------------------------------------------- Alan D. Jones July 1988 The speaker on the PC and its associated driver circuitry is quite simple and crude, having been designed primarily for creating single square-wave tones of various audio frequencies. This speaker is typically driven by a pair of transistors used as current amplifier which is in turn driven directly by the output of a TTL gate. This results in only two possibilities of voltage across the voice coil: 0 volts and 5 volts. Any sound to be reproduced by this system must be reduced to an approximation in the form of a stream of constant-amplitude, variable-width rectangular pulses. Examination of a speech waveform on an oscilloscope display quickly tells us that it is not going to be possible to even remotely mimic this waveform under the above restrictions. Much of the information contained in the waveform is in the form of amplitude variations, and this is the one attribute we cannot reproduce. It is initially tempting to try to use the technique of the "class D" amplifier to create the waveform, using high-speed pulse width modulation and depending on the mechanical characteristics of the speaker and those of the human ear to provide the missing low-pass filtering. Assuming the sampling rate to be 8 KHz (based on the Nyquist criterion) and, to conserve memory, assuming the samples to contain only 4 bits of amplitude information (16 levels), we can see that data accumulates at a rate of 4k bytes per second, which is certainly acceptable. The problem comes when we try to play back the sound. Pulses occur at intervals of 125 microseconds, which doesn't seem too bad, but since each pulse can have 16 possible widths, it is necessary to time the pulses with a resolution of well under 8 microseconds. This is only a couple of instruction times on a 4.77 MHz XT, and even on a fast 80386 it doesn't give the CPU much time between bits to shift bits, read and increment a pointer, check the pointer to see if it's done yet, etc., not to mention the difficulty of servicing unrelated interrupts. The search for simpler (but still usable) and less CPU-intensive methods of reproducing speech leads to the question of what information in the waveform we can discard without an unacceptable loss of intelligibility. My experiments with running speech signals through a graphic equalizer revealed that the lower-frequency components, those which are most visible to the eye on the oscilloscope, are actually of minimal importance in understanding speech. This is also demonstrated by the fact that a whisper is just as understandable as normal speech, but does not make use of vibrating vocal chords, which are the primary source of low-frequency components in the voice. The schematic created by printing the file SCHEMATC.PRT arose partly from the above observations and partly from trial-and-error. The circuit consists of two stages of voltage amplification with some high-pass filtering built into the coupling capacitors, followed by a differentiator. The output of the differentiator is fed to a voltage comparator, thus producing an output which has approximately the following relationship to the input from the microphone: If the derivative of the speech waveform if positive, then the output is logic zero; If the derivative of the speech waveform is negative, then the output is logic one. The transition timing at the output is entirely analog in nature; there is no synchronizing clock signal anywhere in the circuit. If the output of this circuit is connected directly to a speaker, the resulting sound will still be an understandable version of the input. Since the output consists of nothing but a digital bit stream, the job of the computer becomes that of simply recording and accurately reproducing this bit stream. The trimpot at the input of amplifier U3 is used to set the DC idle voltage output from the differentiator to somewhere near the threshold of comparator U4. There will be a considerable amount of noise at the output of U3, originating at the microphone and within the input circuitry of U1, and highly amplified by U1 and U2. The trimpot should be adjusted so that the comparator threshold is just outside the normal excursion of the noise signal ("off to one side"), otherwise "silence" at the microphone will become, at the speaker output from the computer, a loud hiss with a strong component at half the sampling frequency. I used LF356's for U1, U2, and U3, and an LM393 for U4. Everything is powered by +12 and ground. All amplifiers should have power supply bypass capacitors (not shown). The microphone is a 600 ohm dynamic type. The 12 volt power supply should be quiet and well-regulated; the one in the PC is too noisy unless you use heavy filtering. The two programs, RECORD and PLAY, are used as follows: Attach the circuit to the CTS input on one of the PC's COM ports. Then type: RECORD where is the COM port number and is the name of the disk file to contain the voice data. RECORD will respond with "Press a key to start and stop." Press the space bar and start talking. Press the space bar again to end recording and write the data to disk. Play it back with PLAY . The sampling rate is about 16.5k bits per second. This means that about 30 seconds of voice will make a 64k disk file. This is a simple program; it runs out of steam at 64k. The programs both operate by reprogramming the 8253 time chip to produce hardware interrupts at the 16.5 KHz rate. The interrupt service routine then manipulates the NAND gate driving the speaker based on bits read from the file. The 16.5 Khz rate was chosen by trial-and-error; this is the audible "point of diminishing returns", where a further increase in sampling rate didn't produce enough of an improvement to warrant the increased memory usage. This technique is somewhat limited in its usefulness. It necessitates the writing of a "badly behaved" program which not only reprograms the timer chip but also totally hogs the CPU for the duration of the voice output. Nevertheless, it demonstrates a few interesting things about how humans hear speech. I first developed this circuit over a year ago as a rebuttal to someone who said "it couldn't be done". Not only can it be done, it is actually quite simple. Certainly the curcuit could be improved, at the possible expense of increased complexity. I'm waiting to hear from some of you. If anyone has questions, especially about my sloppy code, I check for messages on CIS every three or four days. - Alan 74030,554