Vocal music on computers?
The desire to make inanimate objects spout forth more-or-less human utterances may possibly have something to do with Man's insistence upon anthropomorphizing everything he gets his grubby paws onto (viz. boats, cars, planes and family pets). The successful transplant of the refined tones of Kenneth Kendall from BBC TV news studio to a slice of silicon in the BBC Microcomputer must therefore represent a near ultimate example of the humanisation process, although, in essence, the transposition is nothing more than a technological update of good, old-fashioned ventriloquism. However, before we pat ourselves on the back for making such a tasteful transition from Detroit Dalek to Hereford Human, it's worth remembering that the origin of synthetic speech goes back a good deal further than even the first 'pocket calculator' of Charles Babbage. What's also rather interesting is that the two principal techniques for producing speech — recording/reproduction and synthesis through modelling of the vocal tract — have followed roughly parallel paths of development over the last century or so.
Confining our historical explorations to the last 200 years, one of the first well-documented accounts of a talking machine lies in the work of Johann Maelzel. Apart from inventing a speaking doll in 1823, Maelzel was the first person to attempt to simulate the vocal tract by using bellows ('lungs') to force air through a small diameter tube ('windpipe') with a moving flap ('tongue') to alter the resonant characteristics. Modelling of the vocal tract took a quite literal turn in the early part of the present century in the shape of Sir Richard Paget's Plasticene Resonators Producing Artificial Vowel Sounds. Paget's work stems from Helmholtz's observation in the 1860s that certain vowel sounds depended upon two resonances being set up simultaneously in the mouth. The Plasticene Resonators were constructed to demonstrate that all vowels stemmed from two such resonances. In a paper presented to the Musical Association of London in 1924, Paget proved his point by driving his vocal models from bellows, thereby simulating the eight basic vowel sounds. The reader can demonstrate some of Paget's observations for him/herself by speaking the words in Figure 1 and listening to the change in pitch on going from one to another. The actual resonant frequencies generated by the vocal tract will obviously vary from one individual to another, but it should be easy enough to hear the rising scale of the upper notes and the change in pitch that also occurs down below.
Modelling of the vocal tract took a more technological turn in the late 1930s at Bell Telephone Laboratories in the States. The voice synthesis system that emerged, the VODER (Voice Operation DEmonstratoR), consisted of a signal generator producing a buzz to simulate the vocal cords, a noise generator to simulate the rush of expired air, and a series of filters to imitate the resonant characteristics of the vocal tract. In the meantime, the other side of the speech synthesis fence was also being explored, in the shape of extensions of the 1936 speaking clock, using miniature glass records and transport mechanisms. Such synthesis by reproduction found its way into countless Barbie-like talking dolls, but also achieved some commercial success in elevators, information booths, and the like. The advent of cheap microprocessor technology hasn't actually changed the essential principles of speech synthesis, but it has made both techniques more efficient and more flexible.
The modern-day equivalent of recording and reproduction with a glass record is the digital implementation of the analogue tape recorder. Spoken words are digitized at a sample rate that's commensurate with the bandwidth of the voice and stored directly in memory. Speech is produced when the contents of the memory are accessed and fed to the input of a digital-to-analogue converter (DAC), then smoothed by a low-pass filter and amplified (Figure 2). This direct-storage method was widely used in the first talking calculators and by a number of microcomputer peripherals, including the Mountain Computer 'Supertalker' for the Apple II.
The main disadvantage of the direct-storage method is the large amount of memory space needed for even small amounts of speech. For example, the Supertalker operates with a maximum sampling rate of 4kHz, meaning that 4K of RAM will be used up for every second of speech, and the resultant 1.6 kHz bandwidth for the final speech output means that a lot of high frequency detail is lost in the wash. The only way of improving the quality of speech is to increase the sampling rate, but that automatically reduces the number of words digitizable in a given memory space. So, a measure of commonsense and compromise is rather important in putting the digitisation principle into effect!
Commonsense also applies to which words one actually elects to store in memory. If, for example, one wanted the computer to speak a number that was the result of a calculation, a phrase for every possible answer could be digitized — provided one had infinite RAM and infinite patience. A much more sensible approach is to store the words, 'thousand', 'hundred', 'ninety', ... 'twenty' and 'nineteen', ... 'zero', a more manageable total of just 30 phrases, in an appropriate 'phrase table', with an address pointer for each entry, and then concatenate them together to produce whatever number is required to vocalise the calculation. The other thing that can be done to improve the vocal lot of the direct-storage method is to subject the speech to some form of data compression. An obvious starting-point is to remove the 'dead time' at the beginning and end of words, but special compression techniques can go as far as reducing the number of bits required for each digitized value to a quarter of the number used by straight digitization, though the quality of speech invariably suffers.
For all the difficulties encountered in applying digitization to speech synthesis, there are two major advantages to the direct-storage method: firstly, the speech is your own voice and, therefore, is as natural as that normally is; and secondly, adding a new word or phrase to the computer's vocabulary is simply a matter of digitizing it and storing it in memory.
Using the direct-storage method as a starting-point for further discussion, it's pretty clear that economic speech synthesis needs some means of reducing the storage requirements and data output rate from memory while still retaining intelligibility. One technique that emerged in the mid-1970s was synthesis based on stringing together the basic speech sounds or 'phonemes' common to any language. Examples are the "ouuu" sound in 'zoo' or the "tch" sound in 'touch'. When any of the 64 English phonemes are strung together, words are created. The word 'pin', for example, would consist of the phoneme for "phh", followed by the phoneme for "ihh" and that for "nnn".
The results for phoneme synthesis are the classic examples of talking computers — the IBM computer reciting a Shakespeare sonnet, for example — and this is because of the lack of inflections at the ends of words or sentences and an over-regular cadence (ie., monotonous intonation) to the sound. This method of speech synthesis produces speech that is the least understandable and most robotic of any technique, but it does have the considerable advantage of using minimal memory to store the parameters required to construct a word. Speech can be stored at data rates of 10 bytes per second of speech — a dramatic contrast with the thousands of bytes per second required by digitization techniques. Furthermore, unlike the direct storage of entire words, the phoneme storage and reconstruction method enables true speech synthesis to be carried out, though the large amount of concatenation needed to assemble text makes for a fairly bumpy ride.
The third and most recent type of speech synthesis, linear predictive coding (LPC), offers a happy compromise as regards the amount of memory required for encoding speech, but it is also the first modern technique to return to the basic principles of vocal tract modelling. As LPC is the technique used by a number of new speech chips (including the TMS 5220 that is used in the BBC Micro and to be examined in 'Chip Chat' shortly, it's worthwhile taking a break from discussing technology to have a look at what one's actually attempting to model — the human vocal tract (Figure 3) — and how it produces the sounds we call speech.
Speech is composed of two main components:
These are produced when air from the lungs is forced between the vocal cords, thereby forcing them to vibrate or buzz, making a pulsating column of air enter the mouth and nasal cavities. The fundamental pitch of the resultant sound is determined by the length, thickness, and tension of the vocal cords. During the production of voiced sounds, the vocal tract receives pulses whose harmonic spectrum is very complicated — not least because it's changing from one moment to the next. However, a crude approximation of the vocal cord buzz is a pulse wave of very narrow width, ie., that generated by the vocal cords opening and closing very rapidly as air passes through.
The harmonic spectrum of such a pulse wave gives equal weight to all the harmonics (Figure 4a), which isn't exactly like the real thing, but it turns out that passing this approximation through a series of filters does enable a reasonably convincing simulation of vowel sounds.
If the air from the lungs is allowed to pass between the vocal cords more or less unchecked, then unvoiced sounds such as "f" or "h" are produced. Fortunately, for merry modellers of the vocal tract, these are very similar in nature to the sounds produced by filtering a white noise source that has a rather broad and flat spectrum (Figure 4b).
These raw sounds then have to be twisted and turned into shapes more closely corresponding to speech, and that's where the rest of the vocal tract comes into play. The upper vocal tract acts as a complex filtering system that determines the tonal character of both voiced and unvoiced sounds. Important filtering elements are the shape of the mouth, the position of the tongue against the teeth or palate, and the characteristics of the nasal cavities. Sounding "ah" and slowly altering the shape of the mouth demonstrates the effect of changing the shape of the upper part of the vocal tract on the spectrum of these voiced sounds.
In the case of vowel production, the jaw and tongue positions uniquely determine the voicing of the vowels, a, e, i, o, and u (see Figure 5). Also, their production isn't pitch-dependent, ie., any vowel can be produced at any rate of vocal cord buzzing. In general, precise variations of tone quality are obtained by movements of the tongue and lips altering the resonant characteristics of the filter system, thereby creating areas in which certain frequencies are boosted and others cut. The ranges in which certain frequencies are boosted and others cut. The ranges in which frequencies are boosted are known as formant bands and, for a given sound, produce the sort of spectrum shown in Figure 6. In addition to the filtering characteristics of the vocal tract, the lips and epiglottis also impose dynamic amplitude characteristics on the emerging air column to produce the percussive attack transients common to explosive sounds such as 'p' and fricative consonants like 't' and 'k'.
Overall, then, the vocal tract may be regarded as a complex sound generator consisting of an amplitude — and frequency-controlled oscillator (vocal cords and lungs), a noise generator (lungs), and a set of formant filters (mouth and nasal cavities). These basic ingredients are available on commercial synthesisers, but the real problem in transposing them to speech synthesis lies in making the extremely complicated filter changes necessary to simulate speech. To synthesise speech by duplicating these filter changes, one has to be able to analyse these changes, and that's where linear-predictive coding steps in.
In the case of LPC, speech is gathered by the usual ADC sampling technique and then encoded and compressed, but, rather than being compressed in the time domain (at the waveform level), an algorithm first transforms the data to the frequency domain, ie., sorting out the formant bands, the speech equivalents to the harmonics of the average musical waveform. The results of this analysis are essentially the data needed to describe the filtering characteristics of the upper vocal tract and, given a suitable vocal tract synthesiser and raw voiced and unvoiced sounds, would enable speech to be shaped. The data collected by linearly predicting the characteristic formants of a particular word are stored in RAM or ROM. Retrieval of the compressed speech data is then used to control a digital lattice filter being fed the digital version of white noise (raw unvoiced sound) and pulse waves (raw voiced sound), with the digitally-filtered output being sent to the usual combination of DAC, low-pass filter, and amplifier (Figure 7).
The important point about LPC is that it mirrors the way the vocal tract goes about speech synthesis remarkably accurately. However, the fact that each word or phrase requires an exclusive area of memory means that a large vocabulary requires an extensive library of ROMs (typically, 16 x 16K to store 3000 words). Fortunately, a compromise between the low quality, infinite word number of the phoneme method and the high quality, limited word number of the LPC method has been developed, and that's by using the magical ingredient of allophones. Like phonemes, allophones are basic speech components, but, in contrast to the 64 phonemes needed to build up words in the English language, 302 allophones are required, meaning that a single allophone is, in effect, a rather more basic component of speech than a single phoneme.
In fact, allophones are best viewed as modifiers of phonemes, and analysis of English speech suggests that 40 or so allophones can provide most of the phoneme modifications necessary for realistic speech. For example, the phoneme for "phh" is rounded and aspirated in 'poke'; rounded and unaspirated in 'spoke'; aspirated in 'pie'; unaspirated in 'spy'; slightly aspirated in 'taper'; released in 'appetite'; and unreleased in 'apt'. These acoustically different versions of 'p' — the so-called voiceless bilabial stops — are allophonic variations of the phoneme 'phhh". So, allophonic speech synthesis includes the majority of the subtle variations each phoneme can encompass and, therefore, provides much better quality than standard phonemic speech synthesis.
The cost saving of the allophone approach over an all-LPC method of synthesis reflects the very modest memory requirements of the former: just 3K to store the library of 128 allophones and 7K to hold a set of rules (such as Texas Instruments' 'Text-20 Speech') for translating text into allophonic equivalents, and for contouring inflections with the help of pitch modifiers to make the speech sound more natural. A key element in constructing speech with allophones is to make the transition from one to the other as smooth as possible whilst including enough prosody (tonal or syllabic accents) to make everything sound reasonably natural. Apart from what happens at the level of a given word, good speech quality also needs a smooth energy contour over the length of concatenated phrases, and that's no easy task! We'll see how well that's actually achieved in practice when we come to look at the speech facilities now available for most micros.
Finally, to show that it's not just computers, dolls, clocks, and Plasticene Resonators that can be made to speak, here's another historical perspective on speech synthesis: "(Alexander Graham) Bell's youthful interest in speech production also led him to experiment with his pet Skye terrier. He taught the dog to sit up on his hind legs and growl continuously. At the same time, Bell manipulated the dog's vocal tract by hand. The dog's repertoire of sounds finally consisted of the vowels 'a' and 'u', the dipthong 'ou' and the syllables 'ma' and 'ga'. His greatest linguistic accomplishment consisted of the sentence, "How are you, Grandmama?" which must have actually sounded like "ou a u ga ma ma". This, according to Bell, is the only foundation to the rumour that he once taught a dog to speak."