The Texas TMS 5220 Speech Chip comes under scrutiny.
Chip Chat is an occasional series that will make an appearance whenever there's a new chip with relevance to micro music that looks interesting. This month, we're looking at one of the new breed of speech chips, the Texas TMS5220, a device that's achieved some degree of notoriety on account of its ability to synthesise Kenneth Kendalls out of thin air.
Texas Instruments have been responsible for many of the advances in speech synthesis technology, particularly in the field of linear-predictive coding (LPC). Commercial evidence of this is provided in the shape of the (in)famous 'Speak n' Spell' and 'Speak n' Maths' educational toys, devices which use the TMS5100 4-bit speech synthesiser (incidentally, also the basis of E&MM's Wordmaker project, back in June 1981). This particular chip is designed for low-cost devices, but the 8-bit version, the TMS5200, offers much higher quality (because of the greater resolution gained from 8 bits) and general compatibility with 8-bit microprocessors.
The TMS5220 is basically an advanced version of the TMS5200, but with the important addition of allophone synthesis capability. Apart from it being the chip that's responsible for making the BBC Micro sound more like a Hereford human than a Detroit Dalek, the TMS5220 is also finding its way into other micros (the Echo II speech synthesiser for the Apple, for instance) and even into cars (the Austin Maestro).
Speech chips like the TMS5100, 5200, and 5220 operate from LPC data, which represents a 100:1 data compression of the information in spoken words. You'll recall from the article on speech synthesis in October's E&MM that this compression is derived from analysis of formant bands, the vocal equivalent to harmonics. Whilst this data could reach the speech chip from the micro's memory space, there are certain advantages in having the data come from a special speech ROM (what Acorn have dubbed a 'PHROM' (PHrase ROM)).
In fact, with the 5100 and 5200 speech chips, this is the only way the chips can get their daily diet of LPC data. The 5220, on the other hand, goes several steps further by offering a choice of high quality LPC synthesis, medium quality allophone synthesis, or a combination of both, and all with a minimum of supervision from the processor.
What's gained from allophone synthesis is more accurate speech than that generated with low-cost phoneme systems (like the Votrax SC-01 and National Semiconductors' 'Digitalker'), which can neither mesh neighbouring sounds smoothly nor provide all the sound variations necessary to model the intricacies of the vocal tract. However, allophones aren't exactly the pinnacle of achievement, because speech gained from stringing together allophones is a compromise between the greater naturalness of a complex and costly standard LPC system and the cruder, more mechanical effect of a phoneme system.
However, such apparent limitations of allophonic synthesis may have more to do with the way in which one applies the allophones than any inherent limitation in the principle of allophone synthesis. What's also worth remembering about the 5220 chip is that the combined allophonic-LPC mode of operation enables virtually any degree of speech synthesis to be obtained by simply adjusting the mix between the two.
In the case of LPC-based vocabularies that are pre-programmed in word ROMs, a single 5220 chip can access up to 16 ROMs, each of 16K, to provide something in the region of 30 minutes of continuous speech without the repetition of a single word - an operation that makes more sense when dreaming up advertising copy than in reality!
Allophonic synthesis makes much more modest memory requirements: 3K to hold a 128 allophone library, 7K or so to store a set of rules (650 in the case of TI's 'Text-20 Speech') for text-to-allophone translation, and a small amount of memory for an algorithm to string together words from the rules and library, and also to 'naturalise' the resulting speech with smoothing parameters and intonation adjustments.
The library of 128 allophones and 650 rules takes the place of the unlimited ROM library that would otherwise be needed for an unlimited vocabulary. But, like any system with rules and regulations, a combination of letters will occasionally take a perverse delight in breaking the rules of text-to-allophone translation. What results is a mispronounciation, but, as 92% of text follows the 650 rules quite adequately, that's no great disaster. To obtain greater accuracy, more rules and a larger vocabulary can be added, though manual intervention in the speech construction program may be more cost-effective.
In fact, recent developments on the allophone front have resulted in the original complement of 128 allophones being replaced with 302. Whilst this effectively means more variations on the same theme (more subdividing of the traditional phonemic approach), the ordering of such a large number of speech variants inevitably makes the construction of text-to-speech rules even more complicated. According to John Horton, who heads R&D at Acorn Computers, this is the main reason why their so-called 'allophone project' for the BBC Micro has been so slow in getting off the ground.
After the input text has been converted to its equivalent allophonic strings, the speech construction program changes the strings into a stream of LPC data (Figure 1). The TMS5220 decodes this data to control a time-varying digital filter that emulates the upper vocal tract. Digital representations of voiced and unvoiced sounds pass through the filter to be formed into words. The digital output from the filter then passes to an 8-bit DAC that produces the final analogue version of the speech.
One of the nice things about the 5220 is that it requires only minimal control from the host processor. Commands have to be passed to the 5220 to initiate specific activities, but the processor isn't itself involved in these activities. The list of available commands (Table 1) totals just six, including Reset, Load Address, Speak, Read & Branch, Speak External, and Read Byte. The two commands of particular interest to us are Speak, which initiates speech from phrase data stored in an external ROM, and Speak External, which initiates the allophone-stringing mode. A typical system plan for interfacing the 5220 to a host processor is shown in Figure 2. Apart from the usual eight bidirectional data lines, there are also four control lines (READ SELECT (RS) and WRITE SELECT (WS) on the input, and READY and INTERRUPT on the output). Going to the 6100 ROMs, there are four address lines (ADD1, ADD2, ADD4, ADD8), 2 control lines (M0 and M1) and a synchronised clock (ROMCLK) (Figure 4).
Four on-chip registers handle all input and output of data: a 128-bit FIFO (First In First Out) buffer register and a command register receive inputs whilst a data register and status register hold outputs. When the WS line goes low, inputs are directed either to the FIFO buffer (in the case of a Speak External command) or to the command register. Once data is latched in a register, the 5220 lowers its READY line to signal to the processor that the data transfer is complete, thereby releasing the processor for other activities.
However, the processor can still keep track of the 5220's operations by checking its status bits and the INTERRUPT output. Upon receipt of a READ SELECT input, the 5220 delivers its status bits to the host processor over the data lines. The status bits indicate whether the 5220 is speaking (talk status), or whether its FIFO buffer is less than half full.
The 5220's FIFO buffer holds 16 bytes of speech data, which is about 50ms of speech sound (equal to two so-called 'frames' of speech) when operated at a system clock rate of 160 kHz. Synchronisation of the data transfer between the host processor and the 5220 may be accomplished using either interrupts or software polling of the 5220's status register. The FIFO control logic generates an interrupt to the processor when the number of bytes remaining in the buffer falls to eight or less, indicating to the processor that more data is needed.
*With Repeat = 50 bits Energy = 1111 is the Stop code
The typical time needed by the processor to service a FIFO interrupt request for data is less than 500 microseconds. At 20 interrupts per second, something in the region of 1% of available processing time will be required for servicing speech synthesis after a Speak External command. In practice, high quality allophone-stringing may require as many as 30-35 interrupts per second, but the correspondingly increased data transfer overhead incurred in keeping the FIFO buffer adequately refreshed can be reduced by arranging for the 5220 to work directly with a 6100 ROM for part of the time - the combined allophonic-LPC mode of operation.
In either mode of operation, a certain amount of data has to be put in to get a certain amount of speech out. However, in the LPC mode, the 5220 must operate at a higher bit rate (1200 to 1700 bits per second) than in the allophonic mode (400 to 600 bits per second) to attain the best speech quality inherent in each method of synthesis. Also, the closer the input bit rate is to the mode's optimum the more closely the synthesised speech resembles a natural human voice.
Either way, the data that emerges from the digital filter imitation of the human vocal tract has an 8kHz sampling rate which, after passing through the DAC and low-pass filter, enables speech to be synthesised with a maximum bandwidth of around 3.5kHz. When synthesising speech from LPC data, the 5220 joins together 'frames' of speech data. A single frame represents the amount of data needed to specify the essential components for 25ms of speech.
Thus, LPC uses a 40Hz input frame rate for obtaining word data from ROM. The maximum of 50 bits in each frame (Table 2) defines the excitation (the quality of voiced/unvoiced input) and filter characteristics that are linearly interpolated every 3.125ms to produce smoothly-varying speech. Each 50-bit frame is composed of 13 parameters:
1 Energy (amplitude, 4 bits)
2 Pitch (fundamental frequency, 6 bits)
3 Repeat bit (repeated synthesis from a particular frame of data).
4 Ten reflection coefficients (K1 and K2, 5 bits each; K3-K7, 4 bits each; and K8-K10, 3 bits each).
Turning our attention to the speech synthesiser side of the 5220 (Figure 4), these LPC parameters of speech feed serially from either the external ROM or the FIFO buffer into an input register. Here, the data are 'unpacked' and several tests performed to determine whether the repeat bit is set, the pitch is zero (signifying unvoiced speech only), or the energy is zero (set by the stop code of 1111). The unpacked data are then stored in the coded-parameter RAM and serve as index values for selecting appropriate values from the parameter look-up ROM.
Outputs from the look-up ROM are target values that the interpolation logic must reach in one 25ms frame period. During each of the eight 3.125ms interpolation intervals making up one frame period, the interpolation logic generates pitch and energy parameters for the noise and pulse wave generators, as well as the filter-excitation sequence and reflection parameter values for the lattice filter.
The reflection coefficients, K1 to K10, define the nature of the vocal tract modelling, and specifically reflect the linear predictive analysis originally carried out on speech data. The ten K parameters have been made pitch-dependent at the top end of the frequency range to give more natural female or higher pitched speech. These K parameters can also be manipulated to produce musical effects.
In addition, the pitch of the voiced excitation input to the filter can actually follow a well-tempered scale over 1.5 octaves, from 260Hz (middle C) to 696Hz (the F an 11th above). These features help to improve the naturalness and musicality of the synthetic speech and also offer the intriguing prospect of using the 5220 for singing.
So how do the complex technicalities of LPC speech synthesis fit into the world of the average computer musician? Well, one hint of possible musical avenues is provided by the previous paragraph: the idea of changing the pitch of the voiced input to the vocal synthesiser to simulate singing. On top of that, the mass of control parameters used to shape the synthetic vocal tract can be manipulated in ways that are wholly beyond the average human-being. For instance, the American composer, Charles Dodge, has taken standard speech synthesis techniques and applied them to more creative ends in a piece called 'Speech Songs'. By altering the natural resonance, pitch, contour, and speed of the voice, Dodge has been able to bend the synthetic voice in directions that are alternatively musical, frightening, and downright whacky.
In principle, the set-up of the LPC speech synthesiser is similar to the traditional analogue synthesiser, with harmonically-rich inputs to filters and various control points scattered here and there. The big difference is that you're working digitally the whole time up until the DAC comes into the picture. In fact, LPC speech synthesis is a practical implementation of one of the great white hopes in computer music: the real-time digital filter. The problem is that the 12-stage filter in the 5220 chip works at a bandwidth of 3.5kHz, which is way below what's required for music. To step this up and, at the same time, provide a frame rate that allows filtering changes more rapid than every 25ms, needs a corresponding step up in the level of technology. That, of course, is still to come.
Feature by David Ellis
mu:zines is the result of thousands of hours of effort, and will require many thousands more going forward to reach our goals of getting all this content online.
If you value this resource, you can support this project - it really helps!