"We'll fix it in the bits..."
Digital voice editing
How to swing the on-screen scalpel
Few things benefit more from digital audio editing than that sublime vocal take which was almost (but not quite) perfect. Chris Meyer gets a George Bush speech in for major surgery, and prepares to wield the digital scalpel...
In the beginning, you simply had to get it right. Later, you rehearsed and recorded until you got it right. Even later than that, you might have tried to 'fix it in the mix'. But in this modern age of genetic engineering and digital technology, you can go one step further: you can edit the take you ended up with until you have a new, synthesised 'perfect take'. (Well, almost perfect. There are some performances so bad even eleven sacrifices to seven gods can't fix... But I digress.)
The one technological advance which has really made all this possible is the digital audio workstation. These days, this is usually a computer-based system running custom software offering a vast range of impressive editing options. Which is all very well. But new technologies demand new working methods. And with so many options to choose from, many newcomers to digital editing come to grief because they just haven't been able to work out a modus operandi that works for them and their material.
What follows, then, is a description of how I use a digital audio editor to clean up the worst narration tracks imaginable (usually, my own). Luckily, for those not yet making their own digital multimedia productions (which is a lot of us - Ed), most of this can be applied to sung and instrumental passages as well. I happen to use Passport's Alchemy on a Macintosh, but these techniques are transferable to virtually any program or platform.
Each step will more or less correspond to an increasing level of tweakiness; go as far as you feel comfortable. You may also develop a technique where you want to perform these in a different order, or balance several tweaks off each interactively. Remember, this is art - do what works for you. Hopefully, not every take will require this much surgery in the first place.
First, record into your digital audio workstation the take or takes you want to work with. Also record a bit of 'silence' that is pretty representative of the ambient noise hanging around when no-one is singing or talking (this is often referred to as 'room tone'). Contrary to what your mother may have taught you, do not normalise your recordings to maximise their level just yet - this will make level-matching and some of the tweaks more difficult later on.
Next, listen to the takes to get a feel for the correspondence between the wiggles of the waveform on screen and the words that were spoken (the example we'll be working with is shown in Figure 1). It usually helps to highlight just portions of the take and listen, so as not to get lost in the middle. If the takes are long, go ahead and drop markers for important phrases or key points in the lyrics or script.
After you've become familiar with your raw material, get out your digital scissors and perform a rough edit to get one take which has the best-performed and matched versions of all the words and phrases, but which is otherwise flawed in the details. If you must, select the best sentences or phrases from different takes to build one 'Frankenstein' take. I try to resist the temptation to edit down to individual words; if you go too far, each word might be perfect, but the result might sound exactly like Frankenstein looked.
Next, edit out unnecessary words and utterances such as 'umms' and 'ahhs' or places where the speaker started again. Don't worry about editing right up against each word; we'll worry about timing and tightness later - and leaving a little space before and after now, can save you some grief later.
I try to make all my cuts at same-direction zero-crossings in the waveform (where the wiggles cross from just below the centre to above it), choosing points where the general amplitude around the cut is the lowest (see Figure 2). This minimises the chance of audible glitches at cuts. Many programs have an automatic zero crossing function that will auto-correct your cuts to these points.
Some people prefer to leave a 'blending' option turned on, which performs a minicrossfade (akin to a tape splice) at the cuts. This can help smooth over difficult cuts, but carries the risk of inducing phasing between old and new takes during the blend.
You should now have one take that has all the right words, and only the right words, in the right order. Save a copy under a new name, and let's start smoothing out those rough edges.
Now it's time to go after smaller noises that don't belong in our finished work. These may include too-loud breathing, lip smacks, and other clicks and pops that might be induced by electromechanical problems, such as a loose connector or some idiot who didn't notice that the 'record' light was on when he decided to get up out of his squeaky chair and leave the room.
The main job here is to isolate just these noises and chop them out. The main trick is not to remove any parts of the words themselves, and there are two ways to pull this off. One is to select and audition the word you want to keep just before or after the offensive noise, and reduce that selection until you find the area you simply cannot cut into. Then turn around and make the start of this selection the end of the selection for the noise you are about to remove. The second method is to just jump in, select the noise you are after, cut it out, audition the result, and if you blew it, pray that the magic Undo function will not let you down. (You can tell which I prefer.) Figure 3 shows the former method, used to remove the noise of a camera click during a pause.
"Resist the temptation to edit down to individual words; if you go too far, each word might be perfect, but the result might sound the way Frankenstein looked"
Again! make your selections at those zero crossings, and avoid blending if you can, since it might blend back in parts of these noises (or blend out parts of words you want to keep).
Don't take out too little of the offending noise - you'll just end up with a smaller version of it - but at the same time, don't take out a millisecond more than you need. If the result sounds too artificial, consider leaving the blemish in. We are, after all, human (well, most of us are), and we do occasionally need to breathe and wet our lips. Save your work (under a new name, if you have the disk space) after each successful cut.
Okay, a bonus tip to those who bother to read more than the first paragraph in each section: you can remove some clipping artefacts this way as well. If only one or two spots on the waveform clipped (this is visibly noticeable as a flat top on the waveform at a positive or negative extreme), you can get rid of it by chopping out just one cycle of the waveform. Select the wave from the zero crossing before the clip to the matching zero crossing after, and cut. You'll be surprised how often you can get away with this little snip...
The next step is going after annoying noises that are also, unfortunately, parts of the words, and therefore can't be removed. It's important to keep these noises in perspective. They belong; they're just louder than they should be, that's all. To cure them, we're going to apply a precise compressor/expander that happens to be hiding inside your audio editor: the simple gain-change command. Isolate the offending syllables in the same manner as you did clicks and smacks above; select just the noise in question and nothing more; make sure your selection begins and ends on zero crossings. Now, simply gain-reduce the selection. Figure 4 shows a before and after example.
Again, don't go overboard. Making these syllables too quiet will only require more editing later to bring them back up. Also, make sure there is no body of the word itself buried under all that hissing and popping. In the example we've been using, all the of the "s"s in the word "consensus" slid together. In Figure 4, I actually selected the "sus" at the end and gain-reduced that; however, having a whole syllable attenuated altered the overall sound of the word too much. Remember to audition, and don't wreck a whole word in an attempt to tame a single fault.
And again, save after each successful edit. (Getting tired of me saying that? Get used to it. I'm going to remind you every step of the way. You'll thank me later.)
Now that we have all the individual words scrubbed clean, let's work on the rhythm between the words. If the track you're editing is a vocal to a song, chances are it already has pretty good pacing; at worst it may need a little tweaking to fit in the pocket. Narrators often don't have this guide to follow, and many don't think about it as they're speaking.
What we need to do is either shorten or lengthen the pauses between words until the overall phrasing sounds good. When choosing where to remove space, I tend to weight where I make my cut closer to the start of the next word than to the tail of the previous word (see Figure 5). This is because most sounds, including our voice, tend to have a faster 'attack' than 'decay' - the starts of words pop where the ends of words tend to trail off into silence (or room tone).
I still prefer to cut on zero crossings at low points in the overall amplitude, but blending also tends to work fine here. If you cut right up against the start of a word, blending can soften the attack of that word; try a crossfade shape similar to that in Figure 6. It allows the word to start properly, while dovetailing the room tone from the previous pause down to silence, masked by the word itself.
"Don't wreck a whole word in an attempt to tame a single fault"
If you need to lengthen the pause between words, you'll now have use for that bit of room tone I suggested you record at the start. For rough edits, just using silence as a filler is fine until you home in on the timing you want; then go back and replace the silence with room tone of the same length (you can usually select and paste to fit). When removing glitches in the original, also consider replacing them with room tone instead of just cutting them out. In both cases, this is perhaps the best and safest place to use blending when you paste. Still, listen for possible phasing to be safe. If you stick to zero crossings, non-blended pastes usually work fine.
For me, timing is very important. Do not be shy about auditioning your changes over and over again. (And don't forget to save each variation you think you like.) A rhythm tends to emerge from even the simplest spoken narration, let alone while singing; tweaking the timing to reinforce this rhythm will make the final delivery that much stronger.
Almost as important as the timing between words is the timing within words. Each syllable is a rhythmic element to place in appropriate relationship to the others. In more obvious cases, a rushed syllable can make a word unintelligible; a dragged syllable can confuse listeners and make them stumble over your meaning.
Syllables inside words are not much harder to isolate than individual words in a sentence. Indeed, if you are new to the art of corresponding waveform squiggles on-screen to words from your throat, you will be surprised how often you might isolate individual syllables thinking they are whole words. As with the other tips above, make sure you don't grab too much or too little; audition often to be sure. Points of lowest amplitude are often the best to go for, although syllables run together, so sometimes you have to look for changes in the waveform instead (see Figure 7c). Again, have your selections start and end on same-direction zero crossings to try to minimise pops at the start and end.
Now use the time-stretch and compression facilities in your workstation or sample-editing program to alter the phrasing of the word. With my own speech patterns, I find I have to slow down syllables more often than I speed them up.
This adds a lot of clarity to the enunciation. In any event, rarely should you have to use more than 25% change; the one exception might be extending the tail of a word that ended too quickly.
Keep an ear open for artefacts that might appear as a result of the time-stretching or compression process. With a good deal of the algorithms out there today, even a 10% change has the potential to start to sound choppy. Also, some algorithms may leave you with a click at the start or end; you might have to turn blending on for these edits.
Since we're onto really tricky territory here, don't forget to save copies before each attempt so you can recover from any excesses. Try to gauge how offensive these might be in the context of any other music tracks that might be going on at the same time to mask them. For example, you can usually get away with more on a backing-vocal track than on an unadorned narration.
"If the result sounds too artificial, consider leaving the blemish in. We are, after all, human"
We've played with the timing of your phrasing; the last bit of surgery to consider is pitch. With singing, the goal usually is to follow or set the pitch of the melody; any corrections you have to make will be tuning-related and should be very small (and might be made more complicated by singers realising they are off-pitch and correcting mid-note).
However, with voiceovers and narration, the relative pitch of words and syllables can change the meaning of a sentence from a question to a declaration. A small experiment. Say the phrase: "Kill the clown?"
Then say: "Kill the clown!"
Chances are you hit two or three distinctly different notes with the question, with the middle word being the lowest and the last ending on an upward slide of pitch. The declaration was probably in a different key and closer to a monotone, with a downward slide on the last word. (Well, that's the way I talk, anyway...)
Correcting these pitch anomalies is a bit trickier than correcting timing. But don't be discouraged from trying. Use the same rules as repeated ad nauseam above for selecting individual words or phrases, and try small amounts of the pitch-shift facility in your editor. If you have an option to keep the duration the same during these shifts, you might want to use it, lest you undo your timing work above. (In the end, audition both ways to see which sounds more realistic.)
Some programs even allow you to alter the pitch envelope of a selected region to impose artificially or flatten out those upward and downward slides. Now you're into territory where even I fear to tread, but with care you can do amazing things. In our example, the word "people" ended with an anticipatory pitch-bend up; in Figure 8, I made it sound like a true ending by pitch-bending it down. With this type of correction, the duration changes that naturally result can actually help 'sell' the change.
Another trick I use is to change the pitch of an entire narration to add gravity to (or, alternatively, lighten) the weight of a voice. For example, I do not have a particularly resonant voice - especially when nervously performing voiceovers. So if I want to come off much more seriously than I do, I'll select my entire (edited) narration and pitch-shift the whole thing down a semitone or two, keeping the overall duration the same. You might even be tempted to use this trick to harmonise with yourself; I find this often works much better on sung vocals than on spoken passages.
A word of warning, though: save a copy of what you think worked tonight and listen to it tomorrow; a really cool effect can sound fatally corny 24 hours later.
Only now, after you have torn apart your performance and put it back together, should you consider final clean-up duties. For me, this includes normalising the total recording to maximise signal-to-noise, using very selective EQ to notch out hum, maybe rolling out a bit of hiss, and compensating for response weirdnesses in the mic or room.
Be careful when going for clean-ups on the top or bottom - you'll ran into harmonics of the vocal range far faster than you expect. For the male voice, going after any harmonic of hum above 180Hz starts to cut into the 'body' of a vocal. At the other end of the spectrum, shelving below 15kHz can start to cut into the overall 'air' of a vocal, even if there is hiss up there, too. Try to get by with a shelf filter or EQ shape rather than a narrow notch, if you can; narrow, deep notches have a way of ringing that will screw up your sound bad.
Here's another trick they don't teach you in Star Fleet Academy. Most filters and EQs - even digital ones - inflect phase shift and delay onto the signal they're processing. To offset this effect, quite often I'll reverse a sound file, EQ it at half the depth I was planning, reverse it again (now making it go straight), and EQ it one more time by the same amount. The result is zero phase delay, meaning you probably introduced fewer artefacts in the process.
We all know that boosting a frequency range will probably increase the overall amplitude of your signal, possibly resulting in clipping. But even cutting frequencies can shift harmonics around in ways that make peaks get sharper, also resulting in clipping. So, instead of normalising your soundfile, try leaving yourself 1-3dB of headroom before performing these EQ operations. Normalise after you're done, if you want. If you've been saving alternative versions of your soundfile after each edit, you'll at least have something to go back to if you blow it.
Yes, I know all this sounds incredibly tweaky and tedious. But so is practising and recording 357 takes. Not that you shouldn't practise and go for that best take; just like you can't polish a turd and really fix it in the mix, so digital editing is not a cure for negative amounts of talent. (Well, most of the time.)
But as I've moved from tape, a razor blade, and a mixing console to a hard disk, software, and mouse, these new tools have become as natural to me as the old ones. Add them to your arsenal, and you'll have that many more options available on your next project.
Feature by Chris Meyer
mu:zines is the result of thousands of hours of effort, and will require many thousands more going forward to reach our goals of getting all this content online.
If you value this resource, you can support this project - it really helps!