Transcribing oral history recordings with AI – Tom Goskar

I have recently been scoping a project to digitise a large collection of audio recordings made between the early 1970s to the late 1980s. Part of that project, aside from the physical difficulties involved with magnetic open reel tapes such as “sticky-shed syndrome” (which often requires a tape to be literally baked under controlled conditions to temporarily restore its playability) is to provide transcriptions of each recording.

Transcriptions would enable content searchability across the entire audio archive, facilitating cataloguing. Moreover, they present an opportunity to leverage generative AI systems like ChatGPT or Claude for summarising and ‘chatting’ with transcribed material.

My experience with transcription in the past (around 2015) was a simple one. A pair of headphones, a folder of WAV files (recordings), and a Word document. Lots of typing, and lots of pressing pause. I tried many digital transcription tools, but they made so many mistakes I soon gave up on them – correcting errors often took as much time or more than just doing things manually.

However, staring down at what, if the project is funded, could well be several thousand lengthy recordings, I felt that it was time to revisit automatic transcription in the age of AI.

The main problems with transcribing oral history recordings digitally is that until recently most of the software was aimed at the dictation market. Recordings that gave passable results were usually a single, clear, English or American accented voice, speaking at a steady no-too-fast speed. This is usually fairly far from your usual historical interview.

Here in Cornwall, especially back in the 1970s, the Cornish accent can, to the unaccustomed ear (or unaccustomed app) be difficult to understand. Sometimes recordings have multiple voices beyond those of the interviewer and their subject – some in the foreground and others in the background. Sometimes the microphone wasn’t as close as it could be. Then there’s tape hiss, distortion, and a whole raft of potential audio artefacts requiring audio restoration. What chance can a computer have?

A quick bit of research led me to OpenAI’s Whisper technology, which was introduced in 2022.

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language.

Intrigued by the “enhanced accent resilience,” I tested it on a recording of a Cornish centenarian born in the 1880s, interviewed later in life. Her distinctive and vibrant accent, peppered with dialectal expressions, would likely confound most systems.

Whisper is not just an app to download; it’s a sophisticated system that requires installation and command-line operation on a computer with substantial memory and processing power. While some apps have integrated Whisper into a user-friendly interface, they lack granular control over audio handling or the ability to provide prompts (natural language instructions) akin to those in ChatGPT or Copilot.

Prompting Whisper is important. It’s a slightly different process to chatting with an LLM such as ChatGPT, but you can use the small context window of 224 tokens (about 198 words) to pass it information to guide the transcription process. This could be used to ensure that a brand name has the correct spelling and capitalisation, or in my case, it could be used for defining some dialect words.

Some lines from a test Whisper transcription before prompting:

For the bellmaidens breaking this rock.

We was waiting for that poor Bob Wake.
We had to live on that.

We used to make a toaster.
What they used to call a toaster, an apron.

After prompting: balmaidens, towser, bob a week

Were the Balmaidens breaking this rock?

We was waiting for that four bob a week.
We had to live on that.

We used to make a towser.
What they used to call the towser, an apron.

The prompt has corrected the mistakes in the first transcription pass, and used those Cornish dialect terms consistently throughout the remainder of the output.

The resulting complete transcript wasn’t perfect, but the prompt helped to reduce the mistakes and make it a fairly quick task to read through and correct the remaining mistakes that jumped out by referring back to the audio file. If you enable timestamps with the transcription it’s quick to skip to the relevant portion of the recording.

Using an M1 iMac with 16GB RAM, the 7-minute sample audio file took just under 2 minutes to transcribe.

The corrected transcription was then summarised into 100 words using ChatGPT-4, which was then read and checked. The summary required no editing.

A combination of correctly prompted Whisper and ChatGPT summarisation will prove to be an immensely powerful combination for oral history digitisation projects. Some projects will prioritise digitisation over transcription in order to rapidly preserve the recordings before the magnetic media further deteriorates and becomes unlistenable, and rightly so. Often, transcription becomes a task delegated to volunteers, where available, and on large projects with hundreds or thousands of tapes, it can take many years to do. Careful and considered use of AI, with plenty of checking and verification, can speed up the process of making digitised recordings more useful through transcription search by incorporating them and their summaries into catalogues.

AI will undoubtedly help to improve archive cataloguing, enabling the identification of interconnected themes and making searching even more rewarding. For those larger projects that have limited resources systems such as these could potentially make the difference between them being viable or not. They could even unlock audio projects conducted in the past where transcription was not possible.

Contact me if you would like any consultation on how AI might be able to help your project.