Audio Transcription

Audio transcription is an essential task for applications such as voice assistants, podcast search, and video captioning. There are numerous open-source and commercial tools for audio transcription, and it can be difficult to know which one to use. OpenAI's Whisper API is often people's go-to choice, but there are nine different models to choose from with different sizes, speeds, and cost.

In this example, we'll use Zeno to compare the performance of the different models on the Speech Accent Archive dataset. The dataset has over 2,000 people from around the world reading the same paragraph in English. We'll use the dataset to evaluate the performance of the different models on different accents and English fluency levels.

Dependencies

Let's start by installing the required dependencies for this project:

pip install jiwer pandas openai-whisper zeno-client torch transformers tqdm

Additionally, we'll need ffmpeg to run this example. You can test if it is installed by running ffmpeg --help. If it is not found, you should install it through your package manager. For example, if you are using conda, you can just run the following (and other managers such as brew and apt also work).

conda install ffmpeg

Imports

After this is all set up, we can now start running our analysis code and uploading data to Zeno. We'll first import relevant libraries which we're going to use:

from jiwer import wer
import os
import pandas as pd
import whisper
import zeno_client
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import pandas as pd
import requests
from io import BytesIO
import wave
import struct
from tqdm import tqdm

tqdm.pandas()
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

Loading Metadata

We'll be evaluating our model on raw audio, but will use a metadata file with additional information about each audio file here.

df = pd.read_csv("speech_accent_archive.csv")
df["data"] = "https://zenoml.s3.amazonaws.com/accents/" + df["id"]

Adding New Features

In Zeno, we'll often want to enrich our metadata with some extra fields that might be useful for our analysis. In this case, we are going to add the amplitude and length of the audio snippet as additional metadata fields:

# Define the function to get amplitude and length
def get_amplitude_and_length_from_url(url):
    # Download the WAV file content from the URL
    try:
        response = requests.get(url)
        response.raise_for_status()

        # Use the BytesIO object as input for the wave module
        with wave.open(BytesIO(response.content), 'rb') as wav_file:
            frame_rate = wav_file.getframerate()
            n_frames = wav_file.getnframes()
            n_channels = wav_file.getnchannels()
            sample_width = wav_file.getsampwidth()
            duration = n_frames / frame_rate

            frames = wav_file.readframes(n_frames)
            if sample_width == 1:  # 8-bit audio
                fmt = '{}B'.format(n_frames * n_channels)
            elif sample_width == 2:  # 16-bit audio
                fmt = '{}h'.format(n_frames * n_channels)
            else:
                raise ValueError("Only supports up to 16-bit audio.")

            frame_amplitudes = struct.unpack(fmt, frames)
            max_amplitude = max(frame_amplitudes)
            max_amplitude_normalized = max_amplitude / float(int((2 ** (8 * sample_width)) / 2))

            return max_amplitude_normalized, duration
    except requests.RequestException as e:
        print(f"Request failed: {e}")
        return None, None

def apply_get_amplitude_and_length(row):
    url = row['data']  # Assuming the URL is in the 'data' column
    amplitude, length = get_amplitude_and_length_from_url(url)
    return pd.Series({'amplitude': amplitude, 'length': length})

df[['amplitude', 'length']] = df.progress_apply(apply_get_amplitude_and_length, axis=1)

Create a Zeno Project

We can now upload our data to a Zeno project. You will need your ZENO_API_KEY here, which you can generate by clicking on your profile at the top right to navigate to your account page.

Once you have your API key, you can authenticate with the Zeno client and create a project as follows:

client = zeno_client.ZenoClient(YOUR_API_KEY)

project = client.create_project(
    name="Audio Transcription Accents Evaluation",
    view="audio-transcription",
    description="Comparison of multiple audio transcription models",
    metrics=[
        zeno_client.ZenoMetric(name="avg wer", type="mean", columns=["wer"])
    ]
)

We've already added a metric to our project that will help us track the average word error rate of different systems.

You can click on the link provided in the output to start exploring your data!

Running Inference and Uploading Results

We can already look at our data in Zeno, but would like to start evaluating the output transcriptions of models. Let's run inference for some of the popular OpenAI Whisper models.

# Define what models to run inference on
models = ["medium.en", "large-v1", "large-v2", "large-v3", "distil-medium.en", "distil-large-v2"]
os.makedirs("cache", exist_ok=True)

# Load inference data from cache or run inference for each model and add the data to a dataframe.
df_systems = []
for model_name in models:
    try:
        df_system = pd.read_parquet(f"cache/{model_name}.parquet")
    except:
        df_system = df[["id", "data", "label"]].copy()

        if "distil" in model_name:
            model_id = "distil-whisper/" + model_name
            model = AutoModelForSpeechSeq2Seq.from_pretrained(
                model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
            )
            model.to(device)

            processor = AutoProcessor.from_pretrained(model_id)
            pipe = pipeline(
                "automatic-speech-recognition",
                model=model,
                tokenizer=processor.tokenizer,
                feature_extractor=processor.feature_extractor,
                max_new_tokens=128,
                chunk_length_s=15,
                batch_size=16,
                torch_dtype=torch_dtype,
                device=device,
            )
            df_system["output"] = df_system["data"].progress_apply(lambda x: pipe(x)['text'])
            pass
        else:
            whisper_model = whisper.load_model(model_name)
            df_system["output"] = df_system["data"].progress_apply(
              lambda x: whisper_model.transcribe(x)["text"]
            )

        df_system["wer"] = df_system.progress_apply(lambda x: wer(x["label"], x["output"]), axis=1)
        df_system.to_parquet(f"cache/{model_name}.parquet", index=False)
    df_systems.append(df_system)

You can see that we also calculate the word error rate (WER) for each model, a common metric for evaluating transcription models.

We can now upload these results to our project:

for i, df_system in enumerate(df_systems):
    project.upload_system(
      df_system[["id", "output", "wer"]], name=models[i], id_column="id", output_column="output"
    )

Conclusion

If you've followed this example you should have a Zeno project similar to the one linked at the top of this page. Looking at high wer samples by filtering for them using the histograms on the left is a great starting point to start finding the limitations of these state of the art models!

Audio Transcription

Dependencies​

Imports​

Loading Metadata​

Adding New Features​

Create a Zeno Project​

Running Inference and Uploading Results​

Conclusion​