WhisperX

This modules provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.

Options

  • model: string, identifier of the model to choose, sorted ascending in required (V)RAM:

    • tiny, tiny.en

    • base, base.en

    • small, small.en

    • medium, medium.en

    • large-v1

    • large-v2

    • large-v3

  • alignment_mode: string, alignment method to use

    • raw Segments as identified by Whisper

    • segment Improved segmentation using separate alignment model. Roughly equivalent to sentence alignment.

    • word Improved segmentation using separate alignment model. Equivalent to word alignment.

  • language: language code for transcription and alignment models. Supported languages:

    • ar, cs, da, de, el, en, es, fa, fi, fr, he, hu, it, ja, ko, nl, pl, pt, ru, te, tr, uk, ur, vi, zh

    • None: auto-detect language from first 30 seconds of audio

  • batch_size: how many samples to process at once, increases speed but also (V)RAM consumption

IO

Explanation of inputs and outputs as specified in the trainer file:

Input

  • audio (Audio): The input audio to transcribe

Output

The output of the model:

  • transcript (FreeAnnotation): The transcription

Examples

Request

import requests
import json

payload = {
  "jobID" : "whisper_transcript",
  "data": json.dumps([
    {"src":"file:stream:audio", "type":"input", "id":"audio", "uri":"path/to/my/file.wav"},
    {"src":"file:annotation:free", "type":"output", "id":"transcript",  "uri":"path/to/my/transcript.annotation"}
  ]),
  "trainerFilePath": "modules\\whisperx\\whisperx.trainer",
}


url = 'http://127.0.0.1:8080/process'
headers = {'Content-type': 'application/x-www-form-urlencoded'}
x = requests.post(url, headers=headers, data=payload)
print(x.text)