WhisperX
This modules provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.
Options
model: string, identifier of the model to choose, sorted ascending in required (V)RAM:
tiny
,tiny.en
base
,base.en
small
,small.en
medium
,medium.en
large-v1
large-v2
large-v3
alignment_mode
: string, alignment method to useraw
Segments as identified by Whispersegment
Improved segmentation using separate alignment model. Roughly equivalent to sentence alignment.word
Improved segmentation using separate alignment model. Equivalent to word alignment.
language: language code for transcription and alignment models. Supported languages:
ar
,cs
,da
,de
,el
,en
,es
,fa
,fi
,fr
,he
,hu
,it
,ja
,ko
,nl
,pl
,pt
,ru
,te
,tr
,uk
,ur
,vi
,zh
None
: auto-detect language from first 30 seconds of audio
batch_size
: how many samples to process at once, increases speed but also (V)RAM consumption
IO
Explanation of inputs and outputs as specified in the trainer file:
Input
audio
(Audio
): The input audio to transcribe
Output
The output of the model:
transcript
(FreeAnnotation
): The transcription
Examples
Request
import requests
import json
payload = {
"jobID" : "whisper_transcript",
"data": json.dumps([
{"src":"file:stream:audio", "type":"input", "id":"audio", "uri":"path/to/my/file.wav"},
{"src":"file:annotation:free", "type":"output", "id":"transcript", "uri":"path/to/my/transcript.annotation"}
]),
"trainerFilePath": "modules\\whisperx\\whisperx.trainer",
}
url = 'http://127.0.0.1:8080/process'
headers = {'Content-type': 'application/x-www-form-urlencoded'}
x = requests.post(url, headers=headers, data=payload)
print(x.text)