WhisperX
This modules provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.
Options
model: string, identifier of the model to choose, sorted ascending in required (V)RAM:
tiny,tiny.enbase,base.ensmall,small.enmedium,medium.enlarge-v1large-v2large-v3
alignment_mode: string, alignment method to userawSegments as identified by WhispersegmentImproved segmentation using separate alignment model. Roughly equivalent to sentence alignment.wordImproved segmentation using separate alignment model. Equivalent to word alignment.
language: language code for transcription and alignment models. Supported languages:
ar,cs,da,de,el,en,es,fa,fi,fr,he,hu,it,ja,ko,nl,pl,pt,ru,te,tr,uk,ur,vi,zhNone: auto-detect language from first 30 seconds of audio
batch_size: how many samples to process at once, increases speed but also (V)RAM consumption
IO
Explanation of inputs and outputs as specified in the trainer file:
Input
audio(Audio): The input audio to transcribe
Output
The output of the model:
transcript(FreeAnnotation): The transcription
Examples
Request
import requests
import json
payload = {
"jobID" : "whisper_transcript",
"data": json.dumps([
{"src":"file:stream:audio", "type":"input", "id":"audio", "uri":"path/to/my/file.wav"},
{"src":"file:annotation:free", "type":"output", "id":"transcript", "uri":"path/to/my/transcript.annotation"}
]),
"trainerFilePath": "modules\\whisperx\\whisperx.trainer",
}
url = 'http://127.0.0.1:8080/process'
headers = {'Content-type': 'application/x-www-form-urlencoded'}
x = requests.post(url, headers=headers, data=payload)
print(x.text)