Whisper can be used with a WhisperProcessor to preprocess audio inputs and postprocess output text. To transcribe speech in an audio file, simply run `whisper audio.flac audio.mp3 audio.wav –model medium` or use the Python API to load the model and perform transcription on your own data.
However, because Whisper is trained using large-scale noisy data, predictions may include texts that are not actually spoken in the input audio (i.e., hallucination). This happens due to the models’ general knowledge of language, which combines trying to predict the next word with transcribing the audio itself. The models perform unevenly across languages and exhibit disparate performance on different accents and dialects. Lower-resource and/or lower-discoverability languages may have worse hallucinations and repetitive text generation.
Despite these limitations, Whisper’s transcription capabilities can be used to improve accessibility tools for people with hearing impairments or language barriers. However, real-time speech recognition and translation are not possible out of the box due to the models’ speed and size.
In terms of broader implications, while we hope that the technology will primarily be used for beneficial purposes, making ASR technology more accessible could enable more actors to build capable surveillance technologies or scale up existing surveillance efforts as the speed and accuracy allow for affordable automatic transcription and translation of large volumes of audio communication. Moreover, these models may have some capabilities to recognize specific individuals out of the box, which in turn presents safety concerns related both to dual use and disparate performance. In practice, we expect that the cost of transcription is not the limiting factor of scaling up surveillance projects.
The experiment was performed not only with JSUT basic5000 and CV8.0 test sets but also with three evaluation sets from the Corpus of Spontaneous Japanese (CSJ) to evaluate the domain dependency of ASR models. The CSJ data are recordings of actual and mock conference lectures, which contain more filler, disfluency, and mispronounced words than other corpora. Note that its transcription tendency also differs from written text. We measured the average real-time factor (RTF) on an NVIDIA T4 GPU using the first 100 utterances from JSUT basic5000.
The results are shown in Table 2, and as shown in the table, the proposed model achieved a faster inference speed by using DeepSpeed-Inference while maintaining recognition accuracy.
OpenAI Whisper: A Free and Open-Source Speech Recognition Model
in AI