hi,
I got the whisper_stt extension running in Oobabooga and it (kinda) works. However it seems really, really bad in understanding my speech and recognition has been spotty at best.
I saw some youtube tutorials where it seemed to have no problem in understanding - even when spoken to in quite a bit of an accent - and in my own experience it performs knowhere near as good as shown there.
So - are there things I can do to improve its performance? Or may the yt tutorials have been edited to give a wrong impression and spotty performance is what to expect?
I'm very happy with the silero_tts and if I can get the speech-to-text to work at the same level, I'd be a happy camper already
Edit:
It seems to be a memory problem. I can select several models in the extension interface - tiny, small, base, medium, ...
If I choose the tiny or small model, it does work but with the poor results I mentioned above. If I select the medium model I get an OOM error (something like: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 11.99 GiB total capacity; 11.14 GiB already allocated; 0 bytes free; 11.22 GiB reserved in total by PyTorch)
It looks to me as if the language model reserves the whole of my VRAM (which is 12GB) and doesn't leave any for the extension - is there a way to tweak it?
Edit 2:
Ok so, if I use a smaller language model (like a 6B model) it seems to be working perfectly fine with the medium whisper model ... so it is probably a memory issue. I have already tried to start with the command flag "--gpu-memory 5/8/10" which doesn't seem to do anything. Are there other ways of memory management?