I think most people use something like exllamav2 or vllm or use GGUF to do inference and it seems neither of those projects have properly implemented multimodality or this specific model architecture, yet.
You might just be at the forefront of things and there isn't yet any beaten path you could follow.
The easiest thing you could do is just use something that already exists, be it 4bit models, wait a few weeks and then upgrade. And I mean you can also always quantize models yourself and set the parameters however you like, if you have some inference framework that supports your model including the adapters for vision and has the quantization levels you're interested in...