You'll have one model that plays the game in ways that try a wider range of inputs and approaches to reach goals than what humans would produce (similar to the existing research like OpenAI training models to play Minecraft and mine diamonds off a handful of videos with input data and then a lot of YouTube videos).

Then the outputs generated by that model would be passed though another process that looks specifically for things ranging from sequence breaks to clipping. Some of those like sequence breaks aren't even detections that need AI, and depending on just what data is generated by the 'player' AIs, a fair bit of other issues can be similarly detected with dumb approaches. The bugs that would be difficult for an AI to detect would be things like "I threw item A down 45 minutes ago but this NPC just had dialogue thanking me for bringing it back." But even things like this are going to be well within the capabilities of multimodal AI within a few years as long as hardware continues to scale such that it doesn't become cost prohibitive.

The way it's going to start is that 3rd party companies dedicated to QA start feeding their own data and play tests into models to replicate and extend the behaviors, offering synthetic play testing as a cheap additional service to find low hanging fruit and cut down on human tester hours needed, and over time it will shift more and more towards synthetic testing.

You'll still have human play testers around broader quality things like "is this fun" - but the QA that's already being outsourced for bugs is going to almost certainly go the way of AI replacing humans entirely, or just nearly so.