I feel like this approach will make your game rendering engine not very generic. It will be tailored for specific game use case and doing anything custom will be inefficient or a lot of work. Also I think that a lot of duplication will happen in renderer.draw for each backend that could be avoided.
I agree, but assuming the renderer has all the features you need, I think my approach could actually make everything more efficient? The hard part will be to identify most of those custom features in the first place. The duplication means it will be more work to introduce a new backend, true, but in terms of actual code there is nothing preventing each backend to use reuse some parts from a shared library.
In HH style low-level command buffer renderer the game-specific rendering logic happens in game layer. It is common for all backends and tailored for specific game, it is as efficient as game can make it to be - it knows everything about game.
This make a lot of sense when you are targeting modern hardware, where the underlying platform graphics API is more or less the same. But as soon as you need to target ancient hardware or legacy APIs or very specific rendering techniques (PBR), would that type of interface still work? I feel that the efficiency claim is no longer true since the backend would need to simulate it working like the other backends that are more tailored for that type of interface.
It almost sounds like programming in low level vs high level language.
If you ignore the part about me mentioning the scene being a scripting languange, it could also just be a plain struct of data, with plain arrays for every concept. So you would actually eliminate a lot of function calls between the game and the renderer, and each renderer is free to optimize to its heart's content. For example, it would be trivial to find out exactly what changed from one frame to the next.
But overall I agree with everything you said. I'm just not completely convinced yet this is a bad idea ;-)