Could the problem with the scaling calculation could be that you are mixing real-world-meters (distance from screen d) and handmade-hero-world-meters (p_z). Shouldn't you convert the real-world-meters into handmade-hero-world meters for this calculation so that we see the perspective we would see when looking at the 'real' scene in handmade-hero-world-scale? This would then be converted to pixels for display.
This would explain why d = 20.0 gives better results than d = 0.3. It's like viewing a room from 20 meters above the world compared to viewing it from 30cm above the world. When viewed from 30cm above we are not actually seeing anything in the room we are currently in, but rather rooms very far below.
My first thought was that the z value which is 1.0f hardcoded gets multiplied by 1 over some metersperpixel value which wil always make it smaller and not a 0 to 1 ranged value.
It's the focal length that was the surprising thing to me, not the camera distance, but I guess maybe it should not have been. Needing to be 20m up seemed totally plausible but needing the "monitor" to be so far away didn't make a whole lot of sense to me. But, we will take a close look at the math on Monday and see if it's actually reasonable.