Unless you are multiplying matrices with vectors, why would you want to store transposed matrix in memory? (using SSE to do multiply would require transposing matrix with current memory layout, so transposed matrix would be a tiny bit more efficient).
Currently we are just putting floats in array and sending to GL. There is no advantage to store them transposed. Actually in current way matrix memory layout is compatible between Direct3D9 fixed functionality and OpenGL one (not that this is very important to us, but still). If you will start doing it in transposed way, it won't be compatible.