Today i learned there is a glLoadTransposeMatrixf which takes a row major matrix, which i think is the better way to go. it resonates for me with the _mm_setr_ps, which is the reverse but should always be the one to go to.
Unless you are multiplying matrices with vectors, why would you want to store transposed matrix in memory? (using SSE to do multiply would require transposing matrix with current memory layout, so transposed matrix would be a tiny bit more efficient).
Currently we are just putting floats in array and sending to GL. There is no advantage to store them transposed. Actually in current way matrix memory layout is compatible between Direct3D9 fixed functionality and OpenGL one (not that this is very important to us, but still). If you will start doing it in transposed way, it won't be compatible.