Disclaimer: I haven't actually gotten to the OpenGL HMH episodes, I've just done some OpenGL stuff before.
I think the previous 3 answers have all the information collectively, but I'll add my response as a sort of summary that (hopefully) links them together.
Many of vectors that you give to OpenGL for vertex positions are constant (vertices in a mesh loaded from disk for a character or level, as Croepha said). Ideally you don't want to be uploading constant data to the GPU over and over again so if you could just upload it once and then modify it as necessary every time you wanted to use it, that'd be great. Thats where the matrices come in.
As mrmixer said there isn't such a thing as "actual" coordinates, just coordinates with respect to some axis. What you get when you load a model from disk is usually so-called "model-space coordinates", where all the positions are defined relative to some origin for the model (such as the centre/start of the level, a point between the character's feet etc). When you draw that to the screen though, you're not interested in model-space coordinates, you want to know where that vertex should be on your screen. So you first do a transformation (read: matrix multiplication) to put the model in the correct place relative to the centre of the world. Then you transform it so that it is in the correct place relative to the camera, so that if you turn around in a first-person game, you don't see the objects behind you. Finally you transform that position into the unit cube using whatever camera projection your application needs. For 3D this is almost always a perspective transformation, which makes things smaller the further they are from the camera. This lets you upload the vertex positions to the GPU once, and then just upload the transformation matrices each frame.
So thats why we need to do transformations to the data, but why matrix multiplications? As others have mentioned its something that the GPU can do really efficiently, but I suspect that that is also in part because it allows the graphics card to have fewer things that it needs to do efficiently.
As you suggest we could certainly achieve scaling by multiplication with a scalar. However what if we want to scale differently in different directions? Say you have a cylinder of height 1 and radius r and you want to place a cylinder of height 2 and radius r. To achieve that by scalar multiplication you'd need to have different scalars for each direction, which is exactly what you're doing by specifying scale via a matrix.
Using matrix multiplication also allows us to do translation, scaling and rotation (by independent amounts in each axis) with a "single" operation. The 4th vector is used for displacement because you have normal 3D coordinates plus an extra "w" coordinate which you just set to 1. When you do the matrix multiplication, that 1 gets multiplied by the translation entries in the matrix which amounts to adding those entries (the vector specifying the translation) to the vertex's position (after it has been scaled/rotated).
So all 3 of these transformations (or any
affine transformation for that matter) can be achieved by matrix multiplication. Each of these transformations has a pretty simple matrix representation and to get the composition of the three you just multiply the matrices together (though as mrmixer said, order is important) and then give that to the GPU.
I hope that helps (and that I didn't just reiterate what other people have already said :O)