The hardware does do transformations (generally vector * matrix) - but you cannot EVER get the results back. This is the problem.
All 3D libraries require matrix maths on the CPU for moving things around (model matrices) animation, logic, etc. For example, boned animation requires several layers of matrix construction & concatenation. The final world matrices can then be passed to the GPU but you can’t get the GPU to do all that concatenation/construction as it does not have access to the main memory (where the animation data is stored) and does not know what algorithms you are using.
The logic behind this is the weight of numbers. If you have 100 objects each of 1000 vertices, then you only need a few hundred matrix operations per frame. The GPU however needs to transform 100,000 vertices per frame - hence the dedicated hardware to do it. The 100 odd matric calcs for moving the things around is trivial to a modern CPU.
Yet another reason is the fact that graphics cards actually do the work whilst you are rendering the NEXT frame. By staggering the load, this ensures a single point of synchronisation between the two processors (decoupling of parralel processes). This means that the GPU does not have to wait around when the CPU is busy and vice versa. Because of this decoupling, if you asked for a matrix op, you would not get the results until next frame - well after you needed it.
Hope this helps,