I still think we need to look at a BLAS approach. OK, maybe JNI isn’t the best way to go (though you probably don’t need to work on many vectors at once to win over the interface overhead) - but following the discussion of moving vecmath into the core, then the BLAS would be a far better choice at which point you’ve only got a normal subroutine call overhead which is insignificant.
A good example might be character skinning where for each vertex, you have an accumulating product with each bone influence and a scalar weight. If you set your data up correctly (I have derived classes from the native buffer that handle Vector3fBuffer etc) then you can do your character skinning in one BLAS call.
As I mentionned previously, there are highly optimised BLAS routines for every processor architecture out there including versions which will handle long/short vectors and support tiling, prefetching and other superoptimisations.
And the really nice thing is that you don’t need to work too hard in the compiler to analyse many loops into a sequence of equivalent BLAS calls (particularly if you are allowed to relax the IEEE rules on operation order). This is what the Cray Fortran compilers did back in the late 80s.