What you are after is called a “difference key” in video land. It’s a good technique to use to try to avoid the coloured fringes (spill) that you can get with blue or green screen techniques. It requires a stationary camera though (or it gets a whole lot harder)
The simple thing to do is to get the array of raster data for the BufferedImages and then just construct a new array with pixdif[i] = pix1[i]-pix2[i];
(or set the alpha component to 255 or 0 based on the result of the difference * some threshold…)
Apply a threshold to this and scale it so you basically have a binary mask… (pixel is background or not) then you can experiment with more stuff to smooth edges and despeckle, etc. getting rid of “wholes” in the foreground where it just happens to match the same colour as what was behind may be tricky.
You didn’t say the size of the images, but I think you should be able to do 320x240 or similar at 30fps (real-time for television video) without too much effort.