I realize this is somehow getting rediculous now, since it’s only sokoban, but anways, I like thinking about “nearly perfect” design 
Back on topic:
I don’t see a reason why you should make a difference between Player and Boxes. I liked my approach (saving only walls, targets and free spaces in the 2D Array), because I would seperate movable objects from Static ones.
Also, I wouldn’t call it “Grid”, because a Grid encapsulates single Cells. On the first glance, something that is typed “Grid[][]” looks like a 2D array with 2D arrays (Grids) inside.
I would create the following types:
enum Cell {
WALL(false),
FLOOR(true),
TARGET(true),
OUT_OF_LEVEL(false)
public boolean canBeMovedOn;
Cell(boolean canBeMovedOn) {
this.canBeMovedOn = canBeMovedOn;
}
}
abstract class Movable {
private int x;
private int y;
public Movable(int startx, int starty) {
this.x = startx;
this.y = starty;
}
// Direction is the direction the user has pressed
public abstract void act(Direction dir, SokobanWorld world);
}
class Box extends Movable {
public Box(int sx, int sy) {
super(sx, sy);
}
// Do nothing. A Box doesn't react to user input directly
public void act(Direction dir, SokobanWorld world) {}
}
class Player extends Movable {
public Player(int sx, int sy) {
super(sx, sy);
}
public void act(Direction dir, SokobanWorld world) // To be implemented
}
class SokobanWorld {
public Cell[][] grid;
public Map<Vec2i, Movable> movables;
}
But this might all be over-engineered 