Controllable 4D human-object interaction (HOI) synthesis requires specifying not only what action occurs, but how the body and the object meet. Existing generators use language, object geometry, and sparse waypoints to constrain action semantics and object motion, yet leave grasp side, contact region, human pose, and relative layout ambiguous. A single reference image is a natural interface for this missing interaction state. However, an HOI image is not a single global condition: it contains contact, human-pose, object-pose, and spatial-relation cues whose temporal relevance differs across a sequence.
We introduce IMAGIN-4D, a diffusion-based HOI generator that treats the image as structured visual evidence. A frozen DINOv2 patch grid feeds four supervised Q-Former heads that extract role-specific factors — human pose, object pose, contact, and body-object layout — at the conditioning frame. Frame-aware visual tokens re-query the same patches with frame- and text-conditioned queries, so approach, contact, and release frames can retrieve different visual evidence. The denoiser keeps control sources separated: text, diffusion time, object geometry, and global image factors modulate a base AdaLN stream; sparse waypoints use a dedicated AdaLN stream; per-frame visual tokens enter through final-layer cross-attention.
We also introduce image adherence, a control-centric metric that measures whether a generated sequence realizes the depicted interaction. Spatial factorization and frame-aware re-querying improve image adherence over pooled and uniform image baselines, while source-separated routing preserves waypoint precision close to the text-and-waypoint baseline. The full system avoids a 7B vision-language model or a text-to-image API at inference time.