IMAGIN-4D: Image-Guided Controllable Interaction Generation

Anonymous Submission

Anonymous Authors
Image-conditioned 4D HOI generation teaser
Image-conditioned 4D HOI generation. Given a text prompt, object geometry, sparse object waypoints, and a reference image, IMAGIN-4D synthesizes a 4D human-object interaction sequence. Text and waypoints specify the action and the object trajectory, but leave fine-grained interaction details — body pose, contact region, grasp side, and body-object layout — ambiguous. A single reference image resolves this ambiguity. To test whether IMAGIN-4D actually follows the visual evidence, we keep the prompt, geometry, and waypoints fixed and mirror only the reference image: the generated motion changes consistently with the mirrored snapshot.



Abstract

Controllable 4D human-object interaction (HOI) synthesis requires specifying not only what action occurs, but how the body and the object meet. Existing generators use language, object geometry, and sparse waypoints to constrain action semantics and object motion, yet leave grasp side, contact region, human pose, and relative layout ambiguous. A single reference image is a natural interface for this missing interaction state. However, an HOI image is not a single global condition: it contains contact, human-pose, object-pose, and spatial-relation cues whose temporal relevance differs across a sequence.

We introduce IMAGIN-4D, a diffusion-based HOI generator that treats the image as structured visual evidence. A frozen DINOv2 patch grid feeds four supervised Q-Former heads that extract role-specific factors — human pose, object pose, contact, and body-object layout — at the conditioning frame. Frame-aware visual tokens re-query the same patches with frame- and text-conditioned queries, so approach, contact, and release frames can retrieve different visual evidence. The denoiser keeps control sources separated: text, diffusion time, object geometry, and global image factors modulate a base AdaLN stream; sparse waypoints use a dedicated AdaLN stream; per-frame visual tokens enter through final-layer cross-attention.

We also introduce image adherence, a control-centric metric that measures whether a generated sequence realizes the depicted interaction. Spatial factorization and frame-aware re-querying improve image adherence over pooled and uniform image baselines, while source-separated routing preserves waypoint precision close to the text-and-waypoint baseline. The full system avoids a 7B vision-language model or a text-to-image API at inference time.


Method Overview

IMAGIN-4D method overview
IMAGIN-4D uses two complementary image representations. A Spatially Factorized Image Encoder (SFIE) applies four supervised Q-Former heads on frozen DINOv2 patches to extract role-specific spatial tokens for human pose ($\boldsymbol{\rho}$), object pose ($\boldsymbol{\xi}$), contact ($\boldsymbol{\kappa}$), and body-object layout ($\boldsymbol{\nu}$); each token is supervised against a role-specific autoencoder latent trained from the paired motion. A Frame-Aware Image Encoder (FAIE) re-queries the same patch grid with frame- and text-conditioned queries to produce per-frame tokens $\mathbf{M}=\{\boldsymbol{\mu}_t\}_{t=1}^{T}$, supervised only by the diffusion objective. A reference-frame localizer predicts which frame $\hat{t}$ the reference image depicts, and the spatial tokens are gated by a smooth temporal window centered at $\hat{t}$ so they do not over-constrain frames far from the depicted snapshot. The denoiser is role-aware: text, diffusion time, object geometry, and window-gated spatial image evidence modulate a base AdaLN stream; sparse waypoints use a dedicated AdaLN stream; frame-aware visual tokens enter through final-layer cross-attention with the learnable motion tokens.

Image Adherence under Mirrored Reference


Same prompt and waypoints, mirrored reference image
The same text prompt and the same waypoints, with the reference image mirrored. IMAGIN-4D produces qualitatively different motions whose body pose, grasp side, and contact pattern follow the depicted snapshot. A pooled-token image baseline collapses both inputs onto a similar motion: text + waypoints dominate, and the fine-grained visual evidence is lost. This is the canonical test for whether image conditioning is doing real work beyond text and trajectory.

Comparison with State of the Art


Qualitative comparison with image-free and image-conditioned baselines
Qualitative comparison against image-free and image-conditioned baselines. Image-free baselines (e.g., CHOIS, InterDiff, MDM) follow the text prompt and the waypoints but cannot resolve grasp side, contact region, or relative layout. Single-token image-conditioned baselines (pooled CLIP / DINOv2 / Qwen-VL features, and our re-implementation of a concurrent image-conditioned method) improve average motion statistics but still mis-place the human relative to the object. IMAGIN-4D recovers the depicted interaction state while preserving waypoint precision close to the text-and-waypoint baseline.

Sketch-to-Motion


Sketch-to-motion: line-drawing references drive HOI generation
Beyond RGB references. Re-training the image branch on line drawings yields a sketch-to-motion variant of IMAGIN-4D: an animator can scribble the interaction snapshot, and the same spatio-temporal conditioning mechanism drives the full motion. The same SFIE + FAIE design transfers without modification, suggesting that the conditioning is about structured visual evidence rather than RGB photoreality.

Synthetic Image Rendering Pipeline


SceneImg / MeshImg / EditImg rendering pipeline
Training images for image-conditioned HOI. Existing HOI datasets contain motion-capture sequences but no paired reference images. We render three image domains from the same motion data: SceneImg uses photoreal indoor scenes with BEDLAM body textures and the object, MeshImg is a white-background body-object render, and EditImg is an editing pass over SceneImg using a frozen text-to-image model. This lets us train on photoreal frames and probe cross-domain robustness at evaluation time.