IMAGIN-4D: Image-Guided Controllable Interaction Generation

Anonymous Submission

Anonymous Authors
Image-conditioned 4D HOI generation teaser
Image-conditioned 4D HOI generation. Given a text prompt, object geometry, object waypoints, and a reference image, IMAGIN-4D synthesizes a 4D human-object interaction sequence. Text and waypoints specify the action and object trajectory, but leave fine-grained interaction details such as pose, contact, and layout ambiguous. We resolve this ambiguity with a reference image that specifies the interaction snapshot. To test whether IMAGIN-4D follows this visual evidence, we keep the text prompt, object geometry, and waypoints fixed, and mirror only the reference image. IMAGIN-4D generates different motions that satisfy the corresponding snapshot: body pose, object pose, contact, and body-object layout change consistently with the mirrored reference. This is achieved through spatio-temporal image conditioning, which separates spatial cues for the depicted interaction state from frame-aware cues for the surrounding motion. Unlike single-token image conditioning, this preserves fine-grained visual evidence while generating the HOI sequence.

Video



Abstract

Generating human-object interactions (HOI) is central to character animation, robotics, AR/VR, and embodied AI. Recent HOI generation methods synthesize motion from text, object geometry, and sparse waypoints, controlling action semantics and object trajectories. However, these signals underspecify interaction: the same prompt and trajectory can produce different grasps, approach directions, body poses, object poses, contacts, and body-object layouts. We address this ambiguity with a reference image as a visual specification of the desired interaction snapshot. However, a single global image representation conflates distinct cues and conditions all frames on identical visual evidence.

We therefore introduce IMAGIN-4D, a diffusion-based HOI generator that decomposes image conditioning spatio-temporally. For spatial conditioning, IMAGIN-4D extracts supervised interaction-state tokens for body pose, object pose, body-object contact, and spatial relationships at the depicted frame. For temporal conditioning, it computes frame-aware tokens by querying image patches per generated frame, allowing sequence segments to attend to different visual cues from the same image. To balance image, text, and waypoint cues, IMAGIN-4D uses role-aware conditioning: text, waypoints, and interaction-state tokens use separate AdaLN streams, while frame-aware visual tokens cross-attend with motion tokens.

Since HOI motion datasets lack paired images, we build a synthetic motion-to-image rendering pipeline from FullBodyManipulation (FBM) and introduce an image-adherence metric to evaluate whether generated motions match the reference snapshot. Experiments on FBM and BEHAVE show that IMAGIN-4D improves fine-grained interaction control over single-token and uniformly image-conditioned baselines while preserving waypoint-following and motion quality.


Method Overview

IMAGIN-4D method overview
Given a reference image $\mathcal{I}$, text prompt $y$, object geometry $\mathcal{O}$, and sparse waypoints $\mathcal{W}$, IMAGIN-4D generates a 4D human-object motion sequence. A frozen image encoder extracts patch tokens $\mathbf{P}$ from $\mathcal{I}$. The Spatially Factorized Image Encoder (SFIE) reads these patches with role-specific queries and produces supervised latent tokens for contact $\boldsymbol{\kappa}$, human pose $\boldsymbol{\rho}$, object pose $\boldsymbol{\xi}$, and body-object spatial relation $\boldsymbol{\nu}$. These tokens are trained to match role-autoencoder latents derived from the paired motion sequence. Their concatenated summary $\bar{\boldsymbol{\zeta}}$ predicts the reference frame $\hat{t}$ depicted by the image. In parallel, the Frame-Aware Image Encoder re-queries $\mathbf{P}$ with frame- and text-conditioned queries to produce per-frame visual tokens $\boldsymbol{\mu}_t$. The motion denoiser routes conditions by role: base conditioning, waypoint features, and window-gated spatial image evidence modulate transformer layers through separate AdaLN streams, while $\boldsymbol{\mu}_t$ enters through late cross-attention. Sampling-time guidance improves image adherence.

How Image Controls Motion Generation


Ref. Img reference image (orig)
Ours
"The person moves the table on the floor"
Mirrored Ref. Img reference image (mirrored)
Ours
"The person moves the table on the floor"
Ref. Img reference image (orig)
Ours
"Pull the trashcan and set it down"
Mirrored Ref. Img reference image (mirrored)
Ours
"Pull the trashcan and set it down"
Ref. Img reference image (orig)
Ours
"Push the large box and set it down"
Mirrored Ref. Img reference image (mirrored)
Ours
"Push the large box and set it down"
Mirrored-reference consistency test. We horizontally mirror only the reference image at inference time while keeping the text prompt, object geometry, and waypoints fixed. The generated contact side and body-object layout change with the mirrored image, showing that IMAGIN-4D uses visual evidence rather than ignoring the image condition. The motion is not an exact mirror because the unchanged non-image conditions must still be satisfied.

Comparison with State of the Art


Ref. Img reference image
Ours
"Lift the trashcan, move the trashcan and put it down"
CHOIS+Img
"Lift the trashcan, move the trashcan and put it down"
ViHOI*
"Lift the trashcan, move the trashcan and put it down"
Ref. Img reference image
Ours
"Lift the floor lamp, move the floor lamp and set it down"
CHOIS+Img
"Lift the floor lamp, move the floor lamp and set it down"
ViHOI*
"Lift the floor lamp, move the floor lamp and set it down"
Qualitative comparison on FullBodyManipulation (SceneImg). Each row shows the SceneImg reference and generated motion from CHOIS+Img, ViHOI*, and IMAGIN-4D under the same text prompt, object shape, waypoints, and reference image. Single-token image conditioning often misses contact, hand placement, object orientation, or body-object layout. IMAGIN-4D better matches the depicted interaction state.

IMAGIN-4D Gallery


Additional IMAGIN-4D results. Each pair shows the reference image and the generated motion under the same text prompt and object waypoints.

Sketch-to-Motion


Sketch sketch r2
Ours
"Kick the base of floor lamp and set it down"
Sketch sketch r8
Ours
"Lift the suitcase, move the suitcase and put it down"
Sketch-to-motion. We replace the RGB reference image with a line drawing and retrain the model. Despite removing texture, color, and scene appearance, the model preserves the depicted interaction layout and generates a complete motion sequence. This shows that IMAGIN-4D can also support sketch-based conditioning, where users specify the desired contact and body-object arrangement with a simple drawing.

Synthetic Image Rendering Pipeline


SceneImg / MeshImg / EditImg rendering pipeline
Conditioning image domains. We render conditioning images from each ground-truth sequence from the FullBodyManipulation dataset and use the contact-centered frame for evaluation. MeshImg is a clean body-object render, SceneImg adds Replica scenes, body textures, and posed objects, and EditImg applies image editing for more photorealistic references. SceneImg is used for evaluation, while MeshImg and EditImg analyze image-domain transfer.