IMAGIN-4D: Image-Guided Controllable Interaction Generation

Anonymous Submission

Anonymous Authors

Image-conditioned 4D HOI generation teaser

Image-conditioned 4D HOI generation. Given a text prompt, object geometry, object waypoints, and a reference image, IMAGIN-4D synthesizes a 4D human-object interaction sequence. Text and waypoints specify the action and object trajectory, but leave fine-grained interaction details such as pose, contact, and layout ambiguous. We resolve this ambiguity with a reference image that specifies the interaction snapshot. To test whether IMAGIN-4D follows this visual evidence, we keep the text prompt, object geometry, and waypoints fixed, and mirror only the reference image. IMAGIN-4D generates different motions that satisfy the corresponding snapshot: body pose, object pose, contact, and body-object layout change consistently with the mirrored reference. This is achieved through spatio-temporal image conditioning, which separates spatial cues for the depicted interaction state from frame-aware cues for the surrounding motion. Unlike single-token image conditioning, this preserves fine-grained visual evidence while generating the HOI sequence.

Abstract

Generating human-object interactions (HOI) is central to character animation, robotics, AR/VR, and embodied AI. Recent HOI generation methods synthesize motion from text, object geometry, and sparse waypoints, controlling action semantics and object trajectories. However, these signals underspecify interaction: the same prompt and trajectory can produce different grasps, approach directions, body poses, object poses, contacts, and body-object layouts. We address this ambiguity with a reference image as a visual specification of the desired interaction snapshot. However, a single global image representation conflates distinct cues and conditions all frames on identical visual evidence.

We therefore introduce IMAGIN-4D, a diffusion-based HOI generator that decomposes image conditioning spatio-temporally. For spatial conditioning, IMAGIN-4D extracts supervised interaction-state tokens for body pose, object pose, body-object contact, and spatial relationships at the depicted frame. For temporal conditioning, it computes frame-aware tokens by querying image patches per generated frame, allowing sequence segments to attend to different visual cues from the same image. To balance image, text, and waypoint cues, IMAGIN-4D uses role-aware conditioning: text, waypoints, and interaction-state tokens use separate AdaLN streams, while frame-aware visual tokens cross-attend with motion tokens.

Since HOI motion datasets lack paired images, we build a synthetic motion-to-image rendering pipeline from FullBodyManipulation (FBM) and introduce an image-adherence metric to evaluate whether generated motions match the reference snapshot. Experiments on FBM and BEHAVE show that IMAGIN-4D improves fine-grained interaction control over single-token and uniformly image-conditioned baselines while preserving waypoint-following and motion quality.

How Image Controls Motion Generation

Ref. Img

Ours

"The person moves the table on the floor"

Mirrored Ref. Img reference image (mirrored)

Ours

"The person moves the table on the floor"

Ref. Img

Ours

"Pull the trashcan and set it down"

Mirrored Ref. Img reference image (mirrored)

Ours

"Pull the trashcan and set it down"

Ref. Img

Ours

"Push the large box and set it down"

Mirrored Ref. Img reference image (mirrored)

Ours

"Push the large box and set it down"

Mirrored-reference consistency test. We horizontally mirror only the reference image at inference time while keeping the text prompt, object geometry, and waypoints fixed. The generated contact side and body-object layout change with the mirrored image, showing that IMAGIN-4D uses visual evidence rather than ignoring the image condition. The motion is not an exact mirror because the unchanged non-image conditions must still be satisfied.

Comparison with State of the Art

Ref. Img

Ours

"Lift the trashcan, move the trashcan and put it down"

CHOIS+Img

"Lift the trashcan, move the trashcan and put it down"

ViHOI*

"Lift the trashcan, move the trashcan and put it down"

Ref. Img

Ours

"Lift the floor lamp, move the floor lamp and set it down"

CHOIS+Img

"Lift the floor lamp, move the floor lamp and set it down"

ViHOI*

"Lift the floor lamp, move the floor lamp and set it down"

Qualitative comparison on FullBodyManipulation (SceneImg). Each row shows the SceneImg reference and generated motion from CHOIS+Img, ViHOI*, and IMAGIN-4D under the same text prompt, object shape, waypoints, and reference image. Single-token image conditioning often misses contact, hand placement, object orientation, or body-object layout. IMAGIN-4D better matches the depicted interaction state.

IMAGIN-4D Gallery

Ref. Img

Ours

"Lift the clothes stand, move it and put it down"

Ref. Img

Ours

"Push the chair, drag it and set it back down"

Ref. Img

Ours

"Push the plastic box and set it back down"

Ref. Img

Ours

"Lift the chair, move it and put it down"

Additional IMAGIN-4D results. Each pair shows the reference image and the generated motion under the same text prompt and object waypoints.

Sketch-to-Motion

Sketch

Ours

"Kick the base of floor lamp and set it down"

Sketch

Ours

"Lift the suitcase, move the suitcase and put it down"

Sketch-to-motion. We replace the RGB reference image with a line drawing and retrain the model. Despite removing texture, color, and scene appearance, the model preserves the depicted interaction layout and generates a complete motion sequence. This shows that IMAGIN-4D can also support sketch-based conditioning, where users specify the desired contact and body-object arrangement with a simple drawing.

Synthetic Image Rendering Pipeline

SceneImg / MeshImg / EditImg rendering pipeline

Conditioning image domains. We render conditioning images from each ground-truth sequence from the FullBodyManipulation dataset and use the contact-centered frame for evaluation. MeshImg is a clean body-object render, SceneImg adds Replica scenes, body textures, and posed objects, and EditImg applies image editing for more photorealistic references. SceneImg is used for evaluation, while MeshImg and EditImg analyze image-domain transfer.

IMAGIN-4D: Image-Guided Controllable Interaction Generation

Anonymous Submission

Video

Abstract

Method Overview

How Image Controls Motion Generation

Comparison with State of the Art

IMAGIN-4D Gallery

Sketch-to-Motion

Synthetic Image Rendering Pipeline