SIGGRAPH 2026
TL;DR: We introduce
Prox-E, a framework
performing
fine-grained 3D shape editing
by abstracting 3D shapes into a primitive-based proxy shape,
editing that shape using a VLM and using the edited proxy shape to
drive a 3D diffusion model.
Text-based 2D image editing models have recently reached an impressive level of maturity, motivating a growing body of work that heavily depends on these models to drive 3D edits. While effective for appearance-based modifications, such 2D-centric 3D editing pipelines often struggle with fine-grained 3D editing, where localized structural changes must be applied while strictly preserving an object’s overall identity. To address this limitation, we propose Prox-E, a training-free framework that enables fine-grained 3D control through an explicit, primitive-based geometric abstraction. Our framework first abstracts an input 3D shape into a compact set of geometric primitives. A pretrained vision–language model (VLM) then edits this abstraction to specify primitive-level changes. These structural edits are subsequently used to guide a 3D generative model, enabling fine-grained, localized modifications while preserving unchanged regions of the original shape. Through extensive experiments, we demonstrate that our method consistently balances identity preservation, shape quality, and instruction fidelity more effectively than various existing approaches, including 2D-based 3D editors and training-based methods.
We present results on Edit3D-Bench, introduced in VoxHammer. Each column shows, in order: input, original proxy, edited proxy, and output. In the proxies, unchanged primitives are gray, edited primitives are blue, and new primitives are pink. Click and drag to orbit the meshes.
make the tail longer
make it wear a red hat
make the pot cubic
make it into a windmill
We present results on the ShapeTalk benchmark, from ShapeTalk / ChangeIt3D. Each column shows, in order: input, original proxy, edited proxy, and output. In the proxies, unchanged primitives are gray, edited primitives are blue, and new primitives are pink. Click and drag to orbit the meshes.
the legs are spaced further apart
the base has more layers
the shade is bigger
the top is more narrow
there is no stretcher connecting the front legs
the table has a sub-table underneath
Modern 2D editors are remarkably good at changing appearance or inserting new semantic content, but fine-grained structural edits pose a different challenge. These edits require reasoning about the geometry of existing parts and how they should change in 3D, often in explicitly metric terms, such as making the legs 1.5x longer or lowering the seat by a specific amount, rather than just synthesizing plausible pixels. This creates a gap between what image-based editors are optimized for and what controllable 3D editing actually demands.
Prox-E performs fine-grained 3D shape editing by first abstracting a shape into simple geometric primitives, editing that abstraction with a VLM, and then using it to guide 3D generation and appearance refinement.
π§± Primitive-based abstraction. Given an input 3D shape, we first convert the shape into a compact proxy made of superquadrics.
π VLM-driven proxy editing. We render multi-view images of the proxy and provide them to a VLM together with an image of the original shape, the editing prompt, and the proxy in JSON format. The VLM then updates the primitive parameters directly, producing an edited proxy JSON that reflects the requested structural change.
π Iterative verification and correction. This editing process can be repeated: after obtaining an edited proxy, we render it again and feed the new views back to the VLM, allowing it to verify the edit or make further corrections when needed.
π§ Proxy-guided structure generation. Once the edited proxy is ready, we use it to guide TRELLIS during sparse-structure generation. To do this, we classify primitives into three categories and inject a different source of information for each one.
π‘ Unchanged regions preserve the original shape. Primitives that remain unchanged mark parts of the object that should stay intact. In these regions, we inject information from the original shape to preserve its structure and identity.
π£ New regions follow the edited proxy. Newly added primitives indicate regions where new structure should emerge. There, we inject the edited proxy, but only for a limited number of denoising steps, in order to allow the model to produce a more natural and detailed structure in these parts.
π΅ Edited regions are warped from the original. For primitives whose pose or scale changed, we estimate the transformation from the original primitive to the edited one, warp the original shape accordingly, and inject information from this warped shape into the denoising process.
π¨ Appearance refinement for final details. After generating the edited sparse structure, we refine appearance and fine details using TRELLISβs appearance model, the edited proxy, and optionally a strong 2D image editor.
We present qualitative samples from the ShapeTalk benchmark, comparing our method against Spice-E, EditP23, VoxHammer, and TRELLIS with Kontext. Click and drag on any mesh to orbit; cameras stay aligned within each row.
make the shade shorter
add a footrest
make the table shorter
@misc{sella2026proxefinegrained3dshape,
title={Prox-E: Fine-Grained 3D Shape Editing via Primitive-Based Abstractions},
author={Etai Sella and Hao Phung and Nitay Amiel and Or Litany and Or Patashnik and Hadar Averbuch-Elor},
year={2026},
eprint={2604.23774},
archivePrefix={arXiv},
primaryClass={cs.GR},
eprint={2604.23774},
url={https://arxiv.org/abs/2604.23774},
}
We thank Prof. Daniel Cohen-Or for his valuable feedback and suggestions in early stages of the project.