ICML 2026

LATO: 3D Mesh Flow Matching with Structured TOpology Preserving LAtents

Tianhao Zhao*, Youjia Zhang*, Hang Long, Jinshen Zhang, Wenbing Li, Yang Yang, Gongbo Zhang, Jozef Hladky, Matthias Nießner, Wei Yang

Huazhong University of Science and Technology · Peking University Independent Researcher · Technical University of Munich

Abstract

In this paper, we introduce LATO, a novel topology-preserving latent representation that enables scalable, flow matching-based synthesis of explicit 3D meshes. LATO represents a mesh as a Vertex Displacement Field (VDF) anchored on surface, incorporating a sparse voxel Variational Autoencoder (VAE) to compress this explicit signal into a structured, topology-aware voxel latent.

To decapsulate the mesh, the VAE decoder progressively subdivides and prunes latent voxels to instantiate precise vertex locations. In the end, a dedicated connection head queries the voxel latent to predict edge connectivity between vertex pairs directly, allowing mesh topology to be recovered without isosurface extraction or heuristic meshing.

For generative modeling, LATO adopts a two-stage flow matching process, first synthesizing the structure voxels and subsequently refining the voxel-wise topology features. Compared to prior isosurface/triangle-based diffusion models and autoregressive generation approaches, LATO generates meshes with complex geometry, well-formed topology while being highly efficient in inference.

Paradigm

Topology-Preserving Generation

Comparison between topology-agnostic implicit generation, explicit mesh generation, and LATO
LATO vs. Existing Paradigms. Mainstream topology-agnostic approaches utilize vecset or voxel latents decoded into implicit fields, such as SDF, and rely on Marching Cubes for mesh extraction. Explicit mesh generation methods adopt per-face latents via autoregressive or diffusion models, but suffer from severe memory bottlenecks. LATO proposes T-Voxel latents to explicitly model topology, enabling direct generation of artist-friendly meshes.

Representation

Vertex Displacement Field

VDF is the representation that lets LATO move from a surface signal to explicit topology. For a mesh \(\mathcal{M}=(\mathbf{V},\mathbf{F})\), take a point \(\mathbf{p}\) sampled on a triangular face \(\mathbf{f}=[f_0,f_1,f_2]\). Instead of assigning a discrete vertex/edge/face label, LATO records the displacements from \(\mathbf{p}\) to the three vertices of that face.

This makes topology learnable as a dense continuous field: near-zero displacements localize vertices, while changes in the displacement triplet reveal edges between vertex pairs. The resulting point feature is voxelized and pooled into T-Voxels for scalable generation.

VDF definition

\[ \mathcal{F}(\mathbf{p}) = \left\{\mathbf{v}-\mathbf{p}\mid \mathbf{v}\in \{\mathbf{v}_{f_0},\mathbf{v}_{f_1},\mathbf{v}_{f_2}\}\right\}, \quad \mathbf{p}\in\mathbf{f} \]

The VAE input at each sampled point is \(\mathbf{x}_k=[\mathbf{p}_k,\mathcal{F}(\mathbf{p}_k),\mathbf{n}(\mathbf{p}_k)]\), combining position, VDF offsets, and surface normal.

Dense signal Every surface sample contributes vertex displacement supervision.
Topology cue Vertex locations and edge boundaries emerge from the displacement field.
Voxel-ready Point features are pooled into sparse topology-aware T-Voxels.

Pipeline

Method

LATO turns surface-anchored VDF samples into topology-aware T-Voxels, then decodes explicit vertices and edges through a sparse hierarchical pipeline.

LATO pipeline for mesh encoding, T-Voxel generation, and topology reconstruction
Encode

Voxelize surface VDF samples with positions and normals, then compress the pooled signal into compact T-Voxels.

Decode

Subdivide and prune latent voxels to place vertices; a connection head queries T-Voxels to recover edges directly.

Generate

Use two-stage flow matching: synthesize sparse structure first, then generate topology features before explicit decoding.

Application

City Synthesis

LATO extends from object-level mesh generation to scene-scale urban synthesis by generating compositional block-wise building units.

Scene-scale urban generation results produced by LATO
Scene-scale urban generation. The structure model first synthesizes block-wise building envelopes, and detailed T-Voxel features then populate each unit into high-fidelity city blocks.
Interactive mesh

Compositional Urban Meshes

Block-wise building units are generated as sparse T-Voxels, then assembled into a larger explicit city mesh.

Loading city mesh...

Experiments

Results

LATO produces explicit meshes with strong shape fidelity and more coherent topology across object and image-conditioned settings.

Geometry-conditioned generation LATO is compared with recent explicit mesh baselines under the same geometry-conditioned setting.
Geometry-conditioned comparison between LATO and baseline methods
Additional examples More geometry-conditioned results highlighting mesh completeness and topology quality.
Additional geometry-conditioned comparison examples for LATO
Image to 3D Image to 3D generation results.
Image to 3D generation results
Vertex number Effect of vertex number condition. As we increase the vertex number condition scaler cv, the model generates more vertices and triangles.
Effect of vertex number condition on generated mesh complexity
Inference time Inference time comparison. The inference time evaluation is conducted on a single H100 GPU. LATO maintains rapid generation (3∼10s), whereas autoregressive methods exhibit prohibitive temporal scaling, often requiring minutes for high-fidelity outputs.