Photon Labs
Work
Comparer Free
People Detailer Soon
Clay to Real Soon
Upscaler Soon
Contact
Photon Labs Technical Report 2025-12

TITLE:Multi-Material Control in Generative ArchViz: Solving the Semantic Contamination Problem

01. Abstract

The domain of architectural visualization is navigating a structural transition from deterministic, physics-based rendering engines toward probabilistic, generative pipelines. While Latent Diffusion Models (LDMs) excel at global synthesis, they historically struggle with the precise assignment of specific materials to geometric coordinates without semantic leakage.

This report analyzes the "Clay to Real" workflow—a methodology for transforming untextured massing models into photorealistic imagery while maintaining strict architectural fidelity.

02. The "Brick Wall" Problem

A fundamental tension exists between Structure (Geometry) and Style (Material). The diffusion process, by design, attempts to harmonize the entire latent space towards a global prompt vector. If a concept like "brick" is introduced, the model's self-attention mechanisms naturally propagate this feature across the entire image.

In professional practice, this results in "contamination": masonry textures bleeding into glazing, or atmospheric lighting taking on the material properties of the dominant facade. To achieve production-grade results, specific material isolation must be enforced against the model's convergence tendencies.

03. Methodology: Regional Feature Injection

To resolve global contamination, strict decoupling of the 'What' (Material Token) from the 'Where' (Spatial Coordinates) is required. We achieve this through a Masked Cross-Attention Control strategy.

Rather than relying on global prompting ("a house made of brick and wood"), the pipeline operates by injecting distinct feature embeddings—derived from reference materials—directly into specific spatial tensors within the U-Net. This effectively "locks" the semantic interpretation of a region (e.g., "Timber Cladding") before the denoising scheduler can contaminate it with features from adjacent sectors.

04. Architecture Selection: U-Net vs. Transformer

A comparative analysis was conducted between established U-Net architectures (SDXL) and emerging Rectified Flow Transformers (Flux.1). While Transformers demonstrate superior prompting adherence and text encoding, the U-Net architecture presently offers a more mature ecosystem for granular control interventions.

For tasks requiring "pixel-perfect" structural adherence to a CAD model, lighter weight guidance adapters (specifically for line-art and depth) in the SDXL ecosystem allow for a rigid "structural skeleton" that Transformers currently struggle to replicate without excessive computational overhead (~24GB+ VRAM).

05. VRAM Economics & Tiled Upscaling

High-fidelity rendering implies resolutions (4K+) that far exceed the latent window of standard models. Simply stretching the tensors destroys coherence.

To bypass hardware limitations on local workstations, we implement a Sequential Tiled Processing algorithm. The image is segmented into overlapping high-resolution patches, each processed independently with a "coherence pass" to stitch boundaries. This democratizes production-quality rendering, allowing a single prosumer GPU to output imagery that previously required server-farm infrastructure.

06. Conclusion

The solution to semantic contamination lies in treating the generative process not as a "prompt-and-pray" workflow, but as a series of composable, masked injections. By mathematically restricting where feature vectors can apply, architectural specificity is preserved within a probabilistic framework.

After
Before
CLAY (INPUT)
RENDER (OUTPUT)
Move mouse to compare Input vs. Output