Vivarium — Phase-1: face-quality experiment

The UltraShape comparison showed refinement doesn't sharpen faces, because at this scale a face is texture- and source-sprite-bound, not geometry-bound. Phase-1 tested two cheap, sprite-side levers on the archmage to see which actually moves the needle — control vs. each variant, same geometry & paint settings otherwise.

Pipeline under test: arcane sprite → Hunyuan3D-2.1 shape → decimate + PBR paint @200K → textured GLB
Only the input image (A) or the sprite framing (B) changed; everything downstream was held constant.

Variants

Full front

Face crop (zoomed)

#	Lever tested	What changed	Cost
control	—	the shipped archmage (1024² sprite → shape → paint)	~4.5 min
A	higher-res input	RealESRGAN-upscale the sprite to 2048² before paint (same coarse mesh re-painted)	+~30 s
B	face-emphasis sprite	regenerate the sprite with a waist-up / "detailed face" prompt → new shape → paint	+~3.5 min

Verdict. A — no gain Upscaling the input sprite is a dud: A is visually identical to control. The paint pipeline downsamples its conditioning, so extra input resolution is thrown away. Don't repeat it.

B — inconclusive The face-emphasis regen isn't meaningfully sharper, for two reasons: (1) Z-Image largely ignored the "waist-up" framing and drew a full figure anyway, so the face didn't gain pixels; (2) the archmage's face is occluded by hat-brim + beard — a poor test subject with little face to improve.

Conclusion & next lever. The two cheapest sprite-side tricks don't help. The remaining promising lever is higher-resolution paint (more views + larger paint render resolution), which adds texels directly to whatever face exists — to be tested on an unobstructed-face character (apprentice / witch), with a forceful portrait prompt if framing is retested.

Phase-1b — forcing the framing (Z-Image vs SANA-Sprint)

The "waist-up" attempt (B) failed because the phrase was weak and buried in full-body context — not because Z-Image can't frame. Re-run with a portrait-first prompt (full-body tokens removed), 4 seeds each, on an unobstructed-face apprentice:

Both generators obey the strong prompt. Z-Image (top) frames the close-up crisply and on-style; SANA-Sprint (bottom) frames fine but renders softer / more realistic (style-drift risk). Conclusion: the framing fix is prompt surgery, not a generator swap — keep Z-Image. A true close-up carries much more face detail (more pixels land on the face), but it reconstructs a bust, so world full-body characters still want higher-res paint; the portrait is a separate hero/closeup asset.

On SANA for speed: its 2-step advantage only applies to the 2D sprite (~55 s of a ~270 s character) — shape + paint dominate, so it's marginal (~18%) for 3D characters, but a large win for 2D-only mass sprite generation.

Phase-1b — reconstructed in 3D

Best seed from each generator (s2) → Hunyuan shape + paint. Both reconstruct cleanly as busts — the close-up framing yields far more face detail than a full-body sprite carries. ▶ spin both in 3D →

Z-Image (left) reconstructs a crisp, on-style game face. SANA-Sprint (right) is softer / more photoreal and comes out slightly waxier (single-view reconstruction punishes realism), and it drifts from the stylized cast. Timings were identical bar the sprite step — shape ~114 s + paint ~90 s dominate — so SANA's 2-step speed is marginal for 3D characters. Keep Z-Image for 3D; SANA's speed only pays off for 2D-only mass sprite work.

Phase-2 — higher-res paint on the full body

Re-painted the full-body apprentice's shape with more views + higher per-view resolution (8 views @768 vs the baseline 6 @512; texture already 4096). Left = baseline, right = Phase-2:

Marginal. The Phase-2 face is slightly cleaner but not transformative — at ~2.5× the paint time (187 s vs 76 s). Higher paint resolution can only resample the source sprite's ~70 px face; it can't add detail that isn't there.

Every downstream lever so far (geometry refinement, input upscaling, higher-res paint) gives little or nothing — all signs point at the source sprite's small face as the bottleneck. Phase-2b asks: can we just upscale that source? ↓

Phase-2b — does upscaling the source help?

Re-painted the same full-body shape @768 from an ESRGAN-upscaled 2048 source (face ~70 → ~140 px) vs the plain 1024 source — same geometry, same paint res, only the source differs:

No — near-identical. Upscaling adds pixels but not face information; ESRGAN can't invent detail the generator never drew.

Corrected final conclusion. Face quality is bounded by the face detail the generator actually draws, which is set by framing — a close-up makes the model render a real, detailed face; a full-body shot doesn't. No post-hoc pixel trick (upscaling, hi-res paint, geometry refinement) recovers it. So: hero / closeup → portrait busts; world full-body with a sharp face → a generative face pass (img2img / inpaint the face region at high detail before reconstruction), not upscaling.

Phase-1 — character face-quality experiment