Fused Collapsing

Fused Collapsing for Wide BVH Construction

Wilhem Barbier

Mathias Paulin

IRIT, Université de Toulouse, CNRS

Real-time ray tracing

Direct lighting

Reflections

Path tracing

Needs an acceleration structure

Acceleration structures

Goal: efficiently find intersections between a ray and a set of triangles

⤷ Wide bounding volume hierarchy (BVH)

Acceleration structures

Goal: efficiently find intersections between a ray and a set of triangles

⤷ Wide bounding volume hierarchy (BVH)

	Architecture	Structure
Hardware RT	AMD RDNA3	4-wide BVH
	AMD RDNA4	8-wide BVH
	NVIDIA Lovelace	??-wide BVH
	Intel X^e-HPG	6-wide BVH
Software RT	CWBVH	8-wide BVH

Motivation

Challenge: Building these acceleration structures efficiently on a GPU

Motivation

Challenge: Building these acceleration structures efficiently on a GPU

Use case: complex dynamic scenes

Changing topology and significant deformations require a full rebuild

Very expensive!

Motivation

Challenge: Building these acceleration structures efficiently on a GPU

Use case: complex dynamic scenes

Changing topology and significant deformations require a full rebuild

Very expensive!

Our focus: reducing build time, at the cost of some tracing speed

Worth it when build cost is a bottleneck

BVH construction on the GPU

Challenge: Building these acceleration structures efficiently on a GPU

BVH construction on the GPU

Challenge: Building these acceleration structures efficiently on a GPU

Binary BVH builders only

BVH construction on the GPU

Challenge: Building these acceleration structures efficiently on a GPU

Binary builder + separate collapsing

Overview

Overview

Overview

Overview

No additional traversal!

Maximize build speed, at the cost of some tracing performance.

Main contributions

Main contributions

Bottom-up collapsing

Traverse a binary BVH bottom-up, output a wide BVH.

Bottom-up collapsing

Traverse a binary BVH bottom-up, output a wide BVH.

Bottom-up collapsing

Traverse a binary BVH bottom-up, output a wide BVH.

Bottom-up collapsing

Traverse a binary BVH bottom-up, output a wide BVH.

Bottom-up collapsing

Traverse a binary BVH bottom-up, output a wide BVH.

PLOC algorithm [Meister17]

PLOC algorithm [Meister17]

PLOC algorithm [Meister17]

PLOC algorithm [Meister17]

PLOC algorithm [Meister17]

PLOC algorithm [Meister17]

PLOC algorithm [Meister17]

PLOC algorithm [Meister17]

PLOC algorithm [Meister17]

PLOC algorithm [Meister17]

PLOC algorithm [Meister17]

Fusing with PLOC

We integrate our bottom-up collapsing within the PLOC loop.

Fusing with PLOC

We integrate our bottom-up collapsing within the PLOC loop.

Fusing with PLOC

We integrate our bottom-up collapsing within the PLOC loop.

Fusing with PLOC

We integrate our bottom-up collapsing within the PLOC loop.

Fusing with PLOC

We integrate our bottom-up collapsing within the PLOC loop.

Fusing with PLOC

We integrate our bottom-up collapsing within the PLOC loop.

Tree quality

Merge penalty

If two clusters have different number of references: penalize their distance

Merge penalty

Results: experimental setting

Compared algorithms:

H-PLOC with fused collapsing (ours)
H-PLOC with top-down collapsing [Benthin24]

All measurements done on a NVIDIA RTX 3090

Evaluated both BVH4 and BVH8
Four complex dynamic scenes (1M to 7M tris)
Benchmarked using 1 diffuse secondary ray per pixel at 1980x1080
Software ray tracing

More results in the paper!

Test scenes

Results: timings

Build time: consistent reduction across all scenes \( (\times 0.65\text{--}0.71) \)

Results: timings

Build time: consistent reduction across all scenes \( (\times 0.65\text{--}0.71) \)

Combined time: overall reduction when build time dominates \( (\times 0.74\text{--}0.81)\)

Results: timings

Build time: consistent reduction across all scenes \( (\times 0.65\text{--}0.71) \)

Combined time: overall reduction when build time dominates \( (\times 0.74\text{--}0.81)\)

Limitation: lower tree quality leads to higher trace time \( (\times 1.11\text{--}1.37) \)

Conclusion

Building wide BVHs is an understudied problem.

Conclusion

Building wide BVHs is an understudied problem.

We propose a construction algorithm that focuses on maximizing build speed.

Conclusion

Building wide BVHs is an understudied problem.

We propose a construction algorithm that focuses on maximizing build speed.

There is room for new approaches!

Conclusion

Building wide BVHs is an understudied problem.

We propose a construction algorithm that focuses on maximizing build speed.

There is room for new approaches!

Questions?

Partial reordering

Our construction procedure results in non-deterministic node ordering

Ordering matters for cache locality during tracing (node size < cache line)

Partial breadth-first reordering: top few levels only

Spawn a single thread group once
Queue in shared memory
Group synchronization

Very fast (~1% of build time)

-->