Sapphire Radeon RX 9060 XT Nitro+ OC Review - The Fastest RX 9060 XT 33

Sapphire Radeon RX 9060 XT Nitro+ OC Review - The Fastest RX 9060 XT

FSR 4, Software & Redstone »

AMD RDNA 4 Architecture


AMD didn't provide a block diagram for Navi 44, so we puzzled one together

The Radeon RX 9060 XT is powered by the new Navi 44, AMD's second gaming GPU silicon based on the RDNA 4 graphics architecture. This is a monolithic chip built on the 4 nm TSMC N4P foundry node, and packs 29.7 billion transistors across a die-area of 199 mm². On paper, this is half the chip as the Navi 48 powering the Radeon RX 9070 series, but with one key change that sets it apart from its predecessors, Navi 33 and Navi 23—the chip comes with a full PCI-Express 5.0 x16 host interface, just like Navi 48, and not a truncated PCI-Express 5.0 x8. This is also unlike the NVIDIA GB206 chip powering the RTX 5060 Ti, which still uses x8, like it's smaller brother the RTX 5060. This full x16-wide PCIe Gen 5 interface may help in certain cases, such as trying to use this GPU on older systems with Gen 3 x16 host interfaces. Given that AMD is still launching processors with PCIe Gen 3, such as the Ryzen 7 5700GT "Cezanne," the company might be onto something.

Navi 44 comes with 32 compute units across 2 shader engines, which each have 16 RDNA 4 Dual Compute Units. Each CU contains 64 stream processors, and hence Navi 44 has 2,048 of them; 64 AI accelerators, 32 RT accelerators, 128 TMUs, and 64 ROPs. The chip also has 32 MB of Infinity Cache memory, all of which is enabled on the RX 9060 XT. The memory interface is 128-bit GDDR6, which drives either 16 GB or 8 GB of 20 Gbps GDDR6 memory for 320 GB/s of memory bandwidth. This might be low compared to the RTX 5060 series with its 28 Gbps GDDR7 yielding 448 GB/s of bandwidth, but AMD claims that it has made several memory management advances with RDNA 4, which should generationally improve memory performance akin to a significant bandwidth increase. Both the 16 GB and 8 GB variants of the RX 9060 XT max out the Navi 44 silicon. The 16 GB variant comes with a total board power value of 160 W.


At the heart of the RDNA 4 graphics architecture is the new dual compute unit, with a vastly improved memory sub-system, improvements made to the scalar units, a new technology called dynamic register allocation, and improvements to CU efficiency and engine clocks. Each CU has two scheduler blocks, driving a 192 KB general purpose register (GPR), an 8 KB scalar GPR, 32 FMA ALUs, and 32 FMA+INT ALUs. There are also 8 transcendental logic units. RDNA 4 introduces the concept of dual SIMD32 vector units, for even more parallelism. The Scalar Unit comes with support for newer Float32 ops. Schedulers are updated with accelerated spill/fill operations. Instruction prefetching is improved. The new generation AI Accelerator comes with two 16-bit and four 8-bit/4-bit dense matrix compute rates, support for 4:2 structured sparsity for doubling throughput, and matrix loads with transpose. AMD has incorporated many technologies from its CDNA 3 Instinct AI ML accelerators on the AI Accelerators of RDNA 4, including enhanced and power-optimized WMMA, improvements to the ops per CU, support for FP8, E4M3 and E5M2 formats, and 4:2 structured sparsity.


The new generation AI Accelerator comes with two 16-bit and four 8-bit/4-bit dense matrix compute rates, support for 4:2 structured sparsity for doubling throughput, and matrix loads with transpose. AMD has incorporated many technologies from its CDNA 3 Radeon Instinct AI ML accelerators on the AI Accelerators of RDNA 4, including enhanced and power-optimized WMMA, improvements to the ops per CU, support for FP8, E4M3 and E5M2 formats, and 4:2 structured sparsity.


The new generation Ray Accelerator comes with double the box and triangle intersection resources as RDNA 3 RT accelerator, support for hardware instance transforms, improvements to the RT stack management, BVH8 node compression, and a revolutionary feature called oriented bounding boxes. To contain the number of rays really needed to be tested against an object, modern ray tracing technologies use something called a bounding box, which defines a region in which a geometry has to be tested against rays. Most of the time, the geometry is of a vastly different shape and smaller than the shape of a bounding box, which introduces false intersections, and wastes ray testing resources. AMD innovated a way to turn this bounding box into a 3D shape by giving it a Z-axis component, so the bounding box is oriented closer to the shape of the object to be tested, reducing the number of rays needed to be tested against it.


This graph highlights the contribution of various components toward the 100% generational ray traversal performance gain, allowing AMD to make do with a CU count of 64, with RDNA 3 being the baseline.


Both ray tracing and ML acceleration are memory sensitive applications, so AMD innovated a revolutionary change to its memory management system with the introduction of new out-of-order memory. All math is executed in waves on an RDNA GPU, and mutual dependencies between waves can cause memory request stream misses, as one wave's memory request queue waits for the other wave to complete its task. This is solved with a new out-of-order (relaxed ordering) memory management. This graph highlights the contribution of various components toward the 100% generational ray traversal performance gain, allowing AMD to make do with a CU count of 64, with RDNA 3 being the baseline.


On AMD, a fairly big chunk of the ray tracing stack continues to be executed on shaders, but the company has made advances to ensure the cost of ray tracing on the shader resources of the GPU is minimal, with the introduction of Dynamic Registers to improve parallelism.


The new Radiance 2 Display Engine comes with major hardware updates that reduce GPU idle power draw in multi-monitor setups. The engine also comes with hardware flip-metering support (something NVIDIA also introduced with Blackwell, and which enables Multi-Frame Gen on the RTX 50-series). Flip-metering improves video frame pacing to the GPU and reduces CPU overhead for video playback. There is also a display engine level hardware image sharpening component that drives Radeon Image Sharpening. As for I/O, you get contemporary DisplayPort 2.1a and HDMI 2.1b, the maximum bitrate is UHBR 13.5.


Navi 44 comes with an updated media engine which can perform concurrent encoding and decoding. The new generation media engine offers a 25% increase in H.264 low-bitrate encode image quality, and an 11% improvement in HEVC encode quality. AV1 encode and decode get B-frames support, vastly improving bitrates. The media engine posts a 50% generational performance uplift (measured in encoder/decoder frame rates), with reductions in memory overhead.
Next Page »FSR 4, Software & Redstone
View as single page
Jun 13th, 2025 18:38 EEST change timezone

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts