D3D12 LinAlg Matrix Preview

Welcome to the D3D12 LinAlg Matrix Preview release!

Today, we are excited to announce the preview release for the D3D12 Linear Algebra APIs! This feature set unlocks comprehensive hardware acceleration for Matrix-oriented operations across various use cases. Previously, we announced the WaveMMA and Cooperative Vectors features which supported narrow matrix operation use cases; the LinAlg feature set being announced today subsumes these APIs into a singular set of orthogonal APIs. With today’s announcement, we are enabling developers to both efficiently drive neural rendering techniques directly from individual shader threads in real-time graphics pipelines and utilize higher bandwidth matrix MMA operations for ML and image processing applications, all in a singular combined API.

The application of machine learning techniques is now ubiquitous across the industry. For graphics development, neural network based rendering methods, which we’ve been calling neural rendering, are quickly growing in popularity. At the same time, offloading high bandwidth matrix compute onto the GPU is unquestionably at an all-time high. As such, GPU vendors continue to adopt and expand specialized hardware for matrix operations, and the new LinAlg Matrix APIs put the power of that hardware into your hands!

This blog post is part of the larger SM6.10 preview announcement. See the parent blog post for the full feature set. Also see the GDC 2026 blog putting this feature in context of the overall ML story for DirectX here.

Motivation

Unlocking efficient use of the GPU’s specialized matrix hardware is the core motivation for the introduction of the LinAlg Matrix APIs. Thanks to the preview process, we were able to go back to the drawing board and evolve the previous design. Thank you for all the feedback and please keep it coming! We will continue to evolve the LinAlg Matrix APIs over the preview period in response to real world feedback.

The new API supports three modes of operations (called Matrix Scopes), motivating different key matrix use cases:

MatrixScope::Thread (previously previewed as Cooperative Vectors)

A thread-scope matrix is able to fit into a traditional rendering pipeline in place of another shader. A potential example of such is running inference on a neural network trained to compute lighting. Replacing only the single shader enables easy adoption of ML techniques. Inference is understood at a high level by the driver, so it can be mapped to dedicated hardware acceleration.

MatrixScope::Wave (previously previewed as WaveMMA)

High bandwidth dedicated matrix multiplication hardware is increasingly available in contemporary GPUs. A wave-scope matrix surfaces access to this hardware for complex machine learning and image processing applications. A typical application may employ smaller matrices or manually tiled larger matrices for hardware accelerated matrix-matrix multiplications.

MatrixScope::ThreadGroup

MatrixScope::ThreadGroup is new to the LinAlg Matrix API. It is compatible with all the operations of a wave-scope matrix above serving a different use case. The inputs and weight matrices used in LLM-like networks are much larger than allowed sizes for wave-scope matrices. To serve this case with a wave-scope matrix, manual tiling is mandatory, and for cross-hardware performance, multiple different kernels would be required. Conversely, a threadgroup-scope matrix’s larger size avoids manual tiling. The tiling decision is shifted to the driver, allowing you to ship a single implementation while still retaining optimal tiling.

Feature Overview

Shader Model 6.10 introduces high level linear algebra APIs building on top of the Long Vectors and Native DXIL Vectors features released as part of SM6.9. The high level API is converted into a “mid level” API consumed by the driver. The mid level API maintains high levels of context, enabling the driver to take advantage of the underlying hardware capabilities. Meanwhile, the high level API enables better source level usage rules and fast iteration. This API is centered on the new Matrix type provided as a permissively licensed HLSL source header. Depending on the declaration of a specific Matrix instance, various operations are enabled or disabled at compilation time. For example, MatrixScope::Thread is roughly limited to matrix-vector operations, while MatrixScope::Wave and MatrixScope::ThreadGroup are roughly limited to matrix-matrix operations. You can view the full table of available operations in the LinAlg spec.

Code Examples

Below are examples of primary use cases for the Matrix header, each serving a different goal with different available operations. You can find some of these examples here on GitHub.

Cooperative Vectors Example

// Compiled with the line below
//   bin/dxc -I ./include/hlsl -T cs_6_10 -enable-16bit-types coop-vec-example.hlsl

// System header containing the LinAlg Matrix APIs
#include <dx/linalg.h>

// The API is nested under dx::linalg. Simplify the example by using it
using namespace dx::linalg;

// Byte Address Buffer to load/store the matrices
ByteAddressBuffer InBuff : register(t0);

[numthreads(8, 1, 1)]
[shader("compute")]
void main() {
  // The Matrix type names can get quite long. Alias them for readability
  // Looking at the template arguments we have:
  //   ComponentType::F16 - The matrix holds and F16 type
  //   16 - The M dimension of the matrix is 16
  //   16 - The N dimension of the matrix is 16
  //   MatrixUse::A - The Matrix is an "A" matrix, so it only fits into the "A"
  //     slot of various functions
  //   MatrixScope::Thread - The Matrix is a "Thread" matrix, so it may only be
  //     used with "Thread Matrix" operations. These are the operations 
  //     previously covered under the Cooperative Vector API
  using MatrixATy =
      Matrix<ComponentType::F16, 16, 16, MatrixUse::A, MatrixScope::Thread>;

  // Setup data for later by loading the matrix and creating null vectors
  vector<float16_t, 16> Vec = (vector<float16_t, 16>)0;
  vector<float16_t, 16> Bias = (vector<float16_t, 16>)0;
  MatrixATy MatA = MatrixATy::Load<MatrixLayout::RowMajor>(
      InBuff, 0, /* Row stride = number of columns * element size */ 16 * 2);

  // Do a F16 Matrix x Vector multiply
  vector<float16_t, 16> Layer1 = Multiply<float16_t>(MatA, Vec);

  // Do a F16 Matrix x Vector multiply with a bias Vector
  vector<float16_t, 16> Layer2 = MultiplyAdd<float16_t>(MatA, Layer1, Bias);

  // Create a reference to an in-memory vector at offset 4096 in InBuff
  // without actually loading it in
  VectorRef<ComponentType::F8_E4M3FN, 16> MemBias = {InBuff,
                                                     /*start offset*/ 4096};

  // Do a F16 Matrix x Vector multiply with a bias vector stored in memory
  vector<float16_t, 16> Layer3 = MultiplyAdd<float16_t>(MatA, Layer2, MemBias);

  // Create some packed data
  vector<uint8_t4_packed, 4> SomeData = (vector<uint8_t4_packed, 4>)0;

  // Do a MatVecMulAdd but reinterpret the Vec data as F8_F8_E4M3FN with a bias
  // stored in memory
  vector<float16_t, 16> Layer4 = MultiplyAdd<float16_t>(
      MatA, MakeInterpretedVector<ComponentType::F8_E4M3FN>(SomeData), MemBias);
  // Do a MatVecMulAdd but reinterpret the Vec data as F8_E4M3FN with a regular
  // bias vector
  vector<float16_t, 16> Layer5 = MultiplyAdd<float16_t>(
      MatA, MakeInterpretedVector<ComponentType::F8_E4M3FN>(SomeData), Bias);

  // Create some uint data
  vector<uint, 16> SomeData2 = (vector<uint, 16>)0;

  // Do a MatVecMulAdd but convert SomeData2 from a U32 to a F8_E4M3FN first
  vector<float16_t, 16> Layer6 = MultiplyAdd<float16_t>(
      MatA, Convert<ComponentType::F8_E4M3FN, ComponentType::U32>(SomeData2),
      MemBias);
}

OuterProduct and InterlockedAccumulate Example

// Compiled with the line below
//   bin/dxc -I ./include/hlsl -T cs_6_10 -enable-16bit-types outerproduct-example.hlsl

// System header containing the LinAlg Matrix APIs
#include <dx/linalg.h>

// The API is nested under dx::linalg. Simplify the example by using it
using namespace dx::linalg;

// Byte Address Buffer to load/store from
RWByteAddressBuffer OutBuff : register(u0);

[numthreads(8, 1, 1)]
[shader("compute")]
void main() {
  // The Matrix type names can get quite long. Alias them for readability
  // Looking at the template arguments we have:
  //   ComponentType::F16 - The matrix holds and F16 type
  //   16 - The M dimension of the matrix is 16
  //   8 - The N dimension of the matrix is 8
  //   MatrixUse::Accumulator - The Matrix is an "Accumulator" matrix, so it
  //     only fits into the "Accumulator" slot of various functions
  //   MatrixScope::Thread - The Matrix is a "Thread" matrix, so it may only be
  //     used with "Thread Matrix" operations. 
  using MatrixAccumTy = Matrix<ComponentType::F16, 16, 8,
                               MatrixUse::Accumulator, MatrixScope::Thread>;

  // Create some F16 vectors with placeholder data
  vector<float16_t, 16> VecA = (vector<float16_t, 16>)0;
  vector<float16_t, 8> VecB = (vector<float16_t, 8>)0;

  // Create an Accum matrix by outer producting the two vectors
  MatrixAccumTy MatAcc =
      OuterProduct<ComponentType::F16>(VecA, VecB);

  // Atomically accumulate the result into the output buffer
  MatAcc.InterlockedAccumulate(OutBuff, 0);
}

Wave Matrix Example

// Compiled with the line below
//   bin/dxc -I ./include/hlsl -T cs_6_10 -enable-16bit-types linalg-wave.hlsl

// System header containing the LinAlg Matrix APIs
#include <dx/linalg.h>

// The API is nested under dx::linalg. Simplify the example by using it
using namespace dx::linalg;

// This shader performs matrix multiplication C = α*A*B + β*C
// where A, B, and C are matrices of dimensions MxK, KxN, and MxN respectively.
// The shader uses wave-level parallelism to compute tiles of the output matrix
// C. Each wave computes a TILE_SIZExTILE_SIZE tile of C. The dispatch must
// allocate waves for each tile of the MxN output matrix.

// GEMM constants
cbuffer GemmConstants : register(b0)
{
    float alpha;    // Scalar multiplier for A*B
    float beta;     // Scalar multiplier for existing C
}

ByteAddressBuffer MatrixA;
ByteAddressBuffer MatrixB;
RWByteAddressBuffer MatrixC;

// Matrix dimensions - can be configured as needed
#define M 1024    // Rows in A and C
#define N 1024    // Columns in B and C
#define K 1024    // Columns in A, rows in B
#define TILE_SIZE 16

// Optimized GEMM using wave-level parallelism
[numthreads(TILE_SIZE, 1, 1)]
void main(uint3 group_id : SV_GroupID)
{
    // Matrix type definitions for wave scope
    using MatrixATy = Matrix<ComponentType::F16, TILE_SIZE, TILE_SIZE, MatrixUse::A, MatrixScope::Wave>;
    using MatrixBTy = Matrix<ComponentType::F16, TILE_SIZE, TILE_SIZE, MatrixUse::B, MatrixScope::Wave>;
    using MatrixResultTy = Matrix<ComponentType::F32, TILE_SIZE, TILE_SIZE, MatrixUse::Accumulator, MatrixScope::Wave>;

    // Calculate tile coordinates for this thread group
    uint tile_row = group_id.y;
    uint tile_col = group_id.x;

    // Initialize accumulator
    MatrixResultTy c_tile = MatrixResultTy::Splat(0.0f);

    // Perform tiled matrix multiplication across K dimension
    for (uint k = 0; k < K; k += TILE_SIZE)
    {
        // Calculate byte offsets for A and B tiles
        uint a_offset = ((tile_row * TILE_SIZE) * K + k) * sizeof(half);
        uint b_offset = (k * N + (tile_col * TILE_SIZE)) * sizeof(half);

        // Load A and B tiles for this K iteration using ByteAddressBuffer
        MatrixATy a_k_tile = MatrixATy::Load(
            MatrixA, a_offset, K * sizeof(half), MatrixLayout::RowMajor);

        MatrixBTy b_k_tile = MatrixBTy::Load(
            MatrixB, b_offset, N * sizeof(half), MatrixLayout::RowMajor);

        // Multiply and accumulate with mixed precision (half inputs -> float accumulation)
        c_tile.MultiplyAccumulate(a_k_tile, b_k_tile);
    }

    // Calculate output offset for GEMM equation: C = α*A*B + β*C
    uint c_offset = ((tile_row * TILE_SIZE) * N + (tile_col * TILE_SIZE)) * sizeof(float);

    // Load existing C tile
    MatrixResultTy c_existing = MatrixResultTy::Load(MatrixC, c_offset, N * sizeof(float), MatrixLayout::RowMajor);

    // Apply GEMM scaling element-wise: α*A*B + β*C
    for (uint i = 0; i < c_tile.Length(); i++) {
        float ab_val = c_tile.Get(i);
        float c_val = c_existing.Get(i);
        float result = alpha * ab_val + beta * c_val;
        c_tile.Set(i, result);
    }

    c_tile.Store(MatrixC, c_offset, N * sizeof(float), MatrixLayout::RowMajor);
}

ThreadGroup Matrix Example

// Compiled with the line below
//   bin/dxc -I ./include/hlsl -T cs_6_10 -enable-16bit-types linalg-threadgroup.hlsl

// System header containing the LinAlg Matrix APIs
#include <dx/linalg.h>

// The API is nested under dx::linalg. Simplify the example by using it
using namespace dx::linalg;

// This shader performs matrix multiplication C = α*A*B + β*C
// where A, B, and C are matrices of dimensions MxK, KxN, and MxN respectively.
// The shader uses threadgroup-level parallelism to compute tiles of the output
// matrix C. The GPU driver will generate code to split the matrix into optimal
// tiles based on the hardware capabilities.

// GEMM constants
cbuffer GemmConstants : register(b0)
{
    float alpha;    // Scalar multiplier for A*B
    float beta;     // Scalar multiplier for existing C
}

ByteAddressBuffer MatrixA;
ByteAddressBuffer MatrixB;
RWByteAddressBuffer MatrixC;

// Matrix dimensions - can be configured as needed
#define M 1024    // Rows in A and C
#define N 1024    // Columns in B and C
#define K 1024    // Columns in A, rows in B

// Optimized GEMM using threadgroup-level parallelism
[numthreads(1024, 1, 1)]
void main()
{
    // Matrix type definitions for threadgroup scope
    using MatrixATy = Matrix<ComponentType::F16, M, K, MatrixUse::A, MatrixScope::ThreadGroup>;
    using MatrixBTy = Matrix<ComponentType::F16, N, K, MatrixUse::B, MatrixScope::ThreadGroup>;
    using MatrixResultTy = Matrix<ComponentType::F32, M, N, MatrixUse::Accumulator, MatrixScope::ThreadGroup>;

    MatrixATy a_matrix = MatrixATy::Load(MatrixA, 0, K * sizeof(half), MatrixLayout::RowMajor);

    MatrixBTy b_matrix = MatrixBTy::Load(MatrixB, 0, N * sizeof(half), MatrixLayout::RowMajor);

    // Load existing C matrix for GEMM equation: C = α*A*B + β*C
    MatrixResultTy c_existing = MatrixResultTy::Load(MatrixC, 0, N * sizeof(float), MatrixLayout::RowMajor);

    // Compute A*B
    MatrixResultTy ab_result = Multiply<ComponentType::F32>(a_matrix, b_matrix);

    // Apply GEMM scaling element-wise: α*A*B + β*C
    for (uint i = 0; i < ab_result.Length(); i++) {
        float ab_val = ab_result.Get(i);
        float c_val = c_existing.Get(i);
        float result = alpha * ab_val + beta * c_val;
        ab_result.Set(i, result);
    }

    ab_result.Store(MatrixC, 0, N * sizeof(float), MatrixLayout::RowMajor);
}

Data Preparation

There are a couple of D3D methods for converting weight and bias matrix data between formats:

enum D3D12_LINEAR_ALGEBRA_MATRIX_LAYOUT {
    D3D12_LINEAR_ALGEBRA_MATRIX_LAYOUT_ROW_MAJOR,
    D3D12_LINEAR_ALGEBRA_MATRIX_LAYOUT_COLUMN_MAJOR,
    D3D12_LINEAR_ALGEBRA_MATRIX_LAYOUT_MUL_OPTIMAL,
    D3D12_LINEAR_ALGEBRA_MATRIX_LAYOUT_OUTER_PRODUCT_OPTIMAL
}

For instance, D3D12_LINEAR_ALGEBRA_MATRIX_LAYOUT_MUL_OPTIMAL is a device-specific layout for optimal use with the Matrix-Vector operations such as MultiplyAdd in the code example above.

See ID3D12DevicePreview::GetLinearAlgebraMatrixConversionDestinationInfo() and ID3D12CommandListPreview::ConvertLinearAlgebraMatrix() in the D3D LinAlg spec here.

Get Running

LinAlg is part of Shader Model 6.10, currently in preview. This requires:

AgilitySDK 1.720.1-preview available here.
Preview Shader Model 6.10 support in DXC available here.

Device Support:

NVIDIA:	Contact your developer relations representative for in-development driver access.
Intel:	Support planned in an upcoming release.
AMD:	AMD Software: AgilitySDK Developer Preview Edition 25.30.41.02
WARP:	Available on latest WARP software rasterizer preview, available here.

Checking for Support

To enable the LinAlg preview with the AgilitySDK from above, in code turn on experimental feature support before creating a D3D12 device:

UUID Features[] = { D3D12ExperimentalShaderModels };
ThrowIfFailed(D3D12EnableExperimentalFeatures(_countof(Features), Features, nullptr, nullptr));

The API provides many different dimensions of hardware support. To fully explore the granular API, see the documentation here.

To quickly get started with LinAlg, query the device for Tier 1 support:

D3D12_FEATURE_DATA_LINEAR_ALGEBRA_SUPPORT linearAlgebraSupport = {};
HRESULT hr = device->CheckFeatureSupport(
    D3D12_FEATURE_LINEAR_ALGEBRA_SUPPORT,
    &linearAlgebraSupport,
    sizeof(linearAlgebraSupport));

if (SUCCEEDED(hr) && linearAlgebraSupport.LinearAlgebraTier >= D3D12_LINEAR_ALGEBRA_TIER_1)
{
    // Device supports Tier 1 linear algebra operations
}

Supported Tier 1 features are listed here. Other tiered levels of support are also found in that spec.

PIX

As usual, Day One PIX support is available. Check here for the latest information.

Content from GPU Vendors

AMD

Linear Algebra Matrix is supported on AMD Radeon™ RX 9000 series graphics products using the AMD Software: AgilitySDK Developer Preview Edition 25.30.41.02 driver.

Intel

We’re working on a Linear Algebra implementation leveraging our XMX cores and are expect it to share it with ISVs later this year. This new API replaces cooperative vectors and enables efficient use of vector-matrix and matrix-matrix multiplication. It’s a key enabler for neural rendering techniques like texture set neural compression and more. We’re excited to see how developers will leverage this capability and can’t wait to see all the new cool rendering algorithms that will be developed on top of it!

– Matthäus Chajdas, Senior Principal Engineer

NVIDIA

Contact your developer relations representative for in-development driver access.

Motivation

Feature Overview

Code Examples

Cooperative Vectors Example

OuterProduct and InterlockedAccumulate Example

Wave Matrix Example

ThreadGroup Matrix Example

Data Preparation

Get Running

Checking for Support

PIX

Content from GPU Vendors

AMD

Intel

NVIDIA

Category

Author

0 comments

Leave a commentCancel reply

Read next

Evolving DirectX for the ML Era on Windows

DirectX: Bringing Console-Level Developer Tools to Windows

Motivation

Feature Overview

Code Examples

Cooperative Vectors Example

OuterProduct and InterlockedAccumulate Example

Wave Matrix Example

ThreadGroup Matrix Example

Data Preparation

Get Running

Checking for Support

PIX

Content from GPU Vendors

AMD

Intel

NVIDIA

Category

Share

Author

0 comments

Leave a commentCancel reply

Read next

Evolving DirectX for the ML Era on Windows

DirectX: Bringing Console-Level Developer Tools to Windows

Stay informed