NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...6 min read

Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

Share
NOW LET US Article – Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

Explore the inner workings of nn.Linear and how PyTorch optimizes operations using epilogues and views. This article analyzes the differences between Eager and Compiled modes, explaining why torch.compile shines when handling complex MLP blocks.

In the first part of this series "Profiling in PyTorch", we used torch.add(torch.matmul(x, w), b)

to learn how to read PyTorch profiler traces. We also discussed several other topics that came our way - the CPU dispatch chain, launch overhead, the difference between an overhead-bound and a compute-bound regime, and some internals of torch.compile

.

In the second iteration (this blog post), we climb one rung up the ladder. We replace the hand-written matmul-add pair with an nn.Linear

(with bias=True

). This is the building block every deep learning model uses. We then stack three of them (specific to our example), with an activation in between, to form a Multilayer Perceptron (MLP) block.

The scripts for this blog post live here:

02_linear.py

,03_simple_mlp.py

, and03_kernels_mlp.py

. Like before, it helps to open them in a separate tab and walk through the code as you read. We use anNVIDIA A100-SXM4-80GB

GPU to run the scripts. It is really easy to set up a GPU on the Hugging Face infrastructure and experiment with the scripts using Dev Mode with Spaces. One could also run the scripts with the Hugging Face Jobs pipeline.

Before we begin, a quick recap of two ideas we will lean on repeatedly:

  • A GPU kernelis a program that runs in parallel on many threads of the GPU. - The CPU schedules and launchesthese kernels. Most of the PyTorch overhead you see in a profiler trace is this scheduling work.

nn.Linear

is a module wrapper around the same matrix multiplication and addition we already profiled in Part 1. The only difference is that it owns its weight and bias as parameters and exposes a forward

method that PyTorch users have grown familiar with.

# bias=True would truly emulate the multiplication and addition
# operations we have seen in part 1 of the series
linear_layer = nn.Linear(in_dim, out_dim, bias=True)
y = linear_layer(x)

The operation at hand can be written as:

y = x @ w.T + b

Where x

is the input, w

is the weight and b

is the bias. Let's run 02_linear.py

and check the profile.

uv run 02_linear.py --batch 1024 --in_dim 32 --out_dim 64
uvx trace-util traces -b traces

trace-util

is a utility that will sync your traces to a Hugging Face bucket and then provide the Preffeto URLs on your terminal.

Figure 1 shows the profiler trace of a forward call of the linear layer. We trace the forward

call of the linear layer with a similar schedule

setup as the previous traces, with wait=1

, warmup=1

and active=3

. This is why we see three Profile Steps in the CPU and GPU lanes.

If we zoom into the profiler trace, as we do in Figure 2, we notice an aten::t

(transpose) op before the aten::addmm

(multiplication and addition) op. We can already figure out that nn.Linear

transposes the weight parameter and then multiplies it with the input. This is the reason we see an aten::t

op.

An important thing to notice is that aten::t

does not really copy or reorganize data: it only rewrites tensor metadata (shape and stride) on the CPU to represent the transposed matrix. It does not launch a kernel on the GPU. One can verify this two ways: by looking at the GPU lane in the trace, or by checking the aten::t

row in the profiler table and the time it took on CUDA.

There is no aten::add

(the bias addition) in the dispatch chain of the linear layer, as seen in Figure 3. This is because the bias addition has been folded into the matrix multiplication kernel, using what is called an epilogue.

An epilogue is a small computation that a GEMM (GEneral Matrix Multiply) kernel does at the very end, just before it writes its result back to HBM (High Bandwidth Memory, the GPU's main memory). Adding a bias, applying an activation, or scaling by a constant are all classic epilogues. The point of an epilogue is to avoid loading or writing to HBM a second time, since memory traffic makes an operation expensive.

nn.Linear

calls torch.nn.functional.linear

, which, in turn, calls aten::linear

. aten::linear

looks at the inputs, notices that a bias was passed, and dispatches aten::addmm(bias, x, weight)

instead of doing a matmul and an add separately. addmm

computes:

out = x @ weight.T + bias

The cuBLAS GEMM kernel that runs on the GPU has a bias-add variant built in, and that's the kernel aten::addmm

picks. The add never appears as a separate kernel because it is part of the matmul kernel's writeback, which is exactly what an epilogue is.

This is the moment to notice something subtle. The kernel you saw in Part 1 under --compile

(addmm

) is the kernel that eager nn.Linear

already uses. There is nothing left for torch.compile

to fuse here, which is the next thing we will verify.

Let's compile the forward call and look at the profiler trace. (The profiler trace is visualized in the next section)

uv run 02_linear.py --batch 1024 --in_dim 32 --out_dim 64 --compile
uvx trace-util traces -b traces

If you compare the eager and compiled traces for a single nn.Linear

's forward

, you will find:

  • The same cuBLAS GEMM kernel on the GPU.
  • The same aten::addmm

op on the CPU. - A few extra rows on the CPU lane unique to compile.

This is worth internalizing. A common reflex is to reach for torch.compile

whenever a model feels slow. For a single GEMM-with-bias, compile has very little to do. This is not a bug, this is just that compile needs more than one operation to possibly do any fusing. Let's prove that by looking at an MLP.

A careful reader of the two traces (eager vs compile) will notice that the eager CPU dispatch chain has more in it than the compiled one.

Figure 4: Eager dispatch chain where aten::linear walks through aten::t (transpose) and then aten::addmm |

The eager CPU dispatch chain inside aten::linear

is aten::t

followed by aten::addmm

(Figure 4). To understand what aten::t

actually does, we need a quick detour into strides and views.

A tensor stores its data as one flat, contiguous run of numbers in memory. The shape

and stride

are metadata that sit on top of that run and tell PyTorch how to walk it: a stride of (s0, s1)

means "step s0

elements to move one row, step s1

to move one column". Change the metadata and you get a different view of the same raw data, with no copy:

>>> M = torch.tensor([[0, 1],
... [2, 3],
... [4, 5]])
>>> M.shape, M.stride()
(torch.Size([3, 2]), (2, 1)) # two steps per row, one step per column
>>> T = M.t() # transpose
>>> T.shape, T.stride()
(torch.Size([2, 3]), (1, 2)) # shape and stride swapped, data untouched
>>> T
tensor([[0, 2, 4],
[1, 3, 5]])
>>> T.flatten() # forced to materialize, so the data is reordered
tensor([0, 2, 4, 1, 3, 5])

M.t()

did not move a single number. It returned a new view whose strides are swapped, so reading it row-by-row now walks the original buffer 0, 1, 2, 3, 4, 5

in transposed order. The underlying data is identical; only the metadata differs.

This is exactly what aten::t

does inside the linear layer: it does not allocate a new tensor or copy any data, it produces a view of the weight with rewritten strides.

As we can see in Figure 5, compile did not remove a GPU kernel: it removed the CPU overhead of dispatching that view. Inductor traced through the view chain at compile time, computed the resulting strides once, and emitted a direct aten::addmm

call with those strides hard-coded. A few microseconds of CPU work disappear while the GPU does identical math.

As one would expect, when the input data violates the strides precomputed by the compiler, it will throw an error.

If you look at the GPU lane in both traces, there is exactly one kernel per forward, and it is the same kernel both times:

cutlass_80_wmma_tensorop_bf16_s161616gemm_bf16_32x32_32x1_tn_align8

If no transpose kernel ran, who taught the GEMM to read the weight mat

© 2026 Now Let Us. All rights reserved.

Source: Hugging Face Blog

Advertisement
Ad slot ready: 5887729102

More in this category

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.