Compare commits
2 Commits
master
...
rule/add-R
Author | SHA1 | Date | |
---|---|---|---|
![]() |
36a7291544 | ||
![]() |
0660c9d2a2 |
2
rules/S6981/metadata.json
Normal file
2
rules/S6981/metadata.json
Normal file
@ -0,0 +1,2 @@
|
|||||||
|
{
|
||||||
|
}
|
23
rules/S6981/python/metadata.json
Normal file
23
rules/S6981/python/metadata.json
Normal file
@ -0,0 +1,23 @@
|
|||||||
|
{
|
||||||
|
"title": "Gradients should be scaled when using mixed precision",
|
||||||
|
"type": "BUG",
|
||||||
|
"status": "ready",
|
||||||
|
"remediation": {
|
||||||
|
"func": "Constant\/Issue",
|
||||||
|
"constantCost": "5min"
|
||||||
|
},
|
||||||
|
"tags": [
|
||||||
|
],
|
||||||
|
"defaultSeverity": "Major",
|
||||||
|
"ruleSpecification": "RSPEC-6981",
|
||||||
|
"sqKey": "S6981",
|
||||||
|
"scope": "All",
|
||||||
|
"defaultQualityProfiles": ["Sonar way"],
|
||||||
|
"quickfix": "infeasible",
|
||||||
|
"code": {
|
||||||
|
"impacts": {
|
||||||
|
"RELIABILITY": "HIGH"
|
||||||
|
},
|
||||||
|
"attribute": "COMPLETE"
|
||||||
|
}
|
||||||
|
}
|
112
rules/S6981/python/rule.adoc
Normal file
112
rules/S6981/python/rule.adoc
Normal file
@ -0,0 +1,112 @@
|
|||||||
|
This rule raises an issue when an unscaled loss is used for the backward pass and when the forward pass happened in a mixed-precision context
|
||||||
|
|
||||||
|
== Why is this an issue?
|
||||||
|
|
||||||
|
When using mixed precision training, tensors can be cast to lower precision variants to save memory and computing power.
|
||||||
|
The gradients accumulated during the forward pass might also be cast to a lower precision variant. If the resulting gradients have a small enough magnitude, they might underflow.
|
||||||
|
|
||||||
|
=== What is the potential impact?
|
||||||
|
|
||||||
|
If the gradients underflow, the model might not learn properly and the training might be unstable.
|
||||||
|
|
||||||
|
== How to fix it
|
||||||
|
|
||||||
|
To fix this issue, you can use the relevant implementation of `GradScaler`, depending on the `autocast` context and device you are using.
|
||||||
|
|
||||||
|
=== Code examples
|
||||||
|
|
||||||
|
==== Noncompliant code example
|
||||||
|
|
||||||
|
[source,python,diff-id=1,diff-type=noncompliant]
|
||||||
|
----
|
||||||
|
import torch
|
||||||
|
|
||||||
|
model = torch.nn.Linear(28*28, 10)
|
||||||
|
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
|
||||||
|
|
||||||
|
x = torch.randn(1, 1*28*28)
|
||||||
|
y = torch.rand(1, 10)
|
||||||
|
|
||||||
|
optimizer.zero_grad()
|
||||||
|
with torch.autocast(device_type="cuda"):
|
||||||
|
output = model(x)
|
||||||
|
loss = torch.nn.functional.cross_entropy(output, y)
|
||||||
|
loss.backward() # Noncompliant: The loss is used without being scaled
|
||||||
|
optimizer.step()
|
||||||
|
----
|
||||||
|
|
||||||
|
==== Compliant solution
|
||||||
|
|
||||||
|
[source,python,diff-id=1,diff-type=compliant]
|
||||||
|
----
|
||||||
|
import torch
|
||||||
|
|
||||||
|
model = torch.nn.Linear(28*28, 10)
|
||||||
|
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
|
||||||
|
scaler = torch.cuda.amp.GradScaler()
|
||||||
|
|
||||||
|
x = torch.randn(1, 1*28*28)
|
||||||
|
y = torch.rand(1, 10)
|
||||||
|
|
||||||
|
optimizer.zero_grad()
|
||||||
|
with torch.autocast(device_type="cuda"):
|
||||||
|
output = model(x)
|
||||||
|
loss = torch.nn.functional.cross_entropy(output, y)
|
||||||
|
scaler.scale(loss).backward()
|
||||||
|
scaler.step(optimizer)
|
||||||
|
scaler.update()
|
||||||
|
----
|
||||||
|
|
||||||
|
=== How does this work?
|
||||||
|
|
||||||
|
The `GradScaler` class is used to scale the loss before calling the `backward` method. This ensures that the gradients are not underflowing.
|
||||||
|
The calls to `backward()` and `step()` are replaced by `scaler.scale(loss).backward()` and `scaler.step(optimizer)` respectively.
|
||||||
|
We also need to add a call to `scaler.update()` to correctly update the scaler.
|
||||||
|
|
||||||
|
|
||||||
|
== Resources
|
||||||
|
=== Documentation
|
||||||
|
|
||||||
|
* Pytorch documentation - https://pytorch.org/docs/stable/amp.html#gradient-scaling[Gradient Scaling]
|
||||||
|
|
||||||
|
ifdef::env-github,rspecator-view[]
|
||||||
|
|
||||||
|
(visible only on this page)
|
||||||
|
|
||||||
|
== Implementation specification
|
||||||
|
|
||||||
|
Tough implementation, with lots of false negatives in sight.
|
||||||
|
|
||||||
|
There are multiple ways to have an autocast context, with the context manager or with a decorator on the `forward` method of the model.
|
||||||
|
|
||||||
|
I think the implementation should not try too hard to find the issue.
|
||||||
|
|
||||||
|
Find one function that has the properties :
|
||||||
|
- Has the autocast context manager, which contains a call to a subclass of `nn.Module`
|
||||||
|
OR
|
||||||
|
- Contains a call to a subclass of `nn.Module`, with the `@autocast` decorator on the `forward` method.
|
||||||
|
|
||||||
|
- Has a call to the `backward` method of a tensor
|
||||||
|
|
||||||
|
- Has a call to the `step` method, (possibly filter to an object in the optimizer module ?)
|
||||||
|
|
||||||
|
=== Message
|
||||||
|
|
||||||
|
Primary : Use a GradScaler to avoid underflows
|
||||||
|
|
||||||
|
Secondary: Autocast context started here, The optimizer step should be proxied by a GradScaler
|
||||||
|
|
||||||
|
|
||||||
|
=== Issue location
|
||||||
|
|
||||||
|
Primary : on the entire .backward() call
|
||||||
|
|
||||||
|
Secondary : The autocast context or decorator
|
||||||
|
|
||||||
|
Secondary : The optimizer.step() call
|
||||||
|
|
||||||
|
=== Quickfix
|
||||||
|
|
||||||
|
No
|
||||||
|
|
||||||
|
endif::env-github,rspecator-view[]
|
Loading…
x
Reference in New Issue
Block a user