Unanswered
Hey There, Since A Bit I Often Find Experiments Being Stuck While Training A Model.
It Seems To Happen Randomly And I Could Not Find A Reproducible Scenario So Far, But It Happens Often Enough To Be Annoying (I'D Say 1 Out Of 5 Experiments).
The Symptoms
There seems to be a problem with multiprocessing: Although I stopped the task,
You mean you "aborted the task" from the UI?
- There is a memory leak somewhere, please see the screenshot of datadog memory consumptionI'm assuming from the leftover processes ?
Python 3.8/Pytorch 1.11/clearml-sdk 1.9.0/clearml-agent 1.4.1
From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)
157 Views
0
Answers
one year ago
one year ago