Oh, and I do not change the taskβs status in my code. I just create it at the beginning of my training.
` configuration = parser.parse(config_path)
task = clearml.Task.init(project_name='Foo',
task_name=configuration.name) `
It's on the way, but not yet possible π
but updating to the latest version is always a good idea π
What you're seeing is basically the SDK's response to the Task's status being change mid-run, or to someone clicking "Stop" in the UI
It seems like I regained connection. At least I can see all values until the task got terminated and after the HTTPTimeOut
warning in my logs, the training runs for another 200 Iterations (~1.5 hours)
After some investigation, this might be related to an issue in ClearML SDK 1.0.2 with the subprocesses support - I suggest upgrading to ClearML SDK 1.0.4 π
Since I do not manage the cluster, I do not have permission to access system logs. In the docker logs, the last thing that gets printed is the clearml.Task WARNING
.
It seems like I lost connection during the run of my experiment. But this happened like 200 Epochs before the process got terminated
There are literally only two things that can cause that specific message to be printed π
Is there a way to check how much storage I am using on the community server?
Here are the machine monitoring scalars. Seems fine to me. I am currently trying to reproduce results from a paper, thus I do not tune batch_size etc to use all available resources.
I am using the community server at https://app.community.clear.ml . In my environment I use clearml==1.0.2
, so I probably should update to the latest version
Hi ShallowKitten67 .
Can you send the logs? can you share the machine monitoring (from scalars section)?
ShallowKitten67 this could happen if you're changing your task's status somewhere in your code - are you?
Little update here: It happened again after an update to ClearML SDK 1.0.4, but this time it happened immediately after I lost HTTP connection. This makes sense with your explanations. Can I suppress this by setting sdk.development.support_stopping
in the config to false
?
So that doesn't explain why the task's status was changed...
I use tensorboard and rely on automatic logging for all of my scalar reporting. However, I periodically log some scatter plots using clearml.Logger.report_plotly
. And I use report_text
to log some information about training progress to the console.
ShallowKitten67 are you relying on the automatic reporting (so just creating a task and doing nothing clearml-related afterwards), or are you explicitly calling any clearml methods in your code?
I mean, assuming you lost connection to the server and stopped reporting
Yes, also on my machine, where I store the Tensorboard logs together with additional results (Meshes and Model checkpoints) of all experiments, I only use like 1 GB
Well, there's a watchdog on the server that automatically stops tasks that haven't reported for a long time - I guess that's what happened...