I use tensorboard and rely on automatic logging for all of my scalar reporting. However, I periodically log some scatter plots using clearml.Logger.report_plotly
. And I use report_text
to log some information about training progress to the console.
It seems like I regained connection. At least I can see all values until the task got terminated and after the HTTPTimeOut
warning in my logs, the training runs for another 200 Iterations (~1.5 hours)
Yes, also on my machine, where I store the Tensorboard logs together with additional results (Meshes and Model checkpoints) of all experiments, I only use like 1 GB
Is there a way to check how much storage I am using on the community server?
Hi ShallowKitten67 .
Can you send the logs? can you share the machine monitoring (from scalars section)?
It's on the way, but not yet possible π
Here are the machine monitoring scalars. Seems fine to me. I am currently trying to reproduce results from a paper, thus I do not tune batch_size etc to use all available resources.
What you're seeing is basically the SDK's response to the Task's status being change mid-run, or to someone clicking "Stop" in the UI
There are literally only two things that can cause that specific message to be printed π
ShallowKitten67 are you relying on the automatic reporting (so just creating a task and doing nothing clearml-related afterwards), or are you explicitly calling any clearml methods in your code?
Oh, and I do not change the taskβs status in my code. I just create it at the beginning of my training.
` configuration = parser.parse(config_path)
task = clearml.Task.init(project_name='Foo',
task_name=configuration.name) `
Since I do not manage the cluster, I do not have permission to access system logs. In the docker logs, the last thing that gets printed is the clearml.Task WARNING
.
Well, there's a watchdog on the server that automatically stops tasks that haven't reported for a long time - I guess that's what happened...
Little update here: It happened again after an update to ClearML SDK 1.0.4, but this time it happened immediately after I lost HTTP connection. This makes sense with your explanations. Can I suppress this by setting sdk.development.support_stopping
in the config to false
?
but updating to the latest version is always a good idea π
ShallowKitten67 this could happen if you're changing your task's status somewhere in your code - are you?
I mean, assuming you lost connection to the server and stopped reporting
It seems like I lost connection during the run of my experiment. But this happened like 200 Epochs before the process got terminated
I am using the community server at https://app.community.clear.ml . In my environment I use clearml==1.0.2
, so I probably should update to the latest version
So that doesn't explain why the task's status was changed...
After some investigation, this might be related to an issue in ClearML SDK 1.0.2 with the subprocesses support - I suggest upgrading to ClearML SDK 1.0.4 π