Reputation
Badges 1
119 × Eureka!Yes, everything is that way (work dir and args are ok) except the script path . It shows -m module arg1 arg2 .
Yes! I think thats what I will do π Let me know if there is a way to contribute a mode to keep logging off. We just donβt want to pollute the server when debugging.
I'll give that a try! Thanks CostlyOstrich36
CostlyOstrich36 That seemed to do the job! No message after the first epoch, with the caveat of losing resource monitoring. Any idea of what could be causing this? If the resource monitor is the first plot then the iteration detection will fail? Are there any hacks to keep the resource monitoring? Thanks a lot! π
Hey CostlyOstrich36 ! I am using clearml==1.1.2 and clearml-agent==1.1.0 . Stopped is not the right word, more like frozen, it just froze at an epoch. The console on the agent shows epoch 33 first batch and the one at the server epoch 32 last batch. The experiment was running for ~6 hours.
Hey CostlyOstrich36 I am doing a lot of things before the first plot is reported! Is the seconds_from_start parameter unbounded? What should I do if it takes a lot of time to report the first plot?
Side note: When running src.train as a module the server gets the command as src and has to be modified to be src.train
With pip I get the first error I showed, I tried conda and it starts running but at some point crashes with:clearml_agent: ERROR: 'NoneType' object has no attribute 'lower'
AgitatedDove14 Well I have a loss function which is something like:class MyLoss(...): def forward(...): weights = self.compute_weights(...) return (weights * (target-preds)).mean()There seems to be a problem on certain batch when computing the weights. What would be the best way to log the batch that causes the problem, along with the weights being computed.
AgitatedDove14 Thanks! Iβll give it a try! Makes sense π
Sure! I enqueue the experiment from my local machine:python -m src.train model=my_model loss=my_loss dataset=my_dataset
Then I go to the server and run the experiment and create a copy to run with a new model. On the copy, I go to the script path and modify it to be:-m src.train model=my_other_model loss=my_loss dataset=my_dataset
The new experiment, even though the script path has my_new_model default, starts training using my_model .
I can also see ...
So should I set them all with a default value? The working dir is the project one, the one that contains the module package
I am using the code inside the on_train_epoch_end inside a metric. So the important part is:
` fig = plt.figure()
my plot
logger.experiment.add_figure("fig", fig)
plt.close() `
Works like a charm π thanks!
Not yet AgitatedDove14 , does the agent use by default the python version the command is run with? I installed conda and tried using package_manager.type=conda but then get an error:clearml_agent: ERROR: 'NoneType' object has no attribute 'lower'
I set it to 200000 ! But the problem stems from when the first plot is the clearml cpu and gpu monitoring, were you able to reproduce it? Even if I set the number fairly large when the monitoring plot was reported the message appeared.
Is this caused by running the script with the arguments?
AgitatedDove14 Thanks! Im trying to figure out how to create a minimum working example! I am also working with Hydra so that may be a thing. The extension is whats causing it to fail (havenβt figured out why).