I am still getting the error even with the v0.16.3 agent, is there something else we have to do other than updating it?
SuccessfulKoala55 just to let you know: since I opened the link straight from the GCP console it was using https
on the address instead of http
hence the error. Thanks a lot for your help!
Using detect_with_pip_freeze: true
runs into package version not found for some of the ones I have locally.
There are also ways to override the parameters as stated https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_cli.html#use-of-command-line-arguments .
Hey CostlyOstrich36 ! I am using clearml==1.1.2
and clearml-agent==1.1.0
. Stopped is not the right word, more like frozen, it just froze at an epoch. The console on the agent shows epoch 33 first batch and the one at the server epoch 32 last batch. The experiment was running for ~6 hours.
Best thing ever, thanks AgitatedDove14 !
AgitatedDove14 from this thread I understand hydra is not supported and therefore overriding the parameters from the UI wont work, but is there still a way to track and add the parameters to the experiment? Will task.connect_configuration
work with the yaml files?
Managed to get:
clearml_agent: ERROR: Command '['/home/ramon/.clearml/venvs-builds/3.9/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/var/tmp/requirements_tb0x2i3j.txt', '--extra-index-url', '
died with <Signals.SIGKILL: 9>.
while building the task with the id on the agent
It is failing exactly when the download finishes. Not sure if it is something but on the ~/.clearml/pip-download-cache
only a cu120
empty folder appears. Should the torch wheel be saved there?
Sure! For torch I have:
torch==2.0.1
# via
# monai
# pytorch-lightning
# torchio
# torchmetrics
@<1523701070390366208:profile|CostlyOstrich36> Thanks for the help! It ended being a mistake on my side. Misconfigured the VM's memory and it had only 3.75 G. Failed when installing torch.
I just want to retrieve the weights on a script that tests models I have trained in the past
CostlyOstrich36 That seemed to do the job! No message after the first epoch, with the caveat of losing resource monitoring. Any idea of what could be causing this? If the resource monitor is the first plot then the iteration detection will fail? Are there any hacks to keep the resource monitoring? Thanks a lot! 🙌
Hey CostlyOstrich36 I am doing a lot of things before the first plot is reported! Is the seconds_from_start
parameter unbounded? What should I do if it takes a lot of time to report the first plot?
CostlyOstrich36 Pytorch lightning exposes the current_epoch
in the trainer, not sure if that is what you mean.
Sure! Could you point me out how its done
Last question CostlyOstrich36 sorry to poke you! Seems even though if I set an extremely long time it will still fail when the first plots are reported. The first plots are generated automatically by pytorch lightning and track the cpu and gpu usage. Do you think this could be the cause? or should it also detect the iteration.
I set the number to a crazy value and it fails around the same iteration
Oh I think I am wrong! Then it must be the clearml monitoring. Still it fails way before the timer ends.