Reputation
Badges 1
119 × Eureka!If you try:ModelCheckpoint('best_model.hdf5', save_best_only=True)
does it work too?
Last question CostlyOstrich36 sorry to poke you! Seems even though if I set an extremely long time it will still fail when the first plots are reported. The first plots are generated automatically by pytorch lightning and track the cpu and gpu usage. Do you think this could be the cause? or should it also detect the iteration.
CostlyOstrich36 That seemed to do the job! No message after the first epoch, with the caveat of losing resource monitoring. Any idea of what could be causing this? If the resource monitor is the first plot then the iteration detection will fail? Are there any hacks to keep the resource monitoring? Thanks a lot! ๐
CostlyOstrich36 Pytorch lightning exposes the current_epoch
in the trainer, not sure if that is what you mean.
Hey CostlyOstrich36 I am doing a lot of things before the first plot is reported! Is the seconds_from_start
parameter unbounded? What should I do if it takes a lot of time to report the first plot?
Oh I think I am wrong! Then it must be the clearml monitoring. Still it fails way before the timer ends.
Managed to get:
clearml_agent: ERROR: Command '['/home/ramon/.clearml/venvs-builds/3.9/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/var/tmp/requirements_tb0x2i3j.txt', '--extra-index-url', '
died with <Signals.SIGKILL: 9>.
while building the task with the id on the agent
I'll give that a try! Thanks CostlyOstrich36
Sure! Could you point me out how its done
Yes, everything is that way (work dir and args are ok) except the script path . It shows -m module arg1 arg2
.
I set it to 200000
! But the problem stems from when the first plot is the clearml cpu and gpu monitoring, were you able to reproduce it? Even if I set the number fairly large when the monitoring plot was reported the message appeared.
I set the number to a crazy value and it fails around the same iteration
Sure! For torch I have:
torch==2.0.1
# via
# monai
# pytorch-lightning
# torchio
# torchmetrics
AgitatedDove14 I am not sure why the packages get different versions, maybe since the package is not directly imported in my code it is possible to get a different version to what I have locally (?). Should all the libraries versions match exactly between local and the code that runs in the agent? The Task.add_requirements(package_name, package_version=None)
workaround works perfectly! I just add the previous version that doesnโt break the code. Yes, definitely a force flag could help ...
AgitatedDove14 I filed an issue of fire for them to point us to the argument parsing method https://github.com/google/python-fire/issues/291
Thanks SuccessfulKoala55 !
My bad :man-facepalming: It was just specifying weights_path=dirpath
since the first argument is weights_filename
Yes AgitatedDove14 ! Iโll PM you
I am using the code inside the on_train_epoch_end
inside a metric. So the important part is:
` fig = plt.figure()
my plot
logger.experiment.add_figure("fig", fig)
plt.close() `
Sure, Iโll share It through a private message!