Reputation
Badges 1
119 × Eureka!Yes! I think thats what I will do π Let me know if there is a way to contribute a mode to keep logging off. We just donβt want to pollute the server when debugging.
Makes sense! Then where would I have to add output_uri
to save the weights?
For option 2 do I have to configure it on all agents or on the server?
I just want to retrieve the weights on a script that tests models I have trained in the past
SuccessfulKoala55 on both 8080
and 8008
I get: Safari canβt open the page http://<External IP>:80XX
because Safari canβt establish a secure connection to the server http://<External IP>:80XX
.
Hey AgitatedDove14 after playing around seems that if the callback filepath points to an hdf5 file it is not uploaded.
Basically one points to an hdf5 and the other one has no extensiion
If you try:ModelCheckpoint('best_model.hdf5', save_best_only=True)
does it work too?
CostlyOstrich36 That seemed to do the job! No message after the first epoch, with the caveat of losing resource monitoring. Any idea of what could be causing this? If the resource monitor is the first plot then the iteration detection will fail? Are there any hacks to keep the resource monitoring? Thanks a lot! π
Hey CostlyOstrich36 I am doing a lot of things before the first plot is reported! Is the seconds_from_start
parameter unbounded? What should I do if it takes a lot of time to report the first plot?
CostlyOstrich36 Pytorch lightning exposes the current_epoch
in the trainer, not sure if that is what you mean.
Sure! Could you point me out how its done
Last question CostlyOstrich36 sorry to poke you! Seems even though if I set an extremely long time it will still fail when the first plots are reported. The first plots are generated automatically by pytorch lightning and track the cpu and gpu usage. Do you think this could be the cause? or should it also detect the iteration.
I set the number to a crazy value and it fails around the same iteration
Oh I think I am wrong! Then it must be the clearml monitoring. Still it fails way before the timer ends.
I'll give that a try! Thanks CostlyOstrich36
I am about to try everything AgitatedDove14 but ran into a gitlab error from the agent, I added the username and password to the configuration file but still get a Host key verification failed
. Is it common that the cloning message shows the SSH
link instead of the HTTPS
when username and password are provided?
AgitatedDove14 task.set_archived(True)
+ the cleanup service should do it π If we run in debug mode the experiment goes directly to the archive and gets cleaned and we donβt pollute the main experiment page.
On the server through the command line?