Last question CostlyOstrich36 sorry to poke you! Seems even though if I set an extremely long time it will still fail when the first plots are reported. The first plots are generated automatically by pytorch lightning and track the cpu and gpu usage. Do you think this could be the cause? or should it also detect the iteration.
Hey AgitatedDove14 do you have an implementation for gcloud? this is awesome
I feel it’s easier not to report than cleaning after but please correct me if I am overthinking it. I’ll check if I could wrap the code in something that calls the Task.delete if debugging
My bad :man-facepalming: It was just specifying weights_path=dirpath
since the first argument is weights_filename
I need to fetch a dataset for some simple tests but since it doesn’t have credentials to the self-hosted server it wont find the dataset
Sure! Could you point me out how its done
Is this caused by running the script with the arguments?
I set the number to a crazy value and it fails around the same iteration
Sure, I’ll share It through a private message!
For option 2 do I have to configure it on all agents or on the server?
I am using pytorch_lightning
, I'll try to create a snippet I can share! Thanks 🙌
SuccessfulKoala55 just to let you know: since I opened the link straight from the GCP console it was using https
on the address instead of http
hence the error. Thanks a lot for your help!
There are also ways to override the parameters as stated https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_cli.html#use-of-command-line-arguments .
Hey CostlyOstrich36 ! I am using clearml==1.1.2
and clearml-agent==1.1.0
. Stopped is not the right word, more like frozen, it just froze at an epoch. The console on the agent shows epoch 33 first batch and the one at the server epoch 32 last batch. The experiment was running for ~6 hours.
Yes AgitatedDove14 ! I’ll PM you
` File "/home/ramon/.trains/venvs-builds/3.7/lib/python3.7/site-packages/trains/backend_api/session/token_manager.py", line 72, in _get_token_exp
return jwt.decode(token, verify=False).get('exp', sys.maxsize)
File "/home/ramon/.trains/venvs-builds/3.7/lib/python3.7/site-packages/jwt/api_jwt.py", line 113, in decode
decoded = self.decode_complete(jwt, key, algorithms, options, **kwargs)
File "/home/ramon/.trains/venvs-builds/3.7/lib/python3.7/site-packages/jwt/api_jwt.py", line 80, in decode_c...
Yes, exactly! Unfortunately I am not so familiar with the internals of the library but I could take a look and figure that out.
Sure! For torch I have:
torch==2.0.1
# via
# monai
# pytorch-lightning
# torchio
# torchmetrics
Using detect_with_pip_freeze: true
runs into package version not found for some of the ones I have locally.
Pigar is capturing different versions that the ones I have installed on my local machine (not a problem except for one). I just want to force the version of that package in a way that I don’t have to manually change it from the UI for every experiment.
@<1523701070390366208:profile|CostlyOstrich36> Thanks for the help! It ended being a mistake on my side. Misconfigured the VM's memory and it had only 3.75 G. Failed when installing torch.