Reputation
Badges 1
119 × Eureka!I configured a firewall rule that opened the ports for the instance (not 100% sure if this is the right way) using network tags. Yes, the whole screen is black and no trains logo show up: Safari can’t open the page because the server where this page is located isn’t responding.
AgitatedDove14 Well I have a loss function which is something like:class MyLoss(...): def forward(...): weights = self.compute_weights(...) return (weights * (target-preds)).mean()There seems to be a problem on certain batch when computing the weights. What would be the best way to log the batch that causes the problem, along with the weights being computed.
It is failing exactly when the download finishes. Not sure if it is something but on the ~/.clearml/pip-download-cache only a cu120 empty folder appears. Should the torch wheel be saved there?
Is this caused by running the script with the arguments?
I'll give that a try! Thanks CostlyOstrich36
Also, should I allow 8080 , 8008 , and 8081 on ingress and egress on GCP or is only egress enough?
` File "/home/ramon/.trains/venvs-builds/3.7/lib/python3.7/site-packages/trains/backend_api/session/token_manager.py", line 72, in _get_token_exp
return jwt.decode(token, verify=False).get('exp', sys.maxsize)
File "/home/ramon/.trains/venvs-builds/3.7/lib/python3.7/site-packages/jwt/api_jwt.py", line 113, in decode
decoded = self.decode_complete(jwt, key, algorithms, options, **kwargs)
File "/home/ramon/.trains/venvs-builds/3.7/lib/python3.7/site-packages/jwt/api_jwt.py", line 80, in decode_c...
SuccessfulKoala55 just to let you know: since I opened the link straight from the GCP console it was using https on the address instead of http hence the error. Thanks a lot for your help!
I set it to 200000 ! But the problem stems from when the first plot is the clearml cpu and gpu monitoring, were you able to reproduce it? Even if I set the number fairly large when the monitoring plot was reported the message appeared.
AgitatedDove14 Downloading a dataset would not be possible using this right? I want to be able to access the data just avoid reporting the experiment results
It works perfectly! AgitatedDove14 There is something weird on my side 😢
` [package_manager.force_repo_requirements_txt=true] Skipping requirements, using repository "requirements.txt"
Using base prefix '/opt/conda'
New python executable in /home/ramon/.clearml/venvs-builds/3.7/bin/python3.7
Also creating executable in /home/ramon/.clearml/venvs-builds/3.7/bin/python
Installing setuptools, pip, wheel...
2021-06-10 09:57:56
done.
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: p...
Sure! I enqueue the experiment from my local machine:python -m src.train model=my_model loss=my_loss dataset=my_dataset
Then I go to the server and run the experiment and create a copy to run with a new model. On the copy, I go to the script path and modify it to be:-m src.train model=my_other_model loss=my_loss dataset=my_dataset
The new experiment, even though the script path has my_new_model default, starts training using my_model .
I can also see ...
My bad :man-facepalming: It was just specifying weights_path=dirpath since the first argument is weights_filename
For option 2 do I have to configure it on all agents or on the server?
Sure! For torch I have:
torch==2.0.1
# via
# monai
# pytorch-lightning
# torchio
# torchmetrics