
Reputation
Badges 1
119 × Eureka!So should I set them all with a default value? The working dir is the project one, the one that contains the module
package
TimelyPenguin76 I found out its just one package that is causing the error ( cloudpickle
breaks everything). Is there a way to use Pigar but force a single package to have a version?
No, I have all the packages with a version. I just want to know if there is a way to override the requirements versions detected by Pigar when using detect_with_pip_freeze: false
. I have locally cloudpickle==1.4.1
but when running the code and sending the task to the node the environment uses cloudpickle==1.6.0
. I have to manually change the version on the UI. Is there a way to force this single package to have a version? Maybe on the requirments.txt or something similar
I configured a firewall rule that opened the ports for the instance (not 100% sure if this is the right way) using network tags. Yes, the whole screen is black and no trains logo show up: Safari can’t open the page because the server where this page is located isn’t responding.
So I would have to disconnect pytorch? And then upload the model at the end
It works perfectly! AgitatedDove14 There is something weird on my side 😢
Also, should I allow 8080
, 8008
, and 8081
on ingress and egress on GCP or is only egress enough?
Hi AgitatedDove14 thanks for your reply, with the dashboard I meant the Web-App (UI) . I am trying to access http://<External IP>:8080
but unfortunately nothing shows up.
Hey CostlyOstrich36 sorry to ping you! Let's say I enqueue multiple experiments on a couple of agents and one of them fails. Is it possible to restart the experiment from the UI using the latest checkpoint? What if the experiment gets assigned to the other agent? I am not sure how the continue_last_task
flag would help in this case.
Thanks AgitatedDove14 ! seems to be subclassed model + extension
Thanks SuccessfulKoala55 !
Yes! What env variables should I pass
AgitatedDove14 Well I have a loss function which is something like:class MyLoss(...): def forward(...): weights = self.compute_weights(...) return (weights * (target-preds)).mean()
There seems to be a problem on certain batch when computing the weights. What would be the best way to log the batch that causes the problem, along with the weights being computed.
AgitatedDove14 Thanks! Im trying to figure out how to create a minimum working example! I am also working with Hydra so that may be a thing. The extension is whats causing it to fail (haven’t figured out why).
Hi CostlyOstrich36 ! The message is the following:clearml.model - INFO - Selected model id: 27c1a1700b0b4e25a4344dc4ef9868fa
They are not models, those are intermediate tensors I am caching to make training faster. I don't need to log them.
Yes Martin! I have a package installed from github but its using the pypi version
Side note: When running src.train
as a module the server gets the command as src
and has to be modified to be src.train
AgitatedDove14 Thanks! I’ll give it a try! Makes sense 👌
AgitatedDove14 I am not sure why the packages get different versions, maybe since the package is not directly imported in my code it is possible to get a different version to what I have locally (?). Should all the libraries versions match exactly between local and the code that runs in the agent? The Task.add_requirements(package_name, package_version=None)
workaround works perfectly! I just add the previous version that doesn’t break the code. Yes, definitely a force flag could help ...
Not yet AgitatedDove14 , does the agent use by default the python version the command is run with? I installed conda and tried using package_manager.type=conda
but then get an error:clearml_agent: ERROR: 'NoneType' object has no attribute 'lower'