Reputation
Badges 1
981 × Eureka!I would probably leave it to the ClearML team to answer you, I am not using the UI app and for me it worked just well with different regions. Maybe check permissions of the key/secrets?
but if you do that and the package is already installed it will not install using the git repo, this is an issue with pip
Exactly, that’s my problem: I want to remove it to make sure it is reinstalled (because the version can change)
I think that since the agent installs everything from scratch it should work for you. Wdyt?
With env caching enabled, it won’t reinstall this private dependency, right?
I have two controller tasks running in parallel in the trains-agent services queue
If I don’t start clearml-session , I can easily connect to the agent, so clearml-session is doing something that messes up the ssh config and prevent me from ssh into the agent afterwards
Should I try to disable dynamic mapping before doing the reindex operation?
Adding back clearml logging with matplotlib.use('agg') , uses more ram but not that suspicious
So I created a symlink in /opt/train/data -> /data
Or even better: would it be possible to have a support for HTML files as artifacts?
AgitatedDove14 Yes exactly, I tried the fix suggested in the github issue urllib3>=1.25.4 and the ImportError disappeared 🙂
Hi AgitatedDove14 , that’s super exciting news! 🤩 🚀
Regarding the two outstanding points:
In my case, I’d maintain a client python package that takes care of the pre/post processing of each request, so that I only send the raw data to the inference service and I post process the raw output of the model returned by the inference service. But I understand why it might be desirable for the users to have these steps happening on the server. What is challenging in this context? Defining how t...
Ok, now I would like to copy from one machine to another via scp, so I copied the whole /opt/trains/data folder, but I got the following errors:
Could be, but not sure -> from 0.16.2 to 0.16.3
now I can do nvcc --version and I getCuda compilation tools, release 10.1, V10.1.243
Thanks! I will investigate further, I am thinking that the AWS instance might have been stuck for an unknown reason (becoming unhealthy)
Thanks a lot, I will play with that!
No worries! I asked more to be informed, I don't have a real use-case behind. This means that you guys internally catch the argparser object somehow right? Because you could also simply use sys argv to find the parameters, right?
Hi AgitatedDove14 , How should we proceed to fix this bug? Should I open an issue in github? Should I try to make a minimal reproducible example? It’s blocking me atm
PS: in the new env, I’v set num_replicas: 0, so I’m only talking about primary shards…
Yea, the config is not appearing in the webUI anymore with this method 😞
No I agree, it’s probably not worth it
ClearML has a task.set_initial_iteration , I used it as such:checkpoint = torch.load(checkpoint_fp, map_location="cuda:0") Checkpoint.load_objects(to_load=self.to_save, checkpoint=checkpoint) task.set_initial_iteration(engine.state.iteration)But still the same issue, I am not sure whether I use it correctly and if it’s a bug or not, AgitatedDove14 ? (I am using clearml 1.0.4rc1, clearml-agent 1.0.0)
Yes that’s correct - the weird thing is that the error shows the right detected region
and the agent says agent.cudnn_version = 0
And I do that each time I want to create a subtask. This way I am sure to retrieve the task if it already exists