Reputation
Badges 1
981 × Eureka!AgitatedDove14 Yes exactly, I tried the fix suggested in the github issue urllib3>=1.25.4 and the ImportError disappeared ๐
Hi AgitatedDove14 , thatโs super exciting news! ๐คฉ ๐
Regarding the two outstanding points:
In my case, Iโd maintain a client python package that takes care of the pre/post processing of each request, so that I only send the raw data to the inference service and I post process the raw output of the model returned by the inference service. But I understand why it might be desirable for the users to have these steps happening on the server. What is challenging in this context? Defining how t...
Ok, now I would like to copy from one machine to another via scp, so I copied the whole /opt/trains/data folder, but I got the following errors:
Could be, but not sure -> from 0.16.2 to 0.16.3
now I can do nvcc --version and I getCuda compilation tools, release 10.1, V10.1.243
Thanks! I will investigate further, I am thinking that the AWS instance might have been stuck for an unknown reason (becoming unhealthy)
Thanks a lot, I will play with that!
No worries! I asked more to be informed, I don't have a real use-case behind. This means that you guys internally catch the argparser object somehow right? Because you could also simply use sys argv to find the parameters, right?
Hi AgitatedDove14 , How should we proceed to fix this bug? Should I open an issue in github? Should I try to make a minimal reproducible example? Itโs blocking me atm
PS: in the new env, Iโv set num_replicas: 0, so Iโm only talking about primary shardsโฆ
Yea, the config is not appearing in the webUI anymore with this method ๐
No I agree, itโs probably not worth it
ClearML has a task.set_initial_iteration , I used it as such:checkpoint = torch.load(checkpoint_fp, map_location="cuda:0") Checkpoint.load_objects(to_load=self.to_save, checkpoint=checkpoint) task.set_initial_iteration(engine.state.iteration)But still the same issue, I am not sure whether I use it correctly and if itโs a bug or not, AgitatedDove14 ? (I am using clearml 1.0.4rc1, clearml-agent 1.0.0)
Yes thatโs correct - the weird thing is that the error shows the right detected region
and the agent says agent.cudnn_version = 0
And I do that each time I want to create a subtask. This way I am sure to retrieve the task if it already exists
AgitatedDove14 Is it fixed with trains-server 0.15.1?
Is there any channel where we can see when new self hosted server version are published?
See my answer in the issue - I am not using docker
This is consistent: Each time I send a new task on the default queue, if trains-agent-1 has only one task running (the long one), it will pick another one. If I add one more experiment in the queue at that point (trains-agent-1 running two experiments at the same time), that experiment will stay in queue (trains-agent-2 and trains-agent-3 will not pick it because they also are running experiments)