Reputation
Badges 1
981 × Eureka!PS: in the new env, I’v set num_replicas: 0, so I’m only talking about primary shards…
Yea, the config is not appearing in the webUI anymore with this method 😞
No I agree, it’s probably not worth it
I could delete the files manually with sudo rm (sudo is required, otherwise I get Permission Denied )
ClearML has a task.set_initial_iteration , I used it as such:checkpoint = torch.load(checkpoint_fp, map_location="cuda:0") Checkpoint.load_objects(to_load=self.to_save, checkpoint=checkpoint) task.set_initial_iteration(engine.state.iteration)But still the same issue, I am not sure whether I use it correctly and if it’s a bug or not, AgitatedDove14 ? (I am using clearml 1.0.4rc1, clearml-agent 1.0.0)
Yes that’s correct - the weird thing is that the error shows the right detected region
and the agent says agent.cudnn_version = 0
And I do that each time I want to create a subtask. This way I am sure to retrieve the task if it already exists
AgitatedDove14 Is it fixed with trains-server 0.15.1?
Is there any channel where we can see when new self hosted server version are published?
See my answer in the issue - I am not using docker
This is consistent: Each time I send a new task on the default queue, if trains-agent-1 has only one task running (the long one), it will pick another one. If I add one more experiment in the queue at that point (trains-agent-1 running two experiments at the same time), that experiment will stay in queue (trains-agent-2 and trains-agent-3 will not pick it because they also are running experiments)
(I am not part of the awesome ClearML team, just a happy user 🙂 )
AppetizingMouse58 btw I had to delete the old logs index before creating the alias, otherwise ES won’t let me create an alias with the same name as an existing index
AgitatedDove14 I see that the default sample_frequency_per_sec=2. , but in the UI, I see that there isn’t such resolution (ie. it logs every ~120 iterations, corresponding to ~30 secs.) What is the difference with report_frequency_sec=30. ?
Hi SuccessfulKoala55 , not really wrong, rather I don't understand it, the docker image with the args after it
Just found yea, very cool! Thanks!
So the wheel that was working for me was this one: [torch-1.11.0+cu115-cp38-cp38-linux_x86_64.whl](https://download.pytorch.org/whl/cu115/torch-1.11.0%2Bcu115-cp38-cp38-linux_x86_64.whl)
Maybe there is setting in docker to move the space used in a different location? I can simply increase the storage of the first disk, no problem with that