Reputation
Badges 1
981 × Eureka!This https://stackoverflow.com/questions/65109764/wildcard-search-issue-with-long-datatype-in-elasticsearch says long types can be converted to string to do the search
AgitatedDove14 Good news, I was able to reproduce the bug on the pytorch distributed sample 🤩
Here it is > https://github.com/H4dr1en/trains/commit/642c1130ad1f76db10ed9b8e1a4ff0fd7e45b3cc
How about the overhead of running the training on docker on a VM?
SuccessfulKoala55 I tried to setup in a different machine the clearml-agent and now I get a different error message in the logs:Warning: could not locate requested Python version 3.6, reverting to version 3.6 clearml_agent: ERROR: Python executable with version '3.6' defined in configuration file, key 'agent.default_python', not found in path, tried: ('python3.6', 'python3', 'python')
This is new right? it detects the local package, uninstalls it and reinstalls it?
even if I move the Github workers internally where they could have access to the prod server, I am not sure I would like that, because it would pile up test data in the prod server that is not necessary
Ok, I could reproduce with Firefox and Chromium. Steps:
Add creds (either via the popup or in the settings) Go the /settings/webapp-configuration -> Creds should be there Hit F5 Creds are gone
CostlyOstrich36 How is clearml-session setting the ssh config?
Ok, but that means this cleanup code should live somewhere else than inside the task itself right? Otherwise it won't be executed since the task will be killed
AgitatedDove14 Unfortunately no, I already had the problem before using the function, I added it hoping it would fix the issue but it didn’t
self.clearml_task.get_initial_iteration() also gives me the correct number
Ho wow! is it possible to not specify a remote task? (If i am working with Task.set_offline(True))
I don’t have a registry to push my image to.I think I can get around it actually - Will it work if I just build the image locally once, then start the agent? Docker would recognise that image locally and just use it right? I won’t need to update that image often anyway
same as the first one described
I don’t think it is, I was rather wondering how you handled it to understand potential sources of slow down in the training code
I am actually calling later in the start_training function the following:with idist.Parallel(backend="nccl") as parallel: parallel.run(training_func)So my backend should be nccl and not gloo , right? Not sure how important it is, I read in the https://pytorch.org/docs/stable/distributed.html#which-backend-to-use that nccl should be used for distributed GPU training and gloo for distributed CPU training
I will try with clearml==1.1.5rc2
Yes, I set:auth { cookies { httponly: true secure: true domain: ".clearml.xyz.com" max_age: 99999999999 } }It always worked for me this way
Ok, so what worked for me in the end was:config = task.connect_configuration(read_yaml(conf_path)) cfg = OmegaConf.create(config._to_dict())
GrumpyPenguin23 yes, it is the latest
AgitatedDove14 , what I was looking for was: parent_task = Task.get_task(task.parent)
It broke the shift holding to select multiple experiments btw
I just move one experiment in another project, after moving it I am taken to the new project where the layout is then reset
CostlyOstrich36 , actually this only happens for a single agent. The weird thing is that I have a machine with two gpus, and I spawn two agents, one per gpus. Both have the same version. For one, I can see all the logs, but not for the other
sorry, the clearml-session. The error is the one I shared at the beginning of this thread
I still don't see why you would change the type of the cloned Task, I'm assuming the original Task had the correct type, no?
Because it is easier for me that I create a training task out of the controller task by cloning it (so that parameters are prefilled and I can set the parent task id)
Maybe there is setting in docker to move the space used in a different location? I can simply increase the storage of the first disk, no problem with that
The task with id a445e40b53c5417da1a6489aad616fee is not aborted and is still running