Reputation
Badges 1
981 × Eureka!So I guess the problem is that the following snippet:from clearml import Task Task.init()Should be added before the if __name__ == "__main__": ?
Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why
I got some progress TimelyPenguin76 , Now the task runs and I get the error from docker:docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
but according to the disks graphs, the OS disk is being used, but not the data disk
Sure! Here are the relevant parts:
` ...
Current configuration (clearml_agent v1.2.3, location: /tmp/.clearml_agent.3m6hdm1_.cfg):
...
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 ...
AgitatedDove14 Should I create an issue for this to keep track of it?
Alright SuccessfulKoala55 I was able to make it work by downgrading clearml-agent to 0.17.2
I added the pass_hashed and restarted the server, still get the same problem
Relevant issue in Elasticsearch forums: https://discuss.elastic.co/t/elasticsearch-5-6-license-renewal/206420
The file /tmp/.clearml_agent_out.j7wo7ltp.txt  does not exist
I was able to fix by applying for a license and registering it
Hi PompousParrot44 , you could have a Controller task running in the services queue that periodically schedules the task you want to run
Isn't it overkill to run a whole ubuntu 18.04 just to run a dead simple controller task?
I opened an https://github.com/pytorch/ignite/issues/2343 in ignite’s repo and a https://github.com/pytorch/ignite/pull/2344 , could you please have a look? There might be a bug in clearml Task.init in distributed envs
Thanks SuccessfulKoala55 !
Maybe you could add to your docker-compose file an option for limiting the size of the logs, since there is no limit by default, their size will grow for ever, which doesn't sound ideal https://docs.docker.com/compose/compose-file/#logging
There is no way to filter on long types? I can’t believe it
Mmmh unfortunately not easily… I will try to debug deeper today, is there a way to resume a task from code to debug locally?
Something like replacing Task.init with Task.get_task so that Task.current_task is the same task as the output of Task.get_task
Thanks a lot, I will play with that!
I still don't see why you would change the type of the cloned Task, I'm assuming the original Task had the correct type, no?
Because it is easier for me that I create a training task out of the controller task by cloning it (so that parameters are prefilled and I can set the parent task id)
So the migration from one server to another + adding new accounts with password worked, thanks for your help!
I don’t have a registry to push my image to.I think I can get around it actually - Will it work if I just build the image locally once, then start the agent? Docker would recognise that image locally and just use it right? I won’t need to update that image often anyway
it worked for the other folder, so I assume yes --> I archived the /opt/trains/data/mongo, sent the archive via scp, unarchived, updated the rights and now it works
but most likely I need to update the perms of /data as well
You are right, thanks! I was trying to move /opt/trains/data to an external disk, mounted at /data
Ok, now I would like to copy from one machine to another via scp, so I copied the whole /opt/trains/data folder, but I got the following errors:
AgitatedDove14 So I copied pasted locally the https://github.com/pytorch-ignite/examples/blob/main/tutorials/intermediate/cifar10-distributed.py from the examples of pytorch-ignite. Then I added a requirements.txt and called clearml-task to run it on one of my agents. I adapted a bit the script (removed python-fire since it’s not yet supported by clearml).
So I created a symlink in /opt/train/data -> /data
Ho nice, thanks for pointing this out!