Reputation
Badges 1
46 × Eureka!The issue was .ssh wasn't propagated so the git repository couldn't be cloned.
No worries, sorry for pinging, was just making sure you (or anyone else who might help) doesn't miss it 🙂
I use Task.add_requirements("requirements.txt") right before the Task.init.
In main, I parse arguments command-line, add_requirements, initialize Task and call execute_remotely. After that it's all pretty much the usual workflow. Initialize the model, setup dataloaders, optimizer and run the training. I'm using pytorch-ignite and have model checkpoint made on validation evaluator COMPL...
I hacked around the solution by setting api.files_server for the agent to the public URL, but ideally I'd avoid going through reverse-proxy if there's some path_substitution equivalent for this. Thanks
clearml-1.13.1
Task.add_requirements("requirements.txt")
task = Task.init(project_name="My project", task_name="My task")
task.execute_remotely(queue_name="default")
...
One more related question (I hope there's a similar solution), when I log images, they appear in the UI with http://<my-ip> so they are inaccessible (they should be translated to None . Is there any path_substitution variant for this scenario in the config? I can't seem to find it in the docs. Thanks!
Doesn't work unfortunately 😕 Thanks either way!
Having a bit of trouble with this one (sorry for possibly dumb questions).
Are there any docs on how to add certs to the docker image? I see this ( None ) which is where letsencrypt points me to, but I'm not sure what's the proper way to do this on the webapp docker (I'd assume there's a non-hacky way to do it as others are using the same setup I'm trying to make work I guess)
I've tried that one, but it behaves the same :/
Probably not, I'm trying to access it via external IP. Could you point me to instructions for that in the docs, I don't remember seeing it anywhere? Thanks!
@<1523701087100473344:profile|SuccessfulKoala55> kind reminder not to miss this when you catch time, thanks!
So after publishing a task (right click/Publish from WebUI), one of the models got their id changed to __DELETED__4be00...
The other one (last_model on the screenshot below) is all good and didn't get deleted in this way.
"best_model" exists on the disk and I can access it by taking last_model's URL and just changing the file name, but I cannot normally access it via id (which has now changed to __DELETED__4be00...). Any ideas why this might have happened?
? Do you happen to know how to fix this? Thanks!
Found this, seems to be exactly this: None
It appears that running docker as --privileged resolves the issue which is easier for me than to edit all of the instances I've already created. Is there an easy way to add a docker argument in the python script?
I've tried task.set_base_docker(docker_arguments="--privileged") right after Task.init but it doesn't seem to work.
Thanks!
Ooooh, I didn't notice that field is editable. Thanks!
So I should use add_requirements before Task.init and delete the list from webUI when needed?
@<1523701087100473344:profile|SuccessfulKoala55> Kind reminder again, thanks and sorry!
I'll try to reproduce it and will get back at you. The HPO task (parent of this task) was deleted indeed but that shouldn't matter? One of the models was deleted but the other one wasn't.
I know about clearml.conf but wanted to avoid ssh-ing through 50 instances to edit it.
task.set_base_docker does the job, but docker_arguments doesn't propagate if I leave docker_image as None (it just uses both image and arguments from clearml.conf of the agent). If I explicitly state docker_image and docker_arguments in task.set_base_docker it works fine.
"Executing: ['docker', 'run', '-t', '--gpus', '"device=0"'" - so the container is executed with --gpus.
However, torch.cuda.is_available() returns False.
Yeah, I'm starting to lean towards enterprise solution more and more 😁
Thanks!
Single version. The issue seems to be the creation. If I use "clearml-data sync --folder ." it says it uploaded all the files. Running "clearml-data verify --folder ." says it's all good. Metadata on the WebUI reports the expected number of files. However, once I extract the zips (or download the dataset through Python API or CLI) not all the files are there.
"clearml-data add --folder ./*" seems to fix this issue though it doesn't preserve my directory structure so I'd have to write a scrip...
Weird. When I spawn agent with sudo I get this behaviour. Without sudo everything works fine
Neither, metric is a number you report through the Logger:
I just added the secrets/keys to docker-compose.yml and restarted everything but no change.