
Reputation
Badges 1
46 × Eureka!Weird. When I spawn agent with sudo I get this behaviour. Without sudo everything works fine
@<1523701205467926528:profile|AgitatedDove14> Any ideas on this issue? Thanks!
clearml-1.13.1
Task.add_requirements("requirements.txt")
task = Task.init(project_name="My project", task_name="My task")
task.execute_remotely(queue_name="default")
...
Once I used clearml-data add --folder * API everything works correctly (though all files recursively ended up in the root, I had luck all were named differently).
I hacked around the solution by setting api.files_server for the agent to the public URL, but ideally I'd avoid going through reverse-proxy if there's some path_substitution equivalent for this. Thanks
I'll check the docker command next time this happens, thanks! For the machines, all of them have GPUs (and are in fact identical/cloned VMs) and if I rerun it and get the same exact machine again it works so it's some part of "GPU detection" or something, we'll know more hopefully once it happens again, thanks.
Yeah, I'm starting to lean towards enterprise solution more and more 😁
Thanks!
One more related question (I hope there's a similar solution), when I log images, they appear in the UI with http://<my-ip> so they are inaccessible (they should be translated to None . Is there any path_substitution variant for this scenario in the config? I can't seem to find it in the docs. Thanks!
Not ClearML employee (just a recent user), but maybe this will help? None
To make sure I understand, I need to setup a domain with a cert and it should work, no additional ClearML config is required?
Ooooh, I didn't notice that field is editable. Thanks!
So I should use add_requirements before Task.init and delete the list from webUI when needed?
I've tried that one, but it behaves the same :/
I'll try to reproduce it and will get back at you. The HPO task (parent of this task) was deleted indeed but that shouldn't matter? One of the models was deleted but the other one wasn't.
Perfect, exactly what I needed, thanks!
Additional info:
-Public URL uses HTTPS, internal traffic doesn't.
-clearml.storage fails while trying to fetch None ...
Meaning it just replaced the internal IP with the URL at some point for some reason, it doesn't exist in that form anywhere in any configs (http and public URL).
Single version. The issue seems to be the creation. If I use "clearml-data sync --folder ." it says it uploaded all the files. Running "clearml-data verify --folder ." says it's all good. Metadata on the WebUI reports the expected number of files. However, once I extract the zips (or download the dataset through Python API or CLI) not all the files are there.
"clearml-data add --folder ./*" seems to fix this issue though it doesn't preserve my directory structure so I'd have to write a scrip...
"Executing: ['docker', 'run', '-t', '--gpus', '"device=0"'" - so the container is executed with --gpus.
However, torch.cuda.is_available() returns False.
Yes SSH_AUTH_SOCK is defined on the host. Should I manually add SSH mounting then through "extra flags"?
@<1714813627506102272:profile|CheekyDolphin49> You should probably use 'General/coupling' and 'General/rep'
The issue was .ssh wasn't propagated so the git repository couldn't be cloned.
No worries, sorry for pinging, was just making sure you (or anyone else who might help) doesn't miss it 🙂
I use Task.add_requirements("requirements.txt") right before the Task.init.
In main, I parse arguments command-line, add_requirements, initialize Task and call execute_remotely. After that it's all pretty much the usual workflow. Initialize the model, setup dataloaders, optimizer and run the training. I'm using pytorch-ignite and have model checkpoint made on validation evaluator COMPL...
I just added the secrets/keys to docker-compose.yml and restarted everything but no change.
@<1523701087100473344:profile|SuccessfulKoala55> kind reminder not to miss this when you catch time, thanks!
Having a bit of trouble with this one (sorry for possibly dumb questions).
Are there any docs on how to add certs to the docker image? I see this ( None ) which is where letsencrypt points me to, but I'm not sure what's the proper way to do this on the webapp docker (I'd assume there's a non-hacky way to do it as others are using the same setup I'm trying to make work I guess)