
Reputation
Badges 1
46 × Eureka!Single version. The issue seems to be the creation. If I use "clearml-data sync --folder ." it says it uploaded all the files. Running "clearml-data verify --folder ." says it's all good. Metadata on the WebUI reports the expected number of files. However, once I extract the zips (or download the dataset through Python API or CLI) not all the files are there.
"clearml-data add --folder ./*" seems to fix this issue though it doesn't preserve my directory structure so I'd have to write a scrip...
Got it. Is there any way to skip a point at some iteration? If I just don't report it at iteration t I'll get interpolation from t-1 to t+1.
Tried but it didn't help. I suspect the issue is here: "'docker', 'run', '-t', '--gpus', '"device=0"', '-v', '/tmp/ssh-krPvUxRks5/agent.1949:/tmp/ssh-krPvUxRks5/agent.1949', '-e', 'SSH_AUTH_SOCK=/tmp/ssh-krPvUxRks5/agent.1949'"
It passes SSH socket instead of .ssh directory (not sure why, an agent I have running on my own machine behaves differently)? Do you happen to know how to fix this? Thanks!
I'll try to reproduce it and will get back at you. The HPO task (parent of this task) was deleted indeed but that shouldn't matter? One of the models was deleted but the other one wasn't.
I'll check the docker command next time this happens, thanks! For the machines, all of them have GPUs (and are in fact identical/cloned VMs) and if I rerun it and get the same exact machine again it works so it's some part of "GPU detection" or something, we'll know more hopefully once it happens again, thanks.
I have a dataset of ~24GB and I've tried multiple times uploading it with the sync function.
- The cache doesn't work, it attempts to download the dataset every time.
- It "misses" some files somehow. So once the job runs it fails due to missing files.
- I've ran verify afterwards (from the machine I used to upload the data) and it says it's all good. However, once I inspect the zip files on the server (look for the files in the specific zip the state json says they're in) the files are indee...
Additional info:
-Public URL uses HTTPS, internal traffic doesn't.
-clearml.storage fails while trying to fetch None ...
Meaning it just replaced the internal IP with the URL at some point for some reason, it doesn't exist in that form anywhere in any configs (http and public URL).
Doesn't work unfortunately 😕 Thanks either way!
Perfect, exactly what I needed, thanks!
I hacked around the solution by setting api.files_server for the agent to the public URL, but ideally I'd avoid going through reverse-proxy if there's some path_substitution equivalent for this. Thanks
Found this, seems to be exactly this: None
It appears that running docker as --privileged resolves the issue which is easier for me than to edit all of the instances I've already created. Is there an easy way to add a docker argument in the python script?
I've tried task.set_base_docker(docker_arguments="--privileged") right after Task.init but it doesn't seem to work.
Thanks!
Ooooh, I didn't notice that field is editable. Thanks!
So I should use add_requirements before Task.init and delete the list from webUI when needed?
One more related question (I hope there's a similar solution), when I log images, they appear in the UI with http://<my-ip> so they are inaccessible (they should be translated to None . Is there any path_substitution variant for this scenario in the config? I can't seem to find it in the docs. Thanks!
The issue was .ssh wasn't propagated so the git repository couldn't be cloned.
Weird. When I spawn agent with sudo I get this behaviour. Without sudo everything works fine
Not ClearML employee (just a recent user), but maybe this will help? None