Reputation
Badges 1
43 × Eureka!I have a dataset of ~24GB and I've tried multiple times uploading it with the sync function.
- The cache doesn't work, it attempts to download the dataset every time.
- It "misses" some files somehow. So once the job runs it fails due to missing files.
- I've ran verify afterwards (from the machine I used to upload the data) and it says it's all good. However, once I inspect the zip files on the server (look for the files in the specific zip the state json says they're in) the files are indee...
OSX 12.5.1
Python 3.8.1.
Clearml 1.13.1
"clearml-data add --folder ./*" always flattens everything, I have that reproducible 100%.
@<1523701205467926528:profile|AgitatedDove14> Any ideas on this issue? Thanks!
It seems that task.set_base_docker must be called with docker_image as well (otherwise docker_arguments don't propagate), not sure if it's a bug or not, but I have a workaround now, thanks!
One more related question (I hope there's a similar solution), when I log images, they appear in the UI with http://<my-ip> so they are inaccessible (they should be translated to None . Is there any path_substitution variant for this scenario in the config? I can't seem to find it in the docs. Thanks!
Neither, metric is a number you report through the Logger:
In the Task info tab there is no GPU and pytorch doesn't see the cuda device.
I'm not sure how to debug it, that would be my first question. So I should first check if docker is executed with --gpus? I'll pay attention to this next time this happens, thanks.
@<1714813627506102272:profile|CheekyDolphin49> You should probably use 'General/coupling' and 'General/rep'
I've tried that one, but it behaves the same :/
Got it. Is there any way to skip a point at some iteration? If I just don't report it at iteration t I'll get interpolation from t-1 to t+1.
Once I used clearml-data add --folder * API everything works correctly (though all files recursively ended up in the root, I had luck all were named differently).
I just added the secrets/keys to docker-compose.yml and restarted everything but no change.
Single version. The issue seems to be the creation. If I use "clearml-data sync --folder ." it says it uploaded all the files. Running "clearml-data verify --folder ." says it's all good. Metadata on the WebUI reports the expected number of files. However, once I extract the zips (or download the dataset through Python API or CLI) not all the files are there.
"clearml-data add --folder ./*" seems to fix this issue though it doesn't preserve my directory structure so I'd have to write a scrip...
I'll try to reproduce it and will get back at you. The HPO task (parent of this task) was deleted indeed but that shouldn't matter? One of the models was deleted but the other one wasn't.
Doesn't work unfortunately 😕 Thanks either way!
clearml-1.13.1
Task.add_requirements("requirements.txt")
task = Task.init(project_name="My project", task_name="My task")
task.execute_remotely(queue_name="default")
...
Oh, I misunderstood then docs/examples, sorry. I'm using pytorch-ignite.
Thanks for the tip!
Ooooh, I didn't notice that field is editable. Thanks!
So I should use add_requirements before Task.init and delete the list from webUI when needed?
Kind ping on this thread, thanks! 🙂
No worries, sorry for pinging, was just making sure you (or anyone else who might help) doesn't miss it 🙂
I use Task.add_requirements("requirements.txt") right before the Task.init.
In main, I parse arguments command-line, add_requirements, initialize Task and call execute_remotely. After that it's all pretty much the usual workflow. Initialize the model, setup dataloaders, optimizer and run the training. I'm using pytorch-ignite and have model checkpoint made on validation evaluator COMPL...
@<1523701087100473344:profile|SuccessfulKoala55> Kind reminder again, thanks and sorry!
Added -v /home/uname/.ssh:/root/.ssh and it resolved the issue. I assume this is some sort of a bug then?
Additional info:
-Public URL uses HTTPS, internal traffic doesn't.
-clearml.storage fails while trying to fetch None ...
Meaning it just replaced the internal IP with the URL at some point for some reason, it doesn't exist in that form anywhere in any configs (http and public URL).
Yes SSH_AUTH_SOCK is defined on the host. Should I manually add SSH mounting then through "extra flags"?
Probably not, I'm trying to access it via external IP. Could you point me to instructions for that in the docs, I don't remember seeing it anywhere? Thanks!
@<1523701087100473344:profile|SuccessfulKoala55> kind reminder not to miss this when you catch time, thanks!
Tried but it didn't help. I suspect the issue is here: "'docker', 'run', '-t', '--gpus', '"device=0"', '-v', '/tmp/ssh-krPvUxRks5/agent.1949:/tmp/ssh-krPvUxRks5/agent.1949', '-e', 'SSH_AUTH_SOCK=/tmp/ssh-krPvUxRks5/agent.1949'"
It passes SSH socket instead of .ssh directory (not sure why, an agent I have running on my own machine behaves differently)? Do you happen to know how to fix this? Thanks!