Reputation
Badges 1
46 × Eureka!It seems that task.set_base_docker must be called with docker_image as well (otherwise docker_arguments don't propagate), not sure if it's a bug or not, but I have a workaround now, thanks!
Not ClearML employee (just a recent user), but maybe this will help? None
@<1523701205467926528:profile|AgitatedDove14> Any ideas on this issue? Thanks!
In the Task info tab there is no GPU and pytorch doesn't see the cuda device.
I'm not sure how to debug it, that would be my first question. So I should first check if docker is executed with --gpus? I'll pay attention to this next time this happens, thanks.
Failed to initialize NVML: Unknown Error
To make sure I understand, I need to setup a domain with a cert and it should work, no additional ClearML config is required?
OSX 12.5.1
Python 3.8.1.
Clearml 1.13.1
"clearml-data add --folder ./*" always flattens everything, I have that reproducible 100%.
model_checkpoint = ModelCheckpoint(
"checkpoint",
n_saved=2,
filename_prefix="best",
score_function=score_function,
score_name="accuracy",
global_step_transform=global_step_from_engine(trainer),
)
# Save the model after every epoch of val_evaluator is completed
val_evaluator.add_event_handler(
Events.COMPLETED, model_checkpoint, {"model": model}
)
I have a dataset of ~24GB and I've tried multiple times uploading it with the sync function.
- The cache doesn't work, it attempts to download the dataset every time.
- It "misses" some files somehow. So once the job runs it fails due to missing files.
- I've ran verify afterwards (from the machine I used to upload the data) and it says it's all good. However, once I inspect the zip files on the server (look for the files in the specific zip the state json says they're in) the files are indee...
Perfect, exactly what I needed, thanks!
Added -v /home/uname/.ssh:/root/.ssh and it resolved the issue. I assume this is some sort of a bug then?
Kind ping on this thread, thanks! 🙂
@<1714813627506102272:profile|CheekyDolphin49> You should probably use 'General/coupling' and 'General/rep'
Additional info:
-Public URL uses HTTPS, internal traffic doesn't.
-clearml.storage fails while trying to fetch None ...
Meaning it just replaced the internal IP with the URL at some point for some reason, it doesn't exist in that form anywhere in any configs (http and public URL).
I'll check the docker command next time this happens, thanks! For the machines, all of them have GPUs (and are in fact identical/cloned VMs) and if I rerun it and get the same exact machine again it works so it's some part of "GPU detection" or something, we'll know more hopefully once it happens again, thanks.
Yes SSH_AUTH_SOCK is defined on the host. Should I manually add SSH mounting then through "extra flags"?
Once I used clearml-data add --folder * API everything works correctly (though all files recursively ended up in the root, I had luck all were named differently).