![Profile picture](https://clearml-web-assets.s3.amazonaws.com/scoold/avatars/ZanySealion18.png)
Reputation
Badges 1
34 × Eureka!Tried but it didn't help. I suspect the issue is here: "'docker', 'run', '-t', '--gpus', '"device=0"', '-v', '/tmp/ssh-krPvUxRks5/agent.1949:/tmp/ssh-krPvUxRks5/agent.1949', '-e', 'SSH_AUTH_SOCK=/tmp/ssh-krPvUxRks5/agent.1949'"
It passes SSH socket instead of .ssh directory (not sure why, an agent I have running on my own machine behaves differently)? Do you happen to know how to fix this? Thanks!
I've ran it before Task.init
Having a bit of trouble with this one (sorry for possibly dumb questions).
Are there any docs on how to add certs to the docker image? I see this ( None ) which is where letsencrypt points me to, but I'm not sure what's the proper way to do this on the webapp docker (I'd assume there's a non-hacky way to do it as others are using the same setup I'm trying to make work I guess)
Ooooh, I didn't notice that field is editable. Thanks!
So I should use add_requirements before Task.init and delete the list from webUI when needed?
I've tried that one, but it behaves the same :/
One more related question (I hope there's a similar solution), when I log images, they appear in the UI with http://<my-ip> so they are inaccessible (they should be translated to None . Is there any path_substitution variant for this scenario in the config? I can't seem to find it in the docs. Thanks!
model_checkpoint = ModelCheckpoint(
"checkpoint",
n_saved=2,
filename_prefix="best",
score_function=score_function,
score_name="accuracy",
global_step_transform=global_step_from_engine(trainer),
)
# Save the model after every epoch of val_evaluator is completed
val_evaluator.add_event_handler(
Events.COMPLETED, model_checkpoint, {"model": model}
)
Not ClearML employee (just a recent user), but maybe this will help? None
Yes SSH_AUTH_SOCK is defined on the host. Should I manually add SSH mounting then through "extra flags"?
Neither, metric is a number you report through the Logger:
I just added the secrets/keys to docker-compose.yml and restarted everything but no change.
OSX 12.5.1
Python 3.8.1.
Clearml 1.13.1
"clearml-data add --folder ./*" always flattens everything, I have that reproducible 100%.
I'll try to reproduce it and will get back at you. The HPO task (parent of this task) was deleted indeed but that shouldn't matter? One of the models was deleted but the other one wasn't.
@<1714813627506102272:profile|CheekyDolphin49> You should probably use 'General/coupling' and 'General/rep'
Single version. The issue seems to be the creation. If I use "clearml-data sync --folder ." it says it uploaded all the files. Running "clearml-data verify --folder ." says it's all good. Metadata on the WebUI reports the expected number of files. However, once I extract the zips (or download the dataset through Python API or CLI) not all the files are there.
"clearml-data add --folder ./*" seems to fix this issue though it doesn't preserve my directory structure so I'd have to write a scrip...
Probably not, I'm trying to access it via external IP. Could you point me to instructions for that in the docs, I don't remember seeing it anywhere? Thanks!
@<1523701205467926528:profile|AgitatedDove14> Any ideas on this issue? Thanks!
So after publishing a task (right click/Publish from WebUI), one of the models got their id changed to __DELETED__4be00...
The other one (last_model on the screenshot below) is all good and didn't get deleted in this way.
"best_model" exists on the disk and I can access it by taking last_model's URL and just changing the file name, but I cannot normally access it via id (which has now changed to __DELETED__4be00...). Any ideas why this might have happened?
![image](https://clearml-web-assets....
Once I used clearml-data add --folder * API everything works correctly (though all files recursively ended up in the root, I had luck all were named differently).
In the Task info tab there is no GPU and pytorch doesn't see the cuda device.
I'm not sure how to debug it, that would be my first question. So I should first check if docker is executed with --gpus? I'll pay attention to this next time this happens, thanks.
Added -v /home/uname/.ssh:/root/.ssh and it resolved the issue. I assume this is some sort of a bug then?
I hacked around the solution by setting api.files_server for the agent to the public URL, but ideally I'd avoid going through reverse-proxy if there's some path_substitution equivalent for this. Thanks
Doesn't work unfortunately 😕 Thanks either way!
Additional info:
-Public URL uses HTTPS, internal traffic doesn't.
-clearml.storage fails while trying to fetch None ...
Meaning it just replaced the internal IP with the URL at some point for some reason, it doesn't exist in that form anywhere in any configs (http and public URL).
Perfect, exactly what I needed, thanks!
Oh, I misunderstood then docs/examples, sorry. I'm using pytorch-ignite.
Thanks for the tip!
@<1523701087100473344:profile|SuccessfulKoala55> Kind reminder again, thanks and sorry!