
Reputation
Badges 1
46 × Eureka!I just added the secrets/keys to docker-compose.yml and restarted everything but no change.
To make sure I understand, I need to setup a domain with a cert and it should work, no additional ClearML config is required?
@<1523701087100473344:profile|SuccessfulKoala55> Kind reminder again, thanks and sorry!
In the Task info tab there is no GPU and pytorch doesn't see the cuda device.
I'm not sure how to debug it, that would be my first question. So I should first check if docker is executed with --gpus? I'll pay attention to this next time this happens, thanks.
Yeah, I'm starting to lean towards enterprise solution more and more 😁
Thanks!
"Executing: ['docker', 'run', '-t', '--gpus', '"device=0"'" - so the container is executed with --gpus.
However, torch.cuda.is_available() returns False.
model_checkpoint = ModelCheckpoint(
"checkpoint",
n_saved=2,
filename_prefix="best",
score_function=score_function,
score_name="accuracy",
global_step_transform=global_step_from_engine(trainer),
)
# Save the model after every epoch of val_evaluator is completed
val_evaluator.add_event_handler(
Events.COMPLETED, model_checkpoint, {"model": model}
)
Probably not, I'm trying to access it via external IP. Could you point me to instructions for that in the docs, I don't remember seeing it anywhere? Thanks!
Having a bit of trouble with this one (sorry for possibly dumb questions).
Are there any docs on how to add certs to the docker image? I see this ( None ) which is where letsencrypt points me to, but I'm not sure what's the proper way to do this on the webapp docker (I'd assume there's a non-hacky way to do it as others are using the same setup I'm trying to make work I guess)
@<1523701087100473344:profile|SuccessfulKoala55> kind reminder not to miss this when you catch time, thanks!
clearml-1.13.1
Task.add_requirements("requirements.txt")
task = Task.init(project_name="My project", task_name="My task")
task.execute_remotely(queue_name="default")
...
Failed to initialize NVML: Unknown Error
Oh, I misunderstood then docs/examples, sorry. I'm using pytorch-ignite.
Thanks for the tip!
It seems that task.set_base_docker must be called with docker_image as well (otherwise docker_arguments don't propagate), not sure if it's a bug or not, but I have a workaround now, thanks!
I've tried that one, but it behaves the same :/
@<1714813627506102272:profile|CheekyDolphin49> You should probably use 'General/coupling' and 'General/rep'
Neither, metric is a number you report through the Logger:
So after publishing a task (right click/Publish from WebUI), one of the models got their id changed to __DELETED__4be00...
The other one (last_model on the screenshot below) is all good and didn't get deleted in this way.
"best_model" exists on the disk and I can access it by taking last_model's URL and just changing the file name, but I cannot normally access it via id (which has now changed to __DELETED__4be00...). Any ideas why this might have happened?
 doesn't miss it 🙂
I use Task.add_requirements("requirements.txt") right before the Task.init.
In main, I parse arguments command-line, add_requirements, initialize Task and call execute_remotely. After that it's all pretty much the usual workflow. Initialize the model, setup dataloaders, optimizer and run the training. I'm using pytorch-ignite and have model checkpoint made on validation evaluator COMPL...
I know about clearml.conf but wanted to avoid ssh-ing through 50 instances to edit it.
task.set_base_docker does the job, but docker_arguments doesn't propagate if I leave docker_image as None (it just uses both image and arguments from clearml.conf of the agent). If I explicitly state docker_image and docker_arguments in task.set_base_docker it works fine.
Once I used clearml-data add --folder * API everything works correctly (though all files recursively ended up in the root, I had luck all were named differently).
OSX 12.5.1
Python 3.8.1.
Clearml 1.13.1
"clearml-data add --folder ./*" always flattens everything, I have that reproducible 100%.