Reputation
Badges 1
186 × Eureka!if you click on the experiment name here, you get 404 because link looks like this:
https://DOMAIN/projects/PROJECT_ID/EXPERIMENT_ID
when it should look like this:
https://DOMAIN/projects/PROJECT_ID/experiments/EXPERIMENT_ID
m5.xlarge EC2 instance (4 vCPUs, 16 GB RAM), 100GB disk
not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer
do you have any idea why cleanup task keeps failing then (it used to work before the update)
sorry, my bad, after some manipulations I made it work. I have to manually change HTTP to HTTPS in config file for Web and Files (not API) server after initialization, but besides that it works
nice, thanks! I'll check if it solves the issue first thing tomorrow in the morning
nope, old clenup task fails with trains_agent: ERROR: Could not find task id=e7725856e9a04271aab846d77d6f7d66 (for host: )Exception: 'Tasks' object has no attribute 'id
weirdly enough, curl http://apiserver:8008 from inside the container works
copy-pasting entire training command into command line 😃
I updated S3 credentials, I'll check if they work later
it doesn't explain inability to delete logged images and texts though
more like collapse/expand, I guess. or pipelines that you can compose after running experiments to see that experiments are connected to each other
parents and children. maybe tags, maybe separate tab or section, idk. I wonder if anyone else is interested in this functionality, for us this is a very common case
weird
this is what I got in installed packages without adding the direct link:
torch==1.6.0.dev20200430+cu101
torchvision==0.7.0.dev20200430+cu101
Error
Failed to get Scalar Charts
what if cleanup service is launched using ClearML-Agent Services container (part of the ClearML server)? adding clearml.conf to the home directory doesn't help
in order to use private repositories for our experiments I add agent.git_user and agent.git_pass options to clearml.conf when launching agents
if someone accidentally tries to launch an experiment from non-existing repo, ClearML will print
fatal: repository ' https://username:token@github.com/our_organization/non_existing_repo.git/ ' not found
exposing the real token
it prints an empty dict
I’m doing Task.init() in the script, maybe it somehow resets connected parameters… but it used to work before, weird
it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers
we already have cleanup service set up and running, so we should be good from now on
it’s a pretty standard pytorch train/eval loop, using pytorch dataloader and https://docs.monai.io/en/stable/_modules/monai/data/dataset.html
it might be that there is not enough space on our SSD, experiments cache a lot of preprocessed data during the first epoch...