Reputation
Badges 1
186 × Eureka!it might be that there is not enough space on our SSD, experiments cache a lot of preprocessed data during the first epoch...
nice! exactly what I need, thank you!
the code that is used for training the model is also inside the image
this is how it looks if I zoom in on the epochs that ran before the crash
I don't think so because max value of each metric is calculated independently of other metrics
the weird part is that the old job continues running when I recreate the worker and enqueue the new job
I use the following Base Docker image setting: my-docker-hub/my-repo:latest -v /data/project/data:/data
this would be great. I could just then pass it as a hyperparameter
we're using the latest version of clearml, clearml agent and clearml server, but we've been using trains/clearml for 2.5 years, so there are some old tasks left, I guess 😃
that's right
for example, there are tasks A, B, C
we run multiple experiments for A, finetune some of them in separate tasks, then choose one or more best checkpoints, run some experiments for task B, choose the best experiment, and finally run task C
so we get a chain of tasks: A - A-ft - B- C
ClearML pipeline doesn't quite work here because we would like to analyze results of each step before starting next task
but it would be great to see predecessors of each experiment in the chain
there is no method for setting last iteration , which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
agent.hide_docker_command_env_vars.extra_keys: ["DB_PASSWORD=password"]
like this? or ["DB_PASSWORD", "password"]
it works, but it's not very helpful since everybody can see a secret in logs:
Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '-e', 'DB_PASSWORD=password']
we're using os.getenv in the script to get a value for these secrets
any suggestions on how to fix it?
I guess, this could overcomplicate ui, I don't see a good solution yet.
as a quick hack, we can just use separate name (eg "best_val_roc_auc") for all metric values for the current best checkpoint. then we can just add columns with the last value of this metric
great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working
I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/
but trains probably rewrites the folder when cloning the repo. is there any workaround?
task = Task.get_task(task_id = args.task_id)task.mark_started()task.set_parameters_as_dict({"General": {"checkpoint_file": model.url,"restart_optimizer": False,}})task.set_initial_iteration(0)task.mark_stopped()Task.enqueue(task = task, queue_name = task.data.execution.queue)
thanks! I need to read all parts of documentation really carefully =) for some reason, couldn't find this section
maybe I should use explicit reporting instead of Tensorboard
new icons are slick, it would be even better if you could upload custom icons for the different projects
our GPUs are 48GB, so it's quite wasteful to only run one job per GPU
yeah, I'm aware of that, I would have to make sure they don't fail to infamous CUDA out of memory, but still
so max values that I get can be reached at the different epochs