Reputation
Badges 1
186 × Eureka!thanks, this one worked after we changed the package version
okay, what do I do if it IS installed?
isn't this parameter related to communication with ClearML Server? I'm trying to make sure that checkpoint will be downloaded from AWS S3 even if there are temporary connection problems
there's https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig parameter in boto3, but I'm not sure if there's an easy way to pass this parameter to StorageManager
I'm not sure since names of these parameters do not match with boto3 names, and num_download_attempt is passed https://github.com/allegroai/clearml/blob/3d3a835435cc2f01ff19fe0a58a8d7db10fd2de2/clearml/storage/helper.py#L1439 as container.config.retries
it might be that there is not enough space on our SSD, experiments cache a lot of preprocessed data during the first epoch...
example of the failed experiment
I guess, this could overcomplicate ui, I don't see a good solution yet.
as a quick hack, we can just use separate name (eg "best_val_roc_auc") for all metric values for the current best checkpoint. then we can just add columns with the last value of this metric
thanks! I need to read all parts of documentation really carefully =) for some reason, couldn't find this section
the weird part is that the old job continues running when I recreate the worker and enqueue the new job
our GPUs are 48GB, so it's quite wasteful to only run one job per GPU
yeah, I'm aware of that, I would have to make sure they don't fail to infamous CUDA out of memory, but still
another stupid question - what is the proper way to delete a worker? so far I've been using pgrep to find the relevant PID 😃
that's right, I have 4 GPUs and 4 workers. but what if I want to run two jobs simultaneously at the same GPU
well okay, it's probably not that weird considering that worker just runs the container
is it in documentation somewhere?
no, I even added the argument to specify tensorboard log_dir to make sure this is not happening
I use the following Base Docker image setting: my-docker-hub/my-repo:latest -v /data/project/data:/data
this is how it looks if I zoom in on the epochs that ran before the crash
task
=
Task.get_task(task_id
=
args.task_id)
task.mark_started()
task.set_parameters_as_dict(
{
"General": {
"checkpoint_file": model.url,
"restart_optimizer": False,
}
}
)
task.set_initial_iteration(0)
task.mark_stopped()
Task.enqueue(task
=
task, queue_name
=
task.data.execution.queue)