Reputation
Badges 1
186 × Eureka!yeah, backups take much longer, and we had to increase our EC2 instance volume size twice because of these indices
got it, thanks, will try to delete older ones
yeah, server (1.0.0) and client (1.0.1)
two more questions about cleanup if you don't mind:
what if for some old tasks I get WARNING:root:Could not delete Task ID=a0908784a2a942c3812f947ec1f32c9f, 'Task' object has no attribute 'delete'? What's the best way of cleaning them? What is the recommended way of providing S3 credentials to cleanup task?
isn't this parameter related to communication with ClearML Server? I'm trying to make sure that checkpoint will be downloaded from AWS S3 even if there are temporary connection problems
there's https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig parameter in boto3, but I'm not sure if there's an easy way to pass this parameter to StorageManager
[2020-06-09 16:03:19,851] [8] [ERROR] [trains.mongo.initialize] Failed creating fixed user John Doe: 'key'
not necessarily, command usually stays the same irrespective of the machine
JIC - trains still works after that, it's just that the new user is not added and hence is not able to login
original task name contains double space -> saved checkpoint also contains double space -> MODEL URL field in model description of this checkpoint in ClearML converts double space into single space. so when you copy & paste it somewhere, it'll be incorrect
yeah, I was thinking mainly about AWS. we use force to make sure we are using the correct latest checkpoint, but this increases costs when we are running a lot of experiments
it works, but it's not very helpful since everybody can see a secret in logs:
Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '-e', 'DB_PASSWORD=password']
I'm not sure since names of these parameters do not match with boto3 names, and num_download_attempt is passed https://github.com/allegroai/clearml/blob/3d3a835435cc2f01ff19fe0a58a8d7db10fd2de2/clearml/storage/helper.py#L1439 as container.config.retries
maybe db somehow got corrupted ot smth like this? I'm clueless
that was tough but I finally manage to make it working! thanks a lot for your help, I definitely wouldn't be able to do it without you =)
the only problem that I still encounter is that sometimes there are random errors in the beginning of the runs, especiialy when I enqueue multiple experiments at the same time (I have 4 workers for 4 GPUs).
for example, this
from torch.utils.tensorboard import SummaryWrite
writer = SummaryWriter()
sometimes randomly leads to FileNotFoundError: [Errno...
{
username: "username"
password: "password"
name: "John Doe"
},