Reputation
Badges 1
186 × Eureka!we're using the latest version of clearml, clearml agent and clearml server, but we've been using trains/clearml for 2.5 years, so there are some old tasks left, I guess π
I'm so happy to see that this problem has been finally solved!
perhaps I need to do task.set_initial_iteration(0)?
this would be great. I could just then pass it as a hyperparameter
does this mean that setting initial iteration to 0 should help?
nice! exactly what I need, thank you!
for me, increasing shm-size usually helps. what does this RC fix?
this is how it looks if I zoom in on the epochs that ran before the crash
thank you, I'll let you know if setting it to zero worked
same here, changing arguments in the Args section of Hyperparameters doesnβt work, training script starts with the default values.
trains 0.16.0
trains-agent 0.16.0
trains-server 0.16.0
I change the arguments in Web UI, but it looks like they are not parsed by trains
docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
awesome news π
isn't this parameter related to communication with ClearML Server? I'm trying to make sure that checkpoint will be downloaded from AWS S3 even if there are temporary connection problems
there's https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig parameter in boto3, but I'm not sure if there's an easy way to pass this parameter to StorageManager
I use Docker for training, which means that log_dir contents are removed for the continued experiment btw
I guess I could manually explore different containers and their content π as far as I remember, I had to update Elastic records when we moved to the new cloud provider in order to update model URLs
I'm not sure since names of these parameters do not match with boto3 names, and num_download_attempt is passed https://github.com/allegroai/clearml/blob/3d3a835435cc2f01ff19fe0a58a8d7db10fd2de2/clearml/storage/helper.py#L1439 as container.config.retries
not quite. for example, Iβm not sure which info is stored in Elastic and which is in MongoDB
we've already restarted everything, so I don't have any logs on hands right now. I'll let you know if we face any problems π slack bot works though! π
thanks for the link advice, will do
I'll let you know if I managed to achieve my goals with StorageManager
this is the artifactory, this is how I install these packages in the Docker image:
pip3 install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
the files are used for training and evaluation (e.g., precomputed pycocotools meta-info). I could theoretically include them in the repo, but some of them might be quite heavy. what do you mean when you say that they get lost? I copy them from the host machine when I build the custom image, so they are i...
it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers
thanks! we copy S3 URLs quite often. I know that itβs better to avoid double spaces in task names, but shit happens π