Reputation
Badges 1
186 × Eureka!we're using the latest version of clearml, clearml agent and clearml server, but we've been using trains/clearml for 2.5 years, so there are some old tasks left, I guess π
another stupid question - what is the proper way to delete a worker? so far I've been using pgrep to find the relevant PID π
tags are somewhat fine for this, I guess, but there will be too many of them eventually, and they do not reflect sequential nature of the experiments
that's right
for example, there are tasks A, B, C
we run multiple experiments for A, finetune some of them in separate tasks, then choose one or more best checkpoints, run some experiments for task B, choose the best experiment, and finally run task C
so we get a chain of tasks: A - A-ft - B- C
ClearML pipeline doesn't quite work here because we would like to analyze results of each step before starting next task
but it would be great to see predecessors of each experiment in the chain
the weird part is that the old job continues running when I recreate the worker and enqueue the new job
isn't this parameter related to communication with ClearML Server? I'm trying to make sure that checkpoint will be downloaded from AWS S3 even if there are temporary connection problems
there's https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig parameter in boto3, but I'm not sure if there's an easy way to pass this parameter to StorageManager
we already have cleanup service set up and running, so we should be good from now on
wow, thanks, just updated our server!
can't seem to find these metrics snapshot plots =) how do I plot one?
on the side note, is there any way to automatically give more meaningful names to the running docker containers?
WARNING: You are using pip version 20.1.1; however, version 20.3.3 is available.You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
trains_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the TRAINS API server http://apiserver:8008 ?
http://OUR_IP:8081 http://OUR_IP:8080 http://apiserver:8008WARNING: You are using pip version 20.1.1; however, version 20.3.3 is available.
`...
we're using os.getenv in the script to get a value for these secrets
ValueError: Task has no hyperparams section defined
our GPUs are 48GB, so it's quite wasteful to only run one job per GPU
yeah, I'm aware of that, I would have to make sure they don't fail to infamous CUDA out of memory, but still
what if cleanup service is launched using ClearML-Agent Services container (part of the ClearML server)? adding clearml.conf to the home directory doesn't help
this definitely would be a nice addition. number of hyperparameters in our models often goes up to 100
I'm not sure since names of these parameters do not match with boto3 names, and num_download_attempt is passed https://github.com/allegroai/clearml/blob/3d3a835435cc2f01ff19fe0a58a8d7db10fd2de2/clearml/storage/helper.py#L1439 as container.config.retries
okay, so if thereβs no workaround atm, should I create a Github issue?
copy-pasting entire training command into command line π
I'll get back to you with the logs when the problem occurs again
yeah, we've used pipelines in other scenarios. might be a good fit here. thanks!
yes, this is the use case, I think we can use smth like Redis for this communication
yeah, backups take much longer, and we had to increase our EC2 instance volume size twice because of these indices
got it, thanks, will try to delete older ones