nope, old clenup task fails with trains_agent: ERROR: Could not find task id=e7725856e9a04271aab846d77d6f7d66 (for host: )
Exception: 'Tasks' object has no attribute 'id
weirdly enough, curl
http://apiserver:8008 from inside the container works
WARNING: You are using pip version 20.1.1; however, version 20.3.3 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
trains_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the TRAINS API server
http://apiserver:8008 ?
http://OUR_IP:8081 http://OUR_IP:8080
http://apiserver:8008
WARNING: You are using pip version 20.1.1; however, version 20.3.3 is available.
`...
we already have cleanup service set up and running, so we should be good from now on
what if cleanup service is launched using ClearML-Agent Services container (part of the ClearML server)? adding clearml.conf to the home directory doesn't help
do you have any idea why cleanup task keeps failing then (it used to work before the update)
we're using os.getenv in the script to get a value for these secrets
nice, thanks! I'll check if it solves the issue first thing tomorrow in the morning
we're using the latest version of clearml, clearml agent and clearml server, but we've been using trains/clearml for 2.5 years, so there are some old tasks left, I guess 😃
yeah, server (1.0.0) and client (1.0.1)
agent.hide_docker_command_env_vars.extra_keys: ["DB_PASSWORD=password"]
like this? or ["DB_PASSWORD", "password"]
yeah, backups take much longer, and we had to increase our EC2 instance volume size twice because of these indices
got it, thanks, will try to delete older ones
oh wow, I didn't see delete_artifacts_and_models option
I guess we'll have to manually find old artifacts that are related to already deleted tasks
more like collapse/expand, I guess. or pipelines that you can compose after running experiments to see that experiments are connected to each other
that's right
for example, there are tasks A, B, C
we run multiple experiments for A, finetune some of them in separate tasks, then choose one or more best checkpoints, run some experiments for task B, choose the best experiment, and finally run task C
so we get a chain of tasks: A - A-ft - B- C
ClearML pipeline doesn't quite work here because we would like to analyze results of each step before starting next task
but it would be great to see predecessors of each experiment in the chain
nope, that's the point, quite often we run experiments separately, but they are related to each other. currently there's no way to see that one experiment is using checkpoint from the previous experiment since we need to manually insert S3 link as a hyperparameter. it would be useful to see these connections. maybe instead of grouping we could see which experiments are using artifacts of this experiment
tags are somewhat fine for this, I guess, but there will be too many of them eventually, and they do not reflect sequential nature of the experiments
we've already restarted everything, so I don't have any logs on hands right now. I'll let you know if we face any problems 😃 slack bot works though! 🎉
two more questions about cleanup if you don't mind:
what if for some old tasks I get WARNING:root:Could not delete Task ID=a0908784a2a942c3812f947ec1f32c9f, 'Task' object has no attribute 'delete'? What's the best way of cleaning them? What is the recommended way of providing S3 credentials to cleanup task?
JIC - trains still works after that, it's just that the new user is not added and hence is not able to login