do you have any idea why cleanup task keeps failing then (it used to work before the update)
nope, old clenup task fails with trains_agent: ERROR: Could not find task id=e7725856e9a04271aab846d77d6f7d66 (for host: )
Exception: 'Tasks' object has no attribute 'id
weirdly enough, curl
http://apiserver:8008 from inside the container works
well okay, it's probably not that weird considering that worker just runs the container
tags are somewhat fine for this, I guess, but there will be too many of them eventually, and they do not reflect sequential nature of the experiments
nope, that's the point, quite often we run experiments separately, but they are related to each other. currently there's no way to see that one experiment is using checkpoint from the previous experiment since we need to manually insert S3 link as a hyperparameter. it would be useful to see these connections. maybe instead of grouping we could see which experiments are using artifacts of this experiment
that's right, I have 4 GPUs and 4 workers. but what if I want to run two jobs simultaneously at the same GPU
parents and children. maybe tags, maybe separate tab or section, idk. I wonder if anyone else is interested in this functionality, for us this is a very common case
we're using os.getenv in the script to get a value for these secrets
we do log a lot of the different metrics, maybe this can be part of the problem
some of the POST requests "tasks.get_all_ex" fail as far as I can see
it’s a pretty standard pytorch train/eval loop, using pytorch dataloader and https://docs.monai.io/en/stable/_modules/monai/data/dataset.html