Reputation
Badges 1
186 × Eureka!well okay, it's probably not that weird considering that worker just runs the container
I donāt connect anything explicitly, Iām using argparse, it used to work before the update
yes. we upload artifacts to Yandex.Cloud S3 using ClearML. we set " s3://storage.yandexcloud.net/clearml-models " as output uri parameter and add this section to the config:{
host: "
http://storage.yandexcloud.net "
key: "KEY"
secret:"SECRET_KEY",
secure: true
}
this works like a charm. but download button in UI is not working
I guess, this could overcomplicate ui, I don't see a good solution yet.
as a quick hack, we can just use separate name (eg "best_val_roc_auc") for all metric values for the current best checkpoint. then we can just add columns with the last value of this metric
after the very first click, there is a popup with credentials request. nothing happens after that
example of the failed experiment
hard to say, maybe just ārelated experimentsā in experiment info would be enough. Iāll think about it
I assume, temporary fix is to switch to trains-server?
hmmm allegroai/trains:latest whatever it is
I decided to restart the containers one more time, this is what I got.
I had to restart Docker service to remove the containers
as a sidenote, I am not able to pull the newest release, looks like it's not pushed?
"Error response from daemon: manifest for allegroai/trains:0.14.2 not found"
I've already pulled new images from trains-server, let's see if the initial issue occurs again. thank for the fast response guys!
I'll get back to you with the logs when the problem occurs again
it might be that there is not enough space on our SSD, experiments cache a lot of preprocessed data during the first epoch...
I change the arguments in Web UI, but it looks like they are not parsed by trains
it prints an empty dict
Iām doing Task.init() in the script, maybe it somehow resets connected parametersā¦ but it used to work before, weird
nope, that's the point, quite often we run experiments separately, but they are related to each other. currently there's no way to see that one experiment is using checkpoint from the previous experiment since we need to manually insert S3 link as a hyperparameter. it would be useful to see these connections. maybe instead of grouping we could see which experiments are using artifacts of this experiment
dnk if it's relevant, but I also added a new user to apiserver.conf today
I've done it many times, using different devices. sometimes it works, sometimes it doesn't