Reputation
Badges 1
186 × Eureka!I decided to restart the containers one more time, this is what I got.
I had to restart Docker service to remove the containers
I'll get back to you with the logs when the problem occurs again
as a sidenote, I am not able to pull the newest release, looks like it's not pushed?
"Error response from daemon: manifest for allegroai/trains:0.14.2 not found"
I assume, temporary fix is to switch to trains-server?
hmmm allegroai/trains:latest whatever it is
not quite. for example, Iโm not sure which info is stored in Elastic and which is in MongoDB
I guess I could manually explore different containers and their content ๐ as far as I remember, I had to update Elastic records when we moved to the new cloud provider in order to update model URLs
it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers
yeah, it works for the new projects and for the old projects that have already had a description
docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
awesome news ๐
self-hosted ClearML server 1.2.0
SDK version 1.1.6
okay, so if thereโs no workaround atm, should I create a Github issue?
more like collapse/expand, I guess. or pipelines that you can compose after running experiments to see that experiments are connected to each other
I updated S3 credentials, I'll check if they work later
it doesn't explain inability to delete logged images and texts though
sounds like an overkill for this problem, but I donโt see any other pretty solution ๐
we're using os.getenv in the script to get a value for these secrets
what if cleanup service is launched using ClearML-Agent Services container (part of the ClearML server)? adding clearml.conf to the home directory doesn't help