I guess I could manually explore different containers and their content π as far as I remember, I had to update Elastic records when we moved to the new cloud provider in order to update model URLs
I'm not sure since names of these parameters do not match with boto3 names, and num_download_attempt is passed https://github.com/allegroai/clearml/blob/3d3a835435cc2f01ff19fe0a58a8d7db10fd2de2/clearml/storage/helper.py#L1439 as container.config.retries
not quite. for example, Iβm not sure which info is stored in Elastic and which is in MongoDB
we've already restarted everything, so I don't have any logs on hands right now. I'll let you know if we face any problems π slack bot works though! π
thanks for the link advice, will do
I'll let you know if I managed to achieve my goals with StorageManager
this is the artifactory, this is how I install these packages in the Docker image:
pip3 install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
the files are used for training and evaluation (e.g., precomputed pycocotools meta-info). I could theoretically include them in the repo, but some of them might be quite heavy. what do you mean when you say that they get lost? I copy them from the host machine when I build the custom image, so they are i...
it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers
thanks! we copy S3 URLs quite often. I know that itβs better to avoid double spaces in task names, but shit happens π
I'm not sure it's related to the domain switch since we upgraded to the newest ClearML server version at the same time
I assume, temporary fix is to switch to trains-server?
new icons are slick, it would be even better if you could upload custom icons for the different projects
I updated the version in the Installed packages section before starting the experiment
btw, there are "[2020-09-02 15:15:40,331] [9] [WARNING] [urllib3.connectionpool] Connection pool is full, discarding connection: elasticsearch" in the apiserver logs again
runningdocker network prune
before starting the containers kind of helped. I still see an error when I'm comparing > 20 experiments, but at least trains works okay after that, and there are no connection pool limit errors in the logs
m5.xlarge EC2 instance (4 vCPUs, 16 GB RAM), 100GB disk
btw, are there any examples of exporting metrics using Python client? I could only find last_metrics attribute of the task
as a sidenote, I am not able to pull the newest release, looks like it's not pushed?
"Error response from daemon: manifest for allegroai/trains:0.14.2 not found"
I use the following Base Docker image setting: my-docker-hub/my-repo:latest -v /data/project/data:/data
Error
Failed to get Scalar Charts
any suggestions on how to fix it?
nope, the only changes to config that we made are adding web-auth and non-responsive tasks watchdog
just in case, this warning disappeared after I https://stackoverflow.com/questions/49638699/docker-compose-restart-connection-pool-full