Reputation
Badges 1
186 × Eureka!more like collapse/expand, I guess. or pipelines that you can compose after running experiments to see that experiments are connected to each other
I'll get back to you with the logs when the problem occurs again
hmmm allegroai/trains:latest whatever it is
I decided to restart the containers one more time, this is what I got.
I had to restart Docker service to remove the containers
I'm not sure it's related to the domain switch since we upgraded to the newest ClearML server version at the same time
if you click on the experiment name here, you get 404 because link looks like this:
https://DOMAIN/projects/PROJECT_ID/EXPERIMENT_ID
when it should look like this:
https://DOMAIN/projects/PROJECT_ID/experiments/EXPERIMENT_ID
sorry, my bad, after some manipulations I made it work. I have to manually change HTTP to HTTPS in config file for Web and Files (not API) server after initialization, but besides that it works
yeah, it works for the new projects and for the old projects that have already had a description
I updated S3 credentials, I'll check if they work later
it doesn't explain inability to delete logged images and texts though
self-hosted ClearML server 1.2.0
SDK version 1.1.6
Requirement already satisfied (use --upgrade to upgrade): celsusutils==0.0.1
thanks, this one worked after we changed the package version
okay, what do I do if it IS installed?
isn't this parameter related to communication with ClearML Server? I'm trying to make sure that checkpoint will be downloaded from AWS S3 even if there are temporary connection problems
there's https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig parameter in boto3, but I'm not sure if there's an easy way to pass this parameter to StorageManager
I'm not sure since names of these parameters do not match with boto3 names, and num_download_attempt is passed https://github.com/allegroai/clearml/blob/3d3a835435cc2f01ff19fe0a58a8d7db10fd2de2/clearml/storage/helper.py#L1439 as container.config.retries
it’s a pretty standard pytorch train/eval loop, using pytorch dataloader and https://docs.monai.io/en/stable/_modules/monai/data/dataset.html
we’re using latest ClearML server and client version (1.2.0)
it might be that there is not enough space on our SSD, experiments cache a lot of preprocessed data during the first epoch...
example of the failed experiment