
DilapidatedParrot58
42
Questions,
205
Answers
Active since 10 January 2023
Last activity
2 years ago
Reputation
0
Badges 1
186 × Eureka!hey guys, I'm experiencing seemingly random problems with the experiments. there are 4 GPUs and 8 workers (2 workers per GPU) , and sometimes experiments ran...
4 years ago
I updated trains-server today, and now it's very unstable, Web interface randomly stops working. anyone had the same problem? I've never had any problems wit...
4 years ago
Step 3 Task ( https://github.com/allegroai/trains/blob/master/examples/pipeline/step3_train_model.py ) - Loads the processed data (from Step 2) and clearml a...
4 years ago
I'm getting A LOT of errors when running cleanup service Failed deleting the following URIs - script fails to delete image and text files ERROR - Failed dele...
3 years ago
I'm probably stupid, but how do I specify worker name? usecase - I want to create two workers using the same GPU, and new worker just overwrites the old one
4 years ago
anyone having problems with ClearML slowing down pytorch experiments? auto_connect_framework={“pytorch”: False} helps, but it’s not a great solution. we thin...
3 years ago
when we train the models, we often choose checkpoint based on the validation accuracy, but test set accuracy (or specific class validation accuracy) is not n...
4 years ago
hey guys, is there a ready script that can delete all models from S3 (or other storage) that are related to deleted or archived experiments?
3 years ago
hey guys, a question about monthly worker_stats indices each of them takes up about 1gb for us. do we really need to keep all of them? is there any way to fr...
4 years ago
feature request: we have several servers with multiple GPUs, and atm we have to manually check which GPU has enough memory before queuing each experiment int...
3 years ago
hey guys the first time I'm seeing this behavior I'm adding a new user to /opt/trains/config/apiserver.conf and restarting the containers. all old users are ...
4 years ago
yo guys, I'm getting Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(, 'Connection to O...
4 years ago
Show more results
questions