DilapidatedParrot58

42 Questions, 205 Answers

Active since 10 January 2023

Last activity one year ago

Reputation

Badges 1

186 × Eureka!

Questions 42
Answers 205

0 Votes

11 Answers

642 Views

0 Votes 11 Answers 642 Views

Hey Guys, Is There A Ready Script That Can Delete All Models From S3 (Or Other Storage) That Are Related To Deleted Or Archived Experiments?

hey guys, is there a ready script that can delete all models from S3 (or other storage) that are related to deleted or archived experiments?

clearml

2 years ago

0 Votes

5 Answers

685 Views

0 Votes 5 Answers 685 Views

Hey Guys, I Am Trying To Plan What I Need To Do In Order To Efficiently Use Clearml With Spot Instances 1) Detecting When Spot Instance Is Down And Experiment Is Aborted 2) Extracting S3 Address Of The Latest Checkpoint From Clearml Api 3) Starting New E

hey guys, I am trying to plan what I need to do in order to efficiently use ClearML with spot instances 1) detecting when spot instance is down and experimen...

clearml

2 years ago

0 Votes

11 Answers

771 Views

0 Votes 11 Answers 771 Views

Hey Guys, Do You Have Any Plans To Add Functionality To Export Training Config With All Hyperparameters To The Different Formats, Such As Training Command Line Command, Yaml, Etc.?

hey guys, do you have any plans to add functionality to export training config with all hyperparameters to the different formats, such as training command li...

clearml

4 years ago

0 Votes

7 Answers

553 Views

0 Votes 7 Answers 553 Views

When We Train The Models, We Often Choose Checkpoint Based On The Validation Accuracy, But Test Set Accuracy (Or Specific Class Validation Accuracy) Is Not Necessarily The Best For This Checkpoint. Right Now There Are Options To Add Columns With Max And L

when we train the models, we often choose checkpoint based on the validation accuracy, but test set accuracy (or specific class validation accuracy) is not n...

clearml

3 years ago

0 Votes

10 Answers

597 Views

0 Votes 10 Answers 597 Views

What Is The Right Way To Increase Number Of Retries When Using

what is the right way to increase number of retries when using StorageManager.get_local_copy?

clearml

2 years ago

0 Votes

5 Answers

614 Views

0 Votes 5 Answers 614 Views

Is There Any Way To Post Slack Alerts For The Frozen Experiments? (Eg, After Server Restart They Sometimes Get Stuck In Running Mode, Or

is there any way to post Slack alerts for the frozen experiments? (eg, after server restart they sometimes get stuck in Running mode, or https://github.com/p...

clearml

3 years ago

0 Votes

8 Answers

684 Views

0 Votes 8 Answers 684 Views

Hey Guys, I'M Experiencing Seemingly Random Problems With The Experiments. There Are 4 Gpus And 8 Workers (2 Workers Per Gpu) , And Sometimes Experiments Randomly Fail (Or Complete) In The Middle Of The Epoch Without Any Additional Info In The Logs. What

hey guys, I'm experiencing seemingly random problems with the experiments. there are 4 GPUs and 8 workers (2 workers per GPU) , and sometimes experiments ran...

clearml

3 years ago

0 Votes

3 Answers

722 Views

0 Votes 3 Answers 722 Views

Here I Am Again... Can'T Find How To Create A Custom Queue

here I am again... can't find how to create a custom queue

clearml

3 years ago

0 Votes

6 Answers

611 Views

0 Votes 6 Answers 611 Views

Hey Guys, I Keep Getting "Failed Parsing Task Parameter" Warning For The Arguments Such As This One:

hey guys, I keep getting "Failed parsing task parameter" warning for the arguments such as this one: parser.add_argument( "--dataset_mean", type = float, nar...

clearml

2 years ago

0 Votes

30 Answers

614 Views

0 Votes 30 Answers 614 Views

Is Is Possible To Pass Custom

is is possible to pass custom https://clear.ml/docs/latest/docs/configs/env_vars/ to ClearML agents?

clearml

2 years ago

0 Votes

16 Answers

678 Views

0 Votes 16 Answers 678 Views

Anyone Having Problems With Clearml Slowing Down Pytorch Experiments? Auto_Connect_Framework={“Pytorch”: False} Helps, But It’S Not A Great Solution. We Think It’S Related To Clearml Trying To Do Something At Each Dataloader Iteration. We’Ll Try To Provid

anyone having problems with ClearML slowing down pytorch experiments? auto_connect_framework={“pytorch”: False} helps, but it’s not a great solution. we thin...

pytorch

2 years ago

0 Votes

29 Answers

579 Views

0 Votes 29 Answers 579 Views

I'M Using Tensorboard Summarywriter To Add Scalar Metrics For The Experiment. If Experiment Crashed, And I Want To Continue It From Checkpoint, For Some Reason It Plots Metrics In A Really Weird Way. Even Though I Pass Global_Step=Epoch To The Summarywrit

I'm using Tensorboard SummaryWriter to add scalar metrics for the experiment. if experiment crashed, and I want to continue it from checkpoint, for some reas...

clearml

2 years ago

Show more results

0 Hi

we've already restarted everything, so I don't have any logs on hands right now. I'll let you know if we face any problems 😃 slack bot works though! 🎉

3 years ago

0 Hey Guys, I Keep Getting

default docker-compose

3 years ago

0 Hey Guys, Is There A Ready Script That Can Delete All Models From S3 (Or Other Storage) That Are Related To Deleted Or Archived Experiments?

two more questions about cleanup if you don't mind:
what if for some old tasks I get WARNING:root:Could not delete Task ID=a0908784a2a942c3812f947ec1f32c9f, 'Task' object has no attribute 'delete'? What's the best way of cleaning them? What is the recommended way of providing S3 credentials to cleanup task?

2 years ago

0 Hey Guys, Is There A Ready Script That Can Delete All Models From S3 (Or Other Storage) That Are Related To Deleted Or Archived Experiments?

thanks!

2 years ago

0 Hey Guys The First Time I'M Seeing This Behavior I'M Adding A New User To /Opt/Trains/Config/Apiserver.Conf And Restarting The Containers. All Old Users Are Able To Log In, But Not The New One (Invalid User/Password Combination). Any Ideas?

JIC - trains still works after that, it's just that the new user is not added and hence is not able to login

3 years ago

0 Is Is Possible To Pass Custom

right now we can pass github secrets to the clearml agent training containers ( CLEARML_AGENT_GIT_PASS) to install private repos

we need a way to pass secrets to access our database with annotations

2 years ago

0 Hey Guys, I'M Trying To Run An Experiment Using Trains-Agent. I Have A Custom Docker Image With Nightly Versions Of Pytorch And Our Own Library Installed From A Private Repo. I Was Assuming That These Packages Will Be Automatically Available To Trains Dur

I use the following Base Docker image setting: my-docker-hub/my-repo:latest -v /data/project/data:/data

4 years ago

the code that is used for training the model is also inside the image

4 years ago

thanks for the link advice, will do
I'll let you know if I managed to achieve my goals with StorageManager

4 years ago

that was tough but I finally manage to make it working! thanks a lot for your help, I definitely wouldn't be able to do it without you =)

the only problem that I still encounter is that sometimes there are random errors in the beginning of the runs, especiialy when I enqueue multiple experiments at the same time (I have 4 workers for 4 GPUs).
for example, this
from torch.utils.tensorboard import SummaryWrite
writer = SummaryWriter()
sometimes randomly leads to FileNotFoundError: [Errno...

4 years ago

docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.

awesome news 👍

4 years ago

no, I even added the argument to specify tensorboard log_dir to make sure this is not happening

4 years ago

0 It Would Be Nice To Group Experiments Within Projects Use Cases:

parents and children. maybe tags, maybe separate tab or section, idk. I wonder if anyone else is interested in this functionality, for us this is a very common case

2 years ago

0 Yo Clearml Folks! How To Force-Reinstall Package From Github In Installed Packages? Tried Different Strategies (Using @Commit_Id, Versioning, Flag --Force-Reinstall), And It Keeps Saying That Requirement Is Already Satisfied (Old Version Of The Package Is

in the docker image

3 years ago

Requirement already satisfied (use --upgrade to upgrade): celsusutils==0.0.1

3 years ago

it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers

4 years ago

thanks, this one worked after we changed the package version

3 years ago

0 Is There Any Way To Export Csv With Max Metrics And Hyperparameters For Selected Experiments?

kind of

2 years ago

0 I'M Using Tensorboard Summarywriter To Add Scalar Metrics For The Experiment. If Experiment Crashed, And I Want To Continue It From Checkpoint, For Some Reason It Plots Metrics In A Really Weird Way. Even Though I Pass Global_Step=Epoch To The Summarywrit

https://github.com/allegroai/clearml/issues/496

2 years ago

0 Feature Request: Clearml Prints Github Token In The Log, When There Is "Repository Not Found" Error. It Would Be Nice If Could Hide It

in order to use private repositories for our experiments I add agent.git_user and agent.git_pass options to clearml.conf when launching agents

if someone accidentally tries to launch an experiment from non-existing repo, ClearML will print
fatal: repository ' https://username:token@github.com/our_organization/non_existing_repo.git/ ' not found

exposing the real token

2 years ago

0 Feature Request: We Have Several Servers With Multiple Gpus, And Atm We Have To Manually Check Which Gpu Has Enough Memory Before Queuing Each Experiment Into The Right Queue. It Would Be Cool If We Could Set Required Gpu Memory Parameter For Each Experim

got it, thanks!

2 years ago

this is the artifactory, this is how I install these packages in the Docker image:
pip3 install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html

the files are used for training and evaluation (e.g., precomputed pycocotools meta-info). I could theoretically include them in the repo, but some of them might be quite heavy. what do you mean when you say that they get lost? I copy them from the host machine when I build the custom image, so they are i...

4 years ago

great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working

I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/

but trains probably rewrites the folder when cloning the repo. is there any workaround?

4 years ago

0 Anyone Having Problems With Clearml Slowing Down Pytorch Experiments? Auto_Connect_Framework={“Pytorch”: False} Helps, But It’S Not A Great Solution. We Think It’S Related To Clearml Trying To Do Something At Each Dataloader Iteration. We’Ll Try To Provid

it’s a pretty standard pytorch train/eval loop, using pytorch dataloader and https://docs.monai.io/en/stable/_modules/monai/data/dataset.html

2 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

well okay, it's probably not that weird considering that worker just runs the container

4 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

our GPUs are 48GB, so it's quite wasteful to only run one job per GPU
yeah, I'm aware of that, I would have to make sure they don't fail to infamous CUDA out of memory, but still

4 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

that's right, I have 4 GPUs and 4 workers. but what if I want to run two jobs simultaneously at the same GPU

4 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

perfect!

4 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

thanks! I need to read all parts of documentation really carefully =) for some reason, couldn't find this section

4 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

is it in documentation somewhere?

4 years ago

Show more results