DilapidatedParrot58

42 Questions, 205 Answers

Active since 10 January 2023

Last activity one year ago

Reputation

Badges 1

186 × Eureka!

Questions 42
Answers 205

0 Votes

8 Answers

1K Views

0 Votes 8 Answers 1K Views

Yo Clearml Folks! How To Force-Reinstall Package From Github In Installed Packages? Tried Different Strategies (Using @Commit_Id, Versioning, Flag --Force-Reinstall), And It Keeps Saying That Requirement Is Already Satisfied (Old Version Of The Package Is

yo clearml folks! how to force-reinstall package from github in Installed Packages? tried different strategies (using @COMMIT_ID, versioning, flag --force-re...

clearml

3 years ago

0 Votes

6 Answers

1K Views

0 Votes 6 Answers 1K Views

Hey Guys, I Keep Getting "Failed Parsing Task Parameter" Warning For The Arguments Such As This One:

hey guys, I keep getting "Failed parsing task parameter" warning for the arguments such as this one: parser.add_argument( "--dataset_mean", type = float, nar...

clearml

3 years ago

0 Votes

29 Answers

991 Views

0 Votes 29 Answers 991 Views

I'M Using Tensorboard Summarywriter To Add Scalar Metrics For The Experiment. If Experiment Crashed, And I Want To Continue It From Checkpoint, For Some Reason It Plots Metrics In A Really Weird Way. Even Though I Pass Global_Step=Epoch To The Summarywrit

I'm using Tensorboard SummaryWriter to add scalar metrics for the experiment. if experiment crashed, and I want to continue it from checkpoint, for some reas...

clearml

3 years ago

0 Votes

20 Answers

1K Views

0 Votes 20 Answers 1K Views

Hey Guys, I'M Trying To Run An Experiment Using Trains-Agent. I Have A Custom Docker Image With Nightly Versions Of Pytorch And Our Own Library Installed From A Private Repo. I Was Assuming That These Packages Will Be Automatically Available To Trains Dur

hey guys, I'm trying to run an experiment using trains-agent. I have a custom Docker image with nightly versions of pytorch and our own library installed fro...

pytorch

4 years ago

0 Votes

11 Answers

1K Views

0 Votes 11 Answers 1K Views

Hey Guys, Do You Have Any Plans To Add Functionality To Export Training Config With All Hyperparameters To The Different Formats, Such As Training Command Line Command, Yaml, Etc.?

hey guys, do you have any plans to add functionality to export training config with all hyperparameters to the different formats, such as training command li...

clearml

4 years ago

0 Votes

11 Answers

1K Views

0 Votes 11 Answers 1K Views

Hey Guys, Is There A Ready Script That Can Delete All Models From S3 (Or Other Storage) That Are Related To Deleted Or Archived Experiments?

hey guys, is there a ready script that can delete all models from S3 (or other storage) that are related to deleted or archived experiments?

clearml

3 years ago

0 Votes

2 Answers

976 Views

0 Votes 2 Answers 976 Views

After Recent Clearml Server Update, Whenever I Clone An Experiment, The Default Project For The Draft Copy Is The First Project In The List. Previously, It Would Be The Project Which I Am Cloning This Experiment From. This Was Much More Convenient. Is Thi

after recent clearml server update, whenever I clone an experiment, the default project for the draft copy is the first project in the list. previously, it w...

clearml

one year ago

0 Votes

11 Answers

1K Views

0 Votes 11 Answers 1K Views

We Can’T Add Overview To The Subprojects (Btw Thank You So Much For Subprojects, This Is Probably The Best Feature Ever Introduced To Trains/Clearml). Is It Intended? When I Click Overview For The Subproject, It Just Shows An Empty Page Without Any Button

we can’t add overview to the subprojects (btw thank you SO MUCH for subprojects, this is probably the best feature ever introduced to trains/clearml). is it ...

clearml

3 years ago

0 Votes

27 Answers

1K Views

0 Votes 27 Answers 1K Views

Hey Guys, I Keep Getting

hey guys, I keep getting trains_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the TRAINS API server http://apiserver:8008 ?...

clearml

3 years ago

0 Votes

25 Answers

1K Views

0 Votes 25 Answers 1K Views

I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

I'm probably stupid, but how do I specify worker name? usecase - I want to create two workers using the same GPU, and new worker just overwrites the old one

clearml

4 years ago

0 Votes

30 Answers

1K Views

0 Votes 30 Answers 1K Views

I Keep Getting Errors When Trying To Compare A Lot Of Experiments At The Same Time (>10). What'S Evern Worse Is That Trains Start Working Much Slower In General After These Attempts, The Only Way To Fix It Is To Restart The Whole Thing. Would Getting Bett

I keep getting errors when trying to compare a lot of experiments at the same time (>10). what's evern worse is that trains start working much slower in gene...

clearml

4 years ago

0 Votes

7 Answers

1K Views

0 Votes 7 Answers 1K Views

Any Chance Storagemanager Could Re-Download Files Only If Their Size Is Different From File In Cache (As An Option)?

any chance StorageManager could re-download files only if their size is different from file in cache (as an option)?

clearml

3 years ago

Show more results

0 Yo Clearml Folks! How To Force-Reinstall Package From Github In Installed Packages? Tried Different Strategies (Using @Commit_Id, Versioning, Flag --Force-Reinstall), And It Keeps Saying That Requirement Is Already Satisfied (Old Version Of The Package Is

thanks, this one worked after we changed the package version

3 years ago

okay, what do I do if it IS installed?

3 years ago

from the experiment log

3 years ago

in the docker image

3 years ago

0 What Is The Right Way To Increase Number Of Retries When Using

isn't this parameter related to communication with ClearML Server? I'm trying to make sure that checkpoint will be downloaded from AWS S3 even if there are temporary connection problems

there's https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig parameter in boto3, but I'm not sure if there's an easy way to pass this parameter to StorageManager

2 years ago

0 What Is The Right Way To Increase Number Of Retries When Using

I'm not sure since names of these parameters do not match with boto3 names, and num_download_attempt is passed https://github.com/allegroai/clearml/blob/3d3a835435cc2f01ff19fe0a58a8d7db10fd2de2/clearml/storage/helper.py#L1439 as container.config.retries

2 years ago

0 Hey Guys, I'M Experiencing Seemingly Random Problems With The Experiments. There Are 4 Gpus And 8 Workers (2 Workers Per Gpu) , And Sometimes Experiments Randomly Fail (Or Complete) In The Middle Of The Epoch Without Any Additional Info In The Logs. What

it might be that there is not enough space on our SSD, experiments cache a lot of preprocessed data during the first epoch...

4 years ago

example of the failed experiment

4 years ago

0 What Is The Right Way To Increase Number Of Retries When Using

yeah, I think this is what I need!

2 years ago

0 Two Annoying Visual Bugs In Clearml Server Ui After Latest Update:

nice, thanks for the info

2 years ago

docker mode
different ids

4 years ago

nice idea, thanks

4 years ago

0 When We Train The Models, We Often Choose Checkpoint Based On The Validation Accuracy, But Test Set Accuracy (Or Specific Class Validation Accuracy) Is Not Necessarily The Best For This Checkpoint. Right Now There Are Options To Add Columns With Max And L

I guess, this could overcomplicate ui, I don't see a good solution yet.

as a quick hack, we can just use separate name (eg "best_val_roc_auc") for all metric values for the current best checkpoint. then we can just add columns with the last value of this metric

3 years ago

exactly

3 years ago

0 Yo Guys, I'M Getting

I get "The connection has timed out" when I'm trying to reach 8081 port

4 years ago

0 Feature Request: We Have Several Servers With Multiple Gpus, And Atm We Have To Manually Check Which Gpu Has Enough Memory Before Queuing Each Experiment Into The Right Queue. It Would Be Cool If We Could Set Required Gpu Memory Parameter For Each Experim

got it, thanks!

3 years ago

0 Is Is Possible To Pass Custom

ah, I see, I still keep it in agent.extra_docker_arguments

2 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

perfect!

4 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

thanks! I need to read all parts of documentation really carefully =) for some reason, couldn't find this section

4 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

the weird part is that the old job continues running when I recreate the worker and enqueue the new job

4 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

our GPUs are 48GB, so it's quite wasteful to only run one job per GPU
yeah, I'm aware of that, I would have to make sure they don't fail to infamous CUDA out of memory, but still

4 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

another stupid question - what is the proper way to delete a worker? so far I've been using pgrep to find the relevant PID 😃

4 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

that's right, I have 4 GPUs and 4 workers. but what if I want to run two jobs simultaneously at the same GPU

4 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

well okay, it's probably not that weird considering that worker just runs the container

4 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

is it in documentation somewhere?

4 years ago

0 Hey Guys, I'M Trying To Run An Experiment Using Trains-Agent. I Have A Custom Docker Image With Nightly Versions Of Pytorch And Our Own Library Installed From A Private Repo. I Was Assuming That These Packages Will Be Automatically Available To Trains Dur

no, I even added the argument to specify tensorboard log_dir to make sure this is not happening

4 years ago

I use the following Base Docker image setting: my-docker-hub/my-repo:latest -v /data/project/data:/data

4 years ago

0 I Keep Getting Errors When Trying To Compare A Lot Of Experiments At The Same Time (>10). What'S Evern Worse Is That Trains Start Working Much Slower In General After These Attempts, The Only Way To Fix It Is To Restart The Whole Thing. Would Getting Bett

okay, give me a sec

4 years ago

0 I'M Using Tensorboard Summarywriter To Add Scalar Metrics For The Experiment. If Experiment Crashed, And I Want To Continue It From Checkpoint, For Some Reason It Plots Metrics In A Really Weird Way. Even Though I Pass Global_Step=Epoch To The Summarywrit

this is how it looks if I zoom in on the epochs that ran before the crash

3 years ago

task = Task.get_task(task_id = args.task_id)
task.mark_started()
task.set_parameters_as_dict(
{
"General": {
"checkpoint_file": model.url,
"restart_optimizer": False,
}
}
)
task.set_initial_iteration(0)
task.mark_stopped()
Task.enqueue(task = task, queue_name = task.data.execution.queue)

3 years ago

Show more results