DilapidatedParrot58

42 Questions, 205 Answers

Active since 10 January 2023

Last activity one year ago

Reputation

Badges 1

186 × Eureka!

Questions 42
Answers 205

0 Votes

5 Answers

1K Views

0 Votes 5 Answers 1K Views

Step 3 Task (

Step 3 Task ( https://github.com/allegroai/trains/blob/master/examples/pipeline/step3_train_model.py ) - Loads the processed data (from Step 2) and clearml a...

clearml

3 years ago

0 Votes

3 Answers

1K Views

0 Votes 3 Answers 1K Views

Hey Guys, Thanks For Creating Slack Workspace, That'S Really Cool. Question - Are We Missing Smth Or Is Currently Not Possible To Pass S3 Credentials Via Env Variables? We Forked Trains And Added A Simple Fix (

hey guys, thanks for creating Slack workspace, that's really cool. question - are we missing smth or is currently not possible to pass S3 credentials via env...

clearml

4 years ago

0 Votes

2 Answers

1K Views

0 Votes 2 Answers 1K Views

Two Annoying Visual Bugs In Clearml Server Ui After Latest Update:

two annoying visual bugs in ClearML Server UI after latest update: experiment status is still shown as “Aborted” after successful resetting until you refresh...

clearml

2 years ago

0 Votes

16 Answers

1K Views

0 Votes 16 Answers 1K Views

Yo Guys, I'M Getting

yo guys, I'm getting Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(, 'Connection to O...

clearml

4 years ago

0 Votes

2 Answers

965 Views

0 Votes 2 Answers 965 Views

After Recent Clearml Server Update, Whenever I Clone An Experiment, The Default Project For The Draft Copy Is The First Project In The List. Previously, It Would Be The Project Which I Am Cloning This Experiment From. This Was Much More Convenient. Is Thi

after recent clearml server update, whenever I clone an experiment, the default project for the draft copy is the first project in the list. previously, it w...

clearml

one year ago

0 Votes

30 Answers

1K Views

0 Votes 30 Answers 1K Views

I Keep Getting Errors When Trying To Compare A Lot Of Experiments At The Same Time (>10). What'S Evern Worse Is That Trains Start Working Much Slower In General After These Attempts, The Only Way To Fix It Is To Restart The Whole Thing. Would Getting Bett

I keep getting errors when trying to compare a lot of experiments at the same time (>10). what's evern worse is that trains start working much slower in gene...

clearml

4 years ago

0 Votes

27 Answers

1K Views

0 Votes 27 Answers 1K Views

Hey Guys, I Keep Getting

hey guys, I keep getting trains_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the TRAINS API server http://apiserver:8008 ?...

clearml

3 years ago

0 Votes

8 Answers

1K Views

0 Votes 8 Answers 1K Views

Hey Guys, I'M Experiencing Seemingly Random Problems With The Experiments. There Are 4 Gpus And 8 Workers (2 Workers Per Gpu) , And Sometimes Experiments Randomly Fail (Or Complete) In The Middle Of The Epoch Without Any Additional Info In The Logs. What

hey guys, I'm experiencing seemingly random problems with the experiments. there are 4 GPUs and 8 workers (2 workers per GPU) , and sometimes experiments ran...

clearml

4 years ago

0 Votes

5 Answers

1K Views

0 Votes 5 Answers 1K Views

Hey Guys, I Am Trying To Plan What I Need To Do In Order To Efficiently Use Clearml With Spot Instances 1) Detecting When Spot Instance Is Down And Experiment Is Aborted 2) Extracting S3 Address Of The Latest Checkpoint From Clearml Api 3) Starting New E

hey guys, I am trying to plan what I need to do in order to efficiently use ClearML with spot instances 1) detecting when spot instance is down and experimen...

clearml

3 years ago

0 Votes

6 Answers

1K Views

0 Votes 6 Answers 1K Views

Hey Guys, Here I Am Again With Another Question

hey guys, here I am again with another question 😃 after the latest update, I’m getting this error when I’m trying to compare scalars for more than 10 experi...

clearml

4 years ago

0 Votes

25 Answers

1K Views

0 Votes 25 Answers 1K Views

I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

I'm probably stupid, but how do I specify worker name? usecase - I want to create two workers using the same GPU, and new worker just overwrites the old one

clearml

4 years ago

0 Votes

14 Answers

1K Views

0 Votes 14 Answers 1K Views

Hey Guys The First Time I'M Seeing This Behavior I'M Adding A New User To /Opt/Trains/Config/Apiserver.Conf And Restarting The Containers. All Old Users Are Able To Log In, But Not The New One (Invalid User/Password Combination). Any Ideas?

hey guys the first time I'm seeing this behavior I'm adding a new user to /opt/trains/config/apiserver.conf and restarting the containers. all old users are ...

clearml

4 years ago

Show more results

0 I'M Using Tensorboard Summarywriter To Add Scalar Metrics For The Experiment. If Experiment Crashed, And I Want To Continue It From Checkpoint, For Some Reason It Plots Metrics In A Really Weird Way. Even Though I Pass Global_Step=Epoch To The Summarywrit

this would be great. I could just then pass it as a hyperparameter

3 years ago

still no luck, I tried everything =( any updates?

3 years ago

thank you, I'll let you know if setting it to zero worked

3 years ago

perhaps I need to do task.set_initial_iteration(0)?

3 years ago

overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine values I would like to continue from the latest iteration

but for the metrics, I explicitly pass the number of epoch that my training is currently on. it'ls kind of weird that it adds offset to the values that are explicitly reported, no?

3 years ago

0 Hey Guys, I Keep Getting

new version worked

3 years ago

0 There Is Something Weird Going On With Console Log After Latest Updates Of Clearml Server. It Doesn'T Show The Latest Updates, Instead It Often Jumps To The Seemingly Random Parts Of The Console Output

nice, thanks

one year ago

not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer

3 years ago

okay, I will open an issue

3 years ago

0 Is There Any Way To Export Csv With Max Metrics And Hyperparameters For Selected Experiments?

kind of

3 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

well okay, it's probably not that weird considering that worker just runs the container

4 years ago

0 I Keep Getting Errors When Trying To Compare A Lot Of Experiments At The Same Time (>10). What'S Evern Worse Is That Trains Start Working Much Slower In General After These Attempts, The Only Way To Fix It Is To Restart The Whole Thing. Would Getting Bett

m5.xlarge EC2 instance (4 vCPUs, 16 GB RAM), 100GB disk

4 years ago

0 Hey Guys, I'M Trying To Run An Experiment Using Trains-Agent. I Have A Custom Docker Image With Nightly Versions Of Pytorch And Our Own Library Installed From A Private Repo. I Was Assuming That These Packages Will Be Automatically Available To Trains Dur

weird
this is what I got in installed packages without adding the direct link:
torch==1.6.0.dev20200430+cu101
torchvision==0.7.0.dev20200430+cu101

4 years ago

nope, didn't work =(

3 years ago

0 Hi

all our workers went down after starting the slack bot, is it expected?)

4 years ago

0 Hi

we've already restarted everything, so I don't have any logs on hands right now. I'll let you know if we face any problems 😃 slack bot works though! 🎉

4 years ago

0 Hi

python3 slack_alerts.py --channel trains-alerts --slack_api "OUR_KEY" --include_completed_experiments --include_manual_experiments

4 years ago

0 Hi

new icons are slick, it would be even better if you could upload custom icons for the different projects

4 years ago

sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass 😃

3 years ago

task = Task.get_task(task_id = args.task_id)
task.mark_started()
task.set_parameters_as_dict(
{
"General": {
"checkpoint_file": model.url,
"restart_optimizer": False,
}
}
)
task.set_initial_iteration(0)
task.mark_stopped()
Task.enqueue(task = task, queue_name = task.data.execution.queue)

3 years ago

0 We Can’T Add Overview To The Subprojects (Btw Thank You So Much For Subprojects, This Is Probably The Best Feature Ever Introduced To Trains/Clearml). Is It Intended? When I Click Overview For The Subproject, It Just Shows An Empty Page Without Any Button

perhaps it’s happening because it’s an old project that was moved to the new root project?

3 years ago

maybe I should use explicit reporting instead of Tensorboard

3 years ago

I use Docker for training, which means that log_dir contents are removed for the continued experiment btw

3 years ago

does this mean that setting initial iteration to 0 should help?

3 years ago

there is no method for setting last iteration , which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?

3 years ago

okay, so if there’s no workaround atm, should I create a Github issue?

3 years ago

0 I'M Getting A Lot Of Errors When Running Cleanup Service

self-hosted ClearML server 1.2.0
SDK version 1.1.6

2 years ago

0 Hello, I Have A Problem With Task.Set_Initial_Iteration(0) In Google Colab. After Continuing The Experiment, Gaps Appear On My Graph, But If You Use Colab. I Tried It On My Computer And Everything Is Normal There.

I'm so happy to see that this problem has been finally solved!

3 years ago

0 I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

another stupid question - what is the proper way to delete a worker? so far I've been using pgrep to find the relevant PID 😃

4 years ago