DilapidatedParrot58

42 Questions, 205 Answers

Active since 10 January 2023

Last activity 2 years ago

Reputation

Badges 1

186 × Eureka!

Answers 205

0 Hey Guys, I'M Experiencing Seemingly Random Problems With The Experiments. There Are 4 Gpus And 8 Workers (2 Workers Per Gpu) , And Sometimes Experiments Randomly Fail (Or Complete) In The Middle Of The Epoch Without Any Additional Info In The Logs. What

example of the failed experiment

4 years ago

0 I'M Using Tensorboard Summarywriter To Add Scalar Metrics For The Experiment. If Experiment Crashed, And I Want To Continue It From Checkpoint, For Some Reason It Plots Metrics In A Really Weird Way. Even Though I Pass Global_Step=Epoch To The Summarywrit

okay, so if there’s no workaround atm, should I create a Github issue?

3 years ago

nice idea, thanks

4 years ago

0 Hi

python3 slack_alerts.py --channel trains-alerts --slack_api "OUR_KEY" --include_completed_experiments --include_manual_experiments

4 years ago

maybe I should use explicit reporting instead of Tensorboard

3 years ago

0 Hi

all our workers went down after starting the slack bot, is it expected?)

4 years ago

0 There Is Something Weird Going On With Console Log After Latest Updates Of Clearml Server. It Doesn'T Show The Latest Updates, Instead It Often Jumps To The Seemingly Random Parts Of The Console Output

yes

2 years ago

sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass 😃

3 years ago

0 Hey Everyone, There'S A Bug That We Experience After Moving To The New Server And Domain. If You Click On The Experiment Name While Viewing Its Details, You Get A 404 Error Because There'S Missing "Experiments" Part In The Address. Details In The Thread

awesome, thank you

3 years ago

0 Hey Guys, I Keep Getting

thank you 😃

4 years ago

0 I’M Interested In Learning More About Internals Of Clearml Server - For Example, How Elasticsearch, Mongodb, And Redis Are Used Internally. Are There Any Materials Available?

not quite. for example, I’m not sure which info is stored in Elastic and which is in MongoDB

2 years ago

0 Hey Guys, I Keep Getting

do you have any idea why cleanup task keeps failing then (it used to work before the update)

4 years ago

0 Feature Request: Clearml Prints Github Token In The Log, When There Is "Repository Not Found" Error. It Would Be Nice If Could Hide It

in order to use private repositories for our experiments I add agent.git_user and agent.git_pass options to clearml.conf when launching agents

if someone accidentally tries to launch an experiment from non-existing repo, ClearML will print
fatal: repository ' https://username:token@github.com/our_organization/non_existing_repo.git/ ' not found

exposing the real token

3 years ago

I use Docker for training, which means that log_dir contents are removed for the continued experiment btw

3 years ago

thanks! this bug and cloning problem seem to be fixed

2 years ago

0 Hi

new icons are slick, it would be even better if you could upload custom icons for the different projects

4 years ago

task = Task.get_task(task_id = args.task_id)
task.mark_started()
task.set_parameters_as_dict(
{
"General": {
"checkpoint_file": model.url,
"restart_optimizer": False,
}
}
)
task.set_initial_iteration(0)
task.mark_stopped()
Task.enqueue(task = task, queue_name = task.data.execution.queue)

3 years ago

0 Hey Guys, I Keep Getting

nope, old clenup task fails with trains_agent: ERROR: Could not find task id=e7725856e9a04271aab846d77d6f7d66 (for host: )
Exception: 'Tasks' object has no attribute 'id

weirdly enough, curl http://apiserver:8008 from inside the container works

4 years ago

0 Hey Guys, I Keep Getting

new version worked

4 years ago

0 I Updated Trains-Server Today, And Now It'S Very Unstable, Web Interface Randomly Stops Working. Anyone Had The Same Problem? I'Ve Never Had Any Problems With Updating The Server Before

I decided to restart the containers one more time, this is what I got.

I had to restart Docker service to remove the containers

4 years ago

0 Hey Guys, I Keep Getting

nice, thanks! I'll check if it solves the issue first thing tomorrow in the morning

4 years ago

0 Hey Guys, I'M Trying To Run An Experiment Using Trains-Agent. I Have A Custom Docker Image With Nightly Versions Of Pytorch And Our Own Library Installed From A Private Repo. I Was Assuming That These Packages Will Be Automatically Available To Trains Dur

it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers

4 years ago

0 Hey Guys, I Keep Getting

yeah, we did. let me check if explicitly setting credentials helps

4 years ago

weird
this is what I got in installed packages without adding the direct link:
torch==1.6.0.dev20200430+cu101
torchvision==0.7.0.dev20200430+cu101

4 years ago

0 Hey Guys! I'Ve Got The Latest Version Of Trains 0.16.0 And Now I Have A Problem. In Previous Versions I Could Easily Override Default Arguments On Hyperparameters Tab And Now After Editing The Arguments Values With The New Ones And Executing The Experime

ValueError: Task has no hyperparams section defined

4 years ago

not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer

3 years ago

0 Hey Guys, I Keep Getting

default docker-compose

4 years ago

nope, didn't work =(

3 years ago

docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.

awesome news 👍

4 years ago

thank you, I'll let you know if setting it to zero worked

3 years ago

Show more results