ShallowKitten67

1 Question, 11 Answers

Active since 10 January 2023

Last activity one year ago

Reputation

Badges 1

11 × Eureka!

Questions 1
Answers 11

0 Votes

28 Answers

908 Views

0 Votes 28 Answers 908 Views

Hi, I Have A Problem That I Am Not Really Sure About How To Track It Down: I Sometimes Get The Following Message That Kills My Running Process After A Few Hours:

Hi, I have a problem that I am not really sure about how to track it down: I sometimes get the following message that kills my running process after a few ho...

clearml

3 years ago

0 Hi, I Have A Problem That I Am Not Really Sure About How To Track It Down: I Sometimes Get The Following Message That Kills My Running Process After A Few Hours:

Currently 38

3 years ago

0 Hi, I Have A Problem That I Am Not Really Sure About How To Track It Down: I Sometimes Get The Following Message That Kills My Running Process After A Few Hours:

Since I do not manage the cluster, I do not have permission to access system logs. In the docker logs, the last thing that gets printed is the clearml.Task WARNING .

3 years ago

0 Hi, I Have A Problem That I Am Not Really Sure About How To Track It Down: I Sometimes Get The Following Message That Kills My Running Process After A Few Hours:

Yes, also on my machine, where I store the Tensorboard logs together with additional results (Meshes and Model checkpoints) of all experiments, I only use like 1 GB

3 years ago

0 Hi, I Have A Problem That I Am Not Really Sure About How To Track It Down: I Sometimes Get The Following Message That Kills My Running Process After A Few Hours:

It seems like I regained connection. At least I can see all values until the task got terminated and after the HTTPTimeOut warning in my logs, the training runs for another 200 Iterations (~1.5 hours)

3 years ago

0 Hi, I Have A Problem That I Am Not Really Sure About How To Track It Down: I Sometimes Get The Following Message That Kills My Running Process After A Few Hours:

It seems like I lost connection during the run of my experiment. But this happened like 200 Epochs before the process got terminated

3 years ago

0 Hi, I Have A Problem That I Am Not Really Sure About How To Track It Down: I Sometimes Get The Following Message That Kills My Running Process After A Few Hours:

I am using the community server at https://app.community.clear.ml . In my environment I use clearml==1.0.2 , so I probably should update to the latest version

3 years ago

0 Hi, I Have A Problem That I Am Not Really Sure About How To Track It Down: I Sometimes Get The Following Message That Kills My Running Process After A Few Hours:

Is there a way to check how much storage I am using on the community server?

3 years ago

0 Hi, I Have A Problem That I Am Not Really Sure About How To Track It Down: I Sometimes Get The Following Message That Kills My Running Process After A Few Hours:

Little update here: It happened again after an update to ClearML SDK 1.0.4, but this time it happened immediately after I lost HTTP connection. This makes sense with your explanations. Can I suppress this by setting sdk.development.support_stopping in the config to false ?

3 years ago

0 Hi, I Have A Problem That I Am Not Really Sure About How To Track It Down: I Sometimes Get The Following Message That Kills My Running Process After A Few Hours:

Oh, and I do not change the task’s status in my code. I just create it at the beginning of my training.

` configuration = parser.parse(config_path)

task = clearml.Task.init(project_name='Foo',
task_name=configuration.name) `

3 years ago

0 Hi, I Have A Problem That I Am Not Really Sure About How To Track It Down: I Sometimes Get The Following Message That Kills My Running Process After A Few Hours:

Here are the machine monitoring scalars. Seems fine to me. I am currently trying to reproduce results from a paper, thus I do not tune batch_size etc to use all available resources.

3 years ago

0 Hi, I Have A Problem That I Am Not Really Sure About How To Track It Down: I Sometimes Get The Following Message That Kills My Running Process After A Few Hours:

I use tensorboard and rely on automatic logging for all of my scalar reporting. However, I periodically log some scatter plots using clearml.Logger.report_plotly . And I use report_text to log some information about training progress to the console.

3 years ago