Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Have A Problem That I Am Not Really Sure About How To Track It Down: I Sometimes Get The Following Message That Kills My Running Process After A Few Hours:

Hi, I have a problem that I am not really sure about how to track it down: I sometimes get the following message that kills my running process after a few hours: clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ### . This time, it happened while I was asleep, so I didn’t do anything. My Server was up all the time. I am running my training in a docker container on a cluster that is not managed by me and report to ClearML Community Server. Has anyone ever experienced something similar?

  
  
Posted one year ago
Votes Newest

Answers 28


Here are the machine monitoring scalars. Seems fine to me. I am currently trying to reproduce results from a paper, thus I do not tune batch_size etc to use all available resources.

  
  
Posted one year ago

ShallowKitten67 this could happen if you're changing your task's status somewhere in your code - are you?

  
  
Posted one year ago

Is there a way to check how much storage I am using on the community server?

  
  
Posted one year ago

It's on the way, but not yet possible πŸ™‚

  
  
Posted one year ago

how many experiments do you have?

  
  
Posted one year ago

It seems like I lost connection during the run of my experiment. But this happened like 200 Epochs before the process got terminated

  
  
Posted one year ago

I mean, assuming you lost connection to the server and stopped reporting

  
  
Posted one year ago

Currently 38

  
  
Posted one year ago

I am using the community server at https://app.community.clear.ml . In my environment I use clearml==1.0.2 , so I probably should update to the latest version

  
  
Posted one year ago

What you're seeing is basically the SDK's response to the Task's status being change mid-run, or to someone clicking "Stop" in the UI

  
  
Posted one year ago

Doesn't seem too large πŸ™‚

  
  
Posted one year ago

It seems like I regained connection. At least I can see all values until the task got terminated and after the HTTPTimeOut warning in my logs, the training runs for another 200 Iterations (~1.5 hours)

  
  
Posted one year ago

Since I do not manage the cluster, I do not have permission to access system logs. In the docker logs, the last thing that gets printed is the clearml.Task WARNING .

  
  
Posted one year ago

Yes, also on my machine, where I store the Tensorboard logs together with additional results (Meshes and Model checkpoints) of all experiments, I only use like 1 GB

  
  
Posted one year ago

ShallowKitten67 are you relying on the automatic reporting (so just creating a task and doing nothing clearml-related afterwards), or are you explicitly calling any clearml methods in your code?

  
  
Posted one year ago

That seems OK

  
  
Posted one year ago

but updating to the latest version is always a good idea πŸ™‚

  
  
Posted one year ago

Little update here: It happened again after an update to ClearML SDK 1.0.4, but this time it happened immediately after I lost HTTP connection. This makes sense with your explanations. Can I suppress this by setting sdk.development.support_stopping in the config to false ?

  
  
Posted one year ago

Hi ShallowKitten67 .

Can you send the logs? can you share the machine monitoring (from scalars section)?

  
  
Posted one year ago

Oh, and I do not change the task’s status in my code. I just create it at the beginning of my training.

` configuration = parser.parse(config_path)

task = clearml.Task.init(project_name='Foo',
task_name=configuration.name) `

  
  
Posted one year ago

Yeah, it should disable this behavior

  
  
Posted one year ago

Well, there's a watchdog on the server that automatically stops tasks that haven't reported for a long time - I guess that's what happened...

  
  
Posted one year ago

After some investigation, this might be related to an issue in ClearML SDK 1.0.2 with the subprocesses support - I suggest upgrading to ClearML SDK 1.0.4 πŸ™‚

  
  
Posted one year ago

OK, those can't cause any issue πŸ™‚

  
  
Posted one year ago

So that doesn't explain why the task's status was changed...

  
  
Posted one year ago

What server are you using?

  
  
Posted one year ago

There are literally only two things that can cause that specific message to be printed πŸ™‚

  
  
Posted one year ago

I use tensorboard and rely on automatic logging for all of my scalar reporting. However, I periodically log some scatter plots using clearml.Logger.report_plotly . And I use report_text to log some information about training progress to the console.

  
  
Posted one year ago
63 Views
28 Answers
one year ago
4 months ago
Tags
Similar posts