Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey There, Since A Bit I Often Find Experiments Being Stuck While Training A Model. It Seems To Happen Randomly And I Could Not Find A Reproducible Scenario So Far, But It Happens Often Enough To Be Annoying (I'D Say 1 Out Of 5 Experiments). The Symptoms

Hey there, since a bit I often find experiments being stuck while training a model.
It seems to happen randomly and I could not find a reproducible scenario so far, but it happens often enough to be annoying (I'd say 1 out of 5 experiments).
The symptoms are:
The task is stuck: no more logging, no more CPU/GPU/disk/network activity The task is marked as running. It stays marked as running forever if I don't stop the task manually. The last logs before the experiments is stuck are always about saving a checkpoint at the end of an epoch. We do it with https://pytorch.org/ignite/generated/ignite.contrib.handlers.clearml_logger.html#ignite.contrib.handlers.clearml_logger.ClearMLSaver (pytorch-ignite). See logs below. The task/process/agent are responsive: If I manually abort the task from the WebUI, the task is stopped properly and I do see logged "User aborted: stopping task (1)" and "Process aborted by user". I can then start other tasks on the same agent without any problem.What could be the reason for the task being stuck like that? Inter-process communication? Thread locks?
2022-10-08 01:23:10,692 valid_dataset INFO: Epoch[1] Complete. Time taken: 00:03:36 2022-10-08 01:23:10,693 valid_dataset INFO: Engine run complete. Time taken: 00:03:37 2022-10-08 01:23:10,693 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_zcgp8dow.tmp => s3-artefacts/project/experiment_name.experiment_hash/models/best_checkpoint_0.pt 2022-10-08 01:23:11,041 train INFO: Epoch[28] Complete. Time taken: 00:21:05 2022-10-08 01:23:11,093 - clearml.storage - INFO - Uploading: 5.00MB / 37.65MB @ 12.55MBs from /tmp/.clearml.upload_model_zcgp8dow.tmp 2022-10-08 01:23:11,183 - clearml.storage - INFO - Uploading: 10.00MB / 37.65MB @ 55.48MBs from /tmp/.clearml.upload_model_zcgp8dow.tmp 2022-10-08 01:23:11,205 - clearml.storage - INFO - Uploading: 15.00MB / 37.65MB @ 230.58MBs from /tmp/.clearml.upload_model_zcgp8dow.tmp 2022-10-08 01:23:11,227 - clearml.storage - INFO - Uploading: 20.15MB / 37.65MB @ 238.17MBs from /tmp/.clearml.upload_model_zcgp8dow.tmp 2022-10-08 01:23:11,250 - clearml.storage - INFO - Uploading: 25.15MB / 37.65MB @ 212.19MBs from /tmp/.clearml.upload_model_zcgp8dow.tmp 2022-10-08 01:23:11,282 - clearml.storage - INFO - Uploading: 30.15MB / 37.65MB @ 158.42MBs from /tmp/.clearml.upload_model_zcgp8dow.tmp 2022-10-08 01:23:11,304 - clearml.storage - INFO - Uploading: 35.15MB / 37.65MB @ 220.15MBs from /tmp/.clearml.upload_model_zcgp8dow.tmp 2022-10-08 01:23:12,222 - clearml.Task - INFO - Completed model upload to

  
  
Posted 2 years ago
Votes Newest

Answers 12


You mean you "aborted the task" from the UI?

Yes exactly

I'm assuming from the leftover processes ?

Most likely yes, but I don't see how clearml would have an impact here, I am more inclined to think it would be a pytorch dataloader issue, although I don't see why

From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)

yes in venv mode, I'll try with the latest version as well

  
  
Posted one year ago

Any chance this is reproducible ?

Unfortunately not at the moment, I could find a reproducible scenario. If I clone a task that was stuck and start it, it might not be stuck

How many processes do you see running (i.e. ps -Af | grep python) ?

I will check that when the next one will be blocked 👍

What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?

I train with pytorch (1.11) and ignite (0.4.8), using multiprocess (via the dataloader with n_workers=8) on linux, not running inside a docker container

  
  
Posted 2 years ago

Hi AgitatedDove14 , sorry somehow this message got lost 😄
clearml version is the latest at the time, 1.7.1 Yes, I always see the "model uploaded completed" for such stuck tasks I am using python 3.8.10

  
  
Posted 2 years ago

Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:

  • There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
  • There is a memory leak somewhere, please see the screenshot of datadog memory consumptionI don't know yet what to explore first. My first assumption is that this bug is due to a recent version of clearml-sdk/clearml-agent/python/pytorch (the training used to work smoothly couple of months ago). Now I get this problem on all my experiments.
    Note: I think the memory is a consequence of the multiprocessing zombie bug, because on some experiments the memory grows but the experiment get stuck before reaching max of mem (see 2nd screenshot of datadog mem consumption). But that's just a hypothesis

Python 3.8/Pytorch 1.11/clearml-sdk 1.9.0/clearml-agent 1.4.1

Right now my next steps for debugging would be:

  • Train without clearml integration -> If works, check which version of clearml sdk/agent is responsible
  • Train with older version of the training code -> If works, look for guilty code changes
  • Train with different python/pytorch version
    image
    image
  
  
Posted one year ago

There seems to be a problem with multiprocessing: Although I stopped the task,

You mean you "aborted the task" from the UI?

  • There is a memory leak somewhere, please see the screenshot of datadog memory consumptionI'm assuming from the leftover processes ?

Python 3.8/Pytorch 1.11/clearml-sdk 1.9.0/clearml-agent 1.4.1

From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)

  
  
Posted one year ago

JitteryCoyote63 what's the clearml version ?
Are you always seeing the "model uploaded completed" message ?
What's the python version you are using?

  
  
Posted 2 years ago

Note: Could be related to https://github.com/allegroai/clearml/issues/790 , not sure

  
  
Posted 2 years ago

@<1523701205467926528:profile|AgitatedDove14> I see other rc in pypi but no corresponding tags in the clearml-agent repo? are these releases legit?

  
  
Posted one year ago

What is latest rc of clearml-agent? 1.5.2rc0?

  
  
Posted one year ago

Most likely yes, but I don't see how clearml would have an impact here, I am more inclined to think it would be a pytorch dataloader issue, although I don't see why

These are most certainly dataloader process. But clearml-agent when killing the process should also kill all subprocesses, and it might be there is something going on that prenets it from killing the subprocesses ...

Is this easily reproducible ? Can you verify it is still the case with the latest RC of clearml-agent ?

  
  
Posted one year ago

Hmm, #790 should be solved in 1.7.2
Yes, I always see the "model uploaded completed" for such stuck tasksAny chance this is reproducible ?
How many processes do you see running (i.e. ps -Af | grep python) ?
What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?

  
  
Posted 2 years ago

Any insight will help, if you can provide the log of the Task that did get stuck, that would be a good start

  
  
Posted 2 years ago
1K Views
12 Answers
2 years ago
one year ago
Tags