Sometimes I Notice That At The End Of An Experiment Clearml Keeps Hanging (Something With Repository Detection?) And The Script Does Not End. Do More People See This? Especially In Our Continuous Integration Pipeline This Give Problems Because Tests Are G

Answered

Sometimes I notice that at the end of an experiment clearml keeps hanging (something with repository detection?) and the script does not end. Do more people see this? Especially in our continuous integration pipeline this give problems because tests are getting a time out. To give a bit more detail it keeps hanging at the task.close() line.

Here the relevant part of the stack trace. I hope it is helpful
` self = <clearml.task.Task object at 0x7febb2082128>

def close(self):
    """
    Close the current Task. Enables you to manually shutdown the task.

    .. warning::
       Only call :meth:`Task.close` if you are certain the Task is not needed.
    """
    if self._at_exit_called:
        return

    # store is main before we call at_exit, because will will Null it
    is_main = self.is_main_task()
    is_sub_process = self.__is_subprocess()

    # wait for repository detection (5 minutes should be reasonable time to detect all packages)
    if self._logger and not self.__is_subprocess():
        self._wait_for_repo_detection(timeout=300.)

  self.__shutdown()

venv/lib/python3.6/site-packages/clearml/task.py:1415:

self = <clearml.task.Task object at 0x7febb2082128>

def __shutdown(self):
    """
    Will happen automatically once we exit code, i.e. atexit
    :return:
    """
    # protect sub-process at_exit
    if self._at_exit_called:
        # if we are called twice (signal in the middle of the shutdown),
        # make sure we flush stdout, this is the best we can do.
        if self._at_exit_called == get_current_thread_id() and self._logger and self.__is_subprocess():
            self._logger.set_flush_period(None)
            # noinspection PyProtectedMember
            self._logger._close_stdout_handler(wait=True)
            self._at_exit_called = True
        return

    # from here only a single thread can re-enter
    self._at_exit_called = get_current_thread_id()

    # disable lock on signal callbacks, to avoid deadlocks.
    if self.__exit_hook and self.__exit_hook.signal is not None:
        self.__edit_lock = False

    is_sub_process = self.__is_subprocess()

    # noinspection PyBroadException
    try:
        wait_for_uploads = True
        # first thing mark task as stopped, so we will not end up with "running" on lost tasks
        # if we are running remotely, the daemon will take care of it
        task_status = None
        wait_for_std_log = True
        if (not running_remotely() or DEBUG_SIMULATE_REMOTE_TASK.get()) \
                and self.is_main_task() and not is_sub_process:
            # check if we crashed, ot the signal is not interrupt (manual break)
            task_status = ('stopped', )
            if self.__exit_hook:
                is_exception = self.__exit_hook.exception
                # check if we are running inside a debugger
                if not is_exception and sys.modules.get('pydevd'):
                    # noinspection PyBroadException
                    try:
                        is_exception = sys.last_type
                    except Exception:
                        pass

                # only if we have an exception (and not ctrl-break) or signal is not SIGTERM / SIGINT
                if (is_exception and not isinstance(is_exception, KeyboardInterrupt)
                    and is_exception != KeyboardInterrupt) \
                        or (not self.__exit_hook.remote_user_aborted and
                            self.__exit_hook.signal not in (None, 2, 15)):
                    task_status = (
                        'failed',
                        'Exception {}'.format(is_exception) if is_exception else
                        'Signal {}'.format(self.__exit_hook.signal))
                    wait_for_uploads = False
                else:
                    wait_for_uploads = (self.__exit_hook.remote_user_aborted or self.__exit_hook.signal is None)
                    if not self.__exit_hook.remote_user_aborted and self.__exit_hook.signal is None and \
                            not is_exception:
                        task_status = ('completed', )
                    else:
                        task_status = ('stopped', )
                        # user aborted. do not bother flushing the stdout logs
                        wait_for_std_log = self.__exit_hook.signal is not None

        # wait for repository detection (if we didn't crash)
        if wait_for_uploads and self._logger:
            # we should print summary here
            self._summary_artifacts()
            # make sure that if we crashed the thread we are not waiting forever
            if not is_sub_process:
                self._wait_for_repo_detection(timeout=10.)

        # kill the repo thread (negative timeout, do not wait), if it hasn't finished yet.
        if not is_sub_process:
            self._wait_for_repo_detection(timeout=-1)

        # wait for uploads
        print_done_waiting = False
        if wait_for_uploads and (BackendModel.get_num_results() > 0 or
                                 (self.__reporter and self.__reporter.events_waiting())):
            self.log.info('Waiting to finish uploads')
            print_done_waiting = True
        # from here, do not send log in background thread
        if wait_for_uploads:

          self.flush(wait_for_uploads=True)

venv/lib/python3.6/site-packages/clearml/task.py:3022:

self = <clearml.task.Task object at 0x7febb2082128>, wait_for_uploads = True

def flush(self, wait_for_uploads=False):
    # type: (bool) -> bool
    """
    Flush any outstanding reports or console logs.

    :param bool wait_for_uploads: Wait for all outstanding uploads to complete

        - ``True`` - Wait
        - ``False`` - Do not wait (default)
    """

    # make sure model upload is done
    if BackendModel.get_num_results() > 0 and wait_for_uploads:
        BackendModel.wait_for_results()

    # flush any outstanding logs
    if self._logger:
        # noinspection PyProtectedMember
        self._logger._flush_stdout_handler()
    if self.__reporter:
        self.__reporter.flush()
        if wait_for_uploads:

          self.__reporter.wait_for_events()

venv/lib/python3.6/site-packages/clearml/task.py:1371:

self = <clearml.backend_interface.metrics.reporter.Reporter object at 0x7febb1e96668>
timeout = None

def wait_for_events(self, timeout=None):
    if self._report_service:

      return self._report_service.wait_for_events(timeout=timeout)

venv/lib/python3.6/site-packages/clearml/backend_interface/metrics/reporter.py:223:

self = <clearml.backend_interface.metrics.reporter.BackgroundReportService object at 0x7febb1e96630>
timeout = None

def wait_for_events(self, timeout=None):
    # noinspection PyProtectedMember
    if self._is_subprocess_mode_and_not_parent_process():
        while self._queue and not self._queue.empty():
            sleep(0.1)
        return
    self._empty_state_event.clear()

  return self._empty_state_event.wait(timeout)

venv/lib/python3.6/site-packages/clearml/backend_interface/metrics/reporter.py:80:

self = <clearml.utilities.process.mp.SafeEvent object at 0x7febb1e96128>
timeout = None

def wait(self, timeout=None):

  return self._event.wait(timeout=timeout)

venv/lib/python3.6/site-packages/clearml/utilities/process/mp.py:241:

self = <multiprocessing.synchronize.Event object at 0x7febb1e967f0>
timeout = None

def wait(self, timeout=None):
    with self._cond:
        if self._flag.acquire(False):
            self._flag.release()
        else:

          self._cond.wait(timeout)

/usr/lib/python3.6/multiprocessing/synchronize.py:360:

self = <Condition(<Lock(owner=None)>, 0)>, timeout = None

def wait(self, timeout=None):
    assert self._lock._semlock._is_mine(), \
           'must acquire() condition before using wait()'

    # indicate that this thread is going to sleep
    self._sleeping_count.release()

    # release lock
    count = self._lock._semlock._count()
    for i in range(count):
        self._lock.release()

    try:
        # wait for notification or timeout

      return self._wait_semaphore.acquire(True, timeout)

/usr/lib/python3.6/multiprocessing/synchronize.py:261:

self = <clearml.task.Task.__register_at_exit.<locals>.ExitHooks object at 0x7febb207ecc0>
sig = 2, frame = <frame object at 0x7feba4008f68>

def signal_handler(self, sig, frame):
    self.signal = sig

    org_handler = self._org_handlers.get(sig)
    signal.signal(sig, org_handler or signal.SIG_DFL)

    # if this is a sig term, we wait until __at_exit is called (basically do nothing)
    if sig == signal.SIGINT:
        # return original handler result

      return org_handler if not callable(org_handler) else org_handler(sig, frame)

E KeyboardInterrupt

venv/lib/python3.6/site-packages/clearml/task.py:3205: KeyboardInterrupt
================ 7 deselected, 8 warnings in 551.67s (0:09:11) ================= `
As you can see I pressed Ctrl+C after more than 9 minutes. Often this test is finished within 5 seconds, but sometimes not...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GreasyPenguin14
				
					0
					 × 1

Votes Newest

Answers 30

YEY!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes, that seems to solve the issue. Thanks.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					EcstaticGoat95
				
					0

SolidSealion72 EcstaticGoat95 I'm hoping the issue is now resolved 🤞
can you verify with ?
pip install git+

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

SolidSealion72 I'm able to reproduce, hurrah!
(and a fix is already being tested, I will keep you guys updated)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks SolidSealion72 !

Also, I found out that adding "pool.join()" after pool.close() seem to solve the issue in the minimal example.

This is interesting, I'm pretty sure it has something to do with the subprocess not "closing" properly (or too fast or something)
Let me see if I can reproduce

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 Also, I found out that adding "pool.join()" after pool.close() seem to solve the issue in the minimal example.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SolidSealion72
				
					0
					 × 1

AgitatedDove14 I managed to reproduce on Ubuntu (but not on Windows):
Not every run gets stuck, sometimes it's 1 in 10 runs that gets stuck.
https://github.com/maor121/clearml-bug-reproduction

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SolidSealion72
				
					0
					 × 1

EcstaticGoat95 I can see the experiment but I cannot access the notebook (I get Binder inaccessible)
Is this the exact script as here? https://clearml.slack.com/archives/CTK20V944/p1636536308385700?thread_ts=1634910855.059900&cid=CTK20V944

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thank you EcstaticGoat95 !

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Go to https://hub.gke2.mybinder.org/user/jupyterlab-jupyterlab-demo-0570jy0h/lab/tree/demo and run the jupyter notebook called "main.ipynb". I've ran the only cell in it for several times and now its stuck.
You can see the corresponding task at https://demoapp.trains.allegro.ai/projects/98cdb5ace38946a690daec8efd668a76/experiments/db0088e1207440319d80acc3ac89aafb/execution?columns=selected&columns=type&columns=name&columns=tags&columns=status&columns=project.name&columns=users&columns=started&columns=last_update&columns=last_iteration&columns=parent.name&order=-last_update . There are now two tasks in a "Running" state, one of them is generated by another run that I've made in a terminal and terminated.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					EcstaticGoat95
				
					0

EcstaticGoat95 any chance you have an idea on how to reproduce? (even 1 out of 6 is a good start)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes, that's what I observe on one machine. On another machine it hangs with a lower probability (1/6), but it still happens.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					EcstaticGoat95
				
					0

It does work about 50% of the times

EcstaticGoat95 what do you mean by "work about 50%" ? do you mean the other 50% it hangs ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

On another machine it gets stuck once every 6 runs on average.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					EcstaticGoat95
				
					0

It does work about 50% of the times, so running the script several times may reveal the problem.
I tried running it in a fresh virtualenv with only clearml installed, and I see the same issue.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					EcstaticGoat95
				
					0

Hi CostlyOstrich36 , thanks for the quick reply.
I'm running that on a Nvidia DGX-A100 computer with Ubuntu 20.04.3 LTS installed, Python 3.8.10, and clearml=1.1.4.
The ClearML server is not of the latest version though, I used "docker-compose.yml" version 3.6 to launch it.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					EcstaticGoat95
				
					0

EcstaticGoat95 , I couldn't reproduce the issue with 1.1.4 and the provided code. I tried with and without task.close() at the end. Can you please specify which OS / Python version you're using?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

EcstaticGoat95 , thanks a lot! Will take a look 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Multiprocessing Bug
Hi guys, I'm having a similar issue with clearml 1.1.4, and I have written a small script reproducing it.
There are two scripts attached, one invoking the clearml.Task and the other one is a module containing the multiprocessing code. The issue only happens when the multiprocessing code is in another file (module).
The behavior I observe is that the local execution is being stuck with the following messages and the task in the clearml server is being stuck in a "Running" state, even after I terminate the local execution.
2021-11-10 10:54:03,066 - clearml.Task - INFO - Waiting for repository detection and full package requirement analysis 2021-11-10 10:54:27,945 - clearml.Task - INFO - Finished repository detection and package analysisI tried adding an explicit invocation of taks.close() with a print after it, and the code doesn't reach this print.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					EcstaticGoat95
				
					0

AgitatedDove14 GreasyPenguin14 Awesome!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GrittyKangaroo27
				
					0
					 × 1

Yey!!!!!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Tested with clearml 1.1.3 and I could not reproduce the issue 👍

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GreasyPenguin14
				
					0
					 × 1

GreasyPenguin14 GrittyKangaroo27 the new release contains a fix, could you verify it solves the issue in your scenario as well (there is now a smart timeout to detect the inconsistent state, that means the close/exit procedure might be delayed (10sec) instead of hanging in these specific rare scenarios)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Quick update, I might have been able to reproduce the issue ( GreasyPenguin14 working "offline" is a great hack to accelerate debugging this issue, thank you!)
It seems it is related to the known and very annoying Python forking issue (and this is why changing to "spawn" method solves the issue):
https://bugs.python.org/issue6721
Long story short, in some cases when forking (i.e. ProcessPoolExecutor), python can copy locks in a "bad" state, this means that you can end up with a lock acquired by a process that died, from here it is quite obvious that at some point we will hang...
I think the only way to get around it, is with a few predefined timeouts, so that we do not end up hanging the main process.
I'll post here once a fix is pushed to GitHub for you guys to test

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

GreasyPenguin14

In the process MyProcess other processes are created via a ProcessPoolExecutor.

Hmm that is interesting, the sub-process has an additional ProcessPoolExecutor inside it ?
GrittyKangaroo27 if you can help with reproducible code that will be great (or any insight on reproducing the issue)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I face the same problem.
When running the pipeline, some tasks that use multiprocessing would never be completed.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GrittyKangaroo27
				
					0
					 × 1

In the process MyProcess other processes are created via a ProcessPoolExecutor. In these processes calls to logger.report_matplotlib_figure are made, but I get the same issue when I remove these calls.

It looks like I don't have hanging issues when I use mp.set_start_method('spawn') at the top of the script.

I don't have a fully reproducilble example that I can share, sorry for that

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GreasyPenguin14
				
					0
					 × 1

Hi GreasyPenguin14
This is what I did, but I could not reproduce the hang, how is this different from your code?
` from multiprocessing import Process
import numpy as np
from matplotlib import pyplot as plt
from clearml import Task, StorageManager

class MyProcess(Process):
def run(self):
# in another process
global logger
# Create a plot
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = (30 * np.random.rand(N)) ** 2 # 0 to 15 point radii
plt.scatter(x, y, s=area, c=colors, alpha=0.5)
# Plot will be reported automatically
task.logger.report_matplotlib_figure(title="My Plot Title", series="My Plot Series", iteration=10, figure=plt)

Task.set_offline(True)

task = Task.init(
project_name='debug',
task_name='exit subprocess',
auto_connect_arg_parser=True,
auto_connect_streams=True,
auto_connect_frameworks=True,
auto_resource_monitoring=True,
)
parameters = dict(key='value')
task.connect_configuration(parameters)
logger = task.get_logger()

p = MyProcess()
p.start()

csv_file = StorageManager.get_local_copy(" ")
logger.report_table("table", "csv", iteration=0, csv=csv_file)

p.join()
task.close() `

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ubuntu 18.04 and python 3.6. the subprocess is done by subclassing multiprocessing.Process and then calling the .start() method

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GreasyPenguin14
				
					0
					 × 1

btw:
# in another process
How do you spin the subprrocess, is it with Popen ?
also what's the OS and python version you are using?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

30 Answers

4 years ago

2 years ago