Task Struck At

Answered

Task Struck At

task struck at task.flush(wait_for_uploads=True) :

I've been running a model training task - a variation on this clearml dataset example:
https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py

It uses from ignite.contrib.handlers import TensorboardLogger to save matplotlib figures to a tensorboard events file.

My CLEARML_FILES_HOST is a gs://<bucket> My clearml server is self hosted

using clearml 1.8.1

it was uploading fine for most of the day but now it is not uploading metrics and at the end of the run it hangs during task.close() on this line:
https://github.com/allegroai/clearml/blob/762bc5325f790b0ba0614f4aefddc1f881a2644c/clearml/task.py#L3634

This is the stack when I interrupt it:
File "/src/classifier_training/classifier_training.py", line 221, in train_model File "/usr/local/lib/python3.8/dist-packages/clearml/task.py", line 1762, in close self.__shutdown() File "/usr/local/lib/python3.8/dist-packages/clearml/task.py", line 3634, in __shutdown self.flush(wait_for_uploads=True) File "/usr/local/lib/python3.8/dist-packages/clearml/task.py", line 1718, in flush self.__reporter.wait_for_events() File "/usr/local/lib/python3.8/dist-packages/clearml/backend_interface/metrics/reporter.py", line 311, in wait_for_events return self._report_service.wait_for_events(timeout=timeout) File "/usr/local/lib/python3.8/dist-packages/clearml/backend_interface/metrics/reporter.py", line 124, in wait_for_events if self._empty_state_event.wait(timeout=1.0): File "/usr/local/lib/python3.8/dist-packages/clearml/utilities/process/mp.py", line 445, in wait return self._event.wait(timeout=timeout) File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 349, in wait self._cond.wait(timeout) File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 261, in wait return self._wait_semaphore.acquire(True, timeout)Any idea why it may be failing to upload?
I'm running locally (no agent, but from within a local docker container)
The local tensorboard events files are fine (viewing them using tensorboard)

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Votes Newest

Answers 20

Hi PanickyMoth78

it was uploading fine for most of the day but now it is not uploading metrics and at the end

Where are you uploading metrics to (i.e. where is the clearml-server) ?
Are you seeing any retry logging on your console ?
packages/clearml/backend_interface/metrics/reporter.py", line 124, in wait_for_eventsThis seems to be consistent with waiting for metrics to be flushed to the backend, but usually you will see retry messages on your console when that happens

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I think this was the issue: None
And that caused TF binding to skip logging the scalars and from that point it broke the iteration numbering and so on.

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks PanickyMoth78 for pining, let me check if I can find something in the commit log, I think there was a fix there...

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

no retry mesages
CLEARML_FILES_HOST is gs
CLEARML_API_HOST is a self hosted clearml server (in google compute engine).

Note that earlier in the process the code uploads a dataset just fine

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

all done ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start ^CTraceback (most recent call last): File "/home/zanini/repo/RecSys/src/cli/retraining_script.py", line 710, in <module> mr.retrain() File "/home/zanini/repo/RecSys/src/cli/retraining_script.py", line 701, in retrain self.task.close() File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/task.py", line 1783, in close self.__shutdown() File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/task.py", line 3692, in __shutdown self.flush(wait_for_uploads=True) File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/task.py", line 1738, in flush self.__reporter.wait_for_events() File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/backend_interface/metrics/reporter.py", line 316, in wait_for_events return self._report_service.wait_for_events(timeout=timeout) File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/backend_interface/metrics/reporter.py", line 129, in wait_for_events if self._empty_state_event.wait(timeout=1.0): File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/utilities/process/mp.py", line 445, in wait return self._event.wait(timeout=timeout) File "/home/zanini/anaconda3/lib/python3.9/multiprocessing/synchronize.py", line 349, in wait self._cond.wait(timeout) File "/home/zanini/anaconda3/lib/python3.9/multiprocessing/synchronize.py", line 261, in wait return self._wait_semaphore.acquire(True, timeout) File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/task.py", line 3898, in signal_handler return org_handler if not callable(org_handler) else org_handler(sig, frame) KeyboardInterrupt

  				
Posted 
	2 years ago

					More  		
  Report
		
					ShallowGoldfish8
				
					0
					 × 1

Also, I was using tensorboard

  				
Posted 
	2 years ago

					More  		
  Report
		
					ShallowGoldfish8
				
					0
					 × 1

After commenting all the metric/plot reporting, we noticed the model was not uploading the artifacts to S3. A solution was to add wait_for_upload in task.upload_artifact()

  				
Posted 
	2 years ago

					More  		
  Report
		
					ShallowGoldfish8
				
					0
					 × 1

there may have been some interaction between the training task and a preceding dataset creation task :shrug:

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

I've already tried restarting my laptop (and the docker container where my code is running)

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

I mean that it was uploading console logs scalar plots and images fine just a while ago and then it seems to have stopped uploading all scalar plot metrics and the figures but log upload was still fine.

Anyway, it is back to working properly now without any code change (as far as I can tell. I tried commenting out a line or two and then brought them all back)

If I end up with something reproducible I'll post here.

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

The only weird thing to me is not getting any "connection warnings" if this is indeed a network issue ...

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

any news on this? I also got a similar issue

For me the problem sort of went away. My code evolved a bit after posting this so that dataset creation and training tasks run in separate python sessions. I did not investigate further.

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

ShallowGoldfish8 the models are uploaded in the background, task.close() is actually waiting for them, but wait_for_upload is also a good solution.

where it seems to be waiting for the metrics, etc but never finishes. No retry message is shown as well.

From the description it sounds like there is a problem with sending the metrics?! the task.close is waiting for all the metrics to be sent, and it seems like for some reason they are not, and this is why close is waiting on them
Are you running your own clearml-server? is the simple TB example (see https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorboard_toy.py or https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorflow_mnist.py ) working for you?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi Martin, I updated clearml but the problem persists

  				
Posted 
	2 years ago

					More  		
  Report
		
					ShallowGoldfish8
				
					0
					 × 1

ShallowGoldfish8 I believe it was solved in 1.9.0, can you verify?
pip install clearml==1.9.0

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

PanickyMoth78 AgitatedDove14 any news on this? I also got a similar issue

  				
Posted 
	2 years ago

					More  		
  Report
		
					ShallowGoldfish8
				
					0
					 × 1

Something is off here ... Can you try to run the TB examples and the artifacts example and see if they work?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorflow_mnist.py
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

it was uploading fine for most of the day

What do you mean by uploading fine most of the day ? are you suggesting the upload stuck to the GS ? are you seeing the other metrics (scalars console logs etc) ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

My code pretty much createas a dataset, uploads it, trains a model (thats where the current task starts), evaluates it and upload all the artifacts and metrics. The artifacts and configurations are upload alright, but the metrics and plots are not. As with Lavi, my code hangs on the task.close(), where it seems to be waiting for the metrics, etc but never finishes. No retry message is shown as well.
After a print I added for debug right before task.close() the only message I get in the console is ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start

  				
Posted 
	2 years ago

					More  		
  Report
		
					ShallowGoldfish8
				
					0
					 × 1

Yes, seems indeed it was waiting for the uploads, which weren't happening ( I did give it quite a while to try to finish the process in my tests). I thought it was a problem with metrics, but apprently it was more like the artifacts before them. The artifacts were shown in the webui dashboard, but were not on S3

  				
Posted 
	2 years ago

					More  		
  Report
		
					ShallowGoldfish8
				
					0
					 × 1

Write your answer

1K Views

20 Answers

2 years ago