Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Task Struck At

task struck at task.flush(wait_for_uploads=True) :

I've been running a model training task - a variation on this clearml dataset example:
https://github.com/allegroai/clearml/blob/master/examples/datasets/data_ingestion.py

It uses from ignite.contrib.handlers import TensorboardLogger to save matplotlib figures to a tensorboard events file.

My CLEARML_FILES_HOST is a gs://<bucket> My clearml server is self hosted

using clearml 1.8.1

it was uploading fine for most of the day but now it is not uploading metrics and at the end of the run it hangs during task.close() on this line:
https://github.com/allegroai/clearml/blob/762bc5325f790b0ba0614f4aefddc1f881a2644c/clearml/task.py#L3634

This is the stack when I interrupt it:
File "/src/classifier_training/classifier_training.py", line 221, in train_model File "/usr/local/lib/python3.8/dist-packages/clearml/task.py", line 1762, in close self.__shutdown() File "/usr/local/lib/python3.8/dist-packages/clearml/task.py", line 3634, in __shutdown self.flush(wait_for_uploads=True) File "/usr/local/lib/python3.8/dist-packages/clearml/task.py", line 1718, in flush self.__reporter.wait_for_events() File "/usr/local/lib/python3.8/dist-packages/clearml/backend_interface/metrics/reporter.py", line 311, in wait_for_events return self._report_service.wait_for_events(timeout=timeout) File "/usr/local/lib/python3.8/dist-packages/clearml/backend_interface/metrics/reporter.py", line 124, in wait_for_events if self._empty_state_event.wait(timeout=1.0): File "/usr/local/lib/python3.8/dist-packages/clearml/utilities/process/mp.py", line 445, in wait return self._event.wait(timeout=timeout) File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 349, in wait self._cond.wait(timeout) File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 261, in wait return self._wait_semaphore.acquire(True, timeout)Any idea why it may be failing to upload?
I'm running locally (no agent, but from within a local docker container)
The local tensorboard events files are fine (viewing them using tensorboard)

  
  
Posted one year ago
Votes Newest

Answers 20


Hi PanickyMoth78

it was uploading fine for most of the day but now it is not uploading metrics and at the end

Where are you uploading metrics to (i.e. where is the clearml-server) ?
Are you seeing any retry logging on your console ?
packages/clearml/backend_interface/metrics/reporter.py", line 124, in wait_for_eventsThis seems to be consistent with waiting for metrics to be flushed to the backend, but usually you will see retry messages on your console when that happens

  
  
Posted one year ago

ShallowGoldfish8 I believe it was solved in 1.9.0, can you verify?
pip install clearml==1.9.0

  
  
Posted one year ago

PanickyMoth78 AgitatedDove14 any news on this? I also got a similar issue

  
  
Posted one year ago

ShallowGoldfish8 the models are uploaded in the background, task.close() is actually waiting for them, but wait_for_upload is also a good solution.

where it seems to be waiting for the metrics, etc but never finishes. No retry message is shown as well.

From the description it sounds like there is a problem with sending the metrics?! the task.close is waiting for all the metrics to be sent, and it seems like for some reason they are not, and this is why close is waiting on them
Are you running your own clearml-server? is the simple TB example (see https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorboard_toy.py or https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorflow_mnist.py ) working for you?

  
  
Posted one year ago

any news on this? I also got a similar issue

For me the problem sort of went away. My code evolved a bit after posting this so that dataset creation and training tasks run in separate python sessions. I did not investigate further.

  
  
Posted one year ago

no retry mesages
CLEARML_FILES_HOST is gs
CLEARML_API_HOST is a self hosted clearml server (in google compute engine).

Note that earlier in the process the code uploads a dataset just fine

  
  
Posted one year ago

Thanks @<1523701713440083968:profile|PanickyMoth78> for pining, let me check if I can find something in the commit log, I think there was a fix there...

  
  
Posted one year ago

My code pretty much createas a dataset, uploads it, trains a model (thats where the current task starts), evaluates it and upload all the artifacts and metrics. The artifacts and configurations are upload alright, but the metrics and plots are not. As with Lavi, my code hangs on the task.close(), where it seems to be waiting for the metrics, etc but never finishes. No retry message is shown as well.
After a print I added for debug right before task.close() the only message I get in the console is ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start

  
  
Posted one year ago

it was uploading fine for most of the day

What do you mean by uploading fine most of the day ? are you suggesting the upload stuck to the GS ? are you seeing the other metrics (scalars console logs etc) ?

  
  
Posted one year ago

I mean that it was uploading console logs scalar plots and images fine just a while ago and then it seems to have stopped uploading all scalar plot metrics and the figures but log upload was still fine.

Anyway, it is back to working properly now without any code change (as far as I can tell. I tried commenting out a line or two and then brought them all back)

If I end up with something reproducible I'll post here.

  
  
Posted one year ago

I've already tried restarting my laptop (and the docker container where my code is running)

  
  
Posted one year ago

I think this was the issue: None
And that caused TF binding to skip logging the scalars and from that point it broke the iteration numbering and so on.

  
  
Posted one year ago

Hi Martin, I updated clearml but the problem persists

  
  
Posted one year ago

all done ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start ^CTraceback (most recent call last): File "/home/zanini/repo/RecSys/src/cli/retraining_script.py", line 710, in <module> mr.retrain() File "/home/zanini/repo/RecSys/src/cli/retraining_script.py", line 701, in retrain self.task.close() File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/task.py", line 1783, in close self.__shutdown() File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/task.py", line 3692, in __shutdown self.flush(wait_for_uploads=True) File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/task.py", line 1738, in flush self.__reporter.wait_for_events() File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/backend_interface/metrics/reporter.py", line 316, in wait_for_events return self._report_service.wait_for_events(timeout=timeout) File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/backend_interface/metrics/reporter.py", line 129, in wait_for_events if self._empty_state_event.wait(timeout=1.0): File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/utilities/process/mp.py", line 445, in wait return self._event.wait(timeout=timeout) File "/home/zanini/anaconda3/lib/python3.9/multiprocessing/synchronize.py", line 349, in wait self._cond.wait(timeout) File "/home/zanini/anaconda3/lib/python3.9/multiprocessing/synchronize.py", line 261, in wait return self._wait_semaphore.acquire(True, timeout) File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/task.py", line 3898, in signal_handler return org_handler if not callable(org_handler) else org_handler(sig, frame) KeyboardInterrupt

  
  
Posted one year ago

Yes, seems indeed it was waiting for the uploads, which weren't happening ( I did give it quite a while to try to finish the process in my tests). I thought it was a problem with metrics, but apprently it was more like the artifacts before them. The artifacts were shown in the webui dashboard, but were not on S3

  
  
Posted one year ago

there may have been some interaction between the training task and a preceding dataset creation task :shrug:

  
  
Posted one year ago

Also, I was using tensorboard

  
  
Posted one year ago

After commenting all the metric/plot reporting, we noticed the model was not uploading the artifacts to S3. A solution was to add wait_for_upload in task.upload_artifact()

  
  
Posted one year ago

The only weird thing to me is not getting any "connection warnings" if this is indeed a network issue ...

  
  
Posted one year ago
1K Views
20 Answers
one year ago
one year ago
Tags
Similar posts