My code pretty much createas a dataset, uploads it, trains a model (thats where the current task starts), evaluates it and upload all the artifacts and metrics. The artifacts and configurations are upload alright, but the metrics and plots are not. As with Lavi, my code hangs on the task.close(), where it seems to be waiting for the metrics, etc but never finishes. No retry message is shown as well.
After a print I added for debug right before task.close() the only message I get in the console is ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
Hi PanickyMoth78
it was uploading fine for most of the day but now it is not uploading metrics and at the end
Where are you uploading metrics to (i.e. where is the clearml-server) ?
Are you seeing any retry logging on your console ?packages/clearml/backend_interface/metrics/reporter.py", line 124, in wait_for_events
This seems to be consistent with waiting for metrics to be flushed to the backend, but usually you will see retry messages on your console when that happens
After commenting all the metric/plot reporting, we noticed the model was not uploading the artifacts to S3. A solution was to add wait_for_upload
in task.upload_artifact()
I've already tried restarting my laptop (and the docker container where my code is running)
ShallowGoldfish8 I believe it was solved in 1.9.0, can you verify?pip install clearml==1.9.0
PanickyMoth78 AgitatedDove14 any news on this? I also got a similar issue
it was uploading fine for most of the day
What do you mean by uploading fine most of the day ? are you suggesting the upload stuck to the GS ? are you seeing the other metrics (scalars console logs etc) ?
I think this was the issue: None
And that caused TF binding to skip logging the scalars and from that point it broke the iteration numbering and so on.
all done ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start ^CTraceback (most recent call last): File "/home/zanini/repo/RecSys/src/cli/retraining_script.py", line 710, in <module> mr.retrain() File "/home/zanini/repo/RecSys/src/cli/retraining_script.py", line 701, in retrain self.task.close() File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/task.py", line 1783, in close self.__shutdown() File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/task.py", line 3692, in __shutdown self.flush(wait_for_uploads=True) File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/task.py", line 1738, in flush self.__reporter.wait_for_events() File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/backend_interface/metrics/reporter.py", line 316, in wait_for_events return self._report_service.wait_for_events(timeout=timeout) File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/backend_interface/metrics/reporter.py", line 129, in wait_for_events if self._empty_state_event.wait(timeout=1.0): File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/utilities/process/mp.py", line 445, in wait return self._event.wait(timeout=timeout) File "/home/zanini/anaconda3/lib/python3.9/multiprocessing/synchronize.py", line 349, in wait self._cond.wait(timeout) File "/home/zanini/anaconda3/lib/python3.9/multiprocessing/synchronize.py", line 261, in wait return self._wait_semaphore.acquire(True, timeout) File "/home/zanini/repo/RecSys/.venv/lib/python3.9/site-packages/clearml/task.py", line 3898, in signal_handler return org_handler if not callable(org_handler) else org_handler(sig, frame) KeyboardInterrupt
ShallowGoldfish8 the models are uploaded in the background, task.close() is actually waiting for them, but wait_for_upload is also a good solution.
where it seems to be waiting for the metrics, etc but never finishes. No retry message is shown as well.
From the description it sounds like there is a problem with sending the metrics?! the task.close
is waiting for all the metrics to be sent, and it seems like for some reason they are not, and this is why close is waiting on them
Are you running your own clearml-server? is the simple TB example (see https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorboard_toy.py or https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorflow_mnist.py ) working for you?
Hi Martin, I updated clearml but the problem persists
I mean that it was uploading console logs scalar plots and images fine just a while ago and then it seems to have stopped uploading all scalar plot metrics and the figures but log upload was still fine.
Anyway, it is back to working properly now without any code change (as far as I can tell. I tried commenting out a line or two and then brought them all back)
If I end up with something reproducible I'll post here.
there may have been some interaction between the training task and a preceding dataset creation task :shrug:
any news on this? I also got a similar issue
For me the problem sort of went away. My code evolved a bit after posting this so that dataset creation and training tasks run in separate python sessions. I did not investigate further.
Something is off here ... Can you try to run the TB examples and the artifacts example and see if they work?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorflow_mnist.py
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py
The only weird thing to me is not getting any "connection warnings" if this is indeed a network issue ...
Yes, seems indeed it was waiting for the uploads, which weren't happening ( I did give it quite a while to try to finish the process in my tests). I thought it was a problem with metrics, but apprently it was more like the artifacts before them. The artifacts were shown in the webui dashboard, but were not on S3
no retry mesages
CLEARML_FILES_HOST is gs
CLEARML_API_HOST is a self hosted clearml server (in google compute engine).
Note that earlier in the process the code uploads a dataset just fine
Thanks @<1523701713440083968:profile|PanickyMoth78> for pining, let me check if I can find something in the commit log, I think there was a fix there...