
Reputation
Badges 1
25 × Eureka!Yes it is reproducible do you want a snippet?
Already fixed 🙂 please ping tomorrow, I think an RC should be out soon with the fix
Hi PanickyMoth78
it was uploading fine for most of the day but now it is not uploading metrics and at the end
Where are you uploading metrics to (i.e. where is the clearml-server) ?
Are you seeing any retry logging on your console ?packages/clearml/backend_interface/metrics/reporter.py", line 124, in wait_for_events
This seems to be consistent with waiting for metrics to be flushed to the backend, but usually you will see retry messages on your console when that happens
I think this was the issue: None
And that caused TF binding to skip logging the scalars and from that point it broke the iteration numbering and so on.
ShallowGoldfish8 the models are uploaded in the background, task.close() is actually waiting for them, but wait_for_upload is also a good solution.
where it seems to be waiting for the metrics, etc but never finishes. No retry message is shown as well.
From the description it sounds like there is a problem with sending the metrics?! the task.close
is waiting for all the metrics to be sent, and it seems like for some reason they are not, and this is why close is waiting on them
A...
Something is off here ... Can you try to run the TB examples and the artifacts example and see if they work?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorflow_mnist.py
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py
it was uploading fine for most of the day
What do you mean by uploading fine most of the day ? are you suggesting the upload stuck to the GS ? are you seeing the other metrics (scalars console logs etc) ?
Thanks @<1523701713440083968:profile|PanickyMoth78> for pining, let me check if I can find something in the commit log, I think there was a fix there...
The only weird thing to me is not getting any "connection warnings" if this is indeed a network issue ...
Actually that is less interesting, as it is quite straight forward
ShallowGoldfish8 I believe it was solved in 1.9.0, can you verify?pip install clearml==1.9.0
Yes, but as you mentioned everything is created inside the lib, which means the python is not able to intercept the metrics so that clearml can send them to the backend.
Hi FiercePenguin76
Is catboost actually using TB or is it just writing to .tfevent on its own ?
it certainly does not use tensorboard python lib
Hmm, yes I assume this is why the automagic is not working 😞
Does it have a pythonic interface form the metrics ?
Hi SmallDeer34
ClearML automagical logging will work on the current python process. But in your example yyour Bash is running another python script (that has nothing to do with the original notebook), hence clearml automagic is not aware of it (i.e. it cannot "patch" the tensorboard calls).
In order to make it work.
you should do something like:from joeynmt import train train.main(...)
Or something similar 🙂
Make sense ?
Okay here is a standalone code that should be close enough? (if I missed anything let me know)
` import tempfile
from datetime import datetime
from pathlib import Path
import tensorflow as tf
import tensorflow_datasets as tfds
from clearml import Task
task = Task.init(project_name="debug", task_name="test")
(ds_train, ds_test), ds_info = tfds.load(
'mnist',
split=['train', 'test'],
shuffle_files=True,
as_supervised=True,
with_info=True,
)
def normalize_img(image, labe...
Thanks BoredHedgehog47 !
And yes if the Task.init() call was only in main.py
then the TB inside the subprocess (train.py) would as you perceived not be captured.
Did you by any chance test calling Task.init in Both main.py
and train.py
?
I think the crux of the issue is the subprocess calls I removed.
That kind of makes sense, though if the subprocess function also had Task.init call it should have worked.
Would that be the setup to try to replicate?
callbacks.append( tensorflow.keras.callbacks.TensorBoard( log_dir=str(log_dir), update_freq=tensorboard_config.get("update_freq", "epoch"), ) )
Might be! what's the actual value you are passing there?
Maybe before everything else, can you share some background on the rational if starting a new sub process?
BoredHedgehog47 can you test this one? Is it close to your code ?
So in summary: subprocess calls appear to break clearML tracking, even if I do Task.init() in both main.py and train.py.
Okay let me see if we can reproduce & fix this, it should not be long
I basically moved the Task.init() call below the imports
Okay that is odd, can you copy pate the before/after of the import, so we can fix that?!
BoredHedgehog47 I tried changing the order of imports on the sample code I shared before, it worked in both cases ...
Well if we the "video" from TB is not in mp4/gif format than someone will have to encode it.
I was just pointing that for the encoding part we might need additional package
Hi @<1664079296102141952:profile|DangerousStarfish38>
You mean spin the agent on multiple Windows machines? Yes that is supported, I think that it is limited to venv (i.e. not docker) mode, but other than that should work out of the box
Hi LazyLeopard18
I suggest removing the trains.conf and running:trains-init
At the end of the wizard it verifies the credentials, so you should be good to go.
I would also recommend using the machine IP and not local host, as on some setups (Windows / VM etc) localhost will no be bridged to the VM/Docker but machine IP will be.