data:image/s3,"s3://crabby-images/ea8fc/ea8fc4a242d3fbf9f124d8906a48b69b89ea53a2" alt="Profile picture"
Reputation
Badges 1
25 × Eureka!Hi @<1558986821491232768:profile|FunnyAlligator17>
What do you mean by?
We are able to
set_initial_iteration
to 0 but not
get_last_iteration
.
Are you saying that if your code looks like:
Task.set_initial_iteration(0)
task = Task.init(...)
and you abort and re-enqueue, you still have a gap in the scalars ?
Hi ZanyPig66
I used tensorboard as clearml claims to auto-capture tensorboard outputs, but it was a no go.
The auto TB logging should work out of the box, where is it failing ?
Also,task = Task.current_task()
Why aren't you using Task.init in the original script?
The idea is that you run your code on your machine (where the environment works), ClearML auto detects code + python packages + args etc.
Then you clone it in the UI and launch it on a remote machine.
What am I missing ...
Yes it fully supported, and should work.
Could you share the full execution log ?
I can but that is not a configuration we would want to run with in production
Agreed, I just want to isolate the issue. I think this is the bottom python interface missing some configuration or environment variables
is there a way to visualize the pipeline such that this step is βstuckβ in executing?
Yes there is, the pipelline plot (see plots section on the Pipeline Task, will show the current state of the pipeline.
But I have a feeling you have something else in mind?
Maybe add Tag on the pipeline Task itself (then remove it when it continues) ?
I'm assuming you need something that is quite prominent in the UI, so someone knows ?
(BTW I would think of integrating it with the slack monitor, to p...
So βwaitβ is a better metaphore for me
So I would do something like (I might have a few typos but that's the gist):
def post_execute_callback_example(a_pipeline, a_node):
# type (PipelineController, PipelineController.Node) -> None
print('Completed Task id={}'.format(a_node.executed))
# wait until model is tagged, then pass it as argument
while True:
found = Moodel.query_models(...) # model filter here, inlucing tag and project
if found:
...
I think we added it somewhere in 0.14, anyhow I just checked the Logger doc, it is there now π
Does this file look familiar to you?file not found: archive/constants.pkl
Hi @<1523722267119325184:profile|PunySquid88> I guess it's a good thing we talk, because I believe that what you are looking for is already available :)
Logger.current_logger().report_media('title', 'series', iteration=1337, local_path='/tmp/bunny.mp4')
This will actually work on any file, that said, the UI might display the wrong icon (which will be fixed in the next version).
We usually think of artifacts as data you want to reuse, so all the files uploaded there are accessibl...
ReassuredTiger98 I guess this is a plotly feature, none the less I think you can shift the Y axis manually (click and drag)
last iteration is no reset and I still have a gap in my scalars
Hmm is this reproducible ? can you check with the latest clearml version (1.10.3) ?
btw: I'm assuming continue_last_task=0
I think I found the issue, the fact the agent is launching it causes it to ignore the "overridden" set_initial_iteration
Yes it is reproducible do you want a snippet?
Already fixed π please ping tomorrow, I think an RC should be out soon with the fix
Hi PanickyMoth78
it was uploading fine for most of the day but now it is not uploading metrics and at the end
Where are you uploading metrics to (i.e. where is the clearml-server) ?
Are you seeing any retry logging on your console ?packages/clearml/backend_interface/metrics/reporter.py", line 124, in wait_for_events
This seems to be consistent with waiting for metrics to be flushed to the backend, but usually you will see retry messages on your console when that happens
I think this was the issue: None
And that caused TF binding to skip logging the scalars and from that point it broke the iteration numbering and so on.
ShallowGoldfish8 the models are uploaded in the background, task.close() is actually waiting for them, but wait_for_upload is also a good solution.
where it seems to be waiting for the metrics, etc but never finishes. No retry message is shown as well.
From the description it sounds like there is a problem with sending the metrics?! the task.close
is waiting for all the metrics to be sent, and it seems like for some reason they are not, and this is why close is waiting on them
A...
Something is off here ... Can you try to run the TB examples and the artifacts example and see if they work?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorflow_mnist.py
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py
it was uploading fine for most of the day
What do you mean by uploading fine most of the day ? are you suggesting the upload stuck to the GS ? are you seeing the other metrics (scalars console logs etc) ?
Thanks @<1523701713440083968:profile|PanickyMoth78> for pining, let me check if I can find something in the commit log, I think there was a fix there...
The only weird thing to me is not getting any "connection warnings" if this is indeed a network issue ...
Actually that is less interesting, as it is quite straight forward
ShallowGoldfish8 I believe it was solved in 1.9.0, can you verify?pip install clearml==1.9.0
Yes, but as you mentioned everything is created inside the lib, which means the python is not able to intercept the metrics so that clearml can send them to the backend.
Hi FiercePenguin76
Is catboost actually using TB or is it just writing to .tfevent on its own ?
it certainly does not use tensorboard python lib
Hmm, yes I assume this is why the automagic is not working π
Does it have a pythonic interface form the metrics ?
Hi SmallDeer34
ClearML automagical logging will work on the current python process. But in your example yyour Bash is running another python script (that has nothing to do with the original notebook), hence clearml automagic is not aware of it (i.e. it cannot "patch" the tensorboard calls).
In order to make it work.
you should do something like:from joeynmt import train train.main(...)
Or something similar π
Make sense ?
Okay here is a standalone code that should be close enough? (if I missed anything let me know)
` import tempfile
from datetime import datetime
from pathlib import Path
import tensorflow as tf
import tensorflow_datasets as tfds
from clearml import Task
task = Task.init(project_name="debug", task_name="test")
(ds_train, ds_test), ds_info = tfds.load(
'mnist',
split=['train', 'test'],
shuffle_files=True,
as_supervised=True,
with_info=True,
)
def normalize_img(image, labe...
Thanks BoredHedgehog47 !
And yes if the Task.init() call was only in main.py
then the TB inside the subprocess (train.py) would as you perceived not be captured.
Did you by any chance test calling Task.init in Both main.py
and train.py
?
I think the crux of the issue is the subprocess calls I removed.
That kind of makes sense, though if the subprocess function also had Task.init call it should have worked.
Would that be the setup to try to replicate?