Reputation
Badges 1
25 × Eureka!I commented the upload_artifact at the end of the code and it finishes correctly now
upload_artifact caused the "failed" issue ?
Hmm... any idea on what's different with this one ?
MagnificentSeaurchin79 making sure the basics work.
Can you see the 3D plots under the Plot section ?
Regrading the Tensors, could you provide a toy example for us to test ?
DeliciousBluewhale87 Is it ML or DL serving you are after ?
DeliciousBluewhale87 basically any solution that is compliant with S3 protocol will work. An example:output_uri="
:PORT/bucket/folder"
Are you sure Nexus supports this protocol ?
I "think" nexus sits on top of a storage solution (like am object storage), meaning we can use the same storage solution Nexus is using.
Just to clarify we do not support the artifactory protocol Nexus provides for storing models/artifacts. But we do support it as a source for python packages used by the a...
Hi DeliciousBluewhale87
Yes that should have worked, can you verify the task status ?
Print(Task.get_task(...).get_status())
Hi DeliciousBluewhale87
I think we had a docker that does exactly that, and then you would spin the docker as a k8s service , is this what you are referring to?
The warning just let's you know the current processes stopped and itis being launched on a remote machine.
What am I missing? Is the agent failing to run the job that you create manually ?
(notice that when creating a job manually, there is no "execute_remotely", you just enqueue it, as it is not actually "running locally")
Make sense ?
This is odd , and it is marked as failed ?
Are all the Tasks marked failed, or is it just this one ?
Hmm should not make a diff.
Could you verify it still doesn't work with TF 2.4 ?
Oh task_id is the Task ID of step 2.
Basically the idea is, you run your code once (lets call it debugging / programming), that run creates a task in the system, the task stores the environment definition and the arguments used. Then you can clone that Task and launch it on another machine using the Agent (that basically will setup the environment based on the Task definition and will run your code with the new arguments). The Pipeline is basically doing that for you (i.e. cloning a task chan...
Hi MagnificentSeaurchin79
Could you test with the tesnorflow toy example?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorboard_toy.py
task = Task.init(project_name='debug', task_name='test tqdm cr cl') print('start') for i in tqdm.tqdm(range(100), dynamic_ncols=True,): sleep(1) print('done')
This code snippet works as expected (console will show the progress at the flush interval without values in between). What's the difference ?!
What's the clearml version? Is this with the latest from GitHub?
In any case, do you have any suggestion of how I could at least hack tqdm to make it behave? Thanks
I think I know what the issue is, it seems tqdm is using Unicode for the CR this is the 1b 5b 41
sequence I see on the binary log.
Let me see if I can hack something for you to test π
https://stackoverflow.com/questions/60860121/plotly-how-to-make-an-annotated-confusion-matrix-using-a-heatmap
MagnificentSeaurchin79 see plotly example here:
https://allegro.ai/clearml/docs/docs/examples/reporting/plotly_reporting.html
WittyOwl57 this is what I'm getting on my console (Notice New lines! not a single one overwritten as I would expect)
` Training: 10%|β | 1/10 [00:00<?, ?it/s]
Training: 20%|ββ | 2/10 [00:00<00:00, 9.93it/s]
Training: 30%|βββ | 3/10 [00:00<00:00, 9.89it/s]
Training: 40%|ββββ | 4/10 [00:00<00:00, 9.87it/s]
Training: 50%|βββββ | 5/10 [00:00<00:00, 9.87it/s]
Training: 60%|ββββββ | 6/10 [00:00<00:00, 9.88it/s]
Training: 70%|βββββββ | 7/10 [00:00<00...
WittyOwl57 I can verify the issue reproduces! π !
And I know what happens, TQDM is sending an "up arrow" key, if you are running inside bash, that looks like CR (i.e. move the cursor to the begining of the line), but when running inside other terminals (like PyCharm or ClearML log) this "arrow-key" is just unicode character to print, it does nothing, and we end up with multiple lines.
Let me see if we can fix it π
Okay, some progress, so what is the difference ?
Any chance the issue can be reproduced with a small toy code ?
Can you run the tqdm loop inside the code that exhibits the CR issue ? (maybe some initialization thing that is causing it to ignore the value?!)
Can you reproduce this behavior outside of lightning? or in a toy example (because I could not)
What I'd really want is the same behaviour in the console (one smooth progress bar) and one line per epoch in the logs; high hopes, right?
I think they send some "odd" character instead of CR, otherwise I cannot explain the difference.
Can you point to a toy example demonstrating the same issue ?
Also I just tried the pytorch-lightningΒ
RichProgressBar
Β (not yet released) instead of the default (which is unfortunately based on tqdm) and it works great.
Yey!
I mean just add the toy tqdm loop somewhere just before starting the lightning train function. I just want to verify that it works, or maybe there is something in the specific setup happening in real-time that changes it
I understand that it uses time in seconds when there is no report being logged..but, it has already logged three times..
Hmm could it be the reporting started 3 min after the Task started ?
It reverts back, but it cannot "delete" the last reported iteration value.
Make sense ?