Reputation
Badges 1
25 × Eureka!Sadly no π
(I mean you could quickly write a reader for TB and report it, but it is not built into the SDK)
Did you run clearml-init
after the pip install ?
How do I reproduce it? When I use add_step with the wrong parameter it throws an exception before the pipeline even starts ...
Follow-up question: how does clearML "inject" the argparse arguments before the task is initialized?
it patches the actual parse_args
call, to make sure it works you just need to make sure it was imported before the actual call takes place
I had to do another workaround since when
torch.distributed.run
called it's
ArgumentParser
, it was getting the arguments from my script (and from my task) instead of the ones I passed it
Are you saying...
BTW: you still can get race/starvation cases... But at least no crash
MelancholyElk85
How do I add files without uploading them anywhere?
The files themselves need to be packaged into a zip file (so we have an immutable copy of the dataset). This means you cannot "register" existing files (in your example, files on your S3 bucket?!). The idea is to make sure your dataset is protected against changes on the one hand, but on the other to allow you to change it, and only store the changeset.
Does that make sense ?
Sure just setup clearml-agent
on any machine π
(The app.community server is the control plane)
GreasyPenguin14 yes there is π
https://github.com/allegroai/clearml/issues/209
Set environment variable CLEARML_NO_DEFAULT_SERVER=1
can you bump me to that thread?
https://clearml.slack.com/archives/CTK20V944/p1630610430171200
I realise I'll need to catalogue all the dataset ids created by ppl separately on a spreadsheet.
Okay this part I missed, why would you need to add additional "catalog" when you have the UI?
TenseOstrich47 this sounds like a good idea.
When you have a script, please feel free to share, I think it will be useful for other users as well π
With remote_execution it isΒ
command="[...]"
Β , but on local it isΒ
command='train'
Β like it is supposed to be.
I'm not sure I follow, could you expand ?
And is "requirements-dev.txt" in your git root folder?
What is your clearml-agent version?
Really stoked to start using it and introduce a more sane ML ops workflow at my workplace lol.
Totally with you π
... would that be aΒ
Model Registry Store
Β plugin?
YES please β€
So we actually just introduced "Applications" into the clearml free tier, https://app.community.clear.ml/applications
Allowing you to take any Task in the system and make it an "application" (a python script running on one of the service agents), with the ability to configu...
And you have the exact same folder structure / content, and server A/B give a different set of experiments ?
(is serverB empty, meaning no experiments at all?)
ReassuredTiger98 I think it is using moviepy
for the encoding... No?
from what I gather there is a lightly documented concept
Yes ... π the reason for it is that actually one could do:
` @PipelineDecorator.pipeline(...)
def pipeline(i):
....
if name == 'main':
pipeline(0)
pipeline(1)
pipeline(2) `Basically rerunning the pipeline 3 times
This support was added as some users found a use case for it, but I think this would be a rare one
Hi Guys, just curious here, what's was the final issue?
Also out of curiosity, what does that mean? "1.12.2 because some bug that make fastai lag 2x" ?
there was a problem with index order when converting from pytorch tensor to numpy array
HealthyStarfish45 I'm assuming you are sending numpy to report_image (which makes sense) if you want to debug it, you can also test tensorboard add_image or matplotlib imshow. both will send debug images
If you passed the correct path it should work (if it fails it would have failed right at the beginning).
BTW: I think it is clearml-agent --config-file <file here> daemon ...
Container environment setup overhead?
GreasyPenguin14 GrittyKangaroo27 the new release contains a fix, could you verify it solves the issue in your scenario as well (there is now a smart timeout to detect the inconsistent state, that means the close/exit procedure might be delayed (10sec) instead of hanging in these specific rare scenarios)
Hi @<1578555761724755968:profile|GrievingKoala83>
Two tasks are created, but the training does not begin, both tasks are in perpetual running.
Can you print something after the task.launch_multi_node(args.nodes))
- I'm assuming the two Tasks are running and are blocked on the " Trainer
" class
If specified
args.gpus=2
and args.nodes=2,
three
tasks are created.
This is really odd, can you add some prints with task id and rank after the ...
Hi PunyGoose16 ,
next release includes it (eta after this weekend π )
Not sure on the cause but if you do:
mp.set_start_method('fork', force=True)
There is no semaphore leakage