Reputation
Badges 1
25 × Eureka!We used subprocess for it, ...
Popen? os.system? fork?
Hi @<1523701949617147904:profile|PricklyRaven28>
Sorry, we missed that one
we need to invoke it with
accelerate launch
so we use
subprocess.run
So you have two options, either you change the script entry of the Task from your " script.py
" to" -m accelerate launch script.py
or you manually do that inside your entry point (i.e. call accelerate launch)
BTW, I "think" we added an "auto detect" for it, so that if you launched it manually this wa...
Hi @<1523701601770934272:profile|GiganticMole91>
to use https although the scheduled task is using ssh for git?
Sure as long as it has git_user / git_pass configured in the agents clearml.conf it will automatically convert ssh to http git pull
None
Thanks @<1523701601770934272:profile|GiganticMole91> !
(As usual MS decided to invent a new "standard")
I'll make sure the guys looks at it and get an RC with a fix
So I suppose clearml-agent is not responsible, because it finds a wheel for torch 1.11.0 with cu117.
The thing is, the agent used to do all the heavy parsing because pytorch never actually had a pip compatible artifactory
But now they do, so the agent basically passed the parsing to pip and just added the correct additional pytorch pip repo.
It seems we need to switch back... wdyt?
if this is the case pytorch really messed things up, this means they removed packages
Let me check something
I am not sure what switching back will solve, here the wheel should have been correct, it's just the architecture of the card that is incompatible
So I tested the "old" code that did the parsing and matching, and it did resolve to the correct wheel (i.e. found that there is no 117 only 115 and installed this one)
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt?
The wheel you download from pip, for example this one torch-1.11.0-cp38-cp38-manylinux1_x86_64.whl
is actually both CPU and cuda 117
Hi @<1523701066867150848:profile|JitteryCoyote63>
RC is out,
pip3 install clearml-agent==1.5.3rc3
Then in pytorch_resolve: "direct"
None
Let me know if it worked
Hi @<1523701066867150848:profile|JitteryCoyote63>
Thank you for bringing it! can you verify with the latest clearml-agent 1.5.3rc2
?
I looked at your task log on the github issue. It seems the main issue is that your notebook is Not stored as python code. Are you running it on jupyter notebook or is it ipython that you are runnig it on? Is this reproducible? If so what's the jupyter version, python and OS versions?
Are they ephemeral or later used by other Tasks, execution etc ?
For example: configuration files, they are specific for an execution, and someone will edit them.
Initial weights files, are something that multiple execution might needs them, and they will be used to restore an execution. Data, even if changing, is usually used by multiple executions tasks etc.
It seems like you treat these files as "configurations", is that right ?
Hmm, so what I'm thinking is "extending" the capabilities of the "configuration" section (as it seems this is the right context). Allowing to upload a bunch of files (with the same mechanism as artifacts), as zip files, in the configuration "editable" section have the URL storing the zip, together with the target folder. wdyt?
So one can override the queue ID but not the worker
apparently ... I can't think of a good reason for that actually ...
It's always the details... Is the new Task running inside a new subprocess ?
basically there is a difference between
remote task spawning new tasks (as subprocesses, or as jobs on remote machine), remote task still running remote task, is being replaced by a spawned task (same process?!)UnevenDolphin73 am I missing a 3rd option? which of these is your case?
p,s. I have a suspicion that there might be a misuse of "Task" here?! What are you considering a Task? (from clearml perspective a Task...
hmm, yes, but then this kind of a hacky solution... The original #340 was about packaging source code that was not in git... Now we want to add "data" (even if ephemeral) on to it, no?
My thinking is somehow make sure a Task can reference a "Dataset" to be downloaded before it starts by the agent ?!
Hi WittyOwl57
I think what happens is it auto-logs the joblib load/save calls, these calls track models used/created by the code, and attach them to the model repository representing these model.
I'm assuming there are multiple load/save , and there are multiple model instances pointing to the same local file "file:///tmp/..." . The earning basically says it is re-registering existing models.
Make sense ?
Hi @<1671689437261598720:profile|FranticWhale40>
Are you positive the Triton container finished syncing ?
Could you provide the docker log (both the serving and the triton)?
What is the clearml-serving version you are using ?
Could you add a print in the "preprocess" function, just to validate you are getting to the correct model version ?
Okay we got to the bottom of this. This was actually because of the load balancer timeout settings we had, which was also 30 seconds and confusing us.
Nice!
btw:
in the clearml.conf we put this:
for future reference, you are missing the sdk section:
sdk.http.timeout: 300
.
notation works as well as {}
Hi @<1562973083189383168:profile|GrievingDuck15>
Thanks for noticing, yes the api is always versioned, we should make it clear in the docs. Also if you need the latest one use version 999 , it will default to the latest one it can support
Requested version: 2.28, Used version 1.0" for some reason
This is fine that means there is no change in that API
Thank you so much @<1572395184505753600:profile|GleamingSeagull15> !
looks like your
faq.clear.ml
site is missing from your main sites sitemap files,
Thank you for noticing! I'll check with the webdevs
Also missing the
robots
meta tag on that site,
🙏
Last tip is to add a link on the
faq.clear.ml
site back to
clear.ml
for search index relevancy ( connects the two sites as being related in content...
Hi EnthusiasticCoyote38
But one one process finished it changed task status to complete. May be you know some save way to deal with such situation? Or maybe the best way to check task status before upload object?
Well, you can actually forcefully set the state of the Task to running, then add artifacts, then close it?
would that work?
` my_other_task.reload()
my_other_task.mark_started(force=True)
my_other_task.upload_artifact(...)
my_other_task.flush(wait_for_uploads=True)
my_othe...
Hi MiniatureCrocodile39
Which packages to you need to run the viewer? I suppose dicom reader is a must?
and since the update the docs seem to be a bit off but I think I got it
Working on a whole new site 😉
sorry the point where you select the interpreter for pycharm
Oh I see...
GrumpyPenguin23 could you help and point us to an overview/getting-started video?
however setting up the interpertier on pycharm is different on mac for some reason, and the video just didnt match what I see
MiniatureCrocodile39 Are you running on a remote machine (i.e. PyCharm + remote ssh) ?