Reputation
Badges 1
25 × Eureka!Oh no, I just saw the message @<1541954607595393024:profile|BattyCrocodile47> is this stills relevant?
Thanks @<1657918706052763648:profile|SillyRobin38> this is still in the internal git repo (we usually do not develop directly on github)
I want to get familiar with it and, if possible, contribute to the project.
This is a good place to start: None
we are still debating weather to sue it directly or as part of Triton ( None ) , would love to get your feedback
. In short, I was not able to do it withΒ
Task.clone
Β andΒ
Task.create
Β , the behavior differs from what is described in docs and docstrings (this is another story - I can submit an issue on github later)
The easiest is to use task_ task_overrides
Then pass:task_overrides = dict('script': dict(diff='', branch='main'))
You can check the example here, just make sure you add the callback and you are good to go π
https://github.com/allegroai/trains/blob/master/examples/frameworks/keras/keras_tensorboard.py#L107
A single query will return if the agent is running anything, and for how long, but I do not think you can get the idle time ...
@<1541954607595393024:profile|BattyCrocodile47>
Is that instance only able to handle one task at a time?
You could have multiple agents on the same machine, each one with its own dedicated GPU, but you will not be able to change the allocation (i.e. now I want 2 GPUs on one agent) without restarting the agents on the instance. In either case, this is for a "bare-metal" machine, and in the AWS autoscaler case, this goes under "dynamic" GPUs (see above)
As a result, I need to do somethig which copies the files (e.g. cp -r or StorageManager.upload_folder(βbβ, βaβ)
but this is expensive
You are saying the copy is just wasteful (but you do have the files locally)?
So if any step corresponding to 'inference_orchestrator_1' fails, then 'inference_orchestrator_2' keeps running.
GiganticTurtle0 I'm not sure it makes sense to halt the entire pipeline if one step fails.
That said, how about using the post_execution callback, then check if the step failed, you could stop the entire pipeline (and any running steps), what do you think?
Sometimes it is working fine, but sometimes I get this error message
@<1523704461418041344:profile|EnormousCormorant39> can I assume there is a gateway at --remote-gateway <internal-ip> ?
Could it be that this gateway has some network firewall blocking some of the traffic ?
If this is all local network, why do you need to pass --remote-gateway ?
Hi RipeGoose2
when I'm using the set_credentials approach does it mean the trains.conf is redundant? if
Yes this means there is no need for trains.conf , all the important stuff (i.e. server + credentials you provide from code).
BTW: When you execute the same code (i.e. code with set_credentials call) the agent's coniguration will override what you have there, so you will be able to run the Task later either on prem/cloud without needing to change the code itself π
Since this fix is all about synchronizing different processes, we wanted to be extra careful with the release. That said I think that what we have now should be quite stable. Plan is to have the RC available right after the weekend.
PleasantGiraffe85
it took the repo from the cache. When I delete the cache, it can't get the repo any longer.
what error are you getting ? (are we talking about the internal repo)
FrustratingWalrus87 If you need active one, I think there is currently no alternative to TB tSNE π it is truly great π
That said you can use plotly for the graph:
https://plotly.com/python/t-sne-and-umap-projections/#project-data-into-3d-with-tsne-and-pxscatter3d
and report it to ClearML with Logger report_plotly :
https://github.com/allegroai/clearml/blob/e9f8fc949db7f82b6a6f1c1ca64f94347196f4c0/examples/reporting/plotly_reporting.py#L20
Should I useΒ
update_weights_package
Yes
BTW, config.pbtxt you should pass when "registering" the endpoint with the CLI
one can containerise the whole pipeline and run it pretty much anywhere.
Does that mean the entire pipeline will be running on the instance spinning the container ?
From here: this is what I understand:
https://kedro.readthedocs.io/en/stable/10_deployment/06_kubeflow.html
My thinking was I can use one command and run all steps locally while still registering all "nodes/functions/inputs/outputs etc" with clearml such that I could also then later go into the interface and clone an...
Hmm can you test with the latest RC? or even better from the GitHub (that said the Github will break the interface, as we upgraded the pipeline π )
BTW: 0.14.3 solved the issue you are referring to, so you can import trains before / parsing the args without an issue. Regrading passing project/name as parameters. A few thoughts: (1) you can always rename / move projects from the UI (2) If you are running it with trains-agent there is no meaning to these arguments, as by definition the Task was already created... Maybe we should give an option to exclude a few arguments from argparser, I think this topic came up a few times... What d...
ElegantKangaroo44 I think TrainsCheckpoint would probably be the easiest solution. I mean it will not be a must, but another option to deepen the integration, and allow us more flexibility.
Now in case I needed to do it, can I add new parameters to cloned experiment or will these get deleted?
Adding new parameters is supported π
It appears that "they sell that" as Triton Management Service, part of
. It is possible to do through their API, but would need to be explicit.
We support that, but this is Not dynamically loaded, this is just removing and adding models, this does not unload them from the GRAM.
That's the main issue. when we unload the model, it is unloaded, to do dynamic, they need to be able to save it in RAM and unload it from GRAM, that's the feature that is missing on all Triton deployme...
So you are saying 156 chunks, with each chunk about ~6500 files ?
but it still not is able to run any task after I abort and rerun another task
When you "run" a task you are pushing it to a queue, so how come a queue is empty? what happens after you push your newly cloned task to the queue ?
Hi @<1715175986749771776:profile|FuzzySeaanemone21>
and then run "clearml-agent daemon --gpus 0 --queue gcp-l4" to start the worker.
I'm assuming the docker service cannot spin a container with GPU access, usually this means you are missing the nvidia docker runtime component