
Reputation
Badges 1
25 × Eureka!That's the theory, I still see it is not there
Hover over the border (I would suggest to use the full screen, i.e. maximize)
I know that there is possibility to set up some budget - for example seconds of running after which optimization stops. But is there a possibility to specify a boolean condition when work should stop?
RoundMosquito25 you mean when you reach a limit of loss<Threshold
or something similar ?
Hi @<1636175432829112320:profile|PlainSealion45>
- I used this initial model to create the endpoint with
model add
command.
I think that the initial model needs to be added with model auto-aupdate
Not with model add
basically do not call model add - this is static, always using the model ID specified (you can deploy new models with manually callign model add on the same endpoint and specifying diffrent model ID , but again manual)
To Automatically have the m...
Yes! That's exactly what I meant, as you can see the Triton backend was not able to load your model. I'm assuming because it was Not converted to torch script, like we do in the original example
https://github.com/allegroai/clearml-serving/blob/6c4bece6638a7341388507a77d6993f447e8c088/examples/pytorch/train_pytorch_mnist.py#L136
I notice that, in my Serving Service situated in the DevOps project, the "endpoints" section doesn't seem to get updated when I tag a new model with "released".
It takes it a few minutes (I think 5 min is the default) to update.
Notice that you need to add the model with
model auto-update --engine triton --endpoint "test_model_pytorch_auto" ...
Not with model add (if for some reason that does not work please let me know)
No need to pass the model version i.e. 1
you can ...
. I am not sure this is related to the fact the model is not correctly converted to TorchScript
Because Triton Only supports TorchScript (Not torch models) 🙂
MelancholyChicken65 found it ! thank you for finding this issue.
I'm hoping to get an update soon 🙂
MelancholyChicken65 what's the clearml-serving you are using ? (I believe this issue was fixed in 1.2)
I see, let me check the code and get back to you, this seems indeed like an issue with the Triton configuration in the model monitoring scenario.
Hmm is "model_monitoring_eps" another version of the model and it does not have all the properties of the "original" one?
check if the fileserver docker is running with docker ps
PompousParrot44 with pleasure. If during your search for a solution you come across something that solves it, and might integrate to the agent, do not hesitate to suggest it :)
Hi PompousParrot44
Well this kind of control is tricky. If you don't mind processes "fighting over cpu" you can just spin two trains-agents in cpu-mode. It will work as long as they have a different TRAINS_WORKER_NAME
The other option (might be a bit of an overkill) is to use K8s, which will set the CPU % for the entire agent.
What do you think?
PompousParrot44 now that I think about it, you might be able to limit the cpu affinity, would that help?
I mean , the python package, not the trains-server version
btw: both should work fine
PompousBeetle71 you can also use ModelOutput.update_weights_package to store multiple files at once (they will all be packaged into a single zip, and unpacked when you get them back via ModelInput). Would that help?
BTW: how are you using them? should we have a direct interface to those ?
PompousBeetle71 notice that starting with this version when you set model tags they will be stored as user tags , which you can change and edit in UI. So if you still need the system tags you have to access them directly.
Hi PompousBeetle71 , what exactly is the scenario / problem we are trying to solve ?
PompousBeetle71 , These are cuda versions, I'm looking for the nvidia driver version for example 440.xx or 418.xx .
The reason is, we set an OS environment for the driver, and I remember that old drivers did not support it . Basically they do not support NVIDIA_VISIBLE_DEVICES=all , so I'm trying to see if that's the case, then we could add fix .
p.s. any chance you can get me the nvidia driver version? I can't seem to find the one for v22 on amazon
PompousBeetle71 so basically exclude parameters that are considered "local" only, so that other people will not accidentally use them?
Hi PompousBeetle71 I'm with SteadyFox10 on this one. Unless you choose a file name based on epoch or step , you are literally overwriting the model file, which Trains will reflect. If you use epoch in the filename you will end up with all your models logged by Trains. BTW we are actively working on integration with pytorch ignite, so if you have any suggestions now is the time :)
SuccessfulKoala55 please post here once the code is available in your pytorch_ignite 🙂
PompousBeetle71 could you try trains-agent 0.15.0rc0 ? What's the OS you are using? Are you running in docker mode, if so, what's the docker version?
let's call it an applicative project which has experiments and an abstract/parent project, or some other name that group applicative projects.
That was my way of thinking, the guys argued it will soon "deteriorate" into the first option :)
PompousBeetle71 that actually brings me to another question, how do you feel about "parent" experiment ?
PompousBeetle71 you can check this example:
https://github.com/allegroai/trains/blob/master/examples/distributed/example_torch_distributed.py
I think it should help, if you want a more manual approach, you can check the POpen subprocesses here:
https://github.com/allegroai/trains/blob/master/examples/distributed/example_subprocess.py