Hi @<1636175432829112320:profile|PlainSealion45>
- I used this initial model to create the endpoint with
model add
command.
I think that the initial model needs to be added with model auto-aupdate
Not with model add
basically do not call model add - this is static, always using the model ID specified (you can deploy new models with manually callign model add on the same endpoint and specifying diffrent model ID , but again manual)
To Automatically have the m...
Thanks for the logs @<1627478122452488192:profile|AdorableDeer85>
Notice that the log you attached means the preprocessing is executed and the GPU backend is returning an error.
Could you provide the log of the docker compose specifically the intersting part is the Triton container, I want to verify it loads the model properly
Hi CloudyHamster42
how do i have the trains-agent install myΒ
requirements.txt
Β file from my repo when creating the environment?
BTW if you clear all "the installed packages", then trains-agent
will user requirements.txt and update back all the packages in the UI
VexedCat68 yes π you can also pass the parent folder and it will zip the entire subfolders into a single artifact
. I am not sure this is related to the fact the model is not correctly converted to TorchScript
Because Triton Only supports TorchScript (Not torch models) π
JumpyPig73 Do you see all the configurations under the Args section in the "Configuration" Tab ?
(Maybe I'm wrong and the latest RC does Not include the python-fire support)
Hmm, Notice that it does store sym links to parent data versions (to save on multiple copies of the same file). If you call get_mutable_local_copy() you will get a standalone copy
ColossalAnt7 I would do the following:
Configure trains-server user/pass, mounting the API server configuration file as pointed in the trains-server documentation (intermediate temporary step) Start by providing the ML guys with a VPN access that allows them to access directly the trains-server api/web/file pos (caveat is the IP/sub-domain needs to be solved) Configure a ConfigMap to do the routing/ingest (this solves the IP/Sub-Domain issue) and allow the VPN to access the single entrypoint...
By the way, will downloading still happen if the datasets is available in the cache folder?
If it is cached, then there is no need to re-download π
I notice that, in my Serving Service situated in the DevOps project, the "endpoints" section doesn't seem to get updated when I tag a new model with "released".
It takes it a few minutes (I think 5 min is the default) to update.
Notice that you need to add the model with
model auto-update --engine triton --endpoint "test_model_pytorch_auto" ...
Not with model add (if for some reason that does not work please let me know)
No need to pass the model version i.e. 1
you can ...
Hi @<1636175432829112320:profile|PlainSealion45>
I am trying to automatically generate an online endpoint for inference when manually adding tag
released
to a model.
So the "automatic" here means that the model endpoint will be updated with the latest model, but not that a new endpoint will be created.
Does that make sense ?
To add a new endpoint on Tagging a model, you should combine it with ModelTrigger
and have a fucntion that calls the clearml-serving to cr...
Yes! That's exactly what I meant, as you can see the Triton backend was not able to load your model. I'm assuming because it was Not converted to torch script, like we do in the original example
https://github.com/allegroai/clearml-serving/blob/6c4bece6638a7341388507a77d6993f447e8c088/examples/pytorch/train_pytorch_mnist.py#L136
Hi @<1658281099807166464:profile|SmallCamel52>
Lack of authentication in all versions of the fileserver component
Are you leaving the fileserver open to the world ?
basically the default_output_uri will cause all models to be uploaded to this server (with specific subfolder per project/task)
You can have the same value there as the files_server.
The files_server is where you have all your artifacts / debug samples
Hi HealthyStarfish45
- is there an advantage in using tensorboard over your reporting?
Not unless your code already uses TB or has some built in TB loggers.
html reporting looks powerfull, can one inject some javascript inside?
As long as the JS is self contained in the html script, anything goes :)
HealthyStarfish45
No, it should work π
Okay, what you can do is the following:
assuming you want to launch task id aabb12
The actual slurm command will be:trains-agent execute --full-monitoring --id aabb12
You can test it on your local machine as well.
Make sure the trains.conf is available in the slurm job
(use trains-agent --config-file
to point to a globally shared one)
What do you think?
do you have docker installed on all slurm agent/worker machines
Docker support?
Hi RoughTiger69
seems to not take the pacakges that are in the requirements.txt
The reason for not taking the entire python packages, it will most likely break when trying to run inside the agent.
The directly imported packages aill essentially pull their required packages, and thus create a stable env on the remote machine. The agent then will store the Entire env, as it assumes it will be able to fully replicate it the next time it runs.
If the "Installed Packages" section is empty...
Okay this more complicated but possible.
The idea is to write a glue layer (service) that pulls from the (i.e UI) queue
sets the slurm job
and puts it in a pending queue (so you know the job s waiting in the slurm scheduler)
There is a template here:
https://github.com/allegroai/trains-agent/blob/master/trains_agent/glue/k8s.py
I would love to help and setup a slurm glue in a similar manner
what do you think?
HealthyStarfish45 could you take a look at the code, see if it makes sense to you?
What I'm getting to, is maybe we build a template, then you could fill in the gaps ?
but I need to dig digger into the architecture to understand what we need exactly from k8s glue.
Once you do, feel free to share, basically there are two options , use the k8s scheduler with dynamic pods, or spin the trains-agent as a service pod, and let it spin the jobs
I hope you can do this without containers.
I think you should be fine, the only caveat is CUDA drivers, nothing we can do about that ...
HealthyStarfish45 if I understand correctly the trains-agent is running as daemon (i.e. automatically pulling jobs and executes them), the only point might be cancelling a daemon will cause the Task executed by that daemon to be canceled as well.
Other than that, sounds great!
HealthyStarfish45
Is there a way to say to a worker that it should not take new tasks? If there is such a feature then one could avoid the race condition
Still undocumented, but yes, you can tag it as disabled.
Let me check exactly how.
PompousParrot44 What is the "working directory" on the experiment itself? and the "script path"?
Based on what you wrote above, in order for it work you should have:
working directory: "."
script path: "-m test.scripts.script"
notice no "--args" and working directory is "." (i.e. the root of the repository)