Reputation
Badges 1
25 × Eureka!Hi BoredHedgehog47 I'm assuming the nginx on the k8s ingest is refusing the upload to the files server
JuicyFox94 wdyt?
WackyRabbit7 you can configure AWS autoscaler with two types of instances , with priority to one of them. So in theory you do not need two autoscaler processes, with that in mind I "think" single IAM should suffice
Itβs the correct way to do it, right?
Yep π that said this is not running as a service you will need to spin it on your machine. that said you can definitely connect it with the free SaaS server, and spin the serving on your machine with docker-compose
SolidSealion72 this makes sense, clearml deletes artifacts/models after they are uploaded, so I have to assume these are torch internal files
could you send the entire log here?
i.e. from the "docker-compose" command line and onward
Notice that we are using the same version:
https://github.com/allegroai/clearml-serving/blob/d15bfcade54c7bdd8f3765408adc480d5ceb4b45/clearml_serving/engines/triton/Dockerfile#L2
The reason was that previous version did not support torchscript, (similar error you reported)
My question is, why don't you use the "allegroai/clearml-serving-triton:latest" container ?
RobustRat47
What exactly is the error you are getting ? (I remember only the latest Triton solved some issue there)
Hi RobustRat47
My guess is it's something from the converting PyTorch code to TorchScript. I'm getting this error when trying the
I think you are correct see here:
https://github.com/allegroai/clearml-serving/blob/d15bfcade54c7bdd8f3765408adc480d5ceb4b45/examples/pytorch/train_pytorch_mnist.py#L136
you have to convert the model to TorchScript for Triton to serve it
RobustRat47 what's the Triton container you are using ?
BTW, the Triton error is:model_repository_manager.cc:1152] failed to load 'test_model_pytorch' version 1: Internal: unable to create stream: the provided PTX was compiled with an unsupported toolchain.
https://github.com/triton-inference-server/server/issues/3877
I can raise this as an issue on the repo if that is useful?
I think this is a good idea, at least increased visibility π
Please do π
Thanks RobustRat47 !
Should we put somewhere this requirement ? (i.e. nvidia drivers) ?
Is this really a must ?
it would be clearml-serverβs job to distribute to each user internally?
So you mean the user will never know their own S3 access credentials?
Are those credentials unique per user or once"hidden" for all of them?
Hmm so the concept of "company" wide configuration is supported in the enterprise version.
I'm trying to think of a "hack" to just pass these env/conf ...
How are you spinning the agent machines?
Hi AbruptCow41
I just want them to be able to write in them without them appear nor in their clearml.conf nor in their environmental variables.
So where would they put them ? (or is it pre baked into the docker?)
β¦every user in the server has the same credentials, and they donβt need to know them..makes sense?
Make sense, single credentials for everyone, without the need to distribute
Is that correct?
I see, so basically pull a fixed set of configuration for everyone from the server.
Currently only the scale/enterprise version supports such a feature π
NastySeahorse61 I would try to open in incognito mode (i.e. no cookies etc.), did you also change the address of the server?
Hmm, maybe the right way to do so is to abuse "models" which have entity, you can specify a system_tag on them, they can store a folder (and extract it if you need), they are on projects and they are cloned and can be changed.
wdyt?
DrabCockroach54 notice here there is no aarch64 wheel for anything other than python 3.5...
(and in both cases only py 3.5/3.6 builds, everything else will be built from code)
https://pypi.org/project/pycryptodome/#files
"regular" worker will run one job at a time, services worker will spin multiple tasks at the same time But their setup (i.e. before running the actual task) is one at a time..
would those containers best be started from something in services mode?
Yes as long as the machine has enough cpu/ram
Notice that the services mode will start a second parallel Task after the first one is done setting up the env, if running with CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL, with containers that have git/python/clearml-agent preinstalled it should be minimal.
or is it possible to get no-overhead with my approach of worker-inside-docker?
No do not do that, see above e...
of what task? i'm running lots of them and benchmarking
If you are skipping every installation it should be the same
because if you set CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
it will not install Anything at all
This is why it's odd to me...
wdyt?
im not running in docker mode though
hmmm that might be the first issue. it cannot skip venv creation, it can however use a pre-existing venv (but it will change it every time it installs a missing package)
so setting CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1 in non docker mode has no affect
what if the preexisting venv is just the system python? my base image is python:3.10.10 and i just pip install all requirements in that image. Does that not avoid venv still?
it will basically create a new venv inside the container forking the existing preinistalled stuff (i.e. the new venv already has everything the python system has preinstalled)
then it will call "pip install" on all the "installed packages of the Task.
Which should just check everything is there and install nothing...
from task pick-up to "git clone" is now ~30s, much better.
This is "spent" calling apt update && update install && pip install clearml-agent
if you have those preinstalled it should be quick
though as far as I understand, the recommendation is still to not run workers-in-docker like this:
if you do not want it to install anything and just use existing venv (leaving the venv as is) and if something is missing then so be it, then yes sure that the way to go
@<1689446563463565312:profile|SmallTurkey79> could you attach the full log of the Task?
also I would recommend "export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1" (not true
)
Usually binary env vars are 0/1
(I can see that the docs here: None
never mention it, I'll ask them to add that)
I can see all the steps like git clone,
git clone has nothing to do with "env setup" this is brining the code, you cannot skip that one, that said, this is why the git itself is cached on the host machine, so it is fast
... There may be some odd package that need to be installed because one of our DS is experimenting ... But all that we can see what is happening.
even if everything is preinstalled, it Verifies the packages match, this might take a long time. It's just pip being ...
- try with the latest RC
1.8.1rc2
, it feels like after git clone, it spend minutes without outputting anything
yeah that is odd , can you run the agent with --debug (add before the daemon
command) , and then at the end of the command add --foreground
Now launch the same task on that queue, you will have a verbose log in the console.
Let us know what you see