Suppose that I have three models and these models can't be loaded simultaneously on GPU memory(
Oh!!!
For now, this is the behavior I observe: Suppose I have two models, A and B. ....
Correct
Yes this is a current limitation of the Triton backend BUT!
we are working on a new version that does Exactly what you mentioned (because it is such a common case where in some cases models are not being used that frequently)
The main caveat is the loading time, re-loading models from dist...
Hi @<1545216070686609408:profile|EnthusiasticCow4>
Many of the dataset we work with are generated by SQL query.
The main question in these scenarios is, are those DB stable.
By that I mean, generally speaking DB serve applications, and from time to time they undergo migration (i.e. change in schema, more/less data etc).
The most stable way is to create a script that runs the SQL query, and creates a clearml dateset from it (that script becomes part of the Dataset, to have full tracta...
The image is
allegroai/clearml:1.0.2-108
Yep, that makes sense, seems like a backwards compatibility issue
Hi @<1541954607595393024:profile|BattyCrocodile47>
see here: None
Try with app.clearml.mlops-club.org
and the rest of them
Hi @<1649221394904387584:profile|RattySparrow90>
: Are the models I defined to be served e.g. via the CLI downloaded to the serving pod
Yes this is done automatically and online (i.e. when you update the using CLI/API) , based on the models/endpoints you set
So that they are physically lying there as a file I can see in the filesystem?
They are, and cached there
Or is it more the case that the pod gets the model when needed/when an API call for this model is incoming?
I...
I see.
You can get the offline folder programmatically then copy the folder content (it's the same as the zip, and you can also pass a folder instead of zip to the import function)task.get_offline_mode_folder()
You can also have a soft link of the offline folder (if you are working on a linux machine:ln -s myoffline_folder ~/.trains/cache/offline
LudicrousParrot69
I "think" I have a better handle on what you wish to do.
Is it kind of generic "serving" solution?
FYI:
Model artifact is, usually, a weights/model file. The idea that later you will be able to access it and serve it. Now the problem is (and I think this is what you are referring to) there is usually a specific piece of code tied to that model that can use it (a.k.a pyfunc)
A few ideas:
These days everyone is trying to build their models with generic interface, so that scik...
SubstantialElk6 Ohh okay I see.
Let's start with background on how the agent works:
When the agent pulls a job (Task), it will clone the code based on the git credentials available on the host itself, or based on the git_user/git_pass configured in ~/clearml.conf
https://github.com/allegroai/clearml-agent/blob/77d6ff6630e97ec9a322e6d265cd874d0ab00c87/docs/clearml.conf#L18
The agent can work in two modes:
Virtual environment mode, where it will create a new venv for each experiment ba...
Do we launch multiple gorups of these in different projects?
Actually Triton can serve multiple models and the endpoints/models are controlled from the clearml-serving.
The only issue is adding a load-balancer in front of multiple nodes to balance the requests between them. wdyt?
packages are updated, and I don't know which python version I get, + changing the python version of the OS is not really recommended
Wait I'm confused, this is inside a container, no?
and the python version running my code should not depend of the python version running the clearml-agent (especially for experiments running in containers)
Generally speaking you are correct, but some packages will not have the same version for all python versions
Specifically in this case I think...
This is also set in the command line.
--cpu-only or maybe without any --gpus flag at all
That’s the question i want to raise too,
No file size limit
Let me try to run it myself
Hi ClumsyElephant70
s there a way to run all pipeline steps, not in isolation but consecutive in the same environment?
You mean as part of a real-time inference process ?
Yes, that means the nvidia drivers are present (as you mentioned the GPU seems to be detected).
Could you check you have libnvidia-ml.so.1 inside the container ?
For example in /usr/lib/nvidia-XYZ/
Setting the credentials on agent machine means the users cannot use their own credentials since an k8s glue agent serves multiple users.
Correct, I think "vault" option is only available on the paid tier 😞
but how should we do this for the credentials?
I'm not sure how to pass them, wouldn't it make sense to give the agent an all accessing credentials ?
is this a config file on your side or something I can change, if we had enterprise version?
Yes, this is one of the things you can configure
ConfusedPig65 could you send the full log (console) of this execution?
Is there an option to do this from a pipeline, from within the
add_step
method? Can you link a reference to cloning and editing a task programmatically?
Hmm, I think there is an open GitHub issue requesting a similar ability , let me check on the progress ...
nope, it works well for the pipeline when not I don't choose to continue_pipeline
Could you send the full log please?
EnchantingWorm39 you have great timing ;)
New version will contain much more advanced search (including all the task fields)
are there any more fields in this function with partial matching? for example project? tags?
Yes they can all be filtered (basically everything you see in the UI)
notice: tags are strings (you can provide list of tags), project is an ID of the project
(Use Task.get_project_id, I think)
Hi @<1523702969063706624:profile|PoisedShark13>
However, INSTALLED PACKAGES of my task is misses many of installed packages (any idea why?)
It automatically detects the directly imported packages, literally analyzing your code base and looking for imports
The derivative packages (i.e. the one that any of the "main" packages need, will be listed after the first time the agent installs everything)
If something specific is missing, you can manually add it with:
Task.add_requiremen...
@<1523703080200179712:profile|NastySeahorse61> / @<1523702868694011904:profile|AbruptCow41>
Is there a way to avoid each task to create a new environment?
You can just define CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
it will just use whatever you have there (notice it will totally ignore requirements.txt and "installed packages" on the Task)
BTW I would recommend turning on the venv caching, this is per docker/python/packages caching so the next time you are using th exact requi...
A few implementation / design details:
When you run code with Trains (and call init) it will record your environment (python packages, git code, uncommitted changes etc) Everything is stored on the Task object in the trains-server, when you clone a task you literally create a copy of the Task object (i.e. a second experiment). on the cloned experiment, you can edit everything (parameters, git, base docker image etc) When you enqueue a Task you add its ID to the execution queue list a trains-a...
That sounds like an internal tritonserver error.
https://forums.developer.nvidia.com/t/provided-ptx-was-compiled-with-an-unsupported-toolchain-error-using-cub/168292
Okay ConfusedPig65 I found the problem. For some reason the latest TF.keras.load_model . save_model is not tracked.
I'll make sure we push a fix later today
Thanks @<1523701868901961728:profile|ReassuredTiger98>
From the log this is what conda is installing, it should have worked
/tmp/conda_env1991w09m.yml:
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- blas~=1.0
- bzip2~=1.0.8
- ca-certificates~=2020.10.14
- certifi~=2020.6.20
- cloudpickle~=1.6.0
- cudatoolkit~=11.1.1
- cycler~=0.10.0
- cytoolz~=0.11.0
- dask-core~=2021.2.0
- decorator~=4.4.2
- ffmpeg~=4.3
- freetype~=2.10.4
- gmp~=6.2.1
- gnutls~=3.6.13
- imageio~=2.9.0
-...
Long story short, work in progress.
BTW: are you referring to manual execution or trains-agent
?
Hi
The Squash operation copies all the data and is no longer linked to previous commits?
Yes, basically the idea is if you have data version that relies on many parents that needs to be merged, the squash will create a merged copy and push it all as a single version, and then yes the parent versions are no longer needed
I thought this operation is like git squash but it seems to me
yeah... we did not want to actually delete the parents because unlike git, the operation is done ...
BTW: Full RestAPI reference here
https://allegro.ai/clearml/docs/rst/references/clearml_api_ref/index.html
Hi Guys, just curious here, what's was the final issue?
Also out of curiosity, what does that mean? "1.12.2 because some bug that make fastai lag 2x" ?