Reputation
Badges 1
62 × Eureka!I would like to know if it is possible to run any pytorch model on the basic docker compose file ? Without triton?
I have my Task.init
inside a train() function inside the flask command. We basically have flask commands allowing to trigger specific behaviors. When running it locally, everything works properly except the repository information. The use case is linked to the way our codebase works. For example, I am going to do flask train {arguments}
and it will trigger the training of a model (that I want to track).
I stopped the autoscaler and deleted it manually. I did it because I want to test...
Thanks, my question is dumb indeed 🙂 Thanks for the reply !
Sure, here is the updated clearml.conf file of the AWS autoscaler instance:
agent {
vcs_cache.enabled: false
package_manager: {
type: poetry,
poetry_version: "1.4.2",
}
}
sdk {
development {
store_code_diff_from_remote: false,
}
}
I see uncommited changes, where as I would like to have nothing.
If I may ask as well for another issue in that thread that is taking me a big amount of time:
Poetry Enabled: Ignoring requested python packages, using repository poetry lock file!
Creating virtualenv alfred-Rp77Shgw-py3.9 in /root/.cache/pypoetry/virtualenvs
Installing dependencies from lock file
2023-04-17 10:17:57
Package operations: 351 installs, 1 update, 1 removal
failed installing poetry requirements: Command '['poetry', 'install', '-n']' returned non-zero exit status 1.
Ignorin...
Ok. I spinned up three AWS autoscalers, each with different conf. I also fixed a submodule issue in my repo (which I was believing was the problem of the git diff) and every run now passes and fails after (not this problem). So I think store_code_diff_from_remote
is of no help from me but my problem is gone...
These changes reflect the modifications I have in my staging area (not commited, not put in staging area with git add
) But I would like to remove this uncommited section from clearml and not be blocked by it
I basically would like to know if we can serve the model without tensorrt format which is highly efficient but more complicated to get.
In production, we should use the clearml-helm-charts
right? Docker-compose in the clearml-serving is more for local testing
Thank you! I will try this 🙂
No problem. I guess this might be a small visualisation bug, but I really have the impression that these workers still pick up tasks, which is strange. I should test again to be sure.
The flask
command is ran inside the git project, which is the strange behavior. It is executed in ~/code/repo/ as flask train ...
I read that, the hosted clearml server was periodically resetted. Does it mean my team would lose all our work?
I still do not get the K8s clearml server usefulness of it then?
I do not remember, but I was afraid.... Thanks for the output ! Maybe in a bad dream ? 😜
For now, I am uploading to the basic-available ClearML server to store my data. But I will soon use S3 buckets to store data. So the question is for both use cases 🙂
Hi @<1523701087100473344:profile|SuccessfulKoala55> , the EC2 instance is spinned-up from the AWS autoscaler provided by ClearML. I use this following docker image: nvidia/cuda:11.8.0-devel-ubuntu20.0
So the EC2 instance runs a docker container
I tried playing with those, but I do not succeed to have a role on the source code detection. I can modify the env variables, nothing happen on CLearML server unfortunately.
Prerequisites, PyTorch models require Triton engine support, please use docker-compose-triton.yml / docker-compose-triton-gpu.yml or if running on Kubernetes, the matching helm chart.
@<1523701070390366208:profile|CostlyOstrich36> @<1523701205467926528:profile|AgitatedDove14> Any ideas on this one?
Sorry to come back to this! Regarding the Kubernetes Serving helm chart, I can see horyzontal scaling of docker containers. What about vertical scaling? Is it implemented? More specifically, where is defined the SKU of the VMs in use?
Okey thanks @<1523701205467926528:profile|AgitatedDove14> and what would be the advantage of using clearm-server on k8s compared to the clearml hosted one?
@<1523701205467926528:profile|AgitatedDove14> If you have any other insights, pls do not hesitate! Thanks a lot
It is due to the caching mechanism of Clearml. Is there a python command to update the venvs-cache?
I will check that. Do you think we could bypass it using Task.create
? And passing all the needed params?
One possible solution I could see as well, is putting the data storage to S3 bucket to improve download performance as it is the same cloud provider. No transfer latency.
Task.set_base_docker
🙂
Using a pyenv virtual env then exporting LOCALPYTHON env var
I literrally connected to it at runtime, and ran poetry install -n
and it worked
I am literrally trying with 1 package and python and it fails. I tried with python 3.8 3.9 and 3.9.16. and it always fail --> not linked to python version. What is the problem then? I am wondering if there is not an intrinsic bug