Reputation
Badges 1
62 × Eureka!I read that, the hosted clearml server was periodically resetted. Does it mean my team would lose all our work?
These changes reflect the modifications I have in my staging area (not commited, not put in staging area with git add ) But I would like to remove this uncommited section from clearml and not be blocked by it
How to make sure that the python version is correct?
Prerequisites, PyTorch models require Triton engine support, please use docker-compose-triton.yml / docker-compose-triton-gpu.yml or if running on Kubernetes, the matching helm chart.
In production, we should use the clearml-helm-charts right? Docker-compose in the clearml-serving is more for local testing
I still do not get the K8s clearml server usefulness of it then?
I basically would like to know if we can serve the model without tensorrt format which is highly efficient but more complicated to get.
The flask command is ran inside the git project, which is the strange behavior. It is executed in ~/code/repo/ as flask train ...
@<1523701070390366208:profile|CostlyOstrich36> poetry is installed as part of the bash script of the task.
The init script of the AWS autoscaler only contains three export variables I set.
@<1523701205467926528:profile|AgitatedDove14> If you have any other insights, pls do not hesitate! Thanks a lot
If I may ask as well for another issue in that thread that is taking me a big amount of time:
Poetry Enabled: Ignoring requested python packages, using repository poetry lock file!
Creating virtualenv alfred-Rp77Shgw-py3.9 in /root/.cache/pypoetry/virtualenvs
Installing dependencies from lock file
2023-04-17 10:17:57
Package operations: 351 installs, 1 update, 1 removal
failed installing poetry requirements: Command '['poetry', 'install', '-n']' returned non-zero exit status 1.
Ignorin...
I tried too. I do not have more logs inside the ClearML agent 😞
@<1523701087100473344:profile|SuccessfulKoala55> Do you think it is possible to ask to run docker mode in the aws autoscaler, and add the cloning and installation inside the init bash script of the task?
How do you explain that it works when I ssh-ed into the same AWS container instance from the autoscaler?
@<1523701070390366208:profile|CostlyOstrich36> The base docker image of the AWS autoscaler is nvidia/cuda:10.2-runtime-ubuntu18.04 . According to me, the python version is not set inside the image, but I am might be wrong and it could be the problem indeed... ?
Yes indeed, but what about the possibility to do the clone/poetry installation ourself in the init bash script of the task?
Because I was ssh-ing to it before the fail. When poetry fails, it installs everything using PIP
Sorry to come back to this! Regarding the Kubernetes Serving helm chart, I can see horyzontal scaling of docker containers. What about vertical scaling? Is it implemented? More specifically, where is defined the SKU of the VMs in use?
I have my Task.init inside a train() function inside the flask command. We basically have flask commands allowing to trigger specific behaviors. When running it locally, everything works properly except the repository information. The use case is linked to the way our codebase works. For example, I am going to do flask train {arguments} and it will trigger the training of a model (that I want to track).
I stopped the autoscaler and deleted it manually. I did it because I want to test...
Related to that,
Is it possible to do Dataset.add_external_files() with source_url and destination_url being two separate azure storage containers?
I am literrally trying with 1 package and python and it fails. I tried with python 3.8 3.9 and 3.9.16. and it always fail --> not linked to python version. What is the problem then? I am wondering if there is not an intrinsic bug
Yes I take the export statements from my bash script of the task
but I still had time to go inside the container, export the PATH variables for my poetry and python versions, and run the poetry install command there
Thank you for the quick replies!
I might do it the wrong way but the above snippet of code is the additional clearml.conf file I add to the AWS autoscaler. Should I add a complete clearml.conf file to it?
That is a good question @<1537605940121964544:profile|EnthusiasticShrimp49> ! I am not sure the image has python 3.9. I tried to check it but did not find the answer. I am using the following AMI: AWS Deep Learning AMI (Ubuntu 18.04) with Support by Terracloudx (Nvidia deep learni...
I guess it makes no sense because of the steps a clearml-agent works...
I also thought about going to pip mode but not all packages are detected from our poetry.lock file unfortunately so cannot do that.
How to set that up inside clearml.conf or something else to know which credentials to load?
I literrally connected to it at runtime, and ran poetry install -n and it worked
I also did that in the following way:
- I put a sleep inside the bash script
- I ssh-ed to the fresh container and did all commands myself (cloning, installation) and again it worked...
I would like to know if it is possible to run any pytorch model on the basic docker compose file ? Without triton?