
Reputation
Badges 1
62 × Eureka!Yes I take the export statements from my bash script of the task
Yes indeed, but what about the possibility to do the clone/poetry installation ourself in the init bash script of the task?
Ok. I spinned up three AWS autoscalers, each with different conf. I also fixed a submodule issue in my repo (which I was believing was the problem of the git diff) and every run now passes and fails after (not this problem). So I think store_code_diff_from_remote
is of no help from me but my problem is gone...
I tried too. I do not have more logs inside the ClearML agent 😞
but I still had time to go inside the container, export the PATH variables for my poetry and python versions, and run the poetry install command there
Task.set_base_docker
🙂
Related to that,
Is it possible to do Dataset.add_external_files() with source_url and destination_url being two separate azure storage containers?
These changes reflect the modifications I have in my staging area (not commited, not put in staging area with git add
) But I would like to remove this uncommited section from clearml and not be blocked by it
Sure, here is the updated clearml.conf file of the AWS autoscaler instance:
agent {
vcs_cache.enabled: false
package_manager: {
type: poetry,
poetry_version: "1.4.2",
}
}
sdk {
development {
store_code_diff_from_remote: false,
}
}
I see uncommited changes, where as I would like to have nothing.
I basically would like to know if we can serve the model without tensorrt format which is highly efficient but more complicated to get.
How do you explain that it works when I ssh-ed into the same AWS container instance from the autoscaler?
Using a pyenv virtual env then exporting LOCALPYTHON env var
Yes should be correct. Inside the bash script of the task.
@<1523701070390366208:profile|CostlyOstrich36> @<1523701087100473344:profile|SuccessfulKoala55> I tried with dummy repo. Using Python and stripe packages ONLY in the pyproject.toml
Here is my result (still failing) :
Poetry Enabled: Ignoring requested python packages, using repository poetry lock file!
Creating virtualenv debug in /root/.clearml/venvs-builds/3.9/task_repository/clearmldebug.git/.venv
Using virtualenv: /root/.clearml/venvs-builds/3.9/task_repository/clearmldebug.git/...
This is really extremely hard to debug. I am thinking to create another repo and iterate on the packages to hopefully find the problem, but it will take ages.
When the task finally failed, I was kicked of from the container
How to set that up inside clearml.conf or something else to know which credentials to load?
Thanks, my question is dumb indeed 🙂 Thanks for the reply !
I would like to know if it is possible to run any pytorch model on the basic docker compose file ? Without triton?
Sorry to come back to this! Regarding the Kubernetes Serving helm chart, I can see horyzontal scaling of docker containers. What about vertical scaling? Is it implemented? More specifically, where is defined the SKU of the VMs in use?
@<1523701118159294464:profile|ExasperatedCrab78> do you have any inputs for this one? 🙂
Prerequisites, PyTorch models require Triton engine support, please use docker-compose-triton.yml / docker-compose-triton-gpu.yml or if running on Kubernetes, the matching helm chart.
Great, and can we specify an environment variable of ClearML that directly updates the clearml.conf file regarding the azure config or do something similar. I do not want to ask every engineer of my team to modify its clearml.conf file? @<1523701070390366208:profile|CostlyOstrich36> Thanks
The flask
command is ran inside the git project, which is the strange behavior. It is executed in ~/code/repo/ as flask train ...
@<1523701087100473344:profile|SuccessfulKoala55> Do you think it is possible to ask to run docker mode in the aws autoscaler, and add the cloning and installation inside the init bash script of the task?
Thank you! I will try this 🙂
Thanks ! So regarding question2, it means that I can spin up a K8s cluster with triton enabled, and by specifiying the type of model while creating the endpoint, it will use or not the triton engine.
Linked to that, Is the triton engine expecting the tensorrt
format or is it just an improvement step compared to other model weights ?
Finally, last question ( I swear 😛 ) : How is the serving on Kubernetes flow supposed to look like? Is it something like that:
- Create en...
One possible solution I could see as well, is putting the data storage to S3 bucket to improve download performance as it is the same cloud provider. No transfer latency.
I have my Task.init
inside a train() function inside the flask command. We basically have flask commands allowing to trigger specific behaviors. When running it locally, everything works properly except the repository information. The use case is linked to the way our codebase works. For example, I am going to do flask train {arguments}
and it will trigger the training of a model (that I want to track).
I stopped the autoscaler and deleted it manually. I did it because I want to test...