Reputation
Badges 1
62 × Eureka!@<1523701087100473344:profile|SuccessfulKoala55> Do you think it is possible to ask to run docker mode in the aws autoscaler, and add the cloning and installation inside the init bash script of the task?
How do you explain that it works when I ssh-ed into the same AWS container instance from the autoscaler?
I will check that. Do you think we could bypass it using Task.create
? And passing all the needed params?
The flask
command is ran inside the git project, which is the strange behavior. It is executed in ~/code/repo/ as flask train ...
I also did that in the following way:
- I put a sleep inside the bash script
- I ssh-ed to the fresh container and did all commands myself (cloning, installation) and again it worked...
Because I was ssh-ing to it before the fail. When poetry fails, it installs everything using PIP
When the task finally failed, I was kicked of from the container
Yes indeed, but what about the possibility to do the clone/poetry installation ourself in the init bash script of the task?
Yes should be correct. Inside the bash script of the task.
And I just tried with Python 3.8 (default version of the image) and it still fails.
Poetry Enabled: Ignoring requested python packages, using repository poetry lock file!
Creating virtualenv debug in /root/.clearml/venvs-builds/3.8/task_repository/clearmldebug.git/.venv
Using virtualenv: /root/.clearml/venvs-builds/3.8/task_repository/clearmldebug.git/.venv
2023-04-18 15:03:52
Installing dependencies from lock file
Finding the necessary packages for the current system
Package operation...
@<1523701070390366208:profile|CostlyOstrich36> poetry is installed as part of the bash script of the task.
The init script of the AWS autoscaler only contains three export variables I set.
One possible solution I could see as well, is putting the data storage to S3 bucket to improve download performance as it is the same cloud provider. No transfer latency.
How to make sure that the python version is correct?
Ok. I spinned up three AWS autoscalers, each with different conf. I also fixed a submodule issue in my repo (which I was believing was the problem of the git diff) and every run now passes and fails after (not this problem). So I think store_code_diff_from_remote
is of no help from me but my problem is gone...
I tried too. I do not have more logs inside the ClearML agent 😞
For now, I am uploading to the basic-available ClearML server to store my data. But I will soon use S3 buckets to store data. So the question is for both use cases 🙂
It just allows me to have access to poetry and python installed on hte container
Sorry to come back to this! Regarding the Kubernetes Serving helm chart, I can see horyzontal scaling of docker containers. What about vertical scaling? Is it implemented? More specifically, where is defined the SKU of the VMs in use?
Prerequisites, PyTorch models require Triton engine support, please use docker-compose-triton.yml / docker-compose-triton-gpu.yml or if running on Kubernetes, the matching helm chart.
I tried playing with those, but I do not succeed to have a role on the source code detection. I can modify the env variables, nothing happen on CLearML server unfortunately.
These changes reflect the modifications I have in my staging area (not commited, not put in staging area with git add
) But I would like to remove this uncommited section from clearml and not be blocked by it
Sure, here is the updated clearml.conf file of the AWS autoscaler instance:
agent {
vcs_cache.enabled: false
package_manager: {
type: poetry,
poetry_version: "1.4.2",
}
}
sdk {
development {
store_code_diff_from_remote: false,
}
}
I see uncommited changes, where as I would like to have nothing.
If I may ask as well for another issue in that thread that is taking me a big amount of time:
Poetry Enabled: Ignoring requested python packages, using repository poetry lock file!
Creating virtualenv alfred-Rp77Shgw-py3.9 in /root/.cache/pypoetry/virtualenvs
Installing dependencies from lock file
2023-04-17 10:17:57
Package operations: 351 installs, 1 update, 1 removal
failed installing poetry requirements: Command '['poetry', 'install', '-n']' returned non-zero exit status 1.
Ignorin...
I am currently trying with a new dummy repo and I iterate over the dependencies of the pyproject.toml.
Okey thanks @<1523701205467926528:profile|AgitatedDove14> and what would be the advantage of using clearm-server on k8s compared to the clearml hosted one?
I have my Task.init
inside a train() function inside the flask command. We basically have flask commands allowing to trigger specific behaviors. When running it locally, everything works properly except the repository information. The use case is linked to the way our codebase works. For example, I am going to do flask train {arguments}
and it will trigger the training of a model (that I want to track).
I stopped the autoscaler and deleted it manually. I did it because I want to test...
Yes I take the export statements from my bash script of the task