so i guess it need to be set inside the container
We need to focus first on Why is it taking minutes to reach Using env.
In our case, we have a container that have all packages installed straight in the system, no venv in the container. Thus we don't use CLEARML_AGENT_SKIP_PIP_VENV_INSTALL
But then when a task is pulled, I can see all the steps like git clone, a bunch of Requirement already satisfied .... There may be some odd package that need to be installed because one of our DS is experimenting ... But all that we can see what is...
we are not using docker compose. We are deploying in Azure with each database as a standalone service
Found a trick to have empty Installed package:clearml.Task.force_requirements_env_freeze(force=True,requirements_file="/dev/null")
Not sure if this is the right way or not ...
Actually, I can set agent.package_manager.pip_version="" in the clearml.conf
And after reading 4x the doc, I can use the env var:CLEARML_AGENT__AGENT__PACKAGE_MANAGER__PIP_VERSION
I also use this: None
Which can give more control
Just a +1 here. When we use the same name for 3 differents image, the thumbnail show 3 different images, but when clicking on any of them, only one is displayed. No way to display the others
what is the difference between vscode via clearml-session and vscode via remote ssh extension ?
in that case yes. What happen is that in docker mode:
you run a clearml agent, that then receive a task
create a container
install another agent inside that container
then run that second agent inside the container
that second agent then pull the task and do the usuall build/install
CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=true need to be set on that second agent somehow ...
I also have the same issue. Default argument are fine but all supplied argument in command line become duplicated !
I really like how you make all this decoupled !! 🎉
I understand for cleaml-agent
What I mean is that I have 2 self deployed server. I want to switch between the 2 config when running the code locally, not inside the agent
Found it: None
And credential are set with :
sdk {
azure.storage {
containers: [
{
account_name: "account"
account_key: "xxxx"
container_name:"clearml"
}
]
}
}
@<1523701070390366208:profile|CostlyOstrich36> I would like to point to azure blob storage, what kind of url schema should I use ? And also, where do you configure the credential for the ClearML server to access to Azure blob as file_server ? I couldn't find any documentation around this topic 😞
TIA
can you make train1.py use clearml.conf.server1 and train2.py use clearml.conf2 ?? In which case I would be intersted @<1523701087100473344:profile|SuccessfulKoala55>
Just keep in mind my your bottleneck will be the transfer rate. So mounting will not save you anything as you still need to transfer the whole dataset sooner or later to your GPU instance.
One solution is as Jake suggest. The other can be pre-download the data to your instance with a CPU only cheap instance type, then restart the instance with GPU.
may be specific to fastai as I cannot reproduce it with another training using yolov5
About the caching: how does it work ? ClearML maintain it own cache and monitor if any of you code changes? Even code that get change inside an import ?
@<1523701087100473344:profile|SuccessfulKoala55> Actually it failed now: failed to talked to our storage in Azure:
ClearML Task: created new task id=c47dd71dea2f421db05647a21d78ed26
2024-01-25 21:45:23,926 - clearml.storage - ERROR - Failed uploading: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)
2024-01-25 21:46:48,877 - clearml.storage - WARNING - Storage helper problem for .clearml.0149daec-7a03-4853-a0cd-a7e2b295...
In summary:
Spin down the local server
Backup the data folder
In the cloud, extract the data backup
Spin up the cloud server
following your example, if the seeds are hard coded in the code, then git hash will detect if changed happen and the step need to be run or not
if you are on github.com , you can use Fine tune PAT token to limit access to minimum. Although the token will be tight to an account, it's quite easy to change to another one from another account.
@<1523701087100473344:profile|SuccessfulKoala55> it is set to "all" as :
NV_LIBCUBLAS_VERSION=12.2.5.6-1NVIDIA_VISIBLE_DEVICES=allCLRML_API_SERVER_URL=https://<redacted>HOSTNAME=1b6a5b546a6bNVIDIA_REQUIRE_CUDA=cuda>=12.2 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=qua...
I don't think agent are aware of each other. Which mean that you can have as many agent as you want and depending on your task usage, they will be fighting for CPU and GPU usage ...
but afaik this only works locally and not if you run your task on a clearml-agent!
Isn;t the agent using the same clearml.conf ?
We have our agent running task and uploading everything to Cloud. As I said, we don;t even have file server running
@<1523701087100473344:profile|SuccessfulKoala55> Should I raise a github issue ?
from what I understand, the docker mode were designed for apt based image and also running as root inside the container.
We have container that are not apt based and running not as root
We also do some "start up" that fetch credentials from Key Vault prior running the agent
you should be able to explicitly upload a file of your choice as artefact using something like this: None
