I am playing around with agents on our self-hosted CML. Currently, and I’m having some trouble with the environment configuration. It’s a bit magical which version of torch I’m getting on the agents .
Our setup:
Server on one VM 5x VMs each hosting their own agent (technically through an Azure VMSS) (eventually this should be replaced by on-prem hardware, but currently just testing out…) Local dev on my own workstation
Background: I’ve created a Task with some model training (using Torch & Lightning) where I use task.execute_remotely(...)
to send execution to remote once everything initializes OK. This works fine. Until it doesn’t 🙂 I started setting up for hyperparameter optimization inspired by https://clear.ml/docs/latest/docs/guides/optimization/hyper-parameter-optimization/examples_hyperparam_opt/ , and spawned an optimizer based on the aforementioned task. Now this enqueues lots of tasks, and my VMs can start paying their bills.
However, jobs from one worker consistently fails. It appears that it has cached a different version of torch than the others (and than what I’m running locally), causing my scripts to fail. Looking into the tasks’ installed packages, the worker that fails list:
https://download.pytorch.org/whl/cpu/torch-1.13.1%2Bcpu-cp310-cp310-linux_x86_64.whl
while others list:
torch @ file:///home/azureuser/.clearml/pip-download-cache/cu0/torch-1.12.1%2Bcpu-cp310-cp310-linux_x86_64.whl
and my local dev environment:
torch @ file:///Users/runner/miniforge3/conda-bld/pytorch-recipe_1669983320380/work
(using conda)
Btw. the various workers are 100% identical.
The reason for the fail is known, and besides the point. The point is I do not have full control over which package versions are installed.
Is there something I’m missing? Is it because we’re using conda for dev?
I’m assuming when I send a task to execute remotely, that it does a pip freeze in the background, and then does it best to replicate on the workers, that then in turn have their own caches, etc.
We current use Azure Machine Learning for compute, where we specify an environment.yaml conda specification, and pre-build our environment for remote usage, hence we’ve built up our codebases around conda..
Note: As I’m writing this, it occurs to me that I haven’t tried to change the agents from pip into conda mode, and that could possibly solve the issue. I can check this tomorrow. But it still would not explain the randomness across workers.