GiganticTurtle0 , let me add some background. The idea is that at some point you had your code running on your machine (when developing it for example),
when you actually executed the code itself in development, you call 'task.init' (to track the development process for example). This Task.init call, did the analysis of the code and python package dependencies and stored in on the Task. Then when you clone the Task, it already lists all the python packages your code directly imports (see "Installed Packages" section).
When the agent needs to run this Task, it will create a new venv, clone the code, apply uncommitted changes, and install all required packages as listed in the "Installed Packages"
The agent will also update back the Task on the Full (pip freeze) python packages installed inside this new venv, so it is later fully reproducible.
The caching mechanism basically skips the creation of the venv if the host (i.e. machine running the agent) already created the exact venv before (by default last 10 venvs are stored).
After all that background, back to the point at hand,
Yep, I've already unmarked the venv caching setting,
Just making sure, after unmarking it in the conf, did you restart the agent (i.e. stopped it and restarted it, the conf is loaded only when the process starts)
Maybe it has to do with the fact that I am not working on a Git repository and
clearML
is not able to locate the
requirements.txt
file?
By default the agent will Only install what is listed in the "Installed packages" section of the Task (See Execution Tab -> Installed Packages).
If you press the "Clear" button (hover over the section to see it), and you clear the entire section, the agent will look for the "requirements.txt" inside the repository and use this one.
Does that make sense ?
I mean that I have a script for data preprocessing task where I need the following dependencies:
` import sys
from pathlib import Path
from contextlib import contextmanager
import numpy as np
from clearml import Task
with add_temporary_module_search_path("/home/user/myclearML/"):
from helpers import (
read_netcdf_dataset,
write_records,
) However, the
xarray package is a dependency of the
helpers module which is required by the
read_netcdf_dataset function. Since
helpers ` is a custom module that is imported into the preprocessing task script, clearML is unable to detect it as a dependency and, therefore, does not install it in the environment it creates for the preprocessing task.
That is the reason why I add the Task.add_requirements
part to indicate to the agent that I will need those dependencies. The problem is that the agent reinstalls all the requirements for the next task (the training task), even though both tasks share the same environment. So my question is whether there is a way to tell the PipelineController
to only generate the packages environment once and use it in the training task as well.
When you said clearml-agent
initial setup are you talking about the agent section in the clearml.conf
or the CLI instructions? If it is the second case I am starting the agent with the basic command:clearml-agent daemon --queue default
Is there any other settings I should specify to the agent?
Thanks for the background. I now have a big picture of the process ClearML
goes through. It was helpful in clarifying some of the questions that I didn't know how to ask properly. So, the idea is that a base task is already stored on the ClearML
server for later use in a production environment. This is because such a task will always be created during the model development process.
Going back to my initial question, as far as I understood, if the environment caching option is enabled, the clearml-agent
will not only not reinstall all the packages of an environment shared by all the tasks of the same pipeline, but it will also be able to re-use an identical environment that has already been used for other tasks stored on the server.
So, the only remaining part of my question is if there is a way for Task.init
to automatically detect the imported packages in a custom module ( helpers.py
in this case), which contains functions and classes that I need to use in the script where I put Task.init
, and for them to work the agent must install the packages imported in this custom module.
From what I understood, ClearML
creates a virtual environment from scratch for each task it runs. To detect the dependencies of each script, apparently it inspects the script for the imports and packages specified in Task.add_requirements
. You mean that's not the convenient way for ClearML
to create the environments for each task? What is the right way to proceed in this case?
I "think" you are referring to the venvs cash, correct?
If so, then you have to set it in the clearml.conf running on the host (agent) machine, make sense ?
with
PipelineController
, is there any way to avoid creating a new development environment for each step of the pipeline?
You are in luck, we are expanding the PipelineController to support functions. basically allowing you to run the step on the node running the entire pipeline, but I'm not sure this covers all angles of the problem.
My main question here is, who/how the initial setup is created by cleaml-agent ?
I would like to be more efficient and re-use that environment once configured in the first task.
You have full venv caching. which means the second time a node creates the same env it will reuse the previous one. By default this is turned off because the storage requirements for the node might increase (copy of the entire pythin env might be a few GB).
un-comment this like to activate it.
https://github.com/allegroai/clearml-agent/blob/aede6f4bac71c8fc56e7cf982318a48527953a3c/docs/clearml.conf#L104
Yep, I've already unmarked the venv caching setting, but still the agent reinstalls all the requirements again.
Maybe it has to do with the fact that I am not working on a Git repository and clearML
is not able to locate the requirements.txt
file?
I see, that means xarray
is not an actual package but a folder add to the python path.
This explains why Task.add_requirements fails, as it is supposed to add python packages to the equivalent of "requirements.txt" ...
Is the folder part of the git repository ? How would you pass it to the remote machine the cleamrl-agent is running on?
Hi GiganticTurtle0
The problem is that the packages that I define in 'required_packages' are not in the scripts corresponding
What do you mean by that? is "Xarray" a wheel package? is it instllable from a git repo (example: pip install git+
http://github.com/user/xarray/axrray.git )
Hmm what do you mean? Isn't it under installed packages?
Hi Martin,
Actually Task.add_requirements
behaves as I expect, since that part of the code is in the preprocessing script and for that task it does install all the specified packages. So, my question could be rephrased as the following: when working with PipelineController
, is there any way to avoid creating a new development environment for each step of the pipeline?
According to the https://clear.ml/docs/latest/docs/clearml_agent provided in the official ClearML documentation, this seems to be the standard way in which the clearml-agent
operates (create a new environment for each task). However, as all the tasks I have added in the pipeline could run in the same environment, I would like to be more efficient and re-use that environment once configured in the first task.
Since I am testing ClearML, I do not yet have a Git repository linked to the project. I am working on VS Code from a local device connected via SSH to a remote server. I spin up the agent on the remote machine and it is listening to the queue where the pipeline is placed.