Hello Folks! I Don'T Know If This Issue Has Already Been Addressed. I Have A Basic Pipelinecontroller Script With Two Steps: One Of Task Is For Preprocessing Purposes And The Other For Training A Model. Currently I Am Placing The Code Related To The Pack

Answered

Hello folks!

I don't know if this issue has already been addressed. I have a basic PipelineController script with two steps: one of task is for preprocessing purposes and the other for training a model. Currently I am placing the code related to the package requirements in the script corresponding to step 1 of the pipeline (before Task.init).

required_packages = ["scipy", "xarray", "dask", "tensorflow"] for package_name in required_packages: Task.add_requirements(package_name)
However, during the training step I get an error saying one of the packages (in this case, 'Xarray') that was installed in the preprocessing step is not available now. In an attempt to solve this I tried to put the above snippet in the script where I define the pipeline, but the agent does not install the packages when I run that script.

The problem is that the packages that I define in 'required_packages' are not in the scripts corresponding to the Tasks, but in custom modules that I import into those scripts, and that is why clearML is not able to log the packages.

The code structure is similar to that of the example https://clear.ml/docs/latest/docs/guides/pipeline/pipeline_controller/ .

Is there a convenient place to specify the package requirements so that the installed packages can be used for the subsequent Tasks?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

Votes Newest

Answers 12

I mean that I have a script for data preprocessing task where I need the following dependencies:

` import sys
from pathlib import Path
from contextlib import contextmanager

import numpy as np
from clearml import Task

with add_temporary_module_search_path("/home/user/myclearML/"):
from helpers import (
read_netcdf_dataset,
write_records,
) However, the xarray package is a dependency of the helpers module which is required by the read_netcdf_dataset function. Since helpers ` is a custom module that is imported into the preprocessing task script, clearML is unable to detect it as a dependency and, therefore, does not install it in the environment it creates for the preprocessing task.

That is the reason why I add the Task.add_requirements part to indicate to the agent that I will need those dependencies. The problem is that the agent reinstalls all the requirements for the next task (the training task), even though both tasks share the same environment. So my question is whether there is a way to tell the PipelineController to only generate the packages environment once and use it in the training task as well.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

GiganticTurtle0 , let me add some background. The idea is that at some point you had your code running on your machine (when developing it for example),
when you actually executed the code itself in development, you call 'task.init' (to track the development process for example). This Task.init call, did the analysis of the code and python package dependencies and stored in on the Task. Then when you clone the Task, it already lists all the python packages your code directly imports (see "Installed Packages" section).
When the agent needs to run this Task, it will create a new venv, clone the code, apply uncommitted changes, and install all required packages as listed in the "Installed Packages"
The agent will also update back the Task on the Full (pip freeze) python packages installed inside this new venv, so it is later fully reproducible.
The caching mechanism basically skips the creation of the venv if the host (i.e. machine running the agent) already created the exact venv before (by default last 10 venvs are stored).

After all that background, back to the point at hand,

Yep, I've already unmarked the venv caching setting,

Just making sure, after unmarking it in the conf, did you restart the agent (i.e. stopped it and restarted it, the conf is loaded only when the process starts)

Maybe it has to do with the fact that I am not working on a Git repository and

clearML

is not able to locate the

requirements.txt

file?

By default the agent will Only install what is listed in the "Installed packages" section of the Task (See Execution Tab -> Installed Packages).
If you press the "Clear" button (hover over the section to see it), and you clear the entire section, the agent will look for the "requirements.txt" inside the repository and use this one.

Does that make sense ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks for the background. I now have a big picture of the process ClearML goes through. It was helpful in clarifying some of the questions that I didn't know how to ask properly. So, the idea is that a base task is already stored on the ClearML server for later use in a production environment. This is because such a task will always be created during the model development process.

Going back to my initial question, as far as I understood, if the environment caching option is enabled, the clearml-agent will not only not reinstall all the packages of an environment shared by all the tasks of the same pipeline, but it will also be able to re-use an identical environment that has already been used for other tasks stored on the server.

So, the only remaining part of my question is if there is a way for Task.init to automatically detect the imported packages in a custom module ( helpers.py in this case), which contains functions and classes that I need to use in the script where I put Task.init , and for them to work the agent must install the packages imported in this custom module.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

Yep, I've already unmarked the venv caching setting, but still the agent reinstalls all the requirements again.
Maybe it has to do with the fact that I am not working on a Git repository and clearML is not able to locate the requirements.txt file?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

Hi GiganticTurtle0

The problem is that the packages that I define in 'required_packages' are not in the scripts corresponding

What do you mean by that? is "Xarray" a wheel package? is it instllable from a git repo (example: pip install git+ http://github.com/user/xarray/axrray.git )

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

with

PipelineController

, is there any way to avoid creating a new development environment for each step of the pipeline?

You are in luck, we are expanding the PipelineController to support functions. basically allowing you to run the step on the node running the entire pipeline, but I'm not sure this covers all angles of the problem.
My main question here is, who/how the initial setup is created by cleaml-agent ?

I would like to be more efficient and re-use that environment once configured in the first task.

You have full venv caching. which means the second time a node creates the same env it will reuse the previous one. By default this is turned off because the storage requirements for the node might increase (copy of the entire pythin env might be a few GB).
un-comment this like to activate it.
https://github.com/allegroai/clearml-agent/blob/aede6f4bac71c8fc56e7cf982318a48527953a3c/docs/clearml.conf#L104

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I "think" you are referring to the venvs cash, correct?
If so, then you have to set it in the clearml.conf running on the host (agent) machine, make sense ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hmm what do you mean? Isn't it under installed packages?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I see, that means xarray is not an actual package but a folder add to the python path.
This explains why Task.add_requirements fails, as it is supposed to add python packages to the equivalent of "requirements.txt" ...
Is the folder part of the git repository ? How would you pass it to the remote machine the cleamrl-agent is running on?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

When you said clearml-agent initial setup are you talking about the agent section in the clearml.conf or the CLI instructions? If it is the second case I am starting the agent with the basic command:
clearml-agent daemon --queue defaultIs there any other settings I should specify to the agent?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

From what I understood, ClearML creates a virtual environment from scratch for each task it runs. To detect the dependencies of each script, apparently it inspects the script for the imports and packages specified in Task.add_requirements . You mean that's not the convenient way for ClearML to create the environments for each task? What is the right way to proceed in this case?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

Hi Martin,

Actually Task.add_requirements behaves as I expect, since that part of the code is in the preprocessing script and for that task it does install all the specified packages. So, my question could be rephrased as the following: when working with PipelineController , is there any way to avoid creating a new development environment for each step of the pipeline?

According to the https://clear.ml/docs/latest/docs/clearml_agent provided in the official ClearML documentation, this seems to be the standard way in which the clearml-agent operates (create a new environment for each task). However, as all the tasks I have added in the pipeline could run in the same environment, I would like to be more efficient and re-use that environment once configured in the first task.

Since I am testing ClearML, I do not yet have a Git repository linked to the project. I am working on VS Code from a local device connected via SSH to a remote server. I spin up the agent on the remote machine and it is listening to the queue where the pipeline is placed.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GiganticTurtle0
				
					0
					 × 1

Write your answer

2K Views

12 Answers

4 years ago

2 years ago