So when you run it standalone it works fine? How are you creating the pipeline?
I created the pipeline on another machine via interactive python shell. The pipeline is picked up by clearml, as I see it on the web ui.
the error occurs in the worker node when it tries to initialize the environment for the pipeline
Can you add a larger piece of the error/log? Do you have a code snippet that also reproduces this?
if I look at the code of the clearml controller.py, I see that it expects additional code at a relative folder
I do not get more information than I just showed
if I go to the folder as mentioned in the error and than one level up, I see no other packages present
my worker node is not a docker, but linux in conda environment
Can you add the full log & the dependencies detected in original code? How are you building the pipeline?
Full console log of the worker:
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue b5fe1e72614247f7a77e5f6cdac35580
No tasks in Queues, sleeping for 5.0 seconds
task 30ad27a7a1244b6e8aa722d81cb6015c pulled from b5fe1e72614247f7a77e5f6cdac35580 by worker NLEIN-315GNH2:0
Running task '30ad27a7a1244b6e8aa722d81cb6015c'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.sppvun4p.txt', '/tmp/.clearml_agent_out.sppvun4p.txt'
Current configuration (clearml_agent v1.4.1, location: /tmp/.clearml_agent.gss2zozj.cfg):
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key =
sdk.aws.s3.region =
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri =
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
agent.worker_id = NLEIN-315GNH2:0
agent.worker_name = NLEIN-315GNH2
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = <20.2
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.priority_optional_packages.0 = pygobject
agent.package_manager.torch_nightly = false
agent.venvs_dir = /home/thermo/.clearml/venvs-builds
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /home/thermo/.clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /home/thermo/.clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /home/thermo/.clearml/pip-cache
agent.docker_apt_cache = /home/thermo/.clearml/apt-cache
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
agent.enable_task_env = false
agent.hide_docker_command_env_vars.enabled = true
agent.hide_docker_command_env_vars.parse_embedded_urls = true
agent.abort_callback_max_timeout = 1800
agent.docker_internal_mounts.sdk_cache = /clearml_agent_cache
agent.docker_internal_mounts.apt_cache = /var/cache/apt/archives
agent.docker_internal_mounts.ssh_folder = ~/.ssh
agent.docker_internal_mounts.ssh_ro_folder = /.ssh
agent.docker_internal_mounts.pip_cache = /root/.cache/pip
agent.docker_internal_mounts.poetry_cache = /root/.cache/pypoetry
agent.docker_internal_mounts.vcs_cache = /root/.clearml/vcs-cache
agent.docker_internal_mounts.venv_build = ~/.clearml/venvs-builds
agent.docker_internal_mounts.pip_download = /root/.clearml/pip-download-cache
agent.apply_environment = true
agent.apply_files = true
agent.custom_build_script =
agent.git_user = MichaelThermo
agent.default_python = 3.8
agent.cuda_version = 0
agent.cudnn_version = 0
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.api_server = https://api.clear.ml
api.web_server = https://app.clear.ml
api.files_server = https://files.clear.ml
api.credentials.access_key = HCH1PO3TF2EZY0226XUS
api.host = https://api.clear.ml
Executing task id [30ad27a7a1244b6e8aa722d81cb6015c]:
repository =
branch =
version_num =
tag =
docker_cmd =
entry_point = controller.py
working_dir = .
::: Python virtual environment cache is disabled. To accelerate spin-up time set agent.venvs_cache.path=~/.clearml/venvs-cache
:::
created virtual environment CPython3.8.0.final.0-64 in 239ms
creator CPython3Posix(dest=/home/thermo/.clearml/venvs-builds/3.8, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/thermo/.local/share/virtualenv)
added seed packages: pip==22.3, setuptools==65.5.0, wheel==0.37.1
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 22.3
Uninstalling pip-22.3:
Successfully uninstalled pip-22.3
Successfully installed pip-20.1.1
Collecting Cython
Using cached Cython-0.29.32-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (1.9 MB)
Installing collected packages: Cython
Successfully installed Cython-0.29.32
Collecting attrs==21.4.0
Using cached attrs-21.4.0-py2.py3-none-any.whl (60 kB)
Collecting pathlib2==2.3.7.post1
Using cached pathlib2-2.3.7.post1-py2.py3-none-any.whl (18 kB)
Collecting six==1.16.0
Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting clearml==1.7.2
Using cached clearml-1.7.2-py2.py3-none-any.whl (950 kB)
Collecting pyjwt<2.5.0,>=2.4.0; python_version > "3.5"
Using cached PyJWT-2.4.0-py3-none-any.whl (18 kB)
Collecting jsonschema>=2.6.0
Using cached jsonschema-4.17.0-py3-none-any.whl (83 kB)
Collecting urllib3>=1.21.1
Using cached urllib3-1.26.12-py2.py3-none-any.whl (140 kB)
Collecting psutil>=3.4.2
Using cached psutil-5.9.3-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (295 kB)
Collecting python-dateutil>=2.6.1
Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting furl>=2.0.0
Using cached furl-2.1.3-py2.py3-none-any.whl (20 kB)
Collecting pyparsing>=2.0.3
Using cached pyparsing-3.0.9-py3-none-any.whl (98 kB)
Collecting Pillow>=4.1.1
Using cached Pillow-9.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
Collecting numpy>=1.10
Using cached numpy-1.23.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Collecting PyYAML>=3.12
Using cached PyYAML-6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (701 kB)
Collecting requests>=2.20.0
Using cached requests-2.28.1-py3-none-any.whl (62 kB)
Collecting pkgutil-resolve-name>=1.3.10; python_version < "3.9"
Using cached pkgutil_resolve_name-1.3.10-py3-none-any.whl (4.7 kB)
Collecting pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0
Using cached pyrsistent-0.19.1-py3-none-any.whl (57 kB)
Collecting importlib-resources>=1.4.0; python_version < "3.9"
Using cached importlib_resources-5.10.0-py3-none-any.whl (34 kB)
Collecting orderedmultidict>=1.0.1
Using cached orderedmultidict-1.0.1-py2.py3-none-any.whl (11 kB)
Collecting charset-normalizer<3,>=2
Using cached charset_normalizer-2.1.1-py3-none-any.whl (39 kB)
Collecting certifi>=2017.4.17
Using cached certifi-2022.9.24-py3-none-any.whl (161 kB)
Collecting idna<4,>=2.5
Using cached idna-3.4-py3-none-any.whl (61 kB)
Collecting zipp>=3.1.0; python_version < "3.10"
Using cached zipp-3.10.0-py3-none-any.whl (6.2 kB)
Installing collected packages: attrs, six, pathlib2, pyjwt, pkgutil-resolve-name, pyrsistent, zipp, importlib-resources, jsonschema, urllib3, psutil, python-dateutil, orderedmultidict, furl, pyparsing, Pillow, numpy, PyYAML, charset-normalizer, certifi, idna, requests, clearml
Successfully installed Pillow-9.3.0 PyYAML-6.0 attrs-21.4.0 certifi-2022.9.24 charset-normalizer-2.1.1 clearml-1.7.2 furl-2.1.3 idna-3.4 importlib-resources-5.10.0 jsonschema-4.17.0 numpy-1.23.4 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 pkgutil-resolve-name-1.3.10 psutil-5.9.3 pyjwt-2.4.0 pyparsing-3.0.9 pyrsistent-0.19.1 python-dateutil-2.8.2 requests-2.28.1 six-1.16.0 urllib3-1.26.12 zipp-3.10.0
Adding venv into cache: /home/thermo/.clearml/venvs-builds/3.8
Running task id [30ad27a7a1244b6e8aa722d81cb6015c]:
[.]$ /home/thermo/.clearml/venvs-builds/3.8/bin/python -u /home/thermo/.clearml/venvs-builds/3.8/code/controller.py
Summary - installed python packages:
pip:
- attrs==21.4.0
- certifi==2022.9.24
- charset-normalizer==2.1.1
- clearml==1.7.2
- Cython==0.29.32
- furl==2.1.3
- idna==3.4
- importlib-resources==5.10.0
- jsonschema==4.17.0
- numpy==1.23.4
- orderedmultidict==1.0.1
- pathlib2==2.3.7.post1
- Pillow==9.3.0
- pkgutil-resolve-name==1.3.10
- psutil==5.9.3
- PyJWT==2.4.0
- pyparsing==3.0.9
- pyrsistent==0.19.1
- python-dateutil==2.8.2
- PyYAML==6.0
- requests==2.28.1
- six==1.16.0
- urllib3==1.26.12
- zipp==3.10.0
Environment setup completed successfully
Starting Task Execution:
Traceback (most recent call last):
File "/home/thermo/.clearml/venvs-builds/3.8/code/controller.py", line 20, in <module>
from .job import LocalClearmlJob, RunningJob, BaseJob
ImportError: attempted relative import with no known parent package
Leaving process id 1273
DONE: Running task '30ad27a7a1244b6e8aa722d81cb6015c', exit status 1
Process failed, exit code 1No tasks in queue b5fe1e72614247f7a77e5f6cdac35580
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue b5fe1e72614247f7a77e5f6cdac35580
Hi John, I've done more experiments and found that this only happens if you try to run the pipeline remotely directly from the python interpreter
Initially, I had only one queue and one worker set-up. If the pipeline 'default execution queue' is the same as the queue used in pipe.start('the queue'), it gets into sort of a dead-lock and waits forever
when I set-up two queues and two workers, set the default-execution-queue to one queue and use the other queue for pipe.start, it all works
but the behavior is different if you kick it off from a jupyter notebook (local) or a python script
in case of the local jupyter notebook, I create the pipeline and when I start it, it all works without the necessity to add the jupyter notebook to git
but if I run exactly the same code from a python script (which also calls start on te pipeline), the worker node tries to check out the script and runs that (or fails if you didn't check it into git yet)
The notebook behavior is indeed how I expect it to work, the behavior via the script is strange
FYI: this is my pipeline script
from clearml import PipelineController
pipe = PipelineController(name="My Pipe", project="Gridsquare-Training", version="0.0.5")
pipe.add_step(name="pipe step 1", base_task_project="Gridsquare-Training", base_task_name="remo2")
pipe.add_step(name="pipe step 2", base_task_project="Gridsquare-Training", base_task_name="remo2", parents=["pipe step 1"])
pipe.set_default_execution_queue("myqueue")
pipe.start("service")
(the 'remo2' task is an existing experiment)