Reputation
Badges 1
18 × Eureka!This is exactly my problem, too, which I described above! If you find any solution, would be glad if you could share. 🙂 Of course, I also share mine when I get one.
Relevant pipeline logs:
1721797252114 myDNSName:cpu:1:service:a6d316cad2b54f36a0cb960dc04e5ab7 DEBUG Installing dependencies from lock file
Package operations: 26 installs, 0 updates, 0 removals
- Installing attrs (23.2.0)
- Installing rpds-py (0.19.0)
- Installing referencing (0.35.1)
- Installing six (1.16.0)
- Installing certifi (2024.7.4)
- Installing charset-normalizer (3.3.2)
- Installing idna (3.7)
- Installing jsonschema-specifications (2023.12.1)
- Installin...
1721735525702 myHostName info ClearML Task: created new task id=a08289ce6f2e47f2afc4e9f4e8540575
ClearML results page:
1721735525998 myHostName info ClearML pipeline page:
Starting the pipeline in queue pipeline
1721735542512 dnsName:cpu:0 INFO task a08289ce6f2e47f2afc4e9f4e8540575 pulled from cbf8757ca3f44980a450f9ea8a1c300a by worker dnsName:cpu:0
1721735547620 dnsName:cpu:0 DEBUG DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): ipAddrOfClea...
I noticed in the documentation that --services-mode only works together with docker mode. So probably, this is the error, since I did not start it with docker. I would have expected an error when trying to start in --services-mode without docker, though.
Now, with docker mode activated, I get a new error: clearml_agent is not installed (using python:3.11.9-slim-bookworm as base image).
Traceback (most recent call last):
File "/opt/clearml/.venv/lib/python3.11/site-packages/clearml_agent/commands/worker.py", line 3221, in install_requirements_for_package_api
package_api.load_requirements(cached_requirements)
File "/opt/clearml/.venv/lib/python3.11/site-packages/clearml_agent/helper/package/pip_api/venv.py", line 41, in load_requirements
super(VirtualenvPip, self).load_requirements(requirements)
File "/opt/clearml/.venv/lib/python3.11/site-packages/clearml...
Actually, it already does not work if my tasks themselves have imports on modules which I wrote myself. I.e., let's say I have the following files in my repo:
pipeline.py
task1.py
task2.py
utils.py
and pipeline.py
has imports on task1 and task2, and e.g. task1 does import utils
. Then I would get an error on the remote agent ModuleNotFoundError
.
Yes, you are right, thanks. Now, I am using two agents with one using a queue dedicated only to the pipeline, and one dedicated to the single tasks. It works. However, still, it sometimes takes a strangely long time for the agent to pick up the next task (or process it), even if it is only "Hello World".
This is true, yes. I do
pipe.set_default_execution_queue("default") and also
pipe.start(queue="default"), where the single steps do not specify queues. Also, my GUI tells me that this is so.
Logs from the first task itself:
1721797284959 myDNSName:0 INFO task 20e5aaa6a8df4b1bbdc9fd5ccdcb5e7d pulled from 179eb7123cc04aa7a9a701890ac8ba0b by worker myDNSName:0
1721797290216 myDNSName:0 DEBUG DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): serverAddress:8008
DEBUG:urllib3.connectionpool:
"GET /auth.login HTTP/1.1" 200 615
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): serverAddress:8008
DEBUG:urllib3.connectionpool:
"GET...
Package operations: 26 installs, 0 updates, 0 removals
- Installing attrs (23.2.0)
- Installing rpds-py (0.19.0)
1721735566405 dnsName:cpu:0:service:a08289ce6f2e47f2afc4e9f4e8540575 DEBUG - Installing referencing (0.35.1)
- Installing six (1.16.0)
- Installing certifi (2024.7.4)
- Installing charset-normalizer (3.3.2)
- Installing idna (3.7)
- Installing jsonschema-specifications (2023.12.1)
- Installing orderedmultidict (1.0.1)
- Installing urllib3 (2.2.2)
- Ins...
I have one agent running on the machine. I also have only one task running. This only happens to us when we use pipelines, not single tasks. It does not depend on parameters like cache. There are no other tasks running in the meantime. I can boil it down even to "Hello World" tasks.
Notably, the example given here
also causes the observed behavior.
Yes, I do have my files in the git repo. Although I have not quite understood which part it takes from the remote git repo, and which part it takes from my local system.
It seems that one also needs to explicitly hand in the git repo in the pipeline and task definitions via PipelineController, since otherwise, the agent would start the task process in some random working directory, where, of course, it cannot find any own module.
And a second thing: When starting a pipeline, it seems that Cl...
Just noting that it also does not work with two agents listening to the same queue, because I thought maybe the controller task of the pipeline blocks the executing of the actual tasks.
Update:
- It does seem to work somehow sometimes, but it takes an unreasonably long time. Even just printing print("Hello World") takes like a minute or so (after the environment has fully been set up).
- I needed to trigger the pipeline 2 times, the first time not even the pipeline started.
I could solve the error by using image python:3.11.9-bookworm, since gcc was missing in the slim image (would be good to have some guideline to which minimum requirements a Docker image should have?).
Now, I am stuck with an error which seems to be related to poetry and pip, and their interplay. Poetry installs my own project, which is then listed as a requirement for pip, which does not find it on the PyPI server (obviously) and complains that it can't install it. I run the pipeline in ser...
I can even comment out the separate steps and create dummy ones in main_pipeline.py and load them, and still, the error persists.
I find this line a little strange:
poetry run python -u main_pipeline.py
because that's the file from which the pipeline is started, i.e. it should not be called again, I guess?
I have the main pipeline code in that file, and my steps in separate files which I load into the main file. This is all in one repo, so it should work.
Well, rather, it takes a minute to complete.