Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

If there was an SSL issue it should log to console right?

correct, also the agent is able to report, so I'm assuming configuration is correct
@<1724960464275771392:profile|DepravedBee82> could you try to put the clearml import + Task .init at the top of your code?

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

None

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

This is exactly my problem, too, which I described above! If you find any solution, would be glad if you could share. 🙂 Of course, I also share mine when I get one.

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					CumbersomeSealion22
				
					0
					 × 1

Hi @<1523701205467926528:profile|AgitatedDove14> , here's my code with some more prints:

from clearml import Task

print("Before Task.init")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
print("Before task.set_repo")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
print("Before task.set_packages")
task.set_packages("requirements.txt")

print("After task")

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig

from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks

print("After import")

I've attached the full log (using RC2). Still getting stuck at Task.init - very weird

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Will try non-root and get back to you. I’m also trying to reproduce on a different machine too

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Although it's still really weird how it was failing silently

totally agree, I think the main issue was the agent had the correct configuration, but the container / env the agent was spinning was missing it,
I'll double check how come it did not print anything

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ah yes you were right, it does still print on remote. Here you go:

environ({'LANG': 'en_GB.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/root', 'LOGNAME': 'root', 'USER': 'root', 'SHELL': '/bin/bash', 'INVOCATION_ID': '2cf51dc43b78470cb14c29f5f653ee18', 'JOURNAL_STREAM': '8:224108', 'SYSTEMD_EXEC_PID': '134947', 'PYTHONUNBUFFERED': '1', 'CUDA_DEVICE_ORDER': 'PCI_BUS_ID', 'CLEARML_WORKER_ID': 'mrl-plswh100:0', 'TRAINS_WORKER_ID': 'mrl-plswh100:0', 'CLEARML_CONFIG_FILE': '/tmp/.clearml_agent.vw6k62pq.cfg', 'TRAINS_CONFIG_FILE': '/tmp/.clearml_agent.vw6k62pq.cfg', 'CLEARML_TASK_ID': 'b0abe1da01bd4539a8e06699141c893a', 'TRAINS_TASK_ID': 'b0abe1da01bd4539a8e06699141c893a', 'CLEARML_LOG_LEVEL': 'INFO', 'TRAINS_LOG_LEVEL': 'INFO', 'CLEARML_LOG_TASK_TO_BACKEND': '0', 'TRAINS_LOG_TASK_TO_BACKEND': '0', 'PYTHONPATH': '/root/.clearml/venvs-builds/3.9/task_repository/ml-queue-test:/root/.clearml/venvs-builds/3.9/task_repository/ml-queue-test::/usr/lib64/python39.zip:/usr/lib64/python3.9:/usr/lib64/python3.9/lib-dynload:/root/.clearml/venvs-builds/3.9/lib64/python3.9/site-packages:/root/.clearml/venvs-builds/3.9/lib/python3.9/site-packages'})

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Hi @<1523701205467926528:profile|AgitatedDove14> , I reordered the imports:

from clearml import Task

print("Before task")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
task.set_packages("requirements.txt")

print("After task")

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig

from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks


for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_properties(i).name)

And here's the output:

Environment setup completed successfully
Starting Task Execution:
Before task

Still looks like it's getting stuck at Task.init

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

My understanding is that on remote execution Task.init is supposed to be a no-op right?

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Hmm, I'm without, no reason why it will get stuck .
Removing all the auto loggers, this can be done with

Task.init(..., auto_connect_frameworks=False)

which would disconnect all the automatic loggers (Hydra etc) off course this is for debugging purposes

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

He confirmed that it’s not inside a container. Trying to figure out why it’s running as root but would it make a difference if it was? Is it better to run the agent from a user profile?

Edit: it might be a container! Just checking now...

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Can this be reproducible using a simple script that we can also run?

Not really unfortunately - happy to share my code, but I've managed to reproduce this with different codebases.

As a summary of what I've tried:

Agent on the H100 machine, Server on Kube - Fail
Agent on laptop, Server on Kube - Fail
Agent on laptop, Server on Docker Desktop - Pass
So I'm 100% sure there is something wrong with our ClearML Server deployment on Kube rather than an issue with the agents or code. As for which of the 7 containers could be at fault... :man-shrugging: . I'm not seeing anything out of the ordinary in the logs. Is there a verbose setting in the agent that could help us diagnose, i.e. each step of what goes on in Task.init ?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

I've added that flag, removed all PL loggers & callbacks and all references to Hydra, but no luck 😞

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

I managed to set up my (Windows) laptop as a worker and reproduce the issue. Would that suggest an issue with ClearML server?

Our server is deployed on a kube cluster. I'm not too clear on how Helm charts etc. work, but if there are any obvious things we should check, let me know and I can ask our DevOps engineer

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Hi @<1724960464275771392:profile|DepravedBee82> , can you perhaps add a simple print at the start of your code before any import?

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

This is so odd,
could you add prints right after the Task.init?
Also could you verify it still gets stuck with the latest RC

clearml==1.16.3rc2

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I think I've found a clue after running with debug:

Before Task.init
Retrying (Retry(total=239, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
Retrying (Retry(total=238, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:20:07
Retrying (Retry(total=237, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:20:12
Retrying (Retry(total=236, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:20:33
Retrying (Retry(total=235, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:21:03
Retrying (Retry(total=234, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login
2024-07-30 10:22:08
Retrying (Retry(total=233, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login

So my current theory is that the env the agent is building doesn't have the corporate TLS/SSL certificates. It's weird how it was failing silently without the --debug flag though...

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Agent on laptop, Server on Kube - Fail

So I'm 100% sure there is something wrong with our ClearML Server deployment on Kube

Yeah that feels like a network config issue...

Is there a verbose setting in the agent that could help us diagnose,

yes running with debug turned on on.
since you managed to reproduce on your latop you can try to run the agent with --debug to test, specifically:

clearml-agent --debug daemon ....

if you are running it in venv mode (which I think the setup) you can also just specify the Task ID and test that (no daemon just execution)

clearml-agent --debug execute --id <task_id_here>

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

My money is on the Redis container although comparing the logs between Kube & Docker Desktop, nothing looks out of the ordinary...

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Can you add before the Task.init

import os
print(os.environ)

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> we've now configured the server to have it's own user account to run the agent so it is no longer running as root, but no luck 😞

Before os.environ
environ({'LANG': 'en_GB.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/home/clearml', 'LOGNAME': 'clearml', 'USER': 'clearml', 'SHELL': '/bin/bash', 'INVOCATION_ID': 'da8e36a03c7348efbb7db360755e92b3', 'JOURNAL_STREAM': '8:244189055', 'SYSTEMD_EXEC_PID': '1970812', 'PYTHONUNBUFFERED': '1', 'CUDA_DEVICE_ORDER': 'PCI_BUS_ID', 'CLEARML_WORKER_ID': 'mrl-plswh100:0', 'TRAINS_WORKER_ID': 'mrl-plswh100:0', 'CLEARML_CONFIG_FILE': '/tmp/.clearml_agent.4ll2u471.cfg', 'TRAINS_CONFIG_FILE': '/tmp/.clearml_agent.4ll2u471.cfg', 'CLEARML_TASK_ID': '4ab4c22b02ed4d1f86ff4fac663828f0', 'TRAINS_TASK_ID': '4ab4c22b02ed4d1f86ff4fac663828f0', 'CLEARML_LOG_LEVEL': 'INFO', 'TRAINS_LOG_LEVEL': 'INFO', 'CLEARML_LOG_TASK_TO_BACKEND': '0', 'TRAINS_LOG_TASK_TO_BACKEND': '0', 'PYTHONPATH': '/home/clearml/.clearml/venvs-builds/3.9/task_repository/ml-queue-test:/home/clearml/.clearml/venvs-builds/3.9/task_repository/ml-queue-test::/usr/lib64/python39.zip:/usr/lib64/python3.9:/usr/lib64/python3.9/lib-dynload:/home/clearml/.clearml/venvs-builds/3.9/lib64/python3.9/site-packages:/home/clearml/.clearml/venvs-builds/3.9/lib/python3.9/site-packages'})
Before Task.init

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Our server is deployed on a kube cluster. I'm not too clear on how Helm charts etc.

The only thing that I can think of is that something is not right the the load balancer on the server so maybe some requests coming from an instance on the cluster are blocked ...
Hmm, saying that aloud that actually could be?! Try to add the following line to the end of the clearml.conf on the machine running the agent:

api.http.default_method: "put"

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks for the response @<1523701205467926528:profile|AgitatedDove14> ! The code is a small FMNIST test training job written in PyTorch Lightning. On my local job (laptop GPU, Windows) it completes in ~ 5min. On the server (Linux, H100s) it just hangs at Starting Task Execution: . Neither of these are in Docker.

I would expect to see the standard PL progress bars outputted to the console, but since nothing is outputted, so I'm not sure how to go about debugging this. I've attached the full logs for local and remote

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Hi @<1724960464275771392:profile|DepravedBee82>
After

Starting Task Execution:

It will literally start the process running your code,
Can you send the full log of the Task? what is the code doing? which system is running the agent (i.e. Windows/Mac/Linux docker etc)

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ok so my train.py now looks like this:

print("Before import")

from pathlib import Path

import hydra
import lightning as L
import torch
from coolname import generate_slug
from omegaconf import DictConfig

from src.datasets import JobDataModule
from src.models import JobModel
from src.utils import LogSummaryCallback, get_num_steps, prepare_loggers_and_callbacks

from clearml import Task

for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_properties(i).name)

print("Before task")

task = Task.init(project_name="ClearML Testing", task_name="FMNIST")
task.set_repo(
    repo="git@ssh.dev.azure.com:v3/mclarenracing/Application%20Engineering/ml-queue-test"
)
task.set_packages("requirements.txt")

print("After task")

And the log looks like this:

Starting Task Execution:
Before import
2024-07-19 09:06:09
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
NVIDIA H100 80GB HBM3
Before task

So it looks like it's getting stuck at Task.init

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Here's what the agent was logging:

 anjum.sayed@M209886    clearml-agent --debug daemon --queue default
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.clearml.dev.mrl:443
DEBUG:urllib3.connectionpool:

 "PUT /auth.login HTTP/1.1" 200 603
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.clearml.dev.mrl:443
DEBUG:urllib3.connectionpool:

 "PUT /v2.5/queues.get_all HTTP/1.1" 200 344
DEBUG:urllib3.connectionpool:

 "PUT /v2.5/queues.get_all HTTP/1.1" 200 332
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): updates.clear.ml:443
DEBUG:clearml_agent.session:Run by interpreter: C:\Users\anjum.sayed\AppData\Local\Programs\Python\Python39\python.exe
Current configuration (clearml_agent v1.8.1, location: C:\Users\anjum.sayed/clearml.conf):
----------------------
agent.worker_id =
agent.worker_name = M209886
agent.force_git_ssh_protocol = true
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = <20.2 ; python_version < '3.10'
agent.package_manager.pip_version.1 = <22.3 ; python_version >\= '3.10'
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = nvidia
agent.package_manager.conda_channels.3 = defaults
agent.package_manager.priority_optional_packages.0 = pygobject
agent.package_manager.torch_nightly = false
agent.package_manager.poetry_files_from_repo_working_dir = false
agent.venvs_dir = C:/Users/anjum.sayed/.clearml/venvs-builds
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.venvs_cache.path = ~/.clearml/venvs-cache
agent.vcs_cache.enabled = true
agent.vcs_cache.path = C:/Users/anjum.sayed/.clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = C:/Users/anjum.sayed/.clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = C:/Users/anjum.sayed/.clearml/pip-cache
agent.docker_apt_cache = C:/Users/anjum.sayed/.clearml/apt-cache
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
agent.enable_task_env = false
agent.sanitize_config_printout = ****
agent.hide_docker_command_env_vars.enabled = true
agent.hide_docker_command_env_vars.parse_embedded_urls = true
agent.abort_callback_max_timeout = 1800
agent.docker_internal_mounts.sdk_cache = /clearml_agent_cache
agent.docker_internal_mounts.apt_cache = /var/cache/apt/archives
agent.docker_internal_mounts.ssh_folder = ~/.ssh
agent.docker_internal_mounts.ssh_ro_folder = /.ssh
agent.docker_internal_mounts.pip_cache = /root/.cache/pip
agent.docker_internal_mounts.poetry_cache = /root/.cache/pypoetry
agent.docker_internal_mounts.vcs_cache = /root/.clearml/vcs-cache
agent.docker_internal_mounts.venv_build = ~/.clearml/venvs-builds
agent.docker_internal_mounts.pip_download = /root/.clearml/pip-download-cache
agent.apply_environment = true
agent.apply_files = true
agent.custom_build_script =
agent.disable_task_docker_override = false
agent.git_user =
agent.git_pass = ****
agent.git_host =
agent.debug = true
agent.default_python = 3.9
agent.cuda_version = 123
agent.cudnn_version = 0
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.http.default_method = put
api.auth.token_expiration_threshold_sec = ****
api.api_server =


api.web_server =


api.files_server =


api.credentials.access_key = 1N33K4IXUYO64HVT4S3PXVDIX4K2CS
api.credentials.secret_key = ****
api.host =


sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.network.file_upload_retries = 3
sdk.aws.s3.key =
sdk.aws.s3.secret = ****
sdk.aws.s3.region =
sdk.aws.s3.use_credentials_chain = false
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.aws.boto3.multipart_threshold = 8388608
sdk.aws.boto3.multipart_chunksize = 8388608
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri =
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
sdk.development.worker.report_event_flush_threshold = 100
sdk.development.worker.console_cr_flush_period = 10
sdk.apply_environment = false
sdk.apply_files = false

DEBUG:clearml_agent.commands.worker:starting resource monitor thread
Worker "M209886:0" - Listening to queues:
+----------------------------------+---------+-------+
| id                               | name    | tags  |
+----------------------------------+---------+-------+
| 3e9973e15a6048c5ae5419ea7d097f9c | default |       |
+----------------------------------+---------+-------+

DEBUG:urllib3.connectionpool:

 "PUT /workers.register HTTP/1.1" 200 278
Running CLEARML-AGENT daemon in background mode, writing stdout/stderr to C:\Users\ANJUM~1.SAY\AppData\Local\Temp\.clearml_agent_daemon_outg5aq488v.txt
DEBUG:urllib3.connectionpool:

 "PUT /v2.5/queues.get_all HTTP/1.1" 200 337
DEBUG:urllib3.connectionpool:

 "PUT /workers.get_runtime_properties HTTP/1.1" 404 371
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/queues.get_next_task HTTP/1.1" 200 282
.................. truncating due to Slack char limit.........
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "PUT /v2.5/tasks.ping HTTP/1.1" 200 271
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "POST /events.add_batch HTTP/1.1" 200 315
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "PUT /workers.status_report HTTP/1.1" 200 283
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "PUT /workers.status_report HTTP/1.1" 200 283
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "PUT /v2.5/tasks.ping HTTP/1.1" 200 271
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/tasks.get_all HTTP/1.1" 200 363
DEBUG:urllib3.connectionpool:

 "PUT /v2.5/tasks.get_by_id HTTP/1.1" 200 3490
DEBUG:urllib3.connectionpool:

 "PUT /v2.5/tasks.stopped HTTP/1.1" 200 304
INFO:clearml_agent.commands.worker:Task process terminated
INFO:clearml_agent.commands.worker:Task interrupted: stopping
DEBUG:urllib3.connectionpool:

 "POST /events.add_batch HTTP/1.1" 200 315
DEBUG:urllib3.connectionpool:

 "PUT /v2.5/tasks.stopped HTTP/1.1" 200 333
DEBUG:urllib3.connectionpool:

 "PUT /workers.status_report HTTP/1.1" 200 283
DEBUG:urllib3.connectionpool:

 "PUT /v2.5/queues.get_all HTTP/1.1" 200 337
DEBUG:urllib3.connectionpool:

 "PUT /v2.14/queues.get_next_task HTTP/1.1" 200 282
DEBUG:urllib3.connectionpool:

 "PUT /workers.unregister HTTP/1.1" 200 280
DEBUG:urllib3.connectionpool:

 "PUT /workers.unregister HTTP/1.1" 200 280

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Retrying (Retry(total=239, connect=240, read=240, redirect=240, status=240)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /auth.login

OH that makes sense I'm assuming on your local machine the certificate is installed but not on remote machines / containers
Add the following to your clearml.conf:

api.verify_certificate: false

None

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks Martin - will try that and see what I can find. Really appreciate your patience with this! 🙂

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Sorry, on the remote machine (i.e. enqueue it and let the agent run it), this will also log the print 🙂

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

THAT WORKED! 🎉

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

Answers 46