I still don't understand how it's happening - can you share your code?
I've also overriden CLEARML_FILES_HOST= None , and configured it in clearml.conf file. Don't know where its picking 8081 😕
Can you add the api section of your clearml.conf
and also a log of a task?
Hi @<1535069219354316800:profile|PerplexedRaccoon19> , can you share the result of running this?
python -c "from clearml import Task; print(Task._get_default_session().get_files_server_host())"
this doesn't interrupt jobs, but it slows it down, and it takes a lot of time to quit (adds ~2 hours for the process to end)
Its a simple training loop that trains models for 2-3 epochs for a total of 200-300 iterations, saves a few checkpoints and saves a final model at the end of it
it worked. The env variables definitely do not work! Had to use clearml.conf along with use_credential_chain=True
Note that you used an env variable, I want to try the config directly first 🙂
@<1523701070390366208:profile|CostlyOstrich36> , as written above, I've done that. It still tries to send to 8081
As mentioned above, I've tried both (env and clearml.conf). Here are my configs (I've blacked out urls and creds)
conf file
api {
web_server:
api_server:
files_server:
credentials {
"access_key" = "xyz"
"secret_key" = "xyz"
}
}
Relevant log (it uploads to S3, I can see the artefact fine on clearml's experiment tracker, but it still causes the job to hang)
2023-12-11 16:06:44,008 - clearml.storage - INFO - Uploading: 5325.00MB / 5348.15MB @ 134.86MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,053 - clearml.storage - INFO - Uploading: 5330.00MB / 5348.15MB @ 113.02MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,101 - clearml.storage - INFO - Uploading: 5335.00MB / 5348.15MB @ 103.35MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,148 - clearml.storage - INFO - Uploading: 5340.15MB / 5348.15MB @ 109.98MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,169 - clearml.storage - INFO - Uploading: 5345.15MB / 5348.15MB @ 240.57MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,492 - clearml.Task - INFO - Completed model upload to
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08674550>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08676560>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863eec0>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08675780>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863d5d0>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863d990>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863ef80>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863f640>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08676d70>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863e6e0>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08677b20>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08676680>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
2023-12-11 16:17:58,911 - clearml.metrics - WARNING - Failed uploading to
(HTTPSConnectionPool(host='xyz.com', port=8081): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a086771f0>, 'Connection to xyz.com timed out. (connect timeout=30)')))
2023-12-11 16:17:58,913 - clearml.metrics - WARNING - Failed uploading to
(HTTPSConnectionPool(host='xyz.com', port=8081): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08677d30>, 'Connection to xyz.com timed out. (connect timeout=30)')))
2023-12-11 16:17:58,914 - clearml.metrics - ERROR - Not uploading 2/5 events because the data upload failed
I tried that earlier - that checks out , it matches the s3 path I provide in the conf
And what are you doing in your code, exactly?
Hi @<1535069219354316800:profile|PerplexedRaccoon19> you can setup api.files_server
in clearml.conf
to point to your s3 bucket