Can you add the api section of your clearml.conf
and also a log of a task?
@<1523701070390366208:profile|CostlyOstrich36> , as written above, I've done that. It still tries to send to 8081
Its a simple training loop that trains models for 2-3 epochs for a total of 200-300 iterations, saves a few checkpoints and saves a final model at the end of it
Hi @<1535069219354316800:profile|PerplexedRaccoon19> , can you share the result of running this?
python -c "from clearml import Task; print(Task._get_default_session().get_files_server_host())"
I've also overriden CLEARML_FILES_HOST= None , and configured it in clearml.conf file. Don't know where its picking 8081 😕
Note that you used an env variable, I want to try the config directly first 🙂
I tried that earlier - that checks out , it matches the s3 path I provide in the conf
And what are you doing in your code, exactly?
it worked. The env variables definitely do not work! Had to use clearml.conf along with use_credential_chain=True
this doesn't interrupt jobs, but it slows it down, and it takes a lot of time to quit (adds ~2 hours for the process to end)
As mentioned above, I've tried both (env and clearml.conf). Here are my configs (I've blacked out urls and creds)
conf file
api {
web_server:
api_server:
files_server:
credentials {
"access_key" = "xyz"
"secret_key" = "xyz"
}
}
Relevant log (it uploads to S3, I can see the artefact fine on clearml's experiment tracker, but it still causes the job to hang)
2023-12-11 16:06:44,008 - clearml.storage - INFO - Uploading: 5325.00MB / 5348.15MB @ 134.86MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,053 - clearml.storage - INFO - Uploading: 5330.00MB / 5348.15MB @ 113.02MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,101 - clearml.storage - INFO - Uploading: 5335.00MB / 5348.15MB @ 103.35MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,148 - clearml.storage - INFO - Uploading: 5340.15MB / 5348.15MB @ 109.98MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,169 - clearml.storage - INFO - Uploading: 5345.15MB / 5348.15MB @ 240.57MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,492 - clearml.Task - INFO - Completed model upload to
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08674550>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08676560>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863eec0>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08675780>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863d5d0>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863d990>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863ef80>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863f640>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08676d70>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863e6e0>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08677b20>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08676680>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
2023-12-11 16:17:58,911 - clearml.metrics - WARNING - Failed uploading to
(HTTPSConnectionPool(host='xyz.com', port=8081): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a086771f0>, 'Connection to xyz.com timed out. (connect timeout=30)')))
2023-12-11 16:17:58,913 - clearml.metrics - WARNING - Failed uploading to
(HTTPSConnectionPool(host='xyz.com', port=8081): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08677d30>, 'Connection to xyz.com timed out. (connect timeout=30)')))
2023-12-11 16:17:58,914 - clearml.metrics - ERROR - Not uploading 2/5 events because the data upload failed
Hi @<1535069219354316800:profile|PerplexedRaccoon19> you can setup api.files_server
in clearml.conf
to point to your s3 bucket
I still don't understand how it's happening - can you share your code?