Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
What Is The Best Way To Set S3 As A Files Server? We Have A Clearml Deployment Without A Files Server, But After/During A Training Run Clearml.Metrics Always Fails Due To A Connection Error While Trying To Call <Url>:8081 (We Don'T Have 8081 Because Of

What is the best way to set S3 as a files server?

We have a clearml deployment without a files server, but after/during a training run clearml.metrics always fails due to a connection error while trying to call <url>:8081 (we don't have 8081 because of no files server)

  
  
Posted 11 months ago
Votes Newest

Answers 15


Hi @<1535069219354316800:profile|PerplexedRaccoon19> you can setup api.files_server in clearml.conf to point to your s3 bucket

  
  
Posted 11 months ago

it worked. The env variables definitely do not work! Had to use clearml.conf along with use_credential_chain=True

  
  
Posted 10 months ago

I still don't understand how it's happening - can you share your code?

  
  
Posted 11 months ago

uses clearml 1.13.2

  
  
Posted 11 months ago

@<1523701070390366208:profile|CostlyOstrich36> , as written above, I've done that. It still tries to send to 8081

  
  
Posted 11 months ago

Can you add the api section of your clearml.conf and also a log of a task?

  
  
Posted 11 months ago

Note that you used an env variable, I want to try the config directly first 🙂

  
  
Posted 11 months ago

As mentioned above, I've tried both (env and clearml.conf). Here are my configs (I've blacked out urls and creds)

conf file

api { 
    web_server: 

    api_server: 

    files_server: 


    credentials {
        "access_key" = "xyz"
        "secret_key"  = "xyz"
    }
}

Relevant log (it uploads to S3, I can see the artefact fine on clearml's experiment tracker, but it still causes the job to hang)

2023-12-11 16:06:44,008 - clearml.storage - INFO - Uploading: 5325.00MB / 5348.15MB @ 134.86MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,053 - clearml.storage - INFO - Uploading: 5330.00MB / 5348.15MB @ 113.02MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,101 - clearml.storage - INFO - Uploading: 5335.00MB / 5348.15MB @ 103.35MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,148 - clearml.storage - INFO - Uploading: 5340.15MB / 5348.15MB @ 109.98MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,169 - clearml.storage - INFO - Uploading: 5345.15MB / 5348.15MB @ 240.57MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,492 - clearml.Task - INFO - Completed model upload to 

Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08674550>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08676560>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863eec0>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08675780>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863d5d0>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863d990>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863ef80>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863f640>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08676d70>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863e6e0>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08677b20>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08676680>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
2023-12-11 16:17:58,911 - clearml.metrics - WARNING - Failed uploading to 
 (HTTPSConnectionPool(host='xyz.com', port=8081): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a086771f0>, 'Connection to xyz.com timed out. (connect timeout=30)')))
2023-12-11 16:17:58,913 - clearml.metrics - WARNING - Failed uploading to 
 (HTTPSConnectionPool(host='xyz.com', port=8081): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08677d30>, 'Connection to xyz.com timed out. (connect timeout=30)')))
2023-12-11 16:17:58,914 - clearml.metrics - ERROR - Not uploading 2/5 events because the data upload failed
  
  
Posted 11 months ago

Its a simple training loop that trains models for 2-3 epochs for a total of 200-300 iterations, saves a few checkpoints and saves a final model at the end of it

  
  
Posted 11 months ago

Hi @<1535069219354316800:profile|PerplexedRaccoon19> , can you share the result of running this?

python -c "from clearml import Task; print(Task._get_default_session().get_files_server_host())"
  
  
Posted 11 months ago

I tried that earlier - that checks out , it matches the s3 path I provide in the conf

  
  
Posted 11 months ago

And what are you doing in your code, exactly?

  
  
Posted 11 months ago

any luck, is this expected?

  
  
Posted 11 months ago

this doesn't interrupt jobs, but it slows it down, and it takes a lot of time to quit (adds ~2 hours for the process to end)

  
  
Posted 11 months ago

I've also overriden CLEARML_FILES_HOST= None , and configured it in clearml.conf file. Don't know where its picking 8081 😕

  
  
Posted 11 months ago