Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
What Is The Best Way To Set S3 As A Files Server? We Have A Clearml Deployment Without A Files Server, But After/During A Training Run Clearml.Metrics Always Fails Due To A Connection Error While Trying To Call <Url>:8081 (We Don'T Have 8081 Because Of

What is the best way to set S3 as a files server?

We have a clearml deployment without a files server, but after/during a training run clearml.metrics always fails due to a connection error while trying to call <url>:8081 (we don't have 8081 because of no files server)

  
  
Posted one year ago
Votes Newest

Answers 15


I've also overriden CLEARML_FILES_HOST= None , and configured it in clearml.conf file. Don't know where its picking 8081 😕

  
  
Posted one year ago

Hi @<1535069219354316800:profile|PerplexedRaccoon19> you can setup api.files_server in clearml.conf to point to your s3 bucket

  
  
Posted one year ago

@<1523701070390366208:profile|CostlyOstrich36> , as written above, I've done that. It still tries to send to 8081

  
  
Posted one year ago

Can you add the api section of your clearml.conf and also a log of a task?

  
  
Posted one year ago

Note that you used an env variable, I want to try the config directly first 🙂

  
  
Posted one year ago

As mentioned above, I've tried both (env and clearml.conf). Here are my configs (I've blacked out urls and creds)

conf file

api { 
    web_server: 

    api_server: 

    files_server: 


    credentials {
        "access_key" = "xyz"
        "secret_key"  = "xyz"
    }
}

Relevant log (it uploads to S3, I can see the artefact fine on clearml's experiment tracker, but it still causes the job to hang)

2023-12-11 16:06:44,008 - clearml.storage - INFO - Uploading: 5325.00MB / 5348.15MB @ 134.86MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,053 - clearml.storage - INFO - Uploading: 5330.00MB / 5348.15MB @ 113.02MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,101 - clearml.storage - INFO - Uploading: 5335.00MB / 5348.15MB @ 103.35MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,148 - clearml.storage - INFO - Uploading: 5340.15MB / 5348.15MB @ 109.98MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,169 - clearml.storage - INFO - Uploading: 5345.15MB / 5348.15MB @ 240.57MBs from /tmp/.clearml.upload_model_05djjpwq.tmp
2023-12-11 16:06:44,492 - clearml.Task - INFO - Completed model upload to 

Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08674550>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08676560>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863eec0>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08675780>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863d5d0>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863d990>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863ef80>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863f640>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08676d70>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=1, connect=1, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a0863e6e0>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08677b20>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
Retrying (Retry(total=0, connect=0, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08676680>, 'Connection to xyz.com timed out. (connect timeout=30)')': /
2023-12-11 16:17:58,911 - clearml.metrics - WARNING - Failed uploading to 
 (HTTPSConnectionPool(host='xyz.com', port=8081): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a086771f0>, 'Connection to xyz.com timed out. (connect timeout=30)')))
2023-12-11 16:17:58,913 - clearml.metrics - WARNING - Failed uploading to 
 (HTTPSConnectionPool(host='xyz.com', port=8081): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f1a08677d30>, 'Connection to xyz.com timed out. (connect timeout=30)')))
2023-12-11 16:17:58,914 - clearml.metrics - ERROR - Not uploading 2/5 events because the data upload failed
  
  
Posted one year ago

uses clearml 1.13.2

  
  
Posted one year ago

this doesn't interrupt jobs, but it slows it down, and it takes a lot of time to quit (adds ~2 hours for the process to end)

  
  
Posted one year ago

Hi @<1535069219354316800:profile|PerplexedRaccoon19> , can you share the result of running this?

python -c "from clearml import Task; print(Task._get_default_session().get_files_server_host())"
  
  
Posted one year ago

I tried that earlier - that checks out , it matches the s3 path I provide in the conf

  
  
Posted one year ago

And what are you doing in your code, exactly?

  
  
Posted one year ago

Its a simple training loop that trains models for 2-3 epochs for a total of 200-300 iterations, saves a few checkpoints and saves a final model at the end of it

  
  
Posted one year ago

any luck, is this expected?

  
  
Posted one year ago

I still don't understand how it's happening - can you share your code?

  
  
Posted one year ago

it worked. The env variables definitely do not work! Had to use clearml.conf along with use_credential_chain=True

  
  
Posted 11 months ago