Hi SuccessfulKoala55 , just to add, my clearml.conf (client) and clearml.agent.conf (agent) can have differing values. I'm not sure which one takes precedence and if this could be the cause.
Well, by default, the SDK and agent both use the clearml.conf
file. If those files reside on different machines, there should be no confusion
SubstantialElk6 your agent configuration files does not have any AWS credentials defined - is that on purpose?
yes its on purpose, each user would have their own AWS credentials for default_output_uri.
My assumption is that the agent will have pulled that off the client's clearml.conf.
It won't, as it's not on the same machine... and credentials are never sent to the server, so basically the experiment running in the agent will have no credentials
i see. Can i take it that when the client usestask.execute_remotely(queue_name="1gpu", exit_process=True)
then none of the content in its clearml.conf will be used, except for the API part. And Clearml simply uses whatever is on the Agent side.api { # Notice: 'host' is the api server (default port 8008), not the web server. api_server:
web_server:
files_server:
# Credentials are generated using the webapp,
# Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY credentials {"access_key": "12345", "secret_key": "67890"}, verify_certificate: false }
execute_remotely
simply creates a task and enqueues it - nothing of the original clearml.conf
file is used in the new task - the clearml.conf
file is always local to the executing machine.
Ok thanks. that explains alot. We have been doing this wrongly the whole time, thinking that the clearml.conf on the client side would be acknowledged by the remote agent execution. In reality, only the API section is utilised.
Hi, we are still not getting the model repo to work, mainly due to clearml.storage failing to save the models.
We tried a vanilla boto3 code and it works, but we can't figure out why we get connectionreseterror 104 when clearml does it.
How do we configure clearml in correspondence to following boto code?
S3= boto3.resource('s3', endpoint_url=' https://ecs.ai ', aws_access_key_id='mykey', aws_secret_access_key='mysevret', config=Config(signature_version='s3v4'), region_name='us-east-1', verify='/mycerts/ca.crt')
Hi SubstantialElk6
If you are using boto to acess anything that is Not AWS S3 you have to add both address and port, and make sure you configure the "security" flag.
See example in clearml.conf :
https://github.com/allegroai/clearml-agent/blob/176b4a4cdec9c4303a946a82e22a579ae22c3355/docs/clearml.conf#L247
` aws {
s3 {
{
host: "my-minio-host:9000"
key: "12345678"
secret: "12345678"
multipart: false
secure: false
}
]
} `And in the ` output_uri ` you should have something like: s3://my_host:9000/bucket/folder
Ok, let me check this out first thing on Monday, thanks AgitatedDove14 .
Hi.
We tried as advised above and it still didn't work.
Host: http://ecs.ai:443
output_uri = S3://ecs.ai:443/bucketname
This time round the client gave this error.
Botocore.exceptions.connectiinclosederror: connection was closed before we received a valid response from endpoint URL: ' http://ecs.ai/bucketname/.clearml.test '.
It's quite apparent that whatever clearml passed to boto3 ends up as a http call instead of https, which is wrong.
SubstantialElk6
Notice if you are using a manual setup the default is "secure: false" you have to change it to "secure: true":
https://github.com/allegroai/clearml-agent/blob/176b4a4cdec9c4303a946a82e22a579ae22c3355/docs/clearml.conf#L251
Since you actually need https, you should use the "secure" property in the bucket configuration in clearml.conf
Thanks. We set this configuration and the client ran and submitted the job for remote execution (agent running k8s glue). However when the job runs, and tries to save into model repo, this error came up.
ClearML.storage - ERROR - Failed creating storage object S3://ecs.ai Reason; Missing key and secret for S3 storage access ( S3://ECS.ai ).
I remember being told that the ClearML.conf on the client will not be used in a remote execution like the above so I think this was the problem. I also tried to set the same credentials in 'web app cloud access' under profile. That didn't help as well.
The implication is that
All models can only be saved to a single model repo defined in the ClearML agent end. Users have no choice of defining their own repo destination of choice.
Is there any way out of this? I find it strange that users can't choose a model repo destination of their choice and had to rely on just one public bucket.
I remember being told that the ClearML.conf on the client will not be used in a remote execution like the above so I think this was the problem.
SubstantialElk6 the configuration should be set on the agent's machine (i.e. clearml.conf that is on the machine running the agent)
- Users have no choice of defining their own repo destination of choice.
In the UI you can specify in the "Execution" tab, Output "destination", a different destination for the models/artifacts. Is this what you are looking for ?
Setting the credentials on agent machine means the users cannot use their own credentials since an k8s glue agent serves multiple users.
Referencing your suggestion, we can configure output_uri on task.set_base_docker() but how should we do this for the credentials?
Setting the credentials on agent machine means the users cannot use their own credentials since an k8s glue agent serves multiple users.
Correct, I think "vault" option is only available on the paid tier 😞
but how should we do this for the credentials?
I'm not sure how to pass them, wouldn't it make sense to give the agent an all accessing credentials ?
Do you have more info on vault?
Actually it only make sense if the entire department or organisation are saving their models in a common repo. In our case this is not possible due to client security (e.g. training data from clients can potentially be 'reverse engineered' from trained models in future). So each department and even projects will need their own repo.
In our case this is not possible due to client security (e.g. training data from clients can potentially be 'reverse engineered' from trained models in future).
Hmm I see, wouldn't it make more sense to separate clients like a multi-tenant SAAS solution ?
It would make sense on a very large resource cluster. Unfortunately we only have less than 50 GPUs to share across. A multi-tenant SAAS would cut the resources into even more smaller clusters and not help with efficiency. Or would you have a suggestion?
Hmm that makes sense, I "think" the enterprise offering has a solution for that as well (i.e. full separation over static cluster), but probably the best way to constituent this avenue is talk to Sales (I'm assuming they'll setup a call to discuss the details)
Going back to the open source, I think that adding the credentials as part of the source code might allow to have "credentials" auto populate as part of the remote execution, wdyt?
Going back to the open source, I think that adding the credentials as part of the source code might allow to have "credentials" auto populate as part of the remote execution, wdyt?
Not sure how this will work when i can't supply the credentials to ClearML programatically.
I thought of another potential way but not sure if the SDK supports it.
We will perform manual save and upload of model using vanilla boto3 and credentials passed in as env var. Use ClearML SDK to update the Model Repo on the location of the model, without ClearML uploading it explicitly.
Would the above work?
Hi SubstantialElk6 , you can do it by:
Uploading the model file yourself Creating an OutputModel
object, associated with your task (it will automatically be associated with the main task unless otherwise specified) Calling the model's update_weights()
method with the register_uri
argument pointing to the URI of the previously uploaded model fileWDYT?
Hi Jake, thanks for the suggestion, let me try it out.