Reputation
Badges 1
282 × Eureka!thanks. That seems to work. I got a question, does it save the best model or the model in the last epoch?
Hi, i'm gonna hijack this thread a bit. My community uses ClearML and is looking at various model deployment strategies. We are looking at a seamless integration with Triton but noted they Triton does not support deployment strategies. ClearML-Serving seems to but the strategies are rather limited. Is there a roadmap to expand Clearml-serving?
Do you have more info on vault?
Actually it only make sense if the entire department or organisation are saving their models in a common repo. In our case this is not possible due to client security (e.g. training data from clients can potentially be 'reverse engineered' from trained models in future). So each department and even projects will need their own repo.
Hi, we are still not getting the model repo to work, mainly due to clearml.storage failing to save the models.
We tried a vanilla boto3 code and it works, but we can't figure out why we get connectionreseterror 104 when clearml does it.
How do we configure clearml in correspondence to following boto code?
S3= boto3.resource('s3', endpoint_url=' https://ecs.ai ', aws_access_key_id='mykey', aws_secret_access_key='mysevret', config=Config(signature_version='s3v4'), region_name='us-east-1', ve...
Ok thanks. that explains alot. We have been doing this wrongly the whole time, thinking that the clearml.conf on the client side would be acknowledged by the remote agent execution. In reality, only the API section is utilised.
i see. Can i take it that when the client usestask.execute_remotely(queue_name="1gpu", exit_process=True)
then none of the content in its clearml.conf will be used, except for the API part. And Clearml simply uses whatever is on the Agent side.api { # Notice: 'host' is the api server (default port 8008), not the web server. api_server:
web_server:
files_server:
# Credentials are generated using the webapp,
`
# Override with os environment: ...
Going back to the open source, I think that adding the credentials as part of the source code might allow to have "credentials" auto populate as part of the remote execution, wdyt?
Not sure how this will work when i can't supply the credentials to ClearML programatically.
yes its on purpose, each user would have their own AWS credentials for default_output_uri.
I thought of another potential way but not sure if the SDK supports it.
We will perform manual save and upload of model using vanilla boto3 and credentials passed in as env var. Use ClearML SDK to update the Model Repo on the location of the model, without ClearML uploading it explicitly.
Would the above work?
I didn't track the version on this change in behaviour. But last I tried it was able to download the content after I provide the credentials.
Hi,
I'm running on Dell ECS storage appliance, which offers S3 compatibility.
yes http://ECS.ai is the DNS name of the server.
ClearML-models is the bucket.
Let me try with ip:port.
Thanks, its attached.
I also noted that the status on the ClearML is always in 'pending', unlike others which says 'Running'. Is this a side effect of using k8s glue?
Hi thanks. How about Agent, does its docker mode or k8s mode require docker.sock to be exposed?
Hi, please correct me if i am wrong, to use the glue, i need the following.
A k8s cluster A kubectl that is connected to the k8s cluster A pip install of clearml-agent 0.17.1
So i did all the above, I'm not what it meant by running the entire thing on own machine.
Unfortunately it's not. The problem previously encountered with the docker method surfaced again. In this case, the BASE DOCKER IMAGE
nvidia/cuda:10.1-runtime-ubuntu18.04 --env GIT_SSL_NO_VERIFY=true
is not taking effect with the k8s glue.
Thanks 👍 . Should i create an issue on Github?
I meant the dataset id.
Hi, it make sense to automate this part just like how you automate the rest of the MLOps flow, especially when you already support Data Versioning/Lineage, Data Provenance (How it works with the experiment and as a model source) should be in too. Although i agree technically it's probably not possible to tell if the users actually used the indicated datasets after they do a datasets.get_copy()
.
AgitatedDove14 , i'm Jax, not Manoj! lol. 😅 😅
Sorry AgitatedDove14 can you bump me to that thread?
does the bash script need clearml-agent to be able to communicate to the https clearml-server first? If yes, there's a chicken/egg problem here.
Sorry, in case i misunderstood you. Are you refering to the extra_docker_shell_script
.
Some breakthrough. The problem is because we switched the web, api and files server to use https (ssl) endpoint instead. I had switched back to http end points to test this theory.
Although its not printing the error, i suspect its not able to connect due to lack of the self signed cert. Previously this wasn't an issue, not sure what changed in clearml_agent=1.1.0.
There's a secondary issue resulting, i will put this on a new thread.
Its running as a long running POD on K8S. I'm using log -f
to track its stdout.
Is there a way for k8s glue to pass on self signed cert information to the agent pods?
Ok i get the logic now. extra_docker_shell_script
executes before clearml-agent talks to clearml server.