Reputation
Badges 1
21 × Eureka!So im setting the server like this:
global:
defaultStorageClass: $STORAGE_CLASS
apiserver:
replicaCount: 1
resources:
requests:
cpu: "200m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "4Gi"
service:
type: NodePort
nodePort: 30008
port: 8008
ad...
# Step 2: Login via web UI API
LOGIN_URL="http://${NODE_IP}:30080/api/v2.31/auth.login"
LOGIN_PAYLOAD='{"username": "k8s-agent"}'
LOGIN_RESPONSE=$(curl -s -X POST \
-H "Content-Type: application/json" \
-d "$LOGIN_PAYLOAD" \
"$LOGIN_URL")
From what i could find, since the serving endpoint is not treated as a independent enviroment, the packages are being instaled into a 3.8.10 version of python. And the endpoint is trying to get them from another version that does not contain the packages. But i cannot change the version of either i dont understand why...
hi thanks for the reply!
I have setted it and its dowloading (i checked on the container logs) but when i try to POST i get that error
But as of rn i can acess and serve models, but cannot get them to be listed on the server interface
Yeah that is also my understanding, i have it on a separate machine as resources are not a issue.
Ok! perfect, that was i was looking. Thank you so much
Do you have idea what might cause this?
then
helm install clearml clearml/clearml \
--namespace "$NS" \
--values /tmp/server-values.yaml \
--wait \
--timeout "$TMO"
examples that i tested:
# Step 1: Create a user via the web UI API
CREATE_USER_URL="http://${NODE_IP}:30080/api/v2.31/auth.create_user"
USER_PAYLOAD='{"email": "k8s-agent@clearml.ai", "name": "k8s-agent", "company": "clearml", "given_name": "k8s", "family_name": "agent"}'
CREATE_USER_RESPONSE=$(curl -s -X POST \
-H "Content-Type: application/json" \
-d "$USER_PAYLOAD" \
"$CREATE_USER_URL")
Cause when i check it references to 3y ago and i am following this: None
with the values on helm
helm get values clearml-agent -n clearml-prod
USER-SUPPLIED VALUES:
agentk8sglue:
apiServerUrlReference:
clearmlcheckCertificate: false
createQueueIfNotExists: true
fileServerUrlReference:
image:
pullPolicy: Always
repository: allegroai/clearml-agent-k8s-base
tag: latest
queue: default
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 100m
memory: 256Mi
webServerUrlRefe...
If i run helm get values clearml-agent -n clearml-prod
the output is the following:
USER-SUPPLIED VALUES:
agentk8sglue:
apiServerUrlReference: None
clearmlcheckCertificate: false
createQueueIfNotExists: true
fileServerUrlReference: None
image:
pullPolicy: Always
repository: allegroai/clearml-agent-k8s-base
tag: 1.25-1
queue: default
resources:
limits:
cpu: 500m
memory: 1Gi
requests...
Hey! @<1729671499981262848:profile|CooperativeKitten94> Is there any tips you can give me on this?
It seems like the most recent version supported for kubernetes is clearml-agent==1.9.2?
thanks again!
With no sucess, @<1523701070390366208:profile|CostlyOstrich36> I hope this provides a clear idea of what i am trying, any help is fantastic
I will try :
1- update the agent with these values
2- run argo with those changes
I also see these logs:
bash
/root/entrypoint.sh: line 28: /root/clearml.conf: Read-only file system
This indicates that the container's filesystem is mounted as read-only , preventing the agent from writing its configuration file.
From
podSecurityContext:
readOnlyRootFilesystem: true # This causes the issue
PodSecurityPolicies
Security Context Constraints (OpenShift)
Admission controllers enforcing read-only filesystems
What i intended to do was via calls do the same so i can automate it
cat values-prod.yaml
agent:
api:
web_server: "
"
api_server: "
"
files_server: "
"
credentials:
access_key: "8888TMDLWYY7ZQJJ0I7R2X2RSP8XFT"
secret_key: "oNODbBkDGhcDscTENQyr-GM0cE8IO7xmpaPdqyfsfaWearo1S8EQ8eBOxu-opW8dVUU"
Following up on this i was unable to fix the issue. But i ended up finding another complication. When uploading a onnx model using the upload command it keeps getting tagged as a TensorFlow model, even with the correct file structure, and that leads to the previous issue since the serving module will search for different format than the onnx.
As far as i could see this comes from the helper inside the triton engine, but as of right now i could not fix it.
Is there anything i might be doing ...
i also tried to match it with the secure.conf
Defaulted container "clearml-apiserver" out of: clearml-apiserver, init-apiserver (init)
{
"http": {
"session_secret": {
"apiserver": "V8gcW3EneNDcNfO7G_TSUsWe7uLozyacc9_I33o7bxUo8rCN31VLRg"
}
},
"auth": {
"fixed_users": {
"enabled": true,
"pass_hashed": false,
"users": [
{"username": "admin", "password": "clearml123!", "name": "Administrator"...