Hi, i dont't think clearml agent actually ran at that point in time. All i can see in the pod is
apt install of libpthread-stubs, libx11, libxau and libxcb1 packages. pip install of clearml-agentAfter the above are successful, the pod just hang there.
ok. Any idea what can go on between the setting up of clearml-agent and initialising the clearml-agent itself? Does the clearml-agent try to communicate with any internet address. From another perspective, it looks like a long time out issue. I happen to be deploying on a disconnected on-premise setup.
Does the glue write any error logs anywhere? I only see CLEARML_AGENT_UPDATE_VERSION =
and nothing else.
Is there a way for k8s glue to pass on self signed cert information to the agent pods?
alright thanks. Its impt we clarify it works before we migrate the ifra.
Its hard to tell, but the agent change was a significant one. Unless python versions has something to do with it.
Hi,
I'm running on Dell ECS storage appliance, which offers S3 compatibility.
yes http://ECS.ai is the DNS name of the server.
ClearML-models is the bucket.
Let me try with ip:port.
Hi, by agent logs i suppose you meant the logs from the ClearML server console panel?
Hi just wondering if I did something wrong here. Would k8s-glue be the reason is not working? I'm purchasing the enterprise version and if vault has the same problem it'll be a big issue.
Can i dig into the mongodb or ES to pull these data?
I think in general, the 'published' action can be considered an 'approval'. The question is, how do we control who has the authority to 'publish'? The Web UI today does not support any uploads outside of the coding environment, would be nice it would be supported. But for now, the only workaround is to include parameters that stores document urls in the user properties.
docker exec clearml-elastic curl
zsh: no matches found:
[root@2c7498711bef elasticsearch]# curl -XGET
`
yellow open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 4hAFNtGkRr-CHNGnUYfbTA 1 1 4724 271 660.9kb 660.9kb
yellow open events-log-d1bd92a3b039400cbafc60a7a5b1e52b M3qgFy1HRU2PibDOr1YOdw 1 1 1221 20 1013.6kb 1013.6kb
red open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 EQK8mnlhRxCrrKK3clcUFA 1 1
red open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_...
Some breakthrough. The problem is because we switched the web, api and files server to use https (ssl) endpoint instead. I had switched back to http end points to test this theory.
Although its not printing the error, i suspect its not able to connect due to lack of the self signed cert. Previously this wasn't an issue, not sure what changed in clearml_agent=1.1.0.
There's a secondary issue resulting, i will put this on a new thread.
Nice, what are the names of the talks?
Hi, this is the setup.
clientfrom clearml import Task, Logger task = Task.init(project_name='DETECTRON2',task_name='Train',task_type='training') task.set_base_docker("quay.io/fb/detectron2:v3 --env GIT_SSL_NO_VERIFY=true --env TRAINS_AGENT_GIT_USER=testuser --env TRAINS_AGENT_GIT_PASS=testuser" ) task.execute_remotely(queue_name="single_gpu", exit_process=True)
k8s_glue_example.py spawned a pod and starts running.
ClearML UI -> Experiment -> Results -> Console.
` At the top it will pri...
Sorry i don't quite understand this. The task itself was submitted as I run the code on the client. I suppose the dependancies requirements would be copied over as the experiment is cloned?
Ah ok. So it will be fixed on the ClearML server web UI as well? (See screenshots).
Hi. Yup the model was not physically uploaded with the up:port into the bucket, although ClearML does indicate that it's there, except that I can't download it. I also verified this with another S3 client, the model was not there as well.
AgitatedDove14 , would you elaborate on this resolution process?
Ok thanks, looking forward to it. Would you advise on the bug you encountered?
Hi, this is what i got. No mention of the env variables.
` Current configuration (clearml_agent v0.17.2, location: /home/jax/clearml.conf):
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
ap...
ok thanks. this would mean that increasing the disk space for my ClearML is the only option as we are not at liberty to delete.
Can this issue be solved with vault? It doesn't make sense to expose secrets like that.
Hi, the latest k8sglue-example.py was last commited about 4 months ago. Are you refering to that version?
Ok. I noted this is due to the venv_update setting. It needs to be disabled as it has a dependancy on the internet url. We can close this.