Reputation
Badges 1
53 × Eureka!just a couple of info
From k8s perspective a pod is ephemeral so if it’s gone for any reason it’s gone. Obviously there are structures that can ensure running state (like Deployments or Statefulsets) so if a pod dies, another one takes place. We didn;t go in this direction because pods are not idempotent so it’s not straightfoward to simply replace them. Btw this looks an interesting topic to me so I’d like to include SuccessfulKoala55 on this also because i’m involved more in infra side of the equation and I ma...
ok but describing the pod you should have, at least, the Ending cause
what kin d of Clearml installation you did on machine? there are processes listening on these ports?
I suggest to exec into the pod and issue the command kubectl delete pod -l=CLEARML=agent-74b23a8f --namespace=clearml --field-selector=status.phase!=Pending,status.phase!=Running --output name
sp you can see the ouput from inside the pod. This should help understand what is going on with the command
I need to evaluate a better way to handle image versioning for the future, btw latest version should be fixed. Apologies folks!
AgitatedDove14 trainsConfig is totally optional and you can put the config file itself in it.. e.g.:
` trainsConfig: |-
sdk {
aws {
s3 {
key: ""
secret: ""
region: ""
credentials: [
{
host: "minio.minio:9000"
key: "DEMOaccessKey"
secret: "DEMOsecretKey"
...
how you cluster reacts is about scaling infra as much as needed (karpenter or any other cloud autoscaler should work)
I don’t think it’s related how agent talk with apiserver or fileserver. It’s more related the fact agent pod internal kubectl cannot contact kubernetes apiserver
iptables is used by docker itself so you need to be careful on doing mods: https://docs.docker.com/network/packet-filtering-firewalls/
it will be easier for me to reproduce
I think we can find a solution pretty quickly after some checks. Can you pls open an issue on new helm chart repo so I can take care of it in some day?
did you tried to create a debug pod with a mount using ceph storageclass? you can start from here https://downey.io/notes/dev/ubuntu-sleep-pod-yaml/ then add the pvc and the mount. then you should exec into the pod and try to write a dummy file on the mount; I suspect the problem is there
Can you pls share all 3 health checks ?
so do you want to mount files into agent pod?
our data engineer directly write code in pycharm and test it on the fly with brakpoints. when good we simply commit in git and we set a tag "prod ready"
adding @<1523701087100473344:profile|SuccessfulKoala55> to the conversation because I’m not totally sure the problem relies on ingress, it looks to be a bad token but it shouldn’t since init was good
O k, I’d like to test it more with you; credentials exposed in chart values are system ones and it’s better to not change them; let’s forget about them for now. If you create a new accesskey/secretkey pair in ui, you should use these ones in your agents and they shuld not get overwritten in any way; can you confirm it works without touching credentials
section?
then I enqueue it and it's created but obv empty
btw, judging from screenshots services are ok but pod are not up, especially elastic, redis and mongodb are Pending
so it means k8s didn’t scheduled them for some reason you can find describing these pods
mmmmm should not be related chart as far as I know, I’m going to ping SuccessfulKoala55 ; maybe he can chime in because I’m not sure why it’s happening
maybe this can cause the issue?
This is specific K8s infra management, usually I use Velero for backup
or maybe default?
there are workarounds tbh but they are tricks that require a lot of k8s espertise and they are risky
accessing apiserver from a pod doesn’t require kubeconfig
btw a good practice is to keep infrastructural stuff decoupled from applications. What about using https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner ? After applying that chart you can simply use the generated storage class; wdyt?
I guess apiServerUrlReference should be fixed