Reputation
Badges 1
212 × Eureka!I just opened a shell with the api and tried to curl my files URL, and the curl just hangs. no response
maybe a cors issue?
curl --insecure -sw %{http_code}
` -o /dev/null │
│ init-k8s-glue waiting for apiserver ...
I think the issue is the pod to pod comms can't resolve my route53 dns records
that is the containerinit logs from k8glueagent
I think this is VPN related now
yep that fixed it using references like clearml-webserver.clearml.svc.cluster.local:80
I think if I use the local service URL this problem is fixed
ok yes, this is the problem
thank you for the help!
the worker is now in the dashboard
I'm not familiar with helm that well to clone this, fix it, and then test
The task pod (experiment) started reaching out to an IP associated with malicious activity. The IP was associated with 1000+ domain names. The activity was identified in AWS guard duty with a high severity level.
Would using 22.04 Ubuntu still work in the task execution?
` "ipAddressV4": "165.160.15.20",
"organization": {
"asn": "19574",
"asnOrg": "CSC",
"isp": "Corporation Service Company",
"org": "Corporation Service Company"
},
"country": {
"countryName": "United States"
},
"city": {
...
"title": "Unusual outbound communication seen from EC2 instance i-<> on server port 80.",
I suppose a short term hack would to just edit the /etc/hosts
file and redirect the public url to k8 dns url?
"additionalInfo": { "inBytes": "438", "localPort": "9134", "outBytes": "401", "unusual": "80", "value": "{\"inBytes\":\"438\",\"localPort\":\"9134\",\"outBytes\":\"401\",\"unusual\":\"80\"}", "type": "default" },
This is to address the PYTHONPATH issues
Also what is the base path where the git repo is cloned? So if my repo is called myProject.git, what would the full path be?
so its not the files server, its every server
Made some progress getting the gpu nodes to provision, but got this error on my task K8S glue status: Unschedulable (0/4 nodes are available: 1 node(s) had taint {
http://nvidia.com/gpu : true}, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector.)
When I exec into the pod, it says I need sudo, but wondering if extra_docker_shell_script
is executed as sudo already?
ahhh its possible my clearml.conf was using the public urls when I made it. Let me try this
AgitatedDove14 SmugDolphin23 Would the following subprocess calls break the auto connect to frameworks like tensorboard?
` exe = f"sfi/imagery/models/{strategy_pipeline}/train.py"
cmd = ["/home/npuser/.clearml/venvs-builds/3.7/bin/python", exe, train_config_path]
if training_run_id:
cmd += ["--training-run", str(training_run_id)]
logging.info("Training classifier with command:\n%s", " ".join(cmd))
returncode = subprocess.Popen(cmd).wait() `Note ` /home/npuser...