Reputation
Badges 1
137 × Eureka!Thanks, in DM I sent you the conf we use to deploy the agents.
thanks a lot! So as long as we have the storageclass in our kubernetes cluster configured correctly, the new helm chart should work out of the box?
do I need something else in the clearml.conf?
CostlyOstrich36 so I don't have to write the clearml.conf?
I would like to setup things so that a data scientist working on a project doesn't have to know about buckets and this sort of things... Ideally the server and the agents are configured with a default bucket...
If now I abort the experiment (which is in a pending state and not running), and re-enqueue it again -- no parameters modifications this time...
and I re-enqueue it to the CPU queue, I see that it is sent to the right queue, and after a few seconds the job enters a running state and it completes correctly
After trying Gaspard changes to the helm chart values, I do now see that also a pod for the agentservice is deployed,
And some of the logs point to a misconfigurations on my side (the fact it can't access resources externally),
some others I don't understand:Err:1
` bionic InRelease
Could not connect to archive.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to archive.ubuntu.com:80 (185.125.190.39), connection timed out Could not connect to archive.ubuntu.com...
In the ClearML ui it stays in a Pending state
I can see the outputs from argo, so I know if some resource has been created but I can't inspect the full logs,
the ones I have available are all records similar toNo tasks in queue 80247f703053470fa60718b4dff7a576
Exactly that :) if I go in the queue tab, I see a new queue name (that I didn't create),
with a name like "4gh637aqetc"
I have tried this several time and the behaviour is always the same. It looks like when I modify some hyperparameter, when I enqueue the experiment to one queue, things don't work if I didn't make sure to have previously set the value of k8s-queue to the name of the queue that I want to use. If I don't modify the configuration (e.g. I abort, or reset the job and enqueue it again, or clone and enqueue it without modifying the hyperparameters) then everything works as expected.
Thanks, adding environment variables to the agentservice solved it, but for the agentgroup agent, I can't see any obvious way to inject environment variables. In the helm chart template I don't see any way to pass custom environment variables to the pod
but I was a bit set off track seeing errors in the logs
Hi Jake thanks for your answer!
So I just have a very simple file "project.py" with this content:
` from clearml import Task
task = Task.init(project_name='project-no-git', task_name='experiment-1')
import pandas as pd
print("OK") If I run
python project.py ` from a folder that is not in a git repository, I can clone the task and enqueue it from the UI, and ti runs in the agent with no problems.
If I copy the same file, in a folder that is in a git repository, when I enqueue the ex...
Thanks SuccessfulKoala55 . Any idea why going to the address https://allegroai.github.io/clearml-helm-charts
returns a 404 error?
Other repositories that are used in Argo CD examples (e.g. https://bitnami-labs.github.io/sealed-secrets , which is also hosted on Github) instead of returning a 404, the index.yaml page is loaded instead.
I suspect this might be the reason why I can't make it work with ClearML.
I see in bitnami's gh-pages branch a file https://github.com/bitnami-labs/sealed-secrets/blob/gh-pages/index.html to do the redirect that contains:
` <html>
<head> <meta http-equiv="refresh" content="0; url= ` ` "> </head> <p><a href=" ` ` ">Redirect to repo index.yaml</a></p> </html> ` A similar file is missing in the ` clearml-helm-chart ` ` gh-pages ` branch.
PunyWoodpecker71 just create a Personal Access Token and use it as the value for CLEARML_AGENT_GIT_PASS, https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
actually there are some network issues right now, I'll share the output as soon as I manage to run it
I see... because the problem it would be with permissions when creating artifacts to store in the "/shared" folder
thanks for the help!
Hi Jake unfortunately I realized we put a loadbalancer, so any address like addess.domain, would ping
This is the list of all the environment variables (starting with CLEARML) available in the Pod spawned by the K8s Glue Agent:
` CLEARML_MONGODB_PORT_27017_TCP_PORT
CLEARML_FILESERVER_PORT_8081_TCP_ADDR
CLEARML_ELASTIC_MASTER_PORT_9200_TCP
CLEARML_APISERVER_PORT_8008_TCP_PROTO
CLEARML_FILESERVER_PORT_8081_TCP_PORT
CLEARML_ELASTIC_MASTER_SERVICE_PORT_TRANSPORT
CLEARML_WEBSERVER_PORT_80_TCP
CLEARML_ELASTIC_MASTER_SERVICE_PORT
CLEARML_MONGODB_PORT_27017_TCP_ADDR
CLEARML_FILESERVER_PORT_8081_TCP_P...
OK. In the pod spawned by the K8s Glue Agent, clearml.conf is the same as the K8S Glue Agent
when I run it on my laptop...
what I am trying to achieve is not having to worry about this setting, and have all the artifacts and models uploaded to the file server automatically
ah I see.. yes it makes sense 🙂
and one more question, in the values, I also see the values for the default tokens:
` credentials:
apiserver:
# -- Set for apiserver_key field
accessKey: "5442F3443MJMORWZA3ZH"
# -- Set for apiserver_secret field
secretKey: "BxapIRo9ZINi8x25CRxz8Wdmr2pQjzuWVB4PNASZqCtTyWgWVQ"
tests:
# -- Set for tests_user_key field
accessKey: "ENP39EQM4SLACGD5FXB7"
# -- Set for tests_user_secret field
secretKey: "lPcm0imbcBZ8mwgO7tpadutiS3gnJD05x9j7a...
Effectively kubectl commands don't work from within the agent pod, I'll try to figure out why
OK, so... when executed locally "train" prints:
` train:
SepalLength SepalWidth PetalLength PetalWidth Species
122 7.7 2.8 6.7 2.0 2.0
86 6.7 3.1 4.7 1.5 1.0
59 5.2 2.7 3.9 1.4 1.0
4 5.0 3.6 1.4 0.2 0.0
77 6.7 3.0 5.0 1.7 1.0
.. ... ... ... ... ......