Reputation
Badges 1
25 × Eureka!is no agent listening to the "k8s_scheduler"
There should not be one, this is purely "virtual" , so users understand the k8s cluster is spinning their pod (sometimes it takes time, imagine EKS etc. , just visibility)
unfortunately I can't get info from the cluster
You should be able the pod in the cluster no?!
What's the Task Info panel say, can you share a screen shot ?
Hi ColossalAnt7 , I think we run into it on a few dockers, I believe the bug was fixed in the latest trains-agent RC. Could you verify please ?
I do it to get project name
you can still get it from the task object (even after closing it)
another place I was using was to see if i am in a pipeline task
Yes that makes sense, this is one of the use cases (to see get access to the Task that is currently running). The bug itself will only happen after closing the Task (it needs to clear OS variable).
You can either upgrade to the 1.0.6rc2 or you can hack/fix it with :
` os.environ.pop('CLEARML_PROC_MASTER_ID', None)
os.envi...
okay but still I want to take only a row of each artifact
What do you mean?
How do I get from the node to the task object?
pipeline_task = Task.get_task(task_id=Task.current_task().parent)
The way I understand it is that K8s glue agent is enabled by default (and I do see a Deployment for
clearml-k8sagent
SarcasticSquirrel56
Good start, when you say you see the Task in ""k8s_scheduler" queue, originally did you enqueue it to "default" ?
StickyLizard47 apologies for the https://github.com/allegroai/clearml-server/issues/140 not being followed (probably slipped through the cracks of backend guys, I can see the 1.5 release happened in parallel). Let me make sure it is followed.
SarcasticSquirrel56 specifically, did you also spin a clearml-k8s glue? or are the agents statically allocated on the helm chart?
I see, good point. It does look like mostly boiler plate code, not sure where it actually runs the python command, but I'm sure it is there (python.ts, but could not locate who is actually using it)
Okay that actually makes sense, let me check I think I know what's going on
Hi ColossalAnt7
Following on SuccessfulKoala55 answer
I saw that there is a config file where you can specify specific users and passwords, but it currently requires
- mount the configuration file (the one holding the user/pass) into the pod from a persistent volume .
I think the k8s way to do this would be to use mounted config maps and secrets.
You can use ConfigMaps to make sure the routing is always correct, then add a load-balancer (a.k.a a fixed IP) for the users a...
We do upload the final model manually.
If this is the case just name it based on the parameters, no? am I missing soemthing?
https://github.com/allegroai/clearml/blob/cf7361e134554f4effd939ca67e8ecb2345bebff/clearml/model.py#L1229
I was just wondering if i can make the autologging usable.
It kind of assumes these are different "checkpoints" on the same experiment, and then stores them based on the file name
You can however change the model names later:
` Task.current_task().mo...
For that I need more info, what exactly do you need (or trying to achieve) ?
the unclear part is how do I sample another point in the optimization space from the optimizer
Just so I'm clear on the issue, you want multiple machines to access the internals of the optimizer class ? or Do you just want a way to understand what is the optimizer sampling space (i.e. the parameters and options per parameter) ?
(with older clearml versions thoughβ¦).
Yes, we added content type header for the files when uploading to S3 (so it is easier for users to serve them back). But it seems the python 3.5 casting from Path to str breaks it mimetype call....
think it's because the proxy env var are not passed to the container ...
Yes this seems correct, the errors point to a network issues, i.e. the container does not seem to be able to connect to the clearml-server
Verified, and already fixed with 1.0.6rc2
Hi CurvedHedgehog15
User aborted: stopping task (3)
?
This means "someone" externally aborted the Task, in your case the HPO aborted it (the sophisticated HyperBand Bayesian optimization algorithms we use, both Optuna and HpBandster) will early stop experiments based on their performance and continue if they need later
FreshReindeer51
Could you provide some logs ?
Martin I told you I can't access the resources in the cluster unfortunately
π
so it seems there is some misconfiguration of the k8s glue, because we can see it can "talk" to the clearml-server, but it seems it fails to actually create the k8s pod/job. I would start with debugging the k8s glue (not the services agents). Regardless, I think the next step is to get a log of the k8s glue pod, and better understand the issue.
wdyt?
Hi LooseClams37
From the docker compose, I see the agent is running in venv mode, is that correct?
Also notice that when configuring the minio credentials you can specify if this is an https connection (secure: true) which by default it is not.
See here: https://github.com/allegroai/clearml-agent/blob/5a6caf6399a0128ad81e8723d0a847e2ded5b75e/docs/clearml.conf#L287
Assuming you are using docker-compose, the console output is a good start
Could you send the logs?
Hi ColossalAnt7
Try ctrl-F5 and refresh the page?!
It seems you are missing a few buttons π
it's in the docker image, doesn't the git clone command run in the container
Then this should have worked.
Did you pass in the configuration: force_git_ssh_protocol: true
https://github.com/allegroai/clearml-agent/blob/e93384b99bdfd72a54cf2b68b3991b145b504b79/docs/clearml.conf#L25
Hi @<1695969549783928832:profile|ObedientTurkey46>
How can I connect clearml to a relational database, and have sql query as a dataset? (e.g. dataset.add_references(query = βselect * from images where label = β1β)).
hmm interesting, you have a couple of options that I can think of:
- You can have your query and an argument to the Task, which means it is logged and can be changed later from the UI when you are relaunching it.
- You can have the query an an argument for a preprocessin...
I hope you can do this without containers.
I think you should be fine, the only caveat is CUDA drivers, nothing we can do about that ...
Hi @<1571308003204796416:profile|HollowPeacock58>
parameters = task.connect(config, name='config_params')
It seems that your DotDict does not support the python copy operator?
i.e.
from copy import copy
copy(DotDict())
fails ?
Great ascii tree π
GrittyKangaroo27 assuming you are doing:@PipelineDecorator.component(..., repo='.') def my_component(): ...The function my_component will be running in the repository root, so in thoery it could access the packages 1/2
(I'm assuming here directory "project" is the repository root)
Does that make sense ?
BTW: when you pass repo='.' to @PipelineDecorator.component it takes the current repository that exists on the local machine running the pipel...