Reputation
Badges 1
25 × Eureka!StickyLizard47 apologies for the https://github.com/allegroai/clearml-server/issues/140 not being followed (probably slipped through the cracks of backend guys, I can see the 1.5 release happened in parallel). Let me make sure it is followed.
SarcasticSquirrel56 specifically, did you also spin a clearml-k8s glue? or are the agents statically allocated on the helm chart?
I see, good point. It does look like mostly boiler plate code, not sure where it actually runs the python command, but I'm sure it is there (python.ts, but could not locate who is actually using it)
Okay that actually makes sense, let me check I think I know what's going on
Hi ColossalAnt7
Following on SuccessfulKoala55 answer
I saw that there is a config file where you can specify specific users and passwords, but it currently requires
- mount the configuration file (the one holding the user/pass) into the pod from a persistent volume .
I think the k8s way to do this would be to use mounted config maps and secrets.
You can use ConfigMaps to make sure the routing is always correct, then add a load-balancer (a.k.a a fixed IP) for the users a...
We do upload the final model manually.
If this is the case just name it based on the parameters, no? am I missing soemthing?
https://github.com/allegroai/clearml/blob/cf7361e134554f4effd939ca67e8ecb2345bebff/clearml/model.py#L1229
I was just wondering if i can make the autologging usable.
It kind of assumes these are different "checkpoints" on the same experiment, and then stores them based on the file name
You can however change the model names later:
` Task.current_task().mo...
For that I need more info, what exactly do you need (or trying to achieve) ?
the unclear part is how do I sample another point in the optimization space from the optimizer
Just so I'm clear on the issue, you want multiple machines to access the internals of the optimizer class ? or Do you just want a way to understand what is the optimizer sampling space (i.e. the parameters and options per parameter) ?
(with older clearml versions thoughβ¦).
Yes, we added content type header for the files when uploading to S3 (so it is easier for users to serve them back). But it seems the python 3.5 casting from Path to str breaks it mimetype call....
think it's because the proxy env var are not passed to the container ...
Yes this seems correct, the errors point to a network issues, i.e. the container does not seem to be able to connect to the clearml-server
Verified, and already fixed with 1.0.6rc2
Hi CurvedHedgehog15
User aborted: stopping task (3)
?
This means "someone" externally aborted the Task, in your case the HPO aborted it (the sophisticated HyperBand Bayesian optimization algorithms we use, both Optuna and HpBandster) will early stop experiments based on their performance and continue if they need later
FreshReindeer51
Could you provide some logs ?
Martin I told you I can't access the resources in the cluster unfortunately
π
so it seems there is some misconfiguration of the k8s glue, because we can see it can "talk" to the clearml-server, but it seems it fails to actually create the k8s pod/job. I would start with debugging the k8s glue (not the services agents). Regardless, I think the next step is to get a log of the k8s glue pod, and better understand the issue.
wdyt?
Hi LooseClams37
From the docker compose, I see the agent is running in venv mode, is that correct?
Also notice that when configuring the minio credentials you can specify if this is an https connection (secure: true) which by default it is not.
See here: https://github.com/allegroai/clearml-agent/blob/5a6caf6399a0128ad81e8723d0a847e2ded5b75e/docs/clearml.conf#L287
Assuming you are using docker-compose, the console output is a good start
Could you send the logs?
Hi ColossalAnt7
Try ctrl-F5 and refresh the page?!
It seems you are missing a few buttons π
it's in the docker image, doesn't the git clone command run in the container
Then this should have worked.
Did you pass in the configuration: force_git_ssh_protocol: true
https://github.com/allegroai/clearml-agent/blob/e93384b99bdfd72a54cf2b68b3991b145b504b79/docs/clearml.conf#L25
Hi @<1695969549783928832:profile|ObedientTurkey46>
How can I connect clearml to a relational database, and have sql query as a dataset? (e.g. dataset.add_references(query = βselect * from images where label = β1β)).
hmm interesting, you have a couple of options that I can think of:
- You can have your query and an argument to the Task, which means it is logged and can be changed later from the UI when you are relaunching it.
- You can have the query an an argument for a preprocessin...
I hope you can do this without containers.
I think you should be fine, the only caveat is CUDA drivers, nothing we can do about that ...
Hi @<1571308003204796416:profile|HollowPeacock58>
parameters = task.connect(config, name='config_params')
It seems that your DotDict does not support the python copy operator?
i.e.
from copy import copy
copy(DotDict())
fails ?
Great ascii tree π
GrittyKangaroo27 assuming you are doing:@PipelineDecorator.component(..., repo='.') def my_component(): ...The function my_component will be running in the repository root, so in thoery it could access the packages 1/2
(I'm assuming here directory "project" is the repository root)
Does that make sense ?
BTW: when you pass repo='.' to @PipelineDecorator.component it takes the current repository that exists on the local machine running the pipel...
Hmm make sense, then I would call the export_task once (kind of the easiest to get the entire Task object description pre-filled for you) with that, you can just create as many as needed by calling import_task.
Would that help?
and the step is "queued" or is it "queued" in the pipeline state (i.e. the visualization did not update) ?
MassiveHippopotamus56
the "iteration" entry is actually the "max reported iteration over all graphs" per graph there is different max iteration. Make sense ?
Or maybe do you plan to solve this problem in the very short term?Β (edited)
Yes we will π
BoredHedgehog47
is this ( https://clearml.slack.com/archives/CTK20V944/p1665426268897429?thread_ts=1665422655.799449&cid=CTK20V944 ) the same issue (or solution) ?
ERROR: Could not install packages due to an EnvironmentError:
[Errno 28] No space left on device
BTW: @<1523703080200179712:profile|NastySeahorse61> this sounds like docker out of space on the Main disk '/var/` where it stores all the images and temp file systems
This will cause you code to fail as any runtime change to the container file system will raise this out of disk space error