That wasn't scheduled by ClearML).
This means that from Clearml perspective they are "manual" i.e the job it self (by calling Task.init) create the experiment in the system, and fills in all the fields.
But for a k8s job, I'm still unsuccessful.
HelpfulDeer76 When you say "unsuccessful" what exactly do you mean ?
Could it be they are reported to the clearml demo server (the default server if no configuration is found) ?
ChubbyLouse32 and this works when running python code and not when the agent is running ?
On the same machine ?
Hi NonsensicalSeaanemone47
I'm assuming you mean k8s as compute cluster?
If so, then yes clearml adds priority scheduling on top of your existing kl8s cluster. It also allows you to reuse images as the k8s spins the base container image and then inside the container image the agent sets the environment of the experiment (clones code, apply diff, install missing python packages etc.)
It also gives visibility into the executed pods.
Make sense ?
Hi CynicalBee90
Always great to have people joining the conversation, especially if they are the decision makers a.k.a can amend mistakes š
If I can summarize a few points here (and feel free to fill in / edit any mistake or leftovers)
Open-Source license: This is basically the mongodb license, which is as open as possible with the ability to, at the end, offer some protection against Amazon giants stealing APIs (like they did for both mongodb and elastic search) Platform & language agno...
JitteryCoyote63 Hmmm in theory, yes.
In practice you need to change this line:
https://github.com/allegroai/clearml/blob/fbbae0b8bc933fbbb9811faeabb9b6d9a0ea8d97/clearml/automation/aws_auto_scaler.py#L78
` python -m clearml_agent --config-file '/root/clearml.conf' daemon --queue '{queue}' {docker} --gpus 0 --detached
python -m clearml_agent --config-file '/root/clearml.conf' daemon --queue '{queue}' {docker} --gpus 1 --detached
python -m clearml_agent --config-file '/root/clearml.conf' d...
Notice you should be able to override them in the UI (under Args seciton)
Hi JumpyDragonfly13
- is "10.19.20.15" accessible from your machine (i.e. can you ping to it)?
- Can you manually SSH to 10.19.20.15 on port 10022 ?
but it is not optimal if one of the agents is only able to handle tasks of a single queue (e.g. if the second agent can only work on tasks of type B).
How so?
Hi StickyWhale51
I think this issue is due to some internal race condition, anyhow I think we have an RC out solving it, can you try with:pip install clearml==1.2.0rc2
Try:task.update_requirements('\n'.join([".", ]))Ā
DeterminedToad86 were you running a jupyter notebook or a jupyter console ?
And how did you connect your example,yaml?
Yes, the mechanisms under the hood are quite complex, the automagic does not come for "free" š
Anyhow, your perspective is understood. And as you mentioned I think your use case might be a bit less common. Nonetheless we will try to come-up with a solution (probably an argument for Task.init so you could specify a few more options for the auto package detection)
Hi ShinyPuppy47 ,
Yes that is correct. Use Task.init for automagic logging
So obviously the straight forward solution is to report normalize the step value when reporting to TB, i.e. int(step/batch_size). Which makes sense as I suppose the batch size is known and is part of the hyper-parameters. Normalization itself can be done when comparing experiments in the UI, and in the backend can do that, if given the correct normalization parameter. I think this feature request should actually be posted on GitHub, as it is not as simple as one might think (the UI needs to a...
Let me check what's the subsampling threshold
Hi ThankfulOwl72 checkout TrainsJob
object. It should essentially do what you need:
https://github.com/allegroai/trains/blob/master/trains/automation/job.py#L14
which is probably why it does not work for me, right?
Correct, you need to pass the entire configuration (it is stored as a blob, as opposed to the hyperparameters that are stored as individual values)
` :param configuration_overrides: Optional, override Task configuration objects.
Expected dictionary of configuration object name and configuration object content.
Examples:
{'General': dict(key='value')}
{'General': 'config...
Nooooooooooooooooooooooo
BTW updating the values in grafana is basically configuration of the heatmap graph, so it is fairly easy to do, just not.automatic
AbruptHedgehog21 looking at the error, seems like you are out of storage š
How is this different from argparser btw?
Not different, just a dedicated section š Maybe we should do that automatically, the only "downside" is you will have to name the Dataset when getting it (so it will have an entry name in the Dataset section), wdyt ?
Ohh, clearml is designed so that you should not worry about that, download_dataset = StorageManger.get_local_copy()
this is cashed, meaning the machine that runs that like the second time will not re download the path.
This means step 1 is redundant, no?
Usually when data is passed between components it is automatically uploaded as artifact to the Task (stored on the files server or object storage etc.) then downloaded and passed to the next steps.
How large is the data that you are wo...