Reputation
Badges 1
25 × Eureka!When we enqueue the task using the web-ui we have the above error
ShallowGoldfish8 I think I understand the issue,
basically I think the issue is:task.connect(model_params, 'model_params')
Since this is a nested dict:model_params = { "loss_function": "Logloss", "eval_metric": "AUC", "class_weights": {0: 1, 1: 60}, "learning_rate": 0.1 }
The class_weights is stored as a String key, but catboost expects "int" key, hence it fails.
One op...
RobustRat47 what's the Triton container you are using ?
BTW, the Triton error is:model_repository_manager.cc:1152] failed to load 'test_model_pytorch' version 1: Internal: unable to create stream: the provided PTX was compiled with an unsupported toolchain.
https://github.com/triton-inference-server/server/issues/3877
BattyLion34 the closest I can think of the is monitoring class that can easily be extended.
Datasets are a type of Task, so we can monitor a project and trigger an action when we see a change in number of Tasks/Datasets that are completed.
Monitoring class:
https://github.com/allegroai/clearml/blob/master/clearml/automation/monitor.py
Monitoring example:
https://github.com/allegroai/clearml/blob/master/examples/services/monitoring/slack_alerts.py
I think a dataset monitoring example wil...
Verified @<1643060801088524288:profile|HarebrainedOstrich43> RC will be out soon for you to test, thank you again for catching it, not sure how internal tests missed it (btw the pipeline is created it's just not shown in the right place due to some internal typo)
And you pass:
scheduler.add_task(..., reuse_task=True)
?
@<1546303293918023680:profile|MiniatureRobin9>
, not the pipeline itself. And that's the last part I'm looking for.
Good point, any chance you want to PR this code snippet ?
def add_tags(self, tags):
# type: (Union[Sequence[str], str]) -> None
"""
Add Tags to this pipeline. Old tags are not deleted.
When executing a Pipeline remotely (i.e. launching the pipeline from the UI/enqueuing it), this method has no effect.
:param tags: A li...
When you install using pip <filename> you should end up with something like:minerva @ file://... or minerva @ https://...
Instead you can do: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
Then the Worker ID will running instance appended to the worker name. This means that even if you use the same $DYNAMIC_INSTANCE_ID twice, you will not have two agent registering on the same name.
I still don't get resource logging when I run in an agent.
@<1533620191232004096:profile|NuttyLobster9> there should be no difference ... are we still talking about <30 sec? or a sleep test? (no resource logging at all?)
have a separate task that is logging metrics with tensorboard. When running locally, I see the metrics appear in the "scalars" tab in ClearML, but when running in an agent, nothing. Any suggestions on where to look?
This is odd and somewhat consistent with actu...
WickedGoat98 Same for me, let me ask the UI guys, I think this is a UI bug.
Also maybe before you post the article we could release a fix to both, what do you think?
EDIT:
Never mind 🙂 i just saw the medium link, very cool!!!
The driver script (the one initializes models and initializes a training sequence) was not at git repo and besides that one, everything is.
Yes there is an issue when you have both git repo and totally uncommitted file, since clearml can store either standalone script or a git repository, the mix of the two is not actually supported. Does that make sense ?
CharmingStarfish14 can you check something from code, just to see if this would solve the issue?
Hi @<1653207659978952704:profile|LovelyStork78>
I have a docker container with all the dependencies.
Well I think the main question is are you using the clearml-agent to launch jobs/experiments? If you do it makes sense to specify your docker as "base docker image" (in the UI look for under the Execution tab, Container).
This means the agent will use the pre-installed environment and will add anything that your Task needs on top of it, this of course includes pushing your codebase i...
Hi @<1688721797135994880:profile|ThoughtfulPeacock83>
the configuration vault parameters of a pipeline step with the add_function_step method?
The configuration vault are a per set at execution user/project/company .
What would be the value you need to override ? and what is the use case?
I think that clearml should be able to do parameter sweeps using pipelines in a manner that makes use of parallelisation.
Use the HPO, it is basically doing the same thing with some more sophisticated algorithm (HBOB):
https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
For example - how would this task-based example be done with pipelines?
Sure, you could do something like:
` from clearml import Pi...
can I add user properties to a scheduler configuration?
please expand, what do you mean by user property and how one would use it?
sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass
No worries I totally feel you.
As a quick hack in the actual code of the Task itself, is it reasonable to have:task = Task.init(....) task.set_initial_iteration(0)
ElegantCoyote26 could be, if the Task run is under 30sec?!
GrotesqueDog77 when you say "the second issue" , do you mean the fact that both step 1 and step 2 should have access to the same filesystem?
Hi CooperativeFox72
Sure 🙂task.set_resource_monitor_iteration_timeout(seconds_from_start=1800)
So essentially, the server helm chart creates randomly generated secret pair and deploys it as a shared k8 secret that pods can access.
This is the tricky part, for the helm chart to be able to create it, it means it can login to the server it means there is a secret embedded in the helm chart that lets you access the default server. you see my point ?
Hmm, so what is the difference ?
IrritableJellyfish76 hmm maybe we should an an extra argument partial_name_matching=False
to maintain backwards compatibility?
Yes the "epoch_loss" is the training epoch loss (as expected I assume).
thought that was just the loss reported at the end of the train epoch via tf
It is, isn't that what you are seeing ?
what do you mean? the same env for all components ? if they are using/importing exactly the same packages, and using the same container, then yes it could
DeliciousSeal67
are we talking about the agent failing to install the package ?
how did you try to restart them ?
Yes, but how did you restart the agent on the remote machine ?
PungentLouse55 I'm checking something here, you might stumbled on a bug in parameter overriding. Updating here soon ...