Hi CooperativeFox72 trains 0.16 is out, did it solve this issue? (btw: you can upgrade trains to 0.16 without upgrading the trains-server)
how to put or handle this configuration and where?
In your clearml.conf on the machine with the agent just add at the bottom of the file agent.venvs_cache.path=~/.clearml/venvs-cache
SarcasticSquirrel56 when the process dies (i.e. killed) it does not have time not update the state, then the server watchdog will set the state to aborted after X amount of time of inactivity (default is 2 hours)
Hmm what's the clearml version? Whats the python version, whats the OS? And pytorch version?
Does it wok if you remove the Task.init call?
Hi Martin, of course not,
Smart!
I was just wondering if it has been patched yet and if not what is the expected timeline for patching it
Yes, I believe the target is a patch version 1.15.1 to be released in a couple of weeks. This is not a major issue but it's always better to have have it fixed. (btw: the enterprise version never had this issue to being with, because it is of course authenticated, as well as it has additional RBAC layer on top.)
Hi PompousParrot44
Could you send the "Installed Packages" list?
I think there is a bug in the current trains-agent (there is already a fix but the RC is still not out),
where "packeg @ git+http" packages ignore the git+http link.
You can solve it manually by just editing the "Installed packages" (when Task is in draft mode, the section becomes editable), and remove the "package @" part, and leave the "git+http" link.
Not sure: They also have the feature store (data management), as mentioned, which is pretty MLOps-y
.
Right, sorry, I was thinking about "Nuclio", my bad.
How would you compare those to ClearML?
At least based on the documentation and git state I would say this is very early stages. In terms of features they "tick all the boxes", but I'll be a bit skeptic on the ability to scale and support these features.
Taking a look at the screenshots from the docs, it also seem...
CooperativeFox72 of course, anything trains related, this is the place 🙂
Fire away
Sounds great! I really like that approach, thanks GrotesqueDog77 !
Hi ItchyHippopotamus18
The iteration reporting is automatically detected if you are using tensorboard, matplotlib, or explicitly with trains.Logger
I'm assuming there were no reports, so the monitoring falls back to report every 30seconds where the iterations are seconds from start" (the thing is, this is a time series, so you have to have an X axis...)
Make sense ?
Can you post here the actual line? seems like we can fix it to also support this scenario (if we could test it)
Hi @<1558986821491232768:profile|FunnyAlligator17>
What do you mean by?
We are able to
set_initial_iteration
to 0 but not
get_last_iteration
.
Are you saying that if your code looks like:
Task.set_initial_iteration(0)
task = Task.init(...)
and you abort and re-enqueue, you still have a gap in the scalars ?
[Assuming the above is what you are seeing]
What I "think" is happening is that the Pipeline creates it's own Task. When the pipeline completes, it closes it's own Task, basically making any later calls to Tasl.current_task() return None, because there is no active Task. I think this is the reason that when you are calling process_results(...) you end up with None.
For a quick fix, you can dopipeline = Pipeline(...) MedianPredictionCollector.process_results(pipeline._task)
Maybe we should...
Okay Now I get it!
Let me think about it for an hour or two 😄
Thank you ElegantCoyote26 for catching that! 😍
No worries, glad to hear you found it 😄
Let me verify something in the code,
What's the "working dir" ? (where in the repo the script is executed from)
Now I suspect what happened is it stayed on another node, and your k8s never took care of that
In that case, I think it is stuck on a previous Node, I can;t think of any other reason.
Do you have something else on the same PV that was lost ? like api server configuration?
When we enqueue the task using the web-ui we have the above error
ShallowGoldfish8 I think I understand the issue,
basically I think the issue is:task.connect(model_params, 'model_params')
Since this is a nested dict:model_params = { "loss_function": "Logloss", "eval_metric": "AUC", "class_weights": {0: 1, 1: 60}, "learning_rate": 0.1 }
The class_weights is stored as a String key, but catboost expects "int" key, hence it fails.
One op...
@<1671689437261598720:profile|FranticWhale40> this one: None
Meaning the node restarted (or actually moved)
That somehow the PV never worked and it was all local inside the pod
The second problem that I am running into now, is that one of the dependencies in the package is actually hosted in a private repo.
Add your private repo to the extra index section in the clearml.conf:
None