(fyi: once we have a solid idea here, please open a github issue on the feature request, I'll try to see if we can push it fwd for the next RC ๐ )
Hi CluelessElephant89
Hi guys, if I spot issue with documentations, where should I post them?
The best way from our perspective PR the fix ๐ this is why we put it on GitHub
It runs into the above error when I clone the task or reset it.
from here:
AssertionError: ERROR: --resume checkpoint does not exist
I assume the "internal" code state changed, and now it is looking for a file that does not exist, how would your code state change, in other words why would it be looking for the file only when cloning? could it be you put the state on the Task, then you clone it (i.e. clone the exact same dict, and now the newly cloned Task "thinks" it resuming ?!)
Verified, and already fixed with 1.0.6rc2
My bad I wrote refresh and then edited it to the correct "reload" ๐
I'd prefer to use config_dict, I think it's cleaner
I'm definitely with you
Good news:
newย
best_model
ย is saved, add a tagย
best
,
Already supported, (you just can't see the tag, but it is there :))
My question is, what do you think would be the easiest interface to tell (post/pre) store, tag/mark this model as best so far (btw, obviously if we know it's not good, why do we bother to store it in the first place...)
Just making sure, pip package installed on your Conda env, correct?
however when I clone or reset said task after completion and then enqueue it again, I get the above error.
This part is somewhat confusing... There is no magic happening behind the scenes, cloning a Task and creating it, is basically the same ... Do you have a reference to the YOLOv5 code base itself, maybe I can figure out what's the issue?
Hey IntriguedRat44 ,
Is this what you are after?
https://github.com/allegroai/trains/issues/181
That didnโt gave useful infos, was that docker was not installed in the agent machine x)
JitteryCoyote63 you mean "docker" was not installed and it did not throw an error ?
So essentially, the server helm chart creates randomly generated secret pair and deploys it as a shared k8 secret that pods can access.
This is the tricky part, for the helm chart to be able to create it, it means it can login to the server it means there is a secret embedded in the helm chart that lets you access the default server. you see my point ?
From code ? or the CLI ?
In both cases the dataset needs to upload the parent version somewhere, azure blob supported.
basically @<1554638166823014400:profile|ExuberantBat24> you can think of hyper-datasets as a "feature-store for unstructured data"
I want to optimizer hyperparameters with trains.automation but: ...
Yes you are correct, in case of the example code, it should be "General/..." if you have ArgParser, it should be "Args/..." Yes it looks like the metric is wrong, it should be "epoch_accuracy" & "epoch_accuracy"
Hi @<1645597514990096384:profile|GrievingFish90>
You mean the agent itself inside a docker then the agent spins sibling dockers for the Tasks ?
Hi @<1610808279263350784:profile|FriendlyShrimp96>
Is there a way to get a list of variants given a metric, or even just a full list of metrics and variants for a given task id?
Try this
None
from clearml.backend_api.session.client import APIClient
c = APIClient()
metrics = c.events.get_task_metrics(tasks=["TASK_ID_HERE"], event_type="training_debug_image")
print(metrics)
I think API ...
WackyRabbit7 I guess we are discussing this one on a diff thread ๐ but yes, should totally work, that's the idea
That wasn't scheduled by ClearML).
This means that from Clearml perspective they are "manual" i.e the job it self (by calling Task.init) create the experiment in the system, and fills in all the fields.
But for a k8s job, I'm still unsuccessful.
HelpfulDeer76 When you say "unsuccessful" what exactly do you mean ?
Could it be they are reported to the clearml demo server (the default server if no configuration is found) ?
Hi @<1569858449813016576:profile|JumpyRaven4> could you test the fix? just pull & run
allegroai/clearml-serving-triton:1.3.1
allegroai/clearml-serving-inference:1.3.1
EnviousPanda91 this seems like a specific issue with the clearml-task
cli, could that be ?
Can you send a full clearml-task command-line to test ?
ShinyPuppy47 the code that is being launched, does it call task.init?
Hi AstonishingWorm64
Is this the same ?
https://github.com/allegroai/clearml-serving/issues/1
(I think it was fixed on the later branch, we are releasing 0.3.2 later today with a fix)
Can you try:pip install git+
I have to admit, I'm not sure...
Let me talk to backend guys, in theory you are correct the "initial secret" can be injected via the helm env var, but I'm not sure how that would work in this specific case
EmbarrassedSpider34
Sync_folder and upload
Several times along the code and then
Do notice they overwrite one another...
Hi DeliciousKoala34
This means the pycharm plugin was not able to run git on your local machine.
Whats your OS ?
could it be that if you open cmd / shell "git" is not in the path ?