Reputation
Badges 1
25 × Eureka!NastySeahorse61 it might that the frequency it tests the metric storage is only once a day (or maybe half a day), let me see if I can ask around
(just making sure you can still login to the platform?)
JitteryCoyote63 Should be quite safe, there is no major change that I'm aware of on the ClearML side that can effect it.
That said, wait for after the weekend, we are releasing a new ClearML package, I remember there was something with the model logging, it might not directly have something to do with ignite, but worth testing on the latest version.
so what should the value of "upload_uri" to set to,
fileserver_url
e.g.
?
yes, that would work.
Apparently the error comes when I try to access from
get_model_and_features
the pipeline component
load_model
. If it is not set as pipeline component and only as helper function (provided it is declared before the components that calls it (I already understood that and fixed, different from the code I sent above).
ShallowGoldfish8 so now I'm a bit confused, are you saying that now it works as expected ?
. I'm thinking it's generically a kernel gateway issue, but I'm not sure if other platforms are using that yet
The odd thing is that you can access the notebook, but it returns zero kernels ..
Hi @<1785479228557365248:profile|BewilderedDove91>
It's all about the databases in the under the hood, so 8gb is really a must
Do you mean it recently become part of enterprise version?
I do not think so, but it seems this the support for the open-source is more like a PoC
https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py
Is there a way to document these non-standard entry points
@<1541954607595393024:profile|BattyCrocodile47> you should see the "run" in the Args section under Configuration
in case of HF you should see "-m huggingface" and then the rest in the Args section
(if this does not work, then I assume this is a bug 🙂 )
The idea is of course that you can always enqueue and reproduce, so if that part is broken we should fix it 😊
Hi Team,Can i clone experiment shared by some one, via link?
You mean someone that is not in your workspace ? (I'm assuming app.clear.ml ?)
orchestration module
When you previously mention clone the Task I the UI and then run it, how do you actually run it?
regarding the exception stack
It's pointing to a stdout that was closed?! How could that be? Any chance you can provide a toy example for us to debug?
trains-agent build --docker nvidia/cuda --id myTaskId --target base_env_services
It's building a gpu enabled docker...
you might want a diff container or to specific --cpu-only
AntsyElk37
and when i try to use --output-uri i can't pass true because obviously i can't pass a boolean only strings
hmm, that sounds right, I think we should fix that so when using --output-uri true the value that is passed is actually True, not the string "true".
Regrading the issue itself:
are you saying --skip-task-init is being ignored ? and it always adds the Task.init call? you can also pass --output-uri https://files.clear.ml (which is the same as True) ,...
We are planning an RC later this week, I'll make sure this fix is part of it
Hi StickyWhale51
I think this issue is due to some internal race condition, anyhow I think we have an RC out solving it, can you try with:pip install clearml==1.2.0rc2
Can you run the entire thing on your own machine (just making sure it doesn't give this odd error) ?
If a Task is in the 'Completed' I think the only option is to 'Reset' it (see image).
In the UI yes, in code you can do task.mark_aborted(force=True)
You do clear the previous run execution but I think for a repetitive task this is fine.
I would avoid that, no?
replace it with:git+No need for the repository name, this will ensure you always reinstall it (again pip feature)
Try:task.flush(wait_for_uploads=True)Should do the trick 🙂
Are you running the agent in docker mode or venv mode?
Also in the same open docker session, can you try:$LOCAL_PYTHON -m clearml_agent execute --disable-monitoring --id <task_id_here>Where the Task ID is one of the failed executions (only reset it before)
Hi SmallDeer34
The any generally any pytorch.save(...) is logged/uploaded by clearml automatically. specifically in your case I think the only missing one is the trainer_sate.json, which I assume is general json file, and I imagine is part of huggingface framework. You can easily upload it as additional artifact with Task.upload_artifact wdyt?
So in theory you can clone yourself 2 extra times and push into an execution queue, but the issue might be actually making sure the resources are available. what did you have in mind?
The agent is installing the "Installed Paclages" section of the Task (think of it as requirements.txt)
And again, what do you have there? Is it the outcome of the Task.init auto populating it?
Hmm, maybe the right way to do so is to abuse "models" which have entity, you can specify a system_tag on them, they can store a folder (and extract it if you need), they are on projects and they are cloned and can be changed.
wdyt?
3.a
Regarding the model query, sure from Python or restapi you can query based on any metadata
https://clear.ml/docs/latest/docs/references/sdk/model_model/#modelquery_modelsmodels
3.b
If you are using clearml-serving then check the docs / readme, but in a nutshell yes you can.
If the inference code is batchprocessing, which means a Task, then of course you can and lauch it, check the clearml agent f...
btw, I looked deeper into the log:
File "/tmp/tmpfa8ifmka.py", line 80, in <module>
model.train(data='coco128.yaml',epochs=20)
I'm assuming this all starts here, I think that the pipeline is Not running the code from the same folder, and you are just missing the 'coco128.yaml' try to pass a full path, wdyt?
NastyOtter17 can you provide some more info ?