Hi PanickyFish98
It verifies it has access to it when actually creating the Task, maybe it should be a warning?!
fyi: you can also change the value from the UI (under Execution output) or have a default one set in the clearml.conf
used by the agent
Looking at the
supervisor
method of the base
AutoScaler
class, where are the worker IDs kept.
Is it in the class attribute
queues
?
Actually the supervisor is passing a fixed prefix, then it asks the clearml-server on workers starting with this name.
This way we can have a fixed init script for all agents, while we still can differentiate them from the other agent instances in the system. Make sense ?
PungentLouse55 hmmm
Do you have an idea on how we could quickly reproduce it?
FlutteringWorm14 Can you verify that even with the clearml.conf it has no effect?
Could it be that this is the callback that causes it?
None
Oh this is Only in the SaaS server ...
(I'm sorry I was not clear on that)
It can be a different agent.
If inside a docker thenclearml-agent execute --id <task_id here> --docker
If you need venv doclearml-agent execute --id <task_id here>
You can run that on any machine and it will respin and continue your Task
(obviously your code needs to be aware of that and be able to pull its own last model checkpoint from the Task artifacts / models)
Is this what you are after?
NastySeahorse61 it might that the frequency it tests the metric storage is only once a day (or maybe half a day), let me see if I can ask around
(just making sure you can still login to the platform?)
FreshParrot56 we could add this capability, but the main caveat is that f your version depends on multiple parent versions you still need to download and extract all the parent versions, which means that when you clear them you might hurt later performance. Does that make sense? What is the use-case / scenario for you?
I am trying to use the
configuration vault
option but it doesn't seem to apply the variables I am using.
Hi EmbarrassedSpider34 I think this is an enterprise feature...
Manged to make the credentials attached to the configuration when the task is spinned,
I'm assuming env variables ?
btw: what's the OS and python version?
Hi @<1547028116780617728:profile|TimelyRabbit96>
It should process the new request A (this is a multi threading / async implementation)
Is this consistent with what you are seeing ?
Regrading resetting it via code, if you need I can write a few lines for you to do that , although that might be a bit hacky.
Maybe we should just add a flag saying, use requirements.txt ?
What do you think?
Yes, or at least credentials and API...
Maybe inside your code you can later copy the model into fixed location ?
This way you have the model in the model repository and a copy in a fixed location (StorageManager can upload to a specific bucket/folder with the same credentials you already have)
Would that work?
Hi SarcasticSparrow10 ,
So the bad news is the UI is actually escaping the query, so you cannot search regexp from the UI. The good news, you can do achieve that from python:from trains import Task tasks = Task._query_tasks(task_name='exp.*i1')
🤔 maybe we should have "sub nodes" as just visual functions running inside the same actual pipeline component ?
Yep 🙂 but only in RC (or github)
Decorators are good 🙂
Something along the lines of
` @PipelineDecorator.pipeline(...)
def pipeline(skip_a=False):
if not skip_a:
a = step_a()
else:
# somehow get a previous A?
# let's call it cached A
a = "replace with real'
step_b(a)
... `Is this the gist?
If it is, this looks like, "how can I control whether A is cached or not", is that correct?
I think what you are looking for is clearml-agent daemon
https://clear.ml/docs/latest/docs/clearml_agent
https://clear.ml/docs/latest/docs/getting_started/video_tutorials/agent_remote_execution_and_automation
Hi @<1523711619815706624:profile|StrangePelican34>
You can either report on the Model itself:
None
or you can force it on the Task:
task = Task.get_task("task id here")
task.mark_started(force=True)
task.get_logger().report_scalar(...)
task.mark_completed(force=True)
Awesome ! thank you so much!
1.0.2 will be out in an hour
One thing though - I am running agent on behalf of a regular user.
Oh that might be credentials / docker service issue (i.e. the user might not have the ability to rn a docker with --gpus, but as you mentioned,, that seems like an arch thing 🙂 )
GrotesqueDog77 this should just work, decorate the functions with @PipelineDecorator.component
and call the functions one after the otherpaths = step_one() step_two(paths)
ClearML will make sure it serializes the strings and pass them to step two (of course step two should actually run on a machine with access to the same folder, but this is another issue 🙂 )
These both point to nvidia docker runtime installation issue.
I'm assuming that in both cases you cannot run the docker manually as well, which is essentially what the agent will have to do ...
From the top
trains-agent pulls a service Task Task marked as running- trains-agent worker points to the Task Docker is spinned up environment is installed inside docker (results are shown in the service Task Log) trains-agent inside the docker is launched and a new node appears in the system <host_agent_name>:service:<task_id> and the Task service is listed as running on it main trains-agent is back to idle and its worker now has no experiment listed as running
Where do you think it breaks?
It seems like the web server doesn’t log the call to AWS, I just see this:
This points to the browser actually sending the AWS delete command. Let me check with FE tomorrow
Anyway, in the docs, there is a function called task.register_artifact()
Yes, this is rather deprecated... The idea is that it will monitor an obejct and auto sync it (i.e. serialize and upload).
That said, it is just so much easier to do task.upload_artifact
and you can always update/overrwrite if you are passing the same name, that I cannot see the actual use case. Does that make sense? What are you using it for ?
Hi GreasyLeopard35
I try to resume a stopped or aborted parameter optimization experiment,
How are you continuing the HPO? are you runing everything locally? is this with an agent? are you seeing the '[0, 0]' value on the configuration when launching the HPO or when continuing it ?