
Reputation
Badges 1
25 × Eureka!The odd thing it was able to authenticate but then it could not find the Task to delete.
Could it be someone already deleted the Task ?
(BTW: a new version of the cleanup service is in the working π )
did you run trains-agent
?
Hi JitteryCoyote63
The easiest is to inherit the ResourceMonitor class and change the default logging rate (you could also disable some of the metrics).
https://github.com/allegroai/clearml/blob/701fca9f395c05324dc6a5d8c61ba20e363190cf/clearml/task.py#L565
Then pass the new class to Task.init as auto_resource_monitoring
Yey!
My pleasure π
Hover over the border (I would suggest to use the full screen, i.e. maximize)
oh, then this is user/pass (pass is the same as app key / secret)
None
Full markdown edit on the project so you can create your own reports and share them (you can also put links to the experiments themselves inside the markdown). Notice this is not per experiment reporting (we kind of assumed maintaining a per experiment report is not realistic)
Like, let's say I want "a 15GB GPU or better" and there's 4 queues, two of which fit the description... is there any way to set it so that ClearML will just queue it up on whichever one's available?
How do you know that? Also if you know that, what do you know about the queues ?
Generally speaking this type of granularity is k8s, but it has lots of caveats, specifically that you need to Know what you need in term of resources, that you can specify resources that do not exist, and that...
Hi SkinnyPanda43
Let's say that I install the shared libs with pip in editable mode on my development evironment, how does the clearml-agent will handle those libraries if I submit a job
So installing packages from local folders with "-e" is in general ill-advised.
But using a full git path should work out of the box. for example if you install pip install
https://github.com/user/repo/repo.git then the agent will be able to install it on the remote machine. The main challenge...
Hi @<1603198134261911552:profile|ColossalReindeer77>
When you select poetry as package manager the agent passes control to poetry, this means poetry needs to decide on hte correct torch wheel based on your cuda. I do not think poetry can do that, but I do think you can specify the extra index url to take the torch wheel from:
None
The notebook path goes through a symlink a few levels up the file system (before hitting the repo root, though)
Hmm sounds interesting, how can I reproduce it?
The notebook kernel is also not the default kernel,
What do you mean?
BattyLion34 I have a theory, I think that any Task on the "default" queue qill fail if a Task is running on the "service" queue.
Could you create a toy Task that just print "." and sleeps for 5 seconds and then prints again.
Then while that Task is running, from the UI launch the Task that passed on the "default" queue. If my theory holds it should fail, then we will be getting somewhere π
Oh that makes sense, This depends on how you setup the clearml k8s glue, (becuase the resource allocation is done by k8s) a good hack to limit the number of containers per GPU is to set a RAM limitation per pod, then k8s will know to limit the number of pods on the same GPU machine,
wdty?
but then an error message in the web-app pops up
Fetch parents failed
and the Scheduler task disappears
And the Task is still running? What's he clearml python version and webui version ?
ConvolutedChicken69
basically the cleamrl-data needs to store an immutable copy of the delta changes per version, if the files are already uploaded, there is a good chance they could be modified...
So in order to make sure we you have a clean immutable copy, it will always upload the data (notice it also packages everything into a single zip file, so it is easy to manage).
ElegantCoyote26 point me to where Keras stores the data π
If in the process of integration you had to add a logger/callback to your Keras code, that is the equivalent of using the TB.
TenseOstrich47 it's based on free "index" so the first index not in used will be captured, but if you remove agents, then the order will change e.g. you take down worker #1 , the next worker you spin will be #1 becuase it is not taken)
Hi FlutteringWorm14
Is there some way to limit that?
What do you mean by that? are you referring to the Free tier ?
however, this will also turn off metricsΒ
For the sake of future readers, let me clarify on this one, turning it off auto_connect_frameworks={'pytorch': False}
only effects the auto logging of torch.save/load
(side note: the reason is pytorch does not have built in metric reporting, i.e. it is usually done manually and these days most probably with tensorboard, for example lightning / ignite will use tensorboard as default metric reporting),
Hi SubstantialElk6
I think you are absolutely correct, it seems the glue pops all the arguments, when in fact it should maybe process them a,d convert the --env/-e
What do you think?
Aloso I assume if these are the default arguments they should actually be part of the k8s apply.yaml template no ?
Hi @<1694157594333024256:profile|DisturbedParrot38>
You mean how to tell the agent to pull only some submodules of your git?
If this is the case you can actually remove them on your git branch, submodule is a file with a soft link. Wdyt?
Hi DeliciousBluewhale87
This sounds like a great workflow to implement.
I guess my first question is how do you imagine the manager/director interacting with the system? What will they be shown, to allow them to approve/decline the model promotion ?
SoreDragonfly16 notice that if in the web UI you aborting a task it will do exactly what you described, print a message and quit the process. Any chance someone did that?
Are they ephemeral or later used by other Tasks, execution etc ?
For example: configuration files, they are specific for an execution, and someone will edit them.
Initial weights files, are something that multiple execution might needs them, and they will be used to restore an execution. Data, even if changing, is usually used by multiple executions tasks etc.
It seems like you treat these files as "configurations", is that right ?
Unfortunately that is correct. It continues as if nothing happened!
oh dear, let me make sure this is taken care of
And thank you for the reproduce code!!!
Yes including this. (There was a fix to an issue with trains-agent
and disabling frameworks, it is already part of 0.16.3 )