Reputation
Badges 1
662 × Eureka!So basically what I'm looking for and what I have now is something like the following:
(Local) I have a well-defined aws_autoscaler.yaml
that is used to run the AWS autoscaler. That same autoscaler is also run with CLEARML_CONFIG_FILE=....
(Remotely) The autoscaler launches, listens to the predefined queue, and is able to launch instances as needed. I would run a remote execution task object that's appended to the autoscaler queue. The autoscaler picks it up, launches a new instanc...
We're wondering how many on-premise machines we'd like to deprecate. For that, we want to see how often our "on premise" queue is used (how often a task is submitted and run), for how long, how many resources it consumes (on average), etc.
I don't think there's a PR issue for that yet, at least I haven't created one.
I could have a look at this and maybe make a PR.
Not sure what would the recommended flow be like though π€
Thanks CostlyOstrich36 !
And can I make sure the same budget applies to two different queues?
So that for example, an autoscaler would have a resource budget of 6 instances, and it would listen to aws
and default
as needed?
Thanks for the reply CostlyOstrich36 !
Does the task read/use the cache_dir
directly? It's fine for it to be a cache and then removed from the fileserver; if users want the data to stay they will use the ClearML Dataset π
The S3 solution is bad for us since we have to create a folder for each task (before the task is created), and hope it doesn't get overwritten by the time it executes.
Argument augmentation - say I run my code with python train.py my_config.yaml -e admin.env
...
The S3 bucket credentials are defined on the agent, as the bucket is also running locally on the same machine - but I would love for the code to download and apply the file automatically!
Looks great, looking forward to the all the new treats π
Happy new year! π
Would be good if that's mentioned explicitly in the docs π Thanks!
Parquet file in this instance (used to be CSV, but that was even larger as everything is stored as a string...)
One must then ask, of course, what to do if e.g. a text refers to a dictionary configuration object? π€
Is it currently broken? π€
I cannot, the instance is long gone... But it's not different to any other scaled instances, it seems it just took a while to register in ClearML
Note that it would succeed if e.g. run with pytest -s
We have a more complicated case but I'll work around it π
Follow up though - can configuration objects refer to one-another internally in ClearML?
I'll have a look, at least it seems to only use from clearml import Task
, so unless mlflow changed their SDK, it might still work!
Bump SuccessfulKoala55 ?
-ish, still debugging some weird stuff. Sometimes ClearML picks ip
and sometimes ip2
, and I can't tell why π€
That's what I thought @<1523701087100473344:profile|SuccessfulKoala55> , but the server URL is correct (and WebUI is functional and responsive).
In part of our code, we look for projects with a given name, and pull all tasks in that project. That's the crash point, and it seems to be related to having running tasks in that project.
Yeah, and just thinking out loud what I like about the numpy/pandas documentation
No, I have no running agents listening to that queue. It's as if it's retained in some memory somewhere and the server keeps creating it.
Hmmm, what π
CostlyOstrich36 so internal references are not resolved somehow? Or, how should one achieve:
def my_step(): from ..utils import foo foo("bar")
Hm. Is there a simple way to test tasks, one at a time?
I realized it might work too, but looking for a more definitive answer π Has no-one attempted this? π€
AgitatedDove14 Unfortunately not, the queues tab shows only the number of tasks, but not resources used in the queue . I can toggle between the different workers but then I don't get the full image.
Still; anyone? π₯Ή @<1523701070390366208:profile|CostlyOstrich36> @<1523701205467926528:profile|AgitatedDove14>
Well the individual tasks do not seem to have the expected environment.
We can change the project nameβs of course, if thereβs a suggestion/guide that will make them see past the namespaceβ¦
It is installed on the pipeline creating the machine.
I have no idea why it did not automatically detect it π