Hi @<1523701491863392256:profile|VastShells9> , the GCP autoscaler is not available in the open source I'm afraid. Only in PRO licenses and up
Hi @<1614069770586427392:profile|FlutteringFrog26> , if I'm not mistaken ClearML doesn't support running from different repoes. You can only clone one code repository per task. Is there a specific reason these repoes are separate?
If it's deployed by you, then try running clearml-init
from the same machine the server is on. Doesn't matter if it's a cloud machine really
Hi @<1654294828365647872:profile|GorgeousShrimp11> , you can set it with an env var - CLEARML_CONFIG_FILE
None
From the environment variable
Basically the same capabilities that are offered for the unstructured data - ability to register files, keep track and manage them with links and ability to query into all of their metadata and then connect it to the experiment as a query on the metadata inside different versions - basically giving you a feature store.
I am of course over simplifying as the HyperDatasets feature is an extremely powerful tool for managing unstructured data.
RoughTiger69 , regarding the dataset loading, we are actually thinking of adding it as another "hyper parameter" section, and I think the idea came up a few times in the last month, so we should definitely do that. The question is how do we support multiple entries (i.e. two datasets loaded)? Should we force users to "name" the dataset when they "get it" ?
Regrading cloning, we had a lot of internal discussions on it, "Parent" is a field on a Task, so the information can be easily stored, th...
For the SDK you have to provide it one by one so you'd have to iterate over the list
Hi SubstantialElk6 ,
That's an interesting idea. I think if you want to preprocess a lot of data I think the best would be using multiple datasets (each per process) or different versions of datasets. Although I think you can also pull specific chunks of dataset and then you can use just the one - I'm not sure about the last point.
What do you think?
You can use Task.set_base_docker
( None )
To specify arguments, there is an example there 🙂
Hi @<1523701304709353472:profile|OddShrimp85> , I would suggest looking at the examples here:
None
Can you add a full log from startup of both Elastic and apiserver containers?
Hi @<1603198163143888896:profile|LonelyKangaroo55> , you can change the value of files_server in your clearml.conf
to control it as well.
In the task hyper parameters section you have a section called Hydra. In that section there should be a configuration called _allow_omegaconf_edit_
, what is it set to?
Click on step_one and on Full details
Can you attach the console log? What GPUs are you using? I assume nvidia-smi runs without issue?
I think that Model is used to do general actions as allowed by the SDK. InputModel is for an easier interface when working with the Task object directly.
What is your use case?
What version of clearml-agent are you using? Can you add the full log here?
@<1533257278776414208:profile|SuperiorCockroach75> , excuse my ignorance, but doesn't it depend on the output model i.e. the training run that created it?
What do you mean by drop of many GB? Can you please elaborate on what happens exactly?
I know that elastic can sometimes create disk corruptions and requires regular backups..
JitteryCoyote63 , are you on a self hosted server? It seems that the issue was solved for 3.8 release and I think should be released to the next self hosted release
Hi @<1734020162731905024:profile|RattyBluewhale45> , what version of pytorch are you specifying?
Hi GiganticMole91 ,
I see that the storage settings are also available through environment variables, but I'm worried that the environment variables have already been parsed at that time.
I'm not sure I understand. Can you elaborate? How do you run in remotely? Do you raise an instance each time or are your instances persistent?
It would work from your machine as well, but the machine needs to be turned on... like when an ec2 instance that is running.
I see, thanks for the input!
Do you have a log of the triton server?
Hi @<1523704674534821888:profile|SourLion48> , making sure I understand - You push a job into a queue that an autoscaler is listening to. A machine is spun up by the autoscaler and takes the job and it runs. Afterwards during the idle time, you push another job to the same queue, it is picked up by the machine that was spun up by the autoscaler and that one will fail?
It can take some time if the file is very large or if the folder is very large. This can also depend on the connectivity. If the folder is very large please consider the fact that zipping it also can be resource demanding.
How long has it been hanging and how large is it the folder?
@<1581454875005292544:profile|SuccessfulOtter28> , I don't think there is such a capability currently. I'd suggest opening a GitHub feature request for this.
Hi @<1523706700006166528:profile|DizzyHippopotamus13> , you can simply do it in the experiments dashboard in table view. You can rearrange columns, add custom columns according to metrics and hyper parameters. And of course you can sort the columns