Hi UnevenOstrich23
if --docker is enable that will means every new experiments will be executed into dedicated agent worker containers?
Correct
I think the missing part is how to specify the docker for the experiment?
If this is the case, in the web UI, clone your experiment (which will create a draft copy, that you can edit), then in the Execution tab, scroll down to the "base docker image" and specify the docker image to use.
Notice that you can also add flags after the docker im...
"General" is the parameter section name (like Args)
Hi IrritableGiraffe81
Can you share a code snippet ?
Generally I would trytask = Task.init(..., auto_connect_frameworks={"pytorch': False, 'tensorflow': False)
Ok, just my ignorance then?Β
LOL, no it is just that with a single discrete parameter the strategy makes less sense π
I still see things being installed when the experiment starts. Why does that happen?
This only means no new venv is created, it basically means install in "default" python env (usually whatever is preset inside the docker)
Make sense ?
Why would you skip the entire python env setup ? Did you turn on venvs cache ? (basically caching the entire venv, even if running inside a container)
FrustratingWalrus87 Unfortunately TB's TSNE is not automatically captured by ClearML (Scalars, histograms etc. are)
That said, matplotlib will be automatically captured do you can run your own PCA/tSNE and use matplotlib to visualize (ClearML will capture it).
The same applies for plotly.
What do you think?
Now that we have the free tier (a.k.a community server) we might change the default behavior.
The idea is always to allow an easy way to on-board and test the system.
ReassuredTiger98
BTW: what's the scenario where your machine reverted to the default configuration (i.e. no configuration file) ?
Hi FierceHamster54
This is already supported, unfortunately the open-source version only supports static allocation (i.e you can spin multiple agents and connect each one to specific number of GPUs), the dynamic option (where you have single agent allocating jobs to multiple GPUs / Slices is only part of the enterprise edition
(there is the hidden assumption there that if you spent so much on a DGX you are probably not a small team π )
Although I didn't understand why you mentioned
torch
in my case?
Just a guess π other frameworks do multi-process as well,
I would guess it relates to parallelization of Tasks execution of the
HyperParameterOptimizer
class?
Yes that might be it, it's basically by product of using python "Process" class for multiprocessing. we are working on a fix, not a trivial unfortunately
Okay so my thinking is, on the pipelinecontroller / decorator we will have:abort_all_running_steps_on_failure=False (if True, on step failing it will abort all running steps and leave)
Then per step / component decorator we will havecontinue_pipeline_on_failure=False (if True, on step failing, the rest of the pipeline dag will continue)
GiganticTurtle0 wdyt?
So sorry for the delay here! totally lost track of this thread
Should I be deploying the entire docker file for every model, with the updated requirements?
It's for every "environment" i.e. if models need the same set of python packages , you canshare
if this is the case pytorch really messed things up, this means they removed packages
Let me check something
Make sense π
Just make sure you configure the git user/pass in the docker-compose so the agent has your credentials for the repo clone.
Okay, this seems to be the problem
Hmmm:
WOOT WOOT we broke the record! Objective reached 17.071016994817196
WOOT WOOT we broke the record! Objective reached 17.14302934610711
These two seems strange, let me look into it
Hi TrickySheep9
You should probably check the new https://github.com/allegroai/clearml-server-helm-cloud-ready helm chart π
https://github.com/allegroai/clearml-server-helm-cloud-ready
Clearml automatically gets these reported metrics from TB, since you mentioned see the scalars , I assume huggingface reports to TB. Could you verify? Is there a quick code sample to reproduce?
regrading the artifact, yes that make sense, I guess this is why there is "input" type for an artifact, the actual use case was never found (I guess until now?! what are you point there?)
Regrading the configuration
It's very useful for us to be able to see the contents of the configuration and understand
Wouldn't that just do exactly what you are looking for:
` local_config_file_that_i_can_always_open = task.connect_configuration("important", "/path/to/config/I/only/have/on/my/machi...
Hmm I think the easiest is using the helm chart:
https://github.com/allegroai/clearml-server-helm-cloud-ready
I know there is work on a teraform template, not sure about instio.
Is helm okay for you ?
Hmm, this is a good question, I "think" the easiest is to mount the .ssh folder form the host to the container itself. Then also mount clearml.conf into the container with force_git_ssh_protocol: true see here
https://github.com/allegroai/clearml-agent/blob/6c5087e425bcc9911c78751e2a6ae3e1c0640180/docs/clearml.conf#L25
btw: ssh credentials even though sound more secure are usually less (since they easily contain too broad credentials and other access rights), just my 2 cents π I ...
Hi StraightDog31
I am having trouble using theΒ
StorageManager
Β to upload files to GCP bucket
Are you using the storagemanager directly ? or are you using task.upload_artifact ?
Did you provide the GS credentials in the clearml.conf file, see example here:
https://github.com/allegroai/clearml/blob/c9121debc2998ec6245fe858781eae11c62abd84/docs/clearml.conf#L110
Verified, you are correct "." in label enumeration will break the clone .
I'll make sure this bug is passed to backend guys to fix. Thanks TenseOstrich47 !
meanwhile maybe "_" instead ? π
BTW: I think we had a better example, I'll try to look for one
Hi @<1542316991337992192:profile|AverageMoth57>
Not sure I follow how the integration what you have in mind regarding Gerrit integration None
Sounds interesting ...
wdyt?
WackyRabbit7 in that case:from trains.utilities.pyhocon import ConfigFactory, HOCONConverter from trains.config import config_obj new_conf_text = HOCONConverter.to_hocon(config=ConfigFactory.from_dict(config_obj.as_plain_ordered_dict()), compact=False, level=0, indent=2) print(new_conf_text)
Hmm, let me see if you can somehow "signal" to the subprocess that it should not use the main process Task. (btw: are you forking or spawning a subprocess?)