I want to inject a bash command after the repo has been clone (and maybe even after the venv has been installed).
LazyTurkey38 the created venv inherits from the system environment, so in theory you can do all the installation on the system python and the created venv will just inherit the packages, no?
(btw: just to clarify, there is only one entry point for the custom bash script and that is before everything, so users can configure the container before the agent starts)
These paths are
pathlib.Path
. Would that be a problem?
No need to worry, it should work (i'm assuming "/src/clearml_evaluation/" actually exists on the remote machine, otherwise useless ๐
Hi EmbarrassedSpider34
Long story (see below) short, yes you can ignore this warning :)
Specifically, torch is spinning processes and killing them, every process will have a reference to the parent semaphore (for internal clearml bookkeeping), now python is not very good with this kind of thing (and it is getting better on newer python verions), bottom line python "think" someone lost a semaphore, but there reality is that subprocess never created it in the first place. Does that make sen...
That makes total sense. The question was about the Mac users and OS environment in the configuration file and having that os environment set in code (this is my assumption as it seems that at import time it does not exist). What am I missing here?
Also btw, is this supposed to be screenshot from community verison
Hmm seems like screenshot from an enterprise version, I'll ask them to update ๐
I am also not understanding how clearml-serving is doing the version for models in triton.
Basically you have two Tasks, one is the "controller" checking model changes and updating itself.
The other is the engine, checking on the "controller" Task, which models it needs to download/configure and replaces them.
This way you can ha...
I think you are onto a good flow, quick iterations / discussions here, then if we need more support or an action-item then we can switch to GitHub. For example with feature requests we usually wait to see if different people find them useful, then we bump their priority internally, this is best done using GitHub Issues ๐
GrievingTurkey78 I see,
Basically the arguments after the -m src.train in the remote execution should be ignored (they are not needed).
Change the m in the Args section under the configuration. Let me know if it solved it.
Sadly, I think we need to add another option like task_init_kwargs to the component decorator.
what do you think would make sense ?
can i run it on an agent that doesn't have gpu?
Sure this is fully supported
when i run clearml-serving it throughs me an error "please provide specific config.pbtxt definion"
Yes this is a small file that tells the Triton server how load the model:
Here is an example:
https://github.com/triton-inference-server/server/blob/main/docs/examples/model_repository/inception_graphdef/config.pbtxt
Hi MelancholyBeetle72 , that's a very interesting case. I can totally understand how storing a model and then immediately renaming it breaks the upload. A few questions, is there a way for pytorch lightning not to rename the model? Also I wonder if this scenario happens a lot (storing model and changing it) . I think the best solution is for Trains to create a copy of the file and upload it in the background. That said the name will still end with .part What do you think?
Hi PanickyMoth78
You mean like another Task? or maybe Slack message?
Hi @<1730033904972206080:profile|FantasticSeaurchin8>
You mean in the UI , or when reporting on the SDK?
the task is being Aborted rather than be in Draft. Am I missing something?
Yes, the reason is for not missing anything that you might have reported on it.
And usually execute_remotely will get the execution queue as a paramter (i.e. immdiatly launching the Task)
You can now (starting v1.0) enqueue an aborted Task so it should not make a difference, you can also reset the Task and edit it in the UI
what do you have here in your docker compose :
None
๐ anything that can be done?
Yes. Though again, just highlighting the naming of
foo-mod
is arbitrary. The actual module simply has a folder structured with an implicit namespace:
Yep I think this is exactly why it fails detecting it, let me check that
And itโs failing on typing hints for functions passed in
pipe.add_function_step(โฆ, helper_function=[โฆ])
โฆ I guess those arenโt being removed like the wrapped function step?
Can you provide the log? I think I'm missing what e...
BTW: the new documentation should contain a full search over the docstring
BattyLion34 if everything is installed and used to work, what's the difference from the previous run that worked ?
(You can compare in th UI the working vs non-working, and check the installed packages, it would highlight the diff, maybe the answer is there)
but the requirement was already satisfied.
I'm assuming it is satisfied on the host python environment, do notice that the agent is creating a new clean venv for each experiment. If you are not running in docker-mode, then you ca...
you should have a gpu argument there, set it to true
Are you running it in venv mode or docker mode?
Hi BroadMole98
What I think I am understanding about trains so far is that it's great at tracking one-off script runs and storing artifacts and metadata about training jobs, but doesn't replace kubeflow or snakemake's DAG as a first-class citizen.ย How does Allegro handle DAGgy workflows?
Long story short, yes you are correct. kubeflow and snakemake for that matter, are all about DAGs where each node is running a docker (bash) for you. The missing portions (for both) are:
How do I cr...
maybe I should use explicit reporting instead of Tensorboard
It will do just the same ๐
there is no method for settingย
last iteration
, which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
Let me double check that...
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine ...
That is a very good point
but for the metrics, I explicitly pass th...
Nice! I'll see if we can have better error handling for it, or solve it altogether ๐
DepressedChimpanzee34
What's the hydra version ?
I tested with 1.1.0dev3 and it worked for me
Not really ๐
Everyone can do everything, the idea is sharability and accessibility.
I do know that in the paid tier they have full access control roles SSO etc, but unfortunately its way too complicated for the open-source.
Basically what I'm saying is trust your fellow colleagues ๐
The quickest workaround would be, In your final code just do something like:my_params_for_hpo = {'key': omegaconf.key} task.connect(my_params_for_hpo, name='hpo_params') call_training_with_value(my_params_for_hpo['key'])This will initialize the my_params_for_hpo with the values from OmegaConf, and allow you to override them in the hyperparameyter section (task.connect is two, in manual it stores the data on the Task, in agent mode, it takes the values from the Task and puts them ba...
Are hparms saved in hypeparameter section superior to hparams saved in configuration objects?
well I'm not sure about "superior" but they are structured, as opposed to configuration object, which is as generic as could be
Can you provide some further explanation, please? Sorry, I am beginner.
My bad, I was thinking out loud on improving the HPO process and allowing users to modify the configuration_object , not just the hyperparameters