Hi GiddyTurkey39 ,
When you say trains agent, are you referring to the trains agent command ...
I mean running the trains-agent daemon on a machine. This means you have a daemon pulling jobs from the execution queue and executing them (either in virtual environment, or inside a docker)
You can read more about https://github.com/allegroai/trains-agent and https://allegro.ai/docs/concepts_arch/concepts_arch/
Is it sufficient to queue the experiments
Yes there is no ne...
This points to the wrong cu117 / driver - could that be?
But thanks to you I realized one thing: I use
hparams
further in the code, not
normalize_and_flat_config(hparams)
.
This is the main issue , any reason not to use normalize_and_flat_config(hparams) later in the code?
or maybe update back the hparam?
Was trying to figure out how the method knows that the docker image ID belongs to ECR. Do you have any insight into that?
Basically you should have the docker service login before running the agent, then the agent uses docker to run the image from the ECR.
Make sense ?
Just to make sure, the first two steps are working ?
Maybe it has to do with the fact the "training" step specifies a docker image, could you try to remove it and check?
BTW: A few pointers
The return_values is used to specify multiple returned objects stored individually, not the type of the object. If there is a single object, no need to specify
The parents argument is optional, the pipeline components optimizes execution based on inputs, for example in your code, all pipeline comp...
so would that be "tags" "parents" ?
Shouldn't this be a real value and not a template
you mean value being pulled to the pod that failed ?
If you are using the latest RC:pip install clearml==0.17.5rc5You can pass True it will use the "files_server" as configured in your clearml.conf
I used the http link as a filler to point to the files_server.
Make sense ?
Hi GiddyTurkey39
Are you referring to an already executed Task or the current running one?
(Also, what is the use case here? is it because the "installed packages are in accurate?)
I cannot reproduce, tested with the same matplotlib version and python against the community server
FrothyShark37 what was different in your script ?
. It is not possible to specify the full output destination right?
Correct 😞
then when we triggered a inference deploy it failed
How would you control it? Is it based on a Task ? like a property "match python version" ?
I'm so glad you mentioned the cron job, it would have taken us hours to figure
Hi RoughTiger69
unfortunately, the model was serialized with a different module structure - it was originally placed in a (root) module called
model
....
Is this like a pickle issue?
Unfortunately, this doesn’t work inside clear.ml since there is some mechanism that overrides the import mechanism using
import_bind
.
__patched_import3
What error are you getting? (meaning why isn't it working)
Oh right, I missed the fact the helper functions are also decorated, yes it makes sense we add the tags as well.
Regarding nested pipelines, I think my main question is , are they independent or are we generating everything from the same code base?
If you spin two agent on the same GPU, they are not ware of one another ... So this is expected behavior ...
Make sense ?
you mean in the enterprise
Enterprise with the smarter GPU scheduler, this is inherent problem of sharing resources, there is no perfect solution, you either have fairness, but then you get idle GPU's of you have races, where you can get starvation
It's relatively new and it is great as from the usage aspect it is exactly like a user/pass only the pass is the PAT , really makes life easier
Hi PerplexedGoat65
it appears, in a practical sense, this means to mount the second drive, and then bind them in ClearML’s configuration
Yes, the entire data folder (reason is, if you loose it, you loose all the server storage / artifacts)
Also, thinking about Docker and slower access speed for Docker mounts and such,
If the host OS is linux, you have nothing to worry about, speed will be the same.
This is part if a more advanced set of features of the scheduler, but only available in the enterprise edition 🙂
Hi WackyRabbit7
the services (or the agent running there) is spinning multiple Tasks (as opposed to regular agent where it is one task at a time).
how can I give this agent git access?
in the docker-compose you can configure the git credentials (user/pass or user/key it is the same).
https://github.com/allegroai/clearml-server/blob/d0e2313a24eb1248ebf0ddf31bf589de0d675562/docker/docker-compose.yml#L137
Notice that you can embed links to specific view of an experiment, by copying the full address bar when viewing it.
The downstream stages are rankN scripts, they are waiting for the IP address of the first stage.
Is this like a multi-node training, rather than a pipeline ?
@<1595587997728772096:profile|MuddyRobin9> are you sure it was able to spin the EC2 instance ? which clearml version autoscaler are you running ?