Reputation
Badges 1
25 × Eureka!Hi CheerfulGorilla72
see
Notice all posts on that channel are @ channel π
Oh, yes, that might be (threshold is 3 minutes if no reports) but you can change that:task.set_resource_monitor_iteration_timeout(seconds_from_start=10)
Yes, I mean use the helm chart to deploy the server, but manually deploy the agent glue.
wdyt?
Hmm yeah I can see why...
Now that I think about it, at least in theory the second process that torch creates, should inherit from the main one, and as such Task.init is basically "ignored"
Now I wonder why your first version of the code did not work?
Could it be that we patched the argparser on the subprocess and that we should not have?
the agent does not auto-refresh the configuration, after a conf file change you should restart the agent, after that it should present the new configuration when loading
These both point to nvidia docker runtime installation issue.
I'm assuming that in both cases you cannot run the docker manually as well, which is essentially what the agent will have to do ...
UnevenDolphin73fatal: could not read Username for ' ': terminal prompts disabled .. fatal: clone of ' ' into submodule path '/root/.clearml/vcs-cache/xxx.60db3666b11ac2df511a851e269817ef/xxx/xxx' failedIt seems it tries to clone a submodule and fails due to to missing keys for the submodule.
https://stackoverflow.com/questions/7714326/git-submodule-url-not-including-username
wdyt?
Hmm I see your point.
Any chance you can open a github issue with a small code snippet to make sure we can reproduce and fix it?
The dokcer itself does not have the host configured.
JitteryCoyote63
Sure, just please add a github issue request, so it does not get forgotten.
BTW: wouldn't it be more convinient to configure in the trains.conf ?
The point is, " leap" is proeperly installed, this is the main issue. And although installed it is missing the ".so" ? what am I missing? what are you doing manually that does Not show in the log?
In other words how did you install it "menually" inside the docker when you mentioned it worked for you when running without the agent ?
not really, the OS will almost never allow for that, actually it is based on fairness and priority. we can set the entire agent to have the same low priority for all of them, then the OS will always take CPU when needed (most of the time it won't) and all the agents will split the CPU's among them, no one will get starved π With GPUs , it is a different story, there is no actual context switching or fairness mechanisms like in CPU
ReassuredTiger98 could you provide more information ? (versions, scenario. etc.)
Will such an docker image need a trains configuration file?
If you need to configure things other than credentials (see above) than yes you might need to map trains.conf into the pod.
Specifically, if you need, map your trains.conf to /root/.trains inside the pod/container
Hi FancyWhale93 , in your clear.conf configure default output uri, you can specify the file server as default, or any object storage:
https://github.com/allegroai/clearml-agent/blob/9054ea37c2ef9152f8eca18ee4173893784c5f95/docs/clearml.conf#L409
QuaintJellyfish58 Notice it tries to access AWS not your minio"This seems like a bug?! can you quickly verify with previous version ?
Also notice you have to provide the minio section in the clearml.conf so it knows how to access the endpoint:
https://github.com/allegroai/clearml/blob/bd53d44d71fb85435f6ce8677fbfe74bf81f7b18/docs/clearml.conf#L113
But what should I do? It does not work, it says incorrect password as you can see
How are you spinning the agent machine ?
Basically 10022 port from the host (agent machine) is routed into the container, but it still needs to be open on the host machine, could it be it is behind a firewall? Are you (client side runnign clearml-session) on the same network as the machien runnign the agent ?
but I still need the laod ballancer ...
No you are good to go, as long as someone will register the pods IP automatically on a dns service (local/public) you can use the regsitered address instead of the IP itself (obviously with the port suffix)
Thanks for your support
With pleasure!
OK - the issue was the firewall rules that we had.
Nice!
But now there is an issue with the
Setting up connection to remote session
OutrageousSheep60 this is just a warning, basically saying we are using the default signed SSH server key (has nothing to do with the random password, just the identifying key being used for the remote ssh session)
Bottom line, I think you have everything working π
Hi FiercePenguin76
should return all datasets from all projects?
Correct π
Hi @<1523715429694967808:profile|ThickCrow29>
Is there a way to specify a callback upon an abort action from the user
You mean abort of the entire pipeline?
None
Do you think ClearML is a strong option for running event-based training and batch inference jobs in production?
(I'm assuming event-base, you mean triggered by events not streaming data, i.e. ETL etc)
I know of at least a few large organizations doing tat as we speak so I cannot see any reason not to.
Thatβd include monitoring and alerting. Iβm afraid that Metaflow will look far more compelling to our teams for that reason.
Sure, then use Metaflow. The main issue with Metaflow...
DilapidatedDucks58 You might be able to, check the links, they might be embedded into the docker, so you can map diff png file from the host π
BTW: what would you change the icons to?
Oh that is odd... let me check something
- Components anyway need to be available when you define the pipeline controller/decorator, i.e. same codebaseNo you an specify a different code base, see here:
None - The component code still needs to be self-composed (or, function component can also be quite complex)Well it can address the additional repo (it will be automatically added to the PYTHONPATH), and you c...