Let me try to build a minimal reproducible version
Thank you!
JumpyPig73 Do you see all the configurations under the Args section in the "Configuration" Tab ?
(Maybe I'm wrong and the latest RC does Not include the python-fire support)
but I belive it should have work with 0.14.1 as well
Correct
make sure you follow all the steps :
https://clear.ml/docs/latest/docs/deploying_clearml/upgrade_server_linux_mac
(basically make sure you get the latest docker-compose.yml and the pull itcurl -o /opt/clearml/docker-compose.yml docker-compose -f /opt/clearml/docker-compose.yml pull docker-compose -f /opt/clearml/docker-compose.yml up -d
HugeArcticwolf77 oh no, I think you are correct 😞
Do you want to quickly PR a fix ?
I have to admit, I'm not sure...
Let me talk to backend guys, in theory you are correct the "initial secret" can be injected via the helm env var, but I'm not sure how that would work in this specific case
PompousHawk82 unfortunately this is kind of binary, either you have full tracking of load/save operations or you do not.
This warning message will disappear in the next version as we will be able to log multiple models under the same Task :)
Thanks TroubledJellyfish71 I manged to locate the bug (and indeed it's the new aarach package support)
I'll make sure we push an RC in the next few days, until then as a workaround, you can put the full link (http) to the torch wheel
BTW: 1.11 is the first version to support aarch64, if you request a lower torch version, you will not encounter the bug
Hmm HandsomeGiraffe70
This seem like a bug, let me see what we can do about that 🙂
could it be the parent version was created with an older version of clearml sdk ?
Hi @<1578193384537853952:profile|MoodyOx45>
I have a task A that creates another task B via subprocess.
So the thing about the agent, when it runs the code, there is only One task to rule them all. basically any fork/spawn of subprocess will automatically be logged as the parent Task
I think that what you want is to build a pipeline from those Tasks? Or create a Task and enqueue it manually directly from Task A?
(btw: you can forcefully cause the subprocess to create it's own Task b...
Each user creates aÂ
.env
 file for their needs or exports them in the shell running the python code. Currently I copy the environment variables to an S3 bucket and download it from there
That is a great hack, but who carries the credentials for the S3 bucket? the reason for asking is I;m thinking maybe the code would directly do that (meaning download the .env file and apply them?!)
is it possible to perform debugging operations with pycharm integration using remote session?
Sure, use clearml-session it will open an ssh connection to the remote machine, then you can use pycharm
docstring ?
Usually the preferred way is StorageManager
https://clear.ml/docs/latest/docs/references/sdk/storage
https://clear.ml/docs/latest/docs/integrations/storage
but I still need the laod ballancer ...
No you are good to go, as long as someone will register the pods IP automatically on a dns service (local/public) you can use the regsitered address instead of the IP itself (obviously with the port suffix)
Thanks for your support
With pleasure!
Suppose that I have three models and these models can't be loaded simultaneously on GPU memory(
Oh!!!
For now, this is the behavior I observe: Suppose I have two models, A and B. ....
Correct
Yes this is a current limitation of the Triton backend BUT!
we are working on a new version that does Exactly what you mentioned (because it is such a common case where in some cases models are not being used that frequently)
The main caveat is the loading time, re-loading models from dist...
The remaining problem is that this way, they are visible in the ClearML web UI which is potentially unsafe / bad practice, see screenshot below.
Ohhh that makes sense now, thank you 🙂
Assuming this is a one time credntials for every agent, you can add these arguments in the "extra_docker_arguments" in clearml.conf
Then make sure they are also listed in: hide_docker_command_env_vars which should cover the console log as well
https://github.com/allegroai/clearml-agent/blob/26e6...
ScantWorm7
Tensorboard is automatically captured and sent to the trains server. This is in addition to the local copy of your TB files. Actually in most cases the local copy is redundant
let me check when a fix can be deployed for Hydra...
To auto upload the model you have to tell clearml to upload it somewhere, usually by passing output_uri to Task.init or setting the default_output_uri in the clearml.conf
What will I do to fix my problem?
What is the problem? we just proved the upload speed is just fine?
Sorry @<1798525199860109312:profile|IntriguedGoldfish14> just noticed your reply
Yes two inference container, running simultaneously on the cluster. As you said, each one with its own environment (assuming here that the requirements of the models collide)
Make sense
I have a model and hundreds of thousands of inference records for that model.
What would be the query ? Are you reporting 100+ diff scalars ?
As we can’t create keys in our AWS due to infosec requirements
Hmmm
correct on both.
notice that with upload you can specify any storage (S3/GS/Azure atc)
From the docs I think what's going on is that the https://opennmt.net/OpenNMT-tf/package/opennmt.Runner.html#opennmt.Runner.train is spinning a new subprocess, and the training itself happens on the subprocess.
If this is the case this will explain the lack of automagic, as the subprocess is lacking the "Task.init" call
wdyt, could that be the case ?