Nice! So out of curiosity why didn't it work this time and you had to do it manually?
CheerfulGorilla72 could it be the server address has changed when migrating ?
Yes it is reproducible do you want a snippet?
Already fixed π please ping tomorrow, I think an RC should be out soon with the fix
this is very odd, can you post the log?
@<1523702932069945344:profile|CheerfulGorilla72> use the following bucket name when you are configuring your files/output uri
s3://<iphere>:<porthere>/<bucket_here>
From there everything should work as expected
Yep it is the scale π and yes it should appear once you upgrade
Hi @<1739818374189289472:profile|SourSpider22>
could you send the entire console log? maybe there is a hint somewhere there?
(basically what happens after that is the agent is supposed to be running from inside the container, but maybe it cannot access the clearml-server for some reason)
we run in containers without venv, in the main section, and then delete it or use it for similar experimentsIf this is the case then the idea is the venv creation is actually cached, you can turn it on here (unmark the line)
https://github.com/allegroai/clearml-agent/blob/51eb0a713cc78bd35ca15ed9440ddc92ffe7f37c/docs/clearml.conf#L116
@<1671689437261598720:profile|FranticWhale40> could you test the fix? just pull & run
allegroai/clearml-serving-triton:1.3.1
allegroai/clearml-serving-inference:1.3.1
Hi GrotesqueOctopus42 ,
BTW: is it better to post the long error message on a reply to avoid polluting the channel?
Yes, that is appreciated π
Basically logs in the thread of the initial message.
To fix this a had to spin the agent using --cpu-only flag (--docker --cpu-only)
Yes if you do not specify --cpu-only it will default to trying to access gpus
Nice!
ElegantKangaroo44 I think TrainsCheckpoint
would probably be the easiest solution. I mean it will not be a must, but another option to deepen the integration, and allow us more flexibility.
which part of the code?
the main script?!
but is not part of the package
is the repo it self a package ?
But first I want to make sure the verify argument is actually used, hence False
Maybe we should rename it?! it actually creates a Task but will not auto connect it...
Added -v /home/uname/.ssh:/root/.ssh and it resolved the issue. I assume this is some sort of a bug then?
That is supposed to be automatically mounted the SSH_AUTH_SOCK defined means that you have to add the mount to the SSH_AUTH_SOCK socket so that the container can access it.
Try to run when you undefine SSH_AUTH_SOCK and keep the force_git_ssh_protocol
(no need to manually add the .ssh mount it will do that for you)
When I start the serving containers it can't retrieve the model:
Hi BrightRabbit75
I think you need to pass the credentials for your S3 account to the clearml-serving containers
Basically just add AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
to your docker compose:
https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/docker/docker-compose-triton-gpu.yml#L110
https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666e...
os.environ['TRAINS_PROC_MASTER_ID'] = args.trains_id
it should be '1:'+args.trains_id
os.environ['TRAINS_PROC_MASTER_ID'] = '1:{}'.format(args.trains_id)
Also str(randint(1, sys.maxsize))
Could it be in a python at_exit event ?
Hi AttractiveCockroach17
In your "Installed Packages" (when the task is in draft mode, you can edit it like any requirements.txt), you need to add:package @ git+
You can also make sure you have in in the first place bu addingTask.add_requirements("package", "@ git+
") task = Task.init(...)
so if i plot image with matplot lib..it would not upload? i need use the logger.
Correct, if you have no "main" task , no automagic π
so how can i make it run with the "auto magic"
Automagic logs a single instance... unless those are subprocesses, in which case, the main task takes care of "copying" itself to the subprocess.
Again what is the use case for multiple machines?
Hi @<1523701132025663488:profile|SlimyElephant79>
I would like to save only the last & best checkpoints and not all of them if possible.
Basically it will mimic the local file system, so if you overwrite the local files it will overwrite the remote model.
You can also disable auto logging, and manually upload the models
In Task.init
pass auto_connect_frameworks
False for the specific framework
see:
[None](https://clear.ml/docs/latest/docs/clearml_sdk/task_sdk/#automatic-lo...
RoundMosquito25 how is that possible ? could it be they are connected to a different server ?
Hi RoundMosquito25
Hi, are there available somewhere examples of testing in ClearML? For example unit tests that check if parameters are passed correctly to new tasks etc.?
What do you mean by "testing in ClearML" ?
For example unit tests that check if parameters are passed correctly
Passed where / how? Are we thinking agents here ?
Hmm, I see the jump from 50 to 100, is that consistent with the last iteration on the aborted Task (before continuing )?
TenseOstrich47 this looks like elasticserach is out of space...
we will try to use Triton, but itβs a bit hard with transformer model.
Yes ...
All extra packages we add in serving)
So it should work, you can also run your preprocess class manually from your own machine (for debugging), if you pass to it a local file (basically the downloaded model file from the UI, it should work
it. But itβs maybe not the best solution
Yes... it is not, separating the pre/post to CPU instance and letting triton do the GPU serving is a lot more effici...
Sure thing, anyhow we will fix this bug so next version there is no need for a workaround (but the workaround will still hold so you won't need to change anything)
Oh, then just make sure you call Task.init in your code,
as long as you have clearml.conf in the container or pass the ENV variables to configure your clearml, it should just work
@<1533620191232004096:profile|NuttyLobster9> I think we found the issue, when you are passing a direct link to the python venv, the agent fails to detect the python version and since the python version is required for fetching the correct torch it fails to install it. This is why passing CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE=none
because it skipped resolving the torch / cuda version (that requires parsing the python version)