Reputation
Badges 1
25 × Eureka!This means that if something happens with the k8s node the pod runs on,
Actually if the pod crashed (the pod not the Task) k8s should re spin it, no?
I also experience that if a worker pod running a task is terminated, clearml does not fail/abort the task.
From the k8s perspective, if the task ended (failed/completed) it always return with exit code 0, i.e. success. Because the agent was able to spin the Task. We do not want Tasks with exception to litter the k8s with endless r...
LOl my pleasure - I guess we should have a link in the doc string of add_requirements to set_packages , I will tell the guys
:) yes on your gateway/firewall set http://demoapi.trains.allegro.ai to 127.0.0.1 . That's always good practice ;)
Hi @<1558624430622511104:profile|PanickyBee11>
You mean this is not automatically logged? do you have a callback that logs it in HF?
Thanks! a few thoughts below π
- not true β you can specify the image you want for each stepMy apologies, looking at the release notes, it was added a while back and I have not noticed π
- re: role-base access control - see Outerbounds Platform that provides a layer of security and auth features required by enterprisesRole based access meaning limiting access in metaflow i.e. specific users/groups can only access specific projects etc. ...
@<1545216070686609408:profile|EnthusiasticCow4>git+ssh:// will be converted automatically to git+https if you have user/pass ocnfigured in your clearml.conf on the agent machine.
More over, git packages are always installed After all other packages are installed (because pip cannot resolve the requirements inside the git repo in time)
Hi @<1523702932069945344:profile|CheerfulGorilla72>
This is a property on the Model object
model.published
Not sure why we do not have it here...
None
(I'll ask them to fix that)
"erasing" all the packages that had been set in the base task I'm cloning from. I
Set is not add, if you are calling set_packages, you are overwriting all of them with this single call.
You can however do:
task_data = task.export_task()
requirements = task_data["script"]["requirements"]["pip"]
requirements += "new packages"
task.set_packages(requirements)
I guess we should have get_requirements ?!
Hi @<1546665666675740672:profile|AttractiveFrog67>
- Make sure you stored the model's checkpoint (either pass
output_uri=TrueinTask.initor manually upload) - When you call
Task.initpass "continue_last_task=True" - Now you can do
last_checkpoint=task.models["output"][-1].get_local_copy()and all you need is to loadlast_checkpoint
Is it possibe to launch a task from Machine C to the queue that Machine B's agent is listening to?
Yes, that's the idea
Do I have to have anything installed (aside from theΒ
trains
Β PIP package) on Machine C to do so?
Nothing, pure magic π
Could it be these packages (i.e. numpy etc) are not installed as system packages in the docker (i.e. inside a venv, inside the docker) ?
I am writing quite a bit of documentation on the topic of pipelines. I am happy to share the article here, once my questions are answered and we can make a pull request for the official documentation out of it.
Amazing please share once done, I will make sure we merge it into the docs!
Does this mean that within component or add_function_step I cannot use any code of my current directories code base, only code from external packages that are imported - unless I add my code with ...
ScaryKoala63
When it fails what's the number of files you have in:/home/developer/.clearml/cache/storage_manager/global/ ?
Hi @<1687643893996195840:profile|RoundCat60> , I just saw the message,
Just by chance I set the SSH deploy keys to write access and now we're able to clone the repo. Why would the SSH key need write access to the repo to be able to clone?
Let me explain, the default use case for the agent is to use user/pass (as configured in the clearml.conf file(
It will change any ssh links to https links and will add the credentials to clone the repository.
You can also provide SSH keys (basicall...
Hi RobustRat47
What do you mean by "log space for hyperparameter" , what would be the difference ? (Notice that on the graph itself you can switch to log scale when viewing in the UI) ?
Or are you referring to the hyper parameter optimization, allowing you to add log space ?
Hi SoreDragonfly16
Sadly no, the idea is to create full visibility to all users in the system (basically saying share everything with your colleagues) .
That said, I know the enterprise version have permission / security features, I'm sure it covers this scenario as well.
but somewhere along the way, the request actually remove the header
Where are you seeing the returned value?
If there is new issue will let you know in the new thread
Thanks! I would really like to understand what is the correct configuration
and I install the tar
I think the only way to do that is add it into the docker bash setup script (this is a bash script executed before Task)
I see, let me check the code and get back to you, this seems indeed like an issue with the Triton configuration in the model monitoring scenario.
Hi @<1538330703932952576:profile|ThickSeaurchin47>
Specifically Iβm getting the error βcould not access credentialsβ
Put your minio credentials here:
None
Hi PerplexedCow66
I would like to know how to serve a model, even if I do not use any serving engine
What do you mean no serving engine, i.e. custom code?
Besides that, how can I manage authorization between multiple endpoints?
Are you referring to limiting access to all the endpoints?
How can I manage API keys to control who can access my endpoints?
Just to be clear, accessing the endpoints has nothing to do with the clearml-server credentials, so are you asking how to...
Should be fairly easy to add no?
at the end of the manual execution
A more detailed instructions:
https://github.com/allegroai/trains-agent#installing-the-trains-agent
clearml-agent Β repo please π
Yes! That's exactly what I meant, as you can see the Triton backend was not able to load your model. I'm assuming because it was Not converted to torch script, like we do in the original example
https://github.com/allegroai/clearml-serving/blob/6c4bece6638a7341388507a77d6993f447e8c088/examples/pytorch/train_pytorch_mnist.py#L136
Hmm SuccessfulKoala55 any chance the nginx http was pushed to v1.1 on the latest cloud helm chart?
Hi PricklyGiraffe97
Is this related?
https://clear.ml/blog/increase-huggingface-triton-throughput-by-193/