Reputation
Badges 1
25 × Eureka!About .get_local_copy... would that then work in the agent though?
Yes it would work both locally (i.e. without agent) and remotely
Because I understand that there might not be a local copy in the Agent?
If the file does not exist locally it will be downloaded and cached for you
@<1569496075083976704:profile|SweetShells3> remove these from your pbtext:
name: "conformer_encoder"
platform: "onnxruntime_onnx"
default_model_filename: "model.bin"
Second, what do you have in your preprocess_encoder.py
?
And where are you getting the Error? (is it from the triton container? or from the Rest request?
Nice SoreHorse95 !
BTW: you can edit the entire omegaconf yaml externally with set/get configuration object (name = OmegaConf) , do notice you will need to change Hydra/allow_omegaconf_edit to true
Hi @<1610083503607648256:profile|DiminutiveToad80>
This depends on how you configure the agents in your clearm.conf
You can do https if user/pass are configured, and you can force SSH and it will auto-mount your host SSH folder into the container and use it.
None
[None](https://github.com/allegroai/clearml-agent/blob/0254279ed5987fbc69cebae245efaea33aec1ff2/docs/cl...
Hi @<1610083503607648256:profile|DiminutiveToad80>
do you have a full log? can you share the code you are trying to run?
for a TPU with more than 16GB GRAM and less than 40GB, so sometime we need to provision a A100 to get the training speed we want but we don't use all the GRAM
Oh that makes sense...
Just saw this one, this might help?
https://www.globenewswire.com/news-release/2022/10/24/2539924/0/en/ClearML-and-Genesis-Cloud-Announce-New-MLOps-Partnership-Delivering-100-Green-Energy-Compute-Solution-for-Machine-Learning.html
Please do, just so it wont be forgotten (it won't but for the sake of transparency )
well it should fail, but I think the error message should be fixed 🙂
maybe:ValueError: dataset 'tmp_datset' not found in project
lavi-testing' `wdyt?
Where did you add the Task.init call ?
My apologies you are correct 1.8.1rc0 🙂
Hi PanickyMoth78
Yes i think you are correct, this looks like gs throttling your connection. You can control the number of concurrent uploads with max_worker=1
https://github.com/allegroai/clearml/blob/cf7361e134554f4effd939ca67e8ecb2345bebff/clearml/datasets/dataset.py#L604
Let me know if it works
PanickyMoth78 quick update the fix is already being tested, I'm hoping an RC tomorrow 🙂
Hi @<1533620191232004096:profile|NuttyLobster9>base_task_factory
is a function that gets the node definition and returns a Task to be enqueued ,
pseudo code looks like:
def my_node_task_factory(node: PipelineController.Node) -> Task:
task = Task.create(...)
return task
Make sense ?
You mean parameters of the pipeline? Is this a pipeline from Tasks or from function decorator?
WickedGoat98 you mean the server is on your home network and the agents are in a VPS?
If this is the case, then regular "clearml-server" port forwarding is the only thing you need.
TCP ports 8008/8080/8081
Notice that on the agents you will have to specify the address of your home IP.
I would recommend using a host name and not IP, since the artifact/debug samles links will contain direct links into the file server, and it is always safer to have a host name rather than IP that can change.
E...
UnevenDolphin73 i would use apiclient:
APIClient().projects.edit(project=project_id, system _tags=[])
*I might have a few typos above but that should be the gist
Bummer... that seems like a bit of an oversight tbh.
There is never a solution for those, unless the helm chart "knows" something about the server before spinning it the first time, which basically means a predefined access-key, I do not think we want that 😉
Sometimes it is working fine, but sometimes I get this error message
@<1523704461418041344:profile|EnormousCormorant39> can I assume there is a gateway at --remote-gateway <internal-ip>
?
Could it be that this gateway has some network firewall blocking some of the traffic ?
If this is all local network, why do you need to pass --remote-gateway ?
Sure. JitteryCoyote63 so what was the problem? can we fix something?
Hey JitteryCoyote63 I think I need to better explain the config feature:agent.package_manager.post_packages = ["PyJWT"]
Basically this means that IF you have pyjwt in the installation package it will be installed after everything else is installed.
This doesn't mean it will always be installed.
Think for example "horovod" has to be installed after you have TF / PyTorch installed.
(The same goes for "pre_package" and Cython)
BTW: there is a full Pipeline class that does everything for you, example here:
https://github.com/allegroai/clearml/tree/master/examples/pipeline
Hi JitteryCoyote63 ,
When you shutdown the task (manually with close() or when the process finish) it wait for the uploads...
Why do you need to specifically wait for all the artifacts upload? (currently you can stop the artifacts upload thread and wait for all the artifacts, but that seems like a bad hack)
okay, wait I'll see if I can come up with something .
Ohh, the controller task itself holds the artifacts ?
I see. If you are creating the task externally (i.e. from the controller), you should probably call. task.close() it will return when everything is in order (including artifacts uploaded, and other async stuff).
Will that work?
Let me check, it was supposed to be automatically aborted
GentleSwallow91 what you are looking for is here 🙂
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L149