Reputation
Badges 1
25 × Eureka!I'm assuming those errors are from the triton containers? where you able to run the simple pytorch mnist example serving from the repo?
@<1523707653782507520:profile|MelancholyElk85>
What's the clearml
version you are using ?
Just making sure... base_task_id has to point to a Task that is in "draft" mode, for the pipeline to use it
@<1523707653782507520:profile|MelancholyElk85> I just run a single step pipeline and it seemed to use the "base_task_id" without cloning it...
Any insight on how to reproduce ?
Hi @<1699955693882183680:profile|UpsetSeaturtle37>
What's your clearml-session version? where is the remote machine ?
And yes if the network connection is bad we have seen this behavior you can try with --keepalive=true
Notice that these are SSH networking issue, not something to do with the clearml-session layer the --keepalive is trying to automatically detect these disconnects and make sure it reconnects for you.
And after having called
Task.init()
the second time, the automatic logging of resources and tensorboard plots works as well. I would recommend adding explanation to the docs for
Oh yeah! you always need to call Task.init first, Task,current_task should be called from anywhere you like but after the Task.init was called.
Hi JitteryCoyote63 , let me check, this backwards compatibility might only apply for API version mismatch between the client and server.
The log is missing, but the Kedro logger is print toΒ sys.stdout in my local terminal.
I think the issue night be it starts a new subprocess, and that subprocess is not "patched" to capture the console output.
That said if an agent is running the entire pipeline, then everything is logged from the outside, so whatever is written to stdout/stderr is captured.
This code will give you one graph titled "loss" with two series: (1) trains (2) loss
StickyLizard47 apologies for the https://github.com/allegroai/clearml-server/issues/140 not being followed (probably slipped through the cracks of backend guys, I can see the 1.5 release happened in parallel). Let me make sure it is followed.
SarcasticSquirrel56 specifically, did you also spin a clearml-k8s glue? or are the agents statically allocated on the helm chart?
Hi @<1523701066867150848:profile|JitteryCoyote63>
Hi, how does
agent.enable_git_ask_pass
works
basically it pushes the pass through stdin to git when it asks (it is a git feature)
Hi @<1715900788393381888:profile|BitingSpider17>
Notice that you need __ (double underscore) for converting "." in the clearml.conf file,
this means agent.docker_internal_mounts.sdk_cache
will be CLEARML_AGENT__AGENT__DOCKER_INTERNAL_MOUNTS__SDK_CACHE
None
See here:
https://download.pytorch.org/whl/torch_stable.html
cu110/* has no torch 1.3.1 only 1.7.0
Iβve did saw this βpublishβ option for pipelines, just for models, is this a new feature?
Kind of hidden in the UI (not sure if on purpose), but if you click on the pipeline then go to details, in the new tab (of the pipeline Task) you can publish the Task (aka the pipeline)
In this example:
https://github.com/allegroai/clearml-actions-train-model/blob/7f47f16b438a4b05b91537f88e8813182f39f1fe/train_model.py#L14
replace with something like:
` task = Task.get_tasks(project_name="pipel...
But I am starting to wonder whether It would be easier just changing sys,path on the scripts that use the sibling libs.
that depends, how would the sibling packages get to a remote machine ?
The docker crashes and I want to be abel to debug it exactly as it is run by the agent
On your machine (any machine)
pip install clearml-agent
clearml-agent build --id <taskID> --docker "local_mydocker_name"
docker run -it local_mydocker_name bash
- Agent on laptop, Server on Kube - Fail
So I'm 100% sure there is something wrong with our ClearML Server deployment on Kube
Yeah that feels like a network config issue...
Is there a verbose setting in the agent that could help us diagnose,
yes running with debug turned on on.
since you managed to reproduce on your latop you can try to run the agent with --debug to test, specifically:
clearml-agent --debug daemon ....
if you are running it in venv mode (which I think ...
Hi FierceHamster54
Dataset is downloading multi threaded already
But yes get_local_copy() is thread / process safe
Notice that if you are using TB, everything you report to the TB will appear as well π
The easiest way would be to rename a queue to "1xgpu 16gb", then make sure only machines with >16gb GPUs listen to it.
Note that an agent can listen to Multiple queues
Make sense. BTW: you can manually add data visualization to a Dataset with dataset.get_logger().report_table(...)
error in my-package setup command:
Okay this seems like an error in the setup.py you have in the "mypackage" folder
If it cannot find the Task ID I'm guessing it is trying to connect to the demo server and not your server (i.e. configuration is missing)
β¦every user in the server has the same credentials, and they donβt need to know them..makes sense?
Make sense, single credentials for everyone, without the need to distribute
Is that correct?
ConvolutedSealion94 what's your python version?
(the error itself is clearml failing to execute git diff, or read the output, I suspect unicode or something, assuming you were able to run the same command manually)
Yes (Mine isn't and it is working π )
Hi GracefulDog98
Are argument parameters to the script not passed on to the workers, or am I missing something?
The arguments are passed directly when the code is executed (i.e. the argparser parse_args is called).
If the code fails, I'm assuming the argparse is called before clearml is imported, could that be the case ?
Maybe the configuration file changed?
None
The logic is if the name and project are the same, and there are no artifacts/models, and the last time it was created was under 72 hours, reuse the Task