quick update, still trying to reproduce ...
Hi BoredHedgehog47 I'm assuming the nginx on the k8s ingest is refusing the upload to the files server
JuicyFox94 wdyt?
to avoid downgrade to clearml==1.9.1
I will make sure this is solved in clearml==1.9.3 & clearml-session==0.5.0 quickly
What about Calling Taskl.init Without the agent?
Nice SubstantialElk6 !
BTW: you can configure your cleaml client to store the changes from the latest Pushed commit (and not the default which is latest local commit)
see store_code_diff_from_remote: in clearml.conf:
https://github.com/allegroai/clearml/blob/9b962bae4b1ccc448e1807e1688fe193454c1da1/docs/clearml.conf#L150
setting max_workers to 1 prevents the error (but, I assume, it may come the cost of slower sequential uploads).
This seems like a question to GS storage, maybe we should open an issue there, their backend does the rate limit
My main concern now is that this may happen within a pipeline leading to unreliable data handling.
I'm assuming the pipeline code will have max_workers, but maybe we could have a configuration value so that we can set it across all workers, wdyt?
If
...
Hi PanickyMoth78
Yes i think you are correct, this looks like gs throttling your connection. You can control the number of concurrent uploads with max_worker=1
https://github.com/allegroai/clearml/blob/cf7361e134554f4effd939ca67e8ecb2345bebff/clearml/datasets/dataset.py#L604
Let me know if it works
Hi CrookedAlligator14
Hi, I just started using clearml, and it is amazing!
Thank you! 😍
When I enqueue the task, the venv is setup and starts to install all the packages from the
requirements.txt
file, but at the end I get the following in the console:
Can you try with the latest agent, we improved the support for pytorch (they now have a proper pypi compatible repo), can you see if that solves it?pip3 install clearml-agent==1.5.0rc0
SubstantialElk6
Hmm do you have torch in the "installed packages" section of the Task ?
(This what the agent is using to setup the environment inside the docker, running as a pod)
MoodyCentipede68 from your log
clearml-serving-triton | E0620 03:08:27.822945 41 model_repository_manager.cc:1234] failed to load 'test_model_lstm2' version 1: Invalid argument: unexpected inference output 'dense', allowed outputs are: time_distributed
This seems the main issue of triton failing to.load
Does that make sense to you? how did you configure the endpoint model?
SourSwallow36 it is possible.
Assuming you are not logging metrics by the same name, it should work.
try:Task.init('examples', 'training', continue_last_task='<previous_task_id_here>')
Hi @<1539055479878062080:profile|FranticLobster21>
Like this?
https://github.com/allegroai/clearml/blob/4ebe714165cfdacdcc48b8cf6cc5bddb3c15a89f[…]ation/hyper-parameter-optimization/hyper_parameter_optimizer.py
[https://github.com/allegroai/clearml/blob/4ebe714165cfdacdcc48b8cf6cc5bddb3c15a89f[…]ation/hyper-parameter-opt...
Sorry my bad, you are looking for:
None
I am using importlib and this is probably why everythings weird.
Yes that will explain a lot 🙂
No worries, glad to hear it worked out
BTW: I think an easy fix could be:if running_remotely(): pipeline.start() else: pipeline.create_draft()
I see.
You can get the offline folder programmatically then copy the folder content (it's the same as the zip, and you can also pass a folder instead of zip to the import function)task.get_offline_mode_folder()You can also have a soft link of the offline folder (if you are working on a linux machine:ln -s myoffline_folder ~/.trains/cache/offline
StraightDog31 how did you get these ?
It seems like it is coming from maptplotlib, no?
The downstream stages are rankN scripts, they are waiting for the IP address of the first stage.
Is this like a multi-node training, rather than a pipeline ?
it looks like nvidia is going to come up with an UI for TAO too
Interesting, any reference we could look at ?
Hi GleamingGrasshopper63
How well can the ML Ops component handle job queuing on a multi-GPU server
This is fully supported 🙂
You can think of queues as a way to simplify resources for users (you can do more than that,but let's start simple)
Basicalli qou can create a queue per type of GPU, for example a list of queues could be: on_prem_1gpu, on_prem_2gpus, ..., ec2_t4, ec2_v100
Then when you spin the agents, per type of machine you attach the agent to the "correct" queue.
Int...
Hi SplendidToad10
In order to run a pipeline you first have to create the steps (i.e Tasks).
This is usually dont by running the code once (basically running any code with Task.init call will create a Task for that specific code, including the enviroement definition needed to reproduce it by the Agent)
Hi DeliciousBluewhale87
This sounds like a great workflow to implement.
I guess my first question is how do you imagine the manager/director interacting with the system? What will they be shown, to allow them to approve/decline the model promotion ?
Correct 🙂
I'm assuming the Task object is not your Current task, but a different one?
Just making sure, the original code was executed on python 3?
BTW:
This is very odd "~/.clearml/venvs-builds.3/3.6/bin/python" it thinks it is using "python 3.6" but it is linked with python 2.7 ...
No idea how that could happen
HugeArcticwolf77 actually it is more than that, you can embed the graphs now in the markdown, when you hove over any plot/graph/image you now have a new button that copies the embed test, so that you can directly copy it into your markdown editor (internal or external)
More documentation and screenshots are coming after the holiday, mean time you can check:
https://clear.ml/docs/latest/docs/webapp/webapp_reports
https://clear.ml/docs/latest/assets/images/webapp_report-695dddd2ec8064938bf8...
BTW: do notice to install the agent on the system python packages and Not on any venv.
I have one agent running on the machine. I also have only one task running. This
only
happens to us when we use pipelines
@<1724960468822396928:profile|CumbersomeSealion22> notice that when you are launching a pipeline you are actually running Two tasks, one is the "pipeline" itself (i.e. the logic) and one is the component in the pipeline (i.e. the step)
If you have one agent, I'm assuming what happens is the pipeline itself (the one that you launch on your machine)...
Hi LazyFish41
Could it be some permission issue on /home/quetalasj/.clearml/cache/ ?