
Reputation
Badges 1
25 × Eureka!I start the TaskScheduler, register a task, and stop the scheduler, how do I restart the TaskScheduler in a way that re-register the tasks?
if it's aborted, just re-enqueue it?
(it serializes itself and stores it's state on the Task object, so when re-launched it will deserialize from the last state)
Hmm I guess we should better state that, I'll pass it on π
Yes, albeit not actually "intercept" as the user will be able to directly put Task sin queues B_machine_a/B_machine_b , but any time the user is pushing Tasks into queue B, this service will pull it and push to the individual machines queue.
what do you think?
Hi @<1545216070686609408:profile|EnthusiasticCow4>
Now that it's running how does one add a new task or remove an existing task from the scheduler?
Did you notice the scheduler stores its own configuration as a config object on the Task?
Notice that you can abort/reset the scheduler, change it's configuration in the UI and relaunch it (i.e. enqueue it). It will use the configuration from the UI (backend) and not the original code that created it. Does that make sense?
Do note that the needed module is just a local folder with scripts.
Oh that is the issue, is it in the git repo ?
but then an error message in the web-app pops up
Fetch parents failed
and the Scheduler task disappears
And the Task is still running? What's he clearml python version and webui version ?
Should work out of the box, maybe the only thing to notice is that you will get a Task for every local_rank 0 process
does that make sense ?
@<1523701601770934272:profile|GiganticMole91> really nice!
but can we scheduled new task here?
@<1523701260895653888:profile|QuaintJellyfish58> do you mean schedule a Task from the scheduled function? if yes, you can do something similar to @<1523701601770934272:profile|GiganticMole91> , you create/clone existing Task, change arguments and push it into an execution queue. wdyt?
EnchantingWorm39 you have great timing ;)
However, there is still a delay of approximately 2 minutes between the completion of setup,
Where is that delay in the log?
(btw: it seems your container is missing clearml-agent & git, installing those might add some time)
Yes
Are you trying to upload_artifact to a Task that is already completed ?
@<1523702932069945344:profile|CheerfulGorilla72> use the following bucket name when you are configuring your files/output uri
s3://<iphere>:<porthere>/<bucket_here>
From there everything should work as expected
So what youβre saying is to first kick off a new run and then rename the underlying Pipeline Task, which will cause that particular run to become a new pipeline name?
Correct, basically you are not changing the "pipeline" per-se but the execution name of the pipeline, if that makes sense
What would be most ideal would be to be able to right-click on a pipeline run and have a βcloneβ option, like you can with a task, where you can start a new run with a new name in a single step.
...
, but what I really want to achieve is to share this code:
You mean to share the code between them, unless this is a "preinstalled" package in the container, each endpoint has it's own separate set of modules / files
(this is on purpose, so you could actually change them, just image diff versions of the same common.py file)
(Just a thought, maybe we just need to combine Kedro-Viz ?)
we also provide a custom
aux-config
file. We also had to make sure to update the name inside
config.pbtxt
so that Triton is happy:
Good point, what would be the logic of the auto "config.pbtxt" patching we should employ ?
- Artifacts and models will be uploaded to the output URI, debug images are uploaded to the default file server. It can be changed via the Logger.
- Hmm is this like a configuration file?
You can do.
local_text_file = task.connect_configuration('filenotingit.txt')
Then open the 'local_text_file' it will create a local copy of the data in runtime, and the content will be stored on the Task itself. - This is how the agent installs the python packages, but if the docker already contactains th...
Hi @<1697056701116583936:profile|JealousArcticwolf24> just saw the reply
Image look okay?! what what is the query? basically I'm truing to understand if grafana is connected to the Prometheus, and if the Prometheus has any data in it
Secondly, just to make sure, kafka service should be able to connect directly to the the container running the actual inference
Hi @<1526371965655322624:profile|NuttyCamel41>
How are you creating the model? specifically what do you have in "config.pbtxt"
specifically any python code should be in the pre/post processing code (actually not running on the GPU instance)
GiganticTurtle0
That definitely makes sense. Where can I specify callbacks in theΒ
PipelineDecorator
Β API?
Hmm there isn't one actually... (the interface I was thinking about was PipelineConroller ...)
Would it make sense to throw an exception in the pipeline execution code?
BTW: I just verified, if the pipeline step fails an exception is raised (ValueError)
ReassuredTiger98
How can I make clearml-agent use pre-installed version from the nvidia/pytorch
If the Same version is required, the agent will not try to reinstall it (the new venv the agent is creating inside the container, inherits from the preinstalled system packages)
Comes with PyTorch Version 1.12 based on a commit
. I tried
torch >= 1.11
,
torch == 1.12
If in your installed packages you have torch==1.12
the agent should not tr...
Okay, I'll pass to front-end, see what they can do about it.
SoreDragonfly16 the torchvision warning has nothing to do with the Trains
warning.
The Trains warning means that somehow someone changes the state of the Task from running (in_progress) to "stopped" (aborted). Could it be one of the subprocesses raised an exception ?
Hi @<1526371965655322624:profile|NuttyCamel41>
so sorry I just realized I have not answered it it!
I just tried the pytorch example from the clearml-serving repo and got the error about the wrong model name
okay that is odd, are you using the exact same containers / docker-compose? what is the difference ?
I0603 09:44:02.665851 41 model_lifecycle.cc:693] successfully loaded 'test_model_pytorch' version 1
does that mean that even though there is a warning there you can curl to ...
I hope you can do this without containers.
I think you should be fine, the only caveat is CUDA drivers, nothing we can do about that ...
Hmm there was a commit there that was "fixing" some stuff , apparently it also broke it
Let me see if we can push an RC (I know a few things are already waiting to be released)
DepressedChimpanzee34
What's the hydra version ?
I tested with 1.1.0dev3 and it worked for me
The latest RC (0.17.5rc6) moved all logs into separate subprocess to improve speed with pytorch dataloaders