Reputation
Badges 1
25 × Eureka!I wonder, does it launch all "step two" instances in parallel ?
In theory it should , but in practice since these are the same "template" I'm not sure what would happen.
One last note, you can call PipelineDecorator.debug_pipeline()
to debug the pipeline locally, it will have the exact same behavior only it will run the steps as subprocesses.
DilapidatedDucks58 by default if you continue to execution, it will automatically continue reporting from the last iteration . I think this is what you are seeing
Yep it should :)
I assume you add the previous iteration somewhere else, and this is the cause for the issue?
Where are they stored? I could not find a backend they work with, what am I missing?
Lol, :)
I think the issue is that you do not need to manually set the initial iteration, it's supposed to get it , as it is stored on the Task itself
Hi @<1557899668485050368:profile|FantasticSquid9>
There is some backwards compatibility issue with 1.2 (I think).
Basically what you need it to spin a new one on a new session ID and rergister the endpoints
😞 DilapidatedDucks58 how exactly are you "relaunching/continue" the execution? And what exactly are you setting?
But pytorch has no specific backend, it uses TB.
No?! Can you point me to an example? What I mostly find is how to calc metrics not standard way to then store them...
I think we should open a GitHub Issue and get some more feedback, maybe we should just add support in the backend side ?
LOL, thanks!
We actually plan to create different queues for different types of workloads, we are a bit seeing what the actual usage is to define what type of workloads make sense for us.
That sounds like a great path to take, it will make it very clear fro users on what they will be getting and why they should use specific queue.
As for the memory, yes the reasoning is clear, the main thing we'll have to see is hot define the limits, because we have nodes with quite different resources availab...
Hi UnevenDolphin73 , are those per user/project/system environment variables ?
If these are secrets (that you do not want to expose), maybe it is best just to have them on he agent's machine ?
BTW, I think there is some "vault" support in the paid tiers for these kind of secret, not sure on which level (i.e. user/system/project)
SmarmySeaurchin8 regrading (2)
I'm not sure the current visualization supports it. I mean we can put "{}", but that would imply you can edit it, which then we have to support, possible but weird, and this is why:task.connect({'a':{},'b': {'nested': 'value}}
will become'a' = '{}'
'b/nested' = 'value'
But then if you edit to:'a' = '{'nested': 'value'}'
'b/nested' = 'value'
you have two different ways of presenting the same type of structure...
Hi OutrageousSheep60
Do you mean something like:
https://github.com/allegroai/clearml/tree/master/examples/datasets
?
When you have a bit of experience, please suggest a path forward, it will be great to integrate
sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass
No worries I totally feel you.
As a quick hack in the actual code of the Task itself, is it reasonable to have:task = Task.init(....) task.set_initial_iteration(0)
Hmm so the Task.init should be called on the main process, this way the subprocess knows the Task is already created (you can call Task.init twice to get the task object). I wonder if we somehow can communicate between the sub processes without initializing in the main one...
So does that mean "origin" solves the issue ?
Oh that makes sense, This depends on how you setup the clearml k8s glue, (becuase the resource allocation is done by k8s) a good hack to limit the number of containers per GPU is to set a RAM limitation per pod, then k8s will know to limit the number of pods on the same GPU machine,
wdty?
DefeatedMoth52 how many agents do you have running on the same GPU ?
single task in the DAG is an entire ClearML
pipeline
.
just making sure detials are not lost, "entire ClearML pipeline ." : the pipeline logic is process A running on machine AA.
Every step of that pipeline can be (1) subprocess, but that means the exact same environement is used for everything, (2) The DEFAULT behavior, each step B is running on a different machine BB.
The non-ClearML steps would orchestrate putting messages into a queue, doing retry logic, and tr...
No worries, it's always good to know what can be built later.
I would start with a static .env file (i.e. the same for everyone), or start with hacking the python code to load the .env at the beginning 🤞
I wonder if this hack would work
Assume you upload an artifact/model to ' s3://storage.yandexcloud.net:443/clearml-models ' notice the port is added. Would that trigger a popup in the UI?
Also what happens if you add tge credential manually in the profile page?
I simplified the code, just so I could test it, this one seems to work, feel free to add the missing argparser parts :)
` from argparse import ArgumentParser
from trains import Task
model_snapshots_path = 'mnt/trains'
task = Task.init(project_name='examples', task_name='test argparser', output_uri=model_snapshots_path)
logger = task.get_logger()
def main(args):
print('Got args: %s' % args)
if name == 'main':
parent_parser = ArgumentParser(add_help=False)
parent_parser....