The only workaround I can think of is :series = series + 'IoU>X'
It doesn't look that bad ๐
According to you the VPN shouldn't be a problem right?
Correct as long as all parties are on the same VPN it should work, all the connections are always http so basically trivial communication
pipeline, can I control the tags that the tasks a pipeline creates?ย
add_pipeline_tags
ย adds tags from pipeline to the tasks I suppose? But I also need to clear existing tags in those created tasks
add_pipeline_tags
will add the unique ID of the pipeline execution, if you want to add specific tags you can use the task_overrides
and provide:pipe.add_step(..., task_overrides={'tags': ['my', 'tags']})
HI BurlyRaccoon64
Yes, we did the latest clearml-agent solves the issue, please try:
'pip3 install -U --pre clearml-agent'
FreshParrot56 we could add this capability, but the main caveat is that f your version depends on multiple parent versions you still need to download and extract all the parent versions, which means that when you clear them you might hurt later performance. Does that make sense? What is the use-case / scenario for you?
JitteryCoyote63 any chance you have a log of the failed torch 1.7.0 ?
That makes no sense to me?!
Are you absolutely sure the nntrain is executed on the same queue? (basically could it be that the nntraining is executed on a different queue in these two cases ?)
RoughTiger69 the easiest thing would be to use the override option of Hydra:parameter_override={'Args/overrides': '[the_hydra_key={}]'.format(a_new_value)})
wdyt?
I think this is the temp requirements it creates not your requirements file. If you attach a log here with the "installed packages" section maybe we could help to debug it
@<1523707653782507520:profile|MelancholyElk85> I just run a single step pipeline and it seemed to use the "base_task_id" without cloning it...
Any insight on how to reproduce ?
I want each remote task to execute one instance of the hydra multirun, but I suspect the remote will try to run the full multirun by itself
if config.clearml.remote and task.running_locally(): task.execute_remotely( queue_name=config.clearml.queue_name, clone=True, exit_process=False ) return
I think this ensures the local execution actually triggers the remote one, so it should be as you expect, no?
I still think the issue is getting boto3 credentials
It might be the case
Are you using clearml-agent or are you running it manually ?
SubstantialElk6 is this the pip to install the agent, or the pip the agent is using to install the packages for the specific experiment ?
Ohh then use the AWS autoscaler, basically it what you want, spin an EC2 and set an agent there, then if the EC2 goes down (for example if this is a spot), it will spin it up again automatically with the running Task on it.
wdyt?
HighOtter69 inside the legend click on the color rectangle next to the series name, you can change the color of the series on the graph. This property is stored so it will always remember your color preferences (yes even logging from another machine ๐ )
CLEARML_AGENT_GIT_USER
Is your git user (on whatever git host/server you are using, GitHub/GitLab/BitBucket etc.)
Yes that's the reason, basically there is a background thread analyzing the code, at the end of the execution if it is till running (hence the question regrading execution time) we give it extra 10seconds to come up with answers, otherwise we terminate, so the code won't get stuck. Makes sense to you?
Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
Actually I am as well, this is Kubernets doing the resource scheduling and actually Kubernetes decided it is okay to run two pods on the Same GPU, which is cool, but I was not aware Nvidia already added this feature (I know it was in beta for a long time)
https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/
I also see thety added dynamic slicing and Memory Proteciton:
Notice you can control ...
I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace) for the task when queuing the task and the agents pick these tasks if they have the requested resources. With this, the user need not think about which queue to send the task to. The users just state what they need and the agents do the scheduling for them.
Can I assume we are talking Kubernetes under the hood for the resource allocation ?
Okay this is a bit tricky (and come to think about it, we should allow a more direct interface):pipe.add_step(name='train', parents=['data_pipeline', ], base_task_project='xxx', base_task_name='yyy', task_overrides={'configuration.OmegaConf': dict(value=yaml.dump(MY_NEW_CONFIG), name='OmegaConf', type='OmegaConf YAML')} )
Notice that if you had any other configuration on the base task, you should add them as well (basically it overwrites the configurati...
if I useย
report_image
ย can I get a URL to it somehow?
Let me check ...
Please hit Ctrl-F5 refresh the entire page, see if it is till empty....
DeterminedToad86
Yes I think this is the issue, on SageMaker a specific compiled version of torchvision was installed (probably part of the image)
Edit the Task (before enqueuing) and change the torchvision URL to:torchvision==0.7.0
Let me know if it worked
Hmm I wonder, can you try with this line before?Task._report_subprocess_enabled = False frameworks = { 'tensorboard': True, 'pytorch': False } Task.init(...)
CourageousKoala93 when you call Task.close() it will mark the task as completed, there is no need to do that manually. The idea with mark_completed is that you can forcefully change the state if needed, or externally stop the task and mark it completed. Make sense?
GiddyTurkey39
BTW: you can always add the missing package via code:Task.add_requirements('torch', optional_version)
Regrading the limit interface, let me check I think this is worked on (i.e. nice interface that should be pushed in the next few days). Let me get back to you on this one.
How will imposing an instance limit , prevent or allow --order-fairness feature for example, which exists when running in clearml-agent version compared to k8s_glue_example version ?
A bit of background on how the glue works:
It pulls jobs from the clearml queue, then it prepares a k8s job, and launches the k8s jobs...
That is exactly that, the trains-agent is replicating the code from the git repo, and trying to apply the git diff (see uncommitted changes section). Obviously it failed ๐