My task starts up and checks the mounted EFS volume for x data, if x data does not exist there, it then pulls x data from S3.
BoredHedgehog47 you can just use StorageManager and configure clearml cache for the EFS, it will essentially do the same 🙂
Regrading helm chart with EFS,
you need to configure the clearml-glue pod template with the EFS mount
example :
https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/e7f647f4e6fc76f983d61522e635353005f1472f/examples/kubernetes/volu...
Yes you have to spin the server in order to generate the access/secret key...
LovelyHamster1 NICE! 👍
These are maybe good features to include in ClearML:
or
.
Sure, we should probably add a section into the doc explaining how to do that
Other approach is creating my own API on the top of clearml-serving endpoints and there I control each tenant authentication.
I have to admit that to me this is a much better solution (then my/bento integrated JWT option). Generally speaking I think this is the best approach, it separates authentication layer from execution ...
You can control it with auto_ arguments in the Task.init call
https://clear.ml/docs/latest/docs/references/sdk/task#taskinit
Hover near the edge of the plot, the you should get a "bar" you can click on to resize
Verified, you are correct "." in label enumeration will break the clone .
I'll make sure this bug is passed to backend guys to fix. Thanks TenseOstrich47 !
meanwhile maybe "_" instead ? 😁
No by definition the agent will only execute one Task at a time, you can spin a second agent on the same GPU :)
Seems like passing the Task object is not working as expected (I'll make sure it is fixed).
Try:dataset._task.set_parent(Task.current_task().id)
Thanks StaleKangaroo85 bug is verified. Let me check to see where exactly is the bug.
Two points
Notice that x_labels should be the size of the histogram It seems that you have to pass the labels as well (otherwise you get the trace-0), so if you add labels=['random histogram']
and labels=['random histogram2']
, you'll get the correct legend.Anyhow I'll make sure we also fix it in code so it is automatically labels are [series] if not specified, thanks!
it will only if oom killer is enabled
true, but you will still get OOM (I believe). I think the main issue is the even from inside the container, when you query the memory, you see the entire machine's memory... I'm not sure what we can do about that
FlutteringWorm14 an RC is out (1.7.3dc1) with the ability to configure from clearml.conf
you can now setsdk.development.worker.report_event_flush_threshold
from clearml.conf
Bottom line the driver version in the host machine does not support the CUDA version you have in the docker container
Hi SarcasticSparrow10
I think the default search is any partial match, let me check if there is a way to do some regexp / wildcard
Hi MagnificentSeaurchin79
This sounds like a deeper bug (of a sort), I think the best approach is to open a GitHub issue with some code that can reproduce this behavior, or at least enough information so that we could try to catch the bug.
This way we will make sure it is not forgotten.
Sounds good ?
Notice that in your execute_remotely() you did not specify a queue to put the current Task into
What it does is it stops the current running code and it puts the newly created task into the specified queue, if you do not specify a queue , it will just abort it, and wait for you to Manually enqueue it.
To solve it:task.execute_remotely(queue_name='my_queue')
Hi @<1541229812243238912:profile|PoisedMoth54>
We should probably add a better interface (please feel free to open a github issue on the interface) until then
dataset._task.connect_configuration(configuration="path/to/file", name="my config")
yup! that's what I was wondering if you'd help me find a way to change the timings of. Is there an option I can override to make the retry more aggressive?
you mean wait for less?
None
add to your clearml.conf:
api.http.retries.backoff_factor = 0.1
Hi QuaintPelican38
Can you ssh to {instance_public_ip_address}:10022 (something like ssh -p 10022 user@IP_HERE
)?
Basically just getting the password prompt means you are okay.
I suspect that you have some AWS security definition (firewall) that prevents a direct access to the instance, could that be?
Martin I told you I can't access the resources in the cluster unfortunately
😞
so it seems there is some misconfiguration of the k8s glue, because we can see it can "talk" to the clearml-server, but it seems it fails to actually create the k8s pod/job. I would start with debugging the k8s glue (not the services agents). Regardless, I think the next step is to get a log of the k8s glue pod, and better understand the issue.
wdyt?
CrookedWalrus33 can you send the entire log? (you can DM it to me)
CourageousLizard33 if the two series are on the same graph, just click on the series in the legend, you can enable/disable it, and the scale will adjust automatically.
Regarding grouping, this is a feature that can be turned off, the idea is that we split the tag to title/series... So if you have the same prefix you get to group the TF scalars on the same graph, otherwise they will be on a diff title graph. That said you can make force it to have a series per graph like in TB. Makes sense?
Hi VexedKangaroo32 , funny enough this is one of the fixes we will be releasing soon. There is a release scheduled for later this week, right after that I'll put here a link to an RC containing a fix to this exact issue.
So the naming is a by product of the many TB created (one per experiment), if you add different naming ot the TB files, then this is what you'll be seeing in the UI. Make sense ?
Hi @<1569858449813016576:profile|JumpyRaven4>
task.add_requirements()
This is the problem, if you look closely this is a class method, meant for helping the Task.init better capture python packages, it does Not change the task requirements.
To do that, use " task.set_packages
"
btw, I looked deeper into the log:
File "/tmp/tmpfa8ifmka.py", line 80, in <module>
model.train(data='coco128.yaml',epochs=20)
I'm assuming this all starts here, I think that the pipeline is Not running the code from the same folder, and you are just missing the 'coco128.yaml' try to pass a full path, wdyt?
How are you starting the agent?