Hi @<1715175986749771776:profile|FuzzySeaanemone21>
and then run "clearml-agent daemon --gpus 0 --queue gcp-l4" to start the worker.
I'm assuming the docker service cannot spin a container with GPU access, usually this means you are missing the nvidia docker runtime component
We should probably make sure it is properly stated in the documentation...
And if you could also update the docs with all env vars possible to set up it would awesome!
Yes, I'll pass it on, that is a good point
Thanks! Yes, this could be great !
Could you please open a GitHub issue, so we remember to update the feature ?
JitteryCoyote63
IAM role to the web app could access
you mean the web client key/secret to access S3 data ?
LudicrousParrot69 you mean post execution or while you are executing the hyperparameter optimizer ?
BTW: @<1673501397007470592:profile|RelievedDuck3> we just released 1.3.1 with better debugging, it prints full exception stack on failure to the clearml Serving Session Task.
I suggest you pull the latest image re run the docker compose and check what you have on the serving session Task in the UI
and since the update the docs seem to be a bit off but I think I got it
Working on a whole new site 😉
This seems to be more complicated than what it looks like (ui/backend combination), not are not working on it, just that it might take some time as it passes control to the backend (which by design does not touch external storage points).
Maybe we should create an S3 cleanup service, listing buckets and removing if the Task ID does not exist any longer. wdyt?
Yes this is definitely the issue, the agent assume the docker user is "root".
Let me check something
Hi AbruptHedgehog21
How i can add S3 credentials to S3 bucket in example.env for clearml-serving-triton? I need to add bucket name, keys and endpoint
Basically boto (s3) environment variables would just work:
https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving#advanced-setup---s3gsazure-access-optional
Are you using tensorboard or do you want to log directly to trains ?
I managed to do it by using logger.report_scalar, thanks!
Sure, but for future reference where (in ignite callbacks) did you add the report_scalar
call ?
Hi CheekyElephant36
First you need to run it once on your machine, once this is done (only a few steps is enough), you can one it and enqueue it. Then to actually connect the aws autoscaler (the part that spins machines and runs tasks) go to applications and select the aqs autoscaler.
Btw i think the next video will be about YOLO + autoscaler
BeefyCow3 see this https://allegroai-trains.slack.com/archives/CTK20V944/p1593077204051100 :)
t seems there is some async behavior going on. After ending a run, this prompt just hangs for a long time:
2021-04-18 22:55:06,467 - clearml.Task - INFO - Waiting to finish uploads
And there's no sign of updates on the dashboard
Hmm that could point to an issue uploading the last images (which are larger than regular scalars) could you try flushing and waiting ?
i.e.task.flush() sleep(45)
It will store the entire content of the file, then you can edit it in the UI, and in remote it will return a new local copy of the file (based on the data in the UI) for you to read.
Something is off here ... Can you try to run the TB examples and the artifacts example and see if they work?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorflow_mnist.py
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py
however can you see the inconsistency between the key and the name there:
Yes that was my point on "uniqueness" ... 😞
the model-key must be unique, and it is based on the filename itself (the context is known, it is inside the Task) but the Model Name is an entity, so it should have the Task Name as part of the entity name, does that make sense ?
Hi MistakenDragonfly51
Hello everyone! First, thanks a lot to everyone that made ClearML possible,
❤
To your questions 🙂
long story short, no unless you really want to compile the dockers, which I can't see the real upside here Yes, add the following /opt/clearml.conf:/root/clearml.conf
herehttps://github.com/allegroai/clearml-server/blob/5de7c120621c2831730e01a864cc892c1702099a/docker/docker-compose.yml#L154
and configure your hosts " /opt/clearml.conf"
with ...
Hi JitteryRaven85
I have also deleted some hyper-params but they appear again when training starts.
Yes you cannot "delete" parameters, as any missing parameter is synced back (making sure you have a full log).
The problem is that when I clone an experiment and change the hyper params some change and some remain the same
Could you expand on which parameters stay the same ? (obviously this should not happen)
so firs yes, I totally agree. This is why the clearml-serving
has a dedicated statistics module that creates histograms over time, then we push it into Prometheus and connect grafana to it for dashboards and alerts.
To be honest, I would just use it instead of reporting manually, wdyt?
I would recommend reading this blog post, it should give you a glimpse of what can be built 🙂
https://medium.com/pytorch/how-trigo-built-a-scalable-ai-development-deployment-pipeline-for-frictionless-retail-b583d25d0dd
BTW: if you only need the git diff you can just copy them from the UI into a txt file and do:git apply <copied-diff.txt>
But my previous ques and other query are still not figured out.
What do you mean by "previous ques and other query" ?
Tested with two sub folders, seems to work.
Could you please test with the latest RC:pip install clearml==0.17.5rc4
I'm glad to hear 🙂
If you can reproduce it, let me know
GiddyTurkey39 do you mean to delete them from the server?