Reputation
Badges 1
25 × Eureka!Could you test if this is working:
https://github.com/allegroai/clearml/blob/master/examples/reporting/matplotlib_manual_reporting.py
The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from
JitteryCoyote63 What do you mean by that?
Hmmm, make sure the task doing the cloning is using 0.16.1 and above , because with .16 we added sections and the compatibility is between the version. Meaning if you have tasks generated with trains .16 you need trains .16 to clone them from code (so you could properly control the arguments)
The cool thing of using the trains-agent, you can change any experiment parameters and automate the process, so you get hyper-parameter optimization out of the box, and you can build complicated pipelines
https://github.com/allegroai/trains/tree/master/examples/optimization/hyper-parameter-optimization
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py
Dynamic GPU option only available with Enterprise version right?
Correct π
Quick update Nexus supports direct http upload, which means that as CostlyOstrich36 mentioned, just pointing to the Nexus http upload endpoint would work:output_uri="http://<nexus>:<port>/repository/something/"
See docs:
https://support.sonatype.com/hc/en-us/articles/115006744008-How-can-I-programmatically-upload-files-into-Nexus-3-
Since this fix is all about synchronizing different processes, we wanted to be extra careful with the release. That said I think that what we have now should be quite stable. Plan is to have the RC available right after the weekend.
One last thing make sure you spin the pod container with privileged mode, because the trains-agent docker will spin a sibling docker for your actual experiment.
How so? they are in one place? the creation of the venv is transparent, and the packages that are there are everything you have in the docker, plus the ability to override them from the UI.
What am I missing here ?
CluelessFlamingo93 I would also fix the pip version requirements to:pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"]
Regrading the first direction, this was just pushed π
https://github.com/allegroai/clearml/commit/597a7ed05e2376ec48604465cf5ebd752cebae9c
Regrading the opposite direction:
That is a good question, I really like the idea of just adding another section named Datasets
SucculentBeetle7 should we do that automatically?
Hi RoughTiger69
I'm actually not sure about DVC support as well, see in these links, syncing and registering is a link, not creating an immutable copy.
And the sync between the local and remote seems like it is downloading the remote and comparing to the local copy.
Basically adding remote source Does not mean DVC will create an immutable copy of the content, it's just a pointer to a bucket (feel free to correct me if I misunderstood their capability)
https://dvc.org/doc/command-reference/...
Oh I see, this seems like Triton configuration issue, usually dim -1 means flexible. I can also mention that serving 1.1 should be released later this week with better multiple input support for triton. Does that make sense?
Hi IrritableGiraffe81
Yes it deploys all ClearML (including web).
ClearML-serving unfortunately is a bit more complicated to spin, as it needs actual compute nodes.
That said we are working on making it a lot easier π
Are you doing from keras import ...
or from tensorflow.keras import
?
Hmm, I think it is this line:
WARNING - Model configuration only supports dictionary or string objects
done
Let me check something.
SarcasticSquirrel56
if I configure manually the pods for the different nodes, how do I make clearml server aware that those agents exist?
Basically the agent register themselves on your cleaml-server, and they register on which Queue(s) they listen to. In other words the interface to choose the different types of machines/gpus is by enqueue the Task to different queues.
For example: Queue(1): "CUDA11_GPUx1" , Queue(2): "CUDA10_GPUx1"
Make sense ?
EDIT:
I guess to achieve what I w...
Would this be best if it were executed in the Triton execution environment?
It seems the issue is unrelated to the Triton ...
Could I use theΒ
clearml-agent build
Β command and theΒ
Triton serving engine
Β task ID to create a docker container that I could then use interactively to run these tests?
Yep, that should do it π
I would start simple, no need to get the docker itself it seems like clearml credentials issue?!
With the warning ?
I was able to reproduce it on the old versions, but it seems fixed on the latest from GitHub.
@<1523704157695905792:profile|VivaciousBadger56> regrading: None
Is this a discussion or PR ?
(general ranting is saved for our slack channel π )
So clearml server already contains an authentication layer (JWT Token), and you do have a full user management on top:
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_config#web-login-authentication
Basically what I'm saying if you add httpS on top of the communication, and only open the 3 ports, you should be good to go. Now if you really need SSO (AD included) for user login etc, unfortunately this is not part of the open source, but I know they have it in the scale/ent...
It may have been killed or evicted or something after a day or 2.
Actually the ideal setup is to have a "services" pod running all these service on a single pod, with clearml-agent --services-mode. This Pod should always be on and pull jobs from a dedicated queue.
Maybe a nice way to do that is to have the single Task serialize itself, then have the a Pod run the Task every X hours and spin it down
So I would like to to know what it send to the server to create the task/pipeline, ...
Sure GiddyTurkey39 , Checkout the cleanup service:
https://github.com/allegroai/trains/blob/master/examples/services/cleanup/cleanup_service.py
Please hit Ctrl-F5 refresh the entire page, see if it is till empty....
We actually plan to create different queues for different types of workloads, we are a bit seeing what the actual usage is to define what type of workloads make sense for us.
That sounds like a great path to take, it will make it very clear fro users on what they will be getting and why they should use specific queue.
As for the memory, yes the reasoning is clear, the main thing we'll have to see is hot define the limits, because we have nodes with quite different resources availab...
I think it would make sense to have one task per run to make the comparison on hyper-parameters easier
I agree. Could you maybe open a GitHub issue on it, I want to make sure we solve this issue π
It is available of course, but I think you have to have clearmls-server 1.9+
Which version are you running ?
Okay I found it, this is due to the fact the newer versions are sending the events/images in a subprocess (it used to be a thread).
The creation of the object is done on he main process, updating file index (round robin manner), but the check itself, happens on the subprocess., which is not "aware" of the used indexes (i.e. it is always 0, hence when exceeding the history side, it skips it)