Reputation
Badges 1
25 × Eureka!Sure thing, any specific reason for querying on multi pod per GPU?
Is this for remote development process ?
BTW: the funny thing is, on bare metal machines multi GPU woks out of he box, and deploying it with bare metal clearml-agents is very simple
PleasantOwl46 any chance there are subprojects under the requested project?
So basically development on a "shared" GPU?
StaleMole4 you are printing the values before Task.init had the chance to populate it.
Basically try moving the print after closing the Task (closing the tasks waits for the async update)
Make sense ?
Also I would suggest using Task.execute_remotely
https://clear.ml/docs/latest/docs/references/sdk/task#execute_remotely
Hi IrritableGiraffe81
Yes it deploys all ClearML (including web).
ClearML-serving unfortunately is a bit more complicated to spin, as it needs actual compute nodes.
That said we are working on making it a lot easier π
I have to problem that "debug samples" are not shown anymore after running many iterations.
ReassuredTiger98 could you expand on it? What do you mean by "not shown anymore" ?
Can you see other reports ?
So net-net does this mean itβs behaving as expected,
It is as expected.
If no "Installed Packages" are listed, then it cannot pull a cached venv (because requirements.txt is not a full env, and it never analyzed it)).
It does however create a venv cache based on it (after installing it)
The Clone of this Task (i.e. right click on the UI clone experiment, enqueue it, Will use the cached copy becuase the full packages are listed in the "Installed Packages" section of the Task.
Make sens...
Hi MagnificentSeaurchin79
This means the tensorflow was not directly imported in the repository (which is odd, it might point to the auto package analysis failing to find a the package, if this is the case please let me know)
Regardless, if you need to make sure a package is listed in the requirements either import it or use.Task.add_requirements('tensorflow')
or Task.add_requirements('tensorflow', '2.3.1')
BTW: there is a full Pipeline class that does everything for you, example here:
https://github.com/allegroai/clearml/tree/master/examples/pipeline
Hi AstonishingWorm64
I think you are correct, there is external interface to change the docker.
Could you open a GitHub issue so we do not forget to add an interface for that ?
As a temp hack, you can manually clone "triton serving engine" and edit the container image (under the execution Tab).
wdyt?
UnsightlySeagull42 the assumption is that the agent has a read-only all access user.
As the moment there is no way to configure it to have diff user/pass per repository in the clearml.conf
You can however:
embed the user/pass on the repository link (not very secure) Use ssh-key and have it on .ssh on the host machine Use .git-credentials and configure them (with per project user/pass)
Hi RipeGoose2
Just to clarify, the issue with the html stuck in cache is a UI, thing, basically the webapp needs to tell the browser not to cache the artifacts, it has nothing to do with how the artifacts are created.
Regardless we love improvements so feel free to mass around with the code and PR once you get something useful π
Specifically this is where the html conversion happens
https://github.com/allegroai/clearml/blob/9d108d855f784e1fe7f5691d3b7bf3be64576218/clearml/backend_in...
Yes, that sounds like the issue, is the file actually there ?
Any chance you can open a GitHub issue so we do not forget this feature ?
inΒ
Β issues a delete command to the ClearML API server,...
almost, it issues the boto S3 delete commands (directly to the S3 server, not through the cleaml-server)
And that I need to enter an AWS key/secret in the profile page of the web app here?Β (edited)
correct
But once i see it on the UI means it is already launched somewhere so i didn't quite get you.
The idea is you run it locally once (think debugging your code, or testing it)
While running the code the Task is automatically created, then once in the system you can clone / launch it.
Also, I want to launch my experiments on a kubernetes cluster and i don't actually have any docs on how to do that, so an example can be helpful here.
We are working on documenting the full process, ...
I can't think of any hack that will satisfy your IT other than than an actual vault...
wdyt?
Hmm good question, I'm actually not sure if you can pass 24GB (this is not a limit on the GPU memory, this affects the memblock size, I think)
SubstantialElk6 (2) yes definitely will be fixed
Regrading (1), what do you mean by "via the code" ? Do you mean like as a Task docker cmd ?
It seems to follow a structure specific to clearml,
Actually plotly.js π
Hi @<1559711593736966144:profile|SoggyCow20>
I would first like to say how amazing clearml is!
Thank you! π
Running in Docker mode (v19.03 and above) - using default docker image: nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
yes sdk.agent.default_docker.image = python:3.10.0-alpine
should beagent.default_docker.image = python:3.10.0-alpine
Notice the scope is agent, not sdk
Try the following example.env
:
CLEARML_SERVING_PORT=9090
CLEARML_WEB_HOST="http://<IP>:8080"
CLEARML_API_HOST="http://<IP>:8008"
CLEARML_FILES_HOST="http://<IP>:8081"
(I think the localhost is resolved to inside the container and not the host machine, hence the error)
DistressedGoat23
We are running a hyperparameter tuning (using some cv) which might take a long time and might be even aborted unexpectedly due to machine resources.
We therefore want to see the progress
On the HPO Task itself (not the individual experiments the one controlling it all) there is the global progress of the optimization metric, is this what you are looking for ? Am I missing something?
Hi TeenyFly97
Can I super-impose the graphs while comparing experiments?
Hmm not at the moment, I think someone asked for the option to control it, in both comparison mode and "standalone" mode.
There is a long discussion on this feature here:
https://github.com/allegroai/trains/issues/81#issuecomment-645425450
Feel free to chime in π
I think that the latest agreement is a switch in the UI, separating or collecting (super-imposing) those graphs.
Hi GleamingGrasshopper63
How well can the ML Ops component handle job queuing on a multi-GPU server
This is fully supported π
You can think of queues as a way to simplify resources for users (you can do more than that,but let's start simple)
Basicalli qou can create a queue per type of GPU, for example a list of queues could be: on_prem_1gpu, on_prem_2gpus, ..., ec2_t4, ec2_v100
Then when you spin the agents, per type of machine you attach the agent to the "correct" queue.
Int...
it worked!!!!
YEY!
I pass the IDs to the docker container as environment variables, so this does need restart for the docker container but I guess we can live with that for now
So this would help you decide on which actual Model file to download ? (trying to understand how the argument is being used, meaning should we have it stored somewhere? there is meta-data on the Model itself so we can use that to store the data)
Hi FrothyShark37
Can you verify with the latest version?
pip install -U clearml