Reputation
Badges 1
25 × Eureka!SuperiorPanda77 I have to admit, not sure what would cause the slowness only on GCP ... (if anything I would expect the network infrastructure would be faster)
Hi CharmingPuppy6
Basically yes there is.
The way clearml
is designed, is to have queues abstract different types pf resources. for example a queue for single gpu jobs (let's nam "single_gpu") and a queue for dual gpu jobs (let's name it "single_gpu").
Then you spin agents on machines and have the agents pull jobs from specific queues based on the hardware they have. For example we can have a 4 GPU machine with 3 agents, one agent connect to 2xGPUs and pulling Tasks from the "dual_gpu...
Do you have any experience and things to watch out for?
Yes, for testing start with cheap node instances π
If I remember correctly everything is preconfigured to support GPU instances (aka nvidia runtime).
You can take one of the templates from here as a starting point:
https://aws.amazon.com/blogs/compute/running-gpu-accelerated-kubernetes-workloads-on-p3-and-p2-ec2-instances-with-amazon-eks/
Should not be complicated, it's basically here
https://github.com/allegroai/clearml/blob/1eee271f01a141e41542296ef4649eeead2e7284/clearml/task.py#L2763
wdyt?
UnevenDolphin73fatal: could not read Username for '
': terminal prompts disabled .. fatal: clone of '
' into submodule path '/root/.clearml/vcs-cache/xxx.60db3666b11ac2df511a851e269817ef/xxx/xxx' failed
It seems it tries to clone a submodule and fails due to to missing keys for the submodule.
https://stackoverflow.com/questions/7714326/git-submodule-url-not-including-username
wdyt?
Also what do you have in the "Configuration" section of the serving inference Task?
this?ids = [t.id for t in top_task]
agent.package_manager.system_site_packages
Β can be used to inherit packages
Correct, it is basically venv with --system-site-packages
I do not think virtualenv nesting is support, if it was then in theory you could have executed the clearml-agent from virtual environment with system_site_packages
turned on and then it would inherit from it. But again I'm not sure virtualenv supports it.
BTW: the latest clearml-agent RC already have venv caching (both pip/conda) π
You mean to add these two to the model when deploying?
β βββ model_NVIDIA_GeForce_RTX_3080.plan
β βββ model_Tesla_T4.plan
Notice the preprocess.py
is Not running on the GPU instance, it is running on a CPU instance (technically not the same machine)
But I do not know how it can help me:(
In your code itself after the Task.init
call add:task.set_initial_iteration(0)
See reply here:
https://github.com/allegroai/clearml/issues/496#issuecomment-980037382
PungentLouse55 could you test again with the latest from the GitHub? I think the issue should be solved:pip install git+
i would like to have it also save on the bucket
oh if this is the casse, you can just change the clearml file server to point to GS bucket, everything will be stored there.
Just change your clearml.conf:files_server: "
"
https://github.com/allegroai/clearml/blob/d45ec5d3e2caf1af477b37fcb36a81595fb9759f/docs/clearml.conf#L10
Hi RipeAnt6
What would be the best way to add another model from another project say C to the same triton server serving the previous model?
You can add multiple call to cleaml-serving
, each one with a new endpoint and a new project/model to watch, then when you launch it it will setup all endpoints on a single Triton server (the model optimization loading is taken care by Triton anyhow)
Thanks JuicyFox94 for letting us know.
I'm checking what's the status with it
data it is going to s3 as well as ebs. Why so it should only go to s3
This sounds odd, if this is mounted then it goes to the S3 (the link will point to the files server, but it will be stored on the mounted drive i.e. S3)
wdyt?
could you send the entire log here?
i.e. from the "docker-compose" command line and onward
@<1545216070686609408:profile|EnthusiasticCow4>
Is there currently a way to bind the same GPU to multiple queues? I believe the agent complains last time I tried (which was a bit ago)
run multiple agents on the same GPU,
CLEARML_WORKER_NAME=host-gpu0a clearml-agent daemon --gpus 0
CLEARML_WORKER_NAME=host-gpu0b clearml-agent daemon --gpus 0
Follow-up question: how does clearML "inject" the argparse arguments before the task is initialized?
it patches the actual parse_args
call, to make sure it works you just need to make sure it was imported before the actual call takes place
I had to do another workaround since when
torch.distributed.run
called it's
ArgumentParser
, it was getting the arguments from my script (and from my task) instead of the ones I passed it
Are you saying...
DefeatedOstrich93 can you verify lightning actually only stored once ?
GentleSwallow91 notice this part:
Hi Martin. Sorry - missed your reply.
Yeap I am aware that docker_internal_mounts is inside agent section.
'-v', '/tmp/ssh-XXXXXXnfYTo5/agent.8946:/tmp/ssh-XXXXXXnfYTo5/agent.8946', '-e', 'SSH_AUTH_SOCK=/tmp/ssh-XXXXXXnfYTo5/agent.8946',
It is creating a copy of the ssh folder and setting the SSH_AUTH_SOCK env to it. You can just map the entire ssh folder automatically by un-setting SSH_AUTH_SOCK before running the agent.SSH_AUTH_SOCK= clearml-agent ...
This is the prerequisites of the docker service installed on the host machine (where the agent is running)
Basically follow: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
https://docs.docker.com/compose/gpu-support/
Could not locate channel name 'gg_clearml'
CheerfulGorilla72 these are the permissions:
https://github.com/allegroai/clearml/blob/427b98270cc846b5d7e4af49f9732e3eb8d7d3ae/examples/services/monitoring/slack_alerts.py#L13channels:join channels:read chat:write
My use case is when I have a merge request for a model modification I need to provide several informations for our Quality Management System one is to show that the experiment is a success and the model has some improvement over the previous iteration.
Sounds likes good approach π
Obviously I don't want the reviewer to see all ...
Maybe move publish the experiment and move it to a dedicated folder ? Then even if they see all other experiments, they are under "development" p...
Hi @<1523703472304689152:profile|UpsetTurkey67>
I circumvented the problem by putting timestamp in task name, but I don't think this is necessary.
Just pass reuse_last_task_id=False
to Task.init, it will never try to reuse them π
None