Reputation
Badges 1
282 × Eureka!Transform feature engineering and data processing code into recurring data ingestion workflows. Start building data stores, develop, automate, and schedule complex data processing jobs.
Yeah that'll cover the first two points, but I don't see how it'll end up as a dataset catalogue as advertised.
They don't have the same version. I do seem to notice that if the client is using version 3.8, during remote execution will try to use that same version despite the docker image not installed with that version.
Hi, so you meant i need to installl virtualenv in my base image?
Hi, it looks like the entire http://clear.ml domain is offline for more than 12 hours. Main pages and documentation are inaccessible as well.
Hi this is the log. I didn't see any attempt from the agent to install virtualenv on the base image.
` 1618369068169 clearml-gpu-id-b926b4b809f544c49e99625380a1534b:gpuGPU-4ad68290-0daf-4634-6768-16fad73d47a3 DEBUG Current configuration (clearml_agent v0.17.2, location: /tmp/.clearml_agent.wgsmv2t9.cfg):
agent.worker_id = clearml-gpu-id-b926b4b809f544c49e99625380a1534b:gpuGPU-4ad68290-0daf-4634-6768-16fad73d47a3
agent.worker_name = clearml-gpu-id-b926b4b809f544c49e99625...
Congrats on v1.0. 🎉
The first is probably done using pipeline controllers, the second using Datasets or HyperDatasets. Its not very clear how the last one is achieved, especially on the searchable data catalogs.
Hi, Self-hosted using docker-compose.
Ok. Problem was resolved with latest version of clearml-agent and clearml.
In the ClearML config that's being run by the ClearML container?
Thanks. The challenge we encountered is that we only expose our Devs to the ClearML queues, so users have no idea what's beyond the queue except that it will offer them the resources associated with the queue. In the backend, each queue is associated with more than one host.
So what we tried is as followed.
We create a train.py script much like what Tobias shared above. In this script, we use the socket library to pull the ipaddr.
import socket
hostname=socket.gethostname()
ipaddr=dock...
From ClearML perspective, how would we enable this, considering we don't have direct control or even IP of the agents
ok, i'll wait till i get my hands on vault then. thanks.
thanks SuccessfulKoala55 . I verified your last comment and it works.
Yeah.. issue is ClearML unable to talk to the nodes cos pytorch distributed needs to know their IP. There is some sort of integration missing that would enable this.
Sorry, dev end I was referring to my developers.
I didn't think Horovod needs to be as complicated as you described. It can also work by running on multiple known nodes. How would i add a glue for multinode?
Horovod does also work with other similar products such as yours (E.g. Polyaxon).
I think a related question is, ClearML replies heavily on Triton (Good thing) but Triton only support a few frameworks out of the box. So this 'engine' need to make sure its can work with Triton and use all its wonderful features such as request batching, GPU reuse...etc.
I used nvcr pytorch image and instruct clearml to inherit global dependencies. No need to install torch and work well.
Yes! I definitely think this is important, and hopefully we will see something thereÂ
 (or at least in the docs)
Hi AgitatedDove14 , any updates in the docs to demonstrate this yet?
Next step to figure out if i can do all that in the python code instead of UI.
Hi,
It did, nvidia/cuda:10.1-runtime-ubuntu18.04.
So if i need to set this every time, what is the following config for? And how do i pass in new env parameters?
` default_docker: {
# default docker image to use when running in docker mode
image: "dockerrepo/mydocker:custom"
# optional arguments to pass to docker image
# arguments: ["--ipc=host", ]
arguments: ["--env GIT_SSL_NO_VERIFY=true",]
} `
Ok thanks, that worked.
No issues. I know its hard to track open threads with Slack. I wish there's a plugin for this too. 🙂
Any idea where i can find the relevant API calls for this?
Oh, this meant i have been using the latest agent which is v1.0.0. The problems were still there.
Unfortunately due to security, clients can't have direct access to the nodes. Is there any possible workarounds at the moment?
Thanks. Have a better understanding now.
like create multiple datasets?
create parent (all) - upload to S3
create child1 (first 100k)
create child2 (second 100k)...blah blah
Then only pull indices from children. Technically workable but not sure if its best approach since different ppl have different batch sizes in mind.
Hi SuccessfulKoala55 , is there a channel here that posts version updates?