Reputation
Badges 1
90 × Eureka!Any news on this bug?
ohhh ok. so I can actually remove this if those workers are no longer in use
Ideally, I want to avoid re-inventing the wheel so if this functionality already exists with some examples then it would be great if someone could point me to it
Thanks Ariel, will give it a watch now
How we resolved this issue is by developing a package to deal with the connecting and querying to databases. This package is then used inside the task for pulling the data from the data warehouse. There is a devops component here for authorising access to the relevant secret (we used SecretsManager on AWS). The clearml-agent instances are launched with role permissions which allow access to the relevant secrets. Hope that is helpful to you
Where are you storing your secret JitteryCoyote63 ?
Yes it does 🙂 I suspected this was the process. Thanks Jake. One last question, more so about the architecture design - is it advised to have the clearml server instance and a 'worker' instance listening to the queue as separate remote machines, or can I use the same instance for the web UI and and as a worker? I understand that processing pipelines may be compute intense enough to consume all resources and break the web UI, but I was wondering whether using a single large instance is a po...
Thanks AnxiousSeal95 , will check it out! 🙂
Awesome, thank you Jake! very helpful. For a lot of the models we run, we do not require GPU resources, so its good to know that a beefy instance should be able to run the experiments.
I thought nothing should be stored locally on the agent? Shouldn't all files be logged to the storage rather than the instance itself?
@<1687643893996195840:profile|RoundCat60> Hey Alex. Could you take a look at this when you're free later on please
As in an object from memory directly, without having to export the file first. I thought boto3 can handle this, but looking at the docs again, it doesn't look like it. File-like objects is their term, so maybe an export is required
Another update - the tasks run fine and installs the packages from the correct index URL. However, by default, py_db @ git ..
is added in the installed packages panel. Could this be from a requirements.txt
file somewhere? To get it to work, I have to remove the @ git part, and then it works. Just very strange that it defaults to git pip install 🤔
Only downside, which is not related to clearml, is that codeartifact authorisation tokens have to have a minimum lifespan of 15 mins. Usually, setting up envs before task execution takes less than a couple minutes, so the token lingers in the background. Nonetheless, all works as expected!
Using SSH credentials - replacing https url '
' with ssh url '
' Replacing original pip vcs 'git+
' with '
` '
Collecting py_db
Cloning ssh://@github.com/15gifts/py-db.git (to revision 851daa87317e73b4602bc1bddeca7ff16e1ac865) to /tmp/pip-install-zpiar1hv/py-db
Running command git clone -q 'ssh://@github.com/15gifts/py-db.git' /tmp/pip-install-zpiar1hv/py-db
2021-12-08 15:56:31
ERROR: Repository not found.
fatal: Could not read from remote repository.
Please...
I can authorise CodeArtifact if I ssh into the server, and install the private package with no issues. Seems like something is forcing clearml-agent to use github cloning to install, rather than directly pip. Not sure if this is a configuration I have set up myself, or whether the server is configured to do this
I don't think we explicitly pass the package path to the agent. I expect it to run a regular pip install but it seems to be doing it via git somehow
This is included as part of the config file at ~/clearml.conf
on the clearml-agent
extra_docker_shell_script: [ "apt-get install -y awscli", "aws codeartifact login --tool pip --repository data-live --domain ds-15gifts-code", ]
Not sure how to get a log from the CLI but I can get the error from the clearml server UI, one sec
Oh, that may work. Is there any docs/demos on this?
Okay solved the problem. It is using the version that is locally installed (on my laptop). Is there a way to prevent this? Perhaps a requirements.txt or something like that>
While we're here, how can I return the model accuracy (or any performance metric for that matter) given a model(s) belonging to a particular task? Is this information stored anywhere or do I need to explicitly log this data somehow?
In particular, I am trying to find a neat way to query all models available, and use tags to know the context. As it stands, I log the model accuracies/RMEs as part of the metadata, alongside the training data filepath. Issue is that this is not the neatest way of querying models across tasks without a lot of laborious manual lifting. Suggestions welcome
We are planning to use Airflow as an extension of clearml itself, for several tasks:
we want to isolate the data validation steps from the general training pipeline; the validation will be handled using some base logic and some more advanced validations using something like great expectations. our training data will be a snapshot from the most recent 2 weeks, and this training data will be used across multiple tasks to automate the scheduling and execution of training pipelines periodically e...
From what I can tell, docker has some leakage here. Temp files are not removed correctly, resulting in the build up of disk storage usage.
See the following for more details
https://stackoverflow.com/questions/46672001/is-it-safe-to-clean-docker-overlay2
https://forums.docker.com/t/some-way-to-clean-up-identify-contents-of-var-lib-docker-overlay/30604
https://docs.docker.com/storage/storagedriver/overlayfs-driver/
Im going to write a clean up script and add that to the cron. I dont bel...
I think there is more complexity to what I am trying to achieve, but this will be a good start. Thanks!
I will need to log data set ID, transformer (not the NN architecture, just a data transformer), the model (with all hyperparameters & metadata) etc. and how all things link
Locally or on the remote server?
are the envs named after the worker enumeration? e.g. venv-bulds-0 is linked to worker 0?
One question - you can also set the agent.package_manager.extra_index_url
, but since this is dynamic, will pip install still add the extra index URL from the pip config file? Or does it have to be set in this agent config variable?
Thanks maestro. Will give this a go