Reputation
Badges 1
90 × Eureka!How we resolved this issue is by developing a package to deal with the connecting and querying to databases. This package is then used inside the task for pulling the data from the data warehouse. There is a devops component here for authorising access to the relevant secret (we used SecretsManager on AWS). The clearml-agent instances are launched with role permissions which allow access to the relevant secrets. Hope that is helpful to you
Okay solved the problem. It is using the version that is locally installed (on my laptop). Is there a way to prevent this? Perhaps a requirements.txt or something like that>
yes it does, but that requires me to manually create a new agent every time I want to run a different env no?
Our model store consists of metadata stored in the DWH, and model artifacts stored in S3. We technically use ClearML for managing the hardware resource for running experiments, but have our own custom logging of metrics etc. Just wondering how tricky integrating a trigger would be for that
ECR access should be enabled as part of the role the agent instance assumes when it runs a task
could it be how I am trying to log the figure manually?
Only downside, which is not related to clearml, is that codeartifact authorisation tokens have to have a minimum lifespan of 15 mins. Usually, setting up envs before task execution takes less than a couple minutes, so the token lingers in the background. Nonetheless, all works as expected!
Awesome, thank you Jake! very helpful. For a lot of the models we run, we do not require GPU resources, so its good to know that a beefy instance should be able to run the experiments.
are the envs named after the worker enumeration? e.g. venv-bulds-0 is linked to worker 0?
Thanks AnxiousSeal95 , will check it out! 🙂
Any news on this bug?
After some additional inspection, seems like the issue is docker related.7.7G /var/lib/docker/overlay2/ this is the directory which is causing the device storage issues.
SuccessfulKoala55 thanks for your help as always. I will try to create a DAG on airflow using the SDK to implement some form of retention policy which removes things that are not necessary. We independently store metadata on artefacts we produce, and mostly use clearml as the experiment manager, so a lot of the events data can be cleared.
` # Plot the confusion matrix for predictions
sns.heatmap(
preds_confusion_percentage, annot=True, fmt=".3f", linewidths=.5,
square=True, cmap='Blues_r'
)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
title_str = f'Accuracy Score: {round(score, 2)}\n{TRANSFORM_TYPE}'
plt.title(title_str, size=15)
task.logger.report_matplotlib_figure(
title=f"Performance Heatmap - {model_export_name}",
series="Device Brand Predictions",
iteration=0,
figure=pl...
Hey guys. Installing from the private repo is still failing. We have added the relevant deploy key to the repo, but I still get an error when trying to clone and install. Any ideas?
This is what I see under the 'Installed Packages' section
` # Python 3.8.5 (default, Sep 4 2020, 02:22:02) [Clang 10.0.0 ]
azure_storage_blob == 12.6.0
boto3 == 1.11.17
clearml == 0.17.4
git+ssh://git@github.com/15gifts/py-db.git
Detailed import analysis
**************************
IMPORT PACKAGE azure_st...
ohhh ok. so I can actually remove this if those workers are no longer in use
Thanks GrumpyPenguin23 , will have a look shortly 🙂
One question - you can also set the agent.package_manager.extra_index_url , but since this is dynamic, will pip install still add the extra index URL from the pip config file? Or does it have to be set in this agent config variable?
I don't think we explicitly pass the package path to the agent. I expect it to run a regular pip install but it seems to be doing it via git somehow
I don't even know if I have a valid concern for this. Just a little worried as airflow is accessible by more departments than just DS, which could result in some disasters
That's a good question, which I don't have an answer to 😅 I was hoping to be able to store the config file in some kind of secrets vault, and authenticating via some in-memory trace or so
It should be a draft, so that it can be enqueued
Rightttttt I think I am starting to understanding the architecture now lol. Thank you so much for your help!
Sorry, just revisiting this as I'm only getting around to implementation now. How do you pass the ECR container ID to the defined task?
From what I can tell, docker has some leakage here. Temp files are not removed correctly, resulting in the build up of disk storage usage.
See the following for more details
https://stackoverflow.com/questions/46672001/is-it-safe-to-clean-docker-overlay2
https://forums.docker.com/t/some-way-to-clean-up-identify-contents-of-var-lib-docker-overlay/30604
https://docs.docker.com/storage/storagedriver/overlayfs-driver/
Im going to write a clean up script and add that to the cron. I dont bel...
@<1687643893996195840:profile|RoundCat60> Hey Alex. Could you take a look at this when you're free later on please
Thanks maestro. Will give this a go