Reputation
Badges 1
90 × Eureka!Yeah I would say that a demo on this would be great. I think this task is difficult as is given the differences in deployment architectures, but for common tasks it would be good to have some additional docs/examples 🙂
In particular, I am trying to find a neat way to query all models available, and use tags to know the context. As it stands, I log the model accuracies/RMEs as part of the metadata, alongside the training data filepath. Issue is that this is not the neatest way of querying models across tasks without a lot of laborious manual lifting. Suggestions welcome
Here is the error message from the consoleCollecting git+ssh://****@github.com/15gifts/py-db.git Cloning ssh://****@github.com/15gifts/py-db.git to /tmp/pip-req-build-xai2xts_ Running command git clone -q 'ssh://****@github.com/15gifts/py-db.git' /tmp/pip-req-build-xai2xts_ ERROR: Repository not found. fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.
I thought nothing should be stored locally on the agent? Shouldn't all files be logged to the storage rather than the instance itself?
Reason I am asking is because we have servers with large RAM capacity, but minimal storage capacity, meaning that objects held in memory can sometimes surpass storage capacity if export is required
Thanks GrumpyPenguin23 , will have a look shortly 🙂
Ideally, I want to avoid re-inventing the wheel so if this functionality already exists with some examples then it would be great if someone could point me to it
That's a good question, which I don't have an answer to 😅 I was hoping to be able to store the config file in some kind of secrets vault, and authenticating via some in-memory trace or so
While we're here, how can I return the model accuracy (or any performance metric for that matter) given a model(s) belonging to a particular task? Is this information stored anywhere or do I need to explicitly log this data somehow?
So how do I ensure that artefacts are uploaded in the correct bucket from within clearml?
Locally or on the remote server?
We are planning to use Airflow as an extension of clearml itself, for several tasks:
we want to isolate the data validation steps from the general training pipeline; the validation will be handled using some base logic and some more advanced validations using something like great expectations. our training data will be a snapshot from the most recent 2 weeks, and this training data will be used across multiple tasks to automate the scheduling and execution of training pipelines periodically e...
Thanks Jake, I will have a look. Is there a reason a lot disk space would be used on the server instance? Is there something in the config I can change to ensure that minimal memory is used on that server, and mostly s3 is used for storage?
After some additional inspection, seems like the issue is docker related.7.7G /var/lib/docker/overlay2/
this is the directory which is causing the device storage issues.
As in an object from memory directly, without having to export the file first. I thought boto3 can handle this, but looking at the docs again, it doesn't look like it. File-like objects is their term, so maybe an export is required
SuccessfulKoala55 thanks for your help as always. I will try to create a DAG on airflow using the SDK to implement some form of retention policy which removes things that are not necessary. We independently store metadata on artefacts we produce, and mostly use clearml as the experiment manager, so a lot of the events data can be cleared.
I can't figure out from the examples how the external trigger works. All of our model performance stats are in the DWH, and we want to build triggers based on that, Is that possible to integrate with Clearml triggers and schedulers?
That is a neat way of making it work! Thanks Martin. Once I've added the SSH key to the deployment keys in that repo, then the change in the config should work right? I'm guessing the extra index URL can be a URL to the github repo of interest? (not another privately hosted pypi repo)
Any news on this bug?
yes it does, but that requires me to manually create a new agent every time I want to run a different env no?
Sorry, just revisiting this as I'm only getting around to implementation now. How do you pass the ECR container ID to the defined task?
ECR access should be enabled as part of the role the agent instance assumes when it runs a task
Oh great, thanks! Was trying to figure out how the method knows that the docker image ID belongs to ECR. Do you have any insight into that?
And how will it know that the container is on ECR instead of some other container repository?
/home/ubuntu/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/utilities/plotlympl/mpltools.py:371: MatplotlibDeprecationWarning: The is_frame_like function was deprecated in Matplotlib 3.1 and will be removed in 3.3.
This is the last print statement before it hangs
I dont think its that. its a 20kb file upload. This was the last message just printedClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-star
it just hangs when trying to upload. maybe that is the reason that the plots are not logging?
I removed it and I still get the same error 😞