Reputation
Badges 1
25 × Eureka!Hey GiganticTurtle0 ,
So basically the issue is the the pipeline function ( prediction_service ) is getting a dict as input, and it is expecting to get basic types... if you were to do the following, it would have worked as expected.prediction_service(**default_config)I will make sure we flatten any dictionary so that we end up with config/start , instead of a serialized version of the dict.
wdyt?
Hi SmarmyDolphin68
I see this in between my training epochs, what could be causing this?
This is basically saying we are saving a second model on the same Task and even though both are logged, only the last is stored on the Task itself.
This will change as in the next version a Task will be able to hold reference to multiple models in the artifactory 🙂
Yeah the doctring is always the most updated 🙂
Hi VastShells92022-12-20 12:48:02,560 - clearml.automation.optimization - WARNING - Could not find requested hyper-parameters ['duration'] on base task a6262a151f3b454cba9e22a77f4861e3Basically it is telling you it is setting a parameter it never found on the original Task you want to run the HPO o.
The parameter name should be (based on the screenshot) "Args/duration" (you have to add the section name to the HPO params). Make sense ?
there is probably some way to make an S3 path open up in the browser by default
You should have a pop-up asking for credentials ...
Could you check that if you add the credentials in the profile page it works ?
The api server by default spins multiple processes (they all might be busy a tye time with a huge flood of requests, but this is still multi process). Let me check if there is an easy way to set more processes
WickedGoat98 sure that will not be complicated:
try something along the lines of :agent: networks: - backend container_name: clearml-agent image: allegroai/clearml-agent:latest restart: unless-stopped privileged: true environment: CLEARML_HOST_IP: ${CLEARML_HOST_IP} CLEARML_WEB_HOST: ${CLEARML_WEB_HOST:-} CLEARML_API_HOST: `
CLEARML_FILES_HOST: ${CLEARML_FILES_HOST:-}
CLEARML_API_ACCESS_KEY: ${CLEARML_API_ACCESS_KEY:-}
...
Is this a bug, or an issue with clearml not working correctly with hydra?
It might be a bug?! Hydra is fully supported, i.e. logging the state and allowing you to change the Arguments from the UI.
Is this example working as expected ?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/hydra/hydra_example.py
If you're referring to the run executed by the agent, it ends after this message because my script does not get the right args and so does not know what to...
ValueError: Missing key and secret for S3 storage access
Yes that makes sense, I think we should make sure we do not suppress this warning it is too important.
Bottom line missing configuration section in your clearml.conf
an implementation of this kind is interesting for you or do you suggest to fork
You mean adding a config map storing a default trains.conf for the agent?
I can definitely see your point from the "DevOps" perspective, but from the user perspective it put the "liability" on me to "optimize" the resource, which to me sounds a bit much to put on my tiny shoulders, I just have a general knowledge on what I need. For example lots of CPUs (because I know my process scales well with more cpus), or large memory (because I have an entire dataset in memory). Personally (and really only my personal perspective), I'd rather have the option to select from a...
confirmed that the change had been added by
Make sure you see them in the Task log in the UI (the agent print it when it starts)
Any insight on how we can reproduce the issue?
Can this be reproducible using a simple script that we can also run?
Regrading the project name:
set_project will support project_name in the next version 🙂 project_id=[p.id for p in Task.get_projects() if p.name==project_name][0]
The easiest is to pass an entire trains.conf file
I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace) for the task when queuing the task and the agents pick these tasks if they have the requested resources. With this, the user need not think about which queue to send the task to. The users just state what they need and the agents do the scheduling for them.
Can I assume we are talking Kubernetes under the hood for the resource allocation ?
I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace)
This will be quite easy to implement using the cleamrl k8s glue, just use user-properties and change the template based on it. I can point to where you need to modify the code
Hi GrievingTurkey78
Turning of pytorch auto-logging:Task.init(..., auto_connect_frameworks={'pytorch': False})To manually log a model:from clearml import OutputModel OutputModel().update_weights('my_best_model.pt')
Well done man!
Nice!
is trainsConfig pure text blob ?
SubstantialElk6
Notice if you are using a manual setup the default is "secure: false" you have to change it to "secure: true":
https://github.com/allegroai/clearml-agent/blob/176b4a4cdec9c4303a946a82e22a579ae22c3355/docs/clearml.conf#L251
Hi SmugLizard24
The question is what is the reason of the issue?
That is a good question, could it be out of memory? (trying to compress or send the file in one chunk?)
EnviousStarfish54 are those scalars reported ?
If they are, you can just do:task_reporting = Task.init(project_name='project', task_name='report') tasks = Task.get_tasks(project_name='project', task_name='partial_task_name_here') for t in tasks: t.get_last_scalar_metrics() task_reporting.get_logger().report_something
HealthyStarfish45 what exactly did you have in mind, in terms of the widget ?
My pleasure 🙂
Maybe we should do a webinar... I have a feeling the MLOps aspects are not as straight forward as we would like to think ...
Hi WickedGoat98
A few background notions:
Docker do not store their state, so if you install something inside a docker, the moment you leave, it is gone, and the next time you start the same docker you start from the same initial setup. (This is a great feature of Dockers) It seems the docker you are using is missing wget. You could build a new docker (see the Docker website for more details on how to use a Dockerfile). The way trains-agent works in dockers is it installs everything you ne...
You can definitely configure the watchdog to set the timeout to 15min, it should not have any effect on running processes, they basically ping every 30 sec alive message