Reputation
Badges 1
25 × Eureka!And is there an easy way to get all the metrics associated with a project?
Metrics are per Task, but you can get the min/max/last of all the tasks in a project. Is that it?
yes i can communicate with the server, i managed to put tasks in the queue and retrieve them as well as running tasks with metrics reporting
Through the UI or python code ?
Hi MammothGoat53
Do you mean working with RestAPI directly?
https://clear.ml/docs/latest/docs/references/api/events
However, regarding your recommendation of using
StorageManager
class to delete the URL, it seems that this class only contains methods for checking existence of files, downloading files and uploading files, but
no method
for actually
deleting
files based on their URL (see doc
and
).
Yes you are correct 😞 you should use a "deeper" class:
helper = StorageHelper.get(remote_url)
helper.delete(remo...
Okay this more complicated but possible.
The idea is to write a glue layer (service) that pulls from the (i.e UI) queue
sets the slurm job
and puts it in a pending queue (so you know the job s waiting in the slurm scheduler)
There is a template here:
https://github.com/allegroai/trains-agent/blob/master/trains_agent/glue/k8s.py
I would love to help and setup a slurm glue in a similar manner
what do you think?
Hi @<1533620191232004096:profile|NuttyLobster9>
First nice workaround!
Second could you send the full log? When the venv is skipped then pytorch resolving should be skipped as well, and no error should be raised...
And Lastly could you also send the log of the task that executed correctly (the one you cloned), because you are correct it should have been the same
This makes no sense to me 😞
Both are reading the exact same file, and using the same session / flow ...
Maybe there is an error with the "verify_certificate" on the agent ?
Hi RobustRat47
What do you mean by "log space for hyperparameter" , what would be the difference ? (Notice that on the graph itself you can switch to log scale when viewing in the UI) ?
Or are you referring to the hyper parameter optimization, allowing you to add log space ?
So what is the difference ? both running from the same machine ?
Thanks OutrageousGiraffe8
Any chance you can expand the example code to be a fully a reproducible toy code? (I would really like to make sure we fix it)
Thanks! I think I was able to locate the issue, but I wanted to verify 🙂
Thanks for pinging OutrageousGiraffe8
I think I was able to reproduce.
model is saved to the clearml as an output model when
b
is not a dictionary.
How did you make the example work with the automagic ?
The issue itself is changing the default user.
USER appuser
WORKDIR /home/appuser
Any reason for it ?
This looks exactly like the timeout you are getting.
I'm just not sure what's the diff between the Model autoupload and the manual upload.
Hi @<1541954607595393024:profile|BattyCrocodile47>
It seems to me that instead of implementing webhooks to react to things like adding a tag to a model
Did you look at this example ?
None
Can we straightforwardly stream ALL ClearML events to another system?
what would you consider an event?
The "basic" object type is Task, a state in task is changed via an api call, would that be an e...
LudicrousParrot69 I would advise the following:
Put all the experiments in a new project Filter based on the HPO tag, and sort the experiments based on the metric we are optimizing (see adding custom columns to the experiment table) And select + archive the experiments that are not usedBTW: I think someone already suggested we do the auto archiving inside the HPO process itself. Thoughts ?
CooperativeFox72 yes 20 experiments in parallel means that you always have at least 20 connection coming from different machines, and then you have the UI adding on top of it. I'm assuming the sluggishness you feel are the requests being delayed.
You can configure the API server to have more process workers, you just need to make sure the machine has enough memory to support it.
ReassuredTiger98 no, but I might be missing something.
How do you mean project-specific?
... grab the model artifacts for each, put them into the parent HPO model as its artifacts, and then go through the archive everything.
Nice. wouldn't it make more sense to "store" a link to the "winning" experiment. So you know how to reproduce it, and the set of HP that were chosen?
No that the model is bad, but how would I know how to reproduce it, or retrain when I have more data etc..
Hi @<1523715429694967808:profile|ThickCrow29>
I am using the PipelineController with abort_on_failure set to False.
Is this a pipeline from code or from Tasks?
What is the clearml version?
Lastly, if a component fails, and another components is dependent on it's output, how would it run? if it is not dependent, why is it a child component?
Hi @<1644147961996775424:profile|HurtStarfish47>
. I see
Add image.jpg
being printed for all my data items ...
I assume you forgot to call upload
? the sync "marks" files for uploaded / deletion but the upload call actually does the work,
Kind of like git add / push , if that makes sense ?
Let me verify something in the code,
Hi RoundMosquito25
The main problem here is there is no way to know before running the Task how much memory it would need ... And without that parameter maximizing GPUs is quite challenging. wdyt?