Reputation
Badges 1
25 × Eureka!I'll make sure we have conda ignore git:// packages, and pass them to the second pip stage.
Follow up: I see that if I move an Experiment to a new project, it does not copy the associated model files and must be done manually.Β Once I moved the models to the new project, the query works as expected.
Correct π
Nice catch!
Check the log, the container has torch 1.13.0 but the task requires torch==1.13.1
Now torch package inside those nvidia prepackaged containers are compiled a bit differently . What I suspect happens is the torch wheel from pytorch is not compatible with this container . Easiest fix , change the task requirments to 1.13
Wdyt ?
Profile page top left corner
Hi CluelessFlamingo93
I think the latest clearml-agent 1.5.1 fixed that issue (this is basically pip trying to "protect" you from mismatch packages)
can you upgrade your clearml-agent and test?pip3 install clearml-agent==1.5.1
The package detection is done when running the code on your laptop, and this is when it first logs the packages and versions. Following it, what do you have on your laptop? OS/Conda/Python
Hi DeliciousKoala34
Happened when cloning and running a task on an agent on a different machine. I
sounds like torch internal issue, can you send the full log of the remote Task ?
CluelessFlamingo93 I would also fix the pip version requirements to:pip_version: ["<20.2 ; python_version < '3.10'", "<22.3 ; python_version >= '3.10'"]
In order for the sample to work you have to run the template experiment once. Then the HP optimizer will find the best HP for it.
you could also use:
https://github.com/allegroai/clearml/blob/ce7e77a00e869a2690f31cbc578636ce88bc4613/docs/clearml.conf#L188
and setup the clearml.conf
on the users machine to automatically log the environment variables at run time (stored under the Configuration tab).
Then the agent will pull these same variables at execution time and set them
Thanks VivaciousPenguin66 !
BTW: if you are running the local code with conda, you can set the agent to use conda as well (notice that if you are running locally with pip, the agent's conda env will use pip to install the packages to avoid version mismatch)
See if this helps
I just cloned it from the examples that are available in the SaaS console upon account creation
Ohhh! that would explain it. Maybe it is broken there?! let me check a second
Hi CloudySwallow27
This error occurs randomly during training (in other words training does successfully start).
What's the cleamrl-agent version you are using, and the clearml version ?
SillyPuppy19 yes you are correct, actually I can promise you the callback will be called from a different thread (basically the monitoring thread) so it's on the user to make sure the callback can handle it .
How about we move this discussion to GitHub?
upload_artifact
will actually do two things:
upload the file to the trains-server register it as an artifact on the experiment
What did you mean by "register the artifact manually"? You still need to upload the file to the trains-server (so it is later accessible )
Thanks MuddyCrab47 !!!
I found it!
It turns out the artifact upload will always upload from stream (aka no multi-upload). I will make sure we fix it in the next RC (I think the plan is to have it out this weekend)
PunySquid88 do you want to test a fix?
It doesn't not seem to be related to the upload. The upload itself finished... What's your Trains version?
Anyhow if the StorageManager.upload was fast, the upload_artifact is calling that exact function. So I don't think we actually have an issue here. What do you think?
Do you have a specific numpy version you are installing ? why is it trying to install the wheel from code?
Hmm I cannot think of something that will provide something a per user basis.
Wouldn't a global set of credentials that the agent is using be enough ?
(on the local machine, user can keep using the "definitions.py")
however, this will also turn off metricsΒ
For the sake of future readers, let me clarify on this one, turning it off auto_connect_frameworks={'pytorch': False}
only effects the auto logging of torch.save/load
(side note: the reason is pytorch does not have built in metric reporting, i.e. it is usually done manually and these days most probably with tensorboard, for example lightning / ignite will use tensorboard as default metric reporting),
query
tasks
that are both Running --> You mean
status=["in_progress"]
Yes!
How do I figure out other possible parameter I can use with
status
parameter?
https://clear.ml/docs/latest/docs/references/api/tasks#post-tasksget_all
https://clear.ml/docs/latest/docs/references/api/definitions#taskstask
Filter only tasks that start say
10 min ago
. Is there any parameter for it also ?
last_update or created then use...
maybe I should use explicit reporting instead of Tensorboard
It will do just the same π
there is no method for settingΒ
last iteration
, which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
Let me double check that...
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine ...
That is a very good point
but for the metrics, I explicitly pass th...
DrabCockroach54 that is quite cool!
Basically here is what I would do
Query Tasks that are both Running and Do not have system tag "development" (that means running on agents) + filter only tasks that start say 10 min ago Go over the list and see if (1) they have GPU scalar reported (meaning GPU is accessible) (2) min/max/val of GPU utilization is under 5%wdyt?