Reputation
Badges 1
25 × Eureka!SpotlessFish46 So the expected behavior is to have the single script inside the diff, but you get empty string ?
DilapidatedDucks58 I see ...
This might be more complicated that one would imagine, a simple solution might be to store a snapshot of the values every-time we reach a new maximum, a quick hack might be to add it as text on one of the task's parameters or properties (that we can later add to the table as custom column).
wdyt?
so firs yes, I totally agree. This is why the clearml-serving
has a dedicated statistics module that creates histograms over time, then we push it into Prometheus and connect grafana to it for dashboards and alerts.
To be honest, I would just use it instead of reporting manually, wdyt?
parser.add_argument( "--dataset_mean", type
=
float, nargs
=
"+", default
=
0.5)
I think providing nargs='+ ' assumes the type is a list. nonetheless we should be able to support it. Could you please add a GitHub issue so we do not forget ?
on the side note, is there any way to automatically give more meaningful names to the running docker containers?
What do you mean by that? running where? and where will you see them ?
It seems to follow a structure specific to clearml,
Actually plotly.js π
Hmm, I think I need more to try and reproduce, what exactly did you do, what was the expected behavior vs reality ?
ERROR: torch-1.12.0+cu102-cp38-cp38-linux_x86_64.whl is not a supported wheel on this platform
TartBear70 could it be you are running on a new Mac M1/2 ?
Also quick question, any chance you can test with the latest RC?pip3 install clearml-agent==1.3.1rc6
@<1523701868901961728:profile|ReassuredTiger98> in the UI can you see it in the "installed packages" section under the Execution Tab ?
VexedCat68
. So the checkpoints just added up. I've stopped the training for now. I need to delete all of those checkpoints before I start training again.
Are you uploading the checkpoints manually with artifacts? or is it autologged & uploaded ?
Also why no reuse and overwrite older checkpoints ?
maybe I should use explicit reporting instead of Tensorboard
It will do just the same π
there is no method for settingΒ
last iteration
, which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
Let me double check that...
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine ...
That is a very good point
but for the metrics, I explicitly pass th...
BTW: could it be the Task.init is Not called on the "module.name" entry point, but somewhere internally ?
So this is verry odd, it looks like a pip bug:
The agent is trying to install torch==2.1.0.*
because by default it ignores the 4th+ parts (they are unstable and torch have tendency to remove them) . and for some reason pip will not match 2.1.0.*
with for example "2.1.0.dev20230306+cu118"
but based on the docs it should work:
see here: None
As a workaround you can always edit and change to the final url for example: so ...
I think it's inside the container since it's after the worker pulls the image
Oh that makes more sense, I mean it should not build the from source, but make sense
To solve for build for source:
Add to the "Additional ClearML Configuration" section the following line:agent.package_manager.pip_version: "<21"
You can also turn on venv caching
Add to the "Additional ClearML Configuration" section the following line:agent.venvs_cache.path: ~/.clearml/venvs-cache
I will make sure w...
PompousParrot44 the venv created in the docker always inherits form the docker system-wide packages, so in essence if you are using the same set pf python packages, nothing will get reinstalled.
Hi @<1541954607595393024:profile|BattyCrocodile47>
But the files API is still open to the world, right?
No, of course not π (i.e. API is authenticated with JWT header, this is why you need to generate the secret/key in the UI)
That said, the login process itself is user/pass stored on the server, but other than that the web/api are secured. The file server on the other hand is plain http storage and does not verify the connection like the API does. So if you are going the self-ho...
Hi FancyChicken53
This is a noble cause you are after π
Could you be more specific on what you had in mind, I'll try to find the best example once I have more understanding ...
Lol, :)
I think the issue is that you do not need to manually set the initial iteration, it's supposed to get it , as it is stored on the Task itself
I "think" you are referring to the venvs cash, correct?
If so, then you have to set it in the clearml.conf running on the host (agent) machine, make sense ?
. In short, I was not able to do it withΒ
Task.clone
Β andΒ
Task.create
Β , the behavior differs from what is described in docs and docstrings (this is another story - I can submit an issue on github later)
The easiest is to use task_ task_overrides
Then pass:task_overrides = dict('script': dict(diff='', branch='main'))
Questions
I want to trigger a retrain task when F1
That means that in inference you are reporting the F1 score, correct?
As part of the retraining I have to train all the models and then have to choose best one and deploy it
Are you using passing output_uri to Task.init? are you storing the model as artifact?
You can tag your model/task with "best" tag (and untag the previous one). Then in production , look for the "best" task and get its model
Thoughts?
ohh right, my bad:docker run -t --rm nvidia/cuda:10.1-base-ubuntu18.04 bash -c "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean && apt-get update && apt-get install -y git python3-pip && pip install trains-agent && echo done"
See here:
https://download.pytorch.org/whl/torch_stable.html
cu110/* has no torch 1.3.1 only 1.7.0
I'll try to go with this option, I think its actually perfect for my needs
Great!
seems like the network inside the running code cannot access the localhost (even though you have --network=host
. Could you test it with the machine's IP?
(Actually the best practice is to add a name to the machine (in your hosts file), so that if later you move the server, all the links will be valid)
JitteryCoyote63 it should just "freeze" after a while as it will constantly try to resend logs. Basically you should be fine π
(If for some reason something crashed, please let me know so we can fix it)