I mean when sending data from the clearml-agents, does it block the training while sending metrics or is it done in parallel from the main thread?
That would be awesome, yes, only from my side I have 0 knowledge of the pip codebase 😄
Still investigating, task.data.last_iteration
is correct (equal to engine.state["iteration"]
) when I resume the training
You mean it will resolve by itself in the following days or should I do something? Or there is nothing to do and it will stay this way?
If the reporting is done on a subprocess, I can imagine that the task.set_initial_iteration(0)
call is only effective in the main process, not in the subprocess used for reporting. Could it be the case?
There is no way to filter on long types? I can’t believe it
Maybe the agent could be adapted to have a max_batch_size parameter?
Something like that?
` curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{
"match": {
"variant": "loss_model"
}
},
{
"match": {
"task": "8f88e4b8cff84f23bde74ed4b7213ec6"
}
}
]
}
},
"aggs": {
"series": {
"terms": { "field": "iter" }
}
}
}...
Although task.data.last_iteration
is correct when resuming, there is still this doubling effect when logging metrics after resuming 😞
with the CLI, on a conda env located in /data
So it can be that when restarting the docker-compose, it used another volume, hence the loss of data
I have no idea what's going on
Thanks SuccessfulKoala55 ! So CLEARML_NO_DEFAULT_SERVER=1 by default, right?
as for disk space: I have 21Gb available (8Gb used), /opt/trains/data folder is about 600Mo
Also what is the benefit of having by default index.number_of_shards = 1
for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?
Something was triggered, you can see the CPU usage starting right when the instance became unresponsive - maybe a merge operation from ES?
Ok, I could reproduce with Firefox and Chromium. Steps:
Add creds (either via the popup or in the settings) Go the /settings/webapp-configuration -> Creds should be there Hit F5 Creds are gone
AgitatedDove14 It was only on comparison as far as I remember
self.clearml_task.get_initial_iteration()
also gives me the correct number
I would let the trains team answer this in details, but as a user moving from MLflow to trains, I can share the following insights:
MLflow and trains overlap when it comes to having a system with nice web UI to compare/log experiments/models/metrics. But MFlow lacks a crutial feature IMO which is ML/DevOps: Using MLFlow, you will have to take care of the whole maintenance of your machines, design interactions between them, etc. This is where trains shines, it provides these features out-of-t...
AgitatedDove14 I do continue an aborted Task yes - So I shouldn’t even need to call the task.set_initial_iteration
function, interesting! Do you have any ideas what could be a reason of the behavior I am observing? I am trying to find ways to debug it
Now, I know the experiments having the most metrics. I want to downsample these metrics by 10, ie only keep iterations that are multiple of 10. How can I query (to delete) only the documents ending with 0?
but if you do that and the package is already installed it will not install using the git repo, this is an issue with pip
Exactly, that’s my problem: I want to remove it to make sure it is reinstalled (because the version can change)
I think that since the agent installs everything from scratch it should work for you. Wdyt?
With env caching enabled, it won’t reinstall this private dependency, right?
I also would like to avoid any copy of these artifacts on s3 (to avoid double costs, since some folders might be big)
I am trying to upload an artifact during the execution
nothing wrong from ClearML side 🙂