Reputation
Badges 1
979 × Eureka!yes, because it wonβt install the local package which has this setup.py with the problem in its install_requires described in my previous message
ha nice, where can I find the mapping template of the original clearml so that I can copy and adapt?
I am using pip as a package manager, but i start the trains-agent inside a conda env π
CostlyOstrich36 super thanks for confirming! I have then the follow-up question: are the artifacts duplicated (copied)? or just referenced?
So I want to be able to visualise it quickly as a table in the UI and be able to download it as a dataframe, which of report_media or artifact is better?
` # Set the python version to use when creating the virtual environment and launching the experiment
# Example values: "/usr/bin/python3" or "/usr/local/bin/python3.6"
# The default is the python executing the clearml_agent
python_binary: ""
# ignore any requested python version (Default: False, if a Task was using a
# specific python version and the system supports multiple python the agent will use the requested python version)
# ignore_requested_python_version: ...
I mean when sending data from the clearml-agents, does it block the training while sending metrics or is it done in parallel from the main thread?
That would be awesome, yes, only from my side I have 0 knowledge of the pip codebase π
Still investigating, task.data.last_iteration
is correct (equal to engine.state["iteration"]
) when I resume the training
You mean it will resolve by itself in the following days or should I do something? Or there is nothing to do and it will stay this way?
If the reporting is done on a subprocess, I can imagine that the task.set_initial_iteration(0)
call is only effective in the main process, not in the subprocess used for reporting. Could it be the case?
There is no way to filter on long types? I canβt believe it
Maybe the agent could be adapted to have a max_batch_size parameter?
Something like that?
` curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{
"match": {
"variant": "loss_model"
}
},
{
"match": {
"task": "8f88e4b8cff84f23bde74ed4b7213ec6"
}
}
]
}
},
"aggs": {
"series": {
"terms": { "field": "iter" }
}
}
}...
Although task.data.last_iteration
Β is correct when resuming, there is still this doubling effect when logging metrics after resuming π
with the CLI, on a conda env located in /data
MagnificentSeaurchin79 You could also just fork the tensorflow repo, make changes in a specific branch and specify your forked repo with your custom branch in the install_requires of your setup.py
So it can be that when restarting the docker-compose, it used another volume, hence the loss of data
I have no idea what's going on
Thanks SuccessfulKoala55 ! So CLEARML_NO_DEFAULT_SERVER=1 by default, right?
as for disk space: I have 21Gb available (8Gb used), /opt/trains/data folder is about 600Mo
Also what is the benefit of having by default index.number_of_shards = 1
for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?
Something was triggered, you can see the CPU usage starting right when the instance became unresponsive - maybe a merge operation from ES?
Ok, I could reproduce with Firefox and Chromium. Steps:
Add creds (either via the popup or in the settings) Go the /settings/webapp-configuration -> Creds should be there Hit F5 Creds are gone
AgitatedDove14 It was only on comparison as far as I remember