Reputation
Badges 1
979 × Eureka!Maybe the agent could be adapted to have a max_batch_size parameter?
Something like that?
` curl "localhost:9200/events-training_stats_scalar-adx3r00cad1bdfvsw2a3b0sa5b1e52b/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{
"match": {
"variant": "loss_model"
}
},
{
"match": {
"task": "8f88e4b8cff84f23bde74ed4b7213ec6"
}
}
]
}
},
"aggs": {
"series": {
"terms": { "field": "iter" }
}
}
}...
Although task.data.last_iteration
is correct when resuming, there is still this doubling effect when logging metrics after resuming 😞
with the CLI, on a conda env located in /data
MagnificentSeaurchin79 You could also just fork the tensorflow repo, make changes in a specific branch and specify your forked repo with your custom branch in the install_requires of your setup.py
So it can be that when restarting the docker-compose, it used another volume, hence the loss of data
I see that I have several volumes:
` $ docker volume ls
DRIVER VOLUME NAME
local 5b0bfe5ab1a3d645bd635b2fb6f2aefd2b657d566019343c8305959903996c67
local 43b60287d60db798dc9d1defe1d7d861334c9c8299aefad6da2f20db278cfc5b
local 1406d50aa65ab55d323500d1fb23f19adfc6e721261ab6103a59d20e82146099
local 7367a215bd42a4e888e5d88ce708bf74aedc48a6e9417c72a19739cb80f25e6d
local 7413c39f5e4b6568304832d9d2e925ebdbf47ad31ad22d77830d3618af79237b
local a55cb71edff48c2138a5da9d8d1e26df3b...
I have no idea what's going on
Thanks SuccessfulKoala55 ! So CLEARML_NO_DEFAULT_SERVER=1 by default, right?
as for disk space: I have 21Gb available (8Gb used), /opt/trains/data folder is about 600Mo
Also what is the benefit of having by default index.number_of_shards = 1
for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?
Something was triggered, you can see the CPU usage starting right when the instance became unresponsive - maybe a merge operation from ES?
Ok, I could reproduce with Firefox and Chromium. Steps:
Add creds (either via the popup or in the settings) Go the /settings/webapp-configuration -> Creds should be there Hit F5 Creds are gone
AgitatedDove14 It was only on comparison as far as I remember
self.clearml_task.get_initial_iteration()
also gives me the correct number
I would let the trains team answer this in details, but as a user moving from MLflow to trains, I can share the following insights:
MLflow and trains overlap when it comes to having a system with nice web UI to compare/log experiments/models/metrics. But MFlow lacks a crutial feature IMO which is ML/DevOps: Using MLFlow, you will have to take care of the whole maintenance of your machines, design interactions between them, etc. This is where trains shines, it provides these features out-of-t...
AgitatedDove14 I do continue an aborted Task yes - So I shouldn’t even need to call the task.set_initial_iteration
function, interesting! Do you have any ideas what could be a reason of the behavior I am observing? I am trying to find ways to debug it
Now, I know the experiments having the most metrics. I want to downsample these metrics by 10, ie only keep iterations that are multiple of 10. How can I query (to delete) only the documents ending with 0?
but if you do that and the package is already installed it will not install using the git repo, this is an issue with pip
Exactly, that’s my problem: I want to remove it to make sure it is reinstalled (because the version can change)
I think that since the agent installs everything from scratch it should work for you. Wdyt?
With env caching enabled, it won’t reinstall this private dependency, right?
I also would like to avoid any copy of these artifacts on s3 (to avoid double costs, since some folders might be big)
I am trying to upload an artifact during the execution
nothing wrong from ClearML side 🙂
did you try with another availability zone?
SuccessfulKoala55 I found the issue thanks to you: I changed a bit the domain but didn’t update the apiserver.auth.cookies.domain
setting - I did it, restarted and now it works 🙂 Thanks!
RobustRat47 It can also simply be that the instance type you declared is not available in the zone you defined