Well no luck - using matplotlib.use('agg')
in my training codebase doesn't solve the mem leak
Hi CostlyOstrich36 , most of the time I want to compare two experiments in the DEBUG SAMPLE, so if I click on one sample to enlarge it I cannot see the others. Also once I closed the panel, the iteration number is not updated
Sure, it’s because of a very annoying bug that I shared in this https://clearml.slack.com/archives/CTK20V944/p1648647503942759 , that I couldn’t solve so far.
I’m not sure you can downgrade that easily ...
Yea that’s what I thought, that’s a bit of pain for me now, I hope I can find a way to fix the bug somehow
In all the steps I want to store them as artifacts to s3 because it’s very convenient.
The last step should merge them all, ie. it needs to know all the other artifacts of the previous steps
and then call task.connect_configuration probably
Sorry, its actuallytask.update_requirements(["."])
GrumpyPenguin23 yes, it is the latest
AgitatedDove14 , what I was looking for was: parent_task = Task.get_task(task.parent)
Bottom line is: trains-server uses elastichsearch image: http://docker.elastic.co/elasticsearch/elasticsearch:5.6.16 which does not have an unlimited license (only free license that expires after some time). From versions 6.3, elasticsearch provides an unlimited free license. Trains should use >=6.3, WDYT?
Hi AgitatedDove14 , initially I was doing this, but then I realised that with the approach you suggest all the packages of the local environment also end up in the “installed packages”, while in reality I only need the dependencies of the local package. That’s why I use _update_requirements
, with this approach only the package required will be installed in the agent
no, at least not in clearml-server version 1.1.1-135 • 1.1.1 • 2.14
I didn’t use ignite callbacks, for future reference:
` early_stopping_handler = EarlyStopping(...)
def log_patience(_):
clearml_logger.report_scalar("patience", "early_stopping", early_stopping_handler.counter, engine.state.epoch)
engine.add_event_handler(Events.EPOCH_COMPLETED, early_stopping_handler)
engine.add_event_handler(Events.EPOCH_COMPLETED, log_patience) `
Thanks! I will investigate further, I am thinking that the AWS instance might have been stuck for an unknown reason (becoming unhealthy)
Awesome, thanks WackyRabbit7 , AgitatedDove14 !
AgitatedDove14 Is it fixed with trains-server 0.15.1?
AgitatedDove14 ok, but this happens in my local machine, not in the agent
Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking 😄
Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why
Default would be venv, only use docker if an image is passed. Use case: not have to duplicate all queues to accept both docker and venv agents on the same instances
Otherwise I can try loading the file with custom loader, save as temp file, pass the temp file to connect_configuration, it will return me another temp file with overwritten config, and then pass this new file to OmegaConf
ProxyDictPostWrite._to_dict()
will recursively convert to dict and OmegaConf will not complain then
So I need to have this merging of small configuration files to build the bigger one
I am confused now because I see in the master branch, the clearml.conf file has the following section:# Or enable credentials chain to let Boto3 pick the right credentials. # This includes picking credentials from environment variables, # credential file and IAM role using metadata service. # Refer to the latest Boto3 docs use_credentials_chain: false
So it states that IAM role using metadata service should be supported, right?
This one doesn’t have _to_dict
unfortunately
I was asking to exclude this possibility from my debugging journey 😁
This is the mapping of the faulty index:
` {
"events-plot-d1bd92a3b039400cbafc60a7a5b1e52b_new" : {
"mappings" : {
"dynamic" : "strict",
"properties" : {
"@timestamp" : {
"type" : "date"
},
"iter" : {
"type" : "long"
},
"metric" : {
"type" : "keyword"
},
"plot_data" : {
"type" : "binary"
},
"plot_len" : {
"type" : "long"
},
"plot_str" : {
...
Nevertheless there might still be some value in that, because it would allow to reduce the starting time by removing the initial setup of the agent + downloading of the data to the instance - but not as much as I described initially, if instances stopped are bound to the same capacity limitations as new instances launched
AgitatedDove14 Yes exactly! it is shown in the recording above
Ok, in that case it probably doesn’t work, because if the default value is 10 secs, it doesn’t match what I get in the logs of the experiment: every second the tqdm adds a new line