
Reputation
Badges 1
981 × Eureka!I will try addingsudo sh -c "echo '\n* soft nofile 65535\n* hard nofile 65535' >> /etc/security/limits.conf"
to the extra_vm_bash_script
, maybe that’s enough actually
Whohoo! Thanks 👌
But that was too complicated, I found an easier approach
btw CostlyOstrich36 , I can see in Profile > Version: 1.1.1-135 • 1.1.1 • 2.14
. What these numbers correspond to?
(docker was install with sudo snap install docker
)
Ok, I guess I’ll just delete the whole loss series. Thanks!
but not as much as the ELB reports
I also don't understand what you mean by unless the domain is different...
The same way ssh keys are global, I would have expected the git creds to be used for any git operation
Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:
SuccessfulKoala55 I was able to recreate the indices in the new ES cluster. I specified number_of_shards: 4
for the events-log-d1bd92a3b039400cbafc60a7a5b1e52b
index. I then copied the documents from the old ES using the _reindex
API. The index is 7.5Gb on one shard.
Now I see that this index on the new ES cluster is ~19.4Gb 🤔 The index is divided into the 4 shards, but each shard is between 4.7Gb and 5Gb!
I was expecting to have the same index size as in the previous e...
Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False}
seem to have a positive impact - it is running now, I will confirm in a bit
Sure, I opened an issue https://github.com/allegroai/clearml/issues/288 unfortunately I don't have time to open a PR 🙏
Ok, so what worked for me in the end was:config = task.connect_configuration(read_yaml(conf_path)) cfg = OmegaConf.create(config._to_dict())
Note: I can verify that post_packages is well picked up by the trains-agent, since in the experiment log I see:agent.package_manager.type = pip agent.package_manager.pip_version = \=\=20.2.3 agent.package_manager.system_site_packages = true agent.package_manager.force_upgrade = false agent.package_manager.post_packages.0 = PyJWT\=\=1.7.1
I ended up dropping omegaconf altogether
If I remove security_group_ids
and just let subnet_id
in the configuration, it is not taken into account (the instances are created in a the default subnet)
Could you please point me to the relevant component? I am not familiar with typescript unfortunately 😞
As a quick fix, can you test with auto refresh (see top right button with the pause sign you have on the video)
That doesn’t work unfortunately
The workaround I could find for now is to add the following to CONTAINER > SETUP SHELL SCRIPT:mkdir -p ~/git/credential chmod 0700 ~/git/credential git config --global credential.helper 'cache --socket ~/git/credential/socket'
It could be yes but the difference between now
and last_report_time
doesn’t match the difference I observe
No I agree, it’s probably not worth it
Is it safe to turn off replication while a reindex operation is happening? the reindexing is rather slow and I am wondering if turning of replication will speed up the process
since we removed "." from the requirements?
So it can be that when restarting the docker-compose, it used another volume, hence the loss of data
Task.get_project_object().default_output_destination = None
Yes, it did spin two instances for the same task