
Reputation
Badges 1
979 × Eureka!Mmmh unfortunately not easily… I will try to debug deeper today, is there a way to resume a task from code to debug locally?
Something like replacing Task.init
with Task.get_task
so that Task.current_task
is the same task as the output of Task.get_task
I also tried task.set_initial_iteration(-task.data.last_iteration)
, hoping it would counteract the bug, didn’t work
Yes I did, I found the problem: docker-compose was using trains-server 0.15 because it didn't see the new version of trains-server. Hence I had trains-server 0.15 running with ES7.
-> I deleted all the containers and it successfully pulled trains-server 0.16. Now everything is running properly 🙂
This one doesn’t have _to_dict
unfortunately
Ha nice, makes perfect sense thanks AgitatedDove14 !
Now I'm curious, what did you end up doing ?
in my repo I maintain a bash script to setup a separate python env. then in my task I spawn a subprocess and I don't pass the env variables, so that the subprocess properly picks up the separate python env
Nevertheless there might still be some value in that, because it would allow to reduce the starting time by removing the initial setup of the agent + downloading of the data to the instance - but not as much as I described initially, if instances stopped are bound to the same capacity limitations as new instances launched
Looks like its a hurray then 😄 🎉 🍾
Why is it required in the case where boto3 can figure them out itself within the ec2 instance?
but the post_packages does not reinstalls the version 1.7.1
yes, in setup.py I have:..., install_requires= [ "my-private-dep @ git+
", ... ], ...
Installing collected packages: my-engine Attempting uninstall: my-engine Found existing installation: my-engine 1.0.0 Uninstalling my-engine-1.0.0: Successfully uninstalled my-engine-1.0.0 Successfully installed my-engine-1.0.0
This is the mapping of the faulty index:
` {
"events-plot-d1bd92a3b039400cbafc60a7a5b1e52b_new" : {
"mappings" : {
"dynamic" : "strict",
"properties" : {
"@timestamp" : {
"type" : "date"
},
"iter" : {
"type" : "long"
},
"metric" : {
"type" : "keyword"
},
"plot_data" : {
"type" : "binary"
},
"plot_len" : {
"type" : "long"
},
"plot_str" : {
...
Yes AnxiousSeal95 , stopped instance meaning you don’t pay for it, but just its storage, as described https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html . So AgitatedDove14 increasing the IDLE timeout would still make me pay for the instance while they are idle.
Do you get stopped instances instantely when you ask for them?
Well that’s a good question, that’s what I observed some time ago, but according to their https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/...
Here is the console with some errors
Hi CostlyOstrich36 , this weekend I took a look at the diffs with the previous version ( https://github.com/allegroai/clearml-server/compare/1.1.1...1.2.0# ) and I saw several changes related to the scrolling/logging:
apiserver/bll/event/ http://log_events_iterator.py apiserver/bll/event/ http://events_iterator.py apiserver/config/default/services/_mongo.conf apiserver/database/model/ http://base.py apiserver/services/ http://events.pyI suspect that one of these changes might be responsible ...
So probably only the main process (rank=0) should attach the ClearMLLogger?
I assume you’re using a self-hosted server?
Yes
So two possible cases for trains-agent-1: either:
It picks a new experiment -> show randomly one of the two experiments in the "workers" tab no new experiment in default queue to start -> show randomly no experiment or the one that it is running
Hi CostlyOstrich36 , there was no DB migration necessary since 1.6, right?
erf, I have the same problem with ProxyDictPreWrite 😄 What is the use case of this one ?
Hi DilapidatedDucks58 , I did that already, but I am reusing the same experiment instead of merging two experiments. Step 4 can be seen as:
Update the experiment status to stopped (if it is failed, you won’t be able to re-enqueue it) Set a parameter of that task to point to the latest checkpoint and load it (you can also infer it directy: I simply add a tag to the task resume
, and check at runtime if this tag exists, if yes, I fetch the latest checkpoint of the task) Use https://clea...
I am already trying with latest of pip 😞
Indeed, I actually had the old configuration that was not JSON - I converted to json, now works 🙂
Note: I can verify that post_packages is well picked up by the trains-agent, since in the experiment log I see:agent.package_manager.type = pip agent.package_manager.pip_version = \=\=20.2.3 agent.package_manager.system_site_packages = true agent.package_manager.force_upgrade = false agent.package_manager.post_packages.0 = PyJWT\=\=1.7.1
I don’t have a registry to push my image to.I think I can get around it actually - Will it work if I just build the image locally once, then start the agent? Docker would recognise that image locally and just use it right? I won’t need to update that image often anyway
So I changed ebs_device_name = "/dev/sda1"
, and now I correctly get the 100gb EBS volume mounted on /
. All good 👍