Thanks for the explanations,
Yes that was the case This is also what I would think, although I double checked yesterday:I create a task on my local machine with trains 0.16.2rc0 This task calls task.execute_remotely() The task is sent to an agent running with 0.16 The agent install trains 0.16.2rc0 The agent runs the task, clones it and enqueues the cloned task The cloned task fails because it has no hyper-parameters/args section (I can seen that in the UI) When I clone the task manually usin...
should I try to roll back to clearml-server 1.0.2? I am very anxious now…
it actually looks like I don’t need such a high number of files opened at the same time
SuccessfulKoala55 I want to avoid writing creds in plain text in the config file
This allows me to inject yaml files into other yaml files
trains-elastic container fails with the following error:
Ok, so what worked for me in the end was:config = task.connect_configuration(read_yaml(conf_path)) cfg = OmegaConf.create(config._to_dict())
So it is there already, but commented out, any reason why?
Actually was not related to clearml, the higher level error causing this one was (somewhere in the stack trace): RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
-> wrong numpy version
As you can see, more hard waiting (initial sleep), and then before each apt action, make sure there is no lock
Trying now your code… should take a couple of mins
I have no idea what's going on
is there a command / file for that?
SuccessfulKoala55 Here is the trains-elastic error
Thanks SuccessfulKoala55 !
Maybe you could add to your docker-compose file an option for limiting the size of the logs, since there is no limit by default, their size will grow for ever, which doesn't sound ideal https://docs.docker.com/compose/compose-file/#logging
You are right, thanks! I was trying to move /opt/trains/data to an external disk, mounted at /data
Setting to redis from version 6.2 to 6.2.11 fixed it but I have new issues now 😄
I cannot share the file itself, but here are some potential helpful points:
Multiple lines empty One line is empty but has spaces (6 to be exact) The last line of the file is empty
And after the update, the loss graph appears
so most likely one hard requirement installs the version 2 of pyjwt while setting up the experiment
and the agent says agent.cudnn_version = 0
but the post_packages does not reinstalls the version 1.7.1
yes, exactly: I run python my_script.py
, the script executes, creates the task, calls task.remote_execute(exit_process=True)
and returns to bash. Then, in the bash console, after some time, I see some messages being logged from clearml
but if the task is now running on an agent, isn’t is possible source of conflict? I would expect that after calling Task.enqueue(exit=True), the local task is closed and no processes related to it is running
Thanks @<1523701087100473344:profile|SuccessfulKoala55> ! Are alive workers sending ping to notify the server that they are alive or does the server infers that they are alive based on the last communication?
Sure 🙂 Opened https://github.com/allegroai/clearml/issues/568
Yes AnxiousSeal95 , stopped instance meaning you don’t pay for it, but just its storage, as described https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html . So AgitatedDove14 increasing the IDLE timeout would still make me pay for the instance while they are idle.
Do you get stopped instances instantely when you ask for them?
Well that’s a good question, that’s what I observed some time ago, but according to their https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/...
Sure, where can I find this file?
Thanks for the help SuccessfulKoala55 , the problem was solved by updating the docker-compose file to the latest version in the repo: https://github.com/allegroai/clearml-server/blob/master/docker/docker-compose.yml
Make sure to do docker-compose down & docker-compose up -d
afterwards, and not docker-compose restart