Reputation
Badges 1
979 × Eureka!I assume you’re using a self-hosted server?
Yes
Ho nice, thanks for pointing this out!
Sorry, what I meant is that it is not documented anywhere that the agent should run in docker mode, hence my confusion
So it is there already, but commented out, any reason why?
Thanks AgitatedDove14 ! I created a project with a default output destination to a s3 bucket but I don't have local access to this bucket (only agents have access to it for security reasons). Because of that, I cannot create a task in this project programmatically locally because it tries to access the bucket and fails. And there is no easy way to change the default output location (not in the web UI, not in the sdk)
yes, the only thing I changed is:install_requires=[ ... "my-dep @ git+
]
to:install_requires=[ ... "git+
"]
torch==1.7.1 git+
.
I am already trying with latest of pip 😞
Hey SuccessfulKoala55 , unfortunately this doesn’t work, because the dict contains others dicts, and only the first level dict becomes a dict, the inner dicts still are ProxyDictPostWrite
and will make OmegaConf.create fail
This is consistent: Each time I send a new task on the default queue, if trains-agent-1 has only one task running (the long one), it will pick another one. If I add one more experiment in the queue at that point (trains-agent-1 running two experiments at the same time), that experiment will stay in queue (trains-agent-2 and trains-agent-3 will not pick it because they also are running experiments)
line 13 is empty 🤔
No space, I will add and test 🙂
Hi CostlyOstrich36 , I am not using Hydra, only OmegaConf, so you mean just calling OmegaConf.load should be enough?
with open(path, "r") as stream: return yaml.load(stream, Loader=yaml.FullLoader)
Note: Could be related to https://github.com/allegroai/clearml/issues/790 , not sure
I have a custom way of reading the config file
it would be nice if Task.connect_configuration could support custom yaml file readers for me
And since I ran the task locally with python3.9, it used that version in the docker container
hoo thats cool! I could place torch==1.3.1 there
mmmmh I just restarted the experiment and it seems to work now. I am not sure why that happened. From this SO it could be related to size of the repo. Might be a good idea to clone with --depth 1
in the agents?
Or more generally, try to catch this error and retry a few times?
Hi @<1523701087100473344:profile|SuccessfulKoala55> I was able to find the issue, I was creating a queue and worker subprocess that were not properly cleaned up
Unfortunately this is difficult to reproduce... Neverthless it would be important to me to be robust against it, because if this error happens in a task in the middle of my pipeline, the whole process fails.
This binds to another wider topic I think: How to "skip" tasks if they already run (a mechanism similar to what [ https://luigi.readthedocs.io/en/stable/ ] offers). That would allow to restart the pipeline and skip tasks until the point where the task failed
ClearML has a task.set_initial_iteration
, I used it as such:checkpoint = torch.load(checkpoint_fp, map_location="cuda:0") Checkpoint.load_objects(to_load=self.to_save, checkpoint=checkpoint) task.set_initial_iteration(engine.state.iteration)
But still the same issue, I am not sure whether I use it correctly and if it’s a bug or not, AgitatedDove14 ? (I am using clearml 1.0.4rc1, clearml-agent 1.0.0)
Hi SmugDolphin23 thanks for the input! Will try now but that seems hacky: to have it working I have to specify python3.8 two times:
one in the agent config file (agent.default_python is already python3.8, but seems to be ignored) + make sure it is available (using python:3.8 docker image)Is there a way to prevent this redundancy? Ie. If I want to change the python version, I can control it from a single place?