Reputation
Badges 1
979 × Eureka!ok, thanks SuccessfulKoala55 !
I’d like to move to a setup where I don’t need these tricks
So it is there already, but commented out, any reason why?
Thanks AgitatedDove14 ! I created a project with a default output destination to a s3 bucket but I don't have local access to this bucket (only agents have access to it for security reasons). Because of that, I cannot create a task in this project programmatically locally because it tries to access the bucket and fails. And there is no easy way to change the default output location (not in the web UI, not in the sdk)
yes, the only thing I changed is:install_requires=[ ... "my-dep @ git+
]
to:install_requires=[ ... "git+
"]
torch==1.7.1 git+
.
I am already trying with latest of pip 😞
Hey SuccessfulKoala55 , unfortunately this doesn’t work, because the dict contains others dicts, and only the first level dict becomes a dict, the inner dicts still are ProxyDictPostWrite
and will make OmegaConf.create fail
This is consistent: Each time I send a new task on the default queue, if trains-agent-1 has only one task running (the long one), it will pick another one. If I add one more experiment in the queue at that point (trains-agent-1 running two experiments at the same time), that experiment will stay in queue (trains-agent-2 and trains-agent-3 will not pick it because they also are running experiments)
line 13 is empty 🤔
No space, I will add and test 🙂
hooo now I understand, thanks for clarifying AgitatedDove14 !
Hi CostlyOstrich36 , I am not using Hydra, only OmegaConf, so you mean just calling OmegaConf.load should be enough?
with open(path, "r") as stream: return yaml.load(stream, Loader=yaml.FullLoader)
Note: Could be related to https://github.com/allegroai/clearml/issues/790 , not sure
I have a custom way of reading the config file
it would be nice if Task.connect_configuration could support custom yaml file readers for me
And since I ran the task locally with python3.9, it used that version in the docker container
CostlyOstrich36 I don’t see such number, can you please share a screenshot of where to look at?
hoo thats cool! I could place torch==1.3.1 there
Sorry both of you, my problem was actually lying somewhere else (both buckets are in the same region) - thanks for you time!
AgitatedDove14 Didn’t work 😞
mmmmh I just restarted the experiment and it seems to work now. I am not sure why that happened. From this SO it could be related to size of the repo. Might be a good idea to clone with --depth 1
in the agents?
Or more generally, try to catch this error and retry a few times?
Hi @<1523701087100473344:profile|SuccessfulKoala55> I was able to find the issue, I was creating a queue and worker subprocess that were not properly cleaned up
Unfortunately this is difficult to reproduce... Neverthless it would be important to me to be robust against it, because if this error happens in a task in the middle of my pipeline, the whole process fails.
This binds to another wider topic I think: How to "skip" tasks if they already run (a mechanism similar to what [ https://luigi.readthedocs.io/en/stable/ ] offers). That would allow to restart the pipeline and skip tasks until the point where the task failed
ClearML has a task.set_initial_iteration
, I used it as such:checkpoint = torch.load(checkpoint_fp, map_location="cuda:0") Checkpoint.load_objects(to_load=self.to_save, checkpoint=checkpoint) task.set_initial_iteration(engine.state.iteration)
But still the same issue, I am not sure whether I use it correctly and if it’s a bug or not, AgitatedDove14 ? (I am using clearml 1.0.4rc1, clearml-agent 1.0.0)