
Reputation
Badges 1
979 × Eureka!AgitatedDove14 This looks awesome! Unfortunately this would require a lot of changes in my current code, for that project I found a workaround π But I will surely use it for the next pipelines I will build!
Basically what I did is:
` if task_name is not None:
project_name = parent_task.get_project_name()
task = Task.get_task(project_name, task_name)
if task is not None:
return task
Otherwise here I create the Task `
And I do that each time I want to create a subtask. This way I am sure to retrieve the task if it already exists
Not sure about that, I think you guys solved it with your PipelineController implementation. I would need to test it before giving any feedback π
continue_last_task
is almost what I want, the only problem with it is that it will start the task even if the task is completed
Ho I wasn't aware of that new implementation, was it introduced silently? I don't remember reading it in the release notes! To answer your question: no, for gcp I used the old version, but for azure I will use this one, maybe send a PR if code is clean π
AgitatedDove14 Up π I would like to know if I should wait for next release of trains or if I can already start implementing azure support
Both ^^, I already adapted the code for GCP and I was planning to adapt to Azure now
ProxyDictPostWrite._to_dict()
will recursively convert to dict and OmegaConf will not complain then
Hi AgitatedDove14 , I upgraded to 1.3.1 and the bug of missing logs in the console is still thereβ¦ π
I made another recording so that you can understand what it is about:
I enqueue a task the task starts, the logs shown in the console are very sparse I scroll up and down to try to fetch missing logs, without success I download the logs, open the file and there I see the full logs
Now I'm curious, what did you end up doing ?
in my repo I maintain a bash script to setup a separate python env. then in my task I spawn a subprocess and I don't pass the env variables, so that the subprocess properly picks up the separate python env
Hi DeterminedCrab71 Version: 1.1.1-135 β’ 1.1.1 β’ 2.14
Nice, the preview param will do π btw, I love the new docs layout!
my docker-compose for the master node of the ES cluster is the following:
` version: "3.6"
services:
elasticsearch:
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms2g -Xmx2g
bootstrap.memory_lock: "true"
cluster.name: clearml-es
cluster.initial_master_nodes: clearml-es-n1, clearml-es-n2, clearml-es-n3
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
clust...
The weird thing is that the second experiment started immediatly, correctly in a docker container, but failed with User aborted: stopping task (3)
at some point (while installing the packages). The error message is suprizing since I did not do anything. And then all following experiments are queued to services queue and stuck there.
Actually was not related to clearml, the higher level error causing this one was (somewhere in the stack trace): RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
-> wrong numpy version
Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint
ok, so there is no way to cache it and detect when the ref changes?
We would be super happy to have the possibility of documenting experiments (new tab in experiments UI) with a markdown editor!
Thanks! (Maybe could be added to the docs ?) π
Nevermind, i was able to make it work, but no idea how
with 1.1.1 I getUser aborted: stopping task (3)