
Reputation
Badges 1
981 × Eureka!awesome π
Maybe then we can extend task.upload_artifact
?def upload_artifact(..., wait_for_upload: bool = False): ... if wait_for_upload: self.flush(wait_for_uploads=True)
not really, because it is in the middle of the controller task, there are other things to be done afterwards (retrieving results, logging new artifacts, creating new tasks, etc)
Here is the minimal reproducable example.
Run test_task_a.py - It will register a dummy artifact, create a new task, set a parameter in that task and enqueue it test_task_b will try to retrieve parameter from parent task and fail
Unfortunately this is difficult to reproduce... Neverthless it would be important to me to be robust against it, because if this error happens in a task in the middle of my pipeline, the whole process fails.
This binds to another wider topic I think: How to "skip" tasks if they already run (a mechanism similar to what [ https://luigi.readthedocs.io/en/stable/ ] offers). That would allow to restart the pipeline and skip tasks until the point where the task failed
AgitatedDove14 I cannot confirm at 100%, the context is different (see previous messages) but it could be the same bug behind the scene...
AnxiousSeal95 Any update on this topic? I am very excited to see where this can go π€©
I should also rename /opt/trains/data/elastic_migrated_2020-08-11_15-27-05
folder to /opt/trains/data/elastic
before running the migration tool right?
Hi SuccessfulKoala55 , super thatβs what I was looking for
Not really: I just need to find the one that is compatible with torch==1.3.1
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
I will try to isolate the bug, if I can, I will open an issue in trains-agent π
Thanks SuccessfulKoala55 ! So CLEARML_NO_DEFAULT_SERVER=1 by default, right?
sure, will be happy to debug that π
Will the from clearml import Task
raise an error if no clearml.conf exists? Or only when actual features requiring to define the server (such as Task.init
) will be called
Which commit corresponds to RC version? So far we tested with latest commit on master (9a7850b23d2b0e1f2098ab051de58ce806143fff)
My use case it: in a spot instance marked for termination after 2 mins by aws, I want to close a task and prevent the clearml-agent to pick up a new task after.
I am using 0.17.5, it could be either a bug on ignite or indeed a delay on the send. I will try to build a simple reproducible example to understand to cause
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt? (edited)
That sounds great!
From the answers I saw on the internet, it is most likely related to the mismatch of cuda/cudnn version
The weird thing is that the second experiment started immediatly, correctly in a docker container, but failed with User aborted: stopping task (3)
at some point (while installing the packages). The error message is suprizing since I did not do anything. And then all following experiments are queued to services queue and stuck there.