Reputation
Badges 1
981 × Eureka!(Btw the instance listed in the console has no name, it it normal?)
But I am not sure it will connect the parameters properly, I will check now
I have 11.0 installed but on another machine with 11.0 installed as well, trains downloads torch for cuda 10.1, I guess this is because no wheel exists for torch==1.3.1 and cuda 11.0
I managed to do it by using logger.report_scalar, thanks!
I tried removing type=str but I got same problem π
Hi DilapidatedDucks58 , I did that already, but I am reusing the same experiment instead of merging two experiments. Step 4 can be seen as:
Update the experiment status to stopped (if it is failed, you wonβt be able to re-enqueue it) Set a parameter of that task to point to the latest checkpoint and load it (you can also infer it directy: I simply add a tag to the task resume , and check at runtime if this tag exists, if yes, I fetch the latest checkpoint of the task) Use https://clea...
Not sure about that, I think you guys solved it with your PipelineController implementation. I would need to test it before giving any feedback π
I did change the replica setting on the same index yes, I reverted it back from 1 to 0 afterwards
Here is the minimal reproducable example.
Run test_task_a.py - It will register a dummy artifact, create a new task, set a parameter in that task and enqueue it test_task_b will try to retrieve parameter from parent task and fail
SuccessfulKoala55 Am I doing/saying something wrong regarding the problem of flushing every 5 secs (See my previous message)
Hi SuccessfulKoala55 , super thatβs what I was looking for
ok, now I actually remember why I used _update_requirements instead of add_requirements: The first overwrites all the other, the later only add to the already detected packages. Since my deps are listed in the dependencies of my setup.py, I don't want clearml to list the dependencies of the current environment
I should also rename /opt/trains/data/elastic_migrated_2020-08-11_15-27-05 folder to /opt/trains/data/elastic before running the migration tool right?
Alright, thanks for the answer! Seems legit then π
Will the from clearml import Task raise an error if no clearml.conf exists? Or only when actual features requiring to define the server (such as Task.init ) will be called
Yes, it did spin two instances for the same task
Thanks for your inputs, I will try that! For completion, here is how I retrieve the parameters:
` from trains import Task
task = Task.init("test", "test")
parent_task = Task.get_task(task.parent)
task.get_logger().report_text(task.get_parameters())
artifact_name = task.get_parameter("General/artifact_name")
artifact = parent_task.artifacts[artifact_name].get() `
the first problem I had, that didnβt gave useful infos, was that docker was not installed in the agent machine x)
the instances takes so much time to start, like 5 mins
meaning the RestAPI returns nothing, is that correct
Yes exactly, this is the response from the api server when I try to scroll down on the console to get more logs
This is no coincidence - Any data versioning tool you will find are somehow close to how git works (dvc, etc.) since they aim to solve a similar problem. In the end, datasets are just files.
Where clearml-data stands out imo is the straightfoward CLI combined with the Pythonic API that allows you to register/retrieve datasets very easily
Nevermind, i was able to make it work, but no idea how
Yes, it works now! Yay!
thanks for your help!
Hi TimelyPenguin76 , I guess it tries to spin them down a second time, hence the double print
That was also my feeling! But I though that spawning the trains-agent from a conda env would isolate me from cuda drivers on the system
I can probably have a python script that checks if there are any tasks running/pending, and if not, run docker-compose down to stop the clearml-server, then use boto3 to trigger the creating of a snapshot of the EBS, then wait until it is finished, then restarts the clearml-server, wdyt?
--- /data ---------- 48.4 GiB [##########] /elastic_7 1.8 GiB [ ] /shared 879.1 MiB [ ] /fileserver . 163.5 MiB [ ] /clearml_cache . 38.6 MiB [ ] /mongo 8.0 KiB [ ] /redis
BTW, is there any specific reason for not upgrading to clearml?
I just didn't have time so far π