Reputation
Badges 1
981 × Eureka!To help you debugging this: in the /dashboard endpoint, all projects were still there, but empty (no experiment inside). No experiments archived as well.
Still failing with the same error π
Hi CostlyOstrich36 , one more observation: it looks like when I donβt open the experiment in the webUI before it is finished, then I get all the logs correctly. It is when I open the experiment in the webUI while it is running that I donβt see all the logs.
So it looks like there is an effect of caching (the logs are retrieved only once, when I open the experiment for the first time), and not afterwards (or rarely). Is that possible?
Yes! not a strong use case though, rather I wanted to ask if it was supported somehow
but if the task is now running on an agent, isnβt is possible source of conflict? I would expect that after calling Task.enqueue(exit=True), the local task is closed and no processes related to it is running
Thanks a lot AgitatedDove14 !
mmmmh I just restarted the experiment and it seems to work now. I am not sure why that happened. From this SO it could be related to size of the repo. Might be a good idea to clone with --depth 1 in the agents?
Or more generally, try to catch this error and retry a few times?
Hoo I found:user@trains-agent-1: ps -ax 5199 ? Sl 29:25 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached 6096 ? Sl 30:04 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached
Will the from clearml import Task raise an error if no clearml.conf exists? Or only when actual features requiring to define the server (such as Task.init ) will be called
Yes, it did spin two instances for the same task
Thanks for your inputs, I will try that! For completion, here is how I retrieve the parameters:
` from trains import Task
task = Task.init("test", "test")
parent_task = Task.get_task(task.parent)
task.get_logger().report_text(task.get_parameters())
artifact_name = task.get_parameter("General/artifact_name")
artifact = parent_task.artifacts[artifact_name].get() `
the first problem I had, that didnβt gave useful infos, was that docker was not installed in the agent machine x)
the instances takes so much time to start, like 5 mins
meaning the RestAPI returns nothing, is that correct
Yes exactly, this is the response from the api server when I try to scroll down on the console to get more logs
This is no coincidence - Any data versioning tool you will find are somehow close to how git works (dvc, etc.) since they aim to solve a similar problem. In the end, datasets are just files.
Where clearml-data stands out imo is the straightfoward CLI combined with the Pythonic API that allows you to register/retrieve datasets very easily
Nevermind, i was able to make it work, but no idea how
Yes, it works now! Yay!
thanks for your help!
Hi TimelyPenguin76 , I guess it tries to spin them down a second time, hence the double print
That was also my feeling! But I though that spawning the trains-agent from a conda env would isolate me from cuda drivers on the system
I can probably have a python script that checks if there are any tasks running/pending, and if not, run docker-compose down to stop the clearml-server, then use boto3 to trigger the creating of a snapshot of the EBS, then wait until it is finished, then restarts the clearml-server, wdyt?
--- /data ---------- 48.4 GiB [##########] /elastic_7 1.8 GiB [ ] /shared 879.1 MiB [ ] /fileserver . 163.5 MiB [ ] /clearml_cache . 38.6 MiB [ ] /mongo 8.0 KiB [ ] /redis
BTW, is there any specific reason for not upgrading to clearml?
I just didn't have time so far π
Very nice! Maybe we could have this option as a toggle setting in the user profile page, so that by default we keep the current behaviour, and users like me can change it π wdyt?
no, one worker (trains-agent-1) "forget from time to time" the current experiment he is running and picks another experiment on top of the one he is currently running