Reputation
Badges 1
981 × Eureka!Hi SuccessfulKoala55 , AgitatedDove14 ,
I updated to 1.4.0 (Web UI shows: WebApp: 1.5.0-186 β’ Server: 1.5.0-186 β’ API: 2.18 )
Unfortunately the bug is still there π
I donβt see errors in the console anymore though!
I had another look and modified a events.get_task_logs request with a super old timestamp to try to retrieve all logs, this returned me only the few logs already displayed in the console. So I think the problem doesnβt come from the WebUI, but from the...
Alright, thanks for the answer! Seems legit then π
I came up with the same code, thanks for the fast answer (yes having a setter for that would be cool!)
AgitatedDove14 I now tested with a real experiment, it works, but I saw two issues:
It first doesnt detect torch, downloads it but then says that it is already installed so it doesn't install it. One of the dependency of my repository is another repository (repo-2 in the logs). Both my repositories require numpy . When installing the first repository, it says Requirement already satisfied: numpy in /home/workeruser/.local/lib/python3.6/site-packages . Correct. But then it says `...
I think it comes from the web UI of the version 1.2.0 of clearml-server, because I didnβt change anything else
Hi SuccessfulKoala55 , thanks for the idea! the function isnβt called with atexit.register() though, maybe the way the agent kills the task is not supported by atexit
Could you please point me to the relevant component? I am not familiar with typescript unfortunately π
I want the clearml-agent/instance to stop right after the experiment/training is βpausedβ (experiment marked as stopped + artifacts saved)
AgitatedDove14 I finally solved it: The problem was --network='host' should be --network=host
Not sure about that, I think you guys solved it with your PipelineController implementation. I would need to test it before giving any feedback π
The task is created using Task.clone() yes
AgitatedDove14 Is it fixed with trains-server 0.15.1?
Thanks for your answer! I am in the process of adding subnet_id/security_groups_id/key_name to the config to be able to ssh in the machine, will keep you informed π
I will go for lunch actually π back in ~1h
I am confused now because I see in the master branch, the clearml.conf file has the following section:# Or enable credentials chain to let Boto3 pick the right credentials. # This includes picking credentials from environment variables, # credential file and IAM role using metadata service. # Refer to the latest Boto3 docs use_credentials_chain: falseSo it states that IAM role using metadata service should be supported, right?
I mean that I have a taskA (controller) that is in charge of creating a taskB with the same argv parameters (I just change the entry point of taskB)
When an experiment on trains-agent-1 is finished, I see randomly no experiment/long experiment and when two experiments are running, I see randomly one of the two experiments
Downloading the artifacts is done only when actually calling get()/get_local_copy()
Yes, I rather meant: reproduce this behavior even for getting metadata on the artifacts π
To clarify: trains-agent run a single service Task only
Isn't it overkill to run a whole ubuntu 18.04 just to run a dead simple controller task?
not really, because it is in the middle of the controller task, there are other things to be done afterwards (retrieving results, logging new artifacts, creating new tasks, etc)
AppetizingMouse58 the events_plot.json template misses the plot_len declaration, could you please give me the definition of this field? (reindexing with dynamic: strict fails with: "mapping set to strict, dynamic introduction of [plot_len] within [_doc] is not allowed )
Ok, I won't have time to venture to check the different database components, the first option (shuting down the server) sounds like the easiest option for me, I would then run manually the script once a month or so
That would be awesome, yes, only from my side I have 0 knowledge of the pip codebase π
Yes, in the Task being executed in the agents, I have:from trains import Task task = Task.init(...) task.get_logger().report_text(str(task.get_parameters()))