Reputation
Badges 1
108 × Eureka!i was trying to copy the content of that file
when i run this one clearml-agent --config-file ~/clearml-iris.conf
it output the help info
then i run the second one, it basically outputs the same as just init
before i renamed it, i can log the experiment successfully, i basically add task.init to the python script then just run that script
i’m not sure if i use the command correctly
Do you think the local agent will be supported someday in the future?
Guess I’ll need to implement job schedule myself
I think this is not related to pytorch, because it shows the same problem with mp spawn
so basically, the spawn will run a function in several separate processes, so i followed the link you gave above and put task.init into that function.
i guess in this way, there will be multiple task.init running.
i tired to run trains-compose without -d to say the log,
trains-agent-services | trains_agent: ERROR: Connection Error: it seems api_server is misconfigured. Is this the TRAINS API server http://apiserver:8008 ?
trains-agent-services | http://192.5.53.86:8081 http://192.5.53.86:8080 http://apiserver:8008
I didn’t assign anything to TRAINS_HOST_IP, not sure if the apiserver:8008 caused the problem
i’m trying to install it my lab server, but the same problem happen, when i try to create credentials, it say error but this time it give more info:
Error 301 : Invalid user id: id=f46262bde88b4928997351a657901d8b, company=d1bd92a3b039400cbafc60a7a5b1e52b
never done this before, let me do a quick search
Not for now, i think it can only run on multiple GPU at one node
Can you tell me how can i find out where the scalar log is?
it’s shared but only user files, everything under ~/ directory
i’m setting the environment variable in the python script like thisos.environ['CLEARML_CONFIG_FILE'] = str(Path.home()/f"clearml-{socket.getfqdn()}.conf")
then task.init
Yeah, i’m done with the test, not i can run as what you said
Yes, when i put the task init into the spawn function, it can run without error, but it seems that each of the child process has their own experimentsClearML Task: created new task id=54ce0761934c42dbacb02a5c059314da ClearML Task: created new task id=fe66f8ec29a1476c8e6176989a4c67e9 ClearML results page:
ClearML results page:
`
ClearML Task: overwriting (reusing) task id=de46ccdfb6c047f689db6e50e6fb8291
ClearML Task: created new task id=91f891a272364713a4c3019d0afa058e
ClearML re...
I found server API here https://allegro.ai/clearml/docs/rst/references/clearml_api_ref , but not sure how to use it, for example /debug.ping, should i post request on “ http://localhost:8080/debug/ping ” or “ http://localhost:8080/debug.ping ”?
Thanks, it seems i need to forward all three ports
I’ve been added multi-node support for my code, and i found our lab seems only have shared user file because i installed trains on one node, but it doesn’t appear on the others
is there any document for this?