Reputation
Badges 1
108 × Eureka!I’ve been added multi-node support for my code, and i found our lab seems only have shared user file because i installed trains on one node, but it doesn’t appear on the others
i’m just curious about how does trains server on different nodes communicate about the task queue
we all use conda, guess not need for docker
Yeah, i’m done with the test, not i can run as what you said
i see, now we are trying to let the agent pop up the experiment separately and see if they can communicate with each other, right?
i’m not sure if i use the command correctly
but the thing is that i can only use master to log everything
it’s shared but only user files, everything under ~/ directory
never done this before, let me do a quick search
but the solution in the answer doesn’t help cause when i do reverse with -R the server couldn’t be brought up
I think this is not related to pytorch, because it shows the same problem with mp spawn
i tired to run trains-compose without -d to say the log,
trains-agent-services | trains_agent: ERROR: Connection Error: it seems api_server is misconfigured. Is this the TRAINS API server http://apiserver:8008 ?
trains-agent-services | http://192.5.53.86:8081 http://192.5.53.86:8080 http://apiserver:8008
I didn’t assign anything to TRAINS_HOST_IP, not sure if the apiserver:8008 caused the problem
It only works that we set the CLEARML_CONFIG_FILE before script running
Yes, i think trains might wrap the torch.load function, but the thing is that i need to load some part of the dataset using torch.load, so this error shows up many time during training, I found i can use this line:task = Task.init(project_name="Alfred", task_name="trains_plot", auto_connect_frameworks={'pytorch': False})but does it mean i cannot monitor torch.load function any more?
I think so, let me give it a try, btw, I just found server API but not sure how to use it, for example /debug.ping, should i post request on “ http://localhost:8080/debug/ping ” or “ http://localhost:8080/debug.ping ”?
Thanks, i’ll give it a try
Not for now, i think it can only run on multiple GPU at one node
i tried to add environment right before importing clearml, but it doesn’t work as expectedos.environ['CLEARML_CONFIG_FILE'] = str(Path.home()/f"clearml-{socket.getfqdn()}.conf") from clearml import Task Task.init(project_name="Alfred", task_name="finalized", auto_connect_frameworks={'pytorch': False})
i’m trying to install it my lab server, but the same problem happen, when i try to create credentials, it say error but this time it give more info:
Error 301 : Invalid user id: id=f46262bde88b4928997351a657901d8b, company=d1bd92a3b039400cbafc60a7a5b1e52b
before i renamed it, i can log the experiment successfully, i basically add task.init to the python script then just run that script
I found server API here https://allegro.ai/clearml/docs/rst/references/clearml_api_ref , but not sure how to use it, for example /debug.ping, should i post request on “ http://localhost:8080/debug/ping ” or “ http://localhost:8080/debug.ping ”?
Then access the 8008 through the tunnel