PompousBeetle71 is this ArgParser argument or a connected dictionary ?
I have to leave i'll be back online in a couple of hours.
Meanwhile see if the ports are correct (just curl to all ports see if you get an answer) if everything is okay, try again to run the text example
Hi WickedGoat98
"Failed uploading to //:8081/files_server:"
Seems like the problem. what do you have defined as files_server in the trains.conf
Hi ColossalAnt7
Try ctrl-F5 and refresh the page?!
It seems you are missing a few buttons 😉
Assuming you are using docker-compose, the console output is a good start
You can query the system and get all the experiments based on date, then grab the machine GPU metrics.
DefeatedCrab47 check the cleanup service, it queries the system with the Apiclient.
https://github.com/allegroai/trains/blob/10ec4d56fb4a1f933128b35d68c727189310aae8/examples/services/cleanup/cleanup_service.py#L72
Hi DefeatedCrab47
You mean by trains-agent, or accumulated over all experiences ?
yes that makes send, I think what happened is one of the processes completed the Task (i.e. closed it) before the others did, and so they threw exception.
I switched to have all tasks in a separate process
I think that's probably the best (performance wise as well), nice!
Hi GrotesqueOctopus42
Dispite having reuse_last_task_id=True on Task.init, it always creates a new task id. Anyone ever had this issue?
So the way "reuse_last_task_id=True" works is that if there are no artifacts on the Task it will reuse it, but when running inside jupyter it always has artifacts (the notebook itself), so it starts a new Task.
You can however pass a specific Task ID and it will reuse it "reuse_last_task_id=aabb11", would that help?
You can however pass a specific Task ID and it will reuse it "reuse_last_task_id=aabb11", would that help?
Hmm I'm sorry it might be "continue_last_task", can you try:Task.init(..., continue_last_task="aabb11")
Hi GrotesqueOctopus42 ,
BTW: is it better to post the long error message on a reply to avoid polluting the channel?
Yes, that is appreciated 🙂
Basically logs in the thread of the initial message.
To fix this a had to spin the agent using --cpu-only flag (--docker --cpu-only)
Yes if you do not specify --cpu-only it will default to trying to access gpus
Nice!
Hmm can you run the agent in debug mode, and check the specific console log?
'''
clearml-agent --debug daemon --foreground ...
Did you you set 'force_git_ssh_protocol: true '?
https://github.com/allegroai/clearml-agent/blob/249b51a31bee97d63f41c6d5542e657962008b68/docs/clearml.conf#L39
Actually it is better to leave it as is, it will just automatically mount the .ssh folder into the container, i will make sure the docs point to this option first
WickedGoat98 is this related to plotly opening a web page when you call show()
method ?
You can do:if not Task.running_locally() fig.show()
Okay, I was able to reproduce it (this is odd) let me check ...
WickedGoat98 what's the clearml version you are using?
Hi GrotesqueOctopus42
In theory it can be built, the main hurdle is getting elk/mongo/redis containers for arm64 ...
Hi WickedGoat98
I try to write an article on medium about ClearML and face some a problem with plotly figures.
This is awesome !
I ran the plotly_reporting.py example locally and the uploaded plot was ok.
So are you saying the same example code from the repository worked okay on your server but showed nothing on the hosted server ?
WickedGoat98 until the next RC release (should not take long) this will solve it:df = pd.concat([tickerDf.Close, tickerDf_Change.Close_pcent], axis=1) df = df[1:] df.index = df.index.astype(str) setattr(df, 'ticker', args.symbol)
Basically removing the nan and converting the datetime to string representation (so plotly.js likes it)
Hey WickedGoat98
I found the bug, it is due to the fact the numpy (passed to plotly) contains both datetime and nan, and plotly.js does not like it. I'll make sure this is fixed, in the meantime you can just remove the first row (it contains the nan):df = pd.concat([tickerDf.Close, tickerDf_Change.Close_pcent], axis=1) df = df[1:]
WickedGoat98 Nice!!!
BTW: The fix should solve both (i.e. no need to manually cast), I'll make sure the fix is on GitHub so you'll be able to verify 🙂
WickedGoat98 Same for me, let me ask the UI guys, I think this is a UI bug.
Also maybe before you post the article we could release a fix to both, what do you think?
EDIT:
Never mind 🙂 i just saw the medium link, very cool!!!
WickedGoat98 this is awesome! Let me know how I could help 🙂
BTW: I checked regrading the plot comparison, this is a BE issue due to the size of the plot, I was told a fix will be deployed in a day or two.
WickedGoat98 give me a minute, I'm not sure it is not ClearML related
One last thing make sure you spin the pod container with privileged mode, because the trains-agent docker will spin a sibling docker for your actual experiment.
WickedGoat98 Basically you have two options:
Build a docker image with wget installed, then in the UI specify this image as "Base Docker Image" Configure the trains.conf
file on the machine running the trains-agent, with the above script. This will cause trains-agent
to install wget on any container it is running, so it is available for you to use (saving you the trouble of building your own container).With any of these two, by the time your code is executed, wget is installed an...
trains-agent should be deployed to GPU instances, not the trains-server.
The trains-agent purpose is for you to be able to send jobs to a GPU (at least in most cases) instance.
The "trains-server" is a control plane , basically telling the agent what to run (by storing the execution queue and tasks). Make sense ?
WickedGoat98 sorry, I missed the thread...
that the trains.conf has to be located on the node running the trains-agent.
Correct 🙂
The easiest way to check is to see if you can curl to the ip:port from the docker.
If you fail it is probably the wrong IP.
the IP you need to use is the IP of the machine running the docker-compose (not the IP of the docker inside that machine).
Make sense ?