Is tehre anything specific about the logs we're looking for? Because if I just dumop them it will take me a while to see no sensitive data and naming is there
but the task pending says its in the queue
By the way, just inspecting, the CUDA version on the output of nvidia-smi
is matching the driver installed on the host, and not the container - look at the image below
The latest, I curl
ed the docker-compose like 10 minutes ago
BTW is the if not cached_file: return cached_file
is legit or a bug?
I'm asking that because the DSes we have are working on multiple projects, and they have only one trains.conf
file, I wouldn't want them to edit it each time they switch project
Yes, I have a metric I want to monitor so I will be able to sort my experiments by it. It is logged in this manner
logger.report_scalar(title='Mean Top 4 Accuracy', series=ARGS.model, iteration=0, value=results['top_4_acc'].mean())
When looking at my dashboard this is how it looks
this is the selection from the column setting menu
I'm using iteration = 0 at the moment, and I "choose" the max and it shows as a column... But the column is not the scalar name (because it cuts it and puts the >
sign to signal max).
For the sake of comparing and sorting, it makes sense to log a scalar with a given name without the iteration dimension
doesn't contain the number 4
moreover I think I found a bug
Committing that notebook with changes solved it, but I wonder why it failed
By the examples I figured out this ould appear as a scatter plot with X and Y axis and one point only.. Does it avoid that?
Could be, my message is that in general, the ability to attach a named scalar (without iteration/series dimension) to an experiment is valuable and basic when looking to track a metric over different experiments
That is not very informative
AgitatedDove14 all I did was to cerate this metric as "last" and then turned on the "max" and "min" and then turned them off
I can't reproduce it now but:
I restarted the services and it didn't help I deleted the columns, and created them again after a while and it helped
Oh I get it, that also makes sense with the docs directing this at inference jobs and avoiding GPU - because of the 1-N thing
I'll just exclude .cfg files from the deletion, my question is how to recover, must i recreate the agents or there is another way?
Maybe something similar to dockers, that I could name each one of my trains agents and then refer to them by name something like
trains-agent daemon --name agent_1 ...
Thentrains-agent stop/start
I've dealt with this earlier today because I set up 2 agents, one for each GPU on a machine, and after editing configurations I wanted to restart only one of them (because the other was working) and then I noticed I don't know which one to kill