Reputation
Badges 1
25 × Eureka!Hi SubstantialElk6
The ClearML session ended up tunneling into the physical machine that my agent is running on,
Yes that is the correct behavior. basically the clearml-session is using the agent to "schedule" a machine, then spin a container with JupyterLab/VSCode , and finally connect your CLI directly with that machine.
You can think of it as a way to solve the resource allocation problem.
Make sense ?
WittyOwl57 could it be the EC2 instance is too small (i.e. not enough storage / memory) ?
WittyOwl57 what about? vm.max_map_count
echo "vm.max_map_count=262144" > /tmp/99-clearml.conf
sudo mv /tmp/99-clearml.conf /etc/sysctl.d/99-clearml.conf
sudo sysctl -w vm.max_map_count=262144
sudo service docker restart `https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_linux_mac (5)
Hi WittyOwl57
Are you starting a new server from scratch or is it running on previously stored data?
No, an old experiment changed, nothing was rerun
ohh, that is odd. I think the max iteration value is stored on the DB, which is odd if it changed after an update.
BTW: just making sure, could it be these Tasks were imported ? (i.e. offline execution + import)
and about a month later for some reason the initial iteration seems to have changed to 0
Hmm, I see your point. Just so I fully understand, your are not saying Old experiments were changed, but new experiments (running the same code-ish) have a totally different max iterations value. Is this correct ?
this is not the case as all the scalars report the same iterations
MassiveHippopotamus56 could it be the the machine statistics? (i.e. cpu/gpu etc. these are considered scalars as well...)
MassiveHippopotamus56
the "iteration" entry is actually the "max reported iteration over all graphs" per graph there is different max iteration. Make sense ?
JitteryCoyote63
Yes this extremely annoying, I think it was updated on the community server, let me check if we deployed a new docker with a fix ...
JitteryCoyote63 while it's running, could you give me a few details on the setup, maybe I can reproduce it.
Is it using pytorch distributed ?
Are all models uploaded to S3 ?
etc.
"Updates a few seconds ago"
That just means that the process is not dead.
Yes that seemed to be stuck π
Any chance you can verify with the RC version?
I'll try to dig into the commits, maybe I can come up with an explanation ...
(It would be nice to have all the Pypi releases tagged in github btw)
I wanted to say, we listen ... and point to the tag , but for some reason it was not pushed LOL.
Also, I would upgrade the backend 0.15.1 a few bugs were fixed since 0.14.x some have to do with the plots...
Hi GrotesqueMonkey62 any chance you can be a bit more specific? Maybe a screen grab?
Here is how it works, if you look at an individual experiment scalars are grouped by title (i.e. multiple series on the same graph if they have the same title)
When comparing experiments, any unique combination of title/series will get its own graph, then the different series on the graph are the experiments themselves.
Where do you think the problem lays ?
What's the trains-server version ?
Are you sure trains-server not trains package (i.e. backend)
JitteryCoyote63 okay... but let me explain a bit so you get a better intuition for next time π
The Task.init call, when running remotely, assumes the Task object already exists in the backend, so it ignores whatever was in the code and uses the data stored on the trains-server, similar to what's happening with Task.connect and the argparser.
This gives you the option of adding/changing the "output_uri" for any Task regardless of the code. In the Execution tab, change the "Output Destina...
JitteryCoyote63 with pleasure π
BTW: the Ignite TrainsLogger will be fixed soon (I think it's on a branch already by SuccessfulKoala55 ) to fix the bug ElegantKangaroo44 found. should be RC next week
Hi JitteryCoyote63 a few implementation details on the services-mode, because I'm not certain I understand the issue.
The docker-agent (running in services mode) will pick a Task from the services queue, then it will setup the docker for it spin it and make sure the Task starts running inside the docker (once it is running inside the docker you will see the service Task registered as additional node in the system, until the Task ends) once that happens the trains-agent will try to fetch the...
shows that the trains-agent is stuck running the first experiment, not
the trains_agent execute --full-monitoring --id a445e40b53c5417da1a6489aad616fee
is the second trains-agent instance running inside the docker, if the task is aborted, this process should have quit...
Any suggestions on how I can reproduce it?
I'd prefer to use config_dict, I think it's cleaner
I'm definitely with you
Good news:
newΒ
best_model
Β is saved, add a tagΒ
best
,
Already supported, (you just can't see the tag, but it is there :))
My question is, what do you think would be the easiest interface to tell (post/pre) store, tag/mark this model as best so far (btw, obviously if we know it's not good, why do we bother to store it in the first place...)
Hmm ElegantKangaroo44 low memory that might explain the behavior
BTW: 1==stop request, 3=Task Aborted/Failed
Which makes sense if it crashed on low memory...
From the top
trains-agent pulls a service Task Task marked as running- trains-agent worker points to the Task Docker is spinned up environment is installed inside docker (results are shown in the service Task Log) trains-agent inside the docker is launched and a new node appears in the system <host_agent_name>:service:<task_id> and the Task service is listed as running on it main trains-agent is back to idle and its worker now has no experiment listed as running
Where do you think it breaks?
ElegantKangaroo44 my bad π I missed the nuance in the description
There seems to be an issue in the web ui -> viewingΒ plots in "view in experiment table" doesn't respect the "scalars to display" one sets when viewing in "view in fullscreen".
Yes the info-panel does not respect the full view selection, It's on the to do list to add this ability, but it is still no implemented...
Feel free to add to the UI request list:
https://github.com/allegroai/trains/issues/81
JitteryCoyote63
I agree that its name is not search-engine friendly,
LOL π
It was an internal joke the guys decided to call it "trains" cause you know it trains...
It was unstoppable, we should probably do a line of merchandise with AI π π
Anyhow, this one definitely backfired...