From the docs I think what's going on is that the https://opennmt.net/OpenNMT-tf/package/opennmt.Runner.html#opennmt.Runner.train is spinning a new subprocess, and the training itself happens on the subprocess.
If this is the case this will explain the lack of automagic, as the subprocess is lacking the "Task.init" call
wdyt, could that be the case ?
GiddyTurkey39 can you ping the server-address (just making sure, this should be the IP of the server not 'localhost')
Okay that kind of makes sense, now my followup question is how are you using the ASG? I mean the clearml autoscaler does not use it, so I just wonder on what the big picture, before we solve this little annoyance 🙂
based on this one:
https://stackoverflow.com/questions/31436407/git-ls-remote-returns-fatal-no-remote-configured-to-list-refs-from
I think this is a specific issue of the local git repo configuration, can you verify
(btw: I tested with git 2.17.1 git ls-remote --get-url will return the remote url, without an error)
GiddyTurkey39
I would guess your VM cannot access the trains-server , meaning actual network configuration issue.
What are VM ip and the trains-server IP (the first two numbers are enough, e.g. 10.1.X.Y 174.4.X.Y)
Hmm I just tested on the community version and it seems to work there, Let me check with frontend guys. Can you verify it works for you on https://app.community.clear.ml/ ?
Hi JitteryCoyote63
Is it possible to rollback from 1.2.0 to 1.1.0?
Not really there was a DB migration so out of the box downgrade is not really supported.
That said, v1.3.1 is already out, with what seems like a fix:
As a quick fix, can you test with auto refresh (see top right button with the pause sign you have on the video)
It does not use key auth, instead sets up some weird password and then fails to auth:
AdventurousButterfly15 it ssh Into the container inside the container it sets new daemon with new random very long password
It will Not ssh to the host machine (i.e. the agent needs to run in docker mode, not venv mode), make sense ?
Hi SubstantialElk6
No need for that, you can use the helm chart (or spin them once with kubctl) then they take care of scheduling by themselves.
You can also use the k8s glue (basically spinning kubernetes pods automatically for you, based on the Tasks that you push into the ClearML queue)
https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py
In short, two possible deployments
Static k8s pod running the agent (then the agent runs all the experiments inside t...
Hi MoodyCentipede68 , I think I saw something like it, can you post the full log? The triton error is above, also I think it restarted the container automatically and then it worked
Is there any way to debug these sessions through clearml? Thanks!
Yes this is a real problem, AWS does not allow to get the data very easily...
Can you check the AWS console, see what you have there ?
In theory this should have worked.
Maybe we you are missing some escaping for the "extra_vm_bash_script" ?
I'm hoping the console output will tell us
that might be it.
Is the web UI working properly ?
What ports are you using?
One suggestion is to make sure all agents have the same configuration. Another is to add pip into the "installed packages" section.
(Notice that in the next release we will specifically include it there, to avoid these kind of scenarios)
Sounds great! I really like that approach, thanks GrotesqueDog77 !
Okay how do I reproduce it ?
With the warning ?
I was able to reproduce it on the old versions, but it seems fixed on the latest from GitHub.
sudo curl -L " -s)-$(uname -m)" -o /usr/local/bin/docker-compose
UnevenOstrich23
but interesting that auto-reload config does not working as I expected.
Unfortunately the trains-agent does not support auto reloading the config file yet. If you think this will be a great feature, please feel free to open a GitHub feature request issue 🙂
It works. However, still, it sometimes takes a strangely long time for the agent to pick up the next task (or process it), even if it is only "Hello World".
The agent check every 2/5 seconds if there is a new Task to be launched, could that be it?
Hi @<1627478122452488192:profile|AdorableDeer85>
I'm sorry I'm a bit confused here, any chance you can share the entire notebook ?
Also any reason why this is pointing to "localhost" and not IP/host of the clearml-server ? is the agent running on the same machine ?
You can put a breakpoint here, and see what you are sending:
https://github.com/allegroai/trains/blob/17f7d51a93deb52a0e7d6cdd59da7038b0e2dd0a/trains/backend_api/session/session.py#L220
LethalCentipede31 sure:task.upload_artifact(object_or_file, name)https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py