Reputation
Badges 1
25 × Eureka!correct on both.
notice that with upload you can specify any storage (S3/GS/Azure atc)
From the docs I think what's going on is that the https://opennmt.net/OpenNMT-tf/package/opennmt.Runner.html#opennmt.Runner.train is spinning a new subprocess, and the training itself happens on the subprocess.
If this is the case this will explain the lack of automagic, as the subprocess is lacking the "Task.init" call
wdyt, could that be the case ?
Hi CleanWhale17 , at least for the moment, the code although open ( https://github.com/allegroai/trains-web ) has no external theme/customization interface.
That said we do have some thoughts on it.., What did you have in mind ?
Thanks StrongHorse8
Where do you think would be a good place to put a more advanced setup? Maybe we should add an entry for DevOps? Wdyt?
SarcasticSparrow10 sure see "execute_remotely" it does exactly that:
https://allegro.ai/docs/task.html#trains.task.Task.execute_remotely
It will stop the current process (after syncing everything) and launch itself remotely (i.e. enqueue itself)
When the same code is running by the "trains-agent" the execute_remotely call becomes a no-operation and is basically skipped
Hi JealousParrot68
spinning the clearml-agent with docker support (i.e. each experiment is running inside its own container):
https://clear.ml/docs/latest/docs/clearml_agent#docker-mode
Basically you can specify a default docker to use (per agent) and a specific docker container to use per Task (configured in the UI under execution at the bottom)
So basically take data from tensorboard read it, and report it to the cloud ?
It seems like the configuration is cached in a way even when you change the CLI parameters.
@<1523704461418041344:profile|EnormousCormorant39> nice!
Yes the configuration is cached so that after you set it once you can just call clearml-session again without all the arguments
What was the actual issue ? Should we add something to the printout?
The api server by default spins multiple processes (they all might be busy a tye time with a huge flood of requests, but this is still multi process). Let me check if there is an easy way to set more processes
We're not using a load balancer at the moment.
The easiest way is to add ELB and have amazon add the httpS on top (basically a few clicks on their console)
BoredGoat1
Hmm, that means it should have worked with Trains as well.
Could you run the attached script, see if it works?
Hi SuperiorDucks36
you have such a great and clear GUI
😊
I personally would love to do it with a CLI
Actually a lot of stuff are harder to get from UI (like current state of your local repository etc.) But I think your point stands 🙂 We will start with CLI, because it is faster to deploy/iterate, then when you guys say this is a winner we will have a wizard in the UI.
What do you think?
then will clearml associate that image with my experiment and always use that image with it,
when you say "agent to use my docker image," I'm assuming you mean the configuration file or --docker argument, in both cases this means Default conatiner.
This means that if the Task does Not specify a docker, it will use the one you set in the conf/argument, But Tasks can always specify a different docker to use, and the agent will pull the requested docker based on the Task's entry.
Eve...
well I do not think you set your pytorch lightining to use cuda:
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/code/.venv/lib/python3.9/site-packages/lightning/pytorch/trainer/setup.py:176: PossibleUserWarning: GPU available but not used. Set `accelerator` and `devices` using `Trainer(accelerator='gpu', devices=1)`.
SoggyBeetle95 is this secret a per Task secret, or is it for the agent itself (I.e. for all Tasks the agent will spin)?
Wait who is creating this file? I thought you remove it in the uncommitted changes
You can always specify diff clearml.conf files with --config-file 🙂
Hi DilapidatedDucks58 ,
Are you running in docker or venv mode?
Do the works share a folder on the host machine?
It might be syncing issue (not directly related to the trains-agent but to the facts you have 4 processes trying to simultaneously access the same resource)
BTW: the next trains-agent RC will have a flag (default off) for torch-nightly repository support 🙂
ScantMoth28 it should work, I think default deployment also has an NGINX with reverse proxy on it switching from " http://clearml-server.domain.com/api " to " http://api.clearml-server.domain.com "
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
Wait, are you using conda as package manager ?
EDIT: meaning configured in trains.conf as package manager
Meaning if I create a sleep endpoint that is async
Hmm are you calling "sleep" or "async.sleep"?
Also are you running the serving service with GUNICORN or UVCORN?
see here:
None
Hi ElegantCoyote26
What's the docker / docker-compose version?
What's the OS?
LOL @<1545216070686609408:profile|EnthusiasticCow4>
I assume this is a hidden folder?
for example datasets are hidden folders that can be viewed if you go to the settings page and turn on "show hidden folders"
Any chance you can test with the latest RC ? 1.8.4rc2
So the original looks good, could it be you tried to clone a Task that was executed with an agent with pip, and then pushed into an agent running conda?