ItchyJellyfish73
Unfortunately this needs backend support, and only available in the enterprise version, what is your use case for it? (It was designed to allow out of the box bare-metal multi gpu dynamic allocation, think DGX with 8 GPUs that instead of spinning down agents when you want to change the queue->num-gpu mapping you can do it on the fly)
server-->agent is fast, but agent-->server is slow.
Then multiple connection will not help, this is the bottleneck of the upload speed of your machine, regardless of what the target is (file-server, S3, etc...)
Hi @<1576381444509405184:profile|ManiacalLizard2>
If you make sure all server access is via a host name (i.e. instead of IP:port, use host_address:port), you should be able to replace it with cloud host on the same port
ReassuredTiger98 yes this is exactly it 🙂
agent.package_manager.type will select for the agent weather it should use conda or pip to do the installation. Basically if you develop on conda you should select conda.
The agent will first try to install packages using conda, then it will collect the missing packages and install them into the save environment only using pip.
the hack doesn't work if conda is not installedÂ
Of course conda needs to be installed, it is using a pre-existing conda env, no?! what am I missing
Ideally it would just pull an experiment from a dedicated HPO queue and run it inplace
And the assumption is the code is also there ?
However, SNPE performs quantization with precompiled CLI binary instead of python library (which also needs to be installed). What would be the pipeline in this case?
I would imagine a container with preinstalled SNPE compiler / quantizer, and a python script triggering the process ?
one more question: in case of triggering the quantization process, will it be considered as separate task?
I think this makes sense, since you probably want a container with the SNE environment, m...
This will allow them to experiment outside of clearml and only switch to it when they are in an OK state. This will also helpnot to pollute clearml spaces with half backed ideas
What's the value of runnign outside of an experiment management context ? don't you want to log it?
There is no real penalty here, no?!
ResponsiveHedgehong88 so I would suggest using execute_remotely in your code, basically you start locally you make sure everything is passed as intended, then from within the code you call task.execute_remotely(...)
which will stop the current process and enqueue the Task on the selected queue for the agent to execute.
https://github.com/allegroai/clearml/blob/0397f2b41e41325db2a191070e01b218251bc8b2/examples/advanced/execute_remotely_example.py#L127
This way you can both easily test...
WickedGoat98 Nice!!!
BTW: The fix should solve both (i.e. no need to manually cast), I'll make sure the fix is on GitHub so you'll be able to verify 🙂
Is task.parent something that could help?
Exactly 🙂 something like:# my step is running here the_pipeline_task = Task.get_task(task_id=task.parent)
OddAlligator72 I like this idea.
The single thing I'm not sure about is the "function entry point"
Why would one do that? Meaning why wouldn't you have a proper python entry-point.
The reason I'm reluctant is that you might have calls/functions/variables in global scope of the file storing the function, and then users will not know why something broke, ans it will be very cumbersome to debug.
A simple script entry point seems trivial to launch and debug locally.
What do you think ? What woul...
Hi OutrageousGrasshopper93
which framework are you using? trains-agent will pull the correct torch based on the cuda version it detects, but no such thing for TF the default venv mode, trains-agent creates a new venv for the experiment (not conda) then everything is installed there. If you need conda you need to change the package_manager to conda: https://github.com/allegroai/trains-agent/blob/de332b9e6b66a2e7c6736d12614de9870eff48bc/docs/trains.conf#L49 The safest way to control CUDA dri...
Yes, the same will work with artifacts, use pass the full url to the artifact_object
it should just register it as is.
what do you say that I will manually kill the services agent and launch one myself?
Makes sense 🙂
Hi SillyPuppy19
I think I lost you half way through.
I have a single script that launches training jobs for various models.
Is this like the automation example on the Github, i.e. cloning/enqueue experiments?
flag which is the model name, and dynamically loading the module to train it.
a Model has a UUID in the system as well, so you can use that instead of name (which is not unique), would that solve the problem?
This didn't mesh well with Trains, because the project a...
Ohh then you do docker sibling:
Basically you map the docker socket into the agent's docker , that lets the agent launch another docker on the host machine.
You cab see an example here:
https://github.com/allegroai/clearml-server/blob/6434f1028e6e7fd2479b22fe553f7bca3f8a716f/docker/docker-compose.yml#L144
but it is not optimal if one of the agents is only able to handle tasks of a single queue (e.g. if the second agent can only work on tasks of type B).
How so?
Hi @<1545216070686609408:profile|EnthusiasticCow4>
is there a way to get the date from the InputModel?
You should be able to with model._get_model_data()
But I think we should have it all exposed, wdyt?
When I look at the details, model artifact in the ClearML UI, it's been saved the usual way, and no tags that I added in the OutputModel constructor are there.
Did you disable the autologging ? Are you saying the tags not appearing is a bug (it might be) ?
Also, I don't mind auto logging either if I have control over publishing the model or not directly from that script, and adding tags etc, like OutputModel.
Sure you can publish models / add tags etc, wither from the UI or pr...
Hi @<1526371965655322624:profile|NuttyCamel41>
I think that the only way to actually get huge number of api calls is with a lot of machines.
For example, regardless of the amount of console-logs you print, it will only be a single call, as these are packages every 2-10 seconds. The same with metric reporting etc.
On the free tier you cal already test the amount of API calls, I think the mechanism is exactly the same
fyi: I would put this question in the channel
ldconfig fromÂ
/etc/profile
 which is put there by the interactive_session_task
LackadaisicalOtter14 are you sure ? maybe this is done as part of the installation the interactive session runs ?
Could that be the issue ?apt-get update && apt-get install -y openssh-server
Hi LackadaisicalOtter14
However, whenever we spin up a session,Â
 always gets run and overwrites our configs
what do you mean by that?
The what config are being overwritten? (generally speaking, it just add the OS environment it needs to for the setup process)
That would be great! Might have to useÂ
2>/dev/null
 in some of my bash scripts
Feel free to test and PR :)
One other question regarding connecting. We have setup sshd inside the docker image we are using.
Actually the remote session opens port 10022 on the host machine (so it does not collide with the default ssh port)
It actually runs an additional sshd
inside the docker, setting its port.
And the clearml-session will ssh directly into the container sshd...
Hi LackadaisicalOtter14
Is it possible to remove this line to stop it from being executed
Everything is possible 🙂 II think the main question is why it is there (which ti the best of my understanding, is to solve for any cuda drivers and installed packages, meaning anything that is installed in runtime)
I think we can suppress the error, wdyt?'echo "ldconfig" 2>/dev/null >> /etc/profile && '
logger.report_scalar("loss", "train", iteration=0, value=100)
logger.report_scalar("loss", "test", iteration=0, value=200)
Hi JitteryCoyote63
Is this close ?
https://github.com/allegroai/clearml/issues/283
This only talks about bugs reporting and enhancement suggestions
I'll make sure this is fixed 🙂