Thanks BroadSeaturtle49
I think I was able to locate the issue !=
breaks the pytroch lookup
I will make sure we fix asap and release an RC.
BTW: how come 0.13.x have No linux x64 support? and the same for 0.12.x
https://download.pytorch.org/whl/cu111/torch_stable.html
Out of interest, is there a reason these are read-only?
Yes, we should probably change that... they are designed to be pre-populated, but there should not be any reason you could not remove them
The code for these tasks is on github right?
Correct
Hmm I assume it is not running from the code directory...
(I'm still amazed it worked the first time)
Are you actually using "." ?
Thanks JitteryCoyote63 !
Any chance you want to open github issue with the exact details or fix with a PR ?
(I just want to make sure we fix it as soon as we can 🙂 )
Any reason not to do so in the conf file ?
Hi GrotesqueOctopus42 ,
BTW: is it better to post the long error message on a reply to avoid polluting the channel?
Yes, that is appreciated 🙂
Basically logs in the thread of the initial message.
To fix this a had to spin the agent using --cpu-only flag (--docker --cpu-only)
Yes if you do not specify --cpu-only it will default to trying to access gpus
Nice!
Instead you can do: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
Then the Worker ID will running instance appended to the worker name. This means that even if you use the same $DYNAMIC_INSTANCE_ID twice, you will not have two agent registering on the same name.
I execute theÂ
clearml-session
 withÂ
--docker
 flag.
This is to control the docker image the agent will spin for you (think dev enviroment you want to work in, like nvidia pytorch container already having everything you need)
As we use a custom CUDA image, we do not want this running on user login, and get ugly error messages about missing symlinks.
You can customize the startup bash script (running inside Any container) here:
https://github.com/allegroai/clearml-agent/blob/bf07b7f76d3236c1118b81730c6d9718705a795a/docs/clearml.conf#L145
LackadaisicalOtter14 Would that help?
Hmm could it be this is on the "helper functions" ?
WackyRabbit7 I do 'pkill -f trains' but it's the same... If you need to debug and test run with --foreground and just hit ctrl-c to end the process (it will never switch to background...). Helps?
LOl my pleasure - I guess we should have a link in the doc string of add_requirements
to set_packages
, I will tell the guys
Change to add_missing_installed_packages=False,
here, and see if you end up with git diff
https://github.com/allegroai/clearml/blob/1f82b0c4010799be6157f5c845c7f6ac48e71c0c/clearml/backend_interface/task/populate.py#L158
WackyRabbit7 If you have an idea on an interface to shut it down, please feel free to suggest?
Unfortunately this sounds a classic case of RBAC (role based access control), and only the enterprise version has that feature (I think there is a contact us button on the website for those queries).
The easiest way to support the use case you describe is to share on a Task level 😞
i’m just curious about how does trains server on different nodes communicate about the task queue
We start manual, we tell the agent just execute the task (notice we never enqueued it), if all goes well we will get to multi-node part 🙂
And do you need to run your code inside a docker, or is venv enough ?
(sure, we can try, conda is sometime flaky but is supported)
specify conda as the package manager:https://github.com/allegroai/trains-agent/blob/9a3f950ac689c50ba3415c42749a4bd8059e89a7/docs/trains.conf#L49
2. make sure trains-agent is install on both nodes
3. assuming you already have an experiment in the system, right click on the experiment and clone it. Then press on the ID button next to the experiment name, and copy the task ID
4. ssh to each node and run:
` trains-agent execute --id <...
i think it can only run on multiple GPU at one node
Okay, the first step is to make sure your code is multi-node enabled, there is no magic for that 🙂
WittyOwl57 what about? vm.max_map_count
echo "vm.max_map_count=262144" > /tmp/99-clearml.conf
sudo mv /tmp/99-clearml.conf /etc/sysctl.d/99-clearml.conf
sudo sysctl -w vm.max_map_count=262144
sudo service docker restart `https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_linux_mac (5)
Hi CheekyAnt38
However now I would like to evaluate directly my machine learning model via api requests, directly over clearml. It’s possible?
This basically means serving the model, is this what you mean?
Hi RipeGoose2
So the http://app.community.clear.ml already contains it.
Next release of the standalone server (a.k.a clearml-server) will include it as well.
I think the ETA is end of the year (i.e. 2 weeks), but I'm not sure on the exact timeframe.
Sounds good ?
Should work with report surface, notice that this is not triangles, assumption is this is a fixed sampling of the surface, sample size is the numpy array matrix and the sample value (i.e. Z ) is the value on the matrix. This means that if you have a set of mesh triangles , you have to projects and sample it.
I think this is what you are after https://trimsh.org/trimesh.voxel.base.html?highlight=matrix#trimesh.voxel.base.VoxelGrid.matrix
Thanks!
Hmm from here : None
Could it be you do not have privileges to the resource, or that you did not provide credentials ?
Did that autoscaler work before ?
Oh I see, that kind of make sense
I think this is the section you should use:
None
But instead of the clearml-services container you should use the regular container (or just have it installed as part of the entry-point on any ubuntu based container)
Notice the important parts here are:
[None](https://github.com/allegroai/clearml-server/blob/6a1fc04d1e8b112fb334c8743d...
Hi @<1523702932069945344:profile|CheerfulGorilla72>
the agent is Always inherits from the docker system installed environment
If you have a custom venv inside the docker that is Not activated by default you can set the agent to use it:
None
CLEARML_AGENT_SKIP_PIP_VENV_INSTALL
But it does make me think, if instead of changing the optimizer I launch a few workers that "pull" enqueued tasks, and then report values for them in such a way that the optimizer is triggered to collect the results? would it be possible?
But this is Exactly how the optimizer works.
Regardless of the optimizer (OptimizerOptuna or OptimizerBOHB) both set the next step based on the scalars reported by the tasks executed by agents (on remote machines), then decide on the next set of para...