Manually I was installing the
leap
package through
python -m pip install .
when building the docker container.
NaughtyFish36 what happnes if you add to your "installed packages" /opt/keras-hannd
? This should translate to "pip install /opt/keras-hannd" which seems like exactly what you want, no ?
So could it be that pip install --no-deps .
is the missing issue ?
what happens if you add to the installed packages "/opt/keras-hannd" ?
Hmm CourageousLizard33 seems you stumbled on a weird bug,
This piece of code only tries to get the username of the current UID, but since you are running inside a docker and probably set the environment UID but there is no "actual" UID by that number on /etc/passwd , and so it cannot resolve it.
I'm attaching a quick fix, please let me know if it solved the problem.
I'd like to make sure we have it in the next RC as soon as possible.
Yep 🙂 but only in RC (or github)
function and just seem to be getting an "isadirectory" error?
Can you post here what you are getting ? which clearml version are you using ?!
also tried manually adding
leap==0.4.1
in the task UI which didn't work.
That has to work, if it did not, can you send the log for the failed Task (or the Task that did not install it)?
The environment in the logs does show that leap is being installed potentially from a cache?
- leap @ file:///opt/keras-hannd...
containing the
Extension
module
Not sure I follow, what is the Extension module ? what were you running manually that is not just pip install /opt/keras-hannd
?
😞 CooperativeFox72 please see if you can send a code snippet to reproduce the issue. I'd be happy to solve the it ...
If this is the case, then you have to set a shared PV for the pods, this way they can actually have a persistent cache, which would also be shared.
BTW: a single function call might not be a perfect match for a pipeline component , the overhead of starting a node might not be negligible as it needs to install required python packages bring the code etc.
CooperativeFox72 we are aware of Pool throwing exception that causes things to hang. Fix will be deployed in 0.16 (due to be released tomorrow).
Do you have a code to reproduce it, so I can verify the fix solves the issue?
Hi CooperativeFox72 trains 0.16 is out, did it solve this issue? (btw: you can upgrade trains to 0.16 without upgrading the trains-server)
CooperativeFox72 you can you start by checking the latest RC :)pip install trains==0.15.2rc0
CooperativeFox72 this is indeed sad news 😞
When you have the time, please see if you can send a code snippet to reproduce the issue. I'd like to have it fixed
Thanks CooperativeFox72 ! I'll test and keep you posted 🙂
I use Yaml config for data and model. each of them would be a nested yaml (could be more than 2 layers), so it won't be a flexible solution and I need to manually flatten the dictionary
Yes, you are correct, the recommended option would be to store it with task.connect_configuration
it's goal is to store these types of configuration files/objects.
You can also store the yaml file itself directly just pass Path object instead of dict/string
Yey! BTW: what the setup you are running it with ? does it include "manual" tasks? Do you also report on completed experiments (not just failed ones)? Do you filter by iteration numbers?
Yes EnviousStarfish54 the comparison is line by line and compared only to the left experiment (like any multi comparison, you have to set the baseline, which is always the left column here, do notice you can reorder the columns and the comparison will be updated)
If this is a simple two level nesting:
You can use the section name:task.connect(param['data'], name='data') task.connect(param['model'], name='model')
Would that help?
The comparison reflects the way the data is stored, in the configuration context. that means section name & key value (which is what the code above does)
Hi EnviousStarfish54
I think this is what you are after
task.connect_configuration(my_dict_here, name='my_section_name')
BTW:
if you do task.connect(a_flat_dict, name='new section') you will have the key/value in a section name called "new section"
When exactly are you getting this error ?
Hi IntriguedRat44
Sorry, I missed this message...
I'm assuming you are running in manual mode (i.e. not through the agent), in that case we do not change the CUDA_VISIBLE_DEVICES.
What do you see in the resource monitoring? Is it a single GPU or multiple GPUs?
(Check the :monitor:gpu in the Scalar tab under results,)
Also what's the Trains/ClearML version you are suing and the OS ?
How about this one:
None
wdym 'executed on different machines'?The assumption is that you have machines (i.e. clearml-agents) connected to clearml, which would be running all the different components of the pipeline. Think out of the box scale-up. Each component will become a standalone Job and the data will be passed (i.e. stored and loaded) automatically on the clearml-server (can be configured to be external object storage as well). This means if you have a step that needs GPU it will be launched on a GPU machine...
CourageousLizard33 VM?! I thought we are talking fresh install on ubuntu 18.04?!
Is the Ubuntu in a VM? If so, I'm pretty sure 8GB will do, maybe less, but I haven't checked.
How much did you end up giving it?
CourageousLizard33 Are you using the docker-compose to setup the trains-server?
(Venv mode makes sense if running inside a container, if you need docker support you will need to mount the docker socket inside)
What is exactly the error you re getting from clearml? And what do you have in the configuration file?
SmallBluewhale13
And the Task.init registers 0.17.2 , even though it prints (while running the same code from the same venv) 0.17.2 ?
Hmmm, are you running inside pycharm, or similar ?
None
No they are not, they are taking the vscode backend and put it behind a webserver-ish