I think it is to do with the
build-essential issue. Let me talk you through the process:
Run a docker image locally called keras-hannd-cml (i.e. the one that is then being used by the agent as the base image later on) Run the training script to register the task, which works fine, all dependencies work i.e. the c++ packages are working correctly on that container Execute the task on an agent running in docker mode with the same image that the task was registered with i.e. keras-hannd-cml. Task fails since it's missing the C++ module somehow
i've sent you the most recent logs. Can you see anything incorrect with the above work process?
The point is, "
leap" is proeperly installed, this is the main issue. And although installed it is missing the ".so" ? what am I missing? what are you doing manually that does Not show in the log?
In other words how did you install it "menually" inside the docker when you mentioned it worked for you when running without the agent ?
AgitatedDove14 DM's you the log file for the failed task. I have tried using a task startup script to install G++, gcc etc. but it didn't seem to work, I'll try
build-essentials too. I'm also interested in the way that the environments are set up in clearml, I read in the docs that the task looks for a
requirements.txt file to construct the env, but does this prevent a local package being built correctly i.e. through
setup.py when running a remote task?
c++ module fails to import, anyone have any insight? required c++ compilers seem to be installed on the docker container.
Can you provide log for the failed Task?
BTW: if you need
build-essentials you can add it as the Task startup script
apt-get install build-essentials
function and just seem to be getting an "isadirectory" error?
Can you post here what you are getting ? which clearml version are you using ?!
also tried manually adding
in the task UI which didn't work.
That has to work, if it did not, can you send the log for the failed Task (or the Task that did not install it)?
The environment in the logs does show that leap is being installed potentially from a cache?
- leap @ file:///opt/keras-hannd
This is true I have double checked your logs and you are correct, it seems to be installed
So I do not get how come you get,
ModuleNotFoundError: No module named 'leap.learn.data_tools.merge_data.merge_data'
Could it be you are installing the wrong version? or maybe the wrong package?
is this is the leap you need? where do you install it from?
lastly, is this still relates to the " build-essentials" issue? it seems that we are talking about a whole diff issue?!
AgitatedDove14 Yeah I added it into the initial bash script to test whether that would fix the issue. The task is created using the SDK in the model training script i.e.
Task.init() . I was under the impression the local package would be installed due to replication of the environment I initialised the task under, however I've tried the
add_requirements("leap") function and just seem to be getting an "isadirectory" error? I also tried manually adding
leap==0.4.1 in the task UI which didn't work. The environment in the logs does show that leap is being installed potentially from a cache?
- leap @ file:///opt/keras-hannd
AgitatedDove14 . sorry what .so are you referring to here? I can't see that in the logs. The docker image installs the package via first installing requirements i.e.
RUN pip install --no-cache-dir -r /tmp/requirements.txt the repo is copied locally, and then leap is installed through
RUN cd /opt/keras-hannd && pip install --no-deps . .
ext_modules=[ Extension( 'leap.learn.data_tools.file_io.extio', sources=['leap/learn/data_tools/file_io/extio.cpp'], depends=['leap/learn/data_tools/file_io/samples.h'], define_macros=[('NPY_NO_DEPRECATED_API', 'NPY_1_9_API_VERSION')], extra_compile_args=['-std=c++11'], libraries=['rt'] if platform.system() == 'Linux' else , include_dirs=[GetNumpyIncludeDirectoryLazy()], optional=True ),in our
setup.py which I believe isn't being built correctly when the task is running on the agent.
Manually I was installing the
leap package through
python -m pip install . when building the docker container. My thinking was that when the tasks environment was then replicated on the agent, the
leap package would be installed correctly through it's
setup.py with the
Extension which I've listed above
Manually I was installing the
python -m pip install .
when building the docker container.
NaughtyFish36 what happnes if you add to your "installed packages"
/opt/keras-hannd ? This should translate to "pip install /opt/keras-hannd" which seems like exactly what you want, no ?
So I see this in the build, which means it works , and compiles, what is missing ?
` Building wheels for collected packages: leap
Building wheel for leap (setup.py) ... [?25l- \ |
1667848450770 UH-LPT371:0 DEBUG / - \ | / - done
[?25h Created wheel for leap: filename=leap-0.4.1-cp38-cp38-linux_x86_64.whl size=1052746 sha256=1dcffa8da97522b2611f7b3e18ef4847f8938610180132a75fd9369f7cbcf0b6
Stored in directory: /root/.cache/pip/wheels/b4/0c/2c/37102da47f10c22620075914c8bb4a9a2b1f858263021ca437
Successfully built leap
Installing collected packages: leap
Attempting uninstall: leap
Found existing installation: leap 0.4.1
Not uninstalling leap at /usr/local/lib/python3.8/dist-packages, outside environment /root/.clearml/venvs-builds/3.8
Can't uninstall 'leap'. No files were found to uninstall.
Successfully installed leap-0.4.1 `
AgitatedDove14 The issue seems to be that the
setup.py containing the
Extension module we need isn't being run in the clearml virtual environment within the docker container. What is the correct process for installing local packages so they're replicated correctly when running remotely on an agent?
No module named 'leap.learn.data_tools.merge_data.merge_data'
This seems to be the error but I cannot see
leap in the installed packages , Notice that if the Task has "Installed Packages" section then the agent will use that Not the "requirements.txt" , Only if this section is Empty it will revert to the "requirements.txt" in the repo.
How did you create the Task in the first place?
I see that you added "leap" into the initial bashscript, actually you should add it into the requirements with
Task.add_requirements("leap") task = Task.add_requirements