Reputation
Badges 1
21 × Eureka!this is the installation for a locally used package in the task fyi, so it's imported from the training script
Seems like this was a hidden SSH key error that wasn't being revealed, it was using a cached repo rather than cloning the remote repo.
AgitatedDove14 Yeah I added it into the initial bash script to test whether that would fix the issue. The task is created using the SDK in the model training script i.e. Task.init()
. I was under the impression the local package would be installed due to replication of the environment I initialised the task under, however I've tried the add_requirements("leap")
function and just seem to be getting an "isadirectory" error? I also tried manually adding leap==0.4.1
in the task...
AgitatedDove14 fyi I do install build-essential manually in the logs I just sent you, and it still fails
build-essentials didn't work unfortunately through installing it at startup
AgitatedDove14 Unfortunately that didn't work either, I agree that should run the setup.py correctly but something still seems to be breaking, I've sent you the most recent logs
I think it is to do with the build-essential
issue. Let me talk you through the process:
Run a docker image locally called keras-hannd-cml (i.e. the one that is then being used by the agent as the base image later on) Run the training script to register the task, which works fine, all dependencies work i.e. the c++ packages are working correctly on that container Execute the task on an agent running in docker mode with the same image that the task was registered with i.e. keras-hannd-...
We haveext_modules=[ Extension( 'leap.learn.data_tools.file_io.extio', sources=['leap/learn/data_tools/file_io/extio.cpp'], depends=['leap/learn/data_tools/file_io/samples.h'], define_macros=[('NPY_NO_DEPRECATED_API', 'NPY_1_9_API_VERSION')], extra_compile_args=['-std=c++11'], libraries=['rt'] if platform.system() == 'Linux' else [], include_dirs=[GetNumpyIncludeDirectoryLazy()], optional=True ),
in our setup.py
which I belie...
AgitatedDove14 DM's you the log file for the failed task. I have tried using a task startup script to install G++, gcc etc. but it didn't seem to work, I'll try build-essentials
too. I'm also interested in the way that the environments are set up in clearml, I read in the docs that the task looks for a requirements.txt
file to construct the env, but does this prevent a local package being built correctly i.e. through setup.py
when running a remote task?
update on this - seems like it's an error in our code which isn't being appropriate raised by the looks of things! I'll dig into it further but for now this can be left. thanks for replying!
and it's clearml version 1.7.2
AgitatedDove14 . sorry what .so are you referring to here? I can't see that in the logs. The docker image installs the package via first installing requirements i.e. RUN pip install --no-cache-dir -r /tmp/requirements.txt
the repo is copied locally, and then leap is installed through RUN cd /opt/keras-hannd && pip install --no-deps .
.
Solved this but going to leave it up in case it's useful to anyone - just used the pod template in the values.yml for the clearml-agent helm chart to mount the hostpath as a volume mount i.e.:
podTemplate:
# -- volumes definition for pods spawned to consume ClearML Task (example in values.yaml comments)
volumes:
- name: x11-host-dir
hostPath:
path: /tmp/.X11-unix
volumeMounts:
- name: x11-host-dir
mountPath: '/tmp/.X11-unix'
AgitatedDove14 The issue seems to be that the setup.py
containing the Extension
module we need isn't being run in the clearml virtual environment within the docker container. What is the correct process for installing local packages so they're replicated correctly when running remotely on an agent?
Hmm yeah I have monitored some of the resource metrics and it didn't seem to be an issue. I'll attempt to install prometheus / grafana. This is a PoC however so I was hoping not to have to install too many tools.
The code running is basically this:
` if name == "main":
# initiate clear ml task
task = Task.init(
project_name="hannd-0.1",
task_name="train-endtoend-0.2",
auto_connect_streams={'stdout': True, 'stderr': True, 'logging': True}
)
tas...
error: could not write config file /root/.gitconfig: Device or resource busy Using cached repository in "/root/.clearml/vcs-cache/{repo}.git.{commit}/{repo}.git"
I have noticed this, is there a reason it's using a cached repo here?
CostlyOstrich36 I use the task.set_base_docker(docker_image="some_image")
to set the docker image for the task for future experiment runs, i don't think clearml detects the image i'm running on locally when registering the task
Yeah just checked this, the commit checks out on a different machine
SuccessfulKoala55 Agent ver is 1.4.1, clearml sdk 1.7.2
pushed to a branch