
Reputation
Badges 1
21 × Eureka!Seems like this was a hidden SSH key error that wasn't being revealed, it was using a cached repo rather than cloning the remote repo.
Hmm yeah I have monitored some of the resource metrics and it didn't seem to be an issue. I'll attempt to install prometheus / grafana. This is a PoC however so I was hoping not to have to install too many tools.
The code running is basically this:
` if name == "main":
# initiate clear ml task
task = Task.init(
project_name="hannd-0.1",
task_name="train-endtoend-0.2",
auto_connect_streams={'stdout': True, 'stderr': True, 'logging': True}
)
tas...
update on this - seems like it's an error in our code which isn't being appropriate raised by the looks of things! I'll dig into it further but for now this can be left. thanks for replying!
pushed to a branch
SuccessfulKoala55 Agent ver is 1.4.1, clearml sdk 1.7.2
Yeah just checked this, the commit checks out on a different machine
error: could not write config file /root/.gitconfig: Device or resource busy Using cached repository in "/root/.clearml/vcs-cache/{repo}.git.{commit}/{repo}.git"
I have noticed this, is there a reason it's using a cached repo here?
AgitatedDove14 Unfortunately that didn't work either, I agree that should run the setup.py correctly but something still seems to be breaking, I've sent you the most recent logs
AgitatedDove14 fyi I do install build-essential manually in the logs I just sent you, and it still fails
AgitatedDove14 Yeah I added it into the initial bash script to test whether that would fix the issue. The task is created using the SDK in the model training script i.e. Task.init()
. I was under the impression the local package would be installed due to replication of the environment I initialised the task under, however I've tried the add_requirements("leap")
function and just seem to be getting an "isadirectory" error? I also tried manually adding leap==0.4.1
in the task...
I think it is to do with the build-essential
issue. Let me talk you through the process:
Run a docker image locally called keras-hannd-cml (i.e. the one that is then being used by the agent as the base image later on) Run the training script to register the task, which works fine, all dependencies work i.e. the c++ packages are working correctly on that container Execute the task on an agent running in docker mode with the same image that the task was registered with i.e. keras-hannd-...
CostlyOstrich36 I use the task.set_base_docker(docker_image="some_image")
to set the docker image for the task for future experiment runs, i don't think clearml detects the image i'm running on locally when registering the task
this is the installation for a locally used package in the task fyi, so it's imported from the training script
build-essentials didn't work unfortunately through installing it at startup
AgitatedDove14 DM's you the log file for the failed task. I have tried using a task startup script to install G++, gcc etc. but it didn't seem to work, I'll try build-essentials
too. I'm also interested in the way that the environments are set up in clearml, I read in the docs that the task looks for a requirements.txt
file to construct the env, but does this prevent a local package being built correctly i.e. through setup.py
when running a remote task?
AgitatedDove14 . sorry what .so are you referring to here? I can't see that in the logs. The docker image installs the package via first installing requirements i.e. RUN pip install --no-cache-dir -r /tmp/requirements.txt
the repo is copied locally, and then leap is installed through RUN cd /opt/keras-hannd && pip install --no-deps .
.
AgitatedDove14 The issue seems to be that the setup.py
containing the Extension
module we need isn't being run in the clearml virtual environment within the docker container. What is the correct process for installing local packages so they're replicated correctly when running remotely on an agent?
We haveext_modules=[ Extension( 'leap.learn.data_tools.file_io.extio', sources=['leap/learn/data_tools/file_io/extio.cpp'], depends=['leap/learn/data_tools/file_io/samples.h'], define_macros=[('NPY_NO_DEPRECATED_API', 'NPY_1_9_API_VERSION')], extra_compile_args=['-std=c++11'], libraries=['rt'] if platform.system() == 'Linux' else [], include_dirs=[GetNumpyIncludeDirectoryLazy()], optional=True ),
in our setup.py
which I belie...
and it's clearml version 1.7.2
Solved this but going to leave it up in case it's useful to anyone - just used the pod template in the values.yml for the clearml-agent helm chart to mount the hostpath as a volume mount i.e.:
podTemplate:
# -- volumes definition for pods spawned to consume ClearML Task (example in values.yaml comments)
volumes:
- name: x11-host-dir
hostPath:
path: /tmp/.X11-unix
volumeMounts:
- name: x11-host-dir
mountPath: '/tmp/.X11-unix'