
Reputation
Badges 1
981 × Eureka!I was asking to exclude this possibility from my debugging journey 😁
CostlyOstrich36 I don’t see such number, can you please share a screenshot of where to look at?
Actually was not related to clearml, the higher level error causing this one was (somewhere in the stack trace): RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
-> wrong numpy version
I’m not too fond of many user configurations, it’s confusing.
100% agree, nevertheless, how much is too many? Currently, there are only two settings in the user preferences category, so one more wouldn’t hurt?
however, clearml is open source, nothing stops you from adding the code and sending a PR
I’d be super happy to contribute yes! Nevertheless, I am not sure where to start: clearml-server repo? clearml-web repo?
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
I was rather wondering why clearml was taking space while I configured it to use the /data volume. But as you described AgitatedDove14 it looks like an edge case, so I don’t mind 🙂
Now it starts, I’ll see if this solves the issue
torch==1.7.1 git+
.
so what worked for me was the following startup userscript:
` #!/bin/bash
sleep 120
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get update
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get install -y python3-dev python3-pip gcc git build-essential...
So in my use case each step would create a folder (potentially big) and would store it as an artifact. The last step should “merge” all the pervious folders. The idea is to split the work among multiple machines (in parallel). I would like to avoid that these potentially big folder artifacts are also stored in the pipeline task, because this one will be running on the services queue in the clearml-server instance, that will definitely not have enough space to handle all of them
So if all artifacts are logged in the pipeline controller task, I need the last task to access all the artifacts from the pipeline task. I need to execute something like PipelineController.get_artifact()
in the last step task
with the CLI, on a conda env located in /data
Ho I see, I think we are now touching a very important point:
I thought that torch wheels already included cuda/cudnn libraries, so you don't need to care about the system cuda/cudnn version because eventually only the cuda/cudnn libraries extracted from the torch wheels were used. Is this correct? If not, then does that mean that one should use conda to install the correct cuda/cudnn cudatoolkit?
As you can see, more hard waiting (initial sleep), and then before each apt action, make sure there is no lock
From https://discuss.pytorch.org/t/please-help-me-understand-installation-for-cuda-on-linux/14217/4 it looks like my assumption is correct: there is no need for cudatoolkit to be installed since wheels already contain all cuda/cudnn libraries required by torch
there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent won’t start because the userdata script fails
That’s why I said “not really” 😄
CostlyOstrich36 How is clearml-session setting the ssh config?
There it is: https://github.com/allegroai/clearml/issues/493
AgitatedDove14 I see https://github.com/allegroai/clearml-session/blob/main/clearml_session/interactive_session_task.py#L21= that a key pair is hardcoded in the repo. Is it being used to ssh to the instance?
Will it freeze/crash/break/stop the ongoing experiments?
sorry, the clearml-session. The error is the one I shared at the beginning of this thread
In all the steps I want to store them as artifacts to s3 because it’s very convenient.
The last step should merge them all, ie. it needs to know all the other artifacts of the previous steps
AgitatedDove14 Yes with the command you shared I can now ssh again manually to the agent, but I still clearml-agent will raise the same error
(BTW: it will work with elevated credentials, but probably not recommended)
What does that mean? Not sure to understand
So this message appears when I try to ssh directly into the instance