Reputation
Badges 1
662 × Eureka!Exactly, it should have auto-detected the package.
The instance that took a while to terminate (or has taken a while to disappear from the idle workers)
FWIW It’s also listed in other places @<1523704157695905792:profile|VivaciousBadger56> , e.g. None says:
In order to make sure we also automatically upload the model snapshot (instead of saving its local path), we need to pass a storage location for the model files to be uploaded to.
For example, upload all snapshots to an S3 bucket…
Honestly I wouldn't mind building the image myself, but the glue-k8s setup is missing some documentation so I'm not sure how to proceed
I've been answering there as well 🤕
I cannot, the instance is long gone... But it's not different to any other scaled instances, it seems it just took a while to register in ClearML
Great to hear @<1523701087100473344:profile|SuccessfulKoala55> ! Is there an estimated timeline for these releases?
I'll try a hacky-way around it with sed -i 's/include-system-site-packages = false/include-system-site-packages = true/g' clearml_agent_venv/pyvenv.cfg
and report back.
Sure SuccessfulKoala55 , and thanks for looking into it.
As an alternative (for now, or in general), we could consider reverting back to pip. The issue we encounter is that we have a monorepo, so frozen requirements should specify relative paths, but pip freeze
does not seem to do that, so ClearML also fails in pip
mode
SuccessfulKoala55 help me out here 🙂
It seems all the changes I make in the AWS autoscaler apply directly to the virtual environment set for the autoscaler, but nothing from that propagates down to the launched instances.
So e.g. the autoscaler environment has poetry
installed, but then the instance fails because it does not have it available?
Or to be clear, the environment installed by the autoscaler under /clearml_agent_venv
has poetry installed, and it uses that to set up the environment for the executed task, e.g. in root/.clearml/venvs-builds/3.10/task_repository/.../.venv
, but the latter does not have poetry installed, and so it crashes?
I've tried also e.g. setting gent.package_manager.priority_packages = ["poetry"]
, and/or agent.package_manager.poetry_version = ">1.2.0"
, and other flags, but these affect only the main /clearml_agent_venv
environment, and not the one actually generated by the clearml-agent
when executing the task
I also tried adding gent.package_manager.system_site_packages = true
to ensure these virtual environments have access btw, still no avail
Still crashing, I think that may not be the correct virtual environment to edit 🤔
It's the one created later down the line
That still seems to crash SuccessfulKoala55 🤔
EDIT: No, wait, the environment still needs updating. One moment still...
Now my extra_vm_bash_script
looks like so:
` deactivate
apt-get install -y gfortran libopenblas-dev liblapack-dev libpq-dev python-is-python3 python3-pip python3-dev proj-bin libgraphviz-dev graphviz graphviz-dev libgdal-dev
apt-get install software-properties-common -y
add-apt-repository ppa:deadsnakes/ppa -y
apt update
apt install python3.7 python3.8 python3.9 python3.7-distutils python3.8-distutils python3.9-distutils python3.10-distutils python3.7-dev python3.8-dev python3.9-dev pyt...
AgitatedDove14 I will try! I remember there were some issues with it, where I had to resort to this method first, but maybe things have changed since :)
At any case @<1537605940121964544:profile|EnthusiasticShrimp49> this seems like a good approach, but it’s not quite there yet. For example, even if I’d provide a simple def run_step(…)
function, I’d still need to pass the instance to the function. Passing it along in the kwargs
for create_function_task
does not seem to work, so now I need to also upload the inputs, etc — I’m bringing this up because the pipelines do already do this for you.
Ultimately we're trying to avoid docker in AWS autoscaler (virtualization on top of virtualization seems redundant), and instead we maintain an AMI for a faster boot sequence.
We had no issues when we used pip
, but now when trying to work with poetry
all these issues came up.
The way I understand poetry
to work, is that it is expected there is one system-wide installation that is used for virtual environment creation and manipulation. So at least it may be desired that the ...
AgitatedDove14 yeah I see this now; this was an issue because I later had to "disconnect" the remote task, so it can, itself, create new tasks (using clearml.config.remote.override_current_task_id(None)
). I guess you might remember that discussion? 😁
EDIT: It's the discussion we had here, for reference. https://clearml.slack.com/archives/CTK20V944/p1640955599257500?thread_ts=1640867211.238900&cid=CTK20V944
So probably not needed in JitteryCoyote63 's case, we still have some...
We're wondering how many on-premise machines we'd like to deprecate. For that, we want to see how often our "on premise" queue is used (how often a task is submitted and run), for how long, how many resources it consumes (on average), etc.
AgitatedDove14 Unfortunately not, the queues tab shows only the number of tasks, but not resources used in the queue . I can toggle between the different workers but then I don't get the full image.
I guess in theory I could write a run_step.py
, similarly to how the pipeline in ClearML works… 🤔 And then use Task.create()
etc?
But to be fair, I've also tried with python3.X -m pip install poetry
etc. I get the same error.