Reputation
Badges 1
25 × Eureka!Hi FierceHamster54
This is already supported, unfortunately the open-source version only supports static allocation (i.e you can spin multiple agents and connect each one to specific number of GPUs), the dynamic option (where you have single agent allocating jobs to multiple GPUs / Slices is only part of the enterprise edition
(there is the hidden assumption there that if you spent so much on a DGX you are probably not a small team š )
I want in my CI tests to reproduce a run in an agent
you mean to run it on the CI machine ?
because the env changes and some things break in agents and not locally
That should not happen, no? Maybe there is a bug that needs fixing on clearml-agent ?
Since I'm assuming there is no actual task to run, and you do not need to setup the environment (is that correct?)
you can do:$ CLEARML_OFFLINE_MODE=1 python3 my_main.py
wdyt?
It's the same but done from outside, you want the same and "offline" as well right?
Hi ShinyRabbit94
system_site_packages: true
This is set automatically when running in "docker mode" no need to worry š
What is exactly the error you are getting ?
Could it be the container itself has the python packages installed in a venv not as "system packages" ?
GentleSwallow91 what you are looking for is here š
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L149
Oh if this is the case, then by all means push it into your Task's docker_setup_bash_script
It does not seem to have to be done after the git clone, the only part the I can see is setting the PYTHONPATH to the additional repo you are pulling, and that should work.
The main hurdle might be passing credentials to git, but if you are using SSH it should be transparent
wdyt?
can the ClearML File server be configured to any kind of storage ? Example hdfs or even a database etc..
DeliciousBluewhale87 long story short, no š the file server, will just store/retrieve/delete files from a local/mounted folder
Is there any ways , we can scale this file server when our data volume explodes. Maybe it wouldnt be an issue in the K8s environment anyways. Or can it also be configured such that all data is stored in the hdfs (which helps with scalablity).I would su...
Hi @<1566596960691949568:profile|UpsetWalrus59>
just wondering - shouldn't the job still work if I didn't push the commit yet
How would that work? it does not know which commit to take? it would also fail on git diff apply, no?
OSError: [Errno 28] No space left on device
Hi PreciousParrot26
I think this says it all š there is no more storage left to run all those subprocesses
btw:
I am curious about why a
ThreadPool
of
16
threads is gathered,
This is the maximum simultaneous jobs it will try to launch (it will launch more after the launching is doe, notice not the actual execution) but this is just a way to limit it.
Hi VivaciousWalrus21
After restarting training huge gaps appear in iteration axis (see the screenshot).
The Task.init
actually tries to understand what was the last reported interation and continue from that iteration, I'm assuming that what happens is that your code does that also, which creates a "double shift" that you see as the jump. I think the next version will try to be "smarter" about it, and detect this double gap.
In the meantime, you can do:
` task = Task.init(...)...
PreciousParrot26 I think this is really a matter of the CI process having very limited resources. just to be clear, you are correct and the steps them selves are Not executed inside the CI environment, but it seems that even running the pipeline logic is somehow "too much" for the limited resources... Make sense ?
The bug was fixed š
Hmm I think this is not doable ... š
(the underlying data is stored in DBs and changing it is not really possible without messing about with the DB)
Hi @<1551376687504035840:profile|StraightSealion9>
AWS Autoscaler to create a new instance when you enqueue a task to the relevant queue.
Does that mean that you were able to enqueue a Task and have it launch on the remote EC2 machine ?
Ok..so I should generally avoid connecting complex objects? I guess I would create a 'mini dictionary' with a subset of params, and connectvthis instead.
In theory it should always work, but this specific one fails on a very pythonic paradigm (see below)
from copy import copy
an_object = copy(object)
A good rule of thumb is to connect any object/dict that you want to track or change later
Is there any references (vlog/blog) on deploying real-time model and do the continuous training pipeline in clear-ml?
Something along the lines of this one ?
https://clear.ml/blog/creating-a-fully-automatic-retraining-loop-using-clearml-data/
Or this one?
https://www.youtube.com/watch?v=uNB6FKIi8Wg
. Curious what advantage it would be to use the StorageManager
Basically if you set the clearml cache folder to the EFS, users can always do:from clearml import StorageManager local_file = StorageManager.get_local_copy("
")
where local_file is stored on persistent cache (EFS) and the cache is automatically cleaned based on last accessed file
Apparently the error comes when I try to access from
get_model_and_features
the pipeline component
load_model
. If it is not set as pipeline component and only as helper function (provided it is declared before the components that calls it (I already understood that and fixed, different from the code I sent above).
ShallowGoldfish8 so now I'm a bit confused, are you saying that now it works as expected ?
but can it NOT use /tmp for this iām merging about 100GB
You mean to configure your Temp folder for when squashing ?
you can do hack the following:
` import tempfile
tempfile.tempdir = "/my/new/temp"
Dataset squash
tempfile.tempdir = None `But regradless I think this is worth a GitHub issue with feature request, to set the temp folder///
VexedCat68 both are valid. In case the step was cached (i.e. already executed) the node.job will be None, so it is probably safer to get the Task based on the "executed" field which stores the Task ID used.
It takes 20mins to build the venv environment needed by the clearml-agent
You are Joking?! š
it does apt-get install python3-pip , and pip install clearml-agent, how is that 20min?
When looking at the worker details, it says "No queues currently assigned to this worker"
Yes, I think we should have better information there, the "AWS service" is not directly pulling jobs from any specific queue, hence nothing there. It is "listening" to queues and launching machines, those machines will be listening to the queue. I wonder if it is just easier to also make sure it is listed as "assigned" to those queues . wdyt?
Why does my task execution freeze after pip installation (running agent in foreground mode)?
Hi AdventurousButterfly15
Are you running in agent docker mode or venv mode ?
What do you mean freeze? do you see anything on the Taks console log in the UI? what's the host OS ?
how to put or handle this configuration and where?
In your clearml.conf on the machine with the agent just add at the bottom of the file agent.venvs_cache.path=~/.clearml/venvs-cache