Reputation
Badges 1
25 × Eureka!Any insight will help, if you can provide the log of the Task that did get stuck, that would be a good start
JitteryCoyote63 what's the clearml
version ?
Are you always seeing the "model uploaded completed" message ?
What's the python version you are using?
MysteriousBee56 when you execute your code once it will appear in the server (with all fields pre-populated based on your setup/git etc.) once it is there you can "clone" them and move them around.
Is this what you mean?
A bit of background, the idea behind Trains is that the environment definition (i.e,. git repo packages etc, code entry arguments etc.) is collected when executing the code. This avoids the tedious task of generating and maintaining YAML/Json configuration files.
What is exa...
BattyLizard6 to my knowledge the main issue with fractional GPU, is there is no real restriction on GPU memory allocation (with the exception of MIG slices, which is limited in other ways).
Basically one process/container can consume the maximum GPU ram on the allocated card (this also includes http://run.ai fractional solution, at least from what I understand).
This means that developer A can allocate memory so that developer B on the same GPU will start getting out-of-memory
(Notice in a...
Correct (basically pip freeze results)
Train Data Params/a = {} Train Data Params/b = ...
Then maybe we could "hack" it so that if you edit it in the UI like so:Train Data Params/a = {'new': 'value'} Train Data Params/b = ...
You end up withparam = {'a': {'new': 'value'}, 'b' : ... }
What do you think?
Oh this is so internally, the background thread can signal it is not deferred, are you saying there is bug or the code is odd?
Wait, how did you end up withclearml_task_id = os.environ['CLEARML_TASK_ID']
printing "01b77a220869442d80af42efce82c617" ?
This means you are running by an agent?!
And the trains version?
Hi @<1674588542971416576:profile|SmarmyGorilla62>
You mean on your elastic / mongo local disk storage ?
WackyRabbit7 my apologies for the lack of background in my answer π
Let me start from the top, one of the goal of the trains-agent is to reproduce the "original" execution environment. Once that is done, it will launch the code and monitor it. In order to reproduce the original execution environment, trains-agent will install all the needed python packages, pull the code, and apply the uncommitted changes.
If your entire environment is python based, then virtual-environment mode is proba...
Could not find a version that satisfies the requirement open3d==0.15.2 .. from versions: 0.10.0.0, 0.11.0, 0.11.1, 0.11.2, 0.12.0, 0.13.0)
This points to the agent installing using a different python version that you run the original code, I would guess python3.6
YummyWhale40 from the code snippet, it seems like the argument is passed.
"reuse_last_task_id=True" is the default, and it means that if the previous run of the task did not create any artifacts/models and was executed 72 hours ago (configurable), The Task will be reset (i.e. all logs cleared) and will be reused in the current run.
no, I set the env variable CLEARML_TASK_ID myself
Do not, this is the issue π
this is used internally and messing up the internal state, basically this is one of the signals for the SDK to know there is an agent taking care of things (for example logging the entire console output)
Use any other variable, for example MY_CLEARML_TASK_ID
Another question, do you have the argparse with type=str
?
Hmm, I still wonder what is the "correct" answer for most people, is empty string in argparse redundant anyhow? will someone ever use it?
Hmm, it might be sub-sampling on large scalar plots (so that we do not "kill" the ui), but I remember that it only happens above 50k samples. (when you zoom in, do you still get the 0.5 values?)
SmarmySeaurchin8 regrading (2)
I'm not sure the current visualization supports it. I mean we can put "{}", but that would imply you can edit it, which then we have to support, possible but weird, and this is why:task.connect({'a':{},'b': {'nested': 'value}}
will become'a' = '{}'
'b/nested' = 'value'
But then if you edit to:'a' = '{'nested': 'value'}'
'b/nested' = 'value'
you have two different ways of presenting the same type of structure...
When you install using pip <filename> you should end up with something like:minerva @ file://... or minerva @ https://...
Hi UnevenOstrich23
if --docker is enable that will means every new experiments will be executed into dedicated agent worker containers?
Correct
I think the missing part is how to specify the docker for the experiment?
If this is the case, in the web UI, clone your experiment (which will create a draft copy, that you can edit), then in the Execution tab, scroll down to the "base docker image" and specify the docker image to use.
Notice that you can also add flags after the docker im...
Ohh then you do docker sibling:
Basically you map the docker socket into the agent's docker , that lets the agent launch another docker on the host machine.
You cab see an example here:
https://github.com/allegroai/clearml-server/blob/6434f1028e6e7fd2479b22fe553f7bca3f8a716f/docker/docker-compose.yml#L144
what if the preexisting venv is just the system python? my base image is python:3.10.10 and i just pip install all requirements in that image. Does that not avoid venv still?
it will basically create a new venv inside the container forking the existing preinistalled stuff (i.e. the new venv already has everything the python system has preinstalled)
then it will call "pip install" on all the "installed packages of the Task.
Which should just check everything is there and install nothing...
Actually we just added venv support as well, the reasoning is/was inside a docker it is easier to separate the running processes, with venv we had to support multiple venv running at the same time and reusing of those venv (just a bit more logic) anyhow this is now supported :)
Q. Would someone mind outlining what the steps are to configuring the default storage locations, such that any artefacts or data which are pushed to the server are stored by default on the Azure Blob Store?
Hi VivaciousPenguin66
See my reply here on configuring the default output uri on the agent: https://clearml.slack.com/archives/CTK20V944/p1621603564139700?thread_ts=1621600028.135500&cid=CTK20V944
Regrading permission setup:
You need to make sure you have the Azure blob credenti...
I suppose the same would need to be done for anyΒ
clientΒ
PC runningΒ
clearml
Β such that you are submitting dataset upload jobs?
Correct
That is, the dataset is perhaps local to my laptop, or on a development VM that is not in theΒ
clearml
Β system, but I from there I want to submit a copy of a dataset, then I would need to configure the storage section in the same way as well?
Correct
I am just about to move house, which is stressful enough without a global pandemic(!), so until that's completed I won't commit to anything.
Sure man π no rush, I appreciate the gesture regardless of the outcome
Many thanks!
Sounds good.
BTW, when the clearml-agent is set to use "conda" as package manager it will automatically install the correct cudatoolkit on any new venv it creates. The cudatoolkit version is picked direcly when "developing" the code, assuming you have conda installed as development environment (basically you can transparently do end-to-end conda, and not worry about CUDA at all)