BTW: what would be a reason to go back to self-hosted? (not sure about the SaaS cost, but I remember it was relatively cheap)
Yes JitteryCoyote63 I think you are correct, this currently the easiest to do. PompousParrot44 notice that you should have a "services" queue with a trains-agent "services mode" running to enqueue those type pf mostly sleeping services π
I was thinking we can quickly create a service that does that, maybe leverage one of these ?
https://github.com/mehrdadmhd/scheduler-py
https://github.com/dbader/schedule
WDYT?
EnviousPanda91
in your clearml.conf I think you are missing a sectionagent.git_user="" agent.git_pass="" agent.git_host="" agent.force_git_ssh_protocol: true
docstring ?
Usually the preferred way is StorageManager
https://clear.ml/docs/latest/docs/references/sdk/storage
https://clear.ml/docs/latest/docs/integrations/storage
Hmm UnevenDolphin73 I just checked with v.1.1.6, the first time the configuration file is loaded is when calling Task.init (if not running with an agent, which is your case).
But the main point I just realized I missed π€―"http://"${CLEARML_ENDPOINT}":8080"
The code does not try to resolve OS environments there!
Which, well, is a nice feature to add
https://github.com/allegroai/clearml/blob/d3e986393ac8d1a1ea48302224962570ab8e6f9e/clearml/backend_api/session/session.py#L576
should p...
A quick fix will be:
` import dotenv
dotenv.load_dotenv('~/.env')
from clearml import Task # Now we can load it.
import argparse
if name == "main":
# do stuff `wdyt?
StraightDog31 can you elaborate? where are the parameters stored? who is trying to access them, and maybe for what purpose ?
Yes, i basically plan to use ClearML as user-friendly cluster manager
and it is π
I think the main "drawback" is that you cannot "reserve" nodes for the multi-node training. The easiest solution is to have high-priority queue that is never used, and then have the DDP master process push into the high priority queue, which will ensure these are the next Tasks to be executed (now the only thing that is missing is preemption to running Tasks, but this automation policy is unfortunate...
RoughTiger69 yes I think "Scale" tier covers it π
looks like service-writing-time for me!
Nice!
persist/restore state so that tasks are restartable?
You mean if you write preemption-ready training code ?
hmm this might help:
https://pip.pypa.io/en/stable/topics/configuration/#environment-variables
basically you might be able to define:PIP_NO_USE_PEP517=1
ShakyOstrich31
I am reusing an old task ...
Which means that the old Task stores the requirements on the Task itself (see "Installed Packages" section), Notice it also stores the exact git commit to use.
When you are cloning the Task (i.e. in the pipeline), you should probably:
set the commit / branch to the latest in the branch clear the "installed packages" section, which would cause the agent to use the "requirements.txt" stored in the git repo itself.As far as I understand this s...
BTW: is this on the community server or self-hosted (aka docker-compose)?
Where did you add the Task.init call ?
Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside theΒ
if name == "main":
Β as seen above in the code snippet.
I'm not sure I follow, the error seems like your internal code issue, does that means clearml works as expected ?
Still I wonder if it is normal behavior that clearml exits the experiments with status "completed" and not with failure
Well that depends on the process exit code, if for some reason (not sure why) the process exits with return code 0, it means everything was okay.
I assume this is doing something "Detected an exited process, so exiting main" this is an internal print of your code, I guess it just leaves the process with exitcode 0
wouldn't it be possible to store this information in the clearml server so that it can be implicitly added to the requirements?
I think you are correct, and if we detect that we are using pandas to upload an artifact, we should try and make sure it is listed in the requirements
(obviously this is easier said than done)
And if instead I want to force "get()" to return me the path (e.g. I want to read the csv with a library that is not pandas) do we have an option for that?
Yes, c...
Hi EnviousStarfish54
I remember this feature request, let me check where it stands..
ScantMoth28 it should work, I think default deployment also has an NGINX with reverse proxy on it switching from " http://clearml-server.domain.com/api " to " http://api.clearml-server.domain.com "
Hi FierceHamster54
Dataset is downloading multi threaded already
But yes get_local_copy() is thread / process safe
RoughTiger69
move the files locally (i.e. based on the example move folder b
into folder a
) Create a new version with two parents ('a' and 'b') then sync the local root folder ('a' in your case). Only the meta-data should change (because the referenced files are already in one of the datasets)wdyt?
Oh then this should just workcp -R --link b a/
You can achieve the same symbol link link from python as well
Anyhow from your response is it safe to assume that mixing inΒ
Β code with the core ML task code has not occurred to you as something problematic to start with?
Correct π Actually we believe it makes it easier, as worst case scenario you can always run clearml in "offline" without the need for the backend, and later if needed you can import that run.
That said, regrading (3), the "mid" interaction is always the challenge, clearml will do the auto tracking/upload of the mod...
I understand but how do you launch the cleaml-agent
itself:clearml-agent daemon --detached --queue default --docker
so I didn't have much time to upgrade all the packs because I have some issues with that but it is on my todo list
No worries π
Quick question, if you run https://github.com/allegroai/trains/blob/master/examples/frameworks/keras/legacy/keras_tensorboard.py
Do you see models in the artifacts tab?