if you have 2 agent serving the same queue and then send 2 task to that queue, each agent should take one task
But if you queue sequentially one task then wait until that task to finish and queue the next: then it will be random which agent will take the task. Can be the same on from the previous task
Are you saying that you have 1 agent running task, 1 agent sitting idle while there is a task waiting in the queue and no one is processing it ??
Should I get all the workers None
Then go through them and count how many is in my queue of interest ?
in that case yes. What happen is that in docker mode:
you run a clearml agent, that then receive a task
create a container
install another agent inside that container
then run that second agent inside the container
that second agent then pull the task and do the usuall build/install
CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=true
need to be set on that second agent somehow ...
my code looks like this :
parser = argparse.ArgumentParser()
parser.add_argument('-c', '--config-file', type=str, default='train_config.yaml',
help='train config file')
parser.add_argument('-t', '--train-times', type=int, default=1,
help='train the same model several times')
parser.add_argument('--dataset_dir', help='path to folder containing the preped dataset.', required=True)
parser.add_argument('--backup', action='s...
I don't use submodule so don't really know how that behave with ClearML
Are the uncommit changes in un-tracked files ?
In other words: clearml will only save uncommited changes from files that are tracked by your local git repo
Looks like your issue is not that ClearML is not tracking your changes but more about your Configuration is overwrriten.
This often happen to me. The way I debug this is put a lot of print statement along the code to track when the Configuration is overwriten and narrow down why. print statement will show up in the Console tab.
You don't need agent on your local machine.
You want an agent running on the GPU machine.
Local code will create an experiment in ClearML Server, then run up to the line remotely_execute()
then stop
Once local code stop, the Clearml Server will take over and enqueue the experiment to the prescribe queue
The agent on the GPU see there is a experiment on its queue and then pull it and execute it. This time, clearml lib magic will make the code on the GPU machine, launched by the agent, run...
Feels like Docker, Kubernetes is more fit for that purpose ...
nope, we are self-hosted in Azure
oh ..... did not know about that ...
please provide the full logs and error message.
@<1523701087100473344:profile|SuccessfulKoala55> Thanks. Manage to get it working now with
export REQUESTS_CA_BUNDLE=/etc/ssl/certs/zscaler.crt
(Ubuntu system)
Ok I think I found the issue. I had to point the file server to azure storage:
api {
# Notice: 'host' is the api server (default port 8008), not the web server.
api_server:
web_server:
files_server: "
"
credentials {"access_key": "REDACTED", "secret_key": "REDACTED"}
}
something like this: None ?
thanks for all the pointer ! I will try to have a good play around
interesting, the issue happen with mamba
venv. Now I use a python native venv and it is detecting correctly
that format is correct as I can run pip install -r requirements.txt
using the exact same file
Ok. Found the solution.
The importance is to use this:
Task.add_requirements("requirements.txt")
task = Task.init(project_name='hieutest', task_name='foo',reuse_last_task_id=False)
And not:
task = Task.init(project_name='hieutest', task_name='foo',reuse_last_task_id=False)
task.add_requirements("requirements.txt")
but then it still missing a bunch of library in the Taks (that succeed) > Execution > INSTALLED PACKAGES
So when I do a clone of that task, and try to run the clone, the task fail because it is missing python package 😞
is task.add_requirements("requirements.txt")
redundant ?
Is ClearML always look for a requirements.txt
in the repo root ?
following your example, if the seeds are hard coded in the code, then git hash will detect if changed happen and the step need to be run or not
how does it work if I create my pipeline from code ? Does the task will get the git repo state when first run and use commit hash and uncommited changed as "signature" ?
To me the whole point of having pipeline is to have a system that "know" previous state and make "smart" decision on what should run and what not. If it's just about if then else, then code already handle all that.
And what I struggle a bit is to find doc on how it determine the existing state and how it make decision what to run. thus the initial question
may be I will play around a bit and ask more specific questions .... It's just I cannot find much docs around how the pipeline caching work (which is the main point of pipeline ?)
and in the train.py
, I have task.add_requirements("requirements.txt")
if you are on github.com , you can use Fine tune PAT token to limit access to minimum. Although the token will be tight to an account, it's quite easy to change to another one from another account.