Reputation
Badges 1
25 × Eureka!WackyRabbit7
regular trains-agent modus operandi is one job at a time (i.e. until the Task is done, no other Tasks will be pulled from the queue).
When adding --services-mode, it is Not 1-1 but 1-N, meaning a single trains-agent will launch as many Tasks as it can.
The trains-agent pulls a job from the queue and spins a docker (only dockers are supported for the time being) and lets the job run in the background (the job itself will be registered as another "worker" in the system). Then the...
LOL I keep typing clara without noticing (maybe it's the nvidia thing I keep thinking about)
Carla makes much more sense π
Hi GiganticTurtle0
You can keep clearml following the dictionary auto updating the UI
args = task.connect(args)
Hi @<1541954607595393024:profile|BattyCrocodile47>
This looks like a docker issue running on mac m2
None
wdyt?
It's always the details... Is the new Task running inside a new subprocess ?
basically there is a difference between
remote task spawning new tasks (as subprocesses, or as jobs on remote machine), remote task still running remote task, is being replaced by a spawned task (same process?!)UnevenDolphin73 am I missing a 3rd option? which of these is your case?
p,s. I have a suspicion that there might be a misuse of "Task" here?! What are you considering a Task? (from clearml perspective a Task...
I see now.
Let's assume you know which snapshot that was:
` prev_task = Task.get_task(task_id='the_first_training_task_id')
get the second from last checkpoint
task.models['output'][-2].url
prev_scalars = prev_task.get_reported_scalars()
new_task = Task.init('example', 'new task')
logger = new_task.get_logger()
do some fpr loop and report the prev_scalars with logger.report_scalars
new_task.flush(wait_for_uploads=True)
new_task.set_initial_iteration(22000)
start the train `
Hi ItchyJellyfish73
The behavior should not have changed.
"force_repo_requirements_txt" was always a "catch all option" to set a behavior for an agent, but should generally be avoided
That said, I think there was an issue with v1.0 (cleaml-server) where when you cleared the "Installed Packages" it did not actually cleared it, but set it to empty.
It sounds like the issue you are describing.
Could you upgrade the clearml-server
and test?
Yes, though the main caveat is the data is not really immutable π
Yes clearml is much better π
(joking aside, mlops & orchestration in clearml is miles better)
CheerfulGorilla72 What are you looking for?
I think the main difference is that I can see a value of having access to the raw format within the cloud vendor and not only have it as an archive
I see it does make sense.
Two options, one, as you mentioned use the ClearML StorageManager to upload the files, then register them as external links with Dataset.
Two, I know the enterprise tier has HyperDatasets, that are essentially what you describe, with version control over the "metadata" and "raw storage" on the GCP, including the ab...
I thought about the fact that maybe we need to write everything in one place
It will be in the same place, under the main Task
Should work out of the box
I'm still unclear on why cloning the repo in use happens automatically for the pipeline task and not for component tasks.
I think in the pipeline it was the original default, but it turns out for a lot of users this was not their defualt use case ...
Anyhow you can also pass repo="."
which will load + detect the repo in the execution environemtn and automatically fill it in
Sorry my bad:config_obj['sdk']['stuff']['here'] = value
that machine will be able to pull and report multiple trials without restarting
What do you mean by "pull and report multiple trials" ? Spawn multiple processes with different parameters ?
If this is the case: the internals of the optimizer could be synced to the Task so you can access them, but this is basically the internal representation, which is optimizer dependent, which one did you have in mind?
Another option is to pull Tasks from a dedicated queue and use the LocalClearMLJob ...
Hi DisturbedWalrus17
This is a bit of a hack, but will work:from clearml.backend_interface.metrics.events import UploadEvent UploadEvent._file_history_size = 10
Maybe we should expose it somewhere, what do you think?
I think the main issue is running with python -m module.name --args
Which is a bit different, when trying to "understand" what is the actual repository.
Can you try to run it from the repository folder (same command, just to see if it will have any effect on the detected packages)
trains-agent should be deployed to GPU instances, not the trains-server.
The trains-agent purpose is for you to be able to send jobs to a GPU (at least in most cases) instance.
The "trains-server" is a control plane , basically telling the agent what to run (by storing the execution queue and tasks). Make sense ?
Assuming you are using docker-compose, the console output is a good start
Actually this is by default for any multi node training framework torch DDP / openmpi etc.
We're lucky that they let the developers see their code...
LOL π
and it is also set in theΒ
/clearml-agent/.ssh/config
Β and it still can't clone it. So it must be some security issue internally.
Wait, are you using docker mode or venv mode ? in both cases your SSH credentials should be at the default ~/.ssh
Hmm that makes sense to me, any chance you can open a github issue so we do not forget ? (I do not think it should be very complicated to fix)
Is it being used to ssh to the instance?
It is used for the SSH client so it "knows" the SSH server (does that make sense) ?
os.system
Yes that's the culprit, it actually runs a new process and clearml
assumes that there are no other scripts in the repository that are used, so it does not analyze them
A few options:
Manually add the missing requirement Task.add_requirements('package_name')
make sure you call it before the Task.init
2. import the second script from the first script. This will tell clearml to analyze it as well.
3. Force the entire clearml to analyze the whole repository: https://g...
Now I suspect what happened is it stayed on another node, and your k8s never took care of that
well, it's only when adding aΒ
- name
Β to the template
Nonetheless it should not break it π
Usually in the /tmp folder under a temp filename (it is generated automatically when spinned)
In case of the services, this will be inside the docker itself
` from time import sleep
from clearml import Task
import tqdm
task = Task.init(project_name='debug', task_name='test tqdm cr cl')
print('start')
for i in tqdm.tqdm(range(100)):
sleep(1)
print('done') `The above example code will output a line every 10 seconds (with the default console_cr_flush_period=10) , can you verify it works for you?
Hmm that is a good question, are you mounting the clearml.conf somehow ?