
Reputation
Badges 1
25 × Eureka!Hi SkinnyPanda43
Do you mean the cleaml-agent or the cleaml python (a.k.a the auto package detection) ?
MelancholyChicken65 found it ! thank you for finding this issue.
I'm hoping to get an update soon π
Hi DangerousDragonfly8
You mean you want to trigger something when users archive a Task ?
I guess I would need to put this in the extra_vm_bash_script param of the auto-scaler, but it will reboot in loop right? Isnβt there an easier way to achieve that?
You can edit the extra_vm_bash_script
which means the next time the instance is booted you will have the bash script executed,
In the meantime, you can ssh to the running instance and change the ulimit manually, wdyt?
Is there a way to do this all elegantly?
Of yes there is, this is how TaskB code will look:
` task = Task.init(..., 'task b')
param = {'TaskA' :'TaskAs ID HERE'}
task.connect(param)
taska_model = Task.get_task(param['TaskA']).models['output''][-1]
torch.load(taska_model.get_local_copy())
train
torch.save('modelb') `I might have missed something there, but generally speaking this will let you:
Select TASKA as a parameter of TaskB training process Will register automagically Tasks'A...
because fastaiβs tensorboard doesnβt work in multi gpu
keep me posted when this is solved, so we can also update the fastai2 interface,
https://github.com/allegroai/clearml/issues/199
Seems already supported for a while now ...
Whatβs the general pattern for running a pipeline - train model, evaluate metrics and publish the model if satisfactory (based on a threshold, for example)
Basically I would do:
parameters for pipeline:
TaskA = Training model Task (think of it as our template Task)
Metric = title/series/sign we want to choose based on, where sign is max/min
Project = Project to compare the performance so that we could decide to publish based on the best Metric.
Pipeline:
Clone TaskA Change TaskA argu...
WackyRabbit7 basically starting v1.1 if you are running code without any configuration file, you will get an error (in contrast to previous versions where it defaulted to the demo-server)
- There is a workaround the fastai.launch, that is probably similar to this one:
I think you can do the launching "manually", something like:
https://github.com/allegroai/clearml/blob/fd2d6c6f5d46cad3e406e88eeb4d805455b5b3d8/examples/frameworks/pytorch/pytorch_distributed_example.py#L160
At least until we understand how to fix it automatically
Hmm you mean like overrides ?
Maybe store both before/after resolving ?
(Although that might be confusing? as the before solve should actually be readonly)
Yes that should work, only thing is you need to call Task init on the master process (and make sure you call Task.current_task() on the subprocesses, if you want to automagic to kick in, that said, usually there is no need, they are supposed to report everything back to the main one anyhow
basically
` @call_parse
def main(
Β Β gpus:Param("The GPUs to use for distributed training", str)='all',
Β Β script:Param("Script to run", str, opt=False)='',
Β Β args:Param("Args to pass to script", nargs=...
Can you add before the Task.init
import os
print(os.environ)
Set it on the PID of the agent process itself (i.e. the clearml-agent python process)
That would match what
add_dataset_trigger
and
add_model_trigger
already have so it would be good
Sounds good, any chance you can open a github issue, so that we do not forget?
Another parameter for when the task is deleted might also be useful
That actually might be more complicated, because there might be a race condition, basically missing the delete operation...
What would be the use case?
DepressedChimpanzee34
I am actually curious now, why is the default like this? maybe more people are facing similar bottlenecks?
On "regular" load there is no need for multiple processes, and the memory consumption might be more important than reply lag (at least before you start to scale)
DisturbedWalrus17
By spawning multiple processes for the API server, it looks like we utilise the CPU more now but the UI and API calls are still lagging a lot
Can you try with even more ...
BTW: for future reference, if you set the ulimit in the bash, all processes created after that should have the new ulimit
Hi PompousParrot44
You can check the cleanup service example.
It sleeps for 24 hours then spins up and does its thing.
You can always launch this service tasks on the services queue, its purpose is to run those services on the trains-server as additional CPU services. They will also be registered as service nodes, so you have visibility into which service is running.
In order to clone a task and wait for its completion.
Use the TrainsJob
https://github.com/allegroai/trains/blob/65a4a...
Hmm that is a good idea, and I think you are correct, it cannot support it. But it will be easy to do, maybe adding an argument trigger_on_archive
? wdyt?
Hmm is "model_monitoring_eps" another version of the model and it does not have all the properties of the "original" one?
Hi OddAlligator72
for instance - remove all the metrics from some step onward?Β
(I think that as long as the Task is not published you could do such a thing directly with the RestAPI (aka APIClient from python)
What's the use case?
is the model overridden or its version is automatically increased?
You will have another model, with the same name (assuming the second Task has the same name), but a new ID. So if I understand you correctly, we have auto-versioning :)
Hi @<1570583227918192640:profile|FloppySwallow46>
Hey I have a question, Can you monitor the time for one pipeline,
you mean to see the start / end time of the pipeline?
Click on the details link on the right hand side and you will have all the details on the pipeline task, including running time
GreasyPenguin14 I think this is what you are looking forTask.get_project_id('project_name')