Reputation
Badges 1
35 × Eureka!What is not clear to me is how you would use the callbacks to run the step locally. Are there some properties that needs to be set in the task? I see that there is a start_controller_locally
option for the main @PipelineDecorator.pipeline
, but I don't see it for @PipelineDecorator.component
Hi @<1523701087100473344:profile|SuccessfulKoala55> , thanks, and how can I get the "id" to use with update
for the dataset folder case?
Hi, yes I’m using the same clearml.conf on the agent, in the logs I can see that console_cr_flush_period
is set to 30
Yes these are the only actions. The task is moved correctly tho, I can see it under f'{config.project_id}/.pipelines'
in the UI, the issue is that it's not visible under PIPELINES
. I haven't tried with tasks or fiunctions pipelines yet.
This would work to load the local modules, but I’m also using poetry and the pyproject.toml
is in the subdirectory, so the agent won’t install any dependency if I don’t set the work_dir
In the meantime, any suggestion on how to set the working_dir in any other way? We are moving to this new code structure and I’d like to have clearml up and running
Hi @<1523701205467926528:profile|AgitatedDove14> , in my case all the code is in a subfolder, like projects/main
, so if I run from the git root it can’t find the local modules
So the issue is that I would like too keep the list of hyperparams and metrics, if I clean them up then I would lose them. But I agree that I might be overthinking it
Also: what's the purpose of storing the pipeline arguments as artifacts then? When it runs remotely it still runs the main script as entrypoint and not the pipeline function directly, so all the arguments will be replaced by what is passed to the function during the remote execution, right?
Hi @<1523701087100473344:profile|SuccessfulKoala55> , thanks for the answer, I'll try that. Would you suggest any other simpler way to achieve the same result? I just want to get the best model according to a logged metric.
Hi @<1523701070390366208:profile|CostlyOstrich36> , thanks but in this case I’d like to get also the ids of the running workers, so that I can selectively stop some of them. Is it possible somehow?
I think I found a solution using pipeline_task.move_to_project(new_project_name=f'{config.project_id}/.pipelines/{config.run_name}',
system_tags
=['hidden', 'pipeline'])
Same thing, it's not visible under PIPELINES
Hi @<1523701070390366208:profile|CostlyOstrich36> , sorry how would you use them exactly?
Oh nice thanks, will try with that combination
Thanks @<1523701087100473344:profile|SuccessfulKoala55> , I’ll take a look
Hi @<1523701070390366208:profile|CostlyOstrich36> , yes it's specifically with datasets. Probably the option I need is size.max_used_bytes
but it looks like it's available only for the enterprise plan? Is there any other way to clean the cache after each task?
For instance, I have in my_pipeline/__main__.py
:
import yaml
import argparse
from my_pipeline.pipeline import run_pipeline
parser = argparse.ArgumentParser()
parser.add_argument('--config', type=str, required=True)
if __name__ == '__main__':
args = parser.parse_args()
with open(args.config) as f:
config = yaml.load(f, yaml.FullLoader)
run_pipeline(config)
and in my_pipeline/pipeline.py
:
@PipelineDecorator.pipeline(
name='Main',
project=...
Hi @<1523701087100473344:profile|SuccessfulKoala55> , I think the issue is where to put the connect_configuration
call. I can't put it inside run_pipeline
because it's only running remotely and it doesn't have access to the file, and I can't put it in the script before the call to run_pipeline
since the task has not been initialized yet.
I've upladed an example here for simiplicity: None
@<1523701435869433856:profile|SmugDolphin23> then the issue is that config is not set. I also tried with:
import yaml
import argparse
from my_pipeline.pipeline import run_pipeline
from clearml import Task
parser = argparse.ArgumentParser()
parser.add_argument('--config', type=str, required=True)
if __name__ == '__main__':
if Task.running_locally()::
args = parser.parse_args()
with open(args.config) as f:
config = yaml.load(f, yaml.FullLoader)
else:
...
I just have some for loop in some pipeline components, when processing some files. I know it increases the flush intervals and it’s working when run locally, I only see a new line from tqdm every ~30s. It’s just when I run the same script in docker using the agent I get a new line every ~5s
Hi @<1523701087100473344:profile|SuccessfulKoala55> , is there any workaround?
Basically I want to run a function in parallel, and having that function create multiple tasks. So I was thinking of setting up a pipeline to have this hierarchy main -> parallelized_function -> init_task_function
. But I guess I could also just call Task.create
in init_task_function
and achieve the same
Hi @<1523701435869433856:profile|SmugDolphin23> , I just tried it but Task.current_task()
returns None
even when running in remotely
Yes I can read it using this. I was just wondering if there is a way to read the file downloaded directly from the UI
Is there any other way to specify it besides directly in the component?
I deleted a few experiments, but they had the same kind of plots and metrics. so I don't think they would release much space
I have some git diffs logged but they are very small. For the configurations I saw that the datasets tasks have a fairly large "Dataset Content" config (~2MB), but I only have 5 dataset tasks