
Reputation
Badges 1
42 × Eureka!my colleague, @<1534706830800850944:profile|ZealousCoyote89> has been looking at this – I think he has used the relevant kwarg in the component decorator to specify the packages, and I think it worked but I’m not 100%. Connah?
Ahh. This is a shame. I really want to use ClearML to efficiently compute features but it’s proving a challenge!
Thanks
I am using the PipelineDecorator form of the pipeline and I am passing arguments as function arguments to the pipeline components
Hi John, we are using a self-hosted server with:
WebApp 1.9.2-317
Server: 1.9.2-317
API: 2.23
edit: clearml==1.11.0
To illustrate, here’s an example repo:
repo/
package1/
package2/ # installed separately to package1
task_script.py # requires package1 and package2 to have been pip installed
There are no experiments in the project, let alone the pipeline; they’ve all been archived
And the app is presumably crashed because I can’t click the “Close” button – it’s (the whole page) totally unresponsive and I have to refresh the page, at which point the pipeline still exists (ie was not deleted).
I have left it on the deletion screen (screenshot) for 20-30 mins at one point and it didn’t do anything, so this seems to be a bug
I’m just the messenger here, didn’t set up the web app...
from tempfile import mkdtemp new_folder = with_feature.get_mutable_local_copy(mkdtemp())
It’s this line that causes the issue
Do notice this will only work for pipelines from Tasks, is this a good fit for you?
The issue with this is that we would then presumably have to run/“build” each of the Tasks (pipeline steps) separately to put them on the ClearML server and then get their Task ID’s in order to even write the code for the Pipeline, which increases the complexity of any automated CI/CD flow. Correct me if I’m wrong.
Essentially, I think the key thing here is we want to be able to build the entire Pipe...
I used task.flush(wait_for_uploads=True)
in the final cell of the notebook
Thanks, yes I am familiar with all of the above.
We want to validate the entire pipeline . I am not talking about using a ClearML Pipeline as the validator (which is the case in your examples).
Here is some further detail that will hopefully make things more obvious:
- The pipeline is a series of steps which creates a feature store – in fact, you might even call it a feature pipeline!
- Each pipeline step takes responsibility for a different bit of feature engineering.
- We want to val...
Basically, for a bit more context, this is part of an effort to incorporate ClearML Pipelines in a CI/CD framework. Changes to the pipeline script create_pipeline_a.py
that are pushed to a GitHub master
branch would trigger the build and testing of the pipeline.
And I’d rather the testing/validation etc lived outside of the ClearML Pipeline itself, as stated earlier – and that’s what your pseudo code allows, so if it’s possible that would be great. 🙂
Sorry, I don’t understand how this helps with validating the pipeline run.
Where would the validation code sit?
And the ClearML Pipeline run needs to be available on the ClearML Server (at least as a draft) so that it can be marked as in-production and cloned in the future
The issue here is I don’t have the pipeline ID as it is a new version of the pipeline - i.e. the code has been updated, I want to run the updated pipeline (for the first time), get its ID, and then analyse the run/perform updates to tags (for example)
Yes, sorry, the final cell has the flush
followed by the close
Yep, that’s it. Obviously would be nice to not have to go via the shell but that’s by the by (edit: I don’t know of a way to build or run a new version of a pipeline without going via the shell, so this isn’t a big deal).
The pseudo-code you wrote previously is what would be required, I believe
be able to get the pipeline’s Task ID back at the end
This is the missing piece. We can’t perform validation without this, afaik
Is there a rule whereby only python native datatypes can be used as the “outer” variable?
I have a dict
of numpy np.array
s elsewhere in my code and that works fine with caching.
The Pipeline is defined using PipelineDecorators, so currently to “build and run” it would just involve running the script it is defined in (which enqueues it and runs it etc).
This is not ideal, as I need to access the Task ID and the only methods I can see are for use within the Task/Pipeline ( Task.current_task
and PipelineDecorator.get_current_pipeline
)
The reason I want to check completion etc outside the Pipeline Task is that I want to run data validation etc once when the pipe...
I get an error about incorrect Task ID’s – in the above pseudo code it would be the ID of the step
Task that was displayed in the error
e.g. pseudo for illustration only
` def get_list(dataset_id):
from clearml import Dataset
ds= Dataset.get(dataset_id=dataset_id)
ds_dir=ds.get_local_copy()
etc...
return list_of_objs # one for each file, for example
def pipeline(dataset_id):
list_of_obj = get_list(dataset_id)
list_of_results = []
for obj in list_of_obj:
list_of_results.append(step(obj))
combine(list_of_results) `One benefit is being able to make use of the Pipeline caching so if ne...
I have already tested that the for loop does work, including caching, when spinning out multiple Tasks.
As I say, the issue is grouping the results of the tasks into a list and passing them into another step
The Dataset object itself is not being passed around. The point of showing you that was to say that the Dataset may change and therefore the number of objects (loaded from the Dataset, eg a number of pandas DataFrames that were CSV’s in the dataset) could change
(including caching, even if the number of elements in the list of vals changes)
Producing it now — thanks for your help, won’t be a few mins
Yep, would be happy to run locally, but want to automate this so does running locally help with getting the pipeline run ID (programmatically)?