Reputation
Badges 1
18 × Eureka!yes exactly sorry
point 3. being showcased in GrumpyPenguin23 video
Hmm this is odd, is this a download issue? if this is reproducible maybe we should investigate further...
I'll keep you informed as I play around with it 🙂
Minimum example:df = pd.DataFrame([[1,2,3], [1,2,3]])
task.upload_artifact('test', df)
task.artifacts['test'].get()
Ah I see, ok I'll have to wait then thanks
Ok the caching part is nice. I think the tricky part (as always) are going to be all the edge cases. E.g. in my preprocessing pipeline I might have a lot of tasks so that I can parallelise nicely but at the cost of quite a lot of boiler plate code for getting and writing artefacts as well as having a lot of tasks in the UI. Lets see
Also the docstring is a bit inconclusive:Launch every 15 minutes add_task(task_id='1235', queue='default', minute=15) Launch every 1 hour add_task(task_id='1235', queue='default', hour=1)
but then later::param minute: If specified launch Task at a specific minute of the day (Valid values 0-60) :param hour: If specified launch Task at a specific hour (24h) of the day (Valid values 0-24)
The first seems to imply that 15 will launch every 15 minu...
a bit fidely to figure out but I think it works. can't seem to be checking for artifact names AgitatedDove14 correct me here please but other filters work fine.
AgitatedDove14
That's definitely very easy, I'm still not sure how Kedro scales on clusters. From what I saw, and I might have missed it, it seems more like a single instance with sub-processes, but no real ability to setup diff environment for the diff steps in the pipeline, is this correct ?
sub-processes is an option but it supports much more: https://kedro.readthedocs.io/en/stable/10_deployment/01_deployment_guide.html one can containerise the whole pipeline and run it pretty m...
Is that API endpoint the same return than get_tasks is?
yes that was what I was looking for 🙂 ok no worries I have some ideas on a workaround for now 🙂
AgitatedDove14 . HollowKangaroo16 have you two had any further success on the kedro/clearml front?
I have been looking into this as well. The impression I have so far is that clearml is similar to mlflow just on steroids because it provides additional capabilities around orchestration and experimentation.
AgitatedDove14
Kedro in my opinion is a really nice tool to keep a clean code base for building complex Data Science projects (consisting of one or more pipelines). The UI is really se...
are you using kedro with dagster?
AgitatedDove14 good morning first of all 😄 yeah I know the decorator is coming and that is what I am looking for. But nonetheless I still wanted to play around with things a bit and was curious about the behaviour I saw. But I also saw that it is documented 😄 sorry
any idea when that hot fix is coming?
ok so that way I'll run my own requests against the API endpoint
AgitatedDove14 any idea on what that is?
If I have a task and I upload a dataframe with task.upload_artifact('test', dataframe)
and then on the same task to task.artifacts['test'].get() I always get an error ...
Are you able to reproduce it?
AgitatedDove14 as always much obliged to your fast responses this is actually incredible!
Yeah a bit clearer, something like this in the docs would be really helpful 😉 At least the last part as Storagemanager is actually quite clear.
Maybe I can sum up my understanding?
So am I right in the assumption that I can manage data and the passing of such between tasks either by
Managing them in a folder structure via datasets with the potential issue of syncing a lot of data between tasks ...
AgitatedDove14 Might be just an error on my side but if I use a pandas DataFrame as an Artefact and then use the .get() method in another task I get a compression error. If I use .get_local_copy() I can use: df = pd.read_csv(task.artifacts['bla'].get_local_copy(), compression=None)
and it works. But I need the compression=None otherwise I'll get the same error as with .get()
I'll build a minimal example tomorrow for you
Does that mean the entire pipeline will be running on the instance spinning the container ?
From here: this is what I understand:
Yes I think that is the easiest case, however I don't think it would be all that difficult to add meta data to the nodes that specifies on what kind of queue or node it should be run.
Yep, this is exactly what's coming in the next release of Pipelines (RC should be out in a week or so)
Well if that is coming out soon I'll wait with further developmen...
I wont ask when the decorator is coming 😉
Would that mean if you are running 2-3 clearml agents for 2-3 projects that their environment has to be such that they could run each of the 3 projects (each having different requirements)?
What is the pattern to start an agent within the project specific docker container based on the task? Would that be handled via the service queue? Or can you already configure that on a task level providing a docker file?