Each of these steps,
[2], [3], [4], [5 & 6]
can be thought of as an independent Kedro nodes that can be reused in the future. Now, how to integrate this with ClearML is unclear to us.
The same can be said for ClearML, each of these steps is a clearml Task (with it's own repo/environment)
It sounds (and I might be completely off here, so please feel free to correct me) that the main use for Kedro is the nice web UI of the pipeline (which I
agree looks very cool).
Th...
so moving b in to a won’t work if some subfolders are already there
I though that if they are already there you would merge / overwrite, isn't that what you need ?a/b/c/2.txt
seems like the result of moving b
from dataset B into folder b
in Dataset A, what am I missing?
(My assumption is that you have both datasets locally on the same machine and that you can just copy the files from b
of Datasset B into b
folder of Dataset A)
This is something that we do need if we are going to keep using ClearML Pipelines, and we need it to be reliable and maintainable, so I don’t know whether it would be wise to cobble together a lower-level solution that has to be updated each time ClearML changes its serialisation code
Sorry if I was not clear, I do not mean for you ti do unstable low-level access, I meant that pipelines are Designed to be editable externally, they always deserialize themselves.
The only part that is mi...
DefiantHippopotamus88HTTPConnectionPool(host='localhost', port=8081):
This will not work because inside the container of the second docker compose "fileserver" is not definedCLEARML_FILES_HOST="
"
You have two options:
configure to the docker compose to use the networkhost on all conrtainers (as oppsed to the isolated mode they are now running ing)2. Configure all of the CLEARML_* to point to the Host IP address (e.g. 192.168.1.55) , then rerun the entire thing.
- At its simplest, this could just mean checking that all of the steps and the pipeline itself have completed successfully (by checking their “Task status”).If a pipeline step ends with "failed" status in the pipeline execution function an exception will be raised, if the exception is not caught, the pipeline itself will also fail
run
pipeline_script.py
which contains the pipeline code as decorators.
So in theory the following should actually work.
Let's assume you ...
Btw I sometimes get a gzip error when I am accessing artefacts via the '.get()' part.
Hmm this is odd, is this a download issue? if this is reproducible maybe we should investigate further...
FlutteringWorm14 any insight on the Task the it fails to delete ? or to reproduce ?
JitteryCoyote63 I remember something with "!" in the name or maybe "/" in the name that might cause this behavior. May I suggest checking with clearml-server 1.3 ?
yea the api server configuration also went away
okay that proves it
At the top there should be the URL of the notebook (I think)
If you could provide the specific task ID then it could fetch the training data and study from the previous task and continue with the specified number of trainings.
Yes exactly, and also all the definitions for the HPO process (variables space, study etc.)
The reason that being able to continue from a past study would be useful is that the study provides a base for pruning and optimization of the task. The task would be stopped by aborting when the gpu-rig that it is using is neede...
This is cleaml python client, no need to change the server
same: Not Found (#404)
May I suggest to DM it to me (so it is not public)
My question is what should be the path to the requirements.txt file?
Is it relative to the repo base?
This is actually in runtime (i.e. when running the code), so relative to the working directory. Make sense ? (you can specify absolute path, probably something I would avoid in the code base though...)
Thank you! 😊
When we enqueue the task using the web-ui we have the above error
ShallowGoldfish8 I think I understand the issue,
basically I think the issue is:task.connect(model_params, 'model_params')
Since this is a nested dict:model_params = { "loss_function": "Logloss", "eval_metric": "AUC", "class_weights": {0: 1, 1: 60}, "learning_rate": 0.1 }
The class_weights is stored as a String key, but catboost expects "int" key, hence it fails.
One op...
K8s + clearml-agent integration.
Hmm is this an on-prem k8s cluster?
So the issue is that you have two reference branches on the local git, one to gitlab one to gitea and it fails to understand which on is the correct remote ...
I wonder if "git ls-remote --get-url" will always work ?!
My bad you have to pass it to the container itself:
https://github.com/allegroai/clearml-agent/blob/a5a797ec5e5e3e90b115213c0411a516cab60e83/docs/clearml.conf#L149extra_docker_arguments: ["-e", "CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1"]
Hi ScaryLeopard77
You can probably do:Task.init(...,continue_last_task='task_id_here')
This will continue a previously executed Task and log both steps in the same place.
Does that help?
BTW: you can also of course manually report to any Task as it is still running with:aux_task = Task.get_task(task_id_here) aux_task.get_logger().report_scalar(...)
EnviousPanda91 'connect' will log the object properties, the automagic logging is controlled in the Task.init call. Specifically Which framework produces metrics that are not logged? Your sample code manually reports some scalars/values, do you these as well?
I think the main issue is that for some reason the container running changed one of the files inside the temp folder. then the host machine is "stuck" with a file that the root user owned/changed, and now it cannot reuse / delete the temp folder.
I think the fix is to make sure the container deleted the temp folder when it is done
Hi UnevenDolphin73
In theory it "might" work, I have to admit that personally I'm not a fan of what Amazon did to Mongo, i.e. forking their their code base and selling it as a service, just bad open-source practice
(The main issue might be API calls that might not fully match)
wdyt?
Basic setup:
glues service per "job template" (e.g. k8s resources, for example cpu requirement, or gpu requirement).
queue per glue service, e.g. cpu_machine
queue, and 1xGPU
queue
wdyt?
I see, so in theory you could call add_step with a pipeline parameter (i.e. pipe.add_parameter etc.)
But currently the implementation is such that if you are starting the pipeline from the UI
(i.e. rerunning it with a different argument), the pipeline DAG is deserialized from the Pipeline Task (the idea that one could control the entire DAG externally without changing the code)
I think a good idea would be to actually allow the pipeline class to have an argument saying always create from cod...
One last thing make sure you spin the pod container with privileged mode, because the trains-agent docker will spin a sibling docker for your actual experiment.
WickedGoat98 sorry, I missed the thread...
that the trains.conf has to be located on the node running the trains-agent.
Correct 🙂
The easiest way to check is to see if you can curl to the ip:port from the docker.
If you fail it is probably the wrong IP.
the IP you need to use is the IP of the machine running the docker-compose (not the IP of the docker inside that machine).
Make sense ?