
Reputation
Badges 1
25 × Eureka!you mean to spin a pod with the agent inside it (daemon in services mode).
Or connect the services queue to the k8s cluster (i.e. define the pod template that uses cpu with not a lot of ram)?
Hi ColossalDeer61 ,
Xxx is the module where my main experiment script resides.
So I think there are two options,
Assuming you have a similar folder structure-main_folder
--package_folder
--script_folder
---script.py
Then if you set the "working directory" in the execution section to "." and the entry point to "script_folder/script.py", then your code could do:from package_folder import ABC
2. After cloning the original experiment, you can edit the "installed packages", and ad...
I want to use services queue for running services, and I want to do it on k8s
So yes, as a standalone pod with the agent in venv mode (as opposed to docker mode)
Does that make sense to you?
I guess it wonโt due to the nature of services?
Correct, k8s glue works differently, that said I would actually use the helm to spin a pod woth the agent in services mode and venv mode.
Makes sense to add it to docker run by default if GPUs are mentioned in agent.
I think this is an arch thing, --privileged is not needed on ubuntu flavor, that said you can always have it if you add it here:
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L149
clearml-agent daemon --gpus 0 --queue default --docker
But docker still sees all GPUs.
Yes --gpus should be enough, are you sure regrading the --privileged flag ?
SmugLizard25 are you saying that with the latest version it does not work?
one can containerise the whole pipeline and run it pretty much anywhere.
Does that mean the entire pipeline will be running on the instance spinning the container ?
From here: this is what I understand:
https://kedro.readthedocs.io/en/stable/10_deployment/06_kubeflow.html
My thinking was I can use one command and run all steps locally while still registering all "nodes/functions/inputs/outputs etc" with clearml such that I could also then later go into the interface and clone an...
One example is a node that resizes the images, this node receives as input a Dataset and iterates over each image, resizes it an outputs a new Dataset, which is used in the next node downstream in the Pipeline.
I agree, this sounds like a "function" rather than a job, so better suited for Kedro.
organization structureย
ย and see for yourself (this pipeline has two nodesย
train_model
ย andย
predict
ย )
Interesting! let me dive into that and ...
This is odd, can you send th full log of the failed Task and if possible the code?
Bake to the error:
clearml_agent: ERROR: Failed getting token (error 401 from
): Unauthorized (invalid credentials) (failed to locate provided credentials)
See here:
https://github.com/allegroai/clearml-server/blob/3f2b96266bc51bfce680bd759c7fa9d635ae36d3/docker/docker-compose.yml#L131
You need to provide an access key so it can actually "talk" to the server next to it.
I think there is a bug on the UI that causes series with "." to only use the first part of the series name for the color selection. This means "epsilon 0" and "epsilon 0.1" will always get the same color, and this will explain why it works on other graphs
Hi BitterStarfish58
Where are you uploading it to?
It might be the file upload was broken?
Hmm maybe we should add a test once the download is done, comparing the expected file size and the actual file size, and if they are different we should redownload ?
Hmm BitterStarfish58 what's the error you are getting ?
Any chance you are over the free tier quota ?
BitterStarfish58 could you open a GitHub issue on it? I really want to make sure we support it (and I think it should not be very difficult)
I'm not sure the files-server supports "continue" from last position...
Since the error says network error, is it possible because I'm in Taiwan? Like downloading from Asia leads to this kind of issue
Can you download it from the browser ? (I mean the file size after download , is it 400mb?)
(currently I think the implementation expects that if the download completed, it was successful)
So are you saying the large file size download is the issue ? (i.e. network issues)
Thanks BitterStarfish58 !
BitterStarfish58 I would suspect the upload was corrupted (I think this is the discrepancy between the files size logged, to the actual file size uploaded)
metric=image is the name in the dropdown of the denugimages
HandsomeCrow5client.events.debug_images(metrics=[dict(task='6adb929f66d14731bc76e3493ab89d80', metric='image')])
Basically try with the latest RC ๐
pip install trains 0.15.2rc0
to get all the image metrics:client.events.get_task_metrics(tasks=['6adb929f66d14731bc76e3493ab89d80'], event_type='training_debug_image')