From code ? or the CLI ?
In both cases the dataset needs to upload the parent version somewhere, azure blob supported.
But once i see it on the UI means it is already launched somewhere so i didn't quite get you.
The idea is you run it locally once (think debugging your code, or testing it)
While running the code the Task is automatically created, then once in the system you can clone / launch it.
Also, I want to launch my experiments on a kubernetes cluster and i don't actually have any docs on how to do that, so an example can be helpful here.
We are working on documenting the full process, ...
BroadSeaturtle49 agent RC is out with a fix:pip3 install clearml-agent==1.5.0rc0
Let me know if it solved the issue
@<1523704157695905792:profile|VivaciousBadger56>
Is the idea here the following? You want to use inversion-of-control such that I provide a function
f
to a component that takes the above dict an an input. Then I can do whatever I like inside the function
f
and return a different dict as output. If the output dict of
f
changes, the component is rerun; otherwise, the old output of the component is used?
Yes exactly ! this way you...
That is a good question, usually the cuda version is automatically detected, unless you overrride it with the conf file or OS env. What's the setup? Are you using as package manager ? (conda actually installs CUDA drivers, if the original Task was executed on a machine with conda, it will take the CUDA version automatically, reason is to match the CUDA/Torch/TF)
AstonishingWorm64 I found the issue.
The cleamlr-serving assume the agent is working in docker mode, as it Has to have the triton docker (where triton engine is installed).
Since you are running in venv mode, tritonserver is not installed, hence the error
Hi AstonishingWorm64
I think you are correct, there is external interface to change the docker.
Could you open a GitHub issue so we do not forget to add an interface for that ?
As a temp hack, you can manually clone "triton serving engine" and edit the container image (under the execution Tab).
wdyt?
If there is new issue will let you know in the new thread
Thanks! I would really like to understand what is the correct configuration
can someone show me an example of how
PipelineController.create_draft
I think the idea is to store a draft versio of the pipeline (not the decorator type, I think, but the one launching pre-executed Tasks).
GiganticTurtle0 I'm not sure I fully understand how / why you are using it, can you expand?
EDIT:
However, my intention is ONLY to create it to be executed later on.
Hmm so may like enqueue it?
Hmm should be pushed later today, meanwhile:
` from clearml import Task
from clearml.automation.trigger import TriggerScheduler
def func(*args, **kwargs):
print('test', args, kwargs)
if name == 'main':
s = TriggerScheduler(pooling_frequency_minutes=1.0)
s.add_model_trigger(
name='trigger 1', schedule_function=func,
trigger_project='examples', trigger_on_tags=['deploy']
)
s.add_model_trigger(
name='trigger 2',
schedule_task_id='3f7...
Oh I see the pipeline controller itself (not the components) is the one with the repo
To fix that add at the top of the script the following:
` from clearml import Task
Task.force_store_standalone_script()
@PipelineDecorator.pipeline(...) `That should do the trick
Pretty confusing that neither
services
StickyLizard47 basically this is how a services queue agent should be spinned:
https://github.com/allegroai/clearml-server/blob/9b108740da21f25407bd2c59583ca1c86f8e1faa/docker/docker-compose.yml#L123
When spinning on a k8s cluster, this is a bit more complicated, as it needs to work with the clearml-k8s-glue.
See here how to spin it on k8s
https://github.com/allegroai/clearml-agent/tree/master/docker/k8s-glue
The way I understand it is that K8s glue agent is enabled by default (and I do see a Deployment for
clearml-k8sagent
SarcasticSquirrel56
Good start, when you say you see the Task in ""k8s_scheduler" queue, originally did you enqueue it to "default" ?
Click on the "k8s_schedule" queue, then on the right hand side, you should see your Task, click on it, it will open the Task page. There click on the "Info" Tab, there look for "STATUS MESSAGE" and "STATUS REASON". What do you have there?
I think this was the issue: None
And that caused TF binding to skip logging the scalars and from that point it broke the iteration numbering and so on.
Hi PanickyMoth78
it was uploading fine for most of the day but now it is not uploading metrics and at the end
Where are you uploading metrics to (i.e. where is the clearml-server) ?
Are you seeing any retry logging on your console ?packages/clearml/backend_interface/metrics/reporter.py", line 124, in wait_for_events
This seems to be consistent with waiting for metrics to be flushed to the backend, but usually you will see retry messages on your console when that happens
Thanks @<1523701713440083968:profile|PanickyMoth78> for pining, let me check if I can find something in the commit log, I think there was a fix there...
Okay here is a standalone code that should be close enough? (if I missed anything let me know)
` import tempfile
from datetime import datetime
from pathlib import Path
import tensorflow as tf
import tensorflow_datasets as tfds
from clearml import Task
task = Task.init(project_name="debug", task_name="test")
(ds_train, ds_test), ds_info = tfds.load(
'mnist',
split=['train', 'test'],
shuffle_files=True,
as_supervised=True,
with_info=True,
)
def normalize_img(image, labe...
IrateBee40
Check the first steps here:
https://clear.ml/docs/latest/docs/getting_started/ds/ds_first_steps
(Basically you have to generate credentials / configure you machine so it knows where the server is and how to access it)
Make sense ?
callbacks.append( tensorflow.keras.callbacks.TensorBoard( log_dir=str(log_dir), update_freq=tensorboard_config.get("update_freq", "epoch"), ) )
Might be! what's the actual value you are passing there?
BoredHedgehog47 can you test this one? Is it close to your code ?
I basically moved the Task.init() call below the imports
Okay that is odd, can you copy pate the before/after of the import, so we can fix that?!
Okay the type is inferred from the default value of the function step itself, that means that both:data_frame = step_one(pickle_url, extra=1337)
anddata_frame = step_one(pickle_url, 1337)
Will pass extra as int
.
That said if the default value of the argument is missing, it will revert to str
In order to use the type hints as casting hint, we actually need to improve the task.connect
to support the type casting (they are stored)
Ex: Expecting value: line 1 column 1 (char 0)
K8S Glue pods monitor: Failed parsing kubectl output:
Run with --debug as the first parameter
Are you running the latest from the git repo ?
Seems correct.
I'm assuming something is wrong with the key/secret quoting ?!
Could you generate another one and test it ?
(you can have multiple key/secretes on the same user)
LudicrousDeer3 when using Logger you can provide 'iteration' argument, is this what you are looking for?