It also seems that
PipelineDecorator.upload_artifact
is not compatible with caching, sadly,
Both use the exact same mechanism of uploading artifacts (i.e. including caching for downloaded artifacts), in terms of caching pipeline components, this is on a component level (i.e. same code/task same arguments, equals cache hit)
What exactly are you getting ? how is it that the "PipelineDecorator.upload_artifact" uploads to a different storage ? is that reproducible ?
assuming you have http://hparams.my _param
my suggestion is:
` @hydra.main(config_path="solver/config", config_name="config")
def train(hparams: DictConfig):
task = Task.init(hparams.task_name, hparams.tag)
overrides = {'my_param': hparams.value}
task.connect(overrides, name='overrides')
in remote this will print the value we put in "overrides/my_param"
print(overrides['my_param'])
now we actually use overrides['my_param'] `Make sense ?
BTW:
Error response from daemon: cannot set both Count and DeviceIDs on device request.
Googling it points to a docker issue (which makes sense considering):
https://github.com/NVIDIA/nvidia-docker/issues/1026
What is the host OS?
Are you running inside a kubernetes cluster ?
a task of queue B if the next task is of type A it will have to wait,
It seems you imply there are two types of Tasks and they need to be executed one after the other ?
ClumsyElephant70
Could it be virtualenv package is not installed on the host machine ?
(From the log it seems you are running in venv mode, is that correct?)
One example is a node that resizes the images, this node receives as input a Dataset and iterates over each image, resizes it an outputs a new Dataset, which is used in the next node downstream in the Pipeline.
I agree, this sounds like a "function" rather than a job, so better suited for Kedro.
organization structure
and see for yourself (this pipeline has two nodes
train_model
and
predict
)
Interesting! let me dive into that and ...
I
do
have the SSH key placed at
/root/.ssh/id_rsa
on the machine,
@<1541954607595393024:profile|BattyCrocodile47> is the SSH key part of the containers? or are you saying it is on the EC2 instance ?
Hi @<1649221394904387584:profile|RattySparrow90>
: Are the models I defined to be served e.g. via the CLI downloaded to the serving pod
Yes this is done automatically and online (i.e. when you update the using CLI/API) , based on the models/endpoints you set
So that they are physically lying there as a file I can see in the filesystem?
They are, and cached there
Or is it more the case that the pod gets the model when needed/when an API call for this model is incoming?
I...
Hmm I see what you mean. It is on the roadmap (ETA the next version 0.17, 0.16 is due in a week or so) to add multiple models per Task so it is easier to see the connections in the UI. I'm assuming this will solve the problem?
Hi OutrageousGiraffe8
when I save model using tf.keras.save_model
This should create a new Model in the system (not artifact), models have their own entity and UID.
Are you creating the Task with output_uri="
gs://bucket/folder "
?
RoundMosquito25 are you using clearml-agent daemon --stop
or are you killing them ?
killing them basically means you loose them in the UI when they timeout, the backend does not see them for 10min so it assumes they died, when you call clearml-agent --stop they will unregister themselves and disappear immortally
PompousParrot44 I think the website should address that:
https://allegro.ai/
But the TD;DR is the enterprise version adds Full Dataset Versioning on top, with end-to-end integration from code to DLOps (e.g.. data sampling , database query capabilities, data visualization, multi-site support, permission etc,)
(Just a thought, maybe we just need to combine Kedro-Viz ?)
You can change it the CWD folder, if you put .
in working dir it will be the root git repo, but you can do any subfolder, obviously you need to change the script path to match the folder, e.g. ./folder/script.py
etc.
Basically just change the helm yamlqueue: my_second_queue_name_here
Actually this is by default for any multi node training framework torch DDP / openmpi etc.
oh dear 😞 if that's the case I think you should open an Issue on pypa/pip , I'm not sure what we can do other than that ...
PompousBeetle71 Check the beginning of the log, it should print the configuration, including the access key (excluding the secret) see if it makes sense...
Yep 🙂
Basically:
` task = Task.get_task(task_id='aaaa')
while task.status not in ('completed', 'stopped',):
do something ?
sleep(15) `(Notice task.status / task.get_status() will refresh the Task status on every call)
Thanks SubstantialElk6 !
Happy new year 🎉 🍺 🍾 🎇
Wait I might be completely off.
Is this line "hangs" ?
task.execute_remotely(..., exit_process=True)
It uses only one CPU core, could I use multiprocessing somehow?
Hi EcstaticMouse10
Hmm, yes it should be multi core:
https://github.com/allegroai/clearml/blob/a9774c3842ea526d222044092172980ae505e24f/clearml/datasets/dataset.py#L1175
wdyt?
Sorry @<1524922424720625664:profile|TartLeopard58> 😞 we probably missed it
clearml-session is still being developed 🙂
Which issue are you referring to ?
-- I've been running my script from VSCode for the first time,
In the initial Task (the one created when running inside VSCode) do you have all the packages listed in the "Installed Packages" section ?
yes you are correct, OS environment:TRAINS_PROC_MASTER_ID=1:task_id_here