Reputation
Badges 1
25 × Eureka!SubstantialElk6
The ~<package name with first name dropped> == a.b.c
is a known conda/pip temporary install issue. (Some left over from previous package install)
The easiest way is to find the site-packages folder and delete the package, or create a new virtual environment
BTW:
pip freeze will also list these broken packages
Sure:Dataset.create(..., use_current_task=True)
This will basically attach/make the main Task the Dataset itself (Dataset is a type of a Task, with logic built on top of it)
wdyt ?
Hi PanickyMoth78
So the current implantation of the pipeline parallelization is exactly like python async function calls:for dataset_conf in dataset_configs: dataset = make_dataset_component(dataset_conf) for training_conf in training_configs: model_path = train_image_classifier_component(training_conf) eval_result_path = eval_model_component(model_path)
Specifically here since you are passing the output of one function to another, image what happens is a wait operation, hence it ...
Come to think about it, maybe we should have "parallel_for" as a utility for the pipeline since this is so useful
model_path/run_2022_07_20T22_11_15.209_0.zip , err: [Errno 28] No space left on device
Where was it running?
I take it that these files are also brought into pipeline tasks's local disk?
Unless you changed the object, then no, they should not be downloaded (the "link" is passed)
I assume now it downloads "more" data as this is running in parallel (and yes I assume that before it deleted the files it did not need)
But actually, at east from a first glance, I do not think it should download it at all...
Could it be that the "run_model_path" is a "complex" object of a sort, and it needs to test the values inside ?
How did you define the decorator of "train_image_classifier_component" ?
Did you define:@PipelineDecorator.component(return_values=['run_model_path', 'run_tb_path'], ...
Notice two return values
These paths are
pathlib.Path
. Would that be a problem?
No need to worry, it should work (i'm assuming "/src/clearml_evaluation/" actually exists on the remote machine, otherwise useless π
I located the issue, I'm assuming the fix will be in the next RC π
(probably tomorrow or before the weekend)
You can try direct API call for all the Tasks together:Task._query_tasks(task_ids=[IDS here], only_fields=['last_metrics'])
it does appear on the task in the UI, just somehow not repopulated in the remote run if itβs not a part of the default empty dictβ¦
Hmm that is the odd thing... what's the missing field ? Could it be that it is failing to Cast to a specific type because the default value is missing?
(also, is issue present in the latest clearml RC? It seems like a task.connect issue)
Hmm good point, it should probably return he clearml python version. Is this what you mean?
Hmm I think this is not doable ... π
(the underlying data is stored in DBs and changing it is not really possible without messing about with the DB)
Hi SmallDeer34
Can you see it in TB ? and if so where ?
Yes it should
here is fastai example, just in case π
https://github.com/allegroai/clearml/blob/master/examples/frameworks/fastai/fastai_with_tensorboard_example.py
Hi NastyFox63 could you verify the fix works?pip install git+
EnviousStarfish54 generally speaking the hyper parameters are flat key/value pairs. you can have as many sections as you like, but inside each section, key/value pairs. If you pass a nested dict, it will be stored as path/to/key:value (as you witnessed).
If you need to store a more complicated configuration dict (nesting, lists etc), use the connect_configuration, it will convert your dict to text (in HOCON format) and store that.
In both cases you can edit the configuration and then when ru...
Another point I see is, that in the workers & queses view the GPU usage is not been reported
It should be reported, if it is not, maybe you are running the trains-agent
in cpu mode ? (try adding --gpus)
models been trained stored ...
mongodb will store url links, the upload itself is controlled via the "output_uri" argument to the Task
If None is provided, the Trains log the local stored model (i.e. link to where you stored your model), if you provide one, Trains will automatically upload the model (into a new subfolder) and store the link to that subfolder.
- how can I enable the tensorboard and have the graphs been stored in trains?
Basically if you call Task.init all your...
WickedGoat98
The trains-agent-services docker is always CPU, the idea is put long lasting services there (like the auto cleanup or slack integration or HPO etc.)
To spin an agent with GPU on any machine (regardless of where the trains-server is) you can check the trains-agent
readme.
https://github.com/allegroai/trains-agent#running-the-trains-agent
Thanks @<1689446563463565312:profile|SmallTurkey79> ! π
WickedGoat98 the mechanism of cloning and parameter overriding is working only when the trains-agent
is launching the experiment. Think of it this way:
Manual execution: trains sends data to server
Automatic (trains-agent) execution: trains pulls data from the server
This applies for both the argparse and connect and connect configuration.
The trains code itself is acting differently when it is executed from the 'trains-agent' context.
Does that help clear things ?
(the payload is not the correct form, can that be a problem?
It might, but I assume you will get a different error
Hi JitteryCoyote63
I think there is a GitHub issue (request on it), this is not very trivial to build (basically you need the agent to first temporary pull the git, apply changes, build docker, remove temp build, and restart with the new image)
Any specific reason for not pushing a docker, or using the extra docker bash script on the Task itslef?
GreasyPenguin14 could you test with the matplotlib lib example ? (I cannot reproduce it and it seems like something to do with pycharm and matplotlib backend)
https://github.com/allegroai/clearml/blob/master/examples/frameworks/matplotlib/matplotlib_example.py
GreasyPenguin14 we never had troubles with Task.init
(or any other clearml calls) and working with the pycharm debugger, we use it quite extensively ...
Actually on a very similar setup...
Could you send the full log?
Or maybe a code snippet to reproduce this behavior ?
(We did notice they fixed a few issues with the debugger in 2020.3.3 so it's worth upgrading)
GreasyPenguin14 could you test with the 0.17.5rc4
?
Also what's the PyCharm / OS?