I get gaps in the graphs.
For example, the first time I run, I create a task and run a loop:
Hi SourOx12
Is this related to this one?
https://github.com/allegroai/clearml/issues/496
TrickySheep9
you are absolutely correct π
VivaciousWalrus99
Yes this is odd:1608392232071 spectralab:gpu0 DEBUG New python executable in /cs/usr/gal.hyams/.trains/venvs-builds/3.7/bin/python2
So it thinks it has python v3.7 but it is using python2 in the venv...
In your trains.conf file, set agent.python_binary to the python3.7 binary. It should be something like:agent.python_binary=/path/to/python/python3.7
I am logging debug images via Tensorboard (via
add_image
function), however apparently these debug images are not collected within fileserver,
ZanyPig66 what do you mean not collected to the file server? are you saying the TB add_image is not automatically uploading images? or that you cannot access the files on your files server?
I am trying to use the
configuration vault
option but it doesn't seem to apply the variables I am using.
Hi EmbarrassedSpider34 I think this is an enterprise feature...
Manged to make the credentials attached to the configuration when the task is spinned,
I'm assuming env variables ?
Hi OutrageousGiraffe8
when I save model using tf.keras.save_model
This should create a new Model in the system (not artifact), models have their own entity and UID.
Are you creating the Task with output_uri="
gs://bucket/folder "
?
OutrageousGiraffe8 this sounds like a bug, how can we reproduce it?
Maybe a add another layer here?
https://github.com/allegroai/clearml/blob/a47f127679ebf5912690f7c3e60791a2daa5c984/examples/frameworks/tensorflow/tensorflow_mnist.py#L40
Hi OutrageousGiraffe8
I was not able to reproduce π
Python 3.8 Ubuntu + TF 2.8
I get both metrics and model stored and uploaded
Any idea?
OutrageousGiraffe8 so basically replacing to:self.d1 = ReLU()
Wtf? can you try with = (notice single not double)?
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- cudatoolkit=11.1.1
- pytorch=1.8.0
ZanyPig66 is this reproducible? This sounds like a bug, whats the TB version and OS you rae using?
Is this example working for you (i.e. you see debug images)
https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_tensorboard.py
sounds good, CheerfulGorilla72 could I ask you to open a github issue and suggest it? just so we do not forget ?
Hi ExcitedFish86
In Pytorch-Lightning I use DDP
I think a fix for pytorch multi-node / process distribution was commited to 1.0.4rc1, could you verify it solves the issue ? (rc1 should fix this specific issue)
BTW: no problem working with cleaml-server < 1
Are you suggesting just taking the
read_and_process_file
function out of the
read_dataset
method,
Yes π
As for the second option, you mean create the task in the
init
method of the NetCDFReader class?
correct
It would be a great idea to make the Task picklelizable,
Adding that to the next version to do list π
at that point we define a queue and the agents will take care of training
This is my preferred way as well :)
hi @<1546303293918023680:profile|MiniatureRobin9>
I can still see the metrics in Grafana. I
it will not delete it from grafana, it means it's no longer collected, make sense ?
Hi MinuteGiraffe30
Are you saying that when you are running you code locally with a gitea repository, cleamrl incorrectly adds a link to gitlab ?
GrotesqueDog77 one issue with this design, in order to run a sub-component, the call must be done from the parent component, does that make sense?
` def step_one(data):
return data
def step_two(path):
return model
def both_steps()
path = step_one("stuff")
return step_two(path)
def pipeline():
both_steps() Which would make
both_steps ` a component and step_one and step_two sub-components
wdyt?
No, it is zipped and stored, so in order to open the zipfile and read the files you have to download them.
That said everything is cached, so if the machine already downloaded the dataset there is zero download / unzipping,
make sese?
an implementation of this kind is interesting for you or do you suggest to fork
You mean adding a config map storing a default trains.conf for the agent?
I think this is great! That said, it only applies when you are spining agents (the default helm is for the server). So maybe we need another one? or an option?
Hi TrickyRaccoon92
TKinter
is suddenly used as backend, and instead of writes to the dashboard I get popups per figure.
Are you running with an agent of manually executing the code ?
Hi SpotlessLeopard9
I got many tasks that were just hang at the end of the script without ...
I remember this exact issue was fixed with 1.1.5rc0, see here:
https://clearml.slack.com/archives/CTK20V944/p1634910855059900
Can you verify with the latest RC?pip install clearml==1.1.5rc3
Still feels super hacky tho, think it would be nice to have a simplier way or atleast some nice documentation
YES you are absolutely correct, we should add it to the Task interface.
Any chance you add a GitHub issue so we do not forget ?
I have to leave i'll be back online in a couple of hours.
Meanwhile see if the ports are correct (just curl to all ports see if you get an answer) if everything is okay, try again to run the text example
YEY π π
or do you mean the machine I ran the experiment locally?
Yes this one