I'm so glad you mentioned the cron job, it would have taken us hours to figure
Hi DepressedChimpanzee34
How do I reproduce the issue ?
What are we expecting to get there ?
Is that a Colab issue or hyper-parameter encoding issue ?
WackyRabbit7 If you have an idea on an interface to shut it down, please feel free to suggest?
If i point directly to the data.yaml the training starts without any problem
what do you mean? how do you know where the extracted file is?
basically:
data_path = Dataset.get(...).get_local_copy()
then you should be able to open your file with open(data_path + "/data.yaml", "rt")
doe that work?
Hiย SmoggyGoat53
There is a storage limit on the file server (basically 2GB per file limit), thisย is the cause of the error.
You can upload the 10GB to any S3 alike solution (or a shared folder). Just set the "output_uri" on the Task (either at Task.init or with Task.output_uri = " s3://bucket ")
Hmm so yes that is true, if you are changing the bucket values you will have to manually also adjust it in grafana. I wonder if there is a shortcut here, the data is stored in Prometheus, and I would rather try to avoid deleting old data, Wdyt?
I can't find out how to pass my custom clearml.conf
Hi @<1544491301435609088:profile|TeenyElk27>
The easiest is to map it into the container in your docker-compose
(map a host clearml.conf into /root/clearml.conf inside the container)
No TB (Tesnorboard) is not enabled.
That explains it ๐ did you manage to get it working ?
Can you see the repo itself ? the commit id ?
From the docs I think what's going on is that the https://opennmt.net/OpenNMT-tf/package/opennmt.Runner.html#opennmt.Runner.train is spinning a new subprocess, and the training itself happens on the subprocess.
If this is the case this will explain the lack of automagic, as the subprocess is lacking the "Task.init" call
wdyt, could that be the case ?
This is by design, they cannot use the exact same venv because if the code starts creating files/change them it happens inside the venv and might cause them to crash.
That said if you are running with venv cache, the first one will create the venv and the second one will create a copy from the cache.
Seems like a okay clearml.conf file
Notice this is the error:404can you curl to this address ? are you sure you have httpS and not http ? was the dns configured ?
Hi @<1528908687685455872:profile|MassiveBat21>
However
no useful
template
is created for down stream executions - the source code template is all messed up,
Interesting, could you provide the code that is "created", or even better some way to reproduce it ? It sounds like sort of a bug? or maybe a feature support that is missing.
My question is - what is a best practice in this case to be able to run exported scripts (python code not made availa...
UnevenDolphin73 i would use apiclient:
APIClient().projects.edit(project=project_id, system _tags=[])
*I might have a few typos above but that should be the gist
(just using local server not connected to Internet), am I right?
You can if you host your own git server, Or if your code is a single file / jupyter notebook, then the entire code is stored on the Task.
btw: what is the exact setup, how come there is no git repo?
@<1545216077846286336:profile|DistraughtSquirrel81> shoot an email to "support@clear.ml" and provide all the information you can on the "lost account" (i.e. the one you had the data on), this means email account that created it (or your colleagues emails), and any other information that might help to locate it.
Where do you store those ?
For visibility, after close inspection of API calls it turns out there was no work against the saas server, hence no data
BTW: any specific reason for going the RestAPI way and not using the python SDK ?
These instructions should create the exact chart:
None
What am I missing ?
Hi StrangePelican34
What exactly I not working? Are you getting any TB reports?
Regrading the demoapp, this is just a default server that allows you to start play around with ClearML without needing to setup any of your own servers or signup
That said, I would recommend to sign up (totally free) on the community server
https://app.community.clear.ml/
Okay there should not be any difference ... ๐
Hi @<1534706830800850944:profile|ZealousCoyote89>
We'd like to have pipeline A trigger pipeline B
Basically a Pipeline is a Task (of a specific Type), so you can have pipeline A function clone/enqueue the pipelineB Task, and wait until it is done. wdyt?
FloppyDeer99 what am I seeing in the screenshot ?
But why the url in es is different from it in web UI?
They are not really different, but sometimes the "url quote" is an issue (this is the process a browser will take a string url like a/b and convert it to a%2fb ),
I remember that there was an issue involving double quoting (this is when you have: a/b -> a%2fb -> a%252fb ), notice the last one replace "%" with "%25" as in your example...
Let me know i...
SoreDragonfly16 the torchvision warning has nothing to do with the Trains warning.
The Trains warning means that somehow someone changes the state of the Task from running (in_progress) to "stopped" (aborted). Could it be one of the subprocesses raised an exception ?
So we basically have two options, one is when you call Dataset.get_local_copy() , we register it on the Task automatically, the other is a more explicit, with something like:ds = Datasset.get(...) folder = ds.get_local_copy() task.connect(ds, name=train) ... ds_val = Datasset.get(...) folder = ds_val.get_local_copy() task.connect(ds_val, name=validate)wdyt?
and then?
The thing is programmatically this is not easy to do as API, because at the end the "function" (i.e. LCI) never leaves, it connects to the SSH and stays
But you can query the Task it creates, the project is known, the user is known and it is of special type/tag