Reputation
Badges 1
25 × Eureka! data["encoded_lengths"]
This makes no sense to me, data is a numpy array, not a pandas frame...
Wait. are you saying it is disappearing ? meaning when you cloned the Pipeline (i.e. in draft mode) the configuration was there, then the configuration disappeared ?
Does clearml resolve the CUDA Version from driver or conda?
Actually it starts with the default CUDA based on the host driver, but when it installs the conda env it takes it from the "installed packages" (i.e. the one you used to execute the code in the first place)
Regrading link, I could not find the exact version bu this is close enough I guess:
None
Thanks @<1523701868901961728:profile|ReassuredTiger98>
From the log this is what conda is installing, it should have worked
/tmp/conda_env1991w09m.yml:
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- blas~=1.0
- bzip2~=1.0.8
- ca-certificates~=2020.10.14
- certifi~=2020.6.20
- cloudpickle~=1.6.0
- cudatoolkit~=11.1.1
- cycler~=0.10.0
- cytoolz~=0.11.0
- dask-core~=2021.2.0
- decorator~=4.4.2
- ffmpeg~=4.3
- freetype~=2.10.4
- gmp~=6.2.1
- gnutls~=3.6.13
- imageio~=2.9.0
-...
A true mystery 🙂
That said, I hardly think it is directly related to the trains-agent ...
Do you have any more insights on when / how it happens ?
What if I have multiple files that are not in the same folder? (That is the current use-case)
I think you can do weights_filenames= ['a_folder/firstfile.bin', 'b_folder/secondfile.bin']
(it will look for a common file path for both so it retains the folder structure)
Our workaround now for using a
Dataset
as we do, is to store the dataset ID as a configuration parameter, so it's always included too
Exactly, so with Input Model it's the same only kind of ...
Notice you have in the Path:/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py/sfiBut you should have:/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py/
The difference is that running the agent in daemon mode, means the "daemon" itself is a job in SLURM.
What I was saying is pulling jobs from the clearml queue and then pushing them as individual SLURM jobs, does that make sense ?
Hmm EmbarrassedPeacock82
Let's try with--input-size -1 60 1 --aux-config input.format=FORMAT_NCHWBTW: this seems like a triton LSTM configuration issue, we might want to move the discussion to the Triton server issue, wdyt?
Great, please feel free to share your thoughts here 🙂
Hi JitteryCoyote63
Somehow I thought it was solved 😞
1 ) Yes please add GitHub issue so we can keep track
2 )
Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE
Is this the main issue ?
Hi SubstantialElk6
you can do:from clearml.config import config_obj config_obj.get('sdk')You will get the entire configuration tree of the SDK section (if you need sub sections, you can access them with '.' notation, e.h. sdk.storage )
docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
Any chance they try to store the TensorBoard on this folder ? This could lead to "No such file or directory: 'runs'" if one is deleting it, and the other is trying to access, or similar scenarios
I think the ClearmlLogger is kind of deprecated ...
Basically all you need is Task.init at the beginning , the default tensorboard logger will be caught by clearml
that should have worked, do you want send the log?
JitteryCoyote63
I agree that its name is not search-engine friendly,
LOL 😄
It was an internal joke the guys decided to call it "trains" cause you know it trains...
It was unstoppable, we should probably do a line of merchandise with AI 🚆 😉
Anyhow, this one definitely backfired...
Hi ClumsyElephant70
What's the clearml you are using ?
(The first error is a by product of python process.Event created before a forkserver is created, some internal python issue. I thought it was solved, let me take a look at the code you attached)
'config.pbtxt' could not be inferred. please provide specific config.pbtxt definition.
This basically means there is no configuration on how to serve the mode, i.e. size/type of lower (input) layer and output layer.
You can wither store the configuration on the creating Task, like is done here:
https://github.com/allegroai/clearml-serving/blob/b5f5d72046f878bd09505606ca1147d93a5df069/examples/keras/keras_mnist.py#L51
Or you can provide it as standalone file when registering the mo...
Hi @<1523701337353621504:profile|FlutteringSheep58>
are you asking how to convert a worker IP into a dns resolved host name ?
CooperativeFox72 this is indeed sad news 😞
When you have the time, please see if you can send a code snippet to reproduce the issue. I'd like to have it fixed
PunySquid88 RC1 is out with a fix:pip install trains-agent==0.14.2rc1
but realized calling that from the extension would be hard, so we opted to have the TypeScript code make calls to the ClearML API server directly, e.g.
POST /tasks.get_all_ex
.
did you manage to get that working?
- To get the credentials, we read the
~/clearml.conffile. I tried hard, but couldn't get a TypeScript library to work to parse the HOCON config file format... so I eventually resorted to using (likely brittle) regex to grab the ClearML endpoint and API ke...
Yes I suspect it is too large 😞
Notice that most parts have default values so there is no need to specify them
Hi ClumsyElephant70
extra_docker_shell_script: ["export SECRET=SECRET", ]
I think ${SECRET} will not get resolved you have to specifically have text value there.
That said it is a good idea to resolve it if possible, wdyt?