Why would you need to manually change the current run? you just provided the values with either default/command-line ?
what am I missing here?
ResponsiveHedgehong88 I'm not sure I state dit, but the argparser arguments and values are collected automatically from your current run and put on the Task, there is no need to manually set them if you have the argparser running on your machine. Basically it collects the current (i.e. the process running on your machine) settings, and "copies" them ...
Try to add here:
None
server_info['url'] = f"http://{server_info['hostname']}:{server_info['port']}/"
HandsomeCrow5
BTW: out of curiosity, how do you generate the html reports. I remember a few users suggesting trains should have a report generating functionality
Hi @<1655744373268156416:profile|StickyShrimp60>
My hydra OmegaConf configuration object is not always being picked up, and I am unable to consistently reproduce it.
... I am using clearml v1.14.4,
Hmm how can we reproduce it? what are you seeing what it does "miss" the hydra, i.e. are you seeing any Hydra section? how are you running the code (manually , agent ?)
data["encoded_lengths"]
This makes no sense to me, data is a numpy array, not a pandas frame...
Wait. are you saying it is disappearing ? meaning when you cloned the Pipeline (i.e. in draft mode) the configuration was there, then the configuration disappeared ?
Does clearml resolve the CUDA Version from driver or conda?
Actually it starts with the default CUDA based on the host driver, but when it installs the conda env it takes it from the "installed packages" (i.e. the one you used to execute the code in the first place)
Regrading link, I could not find the exact version bu this is close enough I guess:
None
Thanks @<1523701868901961728:profile|ReassuredTiger98>
From the log this is what conda is installing, it should have worked
/tmp/conda_env1991w09m.yml:
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- blas~=1.0
- bzip2~=1.0.8
- ca-certificates~=2020.10.14
- certifi~=2020.6.20
- cloudpickle~=1.6.0
- cudatoolkit~=11.1.1
- cycler~=0.10.0
- cytoolz~=0.11.0
- dask-core~=2021.2.0
- decorator~=4.4.2
- ffmpeg~=4.3
- freetype~=2.10.4
- gmp~=6.2.1
- gnutls~=3.6.13
- imageio~=2.9.0
-...
A true mystery 🙂
That said, I hardly think it is directly related to the trains-agent ...
Do you have any more insights on when / how it happens ?
What if I have multiple files that are not in the same folder? (That is the current use-case)
I think you can do weights_filenames= ['a_folder/firstfile.bin', 'b_folder/secondfile.bin']
(it will look for a common file path for both so it retains the folder structure)
Our workaround now for using a
Dataset
as we do, is to store the dataset ID as a configuration parameter, so it's always included too
Exactly, so with Input Model it's the same only kind of ...
Notice you have in the Path:/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py/sfiBut you should have:/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py/
The difference is that running the agent in daemon mode, means the "daemon" itself is a job in SLURM.
What I was saying is pulling jobs from the clearml queue and then pushing them as individual SLURM jobs, does that make sense ?
Hmm EmbarrassedPeacock82
Let's try with--input-size -1 60 1 --aux-config input.format=FORMAT_NCHWBTW: this seems like a triton LSTM configuration issue, we might want to move the discussion to the Triton server issue, wdyt?
Great, please feel free to share your thoughts here 🙂
Hi JitteryCoyote63
Somehow I thought it was solved 😞
1 ) Yes please add GitHub issue so we can keep track
2 )
Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE
Is this the main issue ?
Hi SubstantialElk6
you can do:from clearml.config import config_obj config_obj.get('sdk')You will get the entire configuration tree of the SDK section (if you need sub sections, you can access them with '.' notation, e.h. sdk.storage )
docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
Any chance they try to store the TensorBoard on this folder ? This could lead to "No such file or directory: 'runs'" if one is deleting it, and the other is trying to access, or similar scenarios
I think the ClearmlLogger is kind of deprecated ...
Basically all you need is Task.init at the beginning , the default tensorboard logger will be caught by clearml
MelancholyBeetle72 thanks! I'll see if we could release an RC with a fix soon, for you to test :)
that should have worked, do you want send the log?
JitteryCoyote63
I agree that its name is not search-engine friendly,
LOL 😄
It was an internal joke the guys decided to call it "trains" cause you know it trains...
It was unstoppable, we should probably do a line of merchandise with AI 🚆 😉
Anyhow, this one definitely backfired...
I think you can watch it after GTC on the nvidia website, and a week after that we will be able to upload it to the youtube channel 🙂
BTW: is this on the community server or self-hosted (aka docker-compose)?
Hi ClumsyElephant70
What's the clearml you are using ?
(The first error is a by product of python process.Event created before a forkserver is created, some internal python issue. I thought it was solved, let me take a look at the code you attached)