Okay here is a standalone code that should be close enough? (if I missed anything let me know)
` import tempfile
from datetime import datetime
from pathlib import Path
import tensorflow as tf
import tensorflow_datasets as tfds
from clearml import Task
task = Task.init(project_name="debug", task_name="test")
(ds_train, ds_test), ds_info = tfds.load(
'mnist',
split=['train', 'test'],
shuffle_files=True,
as_supervised=True,
with_info=True,
)
def normalize_img(image, label):
"""Normalizes images: uint8 -> float32."""
return tf.cast(image, tf.float32) / 255., label
ds_train = ds_train.map(
normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)
ds_test = ds_test.map(
normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
log_dir = Path(tempfile.gettempdir()) / datetime.now().strftime("%Y%m%d-%H%M%S")
file_writer = tf.summary.create_file_writer(str(log_dir / "metrics"))
file_writer.set_as_default()
cb = tf.keras.callbacks.TensorBoard(log_dir=log_dir, update_freq='epoch')
model.fit(
ds_train,
epochs=6,
validation_data=ds_test,
callbacks=[cb]
) `
Thanks  BoredHedgehog47  !
And yes if the Task.init() call was only in  main.py  then the TB inside the subprocess (train.py) would as you perceived not be captured.
Did you by any chance test calling Task.init in Both  main.py   and  train.py  ?
I was able to get this working by putting Task.init() under  __ main__
well I did something on my end, its magically working now
How would I do os.fork? I'm not familiar with that
Its a legacy code base. There was issues around the memory not being cleared in the GPU when subprocesses were not used. At this point I refactored out the subprocess as it just adds more complexity.
However, the subprocess calls are somewhat important to our code base thus the problem
Hi  BoredHedgehog47 ! We tried to reproduce this, but failed. What we tried is running the attached  main.py  which  Popen s  sub.py .
Can you please run  main.py  as well and tell us if you still encounter the bug? If not, is there anything else you can think of that could trigger this bug besides creating a subprocess?
Thank you!
Maybe before everything else, can you share some background on the rational if starting a new sub process?
BoredHedgehog47 can you test this one? Is it close to your code ?
So in summary: subprocess calls appear to break clearML tracking, even if I do Task.init() in both main.py and train.py.
Okay let me see if we can reproduce & fix this, it should not be long
I basically moved the Task.init() call below the imports
Okay that is odd, can you copy pate the before/after of the import, so we can fix that?!
If I do both everything works, except then I lose clearML tracking (scalars, outputs, etc)
yea let me unwind some changes so I can pinpoint the issue
I think the crux of the issue is the subprocess calls I removed.
That kind of makes sense, though if the subprocess function also had Task.init call it should have worked.
Would that be the setup to try to replicate?
Yea I did similar. I think the crux of the issue is the subprocess calls I removed.
So in summary: subprocess calls appear to break clearML tracking, even if I do Task.init() in both main.py and train.py. However the script does run end to end successfully. If I remove the subprocess calls, I only need Task.init() in main.py for everything to work (scalars, reporting, etc).
So if my main script is called  main.py  and in  main.py  I call a script called  train.py  via a  subprocess.Popen()
BoredHedgehog47 I tried changing the order of imports on the sample code I shared before, it worked in both cases ...
when I did Task.init() in  train.py  the CLI arguments needed for  main.py  don't get captured and the script fails right away. Note this is running  --skip-task-init  since train.py has Task.init()
BoredHedgehog47  you need to make sure "<path here>/train.py" also calls Task.init (again no need to worry about calling it twice with different project/name)
The Task.init call will make sure the auto-connect works.
BTW: if you do os.fork , then there is no need for the Task.init, the main difference is that POpen starts a whole new process, and we need to make sure the newly created process is auto-connected as well (i.e. calling Task.init)
Oh that makes sense:
` # Create a child process
using os.fork() method
pid = os.fork()
if pid > 0 :
# pid greater than 0 represents
# the parent process
print("I am parent process:")
print("Process ID:", os.getpid())
print("Child's process ID:", pid)
else :
# pid equal to 0 represents
# the created child process
print("\nI am child process - this is still fully auto logged")
print("Process ID:", os.getpid())
print("Parent's process ID:", os.getppid()) `https://www.geeksforgeeks.org/python-os-fork-method/
AgitatedDove14   SmugDolphin23  Would the following subprocess calls break the auto connect to frameworks like tensorboard?
`     exe = f"sfi/imagery/models/{strategy_pipeline}/train.py"
cmd = ["/home/npuser/.clearml/venvs-builds/3.7/bin/python", exe, train_config_path]
if training_run_id:
cmd += ["--training-run", str(training_run_id)]
logging.info("Training classifier with command:\n%s", " ".join(cmd))
returncode = subprocess.Popen(cmd).wait() `Note  ` /home/npuser/.clearml/venvs-builds/3.7/bin/python `
		I basically moved the Task.init() call below the imports
callbacks.append( tensorflow.keras.callbacks.TensorBoard( log_dir=str(log_dir), update_freq=tensorboard_config.get("update_freq", "epoch"), ) )Might be! what's the actual value you are passing there?