Hi BoredHedgehog47 ! We tried to reproduce this, but failed. What we tried is running the attached main.py
which Popen
s sub.py
.
Can you please run main.py
as well and tell us if you still encounter the bug? If not, is there anything else you can think of that could trigger this bug besides creating a subprocess?
Thank you!
However, the subprocess calls are somewhat important to our code base thus the problem
So in summary: subprocess calls appear to break clearML tracking, even if I do Task.init() in both main.py and train.py. However the script does run end to end successfully. If I remove the subprocess calls, I only need Task.init() in main.py for everything to work (scalars, reporting, etc).
BoredHedgehog47 I tried changing the order of imports on the sample code I shared before, it worked in both cases ...
So if my main script is called main.py
and in main.py
I call a script called train.py
via a subprocess.Popen()
Yea I did similar. I think the crux of the issue is the subprocess calls I removed.
Okay here is a standalone code that should be close enough? (if I missed anything let me know)
` import tempfile
from datetime import datetime
from pathlib import Path
import tensorflow as tf
import tensorflow_datasets as tfds
from clearml import Task
task = Task.init(project_name="debug", task_name="test")
(ds_train, ds_test), ds_info = tfds.load(
'mnist',
split=['train', 'test'],
shuffle_files=True,
as_supervised=True,
with_info=True,
)
def normalize_img(image, label):
"""Normalizes images: uint8
-> float32
."""
return tf.cast(image, tf.float32) / 255., label
ds_train = ds_train.map(
normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)
ds_test = ds_test.map(
normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)
log_dir = Path(tempfile.gettempdir()) / datetime.now().strftime("%Y%m%d-%H%M%S")
file_writer = tf.summary.create_file_writer(str(log_dir / "metrics"))
file_writer.set_as_default()
cb = tf.keras.callbacks.TensorBoard(log_dir=log_dir, update_freq='epoch')
model.fit(
ds_train,
epochs=6,
validation_data=ds_test,
callbacks=[cb]
) `
BoredHedgehog47 you need to make sure "<path here>/train.py" also calls Task.init (again no need to worry about calling it twice with different project/name)
The Task.init call will make sure the auto-connect works.
BTW: if you do os.fork , then there is no need for the Task.init, the main difference is that POpen starts a whole new process, and we need to make sure the newly created process is auto-connected as well (i.e. calling Task.init)
callbacks.append( tensorflow.keras.callbacks.TensorBoard( log_dir=str(log_dir), update_freq=tensorboard_config.get("update_freq", "epoch"), ) )
Might be! what's the actual value you are passing there?
I basically moved the Task.init() call below the imports
Okay that is odd, can you copy pate the before/after of the import, so we can fix that?!
Maybe before everything else, can you share some background on the rational if starting a new sub process?
I was able to get this working by putting Task.init() under __
main__
Oh that makes sense:
` # Create a child process
using os.fork() method
pid = os.fork()
if pid > 0 :
# pid greater than 0 represents
# the parent process
print("I am parent process:")
print("Process ID:", os.getpid())
print("Child's process ID:", pid)
else :
# pid equal to 0 represents
# the created child process
print("\nI am child process - this is still fully auto logged")
print("Process ID:", os.getpid())
print("Parent's process ID:", os.getppid()) `https://www.geeksforgeeks.org/python-os-fork-method/
Its a legacy code base. There was issues around the memory not being cleared in the GPU when subprocesses were not used. At this point I refactored out the subprocess as it just adds more complexity.
when I did Task.init() in train.py
the CLI arguments needed for main.py
don't get captured and the script fails right away. Note this is running --skip-task-init
since train.py has Task.init()
If I do both everything works, except then I lose clearML tracking (scalars, outputs, etc)
How would I do os.fork? I'm not familiar with that
Thanks BoredHedgehog47 !
And yes if the Task.init() call was only in main.py
then the TB inside the subprocess (train.py) would as you perceived not be captured.
Did you by any chance test calling Task.init in Both main.py
and train.py
?
I think the crux of the issue is the subprocess calls I removed.
That kind of makes sense, though if the subprocess function also had Task.init call it should have worked.
Would that be the setup to try to replicate?
So in summary: subprocess calls appear to break clearML tracking, even if I do Task.init() in both main.py and train.py.
Okay let me see if we can reproduce & fix this, it should not be long
well I did something on my end, its magically working now
AgitatedDove14 SmugDolphin23 Would the following subprocess calls break the auto connect to frameworks like tensorboard?
` exe = f"sfi/imagery/models/{strategy_pipeline}/train.py"
cmd = ["/home/npuser/.clearml/venvs-builds/3.7/bin/python", exe, train_config_path]
if training_run_id:
cmd += ["--training-run", str(training_run_id)]
logging.info("Training classifier with command:\n%s", " ".join(cmd))
returncode = subprocess.Popen(cmd).wait() `Note ` /home/npuser/.clearml/venvs-builds/3.7/bin/python `
BoredHedgehog47 can you test this one? Is it close to your code ?
yea let me unwind some changes so I can pinpoint the issue
I basically moved the Task.init() call below the imports