Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
When I Run An Experiment (Self Hosted), I Only See Scalars For Gpu And System Performance. How Do I See Additional Scalars? I Have

When I run an experiment (self hosted), I only see scalars for GPU and System performance. How do I see additional scalars? I have
"tensorboard": { "enabled": true },in my configuration. Note: In my python script, I do a subprocess call subprocess.Popen(cmd).wait() where cmd is python3 foo.py Is it possible this subprocess call is breaking the tensorboard graphs under scalars ?

  
  
Posted 2 years ago
Votes Newest

Answers 30


So in summary: subprocess calls appear to break clearML tracking, even if I do Task.init() in both main.py and train.py. However the script does run end to end successfully. If I remove the subprocess calls, I only need Task.init() in main.py for everything to work (scalars, reporting, etc).

  
  
Posted 2 years ago

Okay here is a standalone code that should be close enough? (if I missed anything let me know)

` import tempfile
from datetime import datetime
from pathlib import Path

import tensorflow as tf
import tensorflow_datasets as tfds
from clearml import Task

task = Task.init(project_name="debug", task_name="test")
(ds_train, ds_test), ds_info = tfds.load(
'mnist',
split=['train', 'test'],
shuffle_files=True,
as_supervised=True,
with_info=True,
)

def normalize_img(image, label):
"""Normalizes images: uint8 -> float32."""
return tf.cast(image, tf.float32) / 255., label

ds_train = ds_train.map(
normalize_img, num_parallel_calls=tf.data.AUTOTUNE)

ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)
ds_test = ds_test.map(
normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

log_dir = Path(tempfile.gettempdir()) / datetime.now().strftime("%Y%m%d-%H%M%S")
file_writer = tf.summary.create_file_writer(str(log_dir / "metrics"))
file_writer.set_as_default()

cb = tf.keras.callbacks.TensorBoard(log_dir=log_dir, update_freq='epoch')
model.fit(
ds_train,
epochs=6,
validation_data=ds_test,
callbacks=[cb]
) `

  
  
Posted 2 years ago

AgitatedDove14 SmugDolphin23 Would the following subprocess calls break the auto connect to frameworks like tensorboard?
` exe = f"sfi/imagery/models/{strategy_pipeline}/train.py"
cmd = ["/home/npuser/.clearml/venvs-builds/3.7/bin/python", exe, train_config_path]
if training_run_id:
cmd += ["--training-run", str(training_run_id)]

logging.info("Training classifier with command:\n%s", " ".join(cmd))

returncode = subprocess.Popen(cmd).wait() `Note  ` /home/npuser/.clearml/venvs-builds/3.7/bin/python `
  
  
Posted 2 years ago

Thanks BoredHedgehog47 !
And yes if the Task.init() call was only in main.py then the TB inside the subprocess (train.py) would as you perceived not be captured.
Did you by any chance test calling Task.init in Both main.py and train.py ?

  
  
Posted 2 years ago

when I did Task.init() in train.py the CLI arguments needed for main.py don't get captured and the script fails right away. Note this is running --skip-task-init since train.py has Task.init()

  
  
Posted 2 years ago

So if my main script is called main.py and in main.py I call a script called train.py via a subprocess.Popen()

  
  
Posted 2 years ago

I was able to get this working by putting Task.init() under __ main__

  
  
Posted 2 years ago

I basically moved the Task.init() call below the imports

  
  
Posted 2 years ago

sounds good

  
  
Posted 2 years ago

I think the crux of the issue is the subprocess calls I removed.

That kind of makes sense, though if the subprocess function also had Task.init call it should have worked.
Would that be the setup to try to replicate?

  
  
Posted 2 years ago

However, the subprocess calls are somewhat important to our code base thus the problem

  
  
Posted 2 years ago

Yea I did similar. I think the crux of the issue is the subprocess calls I removed.

  
  
Posted 2 years ago

well I did something on my end, its magically working now

  
  
Posted 2 years ago

How would I do os.fork? I'm not familiar with that

  
  
Posted 2 years ago

Hi BoredHedgehog47 ! We tried to reproduce this, but failed. What we tried is running the attached main.py which Popen s sub.py .
Can you please run main.py as well and tell us if you still encounter the bug? If not, is there anything else you can think of that could trigger this bug besides creating a subprocess?
Thank you!

  
  
Posted 2 years ago

callbacks.append( tensorflow.keras.callbacks.TensorBoard( log_dir=str(log_dir), update_freq=tensorboard_config.get("update_freq", "epoch"), ) )Might be! what's the actual value you are passing there?

  
  
Posted 2 years ago

BoredHedgehog47 I tried changing the order of imports on the sample code I shared before, it worked in both cases ...

  
  
Posted 2 years ago

If I do both everything works, except then I lose clearML tracking (scalars, outputs, etc)

  
  
Posted 2 years ago

BoredHedgehog47 can you test this one? Is it close to your code ?

  
  
Posted 2 years ago

it uses the default of epoch

  
  
Posted 2 years ago

Yes I will try that

  
  
Posted 2 years ago

Thank you!!!

  
  
Posted 2 years ago

Oh that makes sense:
` # Create a child process

using os.fork() method

pid = os.fork()

if pid > 0 :
# pid greater than 0 represents
# the parent process
print("I am parent process:")
print("Process ID:", os.getpid())
print("Child's process ID:", pid)

else :
# pid equal to 0 represents
# the created child process
print("\nI am child process - this is still fully auto logged")
print("Process ID:", os.getpid())
print("Parent's process ID:", os.getppid()) `https://www.geeksforgeeks.org/python-os-fork-method/

  
  
Posted 2 years ago

BoredHedgehog47 you need to make sure "<path here>/train.py" also calls Task.init (again no need to worry about calling it twice with different project/name)
The Task.init call will make sure the auto-connect works.
BTW: if you do os.fork , then there is no need for the Task.init, the main difference is that POpen starts a whole new process, and we need to make sure the newly created process is auto-connected as well (i.e. calling Task.init)

  
  
Posted 2 years ago

So in summary: subprocess calls appear to break clearML tracking, even if I do Task.init() in both main.py and train.py.

Okay let me see if we can reproduce & fix this, it should not be long

  
  
Posted 2 years ago

I basically moved the Task.init() call below the imports

Okay that is odd, can you copy pate the before/after of the import, so we can fix that?!

  
  
Posted 2 years ago

Maybe before everything else, can you share some background on the rational if starting a new sub process?

  
  
Posted 2 years ago

yea let me unwind some changes so I can pinpoint the issue

  
  
Posted 2 years ago

Its a legacy code base. There was issues around the memory not being cleared in the GPU when subprocesses were not used. At this point I refactored out the subprocess as it just adds more complexity.

  
  
Posted 2 years ago

and removed the duplicate Task.init()

  
  
Posted 2 years ago
1K Views
30 Answers
2 years ago
one year ago
Tags
Similar posts