Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
When I Run An Experiment (Self Hosted), I Only See Scalars For Gpu And System Performance. How Do I See Additional Scalars? I Have

When I run an experiment (self hosted), I only see scalars for GPU and System performance. How do I see additional scalars? I have
"tensorboard": { "enabled": true },in my configuration. Note: In my python script, I do a subprocess call subprocess.Popen(cmd).wait() where cmd is python3 foo.py Is it possible this subprocess call is breaking the tensorboard graphs under scalars ?

  
  
Posted one year ago
Votes Newest

Answers 30


So if my main script is called main.py and in main.py I call a script called train.py via a subprocess.Popen()

  
  
Posted one year ago

So in summary: subprocess calls appear to break clearML tracking, even if I do Task.init() in both main.py and train.py.

Okay let me see if we can reproduce & fix this, it should not be long

  
  
Posted one year ago

Yea I did similar. I think the crux of the issue is the subprocess calls I removed.

  
  
Posted one year ago

callbacks.append( tensorflow.keras.callbacks.TensorBoard( log_dir=str(log_dir), update_freq=tensorboard_config.get("update_freq", "epoch"), ) )Might be! what's the actual value you are passing there?

  
  
Posted one year ago

it uses the default of epoch

  
  
Posted one year ago

Thanks BoredHedgehog47 !
And yes if the Task.init() call was only in main.py then the TB inside the subprocess (train.py) would as you perceived not be captured.
Did you by any chance test calling Task.init in Both main.py and train.py ?

  
  
Posted one year ago

when I did Task.init() in train.py the CLI arguments needed for main.py don't get captured and the script fails right away. Note this is running --skip-task-init since train.py has Task.init()

  
  
Posted one year ago

BoredHedgehog47 I tried changing the order of imports on the sample code I shared before, it worked in both cases ...

  
  
Posted one year ago

If I do both everything works, except then I lose clearML tracking (scalars, outputs, etc)

  
  
Posted one year ago

Yes I will try that

  
  
Posted one year ago

Oh that makes sense:
` # Create a child process

using os.fork() method

pid = os.fork()

if pid > 0 :
# pid greater than 0 represents
# the parent process
print("I am parent process:")
print("Process ID:", os.getpid())
print("Child's process ID:", pid)

else :
# pid equal to 0 represents
# the created child process
print("\nI am child process - this is still fully auto logged")
print("Process ID:", os.getpid())
print("Parent's process ID:", os.getppid()) `https://www.geeksforgeeks.org/python-os-fork-method/

  
  
Posted one year ago

I basically moved the Task.init() call below the imports

Okay that is odd, can you copy pate the before/after of the import, so we can fix that?!

  
  
Posted one year ago

yea let me unwind some changes so I can pinpoint the issue

  
  
Posted one year ago

well I did something on my end, its magically working now

  
  
Posted one year ago

I basically moved the Task.init() call below the imports

  
  
Posted one year ago

and removed the duplicate Task.init()

  
  
Posted one year ago

Thank you!!!

  
  
Posted one year ago

AgitatedDove14 SmugDolphin23 Would the following subprocess calls break the auto connect to frameworks like tensorboard?
` exe = f"sfi/imagery/models/{strategy_pipeline}/train.py"
cmd = ["/home/npuser/.clearml/venvs-builds/3.7/bin/python", exe, train_config_path]
if training_run_id:
cmd += ["--training-run", str(training_run_id)]

logging.info("Training classifier with command:\n%s", " ".join(cmd))

returncode = subprocess.Popen(cmd).wait() `Note  ` /home/npuser/.clearml/venvs-builds/3.7/bin/python `
  
  
Posted one year ago

BoredHedgehog47 you need to make sure "<path here>/train.py" also calls Task.init (again no need to worry about calling it twice with different project/name)
The Task.init call will make sure the auto-connect works.
BTW: if you do os.fork , then there is no need for the Task.init, the main difference is that POpen starts a whole new process, and we need to make sure the newly created process is auto-connected as well (i.e. calling Task.init)

  
  
Posted one year ago

Hi BoredHedgehog47 ! We tried to reproduce this, but failed. What we tried is running the attached main.py which Popen s sub.py .
Can you please run main.py as well and tell us if you still encounter the bug? If not, is there anything else you can think of that could trigger this bug besides creating a subprocess?
Thank you!

  
  
Posted one year ago

I was able to get this working by putting Task.init() under __ main__

  
  
Posted one year ago

How would I do os.fork? I'm not familiar with that

  
  
Posted one year ago

sounds good

  
  
Posted one year ago

Its a legacy code base. There was issues around the memory not being cleared in the GPU when subprocesses were not used. At this point I refactored out the subprocess as it just adds more complexity.

  
  
Posted one year ago

Okay here is a standalone code that should be close enough? (if I missed anything let me know)

` import tempfile
from datetime import datetime
from pathlib import Path

import tensorflow as tf
import tensorflow_datasets as tfds
from clearml import Task

task = Task.init(project_name="debug", task_name="test")
(ds_train, ds_test), ds_info = tfds.load(
'mnist',
split=['train', 'test'],
shuffle_files=True,
as_supervised=True,
with_info=True,
)

def normalize_img(image, label):
"""Normalizes images: uint8 -> float32."""
return tf.cast(image, tf.float32) / 255., label

ds_train = ds_train.map(
normalize_img, num_parallel_calls=tf.data.AUTOTUNE)

ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)
ds_test = ds_test.map(
normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

log_dir = Path(tempfile.gettempdir()) / datetime.now().strftime("%Y%m%d-%H%M%S")
file_writer = tf.summary.create_file_writer(str(log_dir / "metrics"))
file_writer.set_as_default()

cb = tf.keras.callbacks.TensorBoard(log_dir=log_dir, update_freq='epoch')
model.fit(
ds_train,
epochs=6,
validation_data=ds_test,
callbacks=[cb]
) `

  
  
Posted one year ago

However, the subprocess calls are somewhat important to our code base thus the problem

  
  
Posted one year ago

I think the crux of the issue is the subprocess calls I removed.

That kind of makes sense, though if the subprocess function also had Task.init call it should have worked.
Would that be the setup to try to replicate?

  
  
Posted one year ago

BoredHedgehog47 can you test this one? Is it close to your code ?

  
  
Posted one year ago

So in summary: subprocess calls appear to break clearML tracking, even if I do Task.init() in both main.py and train.py. However the script does run end to end successfully. If I remove the subprocess calls, I only need Task.init() in main.py for everything to work (scalars, reporting, etc).

  
  
Posted one year ago

Maybe before everything else, can you share some background on the rational if starting a new sub process?

  
  
Posted one year ago
639 Views
30 Answers
one year ago
one year ago
Tags
Similar posts