Answered

When I Run An Experiment (Self Hosted), I Only See Scalars For Gpu And System Performance. How Do I See Additional Scalars? I Have

When I run an experiment (self hosted), I only see scalars for GPU and System performance. How do I see additional scalars? I have
"tensorboard": { "enabled": true },in my configuration. Note: In my python script, I do a subprocess call subprocess.Popen(cmd).wait() where cmd is python3 foo.py Is it possible this subprocess call is breaking the tensorboard graphs under scalars ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Votes Newest

Answers 30

Yea I did similar. I think the crux of the issue is the subprocess calls I removed.

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Its a legacy code base. There was issues around the memory not being cleared in the GPU when subprocesses were not used. At this point I refactored out the subprocess as it just adds more complexity.

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

I was able to get this working by putting Task.init() under __ main__

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

So in summary: subprocess calls appear to break clearML tracking, even if I do Task.init() in both main.py and train.py.

Okay let me see if we can reproduce & fix this, it should not be long

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Oh that makes sense:
` # Create a child process

using os.fork() method

pid = os.fork()

if pid > 0 :
# pid greater than 0 represents
# the parent process
print("I am parent process:")
print("Process ID:", os.getpid())
print("Child's process ID:", pid)

else :
# pid equal to 0 represents
# the created child process
print("\nI am child process - this is still fully auto logged")
print("Process ID:", os.getpid())
print("Parent's process ID:", os.getppid()) `https://www.geeksforgeeks.org/python-os-fork-method/

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

when I did Task.init() in train.py the CLI arguments needed for main.py don't get captured and the script fails right away. Note this is running --skip-task-init since train.py has Task.init()

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

AgitatedDove14 SmugDolphin23 Would the following subprocess calls break the auto connect to frameworks like tensorboard?
` exe = f"sfi/imagery/models/{strategy_pipeline}/train.py"
cmd = ["/home/npuser/.clearml/venvs-builds/3.7/bin/python", exe, train_config_path]
if training_run_id:
cmd += ["--training-run", str(training_run_id)]

logging.info("Training classifier with command:\n%s", " ".join(cmd))

returncode = subprocess.Popen(cmd).wait() `Note  ` /home/npuser/.clearml/venvs-builds/3.7/bin/python `

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

yea let me unwind some changes so I can pinpoint the issue

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Thank you!!!

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I basically moved the Task.init() call below the imports

Okay that is odd, can you copy pate the before/after of the import, so we can fix that?!

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

How would I do os.fork? I'm not familiar with that

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Thanks BoredHedgehog47 !
And yes if the Task.init() call was only in main.py then the TB inside the subprocess (train.py) would as you perceived not be captured.
Did you by any chance test calling Task.init in Both main.py and train.py ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

callbacks.append( tensorflow.keras.callbacks.TensorBoard( log_dir=str(log_dir), update_freq=tensorboard_config.get("update_freq", "epoch"), ) )Might be! what's the actual value you are passing there?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I basically moved the Task.init() call below the imports

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

However, the subprocess calls are somewhat important to our code base thus the problem

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Hi BoredHedgehog47 ! We tried to reproduce this, but failed. What we tried is running the attached main.py which Popen s sub.py .
Can you please run main.py as well and tell us if you still encounter the bug? If not, is there anything else you can think of that could trigger this bug besides creating a subprocess?
Thank you!

  				
Posted 
	2 years ago

					More  		
  Report
		
					SmugDolphin23
				
					0

and removed the duplicate Task.init()

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

BoredHedgehog47 I tried changing the order of imports on the sample code I shared before, it worked in both cases ...

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Maybe before everything else, can you share some background on the rational if starting a new sub process?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

BoredHedgehog47 can you test this one? Is it close to your code ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Okay here is a standalone code that should be close enough? (if I missed anything let me know)

` import tempfile
from datetime import datetime
from pathlib import Path

import tensorflow as tf
import tensorflow_datasets as tfds
from clearml import Task

task = Task.init(project_name="debug", task_name="test")
(ds_train, ds_test), ds_info = tfds.load(
'mnist',
split=['train', 'test'],
shuffle_files=True,
as_supervised=True,
with_info=True,
)

def normalize_img(image, label):
"""Normalizes images: uint8 -> float32."""
return tf.cast(image, tf.float32) / 255., label

ds_train = ds_train.map(
normalize_img, num_parallel_calls=tf.data.AUTOTUNE)

ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)
ds_test = ds_test.map(
normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

log_dir = Path(tempfile.gettempdir()) / datetime.now().strftime("%Y%m%d-%H%M%S")
file_writer = tf.summary.create_file_writer(str(log_dir / "metrics"))
file_writer.set_as_default()

cb = tf.keras.callbacks.TensorBoard(log_dir=log_dir, update_freq='epoch')
model.fit(
ds_train,
epochs=6,
validation_data=ds_test,
callbacks=[cb]
) `

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

BoredHedgehog47 you need to make sure "<path here>/train.py" also calls Task.init (again no need to worry about calling it twice with different project/name)
The Task.init call will make sure the auto-connect works.
BTW: if you do os.fork , then there is no need for the Task.init, the main difference is that POpen starts a whole new process, and we need to make sure the newly created process is auto-connected as well (i.e. calling Task.init)

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So in summary: subprocess calls appear to break clearML tracking, even if I do Task.init() in both main.py and train.py. However the script does run end to end successfully. If I remove the subprocess calls, I only need Task.init() in main.py for everything to work (scalars, reporting, etc).

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

well I did something on my end, its magically working now

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

it uses the default of epoch

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Yes I will try that

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

If I do both everything works, except then I lose clearML tracking (scalars, outputs, etc)

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

I think the crux of the issue is the subprocess calls I removed.

That kind of makes sense, though if the subprocess function also had Task.init call it should have worked.
Would that be the setup to try to replicate?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So if my main script is called main.py and in main.py I call a script called train.py via a subprocess.Popen()

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

sounds good

  				
Posted 
	2 years ago

					More  		
  Report
		
					BoredHedgehog47
				
					0
					 × 1

Write your answer

1K Views

30 Answers

2 years ago