Hi, I Have A Clearml Experiment That Failed To Load Its Scalar Plots After A Few Hours Of Training, When I Look At The Log Locally With Tensorboard It Seems To Work Fine. Any Idea What'S Going On?

Answered

Hi, i have a ClearML experiment that failed to load its scalar plots after a few hours of training, when i look at the log locally with Tensorboard it seems to work fine. Any idea what's going on?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousCoyote85
				
					0
					 × 1

Votes Newest

Answers 10

this is what it said on the console when i tried to load it

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousCoyote85
				
					0
					 × 1

all of the experiments for this particular project behave like this,
the console works fine and im still able to view debug images
Task.init() is called in main of the training script with user-specified project and taskname

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousCoyote85
				
					0
					 × 1

this is how task gets created:

def create_clearml_task(
    project_name,
    task_name,
    script,
    args,
    docker_args="",
    docker_image_name="<docker image name>",
    add_task_init_call=True,
    requirements_file=None,
    **kwargs):
    print(
        "Creating task: project_name: {project_name}, task_name: {task_name}, script:{script} and args: \n {args}"
        .format(
            project_name=project_name,
            task_name=task_name,
            script=script,
            args=args,
        ))
    arg_tuples = args_to_tuples(args)
    # Remove the argument to execute on clearML before queueing up otherwise we will just keep calling
    # remote execution recursively without ever doing the work.
    unset_clearml_execute(arg_tuples)
    return Task.create(
        argparse_args=arg_tuples,
        project_name=project_name,
        task_name=task_name,
        script=script,
        add_task_init_call=add_task_init_call,
        repo='git@<repo>.git',
        packages=find_current_packages() if requirements_file is None else None,
        requirements_file=requirements_file,
        docker=docker_image_name,
        commit=get_current_commit(),
        docker_bash_setup_script=bash_setup_string,
        docker_args="-v /home:/home -v /data:/data -v /mnt:/mnt -v /etc/aws:/etc/aws --shm-size 50G"
        + docker_args,
        **kwargs)

===============================================

if args.clearml_taskname is not None and args.clearml_execute is not None:
        args_except_execute = {k: v for k, v in vars(args).items() if k != "clearml_execute"}
        task = create_clearml_task(project_name=project_name,
                                   task_name=args.clearml_taskname,
                                   script="train.py",
                                   args=args_except_execute,
                                   docker_image_name=docker_img,
                                   requirements_file=requirements_file,
                                   add_task_init_call=False)
        task.connect(config_dict)
        Task.enqueue(task, queue_name=args.clearml_execute)
        sys.exit(0)

# inside main:
task = Task.init(project_name, clearml_taskname)
task.connect(config_dict)

i import Task from clearml and I also use PyTorch lightning's TensorboardLogger

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousCoyote85
				
					0
					 × 1

The easiest thing to do to understand what's going on is to look at you browser's Developer Tools (F12) when trying to load scalars and share the contents of the Network section

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

i see

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousCoyote85
				
					0
					 × 1

is there a way to retrieve clearml error logs for situations like this?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousCoyote85
				
					0
					 × 1

I don't think you can connect to a task that was not created using Task.init()

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Can you perhaps share a code example of how you code starts and what it imports?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

sorry, not quite sure i understand - i am calling Task.init inside main. my plots loads on clearml correctly for the first few hours or so, but freezes after that

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousCoyote85
				
					0
					 × 1

Hi @<1602473359956774912:profile|VivaciousCoyote85> , is this something new, or does all experiments behave this way? Do you see the console logs? Can you share how your code runs Task.init() ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

2K Views

10 Answers

2 years ago