Hi All

Answered

Hi All

Hi all 😄

The hyperparameter tuner functionality has just stopped working. When I try and launch an instance of the tuner it gets stuck at "This instance has been scheduled for launch. This may take a few moments." There are available agents in the queue and nothing is added to the queue so it can't really be a problem on our end as far as I can see. I've cloned and tested older tunings that have worked and they also have the same issue. It's been this way for a few days now.

I'm using None with the pro plan. I'm launching the instances from the web client.

@<1780043419314294784:profile|LargeHamster21>

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Votes Newest

Answers 19

Interesting, checking on my account as well O:

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Yes, but only because you asked so nicely 😚

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Not using it

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

I think it was some minor misconfiguration on one of the servers running the applications, incorrect limit was set

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

@<1545216070686609408:profile|EnthusiasticCow4> I believe you are correct. Can you try another optimization method while we look into this?

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Thank you 😊

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

As far as I can tell there's nothing else running that isn't running on our hardware. Is there some way to see what application instances are active?

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

From the logs it looks like the HPO application finds a worker from the queue, attempts to serialize the config sent to the worker, and crashes because of the version conflict with Pyro4. But I don't think we control any of that. I might be misunderstanding something. 🙃

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

@<1545216070686609408:profile|EnthusiasticCow4> , can you check again please?

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Hi @<1780043419314294784:profile|LargeHamster21> ! Looks like you are using python3.11 (agent.default_python=3.11), while Pyro4 is incompatible with this python version: None
I would suggest trying to downgrade the python version or migrate to Pyro5

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Thanks a lot for looking into this. Unfortunately the HPO is still not working. Although the instance is starting now we are receiving the error below in the console of the application instance. When manually enqueueing the same experiment it does work, leading us to believe that this is not an environment issue. It also happens for cloned runs that worked in the past. Could this be related to the previous issue?

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/serpent.py", line 506, in ser_default_class
    value = dict(vars(obj))  # make sure we can serialize anything that resembles a dict
                 ^^^^^^^^^
TypeError: vars() argument must have __dict__ attribute
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.11/site-packages/hpbandster/core/dispatcher.py", line 296, in job_runner
    worker.proxy.start_computation(self, job.id, **job.kwargs)
  File "/usr/local/lib/python3.11/site-packages/Pyro4/core.py", line 185, in __call__
    return self.__send(self.__name, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/Pyro4/core.py", line 437, in _pyroInvoke
    data, compressed = serializer.serializeCall(objectId, methodname, vargs, kwargs, compress=config.COMPRESSION)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/Pyro4/util.py", line 176, in serializeCall
    data = self.dumpsCall(obj, method, vargs, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/Pyro4/util.py", line 602, in dumpsCall
    return serpent.dumps((obj, method, vargs, kwargs), module_in_classname=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/serpent.py", line 69, in dumps
    return Serializer(indent, module_in_classname, bytes_repr).serialize(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/serpent.py", line 229, in serialize
    self._serialize(obj, out, 0)
  File "/usr/local/lib/python3.11/site-packages/serpent.py", line 255, in _serialize
    return self.dispatch[t](self, obj, out, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/serpent.py", line 319, in ser_builtins_tuple
    serialize(elt, out, level + 1)
  File "/usr/local/lib/python3.11/site-packages/serpent.py", line 255, in _serialize
    return self.dispatch[t](self, obj, out, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/serpent.py", line 392, in ser_builtins_dict
    serialize(value, out, level + 1)
  File "/usr/local/lib/python3.11/site-packages/serpent.py", line 255, in _serialize
    return self.dispatch[t](self, obj, out, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/serpent.py", line 392, in ser_builtins_dict
    serialize(value, out, level + 1)
  File "/usr/local/lib/python3.11/site-packages/serpent.py", line 274, in _serialize
    func(self, obj, out, level)
  File "/usr/local/lib/python3.11/site-packages/serpent.py", line 516, in ser_default_class
    raise TypeError("don't know how to serialize class " +
TypeError: don't know how to serialize class <class 'numpy.int64'>. Give it vars() or an appropriate __getstate__

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					LargeHamster21
				
					0

Hi @<1523701435869433856:profile|SmugDolphin23>

I'm a bit confused by your suggestion. To be clear, this is the logs from the HPO application instance that's spun up when you start the HPO process. I don't think we have any control over what python version or Pyro version is started in the application instance. I think this error occurs before any code on our end is run.

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

I'm rather confused as to what else would be running.

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Hey John, I'm Nathans colleague, thanks for looking into this. It seems the application is starting now. What was the issue in the end?

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					LargeHamster21
				
					0

Also, it does not kill the instance, it continues to send progress reports every 2 minutes. I added the full log in the attachment.

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					LargeHamster21
				
					0

Hi @<1545216070686609408:profile|EnthusiasticCow4> , in the PRO plan you are limited to a certain max amount of parallel application instances. If you kill some running applications, your HPO application will start running

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I'm curious as well

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Multiple instances of the autoscaler maybe?

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I just checked the clearml.conf and I'm not specifying any version of python for the agents.

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Write your answer

794 Views

19 Answers

10 months ago