I Have Open Source Version Of Clearml-Server 1.4.0 Installed In Our K8S Cluster. I Was Manually Cleaning Up Old Experiments Through Ui. For That I Have Selected All Experiments In A Single Subproject, And Archived Them. Then Went To Archive And Deleted Th

Answered

I have open source version of clearml-server 1.4.0 installed in our k8s cluster. I was manually cleaning up old experiments through UI. For that I have selected all experiments in a single subproject, and archived them. Then went to archive and deleted them. A bit later I saw that ALL of the experiments in ALL of the projects of ALL teams are gone. I thought that maybe I did archive+delete in “All projects”, but I checked the logs and I see POST /api/v2.18/tasks.delete_many only for the intended project

  				
Posted 
	2 years ago

					More  		
  Report
		
					FiercePenguin76
				
					0
					 × 1

Votes Newest

Answers 30

task_trash_trash is probably irrelevant, as the latest entry there is from Dec 2021

  				
Posted 
	2 years ago

					More  		
  Report
		
					FiercePenguin76
				
					0
					 × 1

The server's code only has a reference to the trash collection when deleting tasks, nowhere else 😮

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

and my problem occurred right after I tried to delete ~1.5K tasks from a single subproject

  				
Posted 
	2 years ago

					More  		
  Report
		
					FiercePenguin76
				
					0
					 × 1

OK, that's a hint... I'll try to look at the code with that in mind

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I’m rather sure that after restart everything will be back to normal. Do you want me to invoke smth via SDK or REST while the server is still in this state?

  				
Posted 
	2 years ago

					More  		
  Report
		
					FiercePenguin76
				
					0
					 × 1

nothing I can think of

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

but if you can check with me tomorrow before restarting that would be cool - I might think of something...

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

FiercePenguin76 one question - did you change by any chance anything related to the way gunicorn is spawning processes / threads when launching the apiserver pods?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

need to check with infra engineers

  				
Posted 
	2 years ago

					More  		
  Report
		
					FiercePenguin76
				
					0
					 × 1

DisgustedDove53 ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					FiercePenguin76
				
					0
					 × 1

(this smells like a threading issue)

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

FiercePenguin76 I have a theory that this is cause by a thread-safety issue - the apiserver code-base is not designed to run in multiple threads right now and scale is handles by processes. Enabling threads in gunicorn may in theory cause this exact behavior

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

we certainly modified some deployment conf, but lets wait for answers tomorrow

  				
Posted 
	2 years ago

					More  		
  Report
		
					FiercePenguin76
				
					0
					 × 1

SuccessfulKoala55 any ideas or should we restart?

  				
Posted 
	2 years ago

					More  		
  Report
		
					FiercePenguin76
				
					0
					 × 1

I'm not pretty sure multi-threading is the reason for this issue, and that restarting will solve it (but you will still need to move your new tasks from the trash collection to the normal collection).
However, I would like to understand the deployment changes you made since if you do not fix them, this might happen again...

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

SuccessfulKoala55 Hi Jake
We didn’t change anything related to gunicorn. Is there any specific thing I can check for?
Also I noticied that it’s not running the gunicorn as a command but loads it in the python code, I don’t think it’s possible to change the threading with env that way.

  				
Posted 
	2 years ago

					More  		
  Report
		
					DisgustedDove53
				
					0
					 × 1

we’re running it with the older helm chart if that matters. anyways I can’t see anything related to Gunicorn in chart or configs.

  				
Posted 
	2 years ago

					More  		
  Report
		
					DisgustedDove53
				
					0
					 × 1

That's strange... Can you perhaps share the env vars passed to the apiserver deployment?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

- name: CLEARML__APISERVER__PRE_POPULATE__ENABLED value: "false" - name: CLEARML__APISERVER__PRE_POPULATE__ZIP_FILES value: /opt/clearml/db-pre-populate - name: CLEARML_SERVER_DEPLOYMENT_TYPE value: helm-cloud

The rest are clearly credentials…

  				
Posted 
	2 years ago

					More  		
  Report
		
					DisgustedDove53
				
					0
					 × 1

So apparently it's possible threading is turned on by default (at least for specific Flask versions), so that's probably it

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

This is indeed a vulnerability and we'll fix that as soon as possible

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I assume it's only triggered in mass deletes for now, so your options either to wait for a patch server version (a new version is about to be released, so we'll either make it to this version or push a patch version immediately after), or change your deployment to use gunicorn (which is a change of behavior, I know)

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I think we can live without mass deleting for a while

  				
Posted 
	2 years ago

					More  		
  Report
		
					FiercePenguin76
				
					0
					 × 1

OK, we're working on a fix, stay tuned for either v1.5.0 or v1.5.1 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I expect it to be released in the next few days

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

restart of clearml-server helped, as expected. Now we see all experiments (except for those that were written into task__trash during the “dark times”)

  				
Posted 
	2 years ago

					More  		
  Report
		
					FiercePenguin76
				
					0
					 × 1

You can simply move them from the task_trash collection to the task collection 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Although you need to make sure you won't move experiments that actually belong in the trash 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I think their ID will provide a clue

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

we’ll see, thanks for your help!

  				
Posted 
	2 years ago

					More  		
  Report
		
					FiercePenguin76
				
					0
					 × 1

Write your answer

1K Views

30 Answers

2 years ago