Hi, I Would Like To Know From You, Maybe Someone Has Encountered The Problem That After Deploying An Agent Inside Docker, The Launch Of The Script Itself Occurs With A Delay (Lunch Pipeline Component). I Run Pipelines And Component Should Works Quickly, B

Answered

Hi, I would like to know from you, maybe someone has encountered the problem that after deploying an agent inside Docker, the launch of the script itself occurs with a delay (Lunch pipeline component). I run pipelines and component should works quickly, but since I have to wait until the script starts, it takes from 10 minutes to 1 hour, which is clearly not normal, I thought maybe the problem was in the clearml-agent and updated to the latest version 1.9.3 but it didn’t help. It seems like there is a lock going on somewhere. Perhaps the clearml agent monitoring somehow affects the script execution, tell me where I can look or how to see the agent logs in more detail?

  				
Posted 
	9 months ago

					More
				  		
  Report
		
					LonelyKangaroo55
				
					0
					 × 1

Votes Newest

Answers 4

@<1523701070390366208:profile|CostlyOstrich36> Fixed: It was a cache issue in NFS. However, we discovered an important detail—there were two folders in the cache: datasets and global . When we started the ClearML script, it began indexing the entire global folder, which was the reason the script got stuck. After mounting only the datasets folder, there was no delay anymore.
Do you know how to disable indexing? If we mount the global folder on all instances, it grows very fast, and each time a new task indexes additional results.

  				
Posted 
	9 months ago

					More
				  		
  Report
		
					LonelyKangaroo55
				
					0
					 × 1

Current configuration (clearml_agent v1.9.3, location: /tmp/clearml.conf):

  				
Posted 
	9 months ago

					More
				  		
  Report
		
					LonelyKangaroo55
				
					0
					 × 1

- Werkzeug==2.2.3
- xdoctest==1.0.2
- xgboost @ file:///rapids/xgboost-1.7.1-cp38-cp38-linux_x86_64.whl
- yarl @ file:///rapids/yarl-1.8.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- zict @ file:///rapids/zict-2.2.0-py2.py3-none-any.whl
- zipp==3.15.0
Environment setup completed successfully
Starting Task Execution:
2025-01-27 13:22:37
ClearML results page: files_server: None
2025-01-27 13:25:38
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
2025-01-27 14:08:44
Import libs
2025-01-27 14:08:49
Start Task

  				
Posted 
	9 months ago

					More
				  		
  Report
		
					LonelyKangaroo55
				
					0
					 × 1

@<1523701070390366208:profile|CostlyOstrich36> Hi,
I have a question related to ClearML’s indexing mechanism for cached datasets. We noticed that when storing the dataset cache folder on an NFS (Network File System), running the command clearml-data get triggers a cache indexing process, which takes a significant amount of time. However, if we remove the NFS cache folder, the command runs almost instantly.
Could you explain how caching works in ClearML? Specifically:

Why does ClearML perform global folder indexing before the script starts?
Why does it index the dataset cache folder when executing clearml-data get ?
Is there an option to disable cache indexing or control its behavior to optimize performance, especially when using NFS?Any insights or workarounds to speed up the process would be greatly appreciated.
Thanks!

  				
Posted 
	9 months ago

					More
				  		
  Report
		
					LonelyKangaroo55
				
					0
					 × 1

Write your answer

1K Views

4 Answers

9 months ago