Answered

Hi All—First Off, Thanks For Being Such A Helpful And Thorough Group Of People. I Learn A Ton Just Searching Through The Channel For Problems. I’M Seeing A Weird Issue. I Have A Conda Env On My Linux Machine, And I Can Successfully Run A Training Script

hi all—first off, thanks for being such a helpful and thorough group of people. i learn a ton just searching through the channel for problems.

i’m seeing a weird issue. I have a conda env on my linux machine, and i can successfully run a training script in that env that connects to our clearML server and stores results.

i have a clearml agent running on a different linux machine that connects to our clearml server. If i clone the task that results from the above process and enqueue it, that task fails because it fails to build the environment properly. weirdly, for all the testing i’ve done on this, it just doesn’t install pandas for some reason.

i’ve tried different python versions (in the env and for the agent), and though i have to install packages from both conda and pip, i’ve made sure i’m installing pandas using conda. not really sure what my next debugging step would be.

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

Votes Newest

Answers 30

actually no

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

but, the call used to start the script was python -m module.name --args

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

actually no

hmm, are those packages correct ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

we do use all those packages, and the version numbers are correct

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

Could it be pandas was not installed on the local machine ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

in the main script, these are the first imports:
import argparse import time import json import pytorch_lightning as pl from pytorch_lightning.accelerators import acceleratorthen after that we import stuff from the repo, and the listed packages are imported in those files

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

i checked the exact same thing

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

$ conda list | grep pandas geopandas 0.9.0 pyhd8ed1ab_1 conda-forge geopandas-base 0.9.0 pyhd8ed1ab_1 conda-forge pandas 1.3.3 py39hde0f152_0 conda-forge

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

(also, the training code, which uses pandas, worked)

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

pip freeze | grep pandas

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

$ pip freeze | grep pandas geopandas @ file:///home/conda/feedstock_root/build_artifacts/geopandas_1623249625470/work pandas==1.3.3

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

conda list | grep matplotlib ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

$ conda list | grep matplotlib matplotlib 3.4.3 py39hf3d152e_1 conda-forge matplotlib-base 3.4.3 py39h2fa2bec_1 conda-forge

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

I can't seem to find a difference between the two, why would matplotlib get listed and pandas does not... Any other package that is missing?
BTW: as an immediate "hack" , before your Task.init call add the following:
Task.add_requirements("pandas")

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

BTW: how is it missing listing torch ? Do you have "import torch" in the code ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

yeah, it’s in one of the imports from the repo

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

and it’s in the “installed packages” from the child task:
absl-py==0.14.0 aiohttp==3.7.4.post0 async-timeout==3.0.1 attrs==21.2.0 cachetools==4.2.2 certifi==2021.5.30 chardet==4.0.0 charset-normalizer==2.0.6 clearml==1.1.1 cycler==0.10.0 Cython==0.29.24 fsspec==2021.9.0 furl==2.1.2 future==0.18.2 google-auth==1.35.0 google-auth-oauthlib==0.4.6 grpcio==1.40.0 idna==3.2 joblib==1.0.1 jsonschema==3.2.0 kiwisolver==1.3.2 Markdown==3.3.4 matplotlib==3.4.3 multidict==5.1.0 numpy==1.21.2 oauthlib==3.1.1 orderedmultidict==1.0.1 packaging==21.0 pathlib2==2.3.6 Pillow==8.3.2 protobuf==3.18.0 psutil==5.8.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pyDeprecate==0.3.1 PyJWT==2.1.0 pyparsing==2.4.7 pyrsistent==0.18.0 python-dateutil==2.8.2 pytorch-lightning==1.4.8 PyYAML==5.4.1 requests==2.26.0 requests-oauthlib==1.3.0 rsa==4.7.2 scikit-learn==0.24.2 scipy==1.7.1 six==1.16.0 tensorboard==2.6.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.0 threadpoolctl==2.2.0 torch==1.9.1 torchmetrics==0.5.1 tqdm==4.62.3 typing-extensions==3.10.0.2 urllib3==1.26.7 Werkzeug==2.0.1 yarl==1.6.3

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

I think the main issue is running with python -m module.name --args
Which is a bit different, when trying to "understand" what is the actual repository.
Can you try to run it from the repository folder (same command, just to see if it will have any effect on the detected packages)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

and it’s in the “installed packages” from the child task:

This is because the agent always updates back the full venv setup, so you will be able to always reproduce the entire thing (as opposed to dev time, where it lists only the directly imported packages)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

that must have been it. here’s the installed packages when not using -m :
` # Python 3.9.7 | packaged by conda-forge | (default, Sep 23 2021, 07:28:37) [GCC 9.4.0]

Local modules found - skipping:

modulename == ../pathto/modulename/init.py

PyYAML == 5.4.1
Shapely == 1.7.1
clearml == 1.1.1
click == 7.1.2
matplotlib == 3.4.3
numpy == 1.21.2
pandas == 1.3.3
python_dateutil == 2.8.2
pytorch_lightning == 1.4.8
pytz == 2021.1
rasterio == 1.2.8
scikit_image == 0.18.3
scikit_learn == 0.24.2
scipy == 1.7.1
tensorboard == 2.6.0
torch == 1.9.1
torchvision == 0.2.2
tqdm == 4.62.3 `

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

i’ll clone and enqueue, but i’m guessing that’s the issue

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

getting different issues (torchvision vs. cuda compatibility, will work on that), but i’m betting that was the issue

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

(torchvision vs. cuda compatibility, will work on that),

The agent will pull the correct torch based on the cuda version that is available at runtime (or configured via the clearml.conf)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

that must have been it. here’s the installed packages when not using

-m

:

Hmm yes, can you open a GitHub issue on that? (this seems like a bug)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

BTW: could it be the Task.init is Not called on the "module.name" entry point, but somewhere internally ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

actually yes— task.init is called inside of a class in one of the internal imports

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

okay, i have a few things on my todo list, they will take a while. we will task.init in the entry point instead of how it’s done now, and we will re-try python -m . if it doesn’t work, we will file an issue. if it does work, yay!

either way, thanks much for your help today, i really appreciate it.

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

Sounds great! let me know what you find out 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

okay, so here’s what i found out—
calling the training entry point directly (eg /path/to/train.py ), and not instantiating the clearml Task in train.py (eg calling a method in a different module where the task is instantiated) does work calling the entrypoint with python -m , but instantiating the clearml Task within train.py also works
so the only thing that doesn’t work is calling the entrypoint with python -m and calling a method from a different module that instantiates the task.

not sure if this is considered a bug or not! but I’d happily make an issue on github if needed.

thanks again for all your help.

  				
Posted 
	3 years ago

					More  		
  Report
		
					NastyFox63
				
					0
					 × 1

not sure if this is considered a bug or not! but I’d happily make an issue on github if needed.

I think we should, at least for the sake of transparency and visibility 🙂

thanks again for all your help.

My pleasure 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

30 Answers

3 years ago

2 years ago