but, the call used to start the script was python -m module.name --args
actually no
hmm, are those packages correct ?
we do use all those packages, and the version numbers are correct
Could it be pandas was not installed on the local machine ?
in the main script, these are the first imports:import argparse import time import json import pytorch_lightning as pl from pytorch_lightning.accelerators import accelerator
then after that we import stuff from the repo, and the listed packages are imported in those files
$ conda list | grep pandas geopandas 0.9.0 pyhd8ed1ab_1 conda-forge geopandas-base 0.9.0 pyhd8ed1ab_1 conda-forge pandas 1.3.3 py39hde0f152_0 conda-forge
(also, the training code, which uses pandas, worked)
$ pip freeze | grep pandas geopandas @ file:///home/conda/feedstock_root/build_artifacts/geopandas_1623249625470/work pandas==1.3.3
$ conda list | grep matplotlib matplotlib 3.4.3 py39hf3d152e_1 conda-forge matplotlib-base 3.4.3 py39h2fa2bec_1 conda-forge
I can't seem to find a difference between the two, why would matplotlib get listed and pandas does not... Any other package that is missing?
BTW: as an immediate "hack" , before your Task.init
call add the following:Task.add_requirements("pandas")
BTW: how is it missing listing torch
? Do you have "import torch" in the code ?
yeah, it’s in one of the imports from the repo
and it’s in the “installed packages” from the child task:absl-py==0.14.0 aiohttp==3.7.4.post0 async-timeout==3.0.1 attrs==21.2.0 cachetools==4.2.2 certifi==2021.5.30 chardet==4.0.0 charset-normalizer==2.0.6 clearml==1.1.1 cycler==0.10.0 Cython==0.29.24 fsspec==2021.9.0 furl==2.1.2 future==0.18.2 google-auth==1.35.0 google-auth-oauthlib==0.4.6 grpcio==1.40.0 idna==3.2 joblib==1.0.1 jsonschema==3.2.0 kiwisolver==1.3.2 Markdown==3.3.4 matplotlib==3.4.3 multidict==5.1.0 numpy==1.21.2 oauthlib==3.1.1 orderedmultidict==1.0.1 packaging==21.0 pathlib2==2.3.6 Pillow==8.3.2 protobuf==3.18.0 psutil==5.8.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pyDeprecate==0.3.1 PyJWT==2.1.0 pyparsing==2.4.7 pyrsistent==0.18.0 python-dateutil==2.8.2 pytorch-lightning==1.4.8 PyYAML==5.4.1 requests==2.26.0 requests-oauthlib==1.3.0 rsa==4.7.2 scikit-learn==0.24.2 scipy==1.7.1 six==1.16.0 tensorboard==2.6.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.0 threadpoolctl==2.2.0 torch==1.9.1 torchmetrics==0.5.1 tqdm==4.62.3 typing-extensions==3.10.0.2 urllib3==1.26.7 Werkzeug==2.0.1 yarl==1.6.3
I think the main issue is running with python -m module.name --args
Which is a bit different, when trying to "understand" what is the actual repository.
Can you try to run it from the repository folder (same command, just to see if it will have any effect on the detected packages)
and it’s in the “installed packages” from the child task:
This is because the agent always updates back the full venv setup, so you will be able to always reproduce the entire thing (as opposed to dev time, where it lists only the directly imported packages)
that must have been it. here’s the installed packages when not using -m
:
` # Python 3.9.7 | packaged by conda-forge | (default, Sep 23 2021, 07:28:37) [GCC 9.4.0]
Local modules found - skipping:
modulename == ../pathto/modulename/init.py
PyYAML == 5.4.1
Shapely == 1.7.1
clearml == 1.1.1
click == 7.1.2
matplotlib == 3.4.3
numpy == 1.21.2
pandas == 1.3.3
python_dateutil == 2.8.2
pytorch_lightning == 1.4.8
pytz == 2021.1
rasterio == 1.2.8
scikit_image == 0.18.3
scikit_learn == 0.24.2
scipy == 1.7.1
tensorboard == 2.6.0
torch == 1.9.1
torchvision == 0.2.2
tqdm == 4.62.3 `
i’ll clone and enqueue, but i’m guessing that’s the issue
getting different issues (torchvision vs. cuda compatibility, will work on that), but i’m betting that was the issue
(torchvision vs. cuda compatibility, will work on that),
The agent will pull the correct torch based on the cuda version that is available at runtime (or configured via the clearml.conf)
that must have been it. here’s the installed packages when not using
-m
:
Hmm yes, can you open a GitHub issue on that? (this seems like a bug)
BTW: could it be the Task.init is Not called on the "module.name" entry point, but somewhere internally ?
actually yes— task.init
is called inside of a class in one of the internal imports
okay, i have a few things on my todo list, they will take a while. we will task.init
in the entry point instead of how it’s done now, and we will re-try python -m
. if it doesn’t work, we will file an issue. if it does work, yay!
either way, thanks much for your help today, i really appreciate it.
Sounds great! let me know what you find out 🙂
okay, so here’s what i found out—
calling the training entry point directly (eg /path/to/train.py
), and not instantiating the clearml Task in train.py
(eg calling a method in a different module where the task is instantiated) does work calling the entrypoint with python -m
, but instantiating the clearml Task within train.py
also works
so the only thing that doesn’t work is calling the entrypoint with python -m
and calling a method from a different module that instantiates the task.
not sure if this is considered a bug or not! but I’d happily make an issue on github if needed.
thanks again for all your help.
not sure if this is considered a bug or not! but I’d happily make an issue on github if needed.
I think we should, at least for the sake of transparency and visibility 🙂
thanks again for all your help.
My pleasure 🙂