not sure if this is considered a bug or not! but I’d happily make an issue on github if needed.
I think we should, at least for the sake of transparency and visibility 🙂
thanks again for all your help.
My pleasure 🙂
okay, so here’s what i found out—
calling the training entry point directly (eg /path/to/train.py
), and not instantiating the clearml Task in train.py
(eg calling a method in a different module where the task is instantiated) does work calling the entrypoint with python -m
, but instantiating the clearml Task within train.py
also works
so the only thing that doesn’t work is calling the entrypoint with python -m
and calling a method from a different module that instantiates the task.
not sure if this is considered a bug or not! but I’d happily make an issue on github if needed.
thanks again for all your help.
Sounds great! let me know what you find out 🙂
okay, i have a few things on my todo list, they will take a while. we will task.init
in the entry point instead of how it’s done now, and we will re-try python -m
. if it doesn’t work, we will file an issue. if it does work, yay!
either way, thanks much for your help today, i really appreciate it.
actually yes— task.init
is called inside of a class in one of the internal imports
BTW: could it be the Task.init is Not called on the "module.name" entry point, but somewhere internally ?
that must have been it. here’s the installed packages when not using
-m
:
Hmm yes, can you open a GitHub issue on that? (this seems like a bug)
(torchvision vs. cuda compatibility, will work on that),
The agent will pull the correct torch based on the cuda version that is available at runtime (or configured via the clearml.conf)
getting different issues (torchvision vs. cuda compatibility, will work on that), but i’m betting that was the issue
i’ll clone and enqueue, but i’m guessing that’s the issue
that must have been it. here’s the installed packages when not using -m
:
` # Python 3.9.7 | packaged by conda-forge | (default, Sep 23 2021, 07:28:37) [GCC 9.4.0]
Local modules found - skipping:
modulename == ../pathto/modulename/init.py
PyYAML == 5.4.1
Shapely == 1.7.1
clearml == 1.1.1
click == 7.1.2
matplotlib == 3.4.3
numpy == 1.21.2
pandas == 1.3.3
python_dateutil == 2.8.2
pytorch_lightning == 1.4.8
pytz == 2021.1
rasterio == 1.2.8
scikit_image == 0.18.3
scikit_learn == 0.24.2
scipy == 1.7.1
tensorboard == 2.6.0
torch == 1.9.1
torchvision == 0.2.2
tqdm == 4.62.3 `
and it’s in the “installed packages” from the child task:
This is because the agent always updates back the full venv setup, so you will be able to always reproduce the entire thing (as opposed to dev time, where it lists only the directly imported packages)
I think the main issue is running with python -m module.name --args
Which is a bit different, when trying to "understand" what is the actual repository.
Can you try to run it from the repository folder (same command, just to see if it will have any effect on the detected packages)
and it’s in the “installed packages” from the child task:absl-py==0.14.0 aiohttp==3.7.4.post0 async-timeout==3.0.1 attrs==21.2.0 cachetools==4.2.2 certifi==2021.5.30 chardet==4.0.0 charset-normalizer==2.0.6 clearml==1.1.1 cycler==0.10.0 Cython==0.29.24 fsspec==2021.9.0 furl==2.1.2 future==0.18.2 google-auth==1.35.0 google-auth-oauthlib==0.4.6 grpcio==1.40.0 idna==3.2 joblib==1.0.1 jsonschema==3.2.0 kiwisolver==1.3.2 Markdown==3.3.4 matplotlib==3.4.3 multidict==5.1.0 numpy==1.21.2 oauthlib==3.1.1 orderedmultidict==1.0.1 packaging==21.0 pathlib2==2.3.6 Pillow==8.3.2 protobuf==3.18.0 psutil==5.8.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pyDeprecate==0.3.1 PyJWT==2.1.0 pyparsing==2.4.7 pyrsistent==0.18.0 python-dateutil==2.8.2 pytorch-lightning==1.4.8 PyYAML==5.4.1 requests==2.26.0 requests-oauthlib==1.3.0 rsa==4.7.2 scikit-learn==0.24.2 scipy==1.7.1 six==1.16.0 tensorboard==2.6.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.0 threadpoolctl==2.2.0 torch==1.9.1 torchmetrics==0.5.1 tqdm==4.62.3 typing-extensions==3.10.0.2 urllib3==1.26.7 Werkzeug==2.0.1 yarl==1.6.3
yeah, it’s in one of the imports from the repo
BTW: how is it missing listing torch
? Do you have "import torch" in the code ?
I can't seem to find a difference between the two, why would matplotlib get listed and pandas does not... Any other package that is missing?
BTW: as an immediate "hack" , before your Task.init
call add the following:Task.add_requirements("pandas")
$ conda list | grep matplotlib matplotlib 3.4.3 py39hf3d152e_1 conda-forge matplotlib-base 3.4.3 py39h2fa2bec_1 conda-forge
$ pip freeze | grep pandas geopandas @ file:///home/conda/feedstock_root/build_artifacts/geopandas_1623249625470/work pandas==1.3.3
(also, the training code, which uses pandas, worked)
$ conda list | grep pandas geopandas 0.9.0 pyhd8ed1ab_1 conda-forge geopandas-base 0.9.0 pyhd8ed1ab_1 conda-forge pandas 1.3.3 py39hde0f152_0 conda-forge
in the main script, these are the first imports:import argparse import time import json import pytorch_lightning as pl from pytorch_lightning.accelerators import accelerator
then after that we import stuff from the repo, and the listed packages are imported in those files
Could it be pandas was not installed on the local machine ?
we do use all those packages, and the version numbers are correct
actually no
hmm, are those packages correct ?
but, the call used to start the script was python -m module.name --args