
Reputation
Badges 1
211 × Eureka!well I did something on my end, its magically working now
I basically moved the Task.init() call below the imports
yea let me unwind some changes so I can pinpoint the issue
when I did Task.init() in train.py
the CLI arguments needed for main.py
don't get captured and the script fails right away. Note this is running --skip-task-init
since train.py has Task.init()
So in summary: subprocess calls appear to break clearML tracking, even if I do Task.init() in both main.py and train.py. However the script does run end to end successfully. If I remove the subprocess calls, I only need Task.init() in main.py for everything to work (scalars, reporting, etc).
Yea I did similar. I think the crux of the issue is the subprocess calls I removed.
If I do both everything works, except then I lose clearML tracking (scalars, outputs, etc)
AgitatedDove14 SmugDolphin23 Would the following subprocess calls break the auto connect to frameworks like tensorboard?
` exe = f"sfi/imagery/models/{strategy_pipeline}/train.py"
cmd = ["/home/npuser/.clearml/venvs-builds/3.7/bin/python", exe, train_config_path]
if training_run_id:
cmd += ["--training-run", str(training_run_id)]
logging.info("Training classifier with command:\n%s", " ".join(cmd))
returncode = subprocess.Popen(cmd).wait() `Note ` /home/npuser...
How would I do os.fork? I'm not familiar with that
Its a legacy code base. There was issues around the memory not being cleared in the GPU when subprocesses were not used. At this point I refactored out the subprocess as it just adds more complexity.
It seems like https://github.com/allegroai/clearml-helm-charts/blob/main/charts/clearml-agent/values.yaml#L72-L80 doesn't actually do anything as the values set here aren't applied in the agent template
It takes about 30 seconds here for that step
` PYTHONPATH: /home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py/sfi:/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py:/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py/sfi/imagery/models/training::/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py/sfi:/usr/lib64/python37.zip:/usr/lib64/python3.7:/usr/lib64/python3.7/lib-dynload:/home/npuser/.clearml/venvs-builds/3.7/lib6...
yea IDK, the git repo is a python library. Is it possible to run something like pip install -e .
so I can utilize the setup.py?
Yea I've done that already but I can do it again
` SysPath: ['/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py/sfi/imagery/models/training', '/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py/sfi', '/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py', '/usr/lib64/python37.zip', '/usr/lib64/python3.7', '/usr/lib64/python3.7/lib-dynload', '/home/npuser/.clearml/venvs-builds/3.7/lib64/python3.7/site-packages', '/home/npuser/.clearml/venvs-builds/3.7/l...
Seems like it has everything I would need
If you look lower, it is there '/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py'
For instance, In my repo, I have a setup.py, how would I run pip install -e .
When I run from sfi.imagery import models. It works fine local. So the repo is setup for proper imports. But fails in clearML tasks
note /home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py
is the correct path
When I run this line locally, it works finefrom sfi.imagery.models.chip_classifier.eval import eval_chip_classifier
Not yet AgitatedDove14 Perhaps we can pair on this Monday.
I was hoping to use docker_bash_setup_script
but it didn't work when I ran pip install -e .
in the respective script
Seems related to this https://github.com/allegroai/clearml/issues/241
I used the values from the dashboard/configuration/api keys
Yep I updated those as well