BattyLion34 if everything is installed and used to work, what's the difference from the previous run that worked ?
(You can compare in th UI the working vs non-working, and check the installed packages, it would highlight the diff, maybe the answer is there)
but the requirement was already satisfied.
I'm assuming it is satisfied on the host python environment, do notice that the agent is creating a new clean venv for each experiment. If you are not running in docker-mode, then you can tell the agent to inherit the venv it creates from the system environment (that will make sure that if object_detection is already installed on the host venv, it will be available for your code).
To do that change:
https://github.com/allegroai/clearml-agent/blob/81edd2860fbc09e2a179985d8315ffaba851dcd7/docs/clearml.conf#L57system_site_packages: true,
BattyLion34 is this running with an agent ?
What's the comparison with a previously working Task (in terms of python packages) ?
AgitatedDove14 Just in case, I've created toy examples of processes I'm running - one for classification, another for object detection. Maybe, it would be more clear, what I'm trying to get: https://gitlab.com/kuznip/clml_cl_toy , https://gitlab.com/kuznip/clml_od_toy .
AgitatedDove14
No, I do not use --docker
flag for clearml agent In Windows setting system_site_packages
to true
allowed all stages in pipeline to start - but doesn't work in Lunux. I've deleted tfrecords from master branch and commit the removal, and set the folder for tfrecords to be ignored in .gitignore. Trying to find, which changes are considered to be uncommited. By cache files I mean the files in folder C:\Users\Super.clearml\vcs-cache - based on error message, clearml tries to load git repository first from that folder.
AgitatedDove14 I've set system_site_packages: true
. Almost succeeded. Current pipeline has the following stages: 1) convert annotations from labelme into coco format 2) convert annotations in coco format and corresponding images to tfrecords. 3) run training MASK RCNN. The process previously failed on the second stage. After setting system_site_packages: true
, the pipeline starts the third stage, but fails with some git issue:diff --git a/work/tfrecord/test.record b/work/tfrecord/test.record index e69de29..d168708 100644 Binary files a/work/tfrecord/test.record and b/work/tfrecord/test.record differ diff --git a/work/tfrecord/train.record b/work/tfrecord/train.record index e69de29..79f9768 100644 Binary files a/work/tfrecord/train.record and b/work/tfrecord/train.record differ ERROR! Failed applying git diff, see diff above.
I've previously included those tfrecord files into git repository. Now I've deleted the files, and set .gitignore to ignore files in work/tfrecord folder. Additionally, deleted cached repositories in system folder 'C:\Users\Super.clearml\vcs-cache' - but the issue remains. Can you help me with this?
BTW, is there any way to automatically clean cached files?
Also, just forgot to note, that I'm running clearml-agent and clearml processes in virtual environment - conda environment on Windows and venv on Linux.
AgitatedDove14 The fact is that I use docker for running clearml server both on Linux and Windows. When I run tasks one-by-one from command line, they run OK - but in this case, clearml doesn't create venv and runs tasks in host environment. When I start tasks in pipeline, clearml creates venv for executing the tasks - there the issue raieses.
AgitatedDove14 Set system_site_packages to true for Linux - having the same error ( ERROR: Could not find a version that satisfies the requirement object_detection==0.1 (from -r /tmp/cached-reqsjhs2q2gm.txt (line 7)) (from versions: 0.0.3)
).
AgitatedDove14git diff
gives nothing - current local repository is up-to-date with gitlab origin.
Yes that is the git repository cache, you are correct. I wonder what happened there ?
So far my local and remote gitlab repositories are synchronized, I suspect, that Failed applying git diff, see diff above
error is caused by cached repository from which clearml tries to run the process. I've cleaned the cache, but it haven't helped.
The installed packages is fully editable like any
requirements.txt
with the same formatting
Yes, but where I can find the file with the list of the packages to be installed?
AgitatedDove14
Linux: resetting task in UI and removing object_detection
from list of libraries to be installed for stage 2 (generating tfrecord) and for stage 3 (training nn) helped to pass the stage2 and start the stage3, where training crashed - seems system cannot import some files from 'object_detection' folder.
Windows: I cannot store generated files as configuration on the Task - there are several files to be generated and some may be pretty large, up to few gigs. Looks like the issues on Windows raise from wrong logic of working with git - but I do not understand this logic clear enough. Say, I have scriptA, which runs TaskA. This TaskA generates some data (each time different) which should be used by subsequient TaskB, run by scriptB. Should this data be somehow automatically committed and pushed to remote repository? In other words, should the pipeline be:
TaskA (generates data)2. TaskB (uses data generated by TaskA)
or should the pipeline be:
TaskA (generates data)2. commit and push data to remote repository
3. TaskB (uses data generated by TaskA by downloading it from remote repository)
?
Hi BattyLion34
script_a.py
generates file
test.json
in project folder
So let's assume "script_a" generates something and puts it under /tmp/my_data
Then it can create a dateset from the folder /tmp/my_data
, with Dataset.create() -> Dataset.sync -> Dataset.upload -> Dataset.finalize
See example: https://github.com/alguchg/clearml-demo/blob/main/process_dataset.py
Then "script_b" can get a copy of the dataset using "Dataset.get()", see example:
https://github.com/alguchg/clearml-demo/blob/main/sklearn_example.py
BTW:
In the gitlab links, what is "script a" and what is "script_b" ?
Regarding diff issue - just found that empty folder 'tfrecord' in which tfrecords should be created, doesn't exist on gitlab origin repository. Added it there, then pulled the origin. Still having diff issue, but I'll run few trials to be sure, there's nothin else to create the issue.
As for "installed packages" list. To create a pipeline, I first run each stage (as a script) from cmd. After all the stages are created and can be seen in UI, I run the pipeline. So far I understand, clearml tracks each library called from scripts and saves the list of this libraries somewhere (as I assume, this list is saved as requirements.txt file somewhere - which is later loaded into venv, when pipeline is running). Can I edit this file (just to comment the row with "object-detection==0.1)?
BTW, regarding the object-detection library. My training scripts have calls like:from object_detection import model_lib_v2
where object_detection
is not an installed library, but a name of a folder, where all my scripts, files and TF ObjectDetection API scripts are located. That's why I want to comment the installation of 'object-detection' in requirements file.
Hi BattyLion34
The windows issue seems like it is coming from missing QT installed on the Host machine
Check the pyqt5
version in your "Installed packages"
see here:
https://superuser.com/questions/1433913/qtpy-pythonqterror-no-qt-bindings-could-be-found
Regrading the linux, it seems your are missing the object_detection
package, where do you usually install it from ?
AgitatedDove14
Regarding Windows - pyqt5 is installed. That's the result of pip freeze
:PyQt5==5.15.2 pyqt5-plugins==5.15.2.2.1.0 PyQt5-sip==12.8.1 pyqt5-tools==5.15.2.3.0.2
Following your link, I've used the last advise and installed pip install PySide2
- I have Python 3.7.7. That didn't help, the issue is the same.
Regarding Linux, I've tried to install object_detection==0.1
, but the requirement was already satisfied. Need to note, that the whole project is placed into "object_detection" folder of tensorflow models ( https://github.com/tensorflow/models/tree/master/research/object_detection ) - thus I expect that all necessary functions for training OD models will be accessible from the project's folder.
AgitatedDove14 In "Results -> Console" tab of UI, I see that the issue with running object detection on Linux is the following:ERROR: Could not find a version that satisfies the requirement object_detection==0.1 (from -r /tmp/cached-reqsypv09bhw.txt (line 7)) (from versions: 0.0.3)
Is it possible to comment the line object_detection==0.1
? Actually, no such version of this or similar library exists. I quess, that this requirement is not necessary. Can I turn of the installation of this library?
but instead, they cannot be run if the files they produce, were not committed.
The thing with git, if you have new files and you did not add them, they will not appear in the git diff, hence missing when running from the agent. Does that sound like your case?
AgitatedDove14 Yes, it's running with an agent. I've updated the clearml from version 0.17.4 to 0.17.5. Sorry, didn't note the other libraries, which were automatically updated along with the new ClearML version.
However, is there any way to manipulate the packages, which will be installed in venv on running the pipeline? I've tried to run the pipeline on Linux server (clearml v.0.17.4) and got the following issue:Requirement already satisfied: numpy==1.19.5 in /root/.clearml/venvs-builds.2/3.8/lib/python3.8/site-packages (from -r /tmp/cached-reqsjhs2q2gm.txt (line 6)) (1.19.5) ERROR: Could not find a version that satisfies the requirement object_detection==0.1 (from -r /tmp/cached-reqsjhs2q2gm.txt (line 7)) (from versions: 0.0.3) ERROR: No matching distribution found for object_detection==0.1 (from -r /tmp/cached-reqsjhs2q2gm.txt (line 7)) clearml_agent: ERROR: Could not install task requirements! Command '['/root/.clearml/venvs-builds.2/3.8/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsjhs2q2gm.txt']' returned non-zero exit status 1.
Thanks for your help.
AgitatedDove14 Does it make any sense to chdnge system_site_packages
to true
if I run in clearml using Docker?
So I think this is a good example of pipelines and data:
Basically Task A generates data stored using the cleamrl-data
(See Dataset class). The output of that is an ID of the Dataset. Then Task B uses that ID to retrieve the Dataset created by Task A.
documentation
https://github.com/allegroai/clearml/blob/master/docs/datasets.md
Example:
Step A creating Dataset:
https://github.com/alguchg/clearml-demo/blob/main/process_dataset.py
Step B training model using the Dataset created in step A:
https://github.com/alguchg/clearml-demo/blob/main/sklearn_example.py
To automate the process, we could use a pipeline, but first we need to understand the manual workflow
So far my local and remote gitlab repositories are synchronized, I suspect, that
Failed applying git diff, see diff above
error is caused by cached repository from which clearml tries to run the process. I've cleaned the cache, but it haven't helped.
Hmm can you test with empty "uncommitted changes" ?
Just making sure when you say still does n't work, you are not trying to run the Task with the git diff that includes teh binary data right?
Yes, but where I can find the file with the list of the packages to be installed? (edited)
What do you mean? The list of packages is exactly what you see in "installed packages" this is what the agent will install on the agent
In Windows setting
system_site_packages
to
true
allowed all stages in pipeline to start - but doesn't work in Lunux.
Notice that it will inherit from the system packages not the venv the agent is installed in
I've deleted tfrecords from master branch and commit the removal, and set the folder for tfrecords to be ignored in .gitignore. Trying to find, which changes are considered to be uncommited.
you can run git diff
it is essentially what will happen in the background
By cache files I mean the files in folder C:\Users\Super.clearml\vcs-cache - based on error message, clearml tries to load git repository first from that folder.
Yes that is the git repository cache, you are correct. I wonder what happened there ?
Is it possible to comment the line `
object_detection==0.1
` ? Actually, no such version of this or similar library exists. I quess, that this requirement is not necessary. Can I turn of the installation of this library?
The installed packages is fully editable like any requirements.txt
with the same formatting :)
Hmm so if I understand what's going on, convert_test.py
needs to have the test.json
, since it creates the test.json but it does not call git add
on it, the test.json will not be part of the git diff
hence missing when executing remotely by the agent.
If test.json is relatively small (i.e. not 10s of MB) you could store it as configuration on the Task. for example:
` local_copy_of_test_json = task.connect_configuration('/path/to/test.json', name='test config')
print(local_copy_of_test_json)
when running locally we will get '/path/to/test.json'
when running inside an agent we will get '/tmp/temp_path/test.json' and the content of this json will be coming from the UI "configuration object" 'test config'
This will store the content of test.json on the Task as configuration object named 'test config' , then when runnign remotely it will create a temp file with the same content (coming from the Task).
If you need to move the temp file to the current execution dir, you can always copy it.
wdyt?
AgitatedDove14
Ok, will check this tomorrow. Thank you for your help!
After few commit-push-pulls got no diff issue on Windows. But just got a weird behavior - if stages running in a pipeline, they do not create new files, but instead, they cannot be run if the files they produce, were not committed. I do not really understand the logic of this. To be exact:
I have 3 stages, each implemented as separate script: 1) converting annotations into coco test.json and train.json files 2) converting json files and images into test.record and train.record files 3) training network using *.record files. Well, if I delete test.json and train.json from local folder (d:\object_detection), and from gitlab origin, the 1st stage running in pipeline cannot be implemented because these (resulting!) files are not found while downloading repository and setting environment. The same is true for the 2nd stage. And, moreover, test.json, train.json, train.record, test.record files are not created in "d:\object_detection" folder. There I can find the old versions of files, which were created while running separately each stage script (not in pipeline). Are they created somewhere on C disk, which is system disk and where ClearML and Python are installed? Really, I do not understand the logic.
AgitatedDove14
For classification example (clml_cl_toy) - script A is image_augmentation.py
, which creates augmented images, script B is train_1st_nn.py
(of train_2nd_nn.py
, which does the same), which trains ANN based on augmented images For object detection example script A is represented by two scripts - annotation_conversion_test.py
, which creates file test.json and annotation_conversion_train.py
, which creates file train.json . These files are used by script B - tf_create.py
, which creates files test.record and train.record .
AgitatedDove14
No, I meant different thing. It's not easy to explain, sorry. Let me try. Say, I have a project in folder "d:\object_detection". There I have a script, which converts annotations from labelme format to coco format. This script name is convert_test.py and it runs a process, registered under the same name in clearml. This script, being run separately from command prompt creates new file in project folder - test.json . I delete this file, synch local and remote repos, both without having this file (it must be generated during pipeline execution). Now I start my pipeline, and it crashes on convert_test process, because test.json IS MISSING. If I add test.json back in local and remote repo, the process runs smoothly, reporting successfully generated test.json (but in the project folder "d:\object_detection", there's an old test.json , which I added to repos before running the pipeline).
Just got an idea, that cleaning clearml repository cash may fix it. Check this tomorrow.
AgitatedDove14 Ok, I'll try to do this with clearml-data. However, I've found that I don't understand the logic, where newly generated data (by pipeline) are placed. I think, it's a major issue with my code. And, also, I should understand this for using clearml-data as well.
Say, script_a.py
generates file test.json
in project folder. script_b.py
should use this file for further processing. When I run script-by-script, test.json is generated and used Ok. However, when I run pipeline with those processes, test.json is reported to be created, but I can't find it neither in project folder, nor in C:\Users\User.clearml\ folder (I mean, that all the operations are done under Windows enviroment. Can you explain this to me? Thank you.
so far I understand, clearml tracks each library called from scripts and saves the list of this libraries somewhere (as I assume, this list is saved as requirements.txt file somewhere - which is later loaded into venv, when pipeline is running).
Correct
Can I edit this file (just to comment the row with "object-detection==0.1)?
BTW, regarding the object-detection library. My training scripts have calls like:
Yes in the UI, iu can right click on the Task select "reset", then it becomes fully editable ("installed packages", "git diff", basically everything)
where
object_detection
is not an installed library, but a name of a folder,
Oh.... No I see, So based on the fact it was detected by 'clearml' as a package I assume it was installed with "pip install -e <folder>" and this is why it was detected.
Regrading running it with the agent, this is a good use case for the "working directory" entry in the execution section:
Assume we have:./script.py ./object_detection/__init__.py ...
Then we need to make sure we have the working dir
set to "." and the entry script as script.py
Make sense?
The fact is that I use docker for running clearml server both on Linux and Windows.
My question was on running the agent, is it running with --docker
flag, i.e. docker mode
Also, just forgot to note, that I'm running clearml-agent and clearml processes in virtual environment - conda environment on Windows and venv on Linux.
Yep that answers my question above 🙂
Does it make any sense to chdnge
system_site_packages
to
true
if I run in clearml using Docker?
No need, it does that automatically, and inherits the docker python system packages 🙂
I've previously included those tfrecord files into git repository. Now I've deleted the files, and set .gitignore to ignore files in work/tfrecord folder. Additionally, deleted cached repositories in system folder 'C:\Users\Super.clearml\vcs-cache' - but the issue remains. Can you help me with this?
I'm assuming you have some binary uncommited changes (tfrecord for example) which makes sense they could break git diff
Binary files a/work/tfrecord/train.record and b/work/tfrecord/train.record differ
Makes sure you remove the tfrecords from git (always a good practice), then commit the removal, then you will not have them in the uncommitted changes (the .gitignore you already took care of)
BTW, is there any way to automatically clean cached files?
What do you mean by cache files? clearml will manage it's own cache (and clean it up)