allegroai/trains-agent-service image hash 03dc85869afe
i think for now it should do the trick... was just thinking about the roadmap part
thanks for letting me know.. but it turns out after i have recreated my whole system environment from scratch, trains agent is working as expected..
AgitatedDove14 when using OutputModel(task, name='LightGBM model', framework='LightGBM').update_weights(f"{args.out}/model.pkl") i am seeing this in the logs No output storage destination defined, registering local model /tmp/model.pkl when i got to trains UI.. i see the model name and details but when i try to download it point to the path file:///tmp/model.pkl which is incorrect wondering how to fix it
is it because of something wrong with this package build from their owner or something else
also one thing i noticed.. when i report confusion matrix and some other plots e.g. seaborn with matplotlib.. on server side i can the plots are there but not visible at all
any example in the repo which i can go through
ok will report back
Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://localhost:8081/Trains%20Test/LightGBM.56ca0c9c9ebf4800b7e4f537295d942c/metrics/LightGBM%20Feature%20Importance%20above%200.0%20threshold/plot%20image/LightGBM%20Feature%20Importance%20above%200.0%20threshold_plot%20image_00000000.png . (Reason: CORS request did not succeed).
seems like port forwarding had an issue.. fixed that.. now running test again to see if things workout as expected
looking at the code https://github.com/allegroai/trains/blob/65a4aa7aa90fc867993cf0d5e36c214e6c044270/trains/model.py#L1146 this happens when storage_uri is not defined where as i have this under trains.conf so task should have it ?
ok so controller task is a simple place holder which run infinitely and fetch a task template and queue it..
ok... is there any way to enforce using a given system wide env.. so agent doesn't need to spend time with env. creation
AgitatedDove14 it seems i am having issues when i restart the agent... it fails in creating/setting up the env again... when i clean up the .trains/venv-builds folder and run a job for agent.. it is able to create the env fine and run job successfully.. when i restart the agent it fails with messages like
` Requirement already satisfied: cffi@ file:///home/conda/feedstock_root/build_artifacts/cffi_1595805535531/work from file:///home/conda/feedstock_root/build_artifacts/cffi_1595805535...
yes delete experiments which are old or for some other reason are not required to keep around
couldn't find the licensing price for enterprise version
i guess i was not so clear may be.. say e.g. you running lightgbm model training, by default it will take all the cpus available on the box and will run that many threads, now another task got scheduled on the same box now you have 2x threads with same amount of CPU to schedule on. So yes the jobs will progress but the progression will not be the same due to context switches which will happen way more than say if we have allowed on 1/2x threads for each job
yeah that would solve it i think.. so what is the normal cadence for release.. every month or quarter ?
while you guys gonna work on it.. just a small feature addition to it.. it would be cool to have a DAG figure which shows how models are linked under this task and ability to just click a circle in that DAG figure to navigate to given task... i think it will be very useful UX 🙂
i mean linking more in UI.. as when i go to model detail page, i can see that a given experiment created this model and click on that to see its detail... so something similar to that for ensemble models
look forward to the new job workflow part in 0.16 then 🙂
i know it support conda.. but i have another system wide env which is not base .. say ml so wondering if i can comnfigure trains-agent to use that... not standard practice but just asking if it is possible
it still tries to create a new env
simply changing to show doesn't work in my case as i am displaying CM.. what about if i use matshow
the use case i have is to allow people from my team to run their workloads on set of servers without stepping over each other..
thanks Martin.. at least something to go with.. as if i have any issue then i know which component logs to look for
TimelyPenguin76 also is there any reason for trating show and imshow differently
as i am seeing now my plots but they are lending into metrics section not plot section.