AbruptHedgehog21 could it be the console log itself is huge ?
Hi SolidSealion72
"/tmp" contained alot of artifacts from ClearML past runs (1.6T in our case).
How did you end up with 1.6TB of artifacts there? what are the workflows on that machine? at least in theory, there should not be any leftover in the tmp folder, after the process is completed.
LOL, great minds and so on π
UnevenDolphin73 you mean the clearml-server
helm chart ?
Hi CleanPigeon16
Yes there is, when you are cloning the pipeline in the UI, go to the Configuration/Pipeline/continue_pipeline and change it to True
Is there an option to do this from a pipeline, from within theΒ
add_step
Β method? Can you link a reference to cloning and editing a task programmatically?
Hmm, I think there is an open GitHub issue requesting a similar ability , let me check on the progress ...
nope, it works well for the pipeline when not I don't choose to continue_pipeline
Could you send the full log please?
CleanPigeon16 Can you send also the "Configuration Object" "Pipeline" section ?
Hi @<1562610699555835904:profile|VirtuousHedgehong97>
I think you need to upgrade your self-hosted clearml-server, could that be the case?
Hi ProudMosquito87
so you mean to mount your data folder onto the the docker so that the code could access it, correct?
If that is the case, is there a specific version not to use absolute path? (e.g. /mnt/data/mine
)?
ScaryKoala63
When it fails what's the number of files you have in:/home/developer/.clearml/cache/storage_manager/global/
?
In the UI you can see all the agents and their IDs
Then you can so
clearml-agent daemon --stop <agent id>
BoredHedgehog47 that actually depends on the container, are you running as root inside the container ?
if not I think the easiest hack is to always map /etc/hosts as a k8s secret file?
load_model
will get a link to a previously registered URL (i.e. it search a model pointing to the specific URL, if it finds it, it will get you the Model object)
Oh I see, what you need is to pass '--script script.py' as entry-point and ' --cwd folder' as working dir
Hmm I think this is not doable ... π
(the underlying data is stored in DBs and changing it is not really possible without messing about with the DB)
Hi @<1533257278776414208:profile|SuperiorCockroach75>
ModuleNotFoundError: No module named 'statsmodels'
seems like this package is missing from the Task
wither import it manually import statsmodels
(so the automagic logs it)
Or add before task init:
Task.add_requirements("statsmodels")
task = Task.init(...)
ps: no need to @ so many people ...
I am symlinking the .clearml directory to a NAS server and this is perhaps part of the problem.
Yep, that sounds about right, it uses Posix file system for internal lock mechanisms (multi process locks), and my guess is that the NAS for some reason does not support it...
Hi @<1552101447716311040:profile|SteadySeahorse58>
ValueError: Could not find queue named "services"
Did you set an agent / auto-scaler ? where is the pipeline and its components will be running ?
If you are using the "default" queue for the agent, notice you might need to run the agent with --services-mode
to allow for multiple pipeline components on the same machine
I would ideally just want to have NVIDIA drivers and Docker on the on-prem nodes (along with the clearML agents). Would that allow me to get by with basic job scheduling/queues through clearML?
Yes this is fully supported and very easy to setup.
Regrading limiting users usage. This is doable, I think the easiest solution both for users and management of the cluster is introducing priority into the queue, basically a user can push job into low priority, and only some users can push into high...
However, when we try to access the webapi from remote through the VPN we fail. The VPN logs don't show any blockage. Any ideas?
Maybe the VPN firewall blocks http connections ? or it might be BrightRabbit75 case, that sounds quite logical to never show anywhere
using only a subset of the features
ShallowGoldfish8 if you have some parameter that controls it (i.e. select different features) then you can launch it with two sets f parameters.
Am I missing something?
for example:
` my_features_select = {"type": "set_a"}
Task.current_task().connect(my_features_select)
if my_features_select["type"] == "set_a":
do something
else
do something else `wdyt?
well it should fail, but I think the error message should be fixed π
maybe:ValueError: dataset 'tmp_datset' not found in project
lavi-testing' `wdyt?
Hi VirtuousFish83 ,
Is it throwing an exception? Are you seeing the plot in the UI but the title is incorrect?