Reputation
Badges 1
103 × Eureka!im not running in docker mode though - im running a clearml worker in a docker container (and then multiplying the container)
(the "magic" of the env detection is nice but man... it has its surprises)
i just need to understand what I should be expecting. I thought from putting it into queue in UI to "running my code remotely" (esp with packages preloaded) should be fairly fast turnaround - certainly not three minutes... i'll have to change my whole pipeline design if this is the case)
thank you!
i'll take that design into consideration.
re: CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL in "docker venv mode" im still not quite sure I understand correctly - since the agent is running in a container, as far as it is concerned it may as well be on bare-metal.
is it just that there's no way for that worker to avoid venv? (i.e. the only way to bypass venv is to use docker-mode?)
everything else is 200 except these two
For digitalocean:
host: "(region). digitaloceanspaces.com:443 "
bucket: “(bucket name)”
key: “(key)”
secret: “(secret)”
multipart: false
secure: true
(verify commented out entirely)
So for you - make sure to add your creds that have the right scope (r/w), and try specifying the bucket .
Then in clearml tasks themselves you tell the task using output_uri=“s3://(region).digitaloceanspaces.com:443/clearml/”
(I import this as a constant from a _constants.py file...
for now I'm just avoiding restarts of the service, but I do want to get to the bottom of it using a fresh instance.
as a backup plan: is there a way to have an API key set up prior to running docker compose up? Like, I need at least one set of credentials that I can reliably have remote agents use, one that I know persists across restarts and upgrades.
id try the static image approach if you dont care about interactivity,
or for more control, switch to plotly and report_plotly for better consistency .
i've not had the best luck with the automated conversion on complex plots before . and when its sufficiently large it doesn't even show up in preview .
static images are in my opinion the most reliable, if not limited by your choice of resolution when you create the figure
nothing came up in the logs. all 200's
i am definitely not seeing it persist after upgrading. previously it wasn't a problem on other upgrades
# imports
...
if __name__ == "__main__:
pipe = PipelineController(...)
# after instantiation, before "the code" that creates the pipeline.
# normal tasks can handle task.execute_remotely() at this stage...
pipe = add_steps_to_pipe(pipe)
...
# after the pipeline is defined. best I can tell, *has* to be last thing in code.
pipe.start_locally() # or just .start()
ugh. again. it launched all these tasks and then just died. logs go silent.


enqueuing. pipe.start("default") but I think it's picking up on my local clearml install instead of what I told it to use.
my tasks have this in them... what's the equivalent for pipeline controllers?
yeah locally it did run. I then ran another via UI spawned from the successful one, it showed cached steps and then refused to run the bottom one, disappearing again. No status message, no status reason. (not running... actually dead)
i just ran a pipeline that took about 2h (more than half this time was just the DAG), with about a hundred tasks. i'm taking a look at them now to see what the logs show for runtimes.
yeah this problem seems to happen on 1.15.1 and 1.16.2 as well, prior runs were on the same version even. It just feels like it happens absolutely randomly (but often).
just happened again to me.
The pipeline is constructed from tasks, it basically does map/reduce. prepare data -> model training + evaluation -> backtesting performance summary.
It figures out how wide to go by parsing the date range supplied as input parameter. Been running stuff like this for months but only recently did ...
is there a way for me to toggle CLEARML's log level? I'm doing some manual task-debugging in ipython and think it would be helpful to see network requests and timeouts if they're occurring.
yup, but you can modify them after task creation in the UI (if its in draft state)
it's upon runtime instantiation of the pipelinecontroller class.
so far it seems that turning off cache like this is my "best option"
and for what its worth it seems I dont have anything special for agent cloning
i did find agent.vcs_cache.clone_on_pull_fail to be helpful . but yah, updating the agent was the biggest fix
yes i actually have been able to turn on caching after rc2 of the agent! been working much better .
i ended up pinning the Dockerfile instruction to 1.18 but before that was letting the entrypoint script do the install (so, latest) .
much appreciate the env var tip . that's more elegant than what i did .
since I've turned off caching I've had much better luck . is what I'm experiencing a bug? (bitbucket nor github private repository work on second task per worker)
I can see agent.vcs_cache.enabled = true as a printout in the Console, but cannot find docs on how to set this via environment variable, since I'm trying to keep these containers from needing a clearml.conf file (though I can generate on in the entrypoint script if need be with <EOF> )
took me a while to deliver enough functionality to my team to justify working on open source... but I finally go back around to investigating this to write a proper issue, but ended up figuring it out myself and opening a PR:
None
thank you!
by any chance do you have insights into github.com/allegroai/clearml-server/issues/248 ? dont know if its related to this at all or not, but it is an issue I experienced after upgrading .
basically the git hash of the executed experiment + a hash on the inputs to the task.
yeah. thats how I've been generating credentials for agents as well as for my dev environment .
the workers connect to the clearml server via ssh-tunnels, so they all talk to "localhost" despite being deployed in different places. each task creates artifacts and metrics that are used downstream