Reputation
Badges 1
25 × Eureka!OddAlligator72 FYI you can also import / export an entire Task (basically allowing you to create it from scratch/json, even without calling Task.create)Task.import_task(...) Task.export_task(...)
you can also set theΒ
agent.package_manager.extra_index_url
Β , but since this is dynamic,...
You are correct, sine this is dynamic there is no need to set the " extra_index_url
" configuration in clearml.conf, the additional bash script will configure pip directly. Make sense ?
Yes, it could, crontab uses the user it is running from (root if used with sudo)
@<1610083503607648256:profile|DiminutiveToad80> try to turn on:
None
enable_git_ask_pass: true
and the inet of the same card ?
@<1545216070686609408:profile|EnthusiasticCow4>git+ssh://
will be converted automatically to git+https
if you have user/pass ocnfigured in your clearml.conf on the agent machine.
More over, git packages are always installed After all other packages are installed (because pip cannot resolve the requirements inside the git repo in time)
BTW: trains-agent is leaner, and does not need plotly. And you can use the APIClient to basically query the entire system, would that be a better solution? See https://github.com/allegroai/trains-agent/blob/master/examples/archive_experiments.py
This task is picked up by first agent; it runs DDP launch script for itself and then creates clones of itself with task.create_function_task() and passes its address as argument to the function
Hi UnevenHorse85
Interesting use case, just for my understanding, the idea is to use ClearML for the node allocation/scheduling and PyTorch DDP for the actual communication, is that correct ?
passes its address as argument to the function
This seems like a great solution.
the queu...
Hi @<1625303806923247616:profile|ItchyCow80>
Could you add some prints ? Is it working without the Task.init call? the code looks okay and the - No repository found,
message basically says it logs it as a standalone script (which makes sense)
Hi @<1631102016807768064:profile|ZanySealion18>
ClearML (remote execution) sometimes doesn't "pick-up" GPU. After I rerun the task it picks it up.
what do you mean by "does not pick up"? is it the container is up but not executed with --gpus , so no GPU access?
I think task.init flag would be great!
π
Failed to initialize NVML: Unknown Error
yeah this is a driver issue. I think you need to check the VM image if the drivers match the GPU on that machine
HelplessCrocodile8 I just tried it, everything seems to work (ubuntu 20.04) π
What's the OS your are using? Python version? Is it conda ?
And how is the endpoint registered ?
Also what's the additional p
doing at the last line if the screenshot ?
Ohh "~/trains.conf" is root probably
Hi @<1724960468822396928:profile|CumbersomeSealion22>
It starts the pipeline, logs that the first step is started, and then...does nothing anymore.
How many agents do you have running? by default an agent will run a Task per agent (unless executed with --services-mode which would allow it to run unlimited amount of parallel tasks)
Okay that makes sense.best_diabetes_detection
is different from your example curl -X POST "
None "
notice best_mage_diabetes_detection` ?
JitteryCoyote63 what am I missing?
What are the errors you are getting (with / without the envs)
EnviousStarfish54 Notice that you can configure it on the agent machine only, so in development you are not "wasting" storage when uploading debug checkpoints/models π
WackyRabbit7 this section is what you need, un mark it, and fill it in
https://github.com/allegroai/trains/blob/c9fac89bcd87550b7eb40e6be64bd19d4384b515/docs/trains.conf#L88
OddAlligator72 I like this idea.
The single thing I'm not sure about is the "function entry point"
Why would one do that? Meaning why wouldn't you have a proper python entry-point.
The reason I'm reluctant is that you might have calls/functions/variables in global scope of the file storing the function, and then users will not know why something broke, ans it will be very cumbersome to debug.
A simple script entry point seems trivial to launch and debug locally.
What do you think ? What woul...
GiddyTurkey39 my bad π try this onetask._update_requirements({})
SmarmyDolphin68
Debug Samples tab and not the Plots,
Are you doing plt.imshow
?
Also make sure you have report_image=False
when calling the report_matplotlib_figure
(if it is true it will upload it as an image to "debug samples")
Would it also be possible to query based on
multiple
user properties
multiple key/value I think are currently not that easy to query,
but multiple tags are quite easy to do
tags=["__$all", "tag1", "tag2],
This smells like a driver/image issue on the instance VM
What are you getting if add this inside your code?
os.system('nvidia-smi')