Hi RipeGoose2
Can you try with the latest from git ?pip install -U git+
hmm, yes, but then this kind of a hacky solution... The original #340 was about packaging source code that was not in git... Now we want to add "data" (even if ephemeral) on to it, no?
My thinking is somehow make sure a Task can reference a "Dataset" to be downloaded before it starts by the agent ?!
Your code should have worked, i.e. you should see the 'model.h5' in the artifacts tab. What do you have there?
It should look something like this one:
https://demoapp.trains.allegro.ai/projects/531785e122644ca5b85b2e19b0321def/experiments/e185cf31b2634e95abc7f9fbdef60e0f/artifacts/output-model
BTW:
To manually register any model:
from trains import Task, OutputModel task = Task.init('examples', 'my model') OutputModel().update_weights('my_best_model.h5')
Okay, I'll make sure we change the default image to the runtime flavor of nvidia/cuda
I want that last python program to be executed with the environment that was created by the agent for this specific task
Well basically they all inherit the Python environment that points to the venv they started from, so at least in theory it should be transparent when the agent is spinning the initial process.
I eventually found a different way of achieving what I needed
Now I'm curious, what did you end up doing ?
So the agent installed okay. It's the specific Task that the agent is failing to create the environment for, correct?
if this is the case, what do you have in the "Installed Packages" section of the Task (see under the Execution tab)
Actually it is better to leave it as is, it will just automatically mount the .ssh folder into the container, i will make sure the docs point to this option first
It may have been killed or evicted or something after a day or 2.
Actually the ideal setup is to have a "services" pod running all these service on a single pod, with clearml-agent --services-mode. This Pod should always be on and pull jobs from a dedicated queue.
Maybe a nice way to do that is to have the single Task serialize itself, then have the a Pod run the Task every X hours and spin it down
So I would like to to know what it send to the server to create the task/pipeline, ...
when you are running the n+1 epoch you get the 2*n+1 reported
RipeGoose2 like twice the gap, i.e internally it adds the an offset of the last iteration... is this easily reproducible ?
Yes, it could, crontab uses the user it is running from (root if used with sudo)
That works AND the feature works!
YEY
Quick follow up question, is there any way to abort a pipeline and all of the tasks it ran?
Hmm yes currently if you abort the pipeline is has no "time" to abort the running Tasks (the DAG itself will stop, because the pipeline controller was aborted, bit the running Tasks will continue).
In order to have better support, we need to add a previously requested feature for "abort" callback. This is actually not as straight forward as it sound...
Hmmm, I'm not sure that you can disable it. But I think you are correct it should be possible. We will add it as another argument to Task.init. That said, FriendlyKoala70 what's the use case for disabling the code detection? You don't have to use it later, but it is always nice to know :)
Yes, but does add_external_files makes chunked zips as add_files do?
No it references them, (i.e. meta-data not actually doing something with the files themselves)
I need the zipping, chunking to manage millions of files
That makes sens, if that's the case you will have to download those files anyway, and then add them with add_files
you can use the StoargeManager to download them, and then add them from the local copy (this will zip/chunk them)
[None](https://clear.ml/docs/la...
OddAlligator72 okay, that is possible, how would you specify the main python script entry point? (wouldn't that make more sense rather than a function call?)
How do you determine which packages to require now?
Analysis of the actual repository (i.e. it will actually look for imports 🙂 ) this way you get the exact versions you hve, but nit the clutter of the entire virtual environment
Thank you GreasyPenguin14 , I think you are correct, in offline mode it should not check the "demo server" configuration (as it will not try to connect to a server anyhow).
Could you open a github issue? so this issue is addressed quickly
nfs version 3
That's the thing, NFS will automatically set file access and flags based on the mount options you cannot change them post mount.
How about creating a new user just for the agent, it makes sense from security / credentials perspective
/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-pyYep I see it now, could you simulate locally (i.e have the other folders in the path as well)?
could it be you also have a file somewhere that is called sfi or imagery or models or chip_classifier that it accidently tries to import first from ?
Actually you cannot breakpoint at "atexit" calls (or at least doesn't work with my gdb)
But I would add a few prints here:
https://github.com/allegroai/clearml/blob/aa4e5ea7454e8f15b99bb2c77c4599fac2373c9d/clearml/task.py#L3166
Hi FranticCormorant35
So Tasks have parent field, that would link one to another.
Unfortunately there is no visual representation for it.
What we did with the hyper-parameter for example, was also to add a tag with the ID of the "parent" Task. This would make sense if you have multiple tasks all generated from the same "parent", like in hyper-parameter optimization.
What's your use case ? Is it a single evaluation Task per training, or multiple or con job alike ?
. I was just wondering if instead of using local subprocesses, several agents could serve the same purpose (running several pipelines concurrently)
wouldn't --service-mode (read as multiple simultaneous Tasks on the same agent) solve the issue?
(BTW: if you set the pipeline component target queue to "services" , this is exactly what will happen)
ThickDove42 looking at the code, I suspect it fails interacting with the actual jupyter server (that is running on the same machine, but still).
Any chance you have a firewall on the Windows machine ?
Hmm, maybe the right way to do so is to abuse "models" which have entity, you can specify a system_tag on them, they can store a folder (and extract it if you need), they are on projects and they are cloned and can be changed.
wdyt?
Btw I sometimes get a gzip error when I am accessing artefacts via the '.get()' part.
Hmm this is odd, is this a download issue? if this is reproducible maybe we should investigate further...
BTW:
I have very small text files that make up a dataset and compression seems to take most of the upload time
How long does it take? and how come it is not smaller in size ?