Reputation
Badges 1
25 × Eureka!Yea the "-e ." seems to fit this problem the best.
👍
It seems like whatever I add to
docker_bash_setup_script
is having no effect.
If this is running with the k8s glue, there console out of the docker_bash_setup_script ` is currently Not logged into the Task (this bug will be solved in the next version), But the code is being executed. You can see the full logs with kubectl, or test with a simple export test
docker_bash_setup_script
` export MY...
Hmm maybe this is the issue, :
Conda error: UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (cudatoolkit):
- pytorch~=1.8.0 -> cudatoolkit[version='>=10.1,<10.2|>=10.2,<10.3']
This makes no sense, conda is saying pytorch=1.8 needs cudatoolkit <10.2/10.3 but actually it needs cudatoolkit 11.1
ColossalAnt7 I would do the following:
Configure trains-server user/pass, mounting the API server configuration file as pointed in the trains-server documentation (intermediate temporary step) Start by providing the ML guys with a VPN access that allows them to access directly the trains-server api/web/file pos (caveat is the IP/sub-domain needs to be solved) Configure a ConfigMap to do the routing/ingest (this solves the IP/Sub-Domain issue) and allow the VPN to access the single entrypoint...
Hi ThoughtfulBadger56
Just add --stop to the clearml-agent
(the exact same command as you used to spin it, just add --stop at the end and it will stop it, or just do clearml-agent daemon --stop and it will iteratively close them)
Hi RipeGoose2
Can you try with the latest from git ?pip install -U git+
hmm, yes, but then this kind of a hacky solution... The original #340 was about packaging source code that was not in git... Now we want to add "data" (even if ephemeral) on to it, no?
My thinking is somehow make sure a Task can reference a "Dataset" to be downloaded before it starts by the agent ?!
Your code should have worked, i.e. you should see the 'model.h5' in the artifacts tab. What do you have there?
It should look something like this one:
https://demoapp.trains.allegro.ai/projects/531785e122644ca5b85b2e19b0321def/experiments/e185cf31b2634e95abc7f9fbdef60e0f/artifacts/output-model
BTW:
To manually register any model:
from trains import Task, OutputModel task = Task.init('examples', 'my model') OutputModel().update_weights('my_best_model.h5')
Okay, I'll make sure we change the default image to the runtime flavor of nvidia/cuda
I want that last python program to be executed with the environment that was created by the agent for this specific task
Well basically they all inherit the Python environment that points to the venv they started from, so at least in theory it should be transparent when the agent is spinning the initial process.
I eventually found a different way of achieving what I needed
Now I'm curious, what did you end up doing ?
Actually it is better to leave it as is, it will just automatically mount the .ssh folder into the container, i will make sure the docs point to this option first
when you are running the n+1 epoch you get the 2*n+1 reported
RipeGoose2 like twice the gap, i.e internally it adds the an offset of the last iteration... is this easily reproducible ?
Yes, it could, crontab uses the user it is running from (root if used with sudo)
That works AND the feature works!
YEY
Quick follow up question, is there any way to abort a pipeline and all of the tasks it ran?
Hmm yes currently if you abort the pipeline is has no "time" to abort the running Tasks (the DAG itself will stop, because the pipeline controller was aborted, bit the running Tasks will continue).
In order to have better support, we need to add a previously requested feature for "abort" callback. This is actually not as straight forward as it sound...
Hmmm, I'm not sure that you can disable it. But I think you are correct it should be possible. We will add it as another argument to Task.init. That said, FriendlyKoala70 what's the use case for disabling the code detection? You don't have to use it later, but it is always nice to know :)
Yes, but does add_external_files makes chunked zips as add_files do?
No it references them, (i.e. meta-data not actually doing something with the files themselves)
I need the zipping, chunking to manage millions of files
That makes sens, if that's the case you will have to download those files anyway, and then add them with add_files
you can use the StoargeManager to download them, and then add them from the local copy (this will zip/chunk them)
[None](https://clear.ml/docs/la...
OddAlligator72 okay, that is possible, how would you specify the main python script entry point? (wouldn't that make more sense rather than a function call?)
How do you determine which packages to require now?
Analysis of the actual repository (i.e. it will actually look for imports 🙂 ) this way you get the exact versions you hve, but nit the clutter of the entire virtual environment
Thank you GreasyPenguin14 , I think you are correct, in offline mode it should not check the "demo server" configuration (as it will not try to connect to a server anyhow).
Could you open a github issue? so this issue is addressed quickly
nfs version 3
That's the thing, NFS will automatically set file access and flags based on the mount options you cannot change them post mount.
How about creating a new user just for the agent, it makes sense from security / credentials perspective
/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-pyYep I see it now, could you simulate locally (i.e have the other folders in the path as well)?
could it be you also have a file somewhere that is called sfi or imagery or models or chip_classifier that it accidently tries to import first from ?
. I was just wondering if instead of using local subprocesses, several agents could serve the same purpose (running several pipelines concurrently)
wouldn't --service-mode (read as multiple simultaneous Tasks on the same agent) solve the issue?
(BTW: if you set the pipeline component target queue to "services" , this is exactly what will happen)
ThickDove42 looking at the code, I suspect it fails interacting with the actual jupyter server (that is running on the same machine, but still).
Any chance you have a firewall on the Windows machine ?
Hmm, maybe the right way to do so is to abuse "models" which have entity, you can specify a system_tag on them, they can store a folder (and extract it if you need), they are on projects and they are cloned and can be changed.
wdyt?
BTW:
I have very small text files that make up a dataset and compression seems to take most of the upload time
How long does it take? and how come it is not smaller in size ?
To clarify, there might be cases where we get helm chart /k8s manifests to deploy a inference services. A black box to us.
I see, in that event, yes you could use clearml queues to do that, as long as you have the credentials the "Task" is basically just a deployment helm task.
You could also have a monitoring code there so that the same Task is pure logic, spinning the helm chart, monitoring the usage, and when it's done taking it down