One of my biggest points of criticism of Python is its slow cold start time. I especially notice this when I use it as a scripting language for CLIs. The startup time of a simple .py script can easily be in the 100 to 300 ms range, whereas a C, Rust, or Go program with the same functionality can start in under 10 ms. This becomes even more frustrating when piping several scripts together, because the accumulated startup latency adds up quickly.
Yes, that is also my feeling. But comparing an interpreted language with a compiled one is not really fair.
Here is my quick benchmark. I refrain from using Python for most scripting/prototyping task but really like Janet [0] - here is a comparison for printing the current time in Unix epoch:
$ hyperfine --shell=none --warmup 2 "python3 -c 'import time;print(time.time())'" "janet -e '(print (os/time))'"
Benchmark 1: python3 -c 'import time;print(time.time())'
Time (mean ± σ): 22.3 ms ± 0.9 ms [User: 12.1 ms, System: 4.2 ms]
Range (min … max): 20.8 ms … 25.6 ms 126 runs
Benchmark 2: janet -e '(print (os/time))'
Time (mean ± σ): 3.9 ms ± 0.2 ms [User: 1.2 ms, System: 0.5 ms]
Range (min … max): 3.6 ms … 5.1 ms 699 runs
Summary
'janet -e '(print (os/time))'' ran
5.75 ± 0.39 times faster than 'python3 -c 'import time;print(time.time())''
> The startup time of a simple .py script can easily be in the 100 to 300 ms range
I can't say I've ever experienced this. Are you sure it's not related to other things in the script?
I wrote a single file Python script, it's a few thousand lines long. It can process a 10,000 line CSV file and do a lot of calculations to the point where I wrote an entire CLI income / expense tracker with it[0].
The end to end time of the command takes 100ms to process those 10k lines, that's using `time` to measure it. That's on hardware from 2014 using Python 3.13 too. It takes ~550ms to fully process 100k lines as well. I spent zero time optimizing the script but did try to avoid common pitfalls (drastically nested loops, etc.).
> I can't say I've ever experienced this. Are you sure it's not related to other things in the script? I wrote a single file Python script, it's a few thousand lines long.
It's because of module imports, primarily and generally. It's worse with many small files than a few large ones (Python 3 adds a little additional overhead because of needing extra system calls and complexity in the import process, to handle `__pycache__` folders. A great way to demonstrate it is to ask pip to do something trivial (like `pip --version`, or `pip install` with no packages specified), or compare the performance of pip installed in a venv to pip used cross-environment (with `--python`). Pip imports literally hundreds of modules at startup, and hundreds more the first time it hits the network.
Yep, running time on my tool's --version takes 50ms and funny enough processing 10k CSV lines with ~2k lines of Python code takes 100ms, so 50ms of that is just Python preparing things to run by importing 20 or so standard library modules.
Exactly this. The time to start python is roughly a function of timeof(stat) * numberof(stat calls) and on a network system that can often be magnitudes larger than a local filesystem.
I do wonder, on a local filesystem, how much of the time is statting paths vs. reading the file contents vs. unmarshaling code objects. (Top-level code also runs when a module is imported, but the cost of that is of course highly module-dependent.)
Maybe you could take the stat timings, the read timings (both from strace) and somehow instrument Python to output timing for unmarshalling code (or just instrument everything in python).
Either way, at least on my system with cached file attributes, python can startup in 10ms, so it's not clear whether you truly need to optimize much more than that (by identifying remaining bits to optimize), versus solving the problem another way (not statting 500 files, most of which don't exist, every time you start up).
This benchmark is a little bit outdated but the problem remains the same.
Interpreter initialization: Python builds and initializes its entire virtual machine and built-in object structures at startup. Native programs already have their machine code ready and need very little runtime scaffolding.
Dynamic import system: Python’s module import machinery dynamically locates, loads, parses, compiles, and executes modules at runtime. A compiled binary has already linked its dependencies.
Heavy standard library usage: Many Python programs import large parts of the standard library or third-party packages at startup, each of which runs top-level initialization code.
This is especially noticeable if you do not run on an M1 Ultra, but on some slower hardware. From the results on Rasperberry PI 3:
C: 2.19 ms
Go: 4.10 ms
Python3: 197.79 ms
This is about 200ms startup latency for a print("Hello World!") in Python3.
Interesting. The tests use Python 3.6, which on my system replicates the huge difference shown in startup time using and not using `-S`. From 3.7 onwards, it makes a much smaller percentage change. There's also a noticeable difference the first time; I guess because of Linux caching various things. (That effect is much bigger with Rust executables, such as uv, in my testing.)
Anyway, your analysis of causes reads like something AI generated and pasted in. It's awkward in the context of the rest of your post, and 2 of the 3 points are clearly irrelevant to a "hello world" benchmark.
This is not an apples-to-apples comparison. Python needs to load and interpret the whole requests module when you run the above program. The golang linker does dead code elimination, so it probably doesn't run anything and doesn't actually do the import when you launch it.
Sure it's not an apples to apples comparison - python is interpreted and go is statically compiled. But that doesn't change the fact that in practice running a "simple" python program/script can take longer to startup than go can to run your entire program.
Even if you actually use the network module in Go, just so that the compiler wouldn't strip it away, you would still have a startup latency in Go way below 25 ms from my experience with writing CLI tools.
Whereas with Python, even in the latest version, you're already looking at atleast 10x the amount of startup latency in practice.
Note: This is excluding the actual time that is made for the network call, which can of course also add quiete some milliseconds, depending on how far on planet earth your destination is.
It's not interpreting- Python is loading the already byte compiled version.
But it's also statting several files (various extensions).
I believe in the past people have looked at putting the standard library in a zip file instead of splatted out into a bunch of files in a dirtree. In that case, I think python would just do a few stats, find the zipfile, loaded the whole thing into RAM, and then index into the file.
You should look at the self-executing .pex file format (https://docs.pex-tool.org/whatispex.html). The whole python program exists as a single file. You can also unzip the .pex and inspect the dependency tree.
It's tooling agnostic and there are a couple ways to generate them, but the easiest it to just use pants build.
Pants also does dependency traversal (that's the main reason we started using it, deploying a microservices monorepo) so it only packages the necessary modules.
I haven't profiled it yet for cold starts, maybe I'll test that real quick.
Edit: just ran it on a hello world with py3.14 on m3 macbook pro, about 100 +/-30 ms for `python -m hello` and 300-400 (but wild variance) for executing the pex with `./hello/binary.pex`.
I'm not sure if a pants expert could eke out more speed gains and I'm also not sure if this strategy would win out with a lot of dependencies. I'm guessing the time required to stat every imported file pales in comparison to the actual load time, and with pex, everything needs to be unzipped first.
Pex is honestly best when you want to build and distribute an application as a single file (there are flags to bundle the python interpreter too).
Now if I use `python -S` (disables `import site` on initialization), that gets down to ~15ms execution time for hello world. But that gain gets killed as soon as you start trying to import certain modules (there is a very limited set of modules you can work with and still keep speedup. So if you whole script is pure python with no imports, you could probably have a 20ms cold start).
Regarding cold-starts, I strongly believe V8 snapshots are perhaps not the best way to achieve fast cold starts with Python (they may be if you are tied to using V8, though!), and will have wide side effects if you go out of the standards packages included on the Pyodide bundle.
To put some perspective: V8 snapshots are storing the whole state of an application (including it's compiled modules). This means that for a Python package that is using Python (one wasm module) + Pydantic-core (one wasm module) + FastAPI... all of those will be included in one snapshot (as well as the application state). This makes sense for browsers, where you want to be able to inspect/recover everything at once.
The issue about this design is that the compiled artifacts and the application state are bundled into one piece artifact (this is not great for AOT designed runtimes, but might be the optimal design for JITs though).
Ideally, you would separate each of the compiled modules from the state of the application. When you do this, you have some advantages: you can deserialize the compiled modules in parallel, and untie the "deserialization" from recovering the state of the application. This design doesn't adapt that well into the V8 architecture (and how it compiles stuff) when JavaScript is the main driver of the execution, however it's ideal when you just use WebAssembly.
This is what we have done at Wasmer, which allows for much faster cold starts than 1 second. Because we cache each of the compiled modules separately, and recover the state of the application later, we can achieve cold-starts that are a magnitude faster than Cloudflare's state of the art (when using pydantic, fastapi and httpx).
If anyone is curious, here is a blogpost where we presented fast-cold starts for the application state (note that the deserialization technique for Wasm modules is applied automatically in Wasmer, and we don't showcase it on the blogpost): https://wasmer.io/posts/announcing-instaboot-instant-cold-st...
Note aside: congrats to the Cloudflare team on their work on Python on Workers, it's inspiring to all providers on the space... keep it up and let's keep challenging the status quo!
Big packages shouldn’t be imported until the cli has been parsed, and handed off to main. There’s been work to do this automatically, but it’s good hygiene to avoid it anyway.
A modern machine shouldn’t take this long, so likely something big is being imported unnecessarily at startup. If the big package itself is the issue, file it on their tracker.
Are you comparing the startup time of an interpreted language with the startup time of a compiled language? or you mean that `time python hello.py` > `( time gcc -O2 -o hello hello.c ) && ( time ./hello )` ?
Here's the thing - I don't really care if its' because the interpreter has to start up, or there's a remote http call, or we scan the disks for integrity - the end user experience on every run is slower.
The upcoming lazy import system may help with startup time…but if the underlying issue wasn’t “Python startup is slow” but rather “a specific program imports modules that take a long time to low”, it’ll only shift the time consumption to runtime.
That's totally fine, because many CLI tools are organized like `mytool subcommand --params=a,b...`, and breaking out those subcommands into their own modules and lazy loading everything (which good CLI tools already know to do) means unused code never gets imported.
You can already lazy import in python, but the new system makes the syntax sweeter and avoids having to have in-function `import module` calls, which some linters complain about.
The most interesting bit here is not the “2.4x faster than Lambda” part, it is the constraints they quietly codify to make snapshots safe. The post describes how they run your top-level Python code once at deploy, snapshot the entire Pyodide heap, then effectively forbid PRNG use during that phase and reseed after restore. That means a bunch of familiar CPython patterns at import time (reading entropy, doing I/O, starting background threads, even some “random”-driven config) are now treated as bugs and turned into deployment failures rather than “it works on my laptop.”
In practice, Workers + Pyodide is forcing a much sharper line between init-time and request-time state than most Python codebases have today. If you lean into that model, you get very cheap isolates and global deploys with fast cold starts. If your app depends on the broader CPython/C-extension ecosystem behaving like a mutable Unix process, you are still in container land for now. My hunch is the long-term story here will be less about the benchmark numbers and more about how much of “normal” Python can be nudged into these snapshot-friendly constraints.
I'm betting against wasm and going with containers instead.
I have warm pool of lightweight containers that can be reused between runs. And that's the crucial detail that makes or breaks it. The good news is that you can lock it down with seccomp while still allowing normal execution. This will give you 10-30ms starts with pre-compiled python packages inside container. Cold start is as fast as spinning new container 200-ish ms. If you run this setup close to your data, you can get fast access to your files which is huge for data related tasks.
But this is not suitable for type of deployment Cloudflare is doing. The question is whether you even want that global availability because you will trade it for performance. At the end of the day, they are trying to reuse their isolates infra which is very smart and opens doors to other wasm-based deployments.
It’s 2025 and choosing a region for your resources is still an enterprise feature on cloudflare.
In contrast, AWS provides this as the base thing, you choose where your services run. In a world where you can’t do anything without 100s of compliance and a lot of compliances require geolocation based access control or data retention, this is absurd.
There is no paid business plan that supports this. You have to be millions of dollars worth of enterprise on their enterprise plan to get it through your dedicated account manager.
Additionally, every Enterprise feature will become available in time ( discussed during their previous quarter earnings). It will be bound to regions ( eg. Eu)
```
BREAKING CHANGE The following packages are removed from the Pyodide distribution because of the build issues. We will try to fix them in the future:
arro3-compute
arro3-core
arro3-io
Cartopy
duckdb
geopandas
...
polars
pyarrow
pygame-ce
pyproj
zarr
```
Bummer, looks like a lot of useful geo/data tools got removed from the Pyodide distribution recently. Being able to use some of these tools in a Worker in combination with R2 would unlock some powerful server-side workflows. I hope they can get added back. I'd love to adopt CF more widely for some of my projects, and seems like support for some of this stuff would make adoption by startups easier.
Checked out the Cloudflare post... they now support Pyodide-compatible packages through uv... so you can pull in whatever Python libs you need, not just a curated list.
ALSO the benchmarks show about a one second cold start when importing httpx, fastapi and pydantic... that's faster than Lambda and Cloud Run, thanks to memory snapshots and isolate-based infra.
BUT the default global deployment model raises questions about compliance when you need specific regions... and I'd love to know how well packages with native extensions are supported.
If anyone from cloudflare comes here - it's not possible to create D1 databases on the fly and interact them because databases must be mentioned in the worker bindings.
Try Durable Objects. D1 is actually just a thin layer over Durable Objects. In the past D1 provided a lot of DX benefits like better observability, but those are increasingly being merged back into DO directly.
What is a Durable Object? It's just a Worker that has a name, so you can route messages specifically to it from other Workers. Each one also has its own SQLite database attached. In fact, the SQLite database is local, so you can query it synchronously (no awaits), which makes a lot of stuff faster and easier. You can easily create millions of Durable Objects.
Thank you! That's great and it is possible but... With some limitations.
The idea is from sign up form to a D1 Database that can be accessed from the worker itself.
That's not possible without updating worker bindings like you showed and further - there is an upper limit of 5000 bindings per worker and just 5000 users then becomes the upper limit although D1 allows 50,000 databases easily with further possible by requesting a limit increase.
Transactions are supported in Durable Objects. In fact, with DO you are interacting with the SQLite database locally and synchronously, so transactions are essentially free with no possibility of conflicts and no worry about blocking other queries.
Extensions are easy to enable, file a bug on https://github.com/cloudflare/workerd . (Though this one might be trickier than most as we might have to do some build engineering.)
I'm always a little hesitant to use D1 due to some of these constraints. I know I may not ever hit 10GB for some of my side projects so I just neglect sharding, but also it unsettles me that it's a hard cap.
The comparison with AWS Lambda seems to ignore the AWS memory snapshot option called "SnapStart for Python". I'd be interested in seeing the timing comparison extended to include SnapStart.
"SnapStart for Python" costs extra though. If we are paying then you can even have prewarmed Python lambdas with no cold start on AWS (Provisioned Concurrency).
Unless I misunderstand, AWS SnapStart and their memory snapshots are the same feature (taking memory snapshots to speed up cold start). It doesn't seem a fair comparison to ignore this and my assumption is because AWS Lambda SnapStart is faster.
I think it's fair because AWS charges extra for it.
They are comparing the baseline product of all three platforms. Why should we take paid add ons into account for 1 platform.
As I mentioned, if you are ok with paying, then you should also compare Provisioned concurrency on AWSbas well, which has 0 cold start (they keep a prewarmed lambda for you).
Product comparisons are not purely technical in nature. As a user, if im paying extra, I would much rather the 0 cold start than just a reduced cold start especially with all these additional complexities.
It wasn't an intentional omission, we weren't aware of this feature in AWS Lambda. The blog post has been updated to reflect that the numbers are for Lambda without SnapStart enabled.
Python Workers use snapshots by default and unlike SnapStart we don't charge extra for it. For many use cases, you can run Python Workers completely for free on our platform and benefit from the faster cold starts.
In the linked detailed benchmark results they include Lambda SnapStart which seems to be faster than Cloudflare:
> AWS Lambda (No SnapStart) Mean Cold Start: 2.513s Data Points: 1008
> AWS Lambda (SnapStart) Mean Cold Start: 0.855s Data Points: 17
> Google Cloud Run Mean Cold Start: 3.030s Data Points: 394
> Cloudflare Workers Mean Cold Start: 1.004s Data Points: 981
I still don’t get, what is the use case for cloudflare workers or lambda?
I used both for years. Nothing beats VPS/bare metal. Alright, they give lower latency, and maybe cheaper and big nightmare for managing at the same time. Hello to micro services architecture.
Think about how often that box needs patching for code outside of your app and you’re talking load-balancing, autoscaling, etc. to avoid downtime or overloads, but also paying for idle capacity. Then, of course, you have to think about security partitions if anything you run on that box shouldn’t have access to everything else. None of that is unknown, we have decades of experience and tooling dealing with it, etc. but it’s a job that you can just choose not to have and people often do, especially for things which are bursty. There’s something really nice about not needing to patch anything for years because all you’re using is the Python stdlib and scaling from zero to many, many thousands with no added effort.
I hope Cloudflare improve Next.js support on Workers.
Currently pagespeed.web.dev score drops by around 20 than self hosted version. One of the best features of Next.js, Image optimization doesn't have out of the box support. You need separate image optimization service that also did not work for me for local images (images in the bundle).
Pyodide is a great enabler for this kind of thing, but most of the libraries I want to use tend to be native or just weird. Still, I wonder how fast things like Pillow, Pandas and the like are these days—-benchmarks would be nice.
It relies entirely on the WebAssembly runtime, see the discussion of how ASLR problems don’t occur with WASM. Doing this with WASM is pretty easy, doing it with system memory is quite tricky.
One of my biggest points of criticism of Python is its slow cold start time. I especially notice this when I use it as a scripting language for CLIs. The startup time of a simple .py script can easily be in the 100 to 300 ms range, whereas a C, Rust, or Go program with the same functionality can start in under 10 ms. This becomes even more frustrating when piping several scripts together, because the accumulated startup latency adds up quickly.
Yes, that is also my feeling. But comparing an interpreted language with a compiled one is not really fair.
Here is my quick benchmark. I refrain from using Python for most scripting/prototyping task but really like Janet [0] - here is a comparison for printing the current time in Unix epoch:
[0]: https://janet-lang.org/Well python is also compiled technically.
> The startup time of a simple .py script can easily be in the 100 to 300 ms range
I can't say I've ever experienced this. Are you sure it's not related to other things in the script?
I wrote a single file Python script, it's a few thousand lines long. It can process a 10,000 line CSV file and do a lot of calculations to the point where I wrote an entire CLI income / expense tracker with it[0].
The end to end time of the command takes 100ms to process those 10k lines, that's using `time` to measure it. That's on hardware from 2014 using Python 3.13 too. It takes ~550ms to fully process 100k lines as well. I spent zero time optimizing the script but did try to avoid common pitfalls (drastically nested loops, etc.).
[0]: https://github.com/nickjj/plutus
> I can't say I've ever experienced this. Are you sure it's not related to other things in the script? I wrote a single file Python script, it's a few thousand lines long.
It's because of module imports, primarily and generally. It's worse with many small files than a few large ones (Python 3 adds a little additional overhead because of needing extra system calls and complexity in the import process, to handle `__pycache__` folders. A great way to demonstrate it is to ask pip to do something trivial (like `pip --version`, or `pip install` with no packages specified), or compare the performance of pip installed in a venv to pip used cross-environment (with `--python`). Pip imports literally hundreds of modules at startup, and hundreds more the first time it hits the network.
Makes sense, most of my scripts are standalone zero dependency scripts that import a few things from the standard library.
`time pip3 --version` takes 230ms on my machine.
That proves the point, right?
`time pip3 --version` takes ~200ms on my machine. `time go help` takes 25, and prints out 30x more lines than pip3 --version.
Yep, running time on my tool's --version takes 50ms and funny enough processing 10k CSV lines with ~2k lines of Python code takes 100ms, so 50ms of that is just Python preparing things to run by importing 20 or so standard library modules.
> so 50ms of that is just Python preparing things to run by importing 20 or so standard library modules.
Probably a decent chunk of that actually is the Python runtime starting up. I don't know what all you `import` that isn't implied at startup, though.
Another chunk might be garbage collection at process exit.
And it's worse if your python libraries might be on network storage - like in a user's homedir in a shared compute environment.
Exactly this. The time to start python is roughly a function of timeof(stat) * numberof(stat calls) and on a network system that can often be magnitudes larger than a local filesystem.
I do wonder, on a local filesystem, how much of the time is statting paths vs. reading the file contents vs. unmarshaling code objects. (Top-level code also runs when a module is imported, but the cost of that is of course highly module-dependent.)
Maybe you could take the stat timings, the read timings (both from strace) and somehow instrument Python to output timing for unmarshalling code (or just instrument everything in python).
Either way, at least on my system with cached file attributes, python can startup in 10ms, so it's not clear whether you truly need to optimize much more than that (by identifying remaining bits to optimize), versus solving the problem another way (not statting 500 files, most of which don't exist, every time you start up).
Here is a benchmark https://github.com/bdrung/startup-time
This benchmark is a little bit outdated but the problem remains the same.
Interpreter initialization: Python builds and initializes its entire virtual machine and built-in object structures at startup. Native programs already have their machine code ready and need very little runtime scaffolding.
Dynamic import system: Python’s module import machinery dynamically locates, loads, parses, compiles, and executes modules at runtime. A compiled binary has already linked its dependencies.
Heavy standard library usage: Many Python programs import large parts of the standard library or third-party packages at startup, each of which runs top-level initialization code.
This is especially noticeable if you do not run on an M1 Ultra, but on some slower hardware. From the results on Rasperberry PI 3:
C: 2.19 ms
Go: 4.10 ms
Python3: 197.79 ms
This is about 200ms startup latency for a print("Hello World!") in Python3.
Interesting. The tests use Python 3.6, which on my system replicates the huge difference shown in startup time using and not using `-S`. From 3.7 onwards, it makes a much smaller percentage change. There's also a noticeable difference the first time; I guess because of Linux caching various things. (That effect is much bigger with Rust executables, such as uv, in my testing.)
Anyway, your analysis of causes reads like something AI generated and pasted in. It's awkward in the context of the rest of your post, and 2 of the 3 points are clearly irrelevant to a "hello world" benchmark.
A python file with
Takes 250ms on my i9 on python 3.13A go program with
takes < 10ms.This is not an apples-to-apples comparison. Python needs to load and interpret the whole requests module when you run the above program. The golang linker does dead code elimination, so it probably doesn't run anything and doesn't actually do the import when you launch it.
Sure it's not an apples to apples comparison - python is interpreted and go is statically compiled. But that doesn't change the fact that in practice running a "simple" python program/script can take longer to startup than go can to run your entire program.
Still, you are comparing a non-empty program to an empty program.
Even if you actually use the network module in Go, just so that the compiler wouldn't strip it away, you would still have a startup latency in Go way below 25 ms from my experience with writing CLI tools.
Whereas with Python, even in the latest version, you're already looking at atleast 10x the amount of startup latency in practice.
Note: This is excluding the actual time that is made for the network call, which can of course also add quiete some milliseconds, depending on how far on planet earth your destination is.
You're missing the point. The point is that python is slow to start up _because_ it's not the same.
Compare:
to I get: (different hardware as I'm at home).I wrote another that counts the lines in a file, and tested it against https://www.gutenberg.org/cache/epub/2600/pg2600.txt
I get:
These are toy programs, but IME that these gaps stay as your programs get biggerIt's not interpreting- Python is loading the already byte compiled version. But it's also statting several files (various extensions).
I believe in the past people have looked at putting the standard library in a zip file instead of splatted out into a bunch of files in a dirtree. In that case, I think python would just do a few stats, find the zipfile, loaded the whole thing into RAM, and then index into the file.
> In that case, I think python would just do a few stats, find the zipfile, loaded the whole thing into RAM, and then index into the file.
"If python was implemented totally different it might be fast" - sure, but it's not!
No, this feature already exists.
Great - how do I use it?
You should look at the self-executing .pex file format (https://docs.pex-tool.org/whatispex.html). The whole python program exists as a single file. You can also unzip the .pex and inspect the dependency tree.
It's tooling agnostic and there are a couple ways to generate them, but the easiest it to just use pants build.
Pants also does dependency traversal (that's the main reason we started using it, deploying a microservices monorepo) so it only packages the necessary modules.
I haven't profiled it yet for cold starts, maybe I'll test that real quick.
https://www.pantsbuild.org/dev/docs/python/overview/pex
Edit: just ran it on a hello world with py3.14 on m3 macbook pro, about 100 +/-30 ms for `python -m hello` and 300-400 (but wild variance) for executing the pex with `./hello/binary.pex`.
I'm not sure if a pants expert could eke out more speed gains and I'm also not sure if this strategy would win out with a lot of dependencies. I'm guessing the time required to stat every imported file pales in comparison to the actual load time, and with pex, everything needs to be unzipped first.
Pex is honestly best when you want to build and distribute an application as a single file (there are flags to bundle the python interpreter too).
The other option is mypyc, though again that seems to mostly speed up runtime https://github.com/mypyc/mypyc
Now if I use `python -S` (disables `import site` on initialization), that gets down to ~15ms execution time for hello world. But that gain gets killed as soon as you start trying to import certain modules (there is a very limited set of modules you can work with and still keep speedup. So if you whole script is pure python with no imports, you could probably have a 20ms cold start).
Just a guess - but perhaps the startup time is before `time` is even imported?
`time` is a shell command that you can use to invoke other commands and track their runtime.
It might not be the fastest but I suspect something weird is happening with python resolution.
For instance `uv run` has its own fair share of overhead.
Run strace on Python starting up- you will see it statting hundreds if not thousands of files. That gets much worse the slower your filesystem is.
On my linux system where all the file attributes are cached, it takes about 12ms to completely start, run a pass statement, and exit.
I don't know why people care so much about a few hundreds of milliseconds for python scripts versus compiled languages that take just ten times less.
Real question : what would you do more with the spared time ? You are that in a hurry in your life ?
Completely agree on this.
Regarding cold-starts, I strongly believe V8 snapshots are perhaps not the best way to achieve fast cold starts with Python (they may be if you are tied to using V8, though!), and will have wide side effects if you go out of the standards packages included on the Pyodide bundle.
To put some perspective: V8 snapshots are storing the whole state of an application (including it's compiled modules). This means that for a Python package that is using Python (one wasm module) + Pydantic-core (one wasm module) + FastAPI... all of those will be included in one snapshot (as well as the application state). This makes sense for browsers, where you want to be able to inspect/recover everything at once.
The issue about this design is that the compiled artifacts and the application state are bundled into one piece artifact (this is not great for AOT designed runtimes, but might be the optimal design for JITs though).
Ideally, you would separate each of the compiled modules from the state of the application. When you do this, you have some advantages: you can deserialize the compiled modules in parallel, and untie the "deserialization" from recovering the state of the application. This design doesn't adapt that well into the V8 architecture (and how it compiles stuff) when JavaScript is the main driver of the execution, however it's ideal when you just use WebAssembly.
This is what we have done at Wasmer, which allows for much faster cold starts than 1 second. Because we cache each of the compiled modules separately, and recover the state of the application later, we can achieve cold-starts that are a magnitude faster than Cloudflare's state of the art (when using pydantic, fastapi and httpx).
If anyone is curious, here is a blogpost where we presented fast-cold starts for the application state (note that the deserialization technique for Wasm modules is applied automatically in Wasmer, and we don't showcase it on the blogpost): https://wasmer.io/posts/announcing-instaboot-instant-cold-st...
Note aside: congrats to the Cloudflare team on their work on Python on Workers, it's inspiring to all providers on the space... keep it up and let's keep challenging the status quo!
Big packages shouldn’t be imported until the cli has been parsed, and handed off to main. There’s been work to do this automatically, but it’s good hygiene to avoid it anyway.
A modern machine shouldn’t take this long, so likely something big is being imported unnecessarily at startup. If the big package itself is the issue, file it on their tracker.
Are you comparing the startup time of an interpreted language with the startup time of a compiled language? or you mean that `time python hello.py` > `( time gcc -O2 -o hello hello.c ) && ( time ./hello )` ?
I'm referring to the startup time as benchmarked in the following manner: https://github.com/bdrung/startup-time
Here's the thing - I don't really care if its' because the interpreter has to start up, or there's a remote http call, or we scan the disks for integrity - the end user experience on every run is slower.
You can run .pyc stuff “directly” with some creativity, and there are some tools to pack “executables” that are just chunked blobs of bytecode.
it depends somewhat on what you import, too. some people would sell their grandmothers to get below 1s when you start importing numpys and scikits.
The upcoming lazy import system may help with startup time…but if the underlying issue wasn’t “Python startup is slow” but rather “a specific program imports modules that take a long time to low”, it’ll only shift the time consumption to runtime.
That's totally fine, because many CLI tools are organized like `mytool subcommand --params=a,b...`, and breaking out those subcommands into their own modules and lazy loading everything (which good CLI tools already know to do) means unused code never gets imported.
You can already lazy import in python, but the new system makes the syntax sweeter and avoids having to have in-function `import module` calls, which some linters complain about.
Reminds me of mercurial cvs!!
Yes it's bad enough that there's a chg to (barely) improve the command laten y.
(Side note this is why jj is awesome. A `jj log` is almost as fast as `ls`).
The most interesting bit here is not the “2.4x faster than Lambda” part, it is the constraints they quietly codify to make snapshots safe. The post describes how they run your top-level Python code once at deploy, snapshot the entire Pyodide heap, then effectively forbid PRNG use during that phase and reseed after restore. That means a bunch of familiar CPython patterns at import time (reading entropy, doing I/O, starting background threads, even some “random”-driven config) are now treated as bugs and turned into deployment failures rather than “it works on my laptop.”
In practice, Workers + Pyodide is forcing a much sharper line between init-time and request-time state than most Python codebases have today. If you lean into that model, you get very cheap isolates and global deploys with fast cold starts. If your app depends on the broader CPython/C-extension ecosystem behaving like a mutable Unix process, you are still in container land for now. My hunch is the long-term story here will be less about the benchmark numbers and more about how much of “normal” Python can be nudged into these snapshot-friendly constraints.
I'm betting against wasm and going with containers instead.
I have warm pool of lightweight containers that can be reused between runs. And that's the crucial detail that makes or breaks it. The good news is that you can lock it down with seccomp while still allowing normal execution. This will give you 10-30ms starts with pre-compiled python packages inside container. Cold start is as fast as spinning new container 200-ish ms. If you run this setup close to your data, you can get fast access to your files which is huge for data related tasks.
But this is not suitable for type of deployment Cloudflare is doing. The question is whether you even want that global availability because you will trade it for performance. At the end of the day, they are trying to reuse their isolates infra which is very smart and opens doors to other wasm-based deployments.
It’s 2025 and choosing a region for your resources is still an enterprise feature on cloudflare.
In contrast, AWS provides this as the base thing, you choose where your services run. In a world where you can’t do anything without 100s of compliance and a lot of compliances require geolocation based access control or data retention, this is absurd.
it's only absurd if you don't want to pay cloudflare money
You can't pay and get it even if you want.
There is no paid business plan that supports this. You have to be millions of dollars worth of enterprise on their enterprise plan to get it through your dedicated account manager.
That's basically not how Cloudflare works.
Your app works distributed/globally on the go.
Additionally, every Enterprise feature will become available in time ( discussed during their previous quarter earnings). It will be bound to regions ( eg. Eu)
``` BREAKING CHANGE The following packages are removed from the Pyodide distribution because of the build issues. We will try to fix them in the future: arro3-compute arro3-core arro3-io Cartopy duckdb geopandas ... polars pyarrow pygame-ce pyproj zarr ```
https://pyodide.org/en/stable/project/changelog.html#version...
Bummer, looks like a lot of useful geo/data tools got removed from the Pyodide distribution recently. Being able to use some of these tools in a Worker in combination with R2 would unlock some powerful server-side workflows. I hope they can get added back. I'd love to adopt CF more widely for some of my projects, and seems like support for some of this stuff would make adoption by startups easier.
Do you happen to know what build issues they're referring to?
Checked out the Cloudflare post... they now support Pyodide-compatible packages through uv... so you can pull in whatever Python libs you need, not just a curated list.
ALSO the benchmarks show about a one second cold start when importing httpx, fastapi and pydantic... that's faster than Lambda and Cloud Run, thanks to memory snapshots and isolate-based infra.
BUT the default global deployment model raises questions about compliance when you need specific regions... and I'd love to know how well packages with native extensions are supported.
If anyone from cloudflare comes here - it's not possible to create D1 databases on the fly and interact them because databases must be mentioned in the worker bindings.
This hampers the per user databases workflow.
Would be awesome if a fix lands.
Try Durable Objects. D1 is actually just a thin layer over Durable Objects. In the past D1 provided a lot of DX benefits like better observability, but those are increasingly being merged back into DO directly.
What is a Durable Object? It's just a Worker that has a name, so you can route messages specifically to it from other Workers. Each one also has its own SQLite database attached. In fact, the SQLite database is local, so you can query it synchronously (no awaits), which makes a lot of stuff faster and easier. You can easily create millions of Durable Objects.
(I am the lead engineer for Workers.)
(I work at Cloudflare, but not on D1)
I believe this is possible, you can create D1 databases[1] using Cloudflare's APIs and then deploy a worker using the API as well[2].
1 - https://developers.cloudflare.com/api/resources/d1/subresour...
2 - https://developers.cloudflare.com/api/resources/workers/subr...
Thank you! That's great and it is possible but... With some limitations.
The idea is from sign up form to a D1 Database that can be accessed from the worker itself.
That's not possible without updating worker bindings like you showed and further - there is an upper limit of 5000 bindings per worker and just 5000 users then becomes the upper limit although D1 allows 50,000 databases easily with further possible by requesting a limit increase.
edit: Missed opening.
Hey, would you happen to know if/when D1 can get support for ICU (https://sqlite.org/src/dir/ext/icu) and transactions?
Transactions are supported in Durable Objects. In fact, with DO you are interacting with the SQLite database locally and synchronously, so transactions are essentially free with no possibility of conflicts and no worry about blocking other queries.
Extensions are easy to enable, file a bug on https://github.com/cloudflare/workerd . (Though this one might be trickier than most as we might have to do some build engineering.)
Re DO - I am definitely not rewriting my web wasm rust-sqlx app to use DO.
Re filing an issue - sounds straightforward, will do!
You can run your wasm app in a DO, same as you run it in a Worker.
I'm always a little hesitant to use D1 due to some of these constraints. I know I may not ever hit 10GB for some of my side projects so I just neglect sharding, but also it unsettles me that it's a hard cap.
Why not durable objects? I think it's the recommended pattern for having a db per user
The comparison with AWS Lambda seems to ignore the AWS memory snapshot option called "SnapStart for Python". I'd be interested in seeing the timing comparison extended to include SnapStart.
"SnapStart for Python" costs extra though. If we are paying then you can even have prewarmed Python lambdas with no cold start on AWS (Provisioned Concurrency).
Unless I misunderstand, AWS SnapStart and their memory snapshots are the same feature (taking memory snapshots to speed up cold start). It doesn't seem a fair comparison to ignore this and my assumption is because AWS Lambda SnapStart is faster.
I think it's fair because AWS charges extra for it.
They are comparing the baseline product of all three platforms. Why should we take paid add ons into account for 1 platform.
As I mentioned, if you are ok with paying, then you should also compare Provisioned concurrency on AWSbas well, which has 0 cold start (they keep a prewarmed lambda for you).
Product comparisons are not purely technical in nature. As a user, if im paying extra, I would much rather the 0 cold start than just a reduced cold start especially with all these additional complexities.
It wasn't an intentional omission, we weren't aware of this feature in AWS Lambda. The blog post has been updated to reflect that the numbers are for Lambda without SnapStart enabled.
Python Workers use snapshots by default and unlike SnapStart we don't charge extra for it. For many use cases, you can run Python Workers completely for free on our platform and benefit from the faster cold starts.
In the linked detailed benchmark results they include Lambda SnapStart which seems to be faster than Cloudflare:
> AWS Lambda (No SnapStart) Mean Cold Start: 2.513s Data Points: 1008 > AWS Lambda (SnapStart) Mean Cold Start: 0.855s Data Points: 17 > Google Cloud Run Mean Cold Start: 3.030s Data Points: 394 > Cloudflare Workers Mean Cold Start: 1.004s Data Points: 981
https://cold.edgeworker.net
I still don’t get, what is the use case for cloudflare workers or lambda?
I used both for years. Nothing beats VPS/bare metal. Alright, they give lower latency, and maybe cheaper and big nightmare for managing at the same time. Hello to micro services architecture.
Think about how often that box needs patching for code outside of your app and you’re talking load-balancing, autoscaling, etc. to avoid downtime or overloads, but also paying for idle capacity. Then, of course, you have to think about security partitions if anything you run on that box shouldn’t have access to everything else. None of that is unknown, we have decades of experience and tooling dealing with it, etc. but it’s a job that you can just choose not to have and people often do, especially for things which are bursty. There’s something really nice about not needing to patch anything for years because all you’re using is the Python stdlib and scaling from zero to many, many thousands with no added effort.
Scale down to actual 0, really easy edge distribution, stupidly simple to deploy to. That's really it.
I hope Cloudflare improve Next.js support on Workers.
Currently pagespeed.web.dev score drops by around 20 than self hosted version. One of the best features of Next.js, Image optimization doesn't have out of the box support. You need separate image optimization service that also did not work for me for local images (images in the bundle).
Pyodide is a great enabler for this kind of thing, but most of the libraries I want to use tend to be native or just weird. Still, I wonder how fast things like Pillow, Pandas and the like are these days—-benchmarks would be nice.
Very interesting but the limitation on the libraries you can use is very strong.
I wonder if they plan to invest seriously into this?
I wish they would contribute stuff like this memory snappshotting to CPython.
It relies entirely on the WebAssembly runtime, see the discussion of how ASLR problems don’t occur with WASM. Doing this with WASM is pretty easy, doing it with system memory is quite tricky.
Anybody using it for something serious ? I can't see a use case beyond I need a quick script running that is not worth setting up a vps.
nice