pg_mooncake 🤝 drizzle: typed SDKs for your analytic workloads
A thing that surprised us in the pg_mooncake launch was how much developers cared about ORMs for their analytical workloads. We saw two typical requests:
1. I have a bunch of Parquet files. I want full ORM support over them.
2. I use Drizzle/Prisma for my transactional database. I also want to use it for my analytics and dashboards.
We've always believed that columnar storage in Postgres must look and feel like a regular heap table. It's where we innovated with pg_mooncake columnstore tables; you can run transactions, updates, deletes, and joins with heap tables.
A nice bonus of native columnstore in Postgres is your favorite ORM works out of the box. In this blog post, we'll use Drizzle.
We will build a simple analytics app on a reddit dataset. And run all analytics queries using Drizzle :)
1. Create a pg_mooncake columnstore table
We be using a reddit dataset from Huggingface.
First, create a pg_mooncake columnstore table.
CREATE TABLE reddit_comments (
author TEXT,
body TEXT,
controversiality BIGINT,
created_utc BIGINT,
link_id TEXT,
score BIGINT,
subreddit TEXT,
subreddit_id TEXT,
id TEXT
) using columnstore;
Next, copy the data from Huggingface into reddit_comments using mooncake.read_parquet. Here we just copy 1 parquet file (15M rows):
COPY reddit_comments FROM 'hf://datasets/fddemarco/pushshift-reddit-comments/data/RC_2012-01.parquet';
Great. Loading data from Huggingface in Postgres is pretty simple now. And fast too.
2. Set up Drizzle schema for the columnstore table.
After we set up Drizzle Driver for our Postgres instance, we can set up the schema for our columnstore table:
// schema.ts
export const redditComments = pgTable("reddit_comments", {
id: text("id").primaryKey(),
author: text("author"),
body: text("body"),
controversiality: bigint("controversiality", { mode: "number" }),
created_utc: bigint("created_utc", { mode: "number" }),
link_id: text("link_id"),
score: bigint("score", { mode: "number" }),
subreddit: text("subreddit"),
subreddit_id: text("subreddit_id")
});
3. That's it. Start querying your data model with full type safety.
For example, to get subreddit statistics:
routes.ts
const stats = db
.select({
subreddit: redditComments.subreddit,
commentCount: sql<number>`count(${redditComments.id})`,
avgScore: sql<number>`avg(${redditComments.score})::float`,
avgControversiality: sql<number>`avg(${redditComments.controversiality})::float`,
totalScore: sql<number>`sum(${redditComments.score})`
})
.from(redditComments)
.groupBy(redditComments.subreddit)
.orderBy(desc(sql`count(${redditComments.id})`))
.limit(10);
You can write fast analytics queries with sytnax you're comfortable with.
4. Get started with pg_mooncake today.
pg_mooncake v0.1 is live! It is available on Neon and will be coming to Supabase very shortly.
Also, check out our friends at Drizzle. And it's never been easier to use your ORM for your analytics.
We're now an official Drizzle sponsor.
We're excited to take on projects like onedollarstats together. Stay tuned for more.