Reddit’s API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.
The key point: This doesn’t touch Reddit’s servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.
What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.
API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.
Self-hosting options:
- USB drive / local folder (just open the HTML files)
- Home server on your LAN
- Tor hidden service (2 commands, no port forwarding needed)
- VPS with HTTPS
- GitHub Pages for small archives
Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.
Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.
How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is “trust but verify” – it accelerates the boring parts but you still own the architecture.
Live demo: https://online-archives.github.io/redd-archiver-example/ GitHub: https://github.com/19-84/redd-archiver (Public Domain)
Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4



It would be inviting a lawsuit for sure. I like the essence of the idea, but it’s probably more trouble than it’s worth for all but the most fanatic.
Is it though? That is (or was, and should be again) publicly accessible information that was created over the years by random internet users. I refuse the notion that an American company can “own it” just because they ran the servers. Sure they can hold copyright for their frontend and backend code, name and whatever. But posts and comments, no way.
Of course it would be dumb for someone under US jurisdiction but we’ll see how much an international DMCA claim is worth considering the current relations anyway.
They don’t own it, the individual posters own the content of their own posts, however, from the reddit terms of service:
And with each of those rights granted, Reddit’s lawyers can defend those rights. So no, they don’t own it “just because they ran the servers” - they own specific rights to copy granted to them by each poster.
(I don’t like this arrangement, but ignorance of the terms of service isn’t going to help someone who uploaded a full copy of the works they have extensive rights to) On this subject I think there needs to be an extensive overhaul to narrow what terms you can extend to the general public. The problem is I straight up don’t trust anyone currently in power to make such a change to have our interests in mind.
Might be easiest to set up an instance in a country that doesn’t give a fuck about western IP law, then others can federate to it.
So yeah, fanatic levels of effort.
Post and comments are not Reddit’s IP anyway :3
They might have set up the user agreement for it. Stackexchange did and their whole business model was about catching businesses where some worker copy/pasted code from a stackexchange answer and getting a settlement out of it.
I agree with you in principle (hell, I’d even take it further and think only trademarks should be protected, other than maybe a short period for copyright and patent protection, like a few years), but the legal system might disagree.
Edit: I’d also make trademarks non-transferrable and apply to individuals rather than corporations, so they can go back to representing quality rather than business decisions. Especially when some new entity that never had any relation to the original trademark user just throws some money at them or their estate to buy the trust associated with the trademark.
/u/Buddahriffic put it better than I could.
I agree, it shouldn’t be reddit’s intellectual property. But the law binds the poor and protects the rich.
this is one reason i support tor deployment out of the box 😋
Brb, setting up a Lemmy server in Red Star OS
(The machine with the only Steam account active in North Korea
would like toalready knows your location)The chances are pretty high that is probably Kims computer, arent they?