MediaCrawler — Multi-Platform Social Media Crawler

Published: December 01, 2023

MediaCrawler is an open-source project designed for learning and research purposes to crawl public social media data from multiple platforms, including Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, Tieba, and Zhihu. The project demonstrates practical browser automation using Playwright, preserving login sessions to obtain necessary data without reverse-engineering complex JavaScript encryption.

Key functionalities include:

Keyword-based search and specific post ID crawling
Crawling secondary comments and creator homepages
Login state caching and IP proxy support
Data export to SQLite, MySQL, CSV, or JSON
Generating comment word clouds for analytics

The project emphasizes learning modern web scraping architecture, with a MediaCrawlerPro version providing advanced features like multi-account support, resume crawling, Linux compatibility, and decoupled JS signature logic for enterprise-level code quality.

Repository: https://github.com/NanmiCoder/MediaCrawler
Documentation & Tutorial: https://nanmicoder.github.io/MediaCrawler/
MediaCrawlerPro: https://github.com/MediaCrawlerPro
Technology Stack: Python 3, Node.js, Playwright, SQLite/MySQL, IP Proxy Pools

My Contribution
I contributed to the project by reviewing, reorganizing, and enhancing the documentation, ensuring that installation guides, usage instructions, and configuration explanations are clear and easy to follow. This improvement makes the project more accessible to learners and developers, and helps users quickly get started with multi-platform social media crawling using MediaCrawler.

Usage Notice: This project is strictly for learning and research purposes. Commercial or illegal use is prohibited, and the developer assumes no legal responsibility for misuse.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Zhuhan Bao

Share on