Leiden Weibo Corpus - Open access
We believe it's important for researchers to make their research data available freely to others. That's why most of the data collected for this project is available below. These files are under the CC BY-NC-SA 3.0 license; see the legal details.

These CSV files are in ZIP archives. You can uncompress them by using a utility like WinZip or Linux unzip. All files were last updated on March 20th, 2012.

All files are CSV files that were exported using the mysqldump utility, encoded in UTF8. Fields are escaped by backslashes, enclosed by double quotes and separated by commas. Rows are separated by a line break (\n). Empty fields are set to NULL (\N), not to an empty string ("").

You can download a file (ZIP: 932MB; CSV: 2.5GB) containing the original text of all messages in the corpus, along with some meta data and linguistic annotations. For each message, this CSV file contains:
This file (ZIP: 20MB; CSV: 47MB) contains a list of the words that appear in the corpus. The columns are:
Geo-lexical frequency statistics
This file (ZIP: 50MB; CSV: 342MB) contains the LWC's geo-lexical frequency statistics. The columns are:
This two-column file (ZIP/CSV: 1KB) contains a list of Sina Weibo province codes, and their names. These names are given in Hànyǔ Pīnyīn (with tone marks), followed by simplified characters.

This file (ZIP: 24KB; CSV: 60KB) contains all known Sina Weibo city codes together with some useful meta data. The columns are:
This file (ZIP: 35KB; CSV: 240KB) contains all Sina Weibo emoticons that existed in January 2012. This list was used to filter emoticons from the LWC's indexes. There are three columns:
