Leiden Weibo Corpus

We believe it's important for researchers to make their research data available freely to others. That's why most of the data collected for this project is available below. These files are under the CC BY-NC-SA 3.0 license; see the legal details.

These CSV files are in ZIP archives. You can uncompress them by using a utility like WinZip or Linux unzip. All files were last updated on March 20th, 2012.

All files are CSV files that were exported using the mysqldump utility, encoded in UTF8. Fields are escaped by backslashes, enclosed by double quotes and separated by commas. Rows are separated by a line break (\n). Empty fields are set to NULL (\N), not to an empty string ("").

Messages
You can download a file (ZIP: 932MB; CSV: 2.5GB) containing the original text of all messages in the corpus, along with some meta data and linguistic annotations. For each message, this CSV file contains:

the Sina Weibo message ID
the original message text
the Sina Weibo code for the user's province
the Sina Weibo code for the user's city
the user's gender (m/f)
the user's screen name
the number of words in this message
the message text with word boundaries marked
the message text with word boundaries marked and POS tags indicated

Words
This file (ZIP: 20MB; CSV: 47MB) contains a list of the words that appear in the corpus. The columns are:

LWC word ID
Word
Reverse of this word (useful for retrograde searching)
Number of occurrences of this word

Geo-lexical frequency statistics
This file (ZIP: 50MB; CSV: 342MB) contains the LWC's geo-lexical frequency statistics. The columns are:

Unique row ID
Word ID (see words above)
Province ID
City ID
Frequency per 1,000 words in this city
Absolute number of occurrences in this city

Provinces
This two-column file (ZIP/CSV: 1KB) contains a list of Sina Weibo province codes, and their names. These names are given in Hànyǔ Pīnyīn (with tone marks), followed by simplified characters.

Cities
This file (ZIP: 24KB; CSV: 60KB) contains all known Sina Weibo city codes together with some useful meta data. The columns are:

Unique LWC city ID
Sina Weibo province code
Sina Weibo city code
Name in Hànyǔ Pīnyīn (with tone marks), followed by simplified characters.
Checksum consisting of province code, hash and city code
Latitude
Longitude
Province and city in simplified characters, as given by Sina Weibo
Total number of words in messages from this city in the LWC
Number of distinct words in messages from this city in the LWC

Emoticons
This file (ZIP: 35KB; CSV: 240KB) contains all Sina Weibo emoticons that existed in January 2012. This list was used to filter emoticons from the LWC's indexes. There are three columns:

Unique LWC emoticon ID
Sina Weibo code, e.g. [囧]
URL to image from Sina Weibo