We believe it's important for researchers to make their research data available freely to others. That's why most of the data collected for this project is available below. These files are under the CC BY-NC-SA 3.0 license; see the legal details.
These CSV files are in ZIP archives. You can uncompress them by using a utility like WinZip or Linux unzip. All files were last updated on March 20th, 2012.
All files are CSV files that were exported using the mysqldump utility, encoded in UTF8. Fields are escaped by backslashes, enclosed by double quotes and separated by commas. Rows are separated by a line break (\n). Empty fields are set to NULL (\N), not to an empty string ("").
You can download a file (ZIP: 932MB; CSV: 2.5GB) containing the original text of all messages in the corpus, along with some meta data and linguistic annotations. For each message, this CSV file contains:
This file (ZIP: 20MB; CSV: 47MB) contains a list of the words that appear in the corpus. The columns are:
Geo-lexical frequency statistics
- LWC word ID
- Reverse of this word (useful for retrograde searching)
- Number of occurrences of this word
This file (ZIP: 50MB; CSV: 342MB) contains the LWC's geo-lexical frequency statistics. The columns are:
- Unique row ID
- Word ID (see words above)
- Province ID
- City ID
- Frequency per 1,000 words in this city
- Absolute number of occurrences in this city
This two-column file (ZIP/CSV: 1KB) contains a list of Sina Weibo province codes, and their names. These names are given in Hànyǔ Pīnyīn (with tone marks), followed by simplified characters.
This file (ZIP: 24KB; CSV: 60KB) contains all known Sina Weibo city codes together with some useful meta data. The columns are:
- Unique LWC city ID
- Sina Weibo province code
- Sina Weibo city code
- Name in Hànyǔ Pīnyīn (with tone marks), followed by simplified characters.
- Checksum consisting of province code, hash and city code
- Province and city in simplified characters, as given by Sina Weibo
- Total number of words in messages from this city in the LWC
- Number of distinct words in messages from this city in the LWC
This file (ZIP: 35KB; CSV: 240KB) contains all Sina Weibo emoticons that existed in January 2012. This list was used to filter emoticons from the LWC's indexes. There are three columns:
- Unique LWC emoticon ID
- Sina Weibo code, e.g. [囧]
- URL to image from Sina Weibo