Leiden Weibo Corpus

Methodology
Citing
Grammar patterns
Part-of-speech tags
Mistakes in the LWC
Estimates
PMI (Pointwise mutual information)
Translations
Learning Chinese?
Difficulties viewing the LWC?

Can't find what you are looking for? Feel free to send me an e-mail.

Methodology

To gather the data for this corpus, the Sina Weibo API was used. A script fetched the 200 most recently posted messages every minute, twenty-four hours a day, seven days a week, for three weeks between January 8th, 2012 and January 30th, 2012. This script also stored the meta data provided by the Sina Weibo API along with every message, such as the user’s screen name, location and gender.

The period in which the corpus was built was chosen so as to contain a “normal” business week, as well as the holiday period around the Chinese New Year, which was on January 23rd, 2012. These dates were chosen to allow users of the LWC to compare messages that are broadly on the same topic, i.e. the Chinese New Year, across China, while also enabling them to see how topics discussed in a normal business week vary from region to region.

All messages were processed using tools from the Stanford Natural Language Processing Group, which were configured to use the Penn Chinese Treebank standard for Chinese word segmentation and part-of-speech tagging. The resulting indices were stored in a MySQL database, to which this site provides a web interface.

If you have any further questions on how the LWC was built, please feel free to get in touch.

Citing

If you cite messages from the LWC, you can refer to their message IDs. These are unique identifiers which can be used to look up messages via the search interface. Please attribute your data to the Leiden Weibo Corpus, including a link to the LWC home page (http://lwc.daanvanesch.nl/). LWC data is freely available, so in general, you may use data from the LWC in academic papers without getting prior written permission. However, some rights are reserved: see the legal page. I would also appreciate it if you could let me know if you're using LWC data in your research. Such information is not only useful when applying for grants, which are necessary to keep the LWC freely and publicly available, but I'm also simply interested to know what people use the LWC for.

Grammar patterns

Grammar patterns are expressed in building blocks that look like this: [POS | word], where "POS" stands for any number of part-of-speech tags, and "word" for any number of words. For POS tags, we use the Penn Chinese Treebank standards (see below). You can also use the wildcard "any", which pretty much does what it says, and the minus symbol to exclude a (list of) value(s). Keep in mind that using the "any" wildcard may somewhat slow down your query. As a rule, the more specific your query, the faster your results will be returned. Some example queries:

[NN | 书] = look for the noun 书
[NN | any] = look for any noun
[NN | -书杂志] = look for a noun, but exclude 书 and 杂志 (if the minus symbol is the first character given, this excludes all following values)
[-NN | 说明] = look 说明 where it is not used as a noun

As indicated above, you can also use more than one part-of-speech tag or word per building block, e.g.:

[NN VV | 说明] = look for the word 说明, but only where it appears as a noun or a verb

You can also combine such building blocks:

[DEG | 的] [NN VV | 了解] = look for the word 了解, but only where it appears as a noun or a verb, and where it is preceded by 的
[DEG | 的] [NN | any] = look for any noun preceded by 的
[PN | 他] [DEG | 的] [NN | any] = look for any noun preceded by a subordinating 的 and the pronoun 他
[BA | any] [NN | any] [VV | any] [AS | any] = look for a BA-construction with a noun immediately after BA, and then a verb and an aspect marker

The tag "PU" denotes punctuation marks (the fullwidth full stop, comma, question and exclamation mark). You can use "any" as the word value, or specify this further by using the following values:

[PU | .] = the fullwidth full stop (。)
[PU | ,] = the fullwidth comma (，)
[PU | !] = the fullwidth exclamation mark (！)
[PU | ?] = the fullwidth question mark (？)

The word value "num" is a wildcard for all Arabic numerals (0-∞). Also, when specifying words, you can use * at the beginning or end of a word as a wildcard, if you surround the word with parentheses: [any | (*们)] will match every word ending in 们, for example. Finally, please note that the search interface will only accept up to five building blocks per query.

POS tags

All data in the corpus has been processed by a part-of-speech tagger, which automatically determines the most probable tag for a given word in a given context. The Leiden Weibo Corpus uses the following tags from the Penn Chinese Treebank:

POS tag	Description	Example
AD	adverb	还
AS	aspect marker	着, 了
BA	bǎ 把 and jiāng 将 in bǎ-construction	将, 把
CC	coordinating conjunction	和
CD	cardinal number	一百
CS	subordinating conjunction	虽然
DEC	de 的 in a relative-clause	的
DEG	associative de 的	的
DER	de 得 in V-de construction and V-de-R	得
DEV	de 地 before VP	地
DT	determiner	这
ETC	for words děng 等 and déngděng 等等	等, 等等
FW	foreign words	ISO
IJ	interjection	啊
JJ	other noun-modifier	男, 共同
LB	bèi 被 in long bèi-construction (with agent)	被, 给
LC	localizer	里
M	measure word	个
MSP	other particle	所
NN	common noun	书
NR	proper noun	美国
NT	temporal noun	今天
OD	ordinal number	第一
ON	onomatopoeia	哈哈, 哗哗
P	preposition excluding bèi 被 and bǎ 把	从
PN	pronoun	他
PU	punctuation	，。？！
SB	bèi 被 in short bèi-const	被
SP	sentence-final particle	吗
VA	predicative adjective	红
VC	shì 是	是
VE	yǒu 有 as the main verb	有
VV	other verb	走

This overview is based on the Penn Chinese Treebank documentation, specifically Xue et al. (2005: 17).

Mistakes in the LWC

The LWC relies heavily on natural-language processing tools to be able to process millions of messages within a reasonable time frame. Unfortunately, these tools are not perfect; the best ones can boast of accuracy rates of approximately 95%. So you may occasionally find a mistake in the LWC's word segmentation or part-of-speech tags. Hopefully, in a future version, users will be able to correct these mistakes online. In the meantime, there is not much that can be done about these mistakes, unfortunately.

However, please note that these accuracy rates are similar to or better than the inter-annotator accuracy rate for human annotators. That is to say, even if enough human annotators could be found to process 5.1 million messages, there would probably be a similar number of mistakes. For the LWC, the accuracy rates may be slightly lower, because the LWC contains a lot of informal language and slang. While there is not much I can do to manually fix all these mistakes, I would appreciate it if you would share your experiences.

Estimates

Sometimes the LWC may display an estimated number of hits rather than an exact number. This is not because the LWC can't calculate an exact number, but because for some queries, estimating the number of results is much faster than doing an exact count. If the LWC were running on a more powerful server, doing an exact count would not be a problem, but unfortunately my resources are limited.

PMI (Pointwise mutual information)

In the LWC's tables with geo-lexical statistics, you may see a column called PMI. PMI is short for pointwise mutual information, a statistical association measure. This column shows how strongly associated a word is with a given region. 0 means there is no association, i.e. the word does not occur here more or less often than in the whole corpus. Negative values indicate disassociation, while positive values indicate the word is more strongly associated with this region For more information on this measure, please see Wikipedia.

Translations

Translations come from Google Translate, using the Google Translate Javascript API. They're quite likely to be wrong: Google Translate seems to be much more at ease translating relatively formal written language, than informal written language. Also, internet slang is not one of its strong points. These translations will hopefully get better as machine translation software continues to improve.

Learning Chinese?

If you're studying Chinese and using the LWC to find example sentences, you may want to automatically convert the messages in the LWC into traditional characters. There are many useful browser plug-ins that can do this for you; see here and here. You may also want to look into pop-up dictionaries; again, see here or here. Happy learning!

Difficulties viewing the LWC?

Sorry to hear that. The LWC complies with modern web standards and should render fine in Google Chrome, Mozilla Firefox, Internet Explorer, Safari, Safari for iOS, and the default Android browser. If you are having trouble viewing pages from the LWC, please ensure you are using a modern browser, such as Google Chrome 18, Mozilla Firefox 10, Safari 5, or Internet Explorer 9 (or higher). You must also have Javascript enabled. If this does not solve your problem, please get in touch. Don't forget to include some information on your system configuration, e.g. what operating system and browser you are using.

Contents