Can't find what you are looking for? Feel free to send me an e-mail.
To gather the data for this corpus, the Sina Weibo API was used. A script fetched the 200 most recently posted messages every minute, twenty-four hours a day, seven days a week, for three weeks between January 8th, 2012 and January 30th, 2012. This script also stored the meta data provided by the Sina Weibo API along with every message, such as the user’s screen name, location and gender.
The period in which the corpus was built was chosen so as to contain a “normal” business week, as well as the holiday period around the Chinese New Year, which was on January 23rd, 2012. These dates were chosen to allow users of the LWC to compare messages that are broadly on the same topic, i.e. the Chinese New Year, across China, while also enabling them to see how topics discussed in a normal business week vary from region to region.
All messages were processed using tools from the Stanford Natural Language Processing Group, which were configured to use the Penn Chinese Treebank standard for Chinese word segmentation and part-of-speech tagging. The resulting indices were stored in a MySQL database, to which this site provides a web interface.
If you have any further questions on how the LWC was built, please feel free to get in touch.
If you cite messages from the LWC, you can refer to their message IDs. These are unique identifiers which can be used to look up messages via the search interface. Please attribute your data to the Leiden Weibo Corpus, including a link to the LWC home page (http://lwc.daanvanesch.nl/). LWC data is freely available, so in general, you may use data from the LWC in academic papers without getting prior written permission. However, some rights are reserved: see the legal page. I would also appreciate it if you could let me know if you're using LWC data in your research. Such information is not only useful when applying for grants, which are necessary to keep the LWC freely and publicly available, but I'm also simply interested to know what people use the LWC for.
Grammar patterns are expressed in building blocks that look like this: [POS | word], where "POS" stands for any number of part-of-speech tags, and "word" for any number of words. For POS tags, we use the Penn Chinese Treebank standards (see below). You can also use the wildcard "any", which pretty much does what it says, and the minus symbol to exclude a (list of) value(s). Keep in mind that using the "any" wildcard may somewhat slow down your query. As a rule, the more specific your query, the faster your results will be returned.
Some example queries:
As indicated above, you can also use more than one part-of-speech tag or word per building block, e.g.:
- [NN | 书] = look for the noun 书
- [NN | any] = look for any noun
- [NN | -书 杂志] = look for a noun, but exclude 书 and 杂志 (if the minus symbol is the first character given, this excludes all following values)
- [-NN | 说明] = look 说明 where it is not used as a noun
You can also combine such building blocks:
- [NN VV | 说明] = look for the word 说明, but only where it appears as a noun or a verb
The tag "PU" denotes punctuation marks (the fullwidth full stop, comma, question and exclamation mark). You can use "any" as the word value, or specify this further by using the following values:
- [DEG | 的] [NN VV | 了解] = look for the word 了解, but only where it appears as a noun or a verb, and where it is preceded by 的
- [DEG | 的] [NN | any] = look for any noun preceded by 的
- [PN | 他] [DEG | 的] [NN | any] = look for any noun preceded by a subordinating 的 and the pronoun 他
- [BA | any] [NN | any] [VV | any] [AS | any] = look for a BA-construction with a noun immediately after BA, and then a verb and an aspect marker
The word value "num" is a wildcard for all Arabic numerals (0-∞). Also, when specifying words, you can use * at the beginning or end of a word as a wildcard, if you surround the word with parentheses: [any | (*们)] will match every word ending in 们, for example. Finally, please note that the search interface will only accept up to five building blocks per query.
- [PU | .] = the fullwidth full stop (。)
- [PU | ,] = the fullwidth comma (，)
- [PU | !] = the fullwidth exclamation mark (！)
- [PU | ?] = the fullwidth question mark (？)
Mistakes in the LWC
The LWC relies heavily on natural-language processing tools to be able to process millions of messages within a reasonable time frame. Unfortunately, these tools are not perfect; the best ones can boast of accuracy rates of approximately 95%. So you may occasionally find a mistake in the LWC's word segmentation or part-of-speech tags. Hopefully, in a future version, users will be able to correct these mistakes online. In the meantime, there is not much that can be done about these mistakes, unfortunately.
However, please note that these accuracy rates are similar to or better than the inter-annotator accuracy rate for human annotators. That is to say, even if enough human annotators could be found to process 5.1 million messages, there would probably be a similar number of mistakes. For the LWC, the accuracy rates may be slightly lower, because the LWC contains a lot of informal language and slang. While there is not much I can do to manually fix all these mistakes, I would appreciate it if you would share your experiences.
Sometimes the LWC may display an estimated number of hits rather than an exact number. This is not because the LWC can't calculate an exact number, but because for some queries, estimating the number of results is much faster than doing an exact count. If the LWC were running on a more powerful server, doing an exact count would not be a problem, but unfortunately my resources are limited.
PMI (Pointwise mutual information)
In the LWC's tables with geo-lexical statistics, you may see a column called PMI. PMI is short for pointwise mutual information, a statistical association measure. This column shows how strongly associated a word is with a given region. 0 means there is no association, i.e. the word does not occur here more or less often than in the whole corpus. Negative values indicate disassociation, while positive values indicate the word is more strongly associated with this region For more information on this measure, please see Wikipedia.
If you're studying Chinese and using the LWC to find example sentences, you may want to automatically convert the messages in the LWC into traditional characters. There are many useful browser plug-ins that can do this for you; see here and here. You may also want to look into pop-up dictionaries; again, see here or here. Happy learning!
Difficulties viewing the LWC?