詞語序差的分佈特點與文本間詞匯異同

劉銳, 孫碧澤, 龍云飛, 王珊

Research output: Contribution to journalArticle

Abstract

該文在已有關於“頻級”“頻序”研究的基礎上,結合兩種不同類型的語料,採用詞匯計量分析方法,考察詞語的“序差”所具有的分佈特點。該研究發現,對於兩種文本的共有詞集,詞的序差呈對稱分佈,且集中分佈於中位數附近,存在離群值序差。這一特點在序差圖上表現為“中段平直,雙尾翹曲”的“雙尾分佈”形態。根據詞語序差的分佈規律,可以將文本共有詞劃分為“中段”“下尾”“上尾”三個層次。“中段”詞語反映兩個文本的共性特徵,“下尾”及“上尾”詞語反映兩個文本的差異性特徵,這些特徵具有反映文本的主題內容和文體風格的語言學意義。
Based on previous studies on frequency and frequency rank of words, this paper focuses on the analysis of the frequency rank difference(FRD) from the perspective of lexical quantitative analysis. This paper reveals that for the common words between texts, the FRDs are distributed symmetrically and gathered around the median. This characteristic assumes a "two-tailed distribution", which is flat in the middle and curving in both ends. Three lexical levels, i.e. middle, downward end and upward end, are summarized based on the FRD distributions. The middle lexicon reflects the common characteristics of the two texts, while the lexicon that belongs to both ends reflects their own distinctive features. These features are of linguistic significance in reflecting the thematic content and stylistic features of the texts. Copyright © 2017 中國科學院軟件研究所.
Original languageChinese
Pages (from-to)8-13
Journal中文信息學報
Volume31
Issue number5
Publication statusPublished - Sep 2017

Citation

劉銳、孫碧澤、龍云飛和王珊(2017):詞語序差的分佈特點與文本間詞匯異同,《中文信息學報》,31(5),頁8-13。

Keywords

  • 序差
  • 雙尾分佈
  • 主題內容
  • 文體風格
  • Frequency rank difference
  • Two-tailed distribution
  • Thematic content
  • Stylistic features of the texts
  • Alt. title: Lexical frequency rank difference distributions between texts