1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158
| ''' 词频统计 首先明确这属于动态查找,适合使用散列表。 先设计方法把单词"切"出来: 除了大小写字母、数字、下划线,包括空格的其它字符均视为分隔符。 对新读入的单词在单词表中查找,如果已经存在,将单词词频+1, 如果不存在,则插入该单词并记录词频为1。
注意: 词频记录什么时候 +1 要看 只有在已经占领位置的key和待插入元素一样时, 才会词频 +1 。 而不是只要有一个冲突就 +1 ! 有冲突且冲突位置元素和待插入元素一样,词频才+1。 (词频存放在散列表中,这样只需要加一个data属性,方便一些。) 目的: 显示词频前4% 的所有单词。
''' import re
class HashTable(): class Data(): def __init__(self,element=None,inf=None,WordNum=0): self.element = element self.WordNum = WordNum self.inf = inf def __init__(self,tablesize,thecells=[]): MinTableSize = 10 if tablesize < MinTableSize: print('散列表太小,直接用数组即可。') return None self.tablesize = tablesize self.TableSize = self.NextProme(tablesize) self.TheCells = [self.Data(None,None,0) for _ in range(self.TableSize)] def NextProme(self,tablesize): return 101 def Hash(self,word): h = 0 i = 0 while i < len(word): h = (h << 5) + ord(word[i]) i += 1 return h % self.tablesize def Insert(self,item): Pos = self.Find(item) self.TheCells[Pos].inf = '占用' self.TheCells[Pos].element = item if self.TheCells[Pos].element == item: self.TheCells[Pos].WordNum += 1 else: self.TheCells[Pos].WordNum = 1 def Find(self,key): cNum = 0 CurrentPos = self.Hash(key) NewPos = CurrentPos while self.TheCells[NewPos].inf != None and \ self.TheCells[NewPos].element != key: if cNum%2: NewPos = CurrentPos + (cNum//2 + 1)**2 while NewPos >= self.TableSize: NewPos -= self.TableSize else: NewPos = CurrentPos - (cNum//2)**2 while NewPos < 0: NewPos += self.TableSize cNum += 1 return NewPos def Delete(self,positon): self.TheCells[positon].inf = None print('已经删除了散列表%d位置的元素'%positon)
class Count_Words_Number(): def __init__(self,document): self.TableSize = 100 self.Wordcount = 0 self.Document = document self.H = HashTable(self.TableSize) list_ = self.Make_Words_list() for word in list_: if len(word) >= 3: self.Wordcount += 1 self.H.Insert(word) def Make_Words_list(self): list_ = [] with open(self.Document) as f: lines = f.readline() while lines=='\n': lines = f.readline() pat = '[a-zA-Z\_0-9]+' while lines: list_ += (re.findall(pat, lines)) lines = f.readline() return list_ def show(self): list_ = [] for data in self.H.TheCells: if data.inf: d = data.inf,data.element,data.WordNum list_.append(d) print(list_) list_ = sorted(list_,key=lambda x:x[2]) print(list_) return list_
if __name__ == '__main__': cwn = Count_Words_Number('单词表.txt') wordstalbe = cwn.Make_Words_list() print(wordstalbe) result = cwn.show() result_list = [] for i in range(4): result_list.append(result.pop()) print('\n词频前4,单词分别如下:\n',result_list)
|