python正向最大匹配分词和逆向最大匹配分词的实例

yipeiwu_com6年前Python基础

正向最大匹配

# -*- coding:utf-8 -*-
 
CODEC='utf-8'
 
def u(s, encoding):
  'converted other encoding to unicode encoding'
  if isinstance(s, unicode):
    return s
  else:
    return unicode(s, encoding)
 
def fwd_mm_seg(wordDict, maxLen, str):
  'forward max match segment'
  wordList = []
  segStr = str
  segStrLen = len(segStr)
  for word in wordDict:
    print 'word: ', word
  print "\n"
  while segStrLen > 0:
    if segStrLen > maxLen:
      wordLen = maxLen
    else:
      wordLen = segStrLen
    subStr = segStr[0:wordLen]
    print "subStr: ", subStr
    while wordLen > 1:
      if subStr in wordDict:
        print "subStr1: %r" % subStr
        break
      else:
        print "subStr2: %r" % subStr
        wordLen = wordLen - 1
        subStr = subStr[0:wordLen]
#      print "subStr3: ", subStr
    wordList.append(subStr)
    segStr = segStr[wordLen:]
    segStrLen = segStrLen - wordLen
  for wordstr in wordList:
    print "wordstr: ", wordstr
  return wordList
    
      
def main():
  fp_dict = open('words.dic')
  wordDict = {}
  for eachWord in fp_dict:
    wordDict[u(eachWord.strip(), 'utf-8')] = 1
  segStr = u'你好世界hello world'
  print segStr
  wordList = fwd_mm_seg(wordDict, 10, segStr)
  print "==".join(wordList)
  
 
if __name__ == '__main__':
  main()
  

逆向最大匹配

# -*- coding:utf-8 -*-
 
 
def u(s, encoding):
  'converted other encoding to unicode encoding'
  if isinstance(s, unicode):
    return s
  else:
    return unicode(s, encoding)
 
CODEC='utf-8'
 
def bwd_mm_seg(wordDict, maxLen, str):
  'forward max match segment'
  wordList = []
  segStr = str
  segStrLen = len(segStr)
  for word in wordDict:
    print 'word: ', word
  print "\n"
  while segStrLen > 0:
    if segStrLen > maxLen:
      wordLen = maxLen
    else:
      wordLen = segStrLen
    subStr = segStr[-wordLen:None]
    print "subStr: ", subStr
    while wordLen > 1:
      if subStr in wordDict:
        print "subStr1: %r" % subStr
        break
      else:
        print "subStr2: %r" % subStr
        wordLen = wordLen - 1
        subStr = subStr[-wordLen:None]
#      print "subStr3: ", subStr
    wordList.append(subStr)
    segStr = segStr[0: -wordLen]
    segStrLen = segStrLen - wordLen
  wordList.reverse()
  for wordstr in wordList:
    print "wordstr: ", wordstr
  return wordList
    
      
def main():
  fp_dict = open('words.dic')
  wordDict = {}
  for eachWord in fp_dict:
    wordDict[u(eachWord.strip(), 'utf-8')] = 1
  segStr = ur'你好世界hello world'
  print segStr
  wordList = bwd_mm_seg(wordDict, 10, segStr)
  print "==".join(wordList)
 
if __name__ == '__main__':
  main()
  

以上这篇python正向最大匹配分词和逆向最大匹配分词的实例就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持【听图阁-专注于Python设计】。

相关文章

获取django框架orm query执行的sql语句实现方法分析

获取django框架orm query执行的sql语句实现方法分析

本文实例讲述了获取django框架orm query执行的sql语句实现方法。分享给大家供大家参考,具体如下: 利用Django orM 可以很方便的写出很多查询,但有时候,我们需要检查...

Python HTTP客户端自定义Cookie实现实例

Python HTTP客户端自定义Cookie实现实例 几乎所有脚本语言都提供了方便的 HTTP 客户端处理的功能,Python 也不例外,使用 urllib 和 urllib2 可以很...

在python中的socket模块使用代理实例

说socket代理之前,先来说说http代理,python的urllib2是自带http代理功能的,可以用如下代码实现:复制代码 代码如下:proxy_handler = urllib2...

python进阶教程之模块(module)介绍

我们之前看到了函数和对象。从本质上来说,它们都是为了更好的组织已经有的程序,以方便重复利用。 模块(module)也是为了同样的目的。在Python中,一个.py文件就构成一个模块。通过...

pycharm修改文件的默认打开方式的步骤

pycharm修改文件的默认打开方式的步骤

有时我们用pycharm打开某个文件的时候,默认的打开方式是不正确的,那么如何设置呢?下面小编给大家分享一下。 首先我们点击File菜单,然后选择Setting,如下图所示 接着找到E...