Efficient Search for Plagiarism on the Web

J. Malcolm, P.C.R. Lane

Research output: Contribution to journalArticlepeer-review

116 Downloads (Pure)

Abstract

Understanding the characteristics of written English allows Internet search for the source of a document to be carried out efficiently. There is a Zipfian distribution of word frequencies in natural language, with some words common and many words rare. If we take a group of three words, the rarity of most of these triples is extreme. This can be exploited to detect web pages similar to a given target document: while a Google search for some triples from the target may return many hits, other triples will only be found in a few documents on the Internet. These documents may well be similar to the target, and are certainly worth examining more closely. Initial experiments show that this approach is very promising, and it is being implemented in a software tool called WebFerret.
Original languageEnglish
Pages (from-to)206-211
JournalProceedings (International Conference on Technology, Communication and Education)
Volume2008
Publication statusPublished - 2008

Keywords

  • plagiarism
  • search engines
  • ferret
  • natural language processing

Fingerprint

Dive into the research topics of 'Efficient Search for Plagiarism on the Web'. Together they form a unique fingerprint.

Cite this