Discovering common phrases in multiple blocks of text

Closed Posted Jul 12, 2006 Paid on delivery
Closed Paid on delivery

This project will build a tool that will efficiently find common phrases in a large volume of discrete text blocks. The text blocks will be read from a database, and the set of available blocks will grow continually. The tool must be able to use it’s knowledge of the existing set of blocks to process incoming blocks efficiently and find phrases that occur multiple times anywhere in the entire set of text blocks. Most text blocks will be around 1000-6000 words in length, although some may be significantly shorter or longer.

## Deliverables

1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.

2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):

a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.

b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.

3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).

Requirements:

* Use an efficient “Order N?? algorithm, which is to say that it’s processing requirements will scale linearly (or sub-linearly) as proportional to the quantity of text being processed

* Be able to run inside of 4GB of RAM regardless of the volume of text being processed

* Capable of operating in a 64-bit Linux environment

* Be capable of using multiple processors concurrently by using multiple threads or processes on each system

* Be capable of running concurrently on multiple systems, preferably using a MySQL database as a control and locking mechanism

* Written in the developer’s choice of perl, java, python or c++

* Able to scale to millions of text blocks in the dataset

* Ability to stop the application and restart at the same point

* A variable setting for the number of words that define a "phrase"

*

## Platform

Linux - 64-bit

Engineering Linux MySQL PHP Software Architecture Software Testing

Project ID: #3641373

About the project

10 proposals Remote project Active Aug 22, 2006

10 freelancers are bidding on average $1522 for this job

SovDyn

See private message.

$850 USD in 30 days
(83 Reviews)
8.2
etags

See private message.

$552.5 USD in 30 days
(45 Reviews)
5.9
vw1852498vw

See private message.

$5610 USD in 30 days
(4 Reviews)
5.3
javaj2eeoracle

See private message.

$850 USD in 30 days
(8 Reviews)
3.4
infostarvw

See private message.

$552.5 USD in 30 days
(3 Reviews)
3.3
bytefoundryvw

See private message.

$1062.5 USD in 30 days
(9 Reviews)
2.9
vw2141512vw

See private message.

$2975 USD in 30 days
(3 Reviews)
0.6
meetinfotech

See private message.

$552.5 USD in 30 days
(0 Reviews)
0.0
alex5555vw

See private message.

$1190 USD in 30 days
(0 Reviews)
0.0
davidrn

See private message.

$1020 USD in 30 days
(0 Reviews)
0.0