Assignment: Inverted Index

Introduction
Today, top search engines like Google and Yahoo use a data structure called Inverted Index for their matching of queries to the documents and give users the relevant documents according to their rank. Inverted Index is basically a mapping from a word to its position of occurrence in the document. Since a word may appear more than once in the document, storing all the positions and the frequency of a word in the document gives an idea of the relevance of this document for a particular word.
If such an inverted index is a build-up for each document in the collection, then when a query is fired, a search can be done for the query in these indexes, and ranking is obtained according to the frequency. Mathematically, an inverted index for a document D and strings s1 , s2 , … , sn is of the form s1 – > a1 , a1 , … 1 2 s2 – > a2 , a2 , … 1 2 . . . sn – > an , an , … 2 1 where ak denotes the lth position of k th word in the document D. l To build up this kind of data structure efficiently, Tries are used. Tries are a good data structure for strings as searching becomes very simple here with every leaf node describing one word. To build up an inverted index given a set of documents using trie, following steps are followed

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Traverse one document and insert words into a trie. As a leaf node is reached, assign it a number (in increasing order) representing its location in the index (starting from 0). Add the position of this word into the index. Now for a word that occurs more than once in the document, when an attempt for the second insertion into the trie is made, a leaf node already containing that word would be found and its value would tell the location in the index. So simply go to this index and add another position for this word.
Do this till the end of the document is reached. Now, you have a trie and an inverted index for the first document.
Repeat this procedure for the rest of the documents. 1 Now follow the below steps to search for a word from the inverted indexes and tries of all the documents
For every document, first, search for the word in the corresponding trie and get its location in the inverted index of that document.
Then traverse through all the positions and see which document has the most frequency and arrange the documents accordingly (in decreasing order). Also, in every document, there are special words called “anchor texts” which have more importance than a normal text word. For example – a download link. So for the same word, its occurrence as an anchor text increases the relevance of that document over its normal occurrence.

Problem Statement
For this assignment, you need to create an inverted index for a collection C of documents from 1 to n. Every document will be a plain text file with the first-line storing its id from 1 to n and the next few lines containing space or newline-separated words. The index should be an array of lists with the size of the array equal to the total number of distinct words in the array and the list for each word contains the locations of the word in the document. The trie used for this construction can be represented in any form (array/linked list/trees etc. ).
So you would have n such tries and inverted indexes. Then you should ask the user for the queries (single-word) and give the order of documents in decreasing order of relevance. For our case, the anchor texts are represented by following the word with a ?. So if you have something like – “Rats fear cats and cats* fear dogs. ” then here 1st cat is a normal word whereas 2nd cat is an anchor text. So now your array size will be 2 – total number of distinct words in the document as you would store positions of normal text and anchor text separately for a given word.
And now relevance should first be decided by the frequency of anchor texts and within them, the collision should be resolved by frequency of normal text. D1 D2 D3 1 it is what it is 2 what is it 3 it is a banana Below are the corresponding tries and inverted indexes for the 3 documents (figure 1). 2 Figure 1: Trie and Inverted Index for Documents 1, 2 and 3 Now if the query is “it” – then search in 1st index gives – 0, 3(f req = 2), 2nd index gives 2(f req = 1) and 3rd one gives 0(f req = 1).
So, our output is – 1, 2, 3or1, 3, 2 (as document 2 and 3 have equal relevance).
Note:

The names of the data files should be taken from the command line. After 3 building the inverted index, you should ask for a query again from the command prompt and also give an option of quitting any time the user wants.
The inverted indexes should be written to files named as “1… n. txt” with each line corresponding to one word in the document.
You can ignore case-sensitive words i. e. , Cat and cat are the same.
Also, ignore symbols in the text (if any) like.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper
Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our Guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more

Online Class Help Services Available from $100 to $150 Weekly We Handle Everything