Information Retrieval

INT309 – Imformation Retrieval: Intercestion and Merge Algorithms

Intersection and merge algorithms are two basic operations of information retrieval. For example, when we search Windows install on the search engine, the expected results should be relevant to both Windows and installation; if we search underground or subway, the results should be merged from the documents relevant to one of the underground and subway. We assume that a data structure, posting list, is already constructed which stores all the IDs of the documents revelent to a keyword in order. This lab requires us to find the intersection or union of two posting lists, to allow users to query more flexibly.

Intersection – Standard Algorithm

In this section we complete the intersection in a complexity of \(O(m+n)\), where \(m\) and \(n\) are numbers of documents in two posting lists. The pseudo code and Java template code were given for this part. My completed method is shown below.

while (p1.hasNext() && p2.hasNext()) {
    Posting posting1 = p1.next();
    Posting posting2 = p2.next();
    if (posting1.docID == posting2.docID) {
        answer.add(posting1.docID);
    }
    else if (posting1.docID > posting2.docID) {
        p1.previous();
    }
    else {
        p2.previous();
    }
}

In this code, p1 and p2 are in ListIterator<Posting> type, since the abstrat class Iterator has neither a method peek to access the current element without point to the next one, nor method previous to point to the previous element. The corresponding variables’ type in the templete code are also modified.

Merge Algorithm

The merge operation is a little more complex than intersection. We need to merge two posting lists into one in order with preventing a unique document shows twice in the result. We have to consider that when one pointer reaches the end of its posting list, there may be some documents remaining in another posting list. Hence, another while loop is added in this method.

while (p1.hasNext() && p2.hasNext()) {
    Posting posting1 = p1.next();
    Posting posting2 = p2.next();
    if (posting1.docID == posting2.docID) {
        answer.add(posting1.docID);
    }
    else if (posting1.docID > posting2.docID) {
        p1.previous();
        answer.add(posting2.docID);
    }
    else {
        p2.previous();
        answer.add(posting1.docID);
    }
}

while (p1.hasNext() || p2.hasNext()) {
    if (p1.hasNext()) {
        answer.add(p1.next().docID);
    } else {
        answer.add(p2.next().docID);
    }
}

Intersection – Faster Algorithm via Skip Pointers

In some cases, the sizes of two posting lists may be vary widely. For example, there are \(m = 10^8\) documents contain the word manual, but only \(n = 10 ^ 4\) documents contain Leica. Then we do not have to iterate all records in the posting list of manual, instead, we can skip some documents in this posting list since it is ordered.

To skip the documents, firstly two members, the skip pointer and document ID via skip pointer are added in Posting class, as the code shown below.

static class Posting {
    int docID;
    List<Integer> positions;
    int skipDocId;                  // added
    ListIterator<Posting> skipIter; // added
    public Posting(int docID, List<Integer> positions) {
        this.docID = docID;
        this.positions = positions;
    }
    public String toString() {
        return docID + ":" + positions;
    }
}

Then before we do faster intersection, we need to initialize the skip pointers. If the current position of pointer added with a specified stride is a valid position, then we let the skip pointer point to the strided position, else point to null instead.

static ListIterator<Posting> initSkippingIter(ListIterator<Posting> iter, int stride) {
    List<Posting> postingList = new ArrayList<Posting>();
    while (iter.hasNext()) {
        postingList.add(iter.next());
    }
    iter = postingList.listIterator();
    int current = 0;

    while (iter.hasNext()) {
        Posting currentPosting = iter.next();
        if (current + stride < postingList.size()) {
            currentPosting.skipDocId = postingList.get(current + stride).docID;
            currentPosting.skipIter = postingList.listIterator(current + stride);
        } else {
            currentPosting.skipIter = null;
        }
        ++current;
    }
    return postingList.listIterator();
}

In the intersection method, some comparisons are added. If a skipped document ID is still less or equal than another document ID, then this skip is safe.

p1 = initSkippingIter(p1, stride);
p2 = initSkippingIter(p2, stride);

while (p1.hasNext() && p2.hasNext()) {
    Posting posting1 = p1.next();
    Posting posting2 = p2.next();
    if (posting1.docID == posting2.docID) {
        answer.add(posting1.docID);
    }
    else if (posting1.docID > posting2.docID) {
        p1.previous();
        if (posting2.skipIter != null && posting2.skipDocId <= posting1.docID) {
            p2 = posting2.skipIter;
        }
    }
    else {
        p2.previous();
        if (posting1.skipIter != null && posting1.skipDocId <= posting2.docID) {
            p1 = posting1.skipIter;
        }
    }
}

Result

In the experiment, the stride is set to 3, and the last group of test case in Or.java is used for testing faster intersection algorithm. The output is shown below. There are 71 successful skip and the answer is correct.

Intersection of 3;4;9;16;19;24;25;27;28;30;31;32;33;35;36;43;46;47;52;55;57;60;61;62;64;65;66;77;78;80;83;86;91;98;99;100;101;102;103;104;106;108;112;113;116;117;119;120;127;141;147;151;156;158;168;170;172;175;179;182;184;185;187;195;197;199;202;206;207;208;209;210;213;221;225;227;228;233;238;249;252;255;256;266;267;268;270;271;273;274;281;284;285;289;290;292;294;299;301;302;303;306;308;312;320;321;322;325;326;328;329;332;334;335;337;341;342;344;345;347;349;356;357;358;360;364;376;377;379;382;383;385;395;397;403;404;405;406;410;412;417;418;423;430;431;432;433;434;437;440;441;445;446;452;453;454;461;464;466;469;477;480;486;487;488;495;496;506;507;511;512;517;518;520;522;524;526;532;535;540;543;549;550;558;562;563;564;571;574;581;586;587;592;597;598;604;607;608;615;620;621;622;625;633;634;635;636;639;640;642;653;654;656;658;660;668;671;676;680;681;683;686;687;689;694;697;702;703;708;710;711;714;722;723;729;730;737;739;742;746;747;750;756;757;758;759;764;765;766;769;770;772;777;780;782;783;784;791;795;798;801;807;812;815;816;822;823;824;825;828;830;833;835;836;837;841;852;854;863;864;865;868;870;873;880;882;884;887;888;889;897;902;906;912;914;918;922;924;925;928;929;932;933;934;938;939;941;943;944;947;948;952;955;961;962;963;968;971;973;975;979;980;983;984;987;989;993;995;996;999
            and 324;335;418;466;505;686: 
Successful skip: 71
Answer:         [335, 418, 466, 686]