Intersection and merge algorithms are two basic operations of information retrieval. For example, when we search Windows install on the search engine, the expected results should be relevant to both Windows and installation; if we search underground or subway, the results should be merged from the documents relevant to one of the underground and subway. We assume that a data structure, posting list, is already constructed which stores all the IDs of the documents revelent to a keyword in order. This lab requires us to find the intersection or union of two posting lists, to allow users to query more flexibly.
Intersection – Standard Algorithm
In this section we complete the intersection in a complexity of \(O(m+n)\), where \(m\) and \(n\) are numbers of documents in two posting lists. The pseudo code and Java template code were given for this part. My completed method is shown below.
while (p1.hasNext() && p2.hasNext()) {
Posting posting1 = p1.next();
Posting posting2 = p2.next();
if (posting1.docID == posting2.docID) {
answer.add(posting1.docID);
}
else if (posting1.docID > posting2.docID) {
p1.previous();
}
else {
p2.previous();
}
}
In this code, p1
and p2
are in ListIterator<Posting>
type, since the abstrat class Iterator
has neither a method peek
to access the current element without point to the next one, nor method previous
to point to the previous element. The corresponding variables’ type in the templete code are also modified.
Merge Algorithm
The merge operation is a little more complex than intersection. We need to merge two posting lists into one in order with preventing a unique document shows twice in the result. We have to consider that when one pointer reaches the end of its posting list, there may be some documents remaining in another posting list. Hence, another while
loop is added in this method.
while (p1.hasNext() && p2.hasNext()) {
Posting posting1 = p1.next();
Posting posting2 = p2.next();
if (posting1.docID == posting2.docID) {
answer.add(posting1.docID);
}
else if (posting1.docID > posting2.docID) {
p1.previous();
answer.add(posting2.docID);
}
else {
p2.previous();
answer.add(posting1.docID);
}
}
while (p1.hasNext() || p2.hasNext()) {
if (p1.hasNext()) {
answer.add(p1.next().docID);
} else {
answer.add(p2.next().docID);
}
}
Intersection – Faster Algorithm via Skip Pointers
In some cases, the sizes of two posting lists may be vary widely. For example, there are \(m = 10^8\) documents contain the word manual, but only \(n = 10 ^ 4\) documents contain Leica. Then we do not have to iterate all records in the posting list of manual, instead, we can skip some documents in this posting list since it is ordered.
To skip the documents, firstly two members, the skip pointer and document ID via skip pointer are added in Posting
class, as the code shown below.
static class Posting {
int docID;
List<Integer> positions;
int skipDocId; // added
ListIterator<Posting> skipIter; // added
public Posting(int docID, List<Integer> positions) {
this.docID = docID;
this.positions = positions;
}
public String toString() {
return docID + ":" + positions;
}
}
Then before we do faster intersection, we need to initialize the skip pointers. If the current position of pointer added with a specified stride is a valid position, then we let the skip pointer point to the strided position, else point to null
instead.
static ListIterator<Posting> initSkippingIter(ListIterator<Posting> iter, int stride) {
List<Posting> postingList = new ArrayList<Posting>();
while (iter.hasNext()) {
postingList.add(iter.next());
}
iter = postingList.listIterator();
int current = 0;
while (iter.hasNext()) {
Posting currentPosting = iter.next();
if (current + stride < postingList.size()) {
currentPosting.skipDocId = postingList.get(current + stride).docID;
currentPosting.skipIter = postingList.listIterator(current + stride);
} else {
currentPosting.skipIter = null;
}
++current;
}
return postingList.listIterator();
}
In the intersection method, some comparisons are added. If a skipped document ID is still less or equal than another document ID, then this skip is safe.
p1 = initSkippingIter(p1, stride);
p2 = initSkippingIter(p2, stride);
while (p1.hasNext() && p2.hasNext()) {
Posting posting1 = p1.next();
Posting posting2 = p2.next();
if (posting1.docID == posting2.docID) {
answer.add(posting1.docID);
}
else if (posting1.docID > posting2.docID) {
p1.previous();
if (posting2.skipIter != null && posting2.skipDocId <= posting1.docID) {
p2 = posting2.skipIter;
}
}
else {
p2.previous();
if (posting1.skipIter != null && posting1.skipDocId <= posting2.docID) {
p1 = posting1.skipIter;
}
}
}
Result
In the experiment, the stride is set to 3, and the last group of test case in Or.java
is used for testing faster intersection algorithm. The output is shown below. There are 71 successful skip and the answer is correct.
Intersection of 3;4;9;16;19;24;25;27;28;30;31;32;33;35;36;43;46;47;52;55;57;60;61;62;64;65;66;77;78;80;83;86;91;98;99;100;101;102;103;104;106;108;112;113;116;117;119;120;127;141;147;151;156;158;168;170;172;175;179;182;184;185;187;195;197;199;202;206;207;208;209;210;213;221;225;227;228;233;238;249;252;255;256;266;267;268;270;271;273;274;281;284;285;289;290;292;294;299;301;302;303;306;308;312;320;321;322;325;326;328;329;332;334;335;337;341;342;344;345;347;349;356;357;358;360;364;376;377;379;382;383;385;395;397;403;404;405;406;410;412;417;418;423;430;431;432;433;434;437;440;441;445;446;452;453;454;461;464;466;469;477;480;486;487;488;495;496;506;507;511;512;517;518;520;522;524;526;532;535;540;543;549;550;558;562;563;564;571;574;581;586;587;592;597;598;604;607;608;615;620;621;622;625;633;634;635;636;639;640;642;653;654;656;658;660;668;671;676;680;681;683;686;687;689;694;697;702;703;708;710;711;714;722;723;729;730;737;739;742;746;747;750;756;757;758;759;764;765;766;769;770;772;777;780;782;783;784;791;795;798;801;807;812;815;816;822;823;824;825;828;830;833;835;836;837;841;852;854;863;864;865;868;870;873;880;882;884;887;888;889;897;902;906;912;914;918;922;924;925;928;929;932;933;934;938;939;941;943;944;947;948;952;955;961;962;963;968;971;973;975;979;980;983;984;987;989;993;995;996;999
and 324;335;418;466;505;686:
Successful skip: 71
Answer: [335, 418, 466, 686]