Improving Google Scholar

I love Google. I love Google Scholar even more. I use DBLP and Citeseer as well to search for computer science articles but neither of these sites index as many articles as Google Scholar. Furthermore, Scholar indexes articles in disciplines other than computer science. This is helpful for those bioinformatics articles that I can't always find on the aforementioned sites.

With this said, I noticed a couple of things that I would change in Google Scholar. Additions I would add include:

Finally, I would work on improving the quality of the search results. I noticed that Google indexes a lot of articles that are cited incorrectly. The quality of these citations can actually be improved pretty simply. I was able to boost these results by using a blackbox Entity Resolution approach outlined by Hector Garcia-Molina. The algorithm is generic but requires two specialized functions: Match and Merge.

Match Function

The match function that I implemented first performs matching on the title, then on the author list, publication date, and publisher. The domain for the match distance is R+, where a match distance of 0 indicates an exact match. For each dimension on which two articles match, the match distance is decreased and for each dimension on which two articles do not match, the match distance is increased. After all of the comparisons are made, if the match function is below a certain threshhold, the two articles are said to match.

Merge Function

The merge function is pretty simple. When merging two citation records data is taken, whenever possible, from the article that is cited more. If a record is missing some information, the merge function attempts to fill it in with the article that is cited less.

Results

For demostration purposes I used a more simplistic match function that simply matches according to article title. Then I clustered matching article titles. This illustrates how similar looking citations can be grouped together. With a merge function and a more complex match function one can imagine how better citations can be constructed. I created a demo to show off the improvements listed above. This demo cannot be shown on this server because the server does not support database views. (Sorry).

Below is a list of clustered results for a given author. Notice that each cluster is sorted by the citation number on the left side. This citation number gives a pretty good indication of the relative quality of a given citation to others in a given cluster. This means that titles closer to the top of the cluster are generally "better" than ones near the bottom.

Cit. #  Title                                                                                              
------  ----------------------------------------------------------------------------------------------------

  1:   "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference"
104:   "Probabilistic Reasoning in Expert Systems: Networks of Plausible Inference"
139:   "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988"
155:   "Probabilistic Inference in Intelligent Systems: Networks of Plausible Inference"
216:   "ProbabilisticReasoninginIntelligentSystems: Networks of Plausible Inference. Morgan  ..."
224:   "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA:  ..."
242:   "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA:  ..."
349:   "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference Morgan Kauffman,  ..."
350:   "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible"
358:   "88a, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference"
419:   "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann,  ..."
424:   "Probabilistic Reasoning in Intelligent Systems: Networks of"
438:   "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference San Mateo"
449:   "Probabilistic Reasoning in Intelligent Systems< SBT> Networks of Plausible Inference"
455:   "Possibilistic Reasoning in Intelligent Systems: Networks of Plausible Inference"
550:   "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann,  ..."
564:   "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference Morgan Kauffman,  ..."
600:   "Probabilistic Reasoning in Intelligent Systems: Networks of Probable Inference"
644:   "Problematic reasoning in intelligent systems: Network of plausible inference"
653:   "Probabilistic reasoning in intelligent systems: networks of plausible inference, Rev. 2nd printing"
873:   "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.  1990  Morgan  ..."

  2:   "Probablistic Reasoning in Intelligence Systems"
189:   "Probabilistic Reasoning in Intelligent Systems. 1988"
195:   "Probabilistic Reasoning in Intelligent Systems. 1988"
276:   "Probabilistic Reasoning in Expert Systems"
363:   "Probabilistic Reasoning in Intelligent"
437:   "Probabilistic reasoning in intelligent sytems"
442:   "Probabilist Reasoning in Intelligent Systems"
589:   "Probability reasoning in intelligence system"
632:   "Probabilistic Reasoning in Uncertain Systems"
642:   "ProbabilisticReasoninginIntelligenceSystemsÕ"

  3:   "Heuristics: Intelligent Search Strategies for Computer Problem Solving"
204:   "Heuristics: Intelligent Strategies for Computer Problem Solving"
293:   "Heuristics: Intelligent Search Strategies for Computer Problem Solving,(1984)"
335:   "Heuristics: Intelligent Search Strategies for Computer Problem Solving,''AddisonWesley"
473:   "Heuristics: Intelligent Search Strategy for Computer Solving"
479:   "Heuristics: Intelligent Search Strategy for Computer Solving"
587:   "Heuristics: Intelligent Search Strategies for Computer Problem Solving, Chapter 1"

  4:   "Causality: models, reasoning, and inference"
248:   "Causation: Models, Reasoning and Inference"
332:   "Causality: Models, Reasoning, and Inference. 2000"
421:   "Models, reasoning, and inference"

  5:   "Temporal constraint networks"
348:   "Temporal constraint networks"

  6:   "Fusion, propagation, and structuring in belief networks"
124:   "Fusion propagation and structuring in Bayesian networks"
287:   "Fusion, propagationandstructuringinbelief networks.""
290:   "Fusion, propagationandstructuringinbelief networks.""
655:   "Fusion, propagation, and structuring in belief networks, volume 29"
715:   "Fusion, Propagation and Structure in Bayesian Networks"

  7:   "Network-Based Heuristics for Constraint-Satisfaction Problems"
597:   "Network-based heunstjcs for constraint-satlsfdction problems"
666:   "Network-based heuristicsforconstraint-satisfactionproblems"
672:   "Network-based heuristicsforconstraint-satisfactionproblems"
775:   "NetworkÐBased Heuristics for ConstraintÐSatisfaction Problems"

  8:   "Enhancement Schemes for Constraint Processing: Backjumping, Learning, and Cutset Decomposition"

  9:   "Causality"

 10:   "CURRENT DISTRIBUTION IN SUPERCONDUCTING FILMS CARRYING QUANTIZED FLUXOIDS"

 11:   "Tree clustering for constraint networks (research note)"

 12:   "A theory of inferred causation"
409:   "A theoryof inferred causation"
426:   "A theory of inferred causality"
635:   "A theory of referred causation"
794:   "A theory of mferred causation"

 13:   "Combining qualitative and quantitative constraints in temporal reasoning"

 14:   "Rejoinder to Discussions of Causal Diagrams for Empirical Research"

 15:   "On the logic of iterated belief revision"
647:   "On the logic of iterated theory revision"

 16:   "Equivalence and synthesis of causal models"
174:   "Equivalence and synthesis of causal models"

 17:   "Detection of crystalline hematite mineralization on Mars by the Thermal Emission Spectrometer-  ..."

 18:   "Evidential reasoning using stochastic simulation of causal models"

 19:   "Readings in uncertain reasoning"
545:   "Readings in Uncertain Reasoning"
584:   "editors. Readings in Uncertain Reasoning"

 20:   "Causal diagrams for epidemiologic research"

 21:   "Identifying independence in Bayesian networks"

 22:   "Conditional entailment: bridging two approaches to default reasoning"
664:   "Conditional entailment: bridging to approaches to default reasoning"
670:   "Conditional entailment: bridging to approaches to default reasoning"

 23:   "Qualitative probabilities for default reasoning, belief revision, and causal modeling"
626:   "Qualitative probabilities for default reasoning, belief revision and causal modeling. UCLA Computer  ..."

 24:   "Generalized best-first search strategies and the optimality of A*"

 25:   "A Computational Model for Causal and Diagnostic Reasoning in Inference Systems"
103:   "A computational model for causal and diagnostic reasoning in inference engines"
389:   "A computational model for causal and diagnostic reasoning in inference systems"
508:   "A computational modelforcausalanddiagnosticreasoningininference engines"

 26:   "Structure identification in relational data"

 27:   "A maximum entropy approach to nonmonotonic reasoning"

 28:   "Rank-based Systems: A Simple Approach to Belief Revision, Belief Update, and Reasoning about  ..."
643:   "Rank-based systems: A simple approach to belief revision, belief  ..."

 29:   "Confounding and Collapsibility in Causal Inference"

 30:   "System Z: A natural ordering of defaults with tractable applications to default reasoning"
 42:   "System Z: a natural ordering of defaults with tractable applications to nonmonotonic reasoning"
411:   "System Z: A natural ordering of defaults withtractable applications to nonmonotonic reasoning"
445:   "System Z: A natural ordering of defaults with tractable applications to nonmonotonic reasoning: A  ..."
590:   "System Z: A natural ordering of defaults woth tractable applications to default reasoning"
620:   "System Z: Anatural ordering ofdefaults with tractableapplications todefault reasoning"
649:   "SystemZ: anaturalorderingof defaultswith tractableapplicationsto default reasoning"
695:   "System Z: a natural ordering of defaults with tractable applications to nonmonotonic logic"