Adding a search engine.

Adding a search engine to the project is intended to be pretty straightforward. Please note the following rules:

The engine MUST be implemented in the engines modules in the com.wegtam.search.engines package.
The engine MUST implement the SearchEngine interface.
The parser of the engine MUST implement the SearchEngineParser interface.
The name of the implemented engine MUST be unique.
The implemented engine MUST be added to the all function of the SearchEnginesLoader object.

Here we will go forward with an example implementation of a search engine. First we have to pick an engine that we want to implement. Please keep in mind that the easy route is using a freely accessible engine which can be easily scraped for results. We’ll pick https://arxiv.org/search/ for starters.

Several search engines offer an “advanced search mode” and it might be feasible to have it supported but in general: Please start out simple! You’ll get to a working solution much faster and can always modify later on.

Examine the website or API

Usually we should start with going to the website and issue some search queries through the web form. While doing that pay attention to the possible filter and sorting options and the URL parameters used by the engine.

As our chosen target offers no API we must resort to web scraping to get our results.

Play around with manipulating parameters directly in the URL (for example number of results) to find out if the engine is liberal in that or will throw an error if we put in arbitrary values.

The next step should be some searches from the command line or REPL to ensure that it still works from outside of a web browser.

Finally put the developer tools of your browser to good use on the result page to figure out the CSS selectors to find the relevant information. For many engines this is quite easy and only some require sophisticated measures or even pre- or post-processing of the raw HTML code.

In our case we can note the following:

usage of HTTP GET method for the search
base URL is https://arxiv.org/search/
several parameters
searchtype
- narrows search into specific fields or modes
- defaults to all
- other options available (title, etc.)
- choosing “Full Text” in the form redirects to another “engine”
query
- contains URL encoded search query
abstracts
- whether to display the abstract of a found publication
- can be show or hide
- should be show because that offers more content
order
- sort order of the results
- defaults to being empty which indicates “relevance”
- other options available (-announced_date_first, etc.)
size
- the number of returned results
- cannot be arbitrary (fixed set: 25, 50, 100, 200)
CSS selector for a search result: .arxiv-result
CSS selector for the title: .title
CSS selector for the URL: .list-title a[href]
CSS selector for the description: .abstract

Regarding the description we stick to the abstract while it would be nice to extract more information like authors and publication date.

Implement the engine

Every new engine must implement the SearchEngine class which implies implementing also SearchEngineParser.

The first functions you have to implement are as follows:

override val capabilities: NonEmptyList[SearchEngineCapabilities] = NonEmptyList.of(SearchEngineCapabilities.Paging)

override val modes: NonEmptyList[SearchMode] = NonEmptyList.of(SearchMode.GENERIC)

override val name: SearchEngineName = ArXiv.ENGINE_NAME

The capabilities return a list of things the engine is “capable of” which usually is Paging but may include other flags like RegionalSearch or TimeFrameSearch.

Next up is modes which returns the supported search modes. In general it is okay to simply return GENERIC here.

Now for the search engine name which is returned from the companion object as you can see.

object ArXiv {
  val ENGINE_NAME: SearchEngineName = "ArXiv"
}

Please note, that the name of the engine must be unique across all engines!

Building the parser

The good news is that it is already implemented and you only need to provide certain parameters for it. Here you apply the CSS selectors you found out in the beginning. The names should be self explanatory.

override protected val parser: SearchEngineParser[F] = new SearchEngineParser[F] {
  override protected val patterns: Map[SearchMode, Map[SearchEngineParserPatternType, SearchEngineParserPattern]] =
    Map(
      SearchMode.GENERIC -> Map(
        EXTRACT_RESULT             -> ".arxiv-result",
        EXTRACT_RESULT_DESCRIPTION -> ".abstract",
        EXTRACT_RESULT_TITLE       -> ".title",
        EXTRACT_RESULT_URL         -> ".list-title a[href]"
      )
    )
}

The real search function

Last but not least the search function must be implemented. It is quite straightforward and may be copied from another engine. However, here you can tweak the code to do special things required to use the engine you chose.

It is considered good practice to pull out some magic strings or numbers into separate variables.

private final val BASE_URL                 = "https://arxiv.org"
private final val PARAMETER_NAME_PAGING    = "start"
private final val DEFAULT_RESULTS_PER_PAGE = 25

override def search(q: SearchQuery)(implicit
    backend: sttp.client3.SttpBackend[F, sttp.capabilities.fs2.Fs2Streams[F] with sttp.capabilities.WebSockets]
): Stream[F, SearchResult] = {
  val requests = Stream.emits((0 to q.results / DEFAULT_RESULTS_PER_PAGE).map { page =>
    basicRequest
      .get(
        uri"$BASE_URL/search/?abstracts=show&searchtype=all&size=$DEFAULT_RESULTS_PER_PAGE&order=&query=${q.query}&$PARAMETER_NAME_PAGING=${page * DEFAULT_RESULTS_PER_PAGE}"
      )
      .header("User-Agent", "Mozilla/5.0 ...")
      .readTimeout(FiniteDuration(30, SECONDS))
      .response(asStringAlways.map(SearchEngineOutput.from))
  })
  val parse = parser.parseResults(BASE_URL.some)(name)(SearchMode.GENERIC)(_)
  val results = requests
    .evalMap(_.send(backend))
    .evalMap(_.body.traverse(parse))
    .flatMap(r => Stream.emits(r.getOrElse(List.empty)))
    .take(q.results.toLong)
  results
}

We create a stream of requests depending on the number of requests we have to do (regarding desired results and needed paging). The URL for each request is build accordingly and important the header for the User-Agent must be set to a sensible value to avoid being blocked. Usually the identifier for a common web browser should be used.

Next we create our parser function (val parse = ...) and create the stream execution pipeline which we return to our caller. Because ArXiv uses relative URLs we pass the optional baseUri parameter to be able to extract correct result URLs.

Make the engine available

Due to trouble with dynamically loading classes we stick to adding any new search engine to the all function of the SearchEnginesLoader class.

def all[F[_]: Sync](): List[SearchEngine[F]] =
  List(
    new ArXiv[F],
    new Bing[F],
    // etc. ...
  )

See how it works

You can now run the list engines command of the CLI from the SBT console:

sbt:wegtam> cli/run --list-engines
[info] running com.wegtam.search.cli.WegtamSearchAgent --list-engines
ArXiv (Paging)
Bing (LanguageSearch, Paging, RegionalSearch, TimeFrameSearch)
DuckDuckGo (RegionalSearch, TimeFrameSearch)
...

Et voilà! Looks like we can try it out for real now:

sbt:wegtam> cli/run --engine ArXiv --results 5 --query "higgs boson"
[info] running com.wegtam.search.cli.WegtamSearchAgent --engine ArXiv --results 5 --query "higgs boson"
https://arxiv.org/abs/2104.03408 (1, 20)
https://arxiv.org/abs/2103.02682 (1, 21)
https://arxiv.org/abs/2103.00409 (1, 22)
https://arxiv.org/abs/2103.02752 (1, 23)
https://arxiv.org/abs/2103.12123 (1, 24)
sbt:wegtam>

Congratulations, you have implemented your first search engine driver! :-)

Adding tests

Of course we are only halfway done. ;-) Because we want to ensure that our engine works correctly, we add some tests. The tests are located under the engines/src/test folder and are separated into offline and online tests.

Writing offline tests

For the offline tests we need to fetch a search result page from the engine and store it into a file. By convention these are stored under engines/src/test/resources using the package path com/wegtam/search/engines like the engines themself and should be named after the search engine. So in our case our file name will be ArXiv.html.

Fetching the results

To match the HTML that the engine parser will see best we do not recommend to save the site via your browser. Dynamic code (JavaScript) might get executed and modify the site before you save it. So far fetching the data via command line tools like curl or wget has worked very good.

The following command serves as an example for using curl. Please replace QUERY with your search query.

% curl -o engines/src/test/resources/com/wegtam/search/engines/ArXiv.html \
  "https://arxiv.org/search/?abstracts=show&searchtype=all&order=&query=QUERY"

Test code

We are using the munit test framework. Which provides fixtures, assertions and other stuff. Please take a look their documentation for details.

First we need to load our saved search results into a fixture.

val resultsFile = ResourceSuiteLocalFixture(
  "search-results-file",
  Resource.make(
    IO.blocking(
      scala.io.Source
	.fromInputStream(
	  getClass().getClassLoader().getResourceAsStream("com/wegtam/search/engines/ArXiv.html"),
	  "UTF-8"
	)
	.mkString
    )
  )(_ => IO.unit)
)

override def munitFixtures = List(resultsFile)

This may look intimidating but it is just the creation of a standard fixture using a Resource from the IO of cats-effect and you can see the code in the innermost part is quite straightforward. We create a string from a resource and that is it about it.

Next up is the actual test code in which we use the fixture to access the results file.

test("must parse results correctly".tag(OfflineTest)) {
  SearchEngineOutput.from(resultsFile()) match {
    case Left(_) => fail("No valid search engine output in test file!")
    case Right(o) =>
      val engine  = new ArXiv[IO]
      val parser  = engine.parser
      val results = parser.parseResults(None)(engine.name)(SearchMode.GENERIC)(o)
      results.map(_.size).assertEquals(25)
  }
}

Note

The first important part here is the tagging of the test via .tag(OfflineTest) to mark it as an offline test. We can use this to execute only offline, only online or all tests except online tests and so on.

Regarding the code you can see that we create the search engine, get the parser from it and call the parseResults function with the content of the results file. Finally we do some assertions on the returned results.

Writing online tests

By convention we group the online tests into a different test class and also try to provide a small list of search queries to avoid hitting the server with the same query multiple times upon a test run.

class ArXivOnlineTest extends munit.CatsEffectSuite {
  val queries = List("category theory", "cognitive computing", "higgs boson", "hilbert space", "particle plasma")
  val searchQuery = ResourceSuiteLocalFixture(
    "search-query",
    Resource.make(IO.delay(queries(scala.util.Random.nextInt(queries.length))))(_ => IO.unit)
  )
  val sttpBackend = ResourceSuiteLocalFixture("sttp-backend", AsyncHttpClientFs2Backend.resource[IO]())

  override def munitFixtures = List(searchQuery, sttpBackend)
}

As you can see we generate two fixtures this time. The first one is simply a random search query (out of our list of provided ones).

Warning

Please ensure that all of your queries return a sufficient number of search results.

The second fixture is a backend for our search engine. We are using the sttp library which can be powered by several backends. However we chose the fs2 one.

Last but not least we want to write the actual tests. The basic test case is querying the engine for results and assert that there are some returned. If the search engine support paging through the results then we should add a second test that tries to fetch more than the default number of search results and asserts that indeed as much are returned.

Note

Please note that the tests are tagged via .tag(OnlineTest) this time!

test("must search online".tag(OnlineTest)) {
  val engine = new ArXiv[IO]
  val query  = SearchQuery(searchQuery(), region = None, results = 25)
  engine.search(query)(sttpBackend()).compile.toList.map(_.size).map(s => assert(s > 0))
}

test("must search online with paging".tag(OnlineTest)) {
  val engine = new ArXiv[IO]
  val query  = SearchQuery(searchQuery(), region = None, results = 50)
  engine.search(query)(sttpBackend()).compile.toList.map(_.size).map(s => assert(s > 25))
}

Hooray! You have not only implemented a search engine driver but also tests as well to ensure its quality!

Next: RFCs