Learn, Share, Build

206
October 07, 2017, at 11:15 PM

I’m building an online dictionary. The search must support:

  • auto-complete
  • misspellings
  • searching by word derivatives (write, wrote, written, writing)

To do so I calculate all possible versions of a word and give them a percentage score:

  • Original: write 100%
  • N-grams: w 2%, wr 8%, wri 30%, writ 60%
  • Derivatives and their N-grams: wro 4%, wrot 8%, wrote 30%, writte 20%, written 40%, writi 20%, writin 30%, writing 40%
  • A phonetic hash of everything: W 1%, WR 4%, WRT 10%, WRTN 7%, WRTNG 7%

A single dictionary entry might have a handful of indexed words, and each of those words might have up to 100 possible versions or N-grams as shown above.

I implemented this in Elasticsearch. Elasticsearch is:

  • +1: made for search and will attempt to find the best score
  • +1: quick
  • +1: Uses inverted indexes, which make a lot of sense for all those derivatives
  • +1: Good at compressing
  • -1: Requires a lot of memory
  • -1: Is difficult to update

But would it be satisfactory to store all this in MySQL?

  • -1: Forward indexing, so a lot of repetitions
  • -1: I haven't seen anyone store derivatives and N-grams in SQL
  • +1: Easy to update

This query doesn't seem so bad, but the suggestions table would end up with almost 50 million rows.

SELECT word FROM
(
  SELECT
    word_id,
    SUM(score) AS score
  FROM suggestions
  WHERE text = "input" OR text="phonetic(input)"
  GROUP BY word_id
  ORDER BY score
) AS suggestion
JOIN words ON words.id = suggestion.word_id
GROUP BY word

Is this an adequate way of doing searches?

READ ALSO
Learn, Share, Build

Learn, Share, Build

I have a question about ext_log_entries table in symfony2 projectIn my project, this table has over 800MB data and more than 700MB index data

208
Learn, Share, Build

Learn, Share, Build

I am not so int SQL and I have the following doubt about how to optimize a queryI am using MySql

174
Learn, Share, Build

Learn, Share, Build

I have a table from the following struct:

212