#+title: FimFinder Source #+author: Robert McIntyre #+email: rlm@mit.edu #+description: advanced search for fimfiction.net #+keywords: my little pony search linear programming web design #+setupfile: ../../aurellem/org/setup.org #+include: ../../aurellem/org/level-0.org #+babel: :mkdirp yes :noweb yes This is the source for [[./search.html][fimfinder.net]], a search engine for [[http://fimfiction.net][fimfiction.net]]. * Organization Want a website that enables easy top-X lists and statistics for the stories on [[fimfiction.net]]. - working-name: [[./search.html][fimfinder.net]] ** Features - Lists of the top pony stories, where "top" is determined by multiple different metrics. - Use minimal bandwidth from fimfiction.net - Precalculate all results; this website is just a convenient view on info that has already been calculated (to ensure low bandwidth on my server and stability of the website.) - Literate Programming -- every aspect of the site from html to backend should be open source and available through the website. - Link back to fimfiction.net for the actual stories; the site can not store any of the stories themselves, just the metadata and statistics of the stories. ** Implementation - Two pages : a search page and the literate programming page. - Backend clojure, frontend HTML/jquery - There's only at most ~300,000 stories, so just use individual files to store the statistics of each story rather than a database (~500MB tops). - Use ajax to only download the particular list that the user wants. - Use tiered lists to reduce bandwidth (such as 16, 64 and 512 for each statistic). - Store list output as list formatted html in separate files which is included via jquery. * Search Page This is the main page of the site and the one that will appear when the user goes to [[./search.html][fimfinder.net]]. It has a brief description of the purpose of the site and how to use the search functions. The search section should already be displaying some results using the "best" algorithm, and provide options to change the algorithm and the number of stories which are displayed, up to a reasonable minimum (maybe 500?) ** Useful Utility Functions Would be nice to get exactly six such functions, one for each pony. - straight up total likes (accumulated popularity) {rarity} - (- likes dislikes) per word (density of goodness) (use word count instead of chapter count to compensate for the fact that some stories have much longer chapters than other stories.) {applejack} - ratings over total time the story has existed (staying power) {fluttershy} - ratings per view (most remarkable) {rainbow dash} - total number of comments (most notable) {pinky pie} - linear ML utility function over combinations of the other search utility functions (overall goodness) {twilight sparkle} * Clojure Backend The clojure backend must do two things: - get story metadata from fimfiction.net - generate lists of the best stories by various criteria. There are four independent dimensions by which a user can select search results: - Search Algorithm - Mature? - Complete? - Number of Results Both the search results and story metadata are small enough that the information can just be stored in files instead of a database without any real loss in performance. There are currently only around 40,000 stories in fimfiction.net, so one file per story is quite reasonable. Likewise, 6 search algorithms and three choices for each other option gives 162 different possible results which will comfortably fit in one file for each result. Thus, the clojure backend is all about maintaining a database of files, each of which represents a story, and creating a database of search results, one for each combination of story-searching options. ** Story Database These functions define the file-based database for story metadata. Each story goes in a file as a json string and is read as a clojure map. The database can be updated with minimal use of fimfiction's bandwidth by first reading all the stories into memory, then only updating those stories that are valid. #+name: pony-clj-db #+begin_src clojure (in-ns 'org.aurellem.pony) (def database-root (File. "/home/r/proj/pony-stories/backend/db")) (defn pony-header-net ([count pony-num] (if (zero? count) nil (try (slurp (URL. (str "http://www.fimfiction.net/api/story.php?story=" pony-num))) (catch Exception ex (error ex "bad download, count = " count) (Thread/sleep 3000) (pony-header-net (dec count) pony-num))))) ([pony-num] (log :info (str "Download " pony-num)) (pony-header-net 4 pony-num))) (defn pony-header-file [pony-num] (File. database-root (str pony-num ".h"))) (defn write-pony-header! [pony-num] (spit (pony-header-file pony-num) (pony-header-net pony-num))) (defn pony-header [pony-num] (clojure.data.json/read-json (let [target (pony-header-file pony-num)] (if (.exists target) (slurp target) (do (write-pony-header! pony-num) (slurp target)))))) (def stale-days 5) (defn new-pony-header! [pony-num] (let [days-old (.getStandardDays (Duration. (Instant. (.lastModified (pony-header-file pony-num))) (Instant.)))] (if (> days-old stale-days) (write-pony-header! pony-num) (log :info (str pony-num " is " days-old " days old; not downloading"))))) (alter-var-root #'pony-header memoize) (defn valid? [header] (not= header (pony-header 0))) (def dead-margin 50) (defn clear-header! [pony-num] (log :info (str "Delete " pony-num)) (.delete (pony-header-file pony-num))) (defn pony-header-count [] (dec (count (file-seq database-root)))) (defn clear-dead-margin! [] (let [frontier-end (pony-header-count) frontier-start (max 0 (- frontier-end dead-margin 2))] (log :info "clearing dead margin.") (dorun (map clear-header! (range frontier-start frontier-end))))) (defn download-new-stories! [] (log :info "Downloading new stories.") (clear-dead-margin!) (loop [current-point (pony-header-count)] (log :info (str current-point "-" (dec (+ dead-margin current-point)))) (let [headers (map pony-header (range current-point (+ dead-margin current-point))) num-valid (count (filter valid? headers))] (log :info (str num-valid " valid")) (if (not= 0 num-valid) (recur (+ dead-margin current-point)) (log :info "done."))))) (defn update-existing-stories! [] (dorun (map new-pony-header! (filter (comp valid? pony-header) (range (pony-header-count)))))) (defn update-header-db! [] (update-existing-stories!) (download-new-stories!)) #+end_src ** HTML Formatting These functions define the mapping between the json-encoded story headers and HTML. I use [[https://github.com/weavejester/hiccup/][hiccup]] to generate the HTML which will be loaded into the results section via javascript. =pony-header->li= transforms a single json header into HTML. The HTML is structured as a list of important information about the story, with the Title and Author first, followed by lists of important statistics about the story. The entire story element is a link to the actual story on fimfiction.net. To transform multiple headers into HTML, transform each individual header into HTML and then assemble them all together into an HTML list using =pony-headers->html=. #+name: pony-clj-html #+begin_src clojure (defn date->str [date] (.format (java.text.DateFormat/getDateInstance) date)) (def results-root (File. "/home/r/proj/pony-stories/html/results")) (def default-story "A story which will not win by any reasonable metric." {:story {:status "Incomplete", :full_image "", :words 1, :chapters [{:id 0, :title "No Chapter Title", :words 1, :views 0, :link "", :date_modified 0}], :author {:id "0", :name "Nobody"}, :image "", :likes 0, :total_views 0, :title "No Title", :dislikes 0, :views 0, :url "", :categories {:Romance false, :Dark false, (keyword "Slice of Life") false, :Sad false, :Human false, :Random false, (keyword "Alternate Universe") false, :Comedy false, :Tragedy true, :Crossover false, :Adventure false}, :date_modified 0, :chapter_count 1, :short_description "No Short Description", :content_rating 0, :comments 0, :content_rating_text "Everyone", :id 0, :description "No Description"}}) (defn sanitize [header] {:story (merge (:story default-story) (:story header))}) (defn pony-header->li [header index] (let [story (:story (sanitize header)) categories (sort (map (comp name first) (filter (comp (partial = true) second) (:categories story)))) title (:title story) author (:name (:author story)) url (:url story) description (:short_description story) status (h (:status story)) num-chapters (h (:chapter_count story)) num-words (h (:words story)) content (h (:content_rating_text story)) id (h (:id story)) likes (h (:likes story)) dislikes (h (:dislikes story)) total-views (h (:total_views story)) comments (h (:comments story)) last-modified (h (date->str (java.util.Date. (long (* 1000 (:date_modified story))))))] (html [:li {:class "pony-list-item"} [:a {:class "pony-link" :href url} [:div {:class "pony-list-item"} [:div {:class "pony-header"} [:em {:class "pony-title"} title " "] [:em {:class "pony-by"} " by "] [:em {:class "pony-author"} author] [:div {:class "pony-description"} description]] [:div {:class "pony-info"} [:div {:class "pony-summary"} [:ul {:class "pony-summary"} [:li {:class "pony-status"} status] [:li {:class "pony-content"} content] [:li {:class "pony-chapters"} "Chapters: " num-chapters] [:li {:class "pony-words"} "Words: " num-words] [:li {:class "pony-id"} "Modified: " last-modified]]] [:div {:class "pony-stats"} [:ul {:class "pony-stats"} (map (fn [desc value] [:li [:div {:class "pony-stats-container"} [:div {:class "pony-stats-desc"} desc] [:div {:class "pony-stats-value"} value] ]]) ["Views" "Likes" "Dislikes" "Comments" "ID"] [total-views likes dislikes comments id])]] [:div {:class "pony-categories"} (if (empty? categories) " " [:ul {:class "pony-categories"} (map #(vector :li %) categories)])] [:div {:class "pony-index"} [:div {:class "inner-index"} index]]]]]]))) (defn pony-headers->html [headers] (html [:ol {:class "pony-list"} (map pony-header->li headers (range 1 (inc (count headers))))])) #+end_src ** Utility Functions Here is where the stories are sorted according to the six different utility functions and the static HTML result files are created. There are 5 basic utility functions, =overall-ratings=, =total-comments=, =ratings-per-view=, =ratings-per-word=, and =likes-per-total-time=, which each compute a simple numerical value ranking the story in relation to other stories. For example, =ratings-per-word= returns: $$\frac{\text{likes} - \text{dislikes}}{\text{total number of words}}$$ for each story. #+name: pony-clj-utility #+begin_src clojure (defn overall-ratings [header] (:likes (:story header))) (defn total-comments [header] (:comments (:story header))) (defn adjusted-ratings [header] (- (:likes (:story header)) (:dislikes (:story header)))) (defn ratings-per-view [header] (let [views (:total_views (:story header)) adjusted-views (if (< views 100) 1e9 views)] ;; penalize stories with very few views. (/ (adjusted-ratings header) adjusted-views))) (defn ratings-per-word [header] (let [words (:words (:story header)) adjusted-words (if (< words 250) 1e9 words)] ;; penalize stories with fewer than 250 words. (/ (adjusted-ratings header) adjusted-words))) (defn story-time [header] (- (/ (System/currentTimeMillis) 1000.) (:date_modified (first (:chapters (:story header)))))) (defn likes-per-total-time [header] (let [time (story-time header)] (/ (adjusted-ratings header) time))) (defn mature? [header] (= "Mature" (:content_rating_text (:story header {})))) (defn complete? [header] (= "Complete" (:status (:story header {})))) (def only-mature (partial filter mature?)) (def show-mature identity) (def no-mature (partial filter (comp not mature?))) (def complete (partial filter complete?)) (def incomplete (partial filter (comp not complete?))) (def any_status identity) #+end_src ** Results File Generation =generate-all-html!= uses each utility function in turn to sort the list of stories, then produces static HTML files based on the name of the function. So for =overall-ratings=, =generate-all-html!= will first sort all the stories by their total likes, then filter the resulting list by complete and mature, then take different amounts of results from the final list and run them through =pony-headers->html= to generate files named: #+BEGIN_EXAMPLE overall-ratings_only-mature_complete_16.html overall-ratings_only-mature_complete_64.html overall-ratings_only-mature_complete_512.html overall-ratings_only-mature_incomplete_16.html overall-ratings_only-mature_incomplete_64.html overall-ratings_only-mature_incomplete_512.html overall-ratings_only-mature_any_status_16.html overall-ratings_only-mature_any_status_64.html overall-ratings_only-mature_any_status_512.html overall-ratings_show-mature_complete_16.html overall-ratings_show-mature_complete_64.html overall-ratings_show-mature_complete_512.html overall-ratings_show-mature_incomplete_16.html overall-ratings_show-mature_incomplete_64.html overall-ratings_show-mature_incomplete_512.html overall-ratings_show-mature_any_status_16.html overall-ratings_show-mature_any_status_64.html overall-ratings_show-mature_any_status_512.html overall-ratings_no-mature_complete_16.html overall-ratings_no-mature_complete_64.html overall-ratings_no-mature_complete_512.html overall-ratings_no-mature_incomplete_16.html overall-ratings_no-mature_incomplete_64.html overall-ratings_no-mature_incomplete_512.html overall-ratings_no-mature_any_status_16.html overall-ratings_no-mature_any_status_64.html overall-ratings_no-mature_any_status_512.html #+END_EXAMPLE For a total of 27 static files per utility function. #+name: pony-clj-results #+begin_src clojure (in-ns 'org.aurellem.pony) ;; TODO memoize utility and filter fns for greater speed + ;; elegance (def ranking-vars [#'likes-per-total-time #'total-comments #'overall-ratings #'ratings-per-word #'ratings-per-view #'ml-rating-function]) (def mature-vars [#'only-mature #'show-mature #'no-mature]) (def complete-vars [#'complete #'incomplete #'any_status]) (def limits [16 64 512]) (defn generate-all-html! ([limit] ;; remove old static results files (dorun (map #(.delete %) (rest (file-seq (File. "/home/r/proj/pony-stories/results"))))) ;; generated last-modified file (spit (File. results-root "last-updated.html") (date->str (.toDate (Instant.)))) (let [valid-headers (filter valid? (map pony-header (range (min (pony-header-count) limit))))] (dorun (for [ranking-var ranking-vars] (let [sorted-headers (sort-by (comp - (var-get ranking-var)) (map sanitize valid-headers))] (dorun (for [mature-var mature-vars complete-var complete-vars] (let [final-headers ((var-get complete-var) ((var-get mature-var) sorted-headers))] (dorun (for [limit limits] (let [html (pony-headers->html (take limit final-headers)) target (File. results-root (str (clojure.string/join "_" [(:name (meta ranking-var)) (:name (meta mature-var)) (:name (meta complete-var)) limit]) ".html"))] (println (.getName target)) (spit target html)))))))))))) ([] (generate-all-html! (pony-header-count)))) (defn redeploy-site [] (update-header-db!) (generate-all-html!)) #+end_src ** Linear Programming Just for fun, I use the linear programming package [[http://lpsolve.sourceforge.net/5.5/][lpsolve]] to automatically construct a sixth utility function out of a linearly weighted combination of the other utility functions, based on some training data. The clojure interface to lpsolve which I am using here is described in [[http://aurellem.org/pokemon-types/html/lpsolve.html][this blog post]]. #+name: pony-clj-lp #+begin_src clojure (def story-ratings "This is a sequence of [story-id goodness] for selected stories." [[1888 50] ;; My Little Dashie [4656 80] ;; Anthropology' [6195 75] ;; It Takes a Village [1422 60] ;; Romance Reports [755 60] ;; On a Cross And Arrow [21583 75] ;; Twilight's List [29271 60] ;; Princess Celestia Hates Tea [9329 40] ;; Beating the Heat [25350 45] ;; Twilight Sparkle Earns the Feature Box [18786 65] ;; Field Notes on Alicorn Reproductive Behavior [25944 40] ;; Twilight's First Time [1526 30] ;; Haylo: A New World ]) (def pony-dimensions [["adjusted ratings" adjusted-ratings 1] ["comments" total-comments 1] ["ratings per word" ratings-per-word 1] ["ratings per view" ratings-per-view 1] ]) (defn normalize [vals] (let [minimum (apply min vals) width (- (apply max vals) (apply min vals))] (map (fn [val] (float (/ (- val minimum) width))) vals))) (defn optimize-pony-stories ([story-ratings pony-dimensions] (let [headers (mapv (comp pony-header first) story-ratings) columns (vec (for [f (map second pony-dimensions)] (normalize (map f headers)))) constraint-matrix (vec (for [n (range (count headers))] (mapv #(nth % n) columns))) b (mapv second story-ratings) c (mapv #(nth % 2) pony-dimensions)] (clojure.pprint/pprint constraint-matrix) (clojure.pprint/pprint b) (clojure.pprint/pprint c) (lp-solve constraint-matrix b c (set-variable-names lps (map first pony-dimensions)) (set-constraints lps LpSolve/LE) (.setMaxim lps) (results lps)))) ([] (optimize-pony-stories story-ratings pony-dimensions))) (def ml-rating-function (let [linear-coefficients (:optimal-values (optimize-pony-stories story-ratings pony-dimensions))] (comp (partial reduce +) (apply juxt (map (fn [coef f] (comp (partial * coef) f)) linear-coefficients (map second pony-dimensions)))))) #+end_src ** Clojure Packages This is the namespace declaration for all of the clojure code above. #+name: pony-clj-header #+begin_src clojure (ns org.aurellem.pony (:require clojure.data.json) (:use hiccup.core) (:use clojure.tools.logging) (:use pokemon.lpsolve) (:import lpsolve.LpSolve) (:import java.net.URL java.io.File) (:import (org.joda.time Instant Duration))) #+end_src * jquery/HTML Frontend The site is a simple view of static html files that will have been already generated by the clojure backend. ** javascript defines the actions The following is some simple javascript that will gather the four search dimensions from radio buttons, construct the appropriate static file name, and display the file using jquery's =load= function whenever the state of any radio button changes. #+name: pony-js #+begin_src javascript var pony = {}; pony.search_result_target = function (search_alg, mature, complete, num_results) { return "results/" + search_alg + "_" + mature + "_" + complete + "_" + num_results + ".html"; }; // get the desired search options from the radio buttons pony.gather_target = function () { var search_alg = $('input[name=search_alg]:checked').val(); var mature = $('input[name=mature]:checked').val(); var complete = $('input[name=complete]:checked').val(); var num_results = $('input[name=num_results]:checked').val(); return pony.search_result_target( search_alg, mature, complete, num_results); }; pony.last_updated = function () { $("#last-updated").load("results/last-updated.html"); }; pony.update_search_results = function () { $("#search-results").load(pony.gather_target()); }; pony.ajax_init = function () { $('input:radio').change(pony.update_search_results); }; pony.main_init = function() { pony.last_updated(); pony.update_search_results(); pony.ajax_init(); }; $(document).ready(pony.main_init); #+end_src ** html defines the structure Now that the interactive portion of the site is defined it is time to create the overall structure of the website. The structure of the site is a list of lists --- there are lists of radio buttons which serve as a controls for the user to view a list of results. This html defines the radio buttons which will serve as controls and leaves an empty div for the list of results, which will be generated by clojure and loaded by jquery depending on the state of the radio buttons. The containing divs and ids and classes are so everything can be styled in css later. #+name: pony-html #+begin_src html