Using junction tables in PHP and MySQL to categorize and include and exclude categories

32
February 11, 2019, at 06:10 AM

I'm trying to analyze tweets using manually assigned categories. Everything's stored in a MySQL database. I can add and remove tweets, categories, and the relationships between them without any problems.

Including categories using OR logic works as expected. If I want to find tweets categorized as "Venezuela" or "Maduro," I send those two terms over in an array called $include with $include_logic set to "or". Tweets categorized under either category are returned. Great!

The problems start when I try to using AND logic (that is, tweets categorized under all included terms, e.g., both Venezuela and Maduro) or when I try excluding categories.

Here's the code:

function filter_tweets($db, $user_id, $from_utc, $to_utc, $include = null, $include_logic = null, $exclude = null) {
    $include_sql = '';
    if (isset($include)) {
        $include_sql = 'AND (';
        $logic_op = '';
        foreach ($include as $cat) {
            $include_sql .= "{$logic_op}cats.name = '$cat' ";
            $logic_op = ($include_logic != 'and') ? 'OR ' : 'AND '; # AND doesn't work here
        }
        $include_sql .= ')';
    }
    $exclude_sql = ''; # Nothing I've tried with this works.
    $sql = "
        SELECT DISTINCT tweets.id FROM tweets
            LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
            LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id 
        WHERE tweets.user_id = $user_id
            AND created_at
                BETWEEN '{$from_utc->format('Y-m-d H:i:s')}'
                    AND '{$to_utc->format('Y-m-d H:i:s')}'
            $include_sql
            $exclude_sql
        ORDER BY tweets.created_at ASC;";
    return db_fetch_all($db, $sql);   
}

where db_fetch_all() is

function db_fetch_all($con, $sql) {
    if ($result = mysqli_query($con, $sql)) {
        $rows = mysqli_fetch_all($result);
        mysqli_free_result($result);
        return $rows;
    }
    die("Failed: " . mysqli_error($con)); 
}

and tweets_cats is the junction table between the tweets and cats tables.

After reading up on joins and junction tables, I understand why my code doesn't work in the two cases mentioned. It can only look at one tweet and corresponding category at a time. So asking it to omit a tweet categorized as "X" is moot because it won't omit it when the same tweet is encountered and categorized as "Y".

What I don't understand is how to modify the code so that it does work. I haven't found any examples of people trying to do anything similar. Perhaps I'm not searching for the right terms. I'd be grateful if someone could point me to a good resource for working with junction tables in MySQL similar to how I'm using them.

Edit: Here's working SQL created by the function using the example mentioned above including "Venezuela" OR "Maduro" on the VP twitter account with the date range set to tweets so far this month (EST converted to UTC).

SELECT DISTINCT tweets.id FROM tweets
    LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
    LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id
WHERE tweets.user_id = 818910970567344128
    AND created_at BETWEEN '2019-02-01 05:00:00' AND '2019-03-01 05:00:00'
    AND (cats.name = 'Venezuela' OR cats.name = 'Maduro' )
ORDER BY tweets.created_at ASC;

Update: Here's working SQL that adheres to AND logic for included categories. Many thanks to @Strawberry for the suggestion!

SELECT tweets.id FROM tweets
    LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
    LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id 
WHERE tweets.user_id = 818910970567344128
    AND created_at BETWEEN '2019-02-01 05:00:00' AND '2019-03-01 05:00:00'
    AND cats.name IN ('Venezuela', 'Maduro')
GROUP BY tweets.id
HAVING COUNT(*) = 2
ORDER BY tweets.created_at ASC;

This is a bit beyond my SQL comprehension, though. I'm glad it works. I just wish I understood how.

Update 2: Here's working SQL that excludes categories. I realized that the AND/OR logic that applies to included categories also applies to excluded ones. This example uses OR logic. The syntax is essentially Q1 NOT IN (Q2), where Q2 is what's excluded, which is basically the same query used for inclusion.

SELECT id FROM tweets
WHERE user_id = 818910970567344128
    AND created_at BETWEEN '2019-02-01 05:00:00' AND '2019-03-01 05:00:00'
    AND id NOT IN (
        SELECT tweets.id FROM tweets
            LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
            LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id 
        WHERE tweets.user_id = 818910970567344128
            AND created_at BETWEEN '2019-02-01 05:00:00' AND '2019-03-01 05:00:00'
            AND cats.name IN ('Venezuela','Maduro')
    )
ORDER BY created_at ASC;

Update 3: Here's the working code.

function filter_tweets($db, $user_id, $from_utc, $to_utc,
                       $include = null, $include_logic = null,
                       $exclude = null, $exclude_logic = null) {
    if (isset($exclude)) {
        $exclude_sql = "
              AND tweets.id NOT IN (\n"
            . include_tweets($user_id, $from_utc, $to_utc, $exclude, $exclude_logic)
            . "\n)";
    } else {
        $exclude_sql = '';
    }
    if (isset($include)) {
        $sql = include_tweets($user_id, $from_utc, $to_utc, $include, $include_logic, $exclude_sql);
    } else {
        $sql = "
            SELECT id FROM tweets
            WHERE user_id = $user_id
              AND created_at
                BETWEEN '{$from_utc->format('Y-m-d H:i:s')}'
                    AND '{$to_utc  ->format('Y-m-d H:i:s')}'
              $exclude_sql";
    }
    $sql .= "\nORDER BY tweets.created_at ASC;";
    return db_fetch_all($db, $sql);   
}

which relies on this additional function for generating SQL:

function include_tweets($user_id, $from_utc, $to_utc, $include, $logic, $exclude_sql = '') {
    $group_sql = '';
    $include_sql = 'AND cats.name IN (';
    $comma = '';
    foreach ($include as $cat) {
        $include_sql .= "$comma'$cat'";
        $comma = ',';
    }
    $include_sql .= ')';
    if ($logic == 'and')
        $group_sql = 'GROUP BY tweets.id HAVING COUNT(*) = ' . count($include);
    return "
        SELECT tweets.id FROM tweets
          LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
          LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id 
        WHERE tweets.user_id = $user_id
          AND created_at
            BETWEEN '{$from_utc->format('Y-m-d H:i:s')}'
                AND '{$to_utc  ->format('Y-m-d H:i:s')}'
          $include_sql
        $group_sql
        $exclude_sql";
}
Answer 1

One way to do this is to join your tweets table against the junction table multiple times, e.g. like this:

SELECT tweets.*
FROM tweets
  JOIN tweet_cats AS tweet_cats_foo
    ON tweet_cats_foo.tweet_id = tweets.id
  JOIN tweet_cats AS tweet_cats_bar
    ON tweet_cats_bar.tweet_id = tweets.id
WHERE
  tweet_cats_foo.name = 'foo' AND tweet_cats_bar.name = 'bar'

or, equivalently, like this:

SELECT tweets.*
FROM tweets
  JOIN tweet_cats AS tweet_cats_foo
    ON tweet_cats_foo.tweet_id = tweets.id
    AND tweet_cats_foo.name = 'foo'
  JOIN tweet_cats AS tweet_cats_bar
    ON tweet_cats_bar.tweet_id = tweets.id
    AND tweet_cats_bar.name = 'bar'

Note that, for simplicity, I'm assuming above that your junction table directly contains the category names. If you insist on using numeric category IDs but searching for categories by name, I'd recommend creating a view that joins the category and junction tables together using the numeric category ID and using that view instead of the actual junction table in your query. This saves you from having to include a whole bunch of unnecessary boilerplate code in the query just for finding the numeric category IDs.

For exclusion queries, you can use a LEFT JOIN and check that no matching record exists in the junction table (in which case all the columns from that table will be NULL), like this:

SELECT tweets.*
FROM tweets
  LEFT JOIN tweet_cats AS tweet_cats_foo
    ON tweet_cats_foo.tweet_id = tweets.id
    AND tweet_cats_foo.name = 'foo'
WHERE
  tweet_cats_foo.tweet_id IS NULL  -- could use any non-null column here

(Using this method, you do need to include the tweet_cats_foo.name = 'foo' condition in the LEFT JOIN clause instead of the WHERE clause.)

Of course, you can also combine these. For example, to find tweets in category foo but not in bar, you could do:

SELECT tweets.*
FROM tweets
  JOIN tweet_cats AS tweet_cats_foo
    ON tweet_cats_foo.tweet_id = tweets.id
    AND tweet_cats_foo.name = 'foo'
  LEFT JOIN tweet_cats AS tweet_cats_bar
    ON tweet_cats_bar.tweet_id = tweets.id
    AND tweet_cats_bar.name = 'bar'
WHERE
  tweet_cats_bar.tweet_id IS NULL

or, again equivalently:

SELECT tweets.*
FROM tweets
  LEFT JOIN tweet_cats AS tweet_cats_foo
    ON tweet_cats_foo.tweet_id = tweets.id
    AND tweet_cats_foo.name = 'foo'
  LEFT JOIN tweet_cats AS tweet_cats_bar
    ON tweet_cats_bar.tweet_id = tweets.id
    AND tweet_cats_bar.name = 'bar'
WHERE
  tweet_cats_foo.tweet_id IS NOT NULL
  AND tweet_cats_bar.tweet_id IS NULL
READ ALSO
pgloader mysql to Postgres command line exclude table

pgloader mysql to Postgres command line exclude table

I have the following command that works

63
Phpmyadmin, I want to restore the backup of the databases

Phpmyadmin, I want to restore the backup of the databases

I've gone and reviewed all the databases and I chose the export option in phpmyadmin and he collect all the databases in one file now I want to restore this filePhpmyadmin Say Select Database

46
How can I select from the same table with different where conditions and differnet group by conditions in this case?

How can I select from the same table with different where conditions and differnet group by conditions in this case?

Basically this is what I am trying to achieve but I don't know how to do it in a single query, I don't know if it can be done in a single query

34
Calculate the average number of hours in table per month

Calculate the average number of hours in table per month

I have a table named "loginhistory", and I need to calculate the most active hour of the monthHow can i do it?

33