Roasting My Old SQL Queries

Today, I will roast my old SQL queries.

My data analytics journey began around this time two years ago. With high hopes and lots of confidence, I knew I could one day make a name for myself in the data space and build a lucrative career. So, I started learning the typical tools data influencers say you must know to be a data analyst. That includes Excel, Tableau, Python, and, of course, our good pal SQL.

I built this project for my portfolio, where I analyzed my old tweets about marketing to discover if a relationship existed between tweet length and performance. My sample size was small, and the project sucked. But that doesn’t matter. I still landed a job and am doing quite well right now.

I remember the high I felt after finishing the write-up for my Twitter project. I felt as though I wrote queries on par with what a seasoned data analyst would do. Oh, if only that were true.

Now that it’s two years later, I stumbled across my project again. I revisited this project while looking at old repos on my GitHub. And when I saw the queries I wrote, I wanted to beat myself up for being so foolish. Some of them are horrendous. Absolutely embarrassing.

If I wrote something similar at my job today, my team would question my sanity and second-guess whether I should even be a data analyst. But, fortunately, I’m smarter now and can write some damn nice SQL. So, we don’t have to worry about that.

In the spirit of having fun, I decided to roast the terrible SQL I previously wrote. Just note that these old queries do not reflect who I am now as a data analyst. I have to preface this because after you see the third query I roast, you’ll laugh at me, too.

To make myself feel better, I’ll be gentle with the first query I roast.

This one finds the top tweets in each range and orders them by impressions descending.

1SELECT *
2FROM (
3    SELECT *,
4    ROW_NUMBER() OVER(PARTITION BY tweet_length_range ORDER BY impressions DESC) AS row_num
5    FROM twitter
6	) AS sub_query
7WHERE row_num BETWEEN 1 AND 5
8ORDER BY tweet_length_range, impressions DESC;

This isn’t that bad, actually. The query should execute efficiently, and there isn’t much to enhance for performance.

However, if I want to be nitpicky, I would make three small changes.

Change the subquery into a CTE for better readability (though this won’t impact performance).
Specify the columns I select rather than doing select *.
Using row_num <= 5 instead of row_num between 1 AND 5 since the row_number function starts at 1. This is for brevity. However, both options should have similar performance.

With these slight changes, my new query looks like this.

1with OrderedTweets as (
2	select tweet_id
3	,tweet_text
4	,tweet_time
5	,impresssions
6	,engagements
7	,engagement_rate
8	,retweets
9	,replies
10	,likes
11	,row_number() over(partition by tweet_length_range order by impressions desc) as row_num
12   from twitter
13)
14select tweet_id
15	,tweet_text
16	,tweet_time
17	,impresssions
18	,engagements
19	,engagement_rate
20	,retweets
21	,replies
22	,likes
23	,row_num
24from OrderedTweets
25where row_num <= 5
26order by tweet_length_range, impressions desc;

Usually, select * isn’t terrible, especially for simple, one-off queries like this. But if I were writing something that would execute many times (like a view, stored procedure, or report query), writing all column names helps with debugging. Your datasets usually aren’t big for practice and portfolio projects, and you won’t have an overwhelming number of columns. But in an organization’s live production database, tables could have 50+ columns. In that scenario, you likely wouldn’t need all columns. So, using select * would force your queries to process more data than necessary.

Okay, that wasn’t too bad, right? Well, things are about to get a little worse now (though certainly not as bad as the last one will be… just wait).

The next SQL query I’ll roast finds the percentage of tweets by length range where all key metrics (impressions, replies, likes, engagements, and profile clicks) are above the overall average for all tweets.

My original monstrosity looks like this.

1WITH cte_1 AS (
2    SELECT tweet_length_range, COUNT(*) AS num_tweets
3    FROM (
4	    SELECT *
5        FROM twitter
6		WHERE impressions > (
7                SELECT AVG(impressions)
8                FROM twitter)
9            AND replies > (
10                SELECT AVG(replies)
11                FROM twitter)
12            AND likes > (
13                SELECT AVG(likes)
14                FROM twitter)
15	        AND engagement_rate > (
16                SELECT AVG(engagement_rate)
17                FROM twitter)
18            AND profile_clicks > (
19                SELECT AVG(profile_clicks)
20                FROM twitter)
21        ) sub_query
22    GROUP BY 1),
23    cte_2 AS (
24    SELECT tweet_length_range, COUNT(*) AS total
25    FROM twitter
26    GROUP BY 1)
27SELECT c.tweet_length_range,
28ROUND(c.num_tweets / d.total,2) AS percent_of_total
29FROM cte_1 c
30JOIN cte_2 d
31    ON c.tweet_length_range = d.tweet_length_range
32GROUP BY 1
33ORDER BY 2 DESC;

Eww. It’s not that pretty to look at and challenging to decipher. My team and I at work are good at formatting our queries and using easy-to-understand naming conventions. So, when I see a new query, I can figure out what it’s trying to do with a quick skim. But I cannot do that with this.

I almost don’t know where to begin.

Here’s why I think this query sucks and what I’d do to fix it.

Too many subqueries between lines 7 and 20. I use five different subqueries to calculate averages at the same granularity. Instead, I could combine them into one subquery or CTE to prevent redundancies.
Poor names for my CTEs. I use the worst names ever—cte_1 and cte_2. These names don’t tell you what the CTEs calculate. So, my refactored query would have stronger naming conventions here.
I use GROUP BY 1 and ORDER BY 2, which I no longer like. This isn’t terrible for small one-off queries, but it’s bad practice when you select many columns from wide tables. It makes it difficult to detect which column you’re actually grouping or ordering by. Instead, I prefer to write out the actual columns.

This is how my new query looks.

1with KeyMetricAverages as (
2	select avg(impressions) as avg_impressions
3	,avg(replies) as avg_replies
4	,avg(likes) as avg_likes
5	,avg(engagement_rate) as avg_engagement_rates
6	,avg(profile_clicks) as avg_profile_clicks
7	from twitter
8)
9, TotalTweetsPerRange as (
10	select t.tweet_length_range
11	,count(t.tweet_length_range) as total_num_tweets
12	from twitter t
13	group by t.tweet_length_range
14)
15, MetricsExceedingAllAverages as (
16	select t.tweet_length_range
17	,count(t.tweet_length_range) as count_per_range
18	from twitter t
19	cross join KeyMetricAverages k
20	where t.impressions > k.avg_impressions
21	and t.replies > k.avg_replies
22	and t.likes > k.avg_likes
23	and t.engagement_rate > k.avg_engagement_rates
24	and t.profile_clicks > k.avg_profile_clicks
25	group by t.tweet_length_range
26)
27select m.tweet_length_range
28,round(m.count_per_range / t.total_num_tweets,2) as percent_of_total
29from MetricsExceedingAllAverages m
30join TotalTweetsPerRange t on t.tweet_length_range = m.tweet_length_range;

This is much easier to read, and skimming it gives you a better understanding of what the SQL is trying to do.

An interesting choice here is the cross join on line 19. Since I aggregate my averages in a CTE rather than 5 subqueries, there is only one row of data. There’s no natural left join or inner join to do, so a cross join applies the record to each row in my Twitter table. At the time, I had no idea cross join existed, hence why I foolishly used 5 subqueries.

Okay…

It’s time for me to roast my final query from my Twitter project. This is the worst of them all. I’m almost embarrassed to share this, but who doesn’t like making fun of their past self?

1WITH cte_1 AS (
2    WITH cte_2 AS (
3        SELECT ROUND(AVG(impressions),0) AS average_impressions
4		FROM twitter)
5    SELECT t.tweet_length_range,
6		ROUND(AVG(t.impressions),0) AS Avg_For_Range,
7        (SELECT average_impressions
8		FROM cte_2) AS Overall_Average
9    FROM twitter t
10    GROUP BY 1
11    ORDER BY 2 DESC)
12SELECT *,
13    (CASE
14        WHEN Avg_For_Range > Overall_Average THEN 'Better'
15        WHEN Avg_For_Range < Overall_Average THEN 'Worse'
16        ELSE NULL
17    END) AS Better_or_Worse_Than_Overall_Average
18FROM cte_1;

Look at that CTE embedded within another CTE. Absolutely disgusting (and not the good kind). Deploy code like this to production, and your team will surely scold you. So, never, never, ever make this mistake. I repeat, DO NOT EMBED A CTE WITHIN ANOTHER CTE. Your query will run slowly on big datasets and make you look like a complete amateur.

Aside from that, what I wrote isn’t terrible. But my CTE blunder is enough to crown this the worst query I’ve ever written.

The only other fixes are similar to the ones I made in my previous two queries.

Don’t use select *.
Don’t use GROUP BY 1 or ORDER BY 2.
Use better naming conventions for CTEs.

Applying these fixes (especially the hideous CTEs) produces the following.

1with AverageImpressions as (
2	select round(avg(impressions),0) as average_impressions
3	from twitter
4)
5, RangeAverages as (
6	select twitter.tweet_length_range
7	,round(avg(twitter.impressions) as average_for_range
8	from twitter
9	group by twitter.tweet_length_range
10)
11select RangeAverages.tweet_length_range
12,RangeAverages.average_for_range
13,case
14		when RangeAverages.average_for_range > AverageImpressions.average_impressions then 'Better'
15		when RangeAverages.average_for_range < AverageImpressions.average_impressions then 'Worse'
16		else 'Equal'
17 end as compared_to_average
18from RangeAverages
19cross join AverageImpressions
20order by RangeAverages.average_for_range desc;

Ah yes… Now that looks a whole lot better. I can sleep better tonight with this.

I hope you learned a thing or two from this roast, specifically what not to do when writing SQL queries. While I cringe and make fun of myself for my past mistakes, know that it’s part of the learning process. It’s natural.

No one comes out of the womb writing perfect SQL. As with any new skill, there’s a learning curve. It’s inevitable that you’ll make mistakes early in your journey. There’s no way around it. Once you’re more proficient down the road, however, all you can—and should—do is laugh and roast your old self.

Your journey is more fun that way, and it assures you’re making progress.