Find Top N terms in a text string

I want to see how often the most common words are used in a list of movie titles.

Sample PBIX

I split, grouped, and counted the words in the titles in Power Query. I created a measure to find the TopN based on a slicer. Now, I need to look at every title and return true if it contains one of the terms in the TopN list. I don’t care if this is TRUE/FALSE, or 1/0, or “Has Top Word”/BLANK - anything as long as I can filter on it for calculations.

Desired outcome

Top N Measure calc, in case it matters (Top N Selection is a separate table for the slicer):

Top N Word = 
    var _TopGroup=
        TOPN(
            SELECTEDVALUE('Top N Selection'[Top N Selection]),
            SUMMARIZE(
                ALLSELECTED(TopWords[Top Words]),
                TopWords[Top Words],
                "Used",
                SUM(TopWords[Times Used])
            ),
            [Used]
        ) 

    var _Used = CALCULATE(
                    SUM(TopWords[Times Used]),
                    FILTER(TopWords, CONTAINS(_TopGroup,[Used],TopWords[Times Used]))
                )

    var _IsUsed = IF(NOT(ISBLANK(_Used)), SELECTEDVALUE(TopWords[Top Words]))
    
    
    RETURN
    _IsUsed
1 Like

@Gaelan ,

Cool problem. Can you please post the actual PBIX file rather than just the PTW version?

Thanks.

  • Brian
1 Like

@BrianJ Yep, sorry - I didn’t realize that you can’t download from PBI Online (sigh…). See if this link works instead:

2 Likes

@Gaelan ,

Thanks! I worked on this til all hours of the night last night. It turns out to be quite a difficult problem, since Power Query is FAR better suite to this type of task, but the highly dynamic nature of this particular problem makes that not a feasible option.

I think I am close to cracking it, but would be helpful if you could also post the supporting Excel data file so that I can run some test in PQ.

Thanks!

  • Brian
1 Like

Wow, thanks! You actually make me feel better that this is as tough as I thought it was. I was really worried I was missing something abundantly obvious.

Here’s the Excel file

1 Like

@Gaelan ,

OK, I finally cracked this one, but it did not surrender easily. As I mentioned above, this was difficult because separating text into columns is two clicks in Power Query, but difficult in DAX. The best way I’ve found to do it is using a trick from DAX wizard @AntrikshSharma , where you replace the spaces in the text with the vertical bar symbol ("|") and then use the PATHLENGTH and PATHITEM series of functions to manipulate the string into a tabular column.

Once you do this, you can create a one column virtual table of the Top N most used words, pulling the parameter from the top N slider, and another one column virtual table of the “split by delimited” movie title words using the approach above.

You can then take the intersection of these two tables and count the rows in them - every movie for which this count > 0 will have a word in it from the dynamic top N list.

Here’s the measure that does the heavy lifting, which you place in the filter pane on the visual and set equal to 1.

Top Word in Film Path =

VAR vTopNWords =

FILTER(

    ALL(TopWords[Top Words] ),

    [RankTopWords] <= [Top N Selection Value]

)

VAR vFilmPathWords =

ADDCOLUMNS(

    CROSSJOIN(

        VALUES(Movies[Film Path] ),

        VALUES( 'Max Path Index'[Value] )

    ),

    "@FilmPathWords", PATHITEM( Movies[Film Path], 'Max Path Index'[Value])

       

)

VAR Intersection =

COUNTROWS(

    INTERSECT(

        SELECTCOLUMNS( vTopNWords, "@Col1", [Top Words] ),

        SELECTCOLUMNS( vFilmPathWords, "@Col2", [@FilmPathWords]    )

    )

)

VAR Result =

IF( Intersection > 0, 1, 0 )

RETURN Result

And here’s what it looks like all put together:

Full solution file attached. I hope this is helpful.

Fun problem!

All the best,

4 Likes

Wow! I can’t wait to dive into this and see how it works. Very much appreciated!

2 Likes

@Gaelan ,

I actually found this problem so interesting and potentially useful to others that I’m in the process of recording a YouTube video about it as we speak…should be posted later this week.

Note: please use the solution file below. I realized in prepping for the video that it’s best to apply a cleaning function, such as the one below to the film titles so that punctuation doesn’t screw things up…

= Table.TransformColumns(#“Removed Duplicates”, {“Film”, each Text.Remove(_, {",", “:”, “-”, “;”} )})

  • Brian

Splitting Columns by Delimeters in DAX.pbix (142.4 KB)

3 Likes

@BrianJ , this is absolutely incredible. I don’t understand a few chunks of it, but it works, and that’s what I care about.

BUT WAIT, THERE’S MORE!

It turns out that I need a few phrases, not just single words. Which, of course, introduces spaces…which doesn’t play well with the path calcs.

So I did a little magic, and I made it work. I created shadow columns to do all my cleaning, including removing spaces from between the words of the spaces I wanted. I’ll see if I can reproduce it in the sample file.

2 Likes

@Gaelan - a remarkably timely comment. On Sunday, I posted on LinkedIn inviting people to share their Power BI Worst Practices. It’s a funny thread, but the answers are really insightful. Here’s one that was echoed in different ways by a number of people:

I agree completely with Steven and others about not using code you don’t fully understand - it may look like it works, but have weird edge cases or other ticking bombs in it. That what is great about this forum - the people who repond to questions don’t just throw code at you, but work to make sure the logic behind it is clear.

See if the video that will post tomorrow answers your remaining questions, but if not please feel free to give me a shout…

  • Brian
1 Like

Oh yeah, I definitely try hard to pull it all apart and really understand it. That said, tight deadlines mean sometimes it has to go on the to-learn list for the time being.

That said, I was able to get the phrase lookup into the test file, if you want to check that out. I put notes in the relevant PowerQuery steps so hopefully others can follow it.

TopN with phrases

@Gaelan ,

I just walked through your Top N with Phrases solution - very clever adaptation of the earlier solution.

Well done - thanks for sharing!

  • Brian

P.S. My YouTube video based on this problem just dropped this morning…