Latest Enterprise DNA Initiatives

Cleaning Dirty data in Power Query

Hello,

I have a comments column in my dataset on which I need to apply text analytics. Now I am dealing with lot of messy text, some of which I am unable to clean through User Interface in power Query here are some examples :

  1. there are several long urls that have different lengths varying from 40 to 500 characters

  2. Comments like : #8d9fc1ca-b858-4af6-8d4a-d4980ad46954-session-1629408347593_8de43ac12d564e449409aecb4bed2815Thanks

  3. bff1bf79-9221-4df6-bcdb-9597f1ce6bf2desk:

4){color:white;font-size:14px;margin:4px 0;font-weight:bold;}#TOU5256405

  1. прочитано.

  2. SentencesWithNoSpaceBetweenWords - It would be interesting to know if there is a way to get this type of text in separate meaningful words

What will be the best approach to handle/ rather clean such data, I am looking to create a word cloud and use the Cognitive services for sentiment analysis.

@jps ,

Have you seen this video from @Melissa? I think the general approach she lays out to developing custom text cleaning functions would be highly relevant here:

I also think this video from @AntrikshSharma will be helpful:

  • Brian
2 Likes

Thank you @BrianJ , Both of them are such a great resource,

I have just attempted he steps shared by @AntrikshSharma in his Video, I am getting stuck with an error

So Basically I had a table, I removed all other columns and kept just the comments column. and applied the Following Code:

let
    Source = #"Feedback Data",
    #"Removed Other Columns" = Table.SelectColumns(Source,{"Comments"}),
    ToKeep=
            let 
                CapitalAlphabets = {"A".."Z"},
                SmallAlphabets = {"a".."z"},
                OneList = CapitalAlphabets & SmallAlphabets
            in 
                OneList,
    CleanText = Text.Select(#"Removed Other Columns",ToKeep)
in
    CleanText

I am now getting the following error :

 Expression.Error: We cannot convert a value of type Table to type Text.
    Details:
        Value=[Table]
        Type=[Type] 

I have tried changing it to type text, changed it to list, I have had errors . Any guidance on how to sort this.

I have actually gone all the way to the end, and created a function as per the video, however the same error is appearing in the final query aswell

@jps ,

Can you please post your PBIX? It’s much easier for me to sort out these sorts of problems when I’ve got the data to test on.

But I think the problem is here:

Text.Select is expecting a text value, and the first parameter here is a table.

  • Brian

Hey Brian,
Thank you , I have attached the PBIX file, with Dummy data, to replicate the type of data I have.

I did manage to solve above problem and I managed to get the Query Working, The Problem now is - The Function is eating up the space between words, … So a sentence is becoming a one big string. And I am not sure what’s causing that

FeedbackQueryProb.pbix (35.9 KB)

So The Column named Comments has been Converted to Query1


sample for probsolving.xlsx (13.2 KB)

@jps ,

Very nice job. Easy fix to keep the spaces and email addresses intact - just add these characters to the Characters to Keep list (see below):

I hope this is helpful.

  • Brian

Oh Great. That resolves the issue. There is One last thing tough,

I would need to have a space where ever a special character exists. So My Guess was to replace these special characters with space. and then extract text with space

So I attempted the following code :

(InputString as text)=>

let
    Source = InputString,

    ToReplace=
        let
            CharactersToReplace = {"~", "`", "!", "#", "$", "%", "^", "&", "*", "(", ")", "-", "_", "+", "=", "{", "}", "[", "]", "|", "\", "/", ":", ";", "<", ">", ".", "?"}
        in
            CharactersToReplace,
    ReplaceValuesWithSpace= Table.ReplaceValue(Source, ToReplace, " ",Replacer.ReplaceText,{"Comments"}),

    ToKeep=
        let 
            CapitalAlphabets = {"A".."Z", " ", "@","."},
            SmallAlphabets = {"a".."z"},
            OneList = CapitalAlphabets & SmallAlphabets
        in 
            OneList,
    CleanText = Text.Select(ReplaceValuesWithSpace,ToKeep)

in
    CleanText`

The above code got me following error:

    Expression.Error: We cannot convert the value "01cfe925d9d265e0][ht..." to type Table.
Details:
    Value=01cfe925d9d265e0][httpsursfdjkhgeleilprcnisdfhlkjgunkkeybgd/gjghlfkz68gfhd2awulpro9dp91cmhaejpgfhjghj872urr8gbygkhy7b0hsn2fiioapvfnb985Testing98pxdhpx5xomsd5d3kdaf9th2dfkqhy9wpjzglwa3c4kkrhenylo67jrwhxalwnt36PowerBI1][httpsurlemailprotectionlinkbjatsaf0scgbax4wxwbzb1qytkofmjoqydh7zakgcszeqonyxlbydigrf2bkwhm87juzpxjfpodamybkka2eq3kynqmcmltzjxmlisb0tdals7lkwks5f1bojjckuu][cidimage010jpg
    Type=[Type]

@BrianJ @Melissa @AntrikshSharma

I’ve not managed to find a solution yet, Would you have any suggestions on the above, please?

Hi @jps,

You are passing a string to a function parameter that expects a table.

I hope this is helpful

1 Like

Aah - Got you. Thank you for pointing this out.