Latest Enterprise DNA Initiatives


Cleaning Dirty data in Power Query

Hello,

I have a comments column in my dataset on which I need to apply text analytics. Now I am dealing with lot of messy text, some of which I am unable to clean through User Interface in power Query here are some examples :

  1. there are several long urls that have different lengths varying from 40 to 500 characters

  2. Comments like : #8d9fc1ca-b858-4af6-8d4a-d4980ad46954-session-1629408347593_8de43ac12d564e449409aecb4bed2815Thanks

  3. bff1bf79-9221-4df6-bcdb-9597f1ce6bf2desk:

4){color:white;font-size:14px;margin:4px 0;font-weight:bold;}#TOU5256405

  1. прочитано.

  2. SentencesWithNoSpaceBetweenWords - It would be interesting to know if there is a way to get this type of text in separate meaningful words

What will be the best approach to handle/ rather clean such data, I am looking to create a word cloud and use the Cognitive services for sentiment analysis.

@jps ,

Have you seen this video from @Melissa? I think the general approach she lays out to developing custom text cleaning functions would be highly relevant here:

I also think this video from @AntrikshSharma will be helpful:

  • Brian
3 Likes

Thank you @BrianJ , Both of them are such a great resource,

I have just attempted he steps shared by @AntrikshSharma in his Video, I am getting stuck with an error

So Basically I had a table, I removed all other columns and kept just the comments column. and applied the Following Code:

let
    Source = #"Feedback Data",
    #"Removed Other Columns" = Table.SelectColumns(Source,{"Comments"}),
    ToKeep=
            let 
                CapitalAlphabets = {"A".."Z"},
                SmallAlphabets = {"a".."z"},
                OneList = CapitalAlphabets & SmallAlphabets
            in 
                OneList,
    CleanText = Text.Select(#"Removed Other Columns",ToKeep)
in
    CleanText

I am now getting the following error :

 Expression.Error: We cannot convert a value of type Table to type Text.
    Details:
        Value=[Table]
        Type=[Type] 

I have tried changing it to type text, changed it to list, I have had errors . Any guidance on how to sort this.

I have actually gone all the way to the end, and created a function as per the video, however the same error is appearing in the final query aswell

@jps ,

Can you please post your PBIX? It’s much easier for me to sort out these sorts of problems when I’ve got the data to test on.

But I think the problem is here:

Text.Select is expecting a text value, and the first parameter here is a table.

  • Brian

Hey Brian,
Thank you , I have attached the PBIX file, with Dummy data, to replicate the type of data I have.

I did manage to solve above problem and I managed to get the Query Working, The Problem now is - The Function is eating up the space between words, … So a sentence is becoming a one big string. And I am not sure what’s causing that

FeedbackQueryProb.pbix (35.9 KB)

So The Column named Comments has been Converted to Query1


sample for probsolving.xlsx (13.2 KB)

@jps ,

Very nice job. Easy fix to keep the spaces and email addresses intact - just add these characters to the Characters to Keep list (see below):

I hope this is helpful.

  • Brian

Oh Great. That resolves the issue. There is One last thing tough,

I would need to have a space where ever a special character exists. So My Guess was to replace these special characters with space. and then extract text with space

So I attempted the following code :

(InputString as text)=>

let
    Source = InputString,

    ToReplace=
        let
            CharactersToReplace = {"~", "`", "!", "#", "$", "%", "^", "&", "*", "(", ")", "-", "_", "+", "=", "{", "}", "[", "]", "|", "\", "/", ":", ";", "<", ">", ".", "?"}
        in
            CharactersToReplace,
    ReplaceValuesWithSpace= Table.ReplaceValue(Source, ToReplace, " ",Replacer.ReplaceText,{"Comments"}),

    ToKeep=
        let 
            CapitalAlphabets = {"A".."Z", " ", "@","."},
            SmallAlphabets = {"a".."z"},
            OneList = CapitalAlphabets & SmallAlphabets
        in 
            OneList,
    CleanText = Text.Select(ReplaceValuesWithSpace,ToKeep)

in
    CleanText`

The above code got me following error:

    Expression.Error: We cannot convert the value "01cfe925d9d265e0][ht..." to type Table.
Details:
    Value=01cfe925d9d265e0][httpsursfdjkhgeleilprcnisdfhlkjgunkkeybgd/gjghlfkz68gfhd2awulpro9dp91cmhaejpgfhjghj872urr8gbygkhy7b0hsn2fiioapvfnb985Testing98pxdhpx5xomsd5d3kdaf9th2dfkqhy9wpjzglwa3c4kkrhenylo67jrwhxalwnt36PowerBI1][httpsurlemailprotectionlinkbjatsaf0scgbax4wxwbzb1qytkofmjoqydh7zakgcszeqonyxlbydigrf2bkwhm87juzpxjfpodamybkka2eq3kynqmcmltzjxmlisb0tdals7lkwks5f1bojjckuu][cidimage010jpg
    Type=[Type]

@BrianJ @Melissa @AntrikshSharma

I’ve not managed to find a solution yet, Would you have any suggestions on the above, please?

Hi @jps,

You are passing a string to a function parameter that expects a table.

I hope this is helpful

1 Like

Aah - Got you. Thank you for pointing this out.

Hi @jps , did the response provided by the other users help you solve your query?

If not, how far did you get and what kind of help you need further?

If yes, kindly mark as solution the answer that solved your query.

Hi @jps, we’ve noticed that no response has been received from you since September 16. We just want to check if you still need further help with this post? In case there won’t be any activity on it in the next few days, we’ll be tagging this post as Solved.