Find the answer to your Linux question:
Results 1 to 7 of 7
I know there is a file command but having read the man page I don't think there is a simple way for the computer to determine what file type it ...
  1. #1
    Just Joined!
    Join Date
    Dec 2010
    Posts
    2

    Checking file types, not extensions

    I know there is a file command but having read the man page I don't think there is a simple way for the computer to determine what file type it is since it outputs the result in a sentence meant for a human to read.

    The reason I'm not looking to check for file extensions is because they can be changed. Perhaps I'm being too paranoid?

    EDIT: Forgot to say it's in bash.

  2. #2
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    It depends on what you are trying to do. If you are trying to tell exactly what programming language something is in, this is obviously fairly complicated, and it can't be guaranteed that file would ever know. However, if you're looking to tell if it's a certain type of binary file, there are options.

    In the general case, "file" does have some advice:
    Code:
         The type printed will usually contain one of the words text (the file contains only print‐
         ing characters and a few common control characters and is probably safe to read on an
         ASCII terminal), executable (the file contains the result of compiling a program in a form
         understandable to some UNIX kernel or another), or data meaning anything else (data is
         usually “binary” or non-printable).
    If you know that the file is binary, and you want to know exactly what type of file it is, you can look at the magic number. This is usually the first few bytes of a binary file. For example, PDF files always start with the ASCII bytes for "%PDF".

    If you only care about specific file types, you could keep track of the magic numbers that are important to you. If you are writing a general file type detector, well, that's really what "file" is for, right?
    DISTRO=Arch
    Registered Linux User #388732

  3. #3
    Just Joined!
    Join Date
    Dec 2010
    Posts
    2
    Quote Originally Posted by Cabhan View Post
    It depends on what you are trying to do. If you are trying to tell exactly what programming language something is in, this is obviously fairly complicated, and it can't be guaranteed that file would ever know. However, if you're looking to tell if it's a certain type of binary file, there are options.

    In the general case, "file" does have some advice:
    Code:
         The type printed will usually contain one of the words text (the file contains only print‐
         ing characters and a few common control characters and is probably safe to read on an
         ASCII terminal), executable (the file contains the result of compiling a program in a form
         understandable to some UNIX kernel or another), or data meaning anything else (data is
         usually “binary” or non-printable).
    If you know that the file is binary, and you want to know exactly what type of file it is, you can look at the magic number. This is usually the first few bytes of a binary file. For example, PDF files always start with the ASCII bytes for "%PDF".

    If you only care about specific file types, you could keep track of the magic numbers that are important to you. If you are writing a general file type detector, well, that's really what "file" is for, right?
    I'm just trying to differentiate between tar, zip, tar.bz2 and tar.gz. I assume I can find the magic numbers somewhere for these types?

  4. #4
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    So there's when you get into some interestingness.

    tar files do not have a magic number. As per tar(5) ("man 5 tar"), the tar file format is simply a series of headers (the first 100 bytes of which are the file name) followed by the file contents.

    For bzip2 and gzip, there are magic numbers, but you cannot tell if they are .tar.bz2 files specifically, or just any bzip2 file. For bzip2, the magic numbers are that the first three bytes will be "BZh" (as per Wikipedia), and for gzip, the magic numbers are that the first two bytes are 0x1f8b (as per RFC 1952).

    I don't know much about zip, but I suspect it has a magic number as well.
    DISTRO=Arch
    Registered Linux User #388732

  5. #5
    Linux Guru Lakshmipathi's Avatar
    Join Date
    Sep 2006
    Location
    3rd rock from sun - Often seen near moon
    Posts
    1,568
    Sometime back,while dealing with file magic signature,I found this link useful File Signatures
    - Lakshmipathi.G
    -------------------
    FOSS India Award winning ext3fs Undelete tool and tutorials www.giis.co.in
    First they criticize you,Then they laugh at you,Then they fight with you,Then you win. - M.K.Gandhi
    -------------------

  6. #6
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi.

    Using the link that Lakshmipathi provided, there is a signature in a tar file for some versions of tar:
    Code:
    #!/usr/bin/env bash
    
    # @(#) s1	Demonstrate ustar, tar, POSIX, GNU signature.
    
    # Utility functions: print-as-echo, print-line-with-visual-space, debug.
    # export PATH="/usr/local/bin:/usr/bin:/bin"
    pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
    pl() { pe;pe "-----" ;pe "$*"; }
    db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
    db() { : ; }
    C=$HOME/bin/context && [ -f $C ] && $C tar
    
    FILE=${1-data1}
    
    # Create tar file.
    pl " Characteristics of test files:"
    pe x > f1
    tar cf f2 f1
    ls -lgG f1 f2
    
    pl " Identification by command \"file\":"
    file f1 f2
    
    pl " Results:"
    od -cx f2 |
    sed -n '/^0000400/,$p' |
    head -n 2 |
    cut -c1-36
    
    exit 0
    producing:
    Code:
    % ./s1
    
    Environment: LC_ALL = C, LANG = C
    (Versions displayed with local utility "version")
    OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
    Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
    GNU bash 3.2.39
    tar (GNU tar) 1.20
    
    -----
     Characteristics of test files:
    -rw-r----- 1     2 Jan  4 05:54 f1
    -rw-r----- 1 10240 Jan  4 05:54 f2
    
    -----
     Identification by command "file":
    f1: ASCII text
    f2: POSIX tar archive (GNU)
    
    -----
     Results:
    0000400  \0   u   s   t   a   r     
            7500 7473 7261 2020 6400 6e6
    It's not at the beginning of the file, but rather embedded [257 byte offset].

    Best wishes ... cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  7. #7
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    I don't know much about the magic.mgc file that the file command uses, but you could potentially parse that to get the information directly. I wouldn't be surprised if there were a library out there to make this easier.
    DISTRO=Arch
    Registered Linux User #388732

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...