Object recognition in digital images is crucial for further automation in everyday life and industry. Basic objects can be distinguished well already, but the automated recognition of detailed categories, called fine-grained recognition, remains challenging. Approaches in this field are usually based on an explicit or implicit normalization of the object pose. Explicit approaches describe an object by the appearance of its parts. Most previous works use annotated locations of semantic parts in all training images. However, annotations are expensive to obtain. Implicit approaches compute numerous local features and aggregate them without considering their spatial position. This leads to an implicit matching of the appearance of corresponding parts in the distance function of the classifier. The concept does not require annotated part locations, but the resulting features are not necessarily optimal. Reasons are that the features might not lie on a Euclidean manifold and that the aggregation strategy is manually chosen using validation data. In this thesis, we address drawbacks of previous approaches with novel recognition and visualization techniques. We present approaches for explicit pose normalization, which do not require part annotations. They are based on generating numerous generic part proposals and selecting relevant ones for classification. Existing implicit approaches are also improved by addressing their main issues. For example, we introduce a novel generalized aggregation scheme, which allows for learning the optimal strategy. The recognition approaches are complemented with two visualizations. We also analyze and predict the influence of random noise on recognition models. We extensively evaluate and discuss all presented ideas in a qualitative and quantitative manner using widely used benchmark datasets. Our recognition approaches successfully improve the accuracy of the base CNNs by up to 20.6% and even work in other domains like action recognition.